Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONNECTOR] Add script to start hadoop #1438

Merged
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ spark.dockerfile
deps.dockerfile
worker.dockerfile
etl.dockerfile
hadoop.dockerfile
# we don't put binary file to git repo
gradle-wrapper.jar
VersionUtils.java
21 changes: 11 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,17 @@
1. [快速啟動 Zookeeper](./docs/run_zookeeper.md): 使用容器化的方式快速建立`zookeeper`服務
2. [快速啟動 Kafka](./docs/run_kafka_broker.md): 使用容器化的方式快速建立`kafka broker`服務
3. [快速啟動 Worker](./docs/run_kafka_worker.md): 使用容器化的方式快速建立`kafka worker`服務
4. [快速啟動 Prometheus](./docs/run_prometheus.md): 建構`Kafka`叢集資訊收集系統
5. [快速啟動 Grafana](./docs/run_grafana.md): 建置圖形化介面監控`kafka`叢集使用狀況
6. [Performance Tool](./docs/performance_benchmark.md): 可模擬多種使用情境來驗證`Kafka`叢集的吞吐量和延遲
7. [Web Server](./docs/web_server/README.md): 可透過`Restful APIs`操作`Kafka`叢集
8. [Dispatcher](docs/dispatcher/README.md): 強大且高效率的 Kafka partitioner 實作
9. [Balancer](docs/balancer/README.md): `Kafka` 伺服器端負載平衡工具
10. [GUI](docs/gui/README.md): 簡單好用的叢集資訊圖形化工具
11. [Connector](./docs/connector/README.md): 提供基於 `kafka connector` 實作的高效平行化工具,包含效能測試和資料遷移等工具
12. [Build](docs/build_project.md): 說明如何建構與測試本專案各模組
13. [etl](./docs/etl/README.md): 構建 spark-kafka 的資料傳輸通道
4. [快速啟動 Hadoop](./docs/run_hadoop.md): 使用容器化的方式快速建立`hadoop`服務
5. [快速啟動 Prometheus](./docs/run_prometheus.md): 建構`Kafka`叢集資訊收集系統
6. [快速啟動 Grafana](./docs/run_grafana.md): 建置圖形化介面監控`kafka`叢集使用狀況
7. [Performance Tool](./docs/performance_benchmark.md): 可模擬多種使用情境來驗證`Kafka`叢集的吞吐量和延遲
8. [Web Server](./docs/web_server/README.md): 可透過`Restful APIs`操作`Kafka`叢集
9. [Dispatcher](docs/dispatcher/README.md): 強大且高效率的 Kafka partitioner 實作
10. [Balancer](docs/balancer/README.md): `Kafka` 伺服器端負載平衡工具
11. [GUI](docs/gui/README.md): 簡單好用的叢集資訊圖形化工具
12. [Connector](./docs/connector/README.md): 提供基於 `kafka connector` 實作的高效平行化工具,包含效能測試和資料遷移等工具
13. [Build](docs/build_project.md): 說明如何建構與測試本專案各模組
14. [etl](./docs/etl/README.md): 構建 spark-kafka 的資料傳輸通道

# 技術發表

Expand Down
239 changes: 239 additions & 0 deletions docker/start_hadoop.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

declare -r DOCKER_FOLDER=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
source $DOCKER_FOLDER/docker_build_common.sh

# ===============================[global variables]===============================
declare -r VERSION=${VERSION:-3.3.4}
declare -r REPO=${REPO:-ghcr.io/skiptests/astraea/hadoop}
declare -r IMAGE_NAME="$REPO:$VERSION"
declare -r DOCKERFILE=$DOCKER_FOLDER/hadoop.dockerfile
declare -r EXPORTER_VERSION="0.16.1"
declare -r EXPORTER_PORT=${EXPORTER_PORT:-"$(getRandomPort)"}
declare -r HADOOP_PORT=${HADOOP_PORT:-"$(getRandomPort)"}
declare -r HADOOP_JMX_PORT="${HADOOP_JMX_PORT:-"$(getRandomPort)"}"
declare -r HADOOP_HTTP_ADDRESS="${HADOOP_HTTP_ADDRESS:-"$(getRandomPort)"}"
declare -r JMX_CONFIG_FILE_IN_CONTAINER_PATH="/opt/jmx_exporter/jmx_exporter_config.yml"
declare -r JMX_OPTS="-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.port=$HADOOP_JMX_PORT \
-Dcom.sun.management.jmxremote.rmi.port=$HADOOP_JMX_PORT \
-Djava.rmi.server.hostname=$ADDRESS \
-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent-${EXPORTER_VERSION}.jar=$EXPORTER_PORT:$JMX_CONFIG_FILE_IN_CONTAINER_PATH"
declare -r HDFS_SITE_XML="/tmp/${HADOOP_PORT}-hdfs.xml"
declare -r CORE_SITE_XML="/tmp/${HADOOP_PORT}-core.xml"
# cleanup the file if it is existent
[[ -f "$HDFS_SITE_XML" ]] && rm -f "$HDFS_SITE_XML"
[[ -f "$CORE_SITE_XML" ]] && rm -f "$CORE_SITE_XML"

# ===================================[functions]===================================

function showHelp() {
echo "Usage: [ENV] start_hadoop.sh"
chaohengstudent marked this conversation as resolved.
Show resolved Hide resolved
echo "ENV: "
echo " REPO=astraea/hadoop set the docker repo"
echo " VERSION=3.3.4 set version of hadoop distribution"
echo " BUILD=false set true if you want to build image locally"
echo " RUN=false set false if you want to build/pull image only"
}

function generateDockerfile() {
echo "#this dockerfile is generated dynamically
FROM ubuntu:22.04 AS build

#install tools
RUN apt-get update && apt-get install -y wget

# download jmx exporter
RUN mkdir /opt/jmx_exporter
WORKDIR /opt/jmx_exporter
RUN wget https://REPO1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/${EXPORTER_VERSION}/jmx_prometheus_javaagent-${EXPORTER_VERSION}.jar
RUN touch $JMX_CONFIG_FILE_IN_CONTAINER_PATH
RUN echo \"rules:\\n- pattern: \\\".*\\\"\" >> $JMX_CONFIG_FILE_IN_CONTAINER_PATH

#download hadoop
WORKDIR /tmp
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-${VERSION}/hadoop-${VERSION}.tar.gz
RUN mkdir /opt/hadoop
RUN tar -zxvf hadoop-${VERSION}.tar.gz -C /opt/hadoop --strip-components=1

FROM ubuntu:22.04

#install tools
RUN apt-get update && apt-get install -y openjdk-11-jre

#copy hadoop
COPY --from=build /opt/jmx_exporter /opt/jmx_exporter
COPY --from=build /opt/hadoop /opt/hadoop

#add user
RUN groupadd $USER && useradd -ms /bin/bash -g $USER $USER

#edit hadoop-env.sh
RUN echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> /opt/hadoop/etc/hadoop/hadoop-env.sh

#change user
RUN chown -R $USER:$USER /opt/hadoop
USER $USER

#export ENV
ENV HADOOP_HOME /opt/hadoop
WORKDIR /opt/hadoop
" >"$DOCKERFILE"
}

function rejectProperty() {
local key=$1
local file=$2
if grep -q "<name>$key</name>" $file; then
echo "$key is NOT configurable"
exit 2
fi
}

function requireProperty() {
local key=$1
local file=$2
if ! grep -q "<name>$key</name>" $file; then
echo "$key is required"
exit 2
fi
}

function setProperty() {
local name=$1
local value=$2
local path=$3

local entry="<property><name>$name</name><value>$value</value></property>"
local escapedEntry=$(echo $entry | sed 's/\//\\\//g')
sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path
}

function initArg() {
local node

echo -e "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n<configuration>\n</configuration>" > $HDFS_SITE_XML
echo -e "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n<configuration>\n</configuration>" > $CORE_SITE_XML

while [[ $# -gt 0 ]]; do
if [[ "$1" == "help" ]]; then
showHelp
exit 0
fi
if [[ "$1" == "namenode" || "$1" == "datanode" ]]; then
node=$1
shift
continue
fi
local name=${1%=*}
local value=${1#*=}
if [[ "$name" == "fs.defaultFS" ]]; then
setProperty $name $value $CORE_SITE_XML
else
setProperty $name $value $HDFS_SITE_XML
fi
shift
done

if [[ "$node" == "namenode" ]]; then
startNamenode
elif [[ "$node" == "datanode" ]]; then
startDatanode
else
echo "Please specify namenode or datanode as argument."
chaohengstudent marked this conversation as resolved.
Show resolved Hide resolved
exit 0
fi
}

# ===================================[namenode]===================================

function startNamenode() {
declare -r CONTAINER_NAME=namenode-$HADOOP_PORT

rejectProperty fs.defaultFS $CORE_SITE_XML
rejectProperty dfs.namenode.datanode.registration.ip-hostname-check $HDFS_SITE_XML

setProperty dfs.namenode.datanode.registration.ip-hostname-check false $HDFS_SITE_XML
setProperty fs.defaultFS hdfs://$CONTAINER_NAME:8020 $CORE_SITE_XML

docker run -d --init \
--name $CONTAINER_NAME \
-h $CONTAINER_NAME \
-e HDFS_NAMENODE_OPTS="$JMX_OPTS" \
-v $HDFS_SITE_XML:/opt/hadoop/etc/hadoop/hdfs-site.xml:ro \
-v $CORE_SITE_XML:/opt/hadoop/etc/hadoop/core-site.xml:ro \
-p $HADOOP_HTTP_ADDRESS:9870 \
-p $HADOOP_JMX_PORT:$HADOOP_JMX_PORT \
-p $HADOOP_PORT:8020 \
-p $EXPORTER_PORT:$EXPORTER_PORT \
"$IMAGE_NAME" /bin/bash -c "./bin/hdfs namenode -format && ./bin/hdfs namenode"

echo "================================================="
echo "http address: ${ADDRESS}:$HADOOP_HTTP_ADDRESS"
echo "jmx address: ${ADDRESS}:$HADOOP_JMX_PORT"
echo "exporter address: ${ADDRESS}:$EXPORTER_PORT"
echo "run $DOCKER_FOLDER/start_hadoop.sh datanode fs.defaultFS=hdfs://${ADDRESS}:$HADOOP_PORT to join datanode"
echo "================================================="
}

# ===================================[datanode]===================================

function startDatanode() {
declare -r CONTAINER_NAME=datanode-$HADOOP_PORT

rejectProperty dfs.datanode.address $HDFS_SITE_XML
rejectProperty dfs.datanode.use.datanode.hostname $HDFS_SITE_XML
rejectProperty dfs.client.use.datanode.hostname $HDFS_SITE_XML
requireProperty fs.defaultFS $CORE_SITE_XML

setProperty dfs.datanode.address 0.0.0.0:$HADOOP_PORT $HDFS_SITE_XML
setProperty dfs.datanode.use.datanode.hostname true $HDFS_SITE_XML
setProperty dfs.client.use.datanode.hostname true $HDFS_SITE_XML

docker run -d --init \
--name $CONTAINER_NAME \
-h ${ADDRESS} \
-e HDFS_DATANODE_OPTS="$JMX_OPTS" \
-v $HDFS_SITE_XML:/opt/hadoop/etc/hadoop/hdfs-site.xml:ro \
-v $CORE_SITE_XML:/opt/hadoop/etc/hadoop/core-site.xml:ro \
-p $HADOOP_HTTP_ADDRESS:9864 \
-p $HADOOP_PORT:$HADOOP_PORT \
-p $HADOOP_JMX_PORT:$HADOOP_JMX_PORT \
-p $EXPORTER_PORT:$EXPORTER_PORT \
"$IMAGE_NAME" /bin/bash -c "./bin/hdfs datanode"

echo "================================================="
echo "http address: ${ADDRESS}:$HADOOP_HTTP_ADDRESS"
echo "jmx address: ${ADDRESS}:$HADOOP_JMX_PORT"
echo "exporter address: ${ADDRESS}:$EXPORTER_PORT"
echo "================================================="
}

# ===================================[main]===================================

checkDocker
buildImageIfNeed "$IMAGE_NAME"
if [[ "$RUN" != "true" ]]; then
echo "docker image: $IMAGE_NAME is created"
exit 0
fi

checkNetwork

initArg "$@"
chaohengstudent marked this conversation as resolved.
Show resolved Hide resolved
64 changes: 64 additions & 0 deletions docs/run_hadoop.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
### Run Hadoop

#### Hadoop 介紹
[Apache Hadoop](https://github.com/apache/hadoop)是一個開源專案,提供可靠、可擴展的分散式計算。

`Apache Hadoop`允許在電腦叢集上使用簡單的模型對大數據進行分散式處理。它設計於將單個server擴展到數千台機器,每台機器都提供本地運算和存儲。本身設計為在應用層面檢測和處理故障,因此可以在可能出現故障的電腦叢集上提供高可用性服務。

#### Hadoop Distributed File System (HDFS) 介紹

`HDFS`是一個主從式架構。在`HDFS`中,檔案被分成一個或多個 blocks 並且儲存在一組 DataNode 中

- `NameNode`執行文件系統命名空間操作,如打開、關閉和重命名文件和目錄。它也負責決定 block 與 DataNode 的映射。

- `DataNode`負責從文件系統的客戶端提供讀取和寫入請求。DataNode 也會根據 NameNode 的指示執行 block 創建、刪除和複製。

##### 腳本部署方式

1. 啟動 `NameNode`
##### 腳本
```bash
./docker/start_hadoop.sh namenode [OPTIONS]
```
`[OPTIONS]`為一或多組`hdfs-site.xml` name=value 參數,可以參考[官方docs](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)提供的各項參數及預設值
chaohengstudent marked this conversation as resolved.
Show resolved Hide resolved

- 如下例所示,將 replication 數量設為2:
```bash
/docker/start_hadoop.sh namenode dfs.replication=2
```

若成功啟動 NameNode,腳本會輸出下列命令:
##### 腳本輸出
```bash
6c615465ad844041ee0bf12f0353e735216b8d6b897e34871a97d038f9da24f4
=================================================
http address: 192.168.103.44:14273
jmx address: 192.168.103.44:15411
exporter address: 192.168.103.44:15862
run /home/chaoheng/IdeaProjects/astraea/docker/start_hadoop.sh datanode fs.defaultFS=hdfs://192.168.103.44:16462 to join datanode
=================================================
```
可以根據輸出的 `http address` 進入官方提供的 WebUI 介面
---
2. 啟動 `DataNode`

成功建置 NameNode 後,腳本會輸出部署 DataNode 的命令,後面的參數`fs.defaultFS`就是 NameNode 的 hostname 及 port
##### 腳本
```bash
./docker/start_hadoop.sh datanode fs.defaultFS=hdfs://192.168.103.44:16462 [OPTIONS]
```
`[OPTIONS]`為一或多組`hdfs-site.xml` name=value 參數,可以參考[官方docs](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)提供的各項參數及預設值

若成功啟動 DataNode,腳本會輸出以下結果:
##### 腳本輸出
```bash
c72f5fa958dcd95e4114deeeb61a49313ceccf433f2525b19dbf3b6937ce9aec
=================================================
http address: 192.168.103.44:12163
jmx address: 192.168.103.44:16783
exporter address: 192.168.103.44:16395
=================================================
```
同樣可以根據輸出的 `http address` 進入官方提供的 WebUI 介面

重複執行此腳本即可在 NameNode 下啟動多個 DataNode