Skip to content

Commit

Permalink
配置项的一些小修改
Browse files Browse the repository at this point in the history
  • Loading branch information
SomeBottle committed Feb 4, 2024
1 parent 36a070c commit b53c628
Show file tree
Hide file tree
Showing 6 changed files with 62 additions and 19 deletions.
17 changes: 12 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,21 @@ ENV PATH="$HADOOP_HOME/bin:/opt/somebottle/haspark/tools:$ZOOKEEPER_HOME/bin:$PA
ENV TEMP_PASS_FILE="/root/temp.pass"
# 用户.ssh配置目录
ENV USR_SSH_CONF_DIR="/root/.ssh"
# Hadoop HDFS是否随容器一并启动
ENV GN_HDFS_SETUP_ON_STARTUP="true"
# Hadoop YARN是否随容器一并启动
ENV GN_YARN_SETUP_ON_STARTUP="true"
# 容器初次启动标识文件
ENV INIT_FLAG_FILE="/root/init_flag"
# 高可用-HDFS Nameservice
# 以下是一些环境变量默认值,用于Hadoop初始化
ENV HADOOP_LAUNCH_MODE="general"
ENV HADOOP_HDFS_REPLICATION="2"
ENV HADOOP_MAP_MEMORY_MB="1024"
ENV HADOOP_REDUCE_MEMORY_MB="1024"
ENV GN_DATANODE_ON_MASTER="false"
ENV GN_NODEMANAGER_WITH_RESOURCEMANAGER="false"
ENV GN_NODEMANAGER_WITH_RESOURCEMANAGER="false"
ENV GN_HDFS_SETUP_ON_STARTUP="false"
ENV GN_YARN_SETUP_ON_STARTUP="false"
ENV HA_HDFS_NAMESERVICE="hacluster"
ENV HA_HDFS_SETUP_ON_STARTUP="false"
ENV HA_YARN_SETUP_ON_STARTUP="false"

# 以Root用户完成
USER root
Expand Down
31 changes: 26 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Hadoop + Spark 分布式容器化部署
# Hadoop + Spark 分布式容器化部署镜像

本镜像基于`bitnami/spark:3.5.0`镜像,系统为`Debian 11`,执行用户为`root`

Expand Down Expand Up @@ -116,7 +116,13 @@ docker pull somebottle/haspark
1. 启动集群中所有节点的相应守护进程: `start-dfs.sh | stop-dfs.sh | start-yarn.sh | stop-yarn.sh | start-all.sh | stop-all.sh`
2. 启动本机的相应守护进程: `start-dfs-local.sh | stop-dfs-local.sh | start-yarn-local.sh | stop-yarn-local.sh | start-all-local.sh | stop-all-local.sh`

脚本实际位于`/opt/somebottle/haspark/tools/`中。
### 4. WordCount测试脚本

本脚本用于测试Hadoop集群是否能正常工作。

命令行: `test-wordcount.sh`

脚本实际位于`/opt/somebottle/haspark/tools/test-wordcount.sh`中。

## 容器部署

Expand Down Expand Up @@ -158,7 +164,20 @@ docker pull somebottle/haspark[:tag]
docker compose up -d
```

### 4. 下线容器
### 4. 停止和启动容器

> ⚠️ 建议你在执行这一步前先调用`stop-all.sh`脚本停止Hadoop集群。
`docker-compose.yml`所在目录中执行。

```bash
docker compose stop # 停止容器
docker compose start # 启动容器
```

### 5. 下线容器

> ⚠️ 建议你在执行这一步前先调用`stop-all.sh`脚本停止Hadoop集群。
`docker-compose.yml`所在目录中执行。

Expand All @@ -176,7 +195,7 @@ docker compose down
docker compose down -v # v代表volumes
```

### 5. 启动与停止Hadoop
### 6. 启动与停止Hadoop

按理说容器启动后,**在完成免密登录配置后会自动执行**Hadoop集群启动脚本,如果没有的话你可以手动执行:

Expand Down Expand Up @@ -221,7 +240,9 @@ stop-hadoop.sh
* DataNode: `/root/hdfs/data`
* JournalNode: `/root/hdfs/journal`

> 建议挂载卷(Volume)到NameNode和DataNode的目录上,可以保留HDFS的数据。
> -> 建议挂载卷(Volume)到NameNode和DataNode以及JournalNode的目录上,可以保留HDFS的数据。
> -> 尤其在**高可用**集群中,要注意根据NameNode和DataNode所在容器决定挂载规则。
> -> 比如在只有NameNode的节点上可以仅挂载NameNode的卷,但若同时还有DataNode,则也要挂载DataNode的卷。
### 日志
* Hadoop的日志位于`/opt/hadoop/logs`目录。
Expand Down
18 changes: 15 additions & 3 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ services:
environment:
- SPARK_MODE=master
volumes:
- haspark-hdfs-name-data:/root/hdfs/name:copy # 映射docker卷到主容器的/root/hdfs/name,创建卷时复制镜像中初始化过的namenode数据
- haspark-hdfs-shmain-name:/root/hdfs/name:copy # 映射docker卷到主容器的/root/hdfs/name,创建卷时复制镜像中初始化过的namenode数据
- haspark-hdfs-shmain-journal:/root/hdfs/journal
- haspark-hdfs-shmain-data:/root/hdfs/data
- ~/docker/spark/share:/opt/share # 三个容器映射到相同的共享目录
ports:
- '8080:8080'
Expand All @@ -28,6 +30,8 @@ services:
- SPARK_WORKER_CORES=1
volumes:
- ~/docker/spark/share:/opt/share
- haspark-hdfs-worker1-name:/root/hdfs/name:copy # namenode数据
- haspark-hdfs-worker1-journal:/root/hdfs/journal
- haspark-hdfs-worker1-data:/root/hdfs/data # datanode数据
ports:
- '8081:8081'
Expand All @@ -42,13 +46,21 @@ services:
- SPARK_WORKER_CORES=1
volumes:
- ~/docker/spark/share:/opt/share
- haspark-hdfs-worker2-name:/root/hdfs/name:copy # namenode数据
- haspark-hdfs-worker2-journal:/root/hdfs/journal
- haspark-hdfs-worker2-data:/root/hdfs/data # datanode数据
ports:
- '8082:8081'
- '8089:8088'
- '9871:9870'

volumes:
haspark-hdfs-name-data:
haspark-hdfs-shmain-name:
haspark-hdfs-shmain-data:
haspark-hdfs-shmain-journal:
haspark-hdfs-worker1-name:
haspark-hdfs-worker1-data:
haspark-hdfs-worker2-data:
haspark-hdfs-worker1-journal:
haspark-hdfs-worker2-name:
haspark-hdfs-worker2-data:
haspark-hdfs-worker2-journal:
4 changes: 2 additions & 2 deletions scripts/entry.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ mkdir -p /opt/somebottle/haspark/logs
# 创建守护进程启动记录目录
# 这里存储的是守护进程的启动顺序,用于start/stop dfs/yarn/all脚本的实现。
mkdir -p /opt/somebottle/haspark/daemon_sequence
touch $HDFS_DAEMON_SEQ_FILE
touch $YARN_DAEMON_SEQ_FILE
echo '' >$HDFS_DAEMON_SEQ_FILE
echo '' >$YARN_DAEMON_SEQ_FILE

# 上面export的只能在当前Shell及子进程中有效
# 导出到/etc/profile中,以在用户登录后的新Shell中也保持有效
Expand Down
3 changes: 2 additions & 1 deletion scripts/hadoop-general-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ if [ -e $INIT_FLAG_FILE ]; then
echo "Formatting HDFS..."
if [ -z "$(ls /root/hdfs/name 2>/dev/null)" ]; then
# 仅当NameNode目录为空时才格式化
$HADOOP_HOME/bin/hdfs namenode -format
# nonInteractive选项保证如果已经格式化过,不会询问用户再次格式化,而是直接跳过。
$HADOOP_HOME/bin/hdfs namenode -format -nonInteractive
else
echo "NameNode directory already formatted, skipping format."
fi
Expand Down
8 changes: 5 additions & 3 deletions scripts/hadoop-ha-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,12 @@ if [[ "$HA_HDFS_SETUP_ON_STARTUP" == "true" ]]; then
if [ -z "$(ls /root/hdfs/name 2>/dev/null)" ]; then
# 当NameNode目录为空时才格式化
echo "-> Formatting NameNode..."
$HADOOP_HOME/bin/hdfs namenode -format
# nonInteractive选项保证如果已经格式化过,不会询问用户再次格式化,而是直接跳过。
$HADOOP_HOME/bin/hdfs namenode -format -nonInteractive
elif [ -z "$(ls /root/hdfs/journal 2>/dev/null)" ]; then
# 当JournalNode目录为空时才初始化
echo "-> Initializing JournalNode..."
hdfs namenode -initializeSharedEdits
hdfs namenode -initializeSharedEdits -nonInteractive
else
echo "NameNode and JournalNode directory already formatted, skipping format."
fi
Expand All @@ -120,7 +121,8 @@ if [[ "$HA_HDFS_SETUP_ON_STARTUP" == "true" ]]; then
elif [[ "$HA_NAMENODE_HOSTS" = *$(hostname)* ]]; then
# 如果本机不是首个NameNode,但也是NameNode,则同步元数据
echo "Syncing HDFS metadata..."
$HADOOP_HOME/bin/hdfs namenode -bootstrapStandby
# nonInteractive选项保证如果已经格式化过,不会询问用户再次格式化,而是直接跳过。
$HADOOP_HOME/bin/hdfs namenode -bootstrapStandby -nonInteractive
fi
fi

Expand Down

0 comments on commit b53c628

Please sign in to comment.