-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathset_up_Hadoop.txt
executable file
·250 lines (238 loc) · 9.74 KB
/
set_up_Hadoop.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
Hadoop 3.1.2
---------------------------Must have---------------------------
"""
********************************
First of all, U need an Linux OS.
For me, i was using WSL (Ubuntu18.04LTS).
I simply went to Windows Store, searched with word 'WSL' and choose 'Ubuntu 18.04 LTS', which i highly recommend for Standalone or Pseudo-distributed Operation.
Of course that u gotta to set up Linux OS if u want to build a real distributed system.
"""
"""
********************************
Secondly, install 'ssh' or 'pdsh'
Check your pdsh default rcmd rsh:
pdsh -q -w localhost
If it is not ssh, u need to add this line at the end of .bashrc or .zshrc:
export PDSH_RCMD_TYPE=ssh
Check ssh connection to localhost:
ssh localhost
Sometimes u will get a message:
'ssh: connect to host localhost port 22: Connection refused'
So u gotta run this (on WSL):
sudo service ssh start
Then rerun the command above (U will need password authentication).
If u only work with I, II operation, u just need to do exactly what i wrote above. The III operation will be little different.
"""
sudo apt install ssh
sudo apt install pdsh
echo 'export PDSH_RCMD_TYPE=ssh' >> .bashrc
echo 'export PDSH_RCMD_TYPE=ssh' >> .zshrc
source .bashrc
source .zshrc
sudo service ssh start
ssh localhost
"""
********************************
Thirdly, Java must be installed.
For the moment i write this note, Java 8 is the version which works finest with hadoop 3.1.2 so i highly recommend it for u.
Maybe u dont know:
Oracle Java downloads now require logging to an Oracle account to download Java updates, so u can not install Oracle Java 8 with PPA at the moment, but OpenJDK still works fine :)
Simply setup OpenJdk8 with:
sudo apt install openjdk-8-jdk
If u have installed multiple jdk, u can simply switch with command:
sudo update-alternatives --config java
One more thing, u also need to change jar, javac:
sudo update-alternatives --config jar
sudo update-alternatives --config javac
Make sure java, jar, javac is 1.8 version, i dont recommend using Java 11 for now (at the moment i wrote this)
Optionally, add these lines into .zshrc or .bashrc:
export JAVA_HOME=<jdk_path>
-> export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH="$PATH:${JAVA_HOME}/bin"
2 things why i said 'Optionally':
1. U can install jdk in many ways so the path could be different.
2. I added Java as EV (Enviroment Variable) to .bashrc or .zshrc for other works. If u only work with Hadoop which uses Java for working itself, u need to add the lines above into hadoop-env.sh. We will go to detail in Operations below.
"""
"""
********************************
Finally, Download hadoop package from apache hadoop official site (Hadoop 3.1.2) then unpack.
U got 2 options:
1. Run Hadoop Command globaly:
Add these lines into .bashrc or .zshrc:
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export HADOOP_HOME=/home/msaio/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=/home/msaio/hadoop/etc/hadoop
export LD_LIBRARY_PATH=/home/msaio/hadoop/lib/native:$LD_LIBRARY_PATH
2. Run Hadoop Command inside the packages:
Hadoop commands lie inside bin/ and sbin/ so running command will look like this:
/home/msaio/hadoop/bin/hdfs dfs -ls /
this line is used to show up path root of distributed system.
Then, last but not least, add java path at the end of file hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=${JAVA_HOME}/bin:${PATH}
which i talked above.
"""
"""
Now we do a mini test:
run commnad if u choose option 1 above:
hadoop
run command if u choose option 2:
cd hadoop
bin/hadoop
"""
So now we can deal with 3 main Operations of Hadoop:
I. Standalone Operation:
"""
A non-distributed mode
As a single Java process
For debugging
Data on local
Only add JAVA path
"""
- Set Up:
1. Do the 'Must Have'
2. Do not config any file but adding java path into the hadoop-env.sh
3. Testing wordcount:
+ Create an input_dir directory:
+ Copy some .txt file into input_dir
+ Create an empty output_dir directory:
+ Run this command from unpacked hadoop folder:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount input_dir output_dir
+ Check output_dir for results:
cat output/*
II. Pseudo-distributed Operation
"""
A single-node cluster: namenode and datanode on the same computer.
Each Hadoop daemon runs in a separate Java process.
We will run on Localhost
"""
-Set Up:
1. Do the "Must Have"
2. Config etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
This configuration means your file system is connected with port 9000 from localhost, so we get the file from distributed system path like this: "hdfs://localhost:9000/<path>"
U dont have to use localhost, u can use IPv4 or Computer's Name on LAN instead if working with multi-node cluster.
3. Config etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
This config means how many copy of file block u have on hdfs. Dont set value more than the number of nodes u got.
4. Set up connection ssh to localhost:
Run commnand:
ssh localhost
If u cannot ssh to localhost, excuting these following commands:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
If u are already familiar with git, u can simply reuse the ssh key from git.
U also can use multiple keys when u add keys in different lines into directory authorized_keys
One more thing, u will need 1 ssh key when working with multi-node cluster, so choose carefully.
5. Next, u need to run a format for namenode:
hdfs namenode -format
or
bin/hdfs namenode -format
So, why do u have to format? After 2 config above, 2 things have to change:
1. First is set up connection:
U are using port 9000 on localhost. If u want to change to port 10000 on 192.168.56.20, u will need to reformat
2. Second is replication:
At the first format u set 1 replication. Now u have to connect 3 more nodes, u gotta to change config to 3 and reformat.
5. Start namenode daemon and datanode daemon:
sbin/start-dfs.sh
or
start-dfs.sh
6. Check
On terminal:
bin/hdfs dfsadmin -report
or
hdfs dfsadmin -report
On web interface (default on version 3.1.2):
http://localhost:9870/
Of course, u can change this address above.
7. Testing wordcount:
Hadoop requires these lines and i dont know why but just stick with it:
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/<username>
Create a input dir:
bin/hdfs dfs -mkdir input
Simply copy from config folder into input dir:
bin/hdfs dfs -put etc/hadoop/*.xml input
Now we run wordcount:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount input output
We can get output by copying to local:
bin/hdfs dfs -get output output
cat output/*
Or just:
bin/hdfs dfs -cat output/*
or
hdfs dfs -cat output/*
8. Done.
Stop hdfs:
sbin/stop-dfs.sh
or
stop-dfs.sh
***
If u want to work with YARN, u gotta config mapred-site.xml and yarn-site.xml then run start-yarn.sh
Try: https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/SingleCluster.html
If u can figure out how it work just copy
https://hadoop.apache.org/docs/r3.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
to your mapred-site.xml nad yarn-site.xml
It will work fine.
***
III. Multi-node Operation (Multi-node Cluster):
"""
HDFS on many computers
Distributed Computing
"""
- Setup:
The idea is make sure every node have java 8 installed, ssh and pdsh install, hadoop package install, the same hadoop config on all nodes. We will go to details later
Make sure all nodes can connect to the others on LAN
Oh i forgot, u got use set the username is the same
1. On namenode:
+ add to /etc/hosts:
Ipv4_of_namenode_computer namenode_computer_name
Ipv4_of_datanode_computer_1 datanode_computer_name_1
Ipv4_of_datanode_computer_2 datanode_computer_name_2
...
+ It will look like this:
192.168.56.1 master
192.168.56.2 node1
192.168.56.3 node2
+ Create ssh connection to node:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub msaio@node1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub msaio@node2
'msaio' is username on my pc.
+ Check connect through every node:
ex: ssh node1
+ First of all, u gotta config on the namenode then copy to all nodes like this:
for node in node1 node2; do
scp ~/hadoop/etc/hadoop/* $node:/home/msaio/hadoop/etc/hadoop/;
done
2. From namenode:
+ Run format namenode:
bin/hadoop namenode -format
+ Check Cluster:
hdfs dfsadmin -report
or on browser interface:
http://master:9870
3. Excute wordcount job:
It is exactly the same psuedo operation
Remmember this, everytime u excute wordcount job, u must remove the output from the output from previous excution.
I have builded 3 os on virtul box with all config and wordcount examples:
https://drive.google.com/drive/folders/1_ecMBROPE7MPMF2EiG3DpnQ12sqAFRH5
Here is the results:
check tk.txt for details