You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/docker_demo.md
+24-18Lines changed: 24 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,15 +5,17 @@ toc: true
5
5
last_modified_at: 2019-12-30T15:59:57-04:00
6
6
---
7
7
8
-
## A Demo using docker containers
8
+
## A Demo using Docker containers
9
9
10
-
Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
11
-
data infrastructure is brought up in a local docker cluster within your computer.
10
+
Let's use a real world example to see how Hudi works end to end. For this purpose, a self contained
11
+
data infrastructure is brought up in a local Docker cluster within your computer. It requires the
12
+
Hudi repo to have been cloned locally.
12
13
13
14
The steps have been tested on a Mac laptop
14
15
15
16
### Prerequisites
16
17
18
+
* Clone the [Hudi repository](https://github.com/apache/hudi) to your local machine.
17
19
* Docker Setup : For Mac, Please follow the steps as defined in [Install Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/). For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
18
20
* kcat : A command-line utility to publish/consume from kafka topics. Use `brew install kcat` to install kcat.
19
21
* /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
@@ -41,16 +43,20 @@ Also, this has not been tested on some environments like Docker on Windows.
41
43
42
44
### BuildHudi
43
45
44
-
The first step is to build hudi. **Note**This step builds hudi on default supported scala version -2.11.
46
+
The first step is to build Hudi. **Note**This step builds Hudi on default supported scala version -2.11.
47
+
48
+
NOTE:Make sure you've cloned the [Hudi repository](https://github.com/apache/hudi) first.
49
+
45
50
```java
46
51
cd <HUDI_WORKSPACE>
47
52
mvn clean package -Pintegration-tests -DskipTests
48
53
```
49
54
50
55
### Bringing up Demo Cluster
51
56
52
-
The next step is to run the docker compose script and setup configs for bringing up the cluster.
53
-
This should pull the docker images from docker hub and setup docker cluster.
57
+
The next step is to run the Docker compose script and setup configs for bringing up the cluster. These files are in the [Hudi repository](https://github.com/apache/hudi) which you should already have locally on your machine from the previous steps.
58
+
59
+
This should pull the Docker images from Docker hub and setup the Docker cluster.
54
60
55
61
```java
56
62
cd docker
@@ -112,7 +118,7 @@ Copying spark default config and setting up configs
112
118
$ docker ps
113
119
```
114
120
115
-
At this point, the docker cluster will be up and running. The demo cluster brings up the following services
121
+
At this point, the Docker cluster will be up and running. The demo cluster brings up the following services
116
122
117
123
* HDFS Services (NameNode, DataNode)
118
124
* Spark Master and Worker
@@ -1317,21 +1323,21 @@ This brings the demo to an end.
1317
1323
1318
1324
## Testing Hudi in Local Docker environment
1319
1325
1320
-
You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
1326
+
You can bring up a Hadoop Docker environment containing Hadoop, Hive and Spark services with support for Hudi.
1321
1327
```java
1322
1328
$ mvn pre-integration-test -DskipTests
1323
1329
```
1324
-
The above command builds docker images for all the services with
1330
+
The above command builds Docker images for all the services with
1325
1331
current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
1326
-
currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker images.
1332
+
currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in Docker images.
1327
1333
1328
1334
To bring down the containers
1329
1335
```java
1330
1336
$ cd hudi-integ-test
1331
1337
$ mvn docker-compose:down
1332
1338
```
1333
1339
1334
-
If you want to bring up the docker containers, use
1340
+
If you want to bring up the Docker containers, use
1335
1341
```java
1336
1342
$ cd hudi-integ-test
1337
1343
$ mvn docker-compose:up -DdetachedMode=true
@@ -1345,21 +1351,21 @@ docker environment (See __hudi-integ-test/src/test/java/org/apache/hudi/integ/IT
1345
1351
1346
1352
### Building Local Docker Containers:
1347
1353
1348
-
The docker images required for demo and running integration test are already in docker-hub. The docker images
1354
+
The Docker images required for demo and running integration test are already in docker-hub. The Docker images
1349
1355
and compose scripts are carefully implemented so that they serve dual-purpose
1350
1356
1351
-
1. The docker images have inbuilt hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
1357
+
1. The Docker images have inbuilt Hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
1352
1358
2. For running integration-tests, we need the jars generated locally to be used for running services within docker. The
1353
1359
docker-compose scripts (see `docker/compose/docker-compose_hadoop284_hive233_spark244.yml`) ensures local jars override
1354
-
inbuilt jars by mounting local HUDI workspace over the docker location
1355
-
3. As these docker containers have mounted local HUDI workspace, any changes that happen in the workspace would automatically
1360
+
inbuilt jars by mounting local Hudi workspace over the Docker location
1361
+
3. As these Docker containers have mounted local Hudi workspace, any changes that happen in the workspace would automatically
1356
1362
reflect in the containers. This is a convenient way for developing and verifying Hudi for
1357
1363
developers who do not own a distributed environment. Note that this is how integration tests are run.
1358
1364
1359
-
This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
1360
-
But if users want to test hudi from locations with lower network bandwidth, they can still build local images
1365
+
This helps avoid maintaining separate Docker images and avoids the costly step of building Hudi Docker images locally.
1366
+
But if users want to test Hudi from locations with lower network bandwidth, they can still build local images
1361
1367
run the script
1362
-
`docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
1368
+
`docker/build_local_docker_images.sh` to build local Docker images before running `docker/setup_demo.sh`
Copy file name to clipboardExpand all lines: website/docs/hoodie_cleaner.md
+66-14Lines changed: 66 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,15 +14,22 @@ each commit, to delete older file slices. It's recommended to leave this enabled
14
14
When cleaning old files, you should be careful not to remove files that are being actively used by long running queries.
15
15
Hudi cleaner currently supports the below cleaning policies to keep a certain number of commits or file versions:
16
16
17
-
-**KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of
18
-
having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
19
-
into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
20
-
retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
21
-
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
22
-
-**KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
23
-
This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
24
-
To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
25
-
based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
17
+
-**KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of
18
+
having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
19
+
into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
20
+
retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
21
+
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
22
+
Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`.
23
+
24
+
-**KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
25
+
This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
26
+
To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
27
+
based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
28
+
Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`.
29
+
30
+
-**KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time.
31
+
Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
32
+
Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
26
33
27
34
### Configurations
28
35
For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -32,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio
32
39
ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
33
40
34
41
You can use this command for running the cleaner independently:
Note: The parallelism takes the min value of number of partitions to clean and `hoodie.cleaner.parallelism`.
41
88
42
89
### Run Asynchronously
43
90
In case you wish to run the cleaner service asynchronously with writing, please configure the below:
@@ -54,4 +101,9 @@ CLI provides the below commands for cleaner service:
54
101
-`clean showpartitions`
55
102
-`cleans run`
56
103
104
+
Example of cleaner keeping the latest 10 commits
105
+
```
106
+
cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
107
+
```
108
+
57
109
You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class.
Copy file name to clipboardExpand all lines: website/versioned_docs/version-0.11.1/hoodie_cleaner.md
+56-7Lines changed: 56 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,14 +18,18 @@ Hudi cleaner currently supports the below cleaning policies to keep a certain nu
18
18
having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
19
19
into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
20
20
retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
21
-
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
21
+
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
22
+
Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`.
23
+
22
24
-**KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
23
25
This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
24
26
To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
25
27
based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
28
+
Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`.
29
+
26
30
-**KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time.
27
31
Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
28
-
Currently you can configure by parameter 'hoodie.cleaner.hours.retained'.
32
+
Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
29
33
30
34
### Configurations
31
35
For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -35,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio
35
39
ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
36
40
37
41
You can use this command for running the cleaner independently:
Note: The parallelism takes the min value of number of partitions to clean and `hoodie.cleaner.parallelism`.
44
88
45
89
### Run Asynchronously
46
90
In case you wish to run the cleaner service asynchronously with writing, please configure the below:
@@ -57,4 +101,9 @@ CLI provides the below commands for cleaner service:
57
101
-`clean showpartitions`
58
102
-`cleans run`
59
103
104
+
Example of cleaner keeping the latest 10 commits
105
+
```
106
+
cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
107
+
```
108
+
60
109
You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class.
0 commit comments