Skip to content

Commit a10ced8

Browse files
authored
Merge branch 'asf-site' into fix-link-rendering
2 parents 90ac46d + df4e119 commit a10ced8

File tree

4 files changed

+203
-44
lines changed

4 files changed

+203
-44
lines changed

website/docs/docker_demo.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,17 @@ toc: true
55
last_modified_at: 2019-12-30T15:59:57-04:00
66
---
77

8-
## A Demo using docker containers
8+
## A Demo using Docker containers
99

10-
Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
11-
data infrastructure is brought up in a local docker cluster within your computer.
10+
Let's use a real world example to see how Hudi works end to end. For this purpose, a self contained
11+
data infrastructure is brought up in a local Docker cluster within your computer. It requires the
12+
Hudi repo to have been cloned locally.
1213

1314
The steps have been tested on a Mac laptop
1415

1516
### Prerequisites
1617

18+
* Clone the [Hudi repository](https://github.com/apache/hudi) to your local machine.
1719
* Docker Setup : For Mac, Please follow the steps as defined in [Install Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/). For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
1820
* kcat : A command-line utility to publish/consume from kafka topics. Use `brew install kcat` to install kcat.
1921
* /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
@@ -41,16 +43,20 @@ Also, this has not been tested on some environments like Docker on Windows.
4143

4244
### Build Hudi
4345

44-
The first step is to build hudi. **Note** This step builds hudi on default supported scala version - 2.11.
46+
The first step is to build Hudi. **Note** This step builds Hudi on default supported scala version - 2.11.
47+
48+
NOTE: Make sure you've cloned the [Hudi repository](https://github.com/apache/hudi) first.
49+
4550
```java
4651
cd <HUDI_WORKSPACE>
4752
mvn clean package -Pintegration-tests -DskipTests
4853
```
4954
5055
### Bringing up Demo Cluster
5156
52-
The next step is to run the docker compose script and setup configs for bringing up the cluster.
53-
This should pull the docker images from docker hub and setup docker cluster.
57+
The next step is to run the Docker compose script and setup configs for bringing up the cluster. These files are in the [Hudi repository](https://github.com/apache/hudi) which you should already have locally on your machine from the previous steps.
58+
59+
This should pull the Docker images from Docker hub and setup the Docker cluster.
5460
5561
```java
5662
cd docker
@@ -112,7 +118,7 @@ Copying spark default config and setting up configs
112118
$ docker ps
113119
```
114120
115-
At this point, the docker cluster will be up and running. The demo cluster brings up the following services
121+
At this point, the Docker cluster will be up and running. The demo cluster brings up the following services
116122
117123
* HDFS Services (NameNode, DataNode)
118124
* Spark Master and Worker
@@ -1317,21 +1323,21 @@ This brings the demo to an end.
13171323

13181324
## Testing Hudi in Local Docker environment
13191325

1320-
You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
1326+
You can bring up a Hadoop Docker environment containing Hadoop, Hive and Spark services with support for Hudi.
13211327
```java
13221328
$ mvn pre-integration-test -DskipTests
13231329
```
1324-
The above command builds docker images for all the services with
1330+
The above command builds Docker images for all the services with
13251331
current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
1326-
currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker images.
1332+
currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in Docker images.
13271333

13281334
To bring down the containers
13291335
```java
13301336
$ cd hudi-integ-test
13311337
$ mvn docker-compose:down
13321338
```
13331339

1334-
If you want to bring up the docker containers, use
1340+
If you want to bring up the Docker containers, use
13351341
```java
13361342
$ cd hudi-integ-test
13371343
$ mvn docker-compose:up -DdetachedMode=true
@@ -1345,21 +1351,21 @@ docker environment (See __hudi-integ-test/src/test/java/org/apache/hudi/integ/IT
13451351

13461352
### Building Local Docker Containers:
13471353

1348-
The docker images required for demo and running integration test are already in docker-hub. The docker images
1354+
The Docker images required for demo and running integration test are already in docker-hub. The Docker images
13491355
and compose scripts are carefully implemented so that they serve dual-purpose
13501356

1351-
1. The docker images have inbuilt hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
1357+
1. The Docker images have inbuilt Hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
13521358
2. For running integration-tests, we need the jars generated locally to be used for running services within docker. The
13531359
docker-compose scripts (see `docker/compose/docker-compose_hadoop284_hive233_spark244.yml`) ensures local jars override
1354-
inbuilt jars by mounting local HUDI workspace over the docker location
1355-
3. As these docker containers have mounted local HUDI workspace, any changes that happen in the workspace would automatically
1360+
inbuilt jars by mounting local Hudi workspace over the Docker location
1361+
3. As these Docker containers have mounted local Hudi workspace, any changes that happen in the workspace would automatically
13561362
reflect in the containers. This is a convenient way for developing and verifying Hudi for
13571363
developers who do not own a distributed environment. Note that this is how integration tests are run.
13581364

1359-
This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
1360-
But if users want to test hudi from locations with lower network bandwidth, they can still build local images
1365+
This helps avoid maintaining separate Docker images and avoids the costly step of building Hudi Docker images locally.
1366+
But if users want to test Hudi from locations with lower network bandwidth, they can still build local images
13611367
run the script
1362-
`docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
1368+
`docker/build_local_docker_images.sh` to build local Docker images before running `docker/setup_demo.sh`
13631369

13641370
Here are the commands:
13651371

website/docs/hoodie_cleaner.md

Lines changed: 66 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,22 @@ each commit, to delete older file slices. It's recommended to leave this enabled
1414
When cleaning old files, you should be careful not to remove files that are being actively used by long running queries.
1515
Hudi cleaner currently supports the below cleaning policies to keep a certain number of commits or file versions:
1616

17-
- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of
18-
having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
19-
into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
20-
retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
21-
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
22-
- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
23-
This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
24-
To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
25-
based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
17+
- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of
18+
having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
19+
into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
20+
retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
21+
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
22+
Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`.
23+
24+
- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
25+
This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
26+
To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
27+
based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
28+
Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`.
29+
30+
- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time.
31+
Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
32+
Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
2633

2734
### Configurations
2835
For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -32,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio
3239
ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
3340

3441
You can use this command for running the cleaner independently:
35-
```java
36-
[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
37-
--props s3:///temp/hudi-ingestion-config/kafka-source.properties \
38-
--target-base-path s3:///temp/hudi \
39-
--spark-master yarn-cluster
4042
```
43+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
44+
Usage: <main class> [options]
45+
Options:
46+
--help, -h
47+
48+
--hoodie-conf
49+
Any configuration that can be set in the properties file (using the CLI
50+
parameter "--props") can also be passed command line using this
51+
parameter. This can be repeated
52+
Default: []
53+
--props
54+
path to properties file on localfs or dfs, with configurations for
55+
hoodie client for cleaning
56+
--spark-master
57+
spark master to use.
58+
Default: local[2]
59+
* --target-base-path
60+
base path for the hoodie table to be cleaner.
61+
```
62+
Some examples to run the cleaner.
63+
Keep the latest 10 commits
64+
```
65+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
66+
--target-base-path /path/to/hoodie_table \
67+
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
68+
--hoodie-conf hoodie.cleaner.commits.retained=10 \
69+
--hoodie-conf hoodie.cleaner.parallelism=200
70+
```
71+
Keep the latest 3 file versions
72+
```
73+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
74+
--target-base-path /path/to/hoodie_table \
75+
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
76+
--hoodie-conf hoodie.cleaner.fileversions.retained=3 \
77+
--hoodie-conf hoodie.cleaner.parallelism=200
78+
```
79+
Clean commits older than 24 hours
80+
```
81+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
82+
--target-base-path /path/to/hoodie_table \
83+
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
84+
--hoodie-conf hoodie.cleaner.hours.retained=24 \
85+
--hoodie-conf hoodie.cleaner.parallelism=200
86+
```
87+
Note: The parallelism takes the min value of number of partitions to clean and `hoodie.cleaner.parallelism`.
4188

4289
### Run Asynchronously
4390
In case you wish to run the cleaner service asynchronously with writing, please configure the below:
@@ -54,4 +101,9 @@ CLI provides the below commands for cleaner service:
54101
- `clean showpartitions`
55102
- `cleans run`
56103

104+
Example of cleaner keeping the latest 10 commits
105+
```
106+
cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
107+
```
108+
57109
You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class.

website/versioned_docs/version-0.11.1/hoodie_cleaner.md

Lines changed: 56 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,18 @@ Hudi cleaner currently supports the below cleaning policies to keep a certain nu
1818
having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
1919
into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
2020
retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
21-
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
21+
disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
22+
Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`.
23+
2224
- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
2325
This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
2426
To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
2527
based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
28+
Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`.
29+
2630
- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time.
2731
Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
28-
Currently you can configure by parameter 'hoodie.cleaner.hours.retained'.
32+
Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
2933

3034
### Configurations
3135
For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -35,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio
3539
ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
3640

3741
You can use this command for running the cleaner independently:
38-
```java
39-
[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
40-
--props s3:///temp/hudi-ingestion-config/kafka-source.properties \
41-
--target-base-path s3:///temp/hudi \
42-
--spark-master yarn-cluster
4342
```
43+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
44+
Usage: <main class> [options]
45+
Options:
46+
--help, -h
47+
48+
--hoodie-conf
49+
Any configuration that can be set in the properties file (using the CLI
50+
parameter "--props") can also be passed command line using this
51+
parameter. This can be repeated
52+
Default: []
53+
--props
54+
path to properties file on localfs or dfs, with configurations for
55+
hoodie client for cleaning
56+
--spark-master
57+
spark master to use.
58+
Default: local[2]
59+
* --target-base-path
60+
base path for the hoodie table to be cleaner.
61+
```
62+
Some examples to run the cleaner.
63+
Keep the latest 10 commits
64+
```
65+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
66+
--target-base-path /path/to/hoodie_table \
67+
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
68+
--hoodie-conf hoodie.cleaner.commits.retained=10 \
69+
--hoodie-conf hoodie.cleaner.parallelism=200
70+
```
71+
Keep the latest 3 file versions
72+
```
73+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
74+
--target-base-path /path/to/hoodie_table \
75+
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
76+
--hoodie-conf hoodie.cleaner.fileversions.retained=3 \
77+
--hoodie-conf hoodie.cleaner.parallelism=200
78+
```
79+
Clean commits older than 24 hours
80+
```
81+
spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
82+
--target-base-path /path/to/hoodie_table \
83+
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
84+
--hoodie-conf hoodie.cleaner.hours.retained=24 \
85+
--hoodie-conf hoodie.cleaner.parallelism=200
86+
```
87+
Note: The parallelism takes the min value of number of partitions to clean and `hoodie.cleaner.parallelism`.
4488

4589
### Run Asynchronously
4690
In case you wish to run the cleaner service asynchronously with writing, please configure the below:
@@ -57,4 +101,9 @@ CLI provides the below commands for cleaner service:
57101
- `clean showpartitions`
58102
- `cleans run`
59103

104+
Example of cleaner keeping the latest 10 commits
105+
```
106+
cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
107+
```
108+
60109
You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class.

0 commit comments

Comments
 (0)