Merge branch 'asf-site' into fix-link-rendering

zhoulii · web-flow · commit a10ced8b7094 · 2022-08-30T11:39:43.000+08:00
diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md
@@ -5,15 +5,17 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-## A Demo using docker containers
+## A Demo using Docker containers
 
-Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your computer.
+Let's use a real world example to see how Hudi works end to end. For this purpose, a self contained
+data infrastructure is brought up in a local Docker cluster within your computer. It requires the
+Hudi repo to have been cloned locally. 
 
 The steps have been tested on a Mac laptop
 
 ### Prerequisites
 
+  * Clone the [Hudi repository](https://github.com/apache/hudi) to your local machine.
   * Docker Setup :  For Mac, Please follow the steps as defined in [Install Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/). For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
   * kcat : A command-line utility to publish/consume from kafka topics. Use `brew install kcat` to install kcat.
   * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
@@ -41,16 +43,20 @@ Also, this has not been tested on some environments like Docker on Windows.
 
 ### Build Hudi
 
-The first step is to build hudi. **Note** This step builds hudi on default supported scala version - 2.11.
+The first step is to build Hudi. **Note** This step builds Hudi on default supported scala version - 2.11.
+
+NOTE: Make sure you've cloned the [Hudi repository](https://github.com/apache/hudi) first. 
+
 ```java
 cd <HUDI_WORKSPACE>
 mvn clean package -Pintegration-tests -DskipTests
 ```
 
 ### Bringing up Demo Cluster
 
-The next step is to run the docker compose script and setup configs for bringing up the cluster.
-This should pull the docker images from docker hub and setup docker cluster.
+The next step is to run the Docker compose script and setup configs for bringing up the cluster. These files are in the [Hudi repository](https://github.com/apache/hudi) which you should already have locally on your machine from the previous steps. 
+
+This should pull the Docker images from Docker hub and setup the Docker cluster.
 
 ```java
 cd docker
@@ -112,7 +118,7 @@ Copying spark default config and setting up configs
 $ docker ps
 ```
 
-At this point, the docker cluster will be up and running. The demo cluster brings up the following services
+At this point, the Docker cluster will be up and running. The demo cluster brings up the following services
 
    * HDFS Services (NameNode, DataNode)
    * Spark Master and Worker
@@ -1317,21 +1323,21 @@ This brings the demo to an end.
 
 ## Testing Hudi in Local Docker environment
 
-You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
+You can bring up a Hadoop Docker environment containing Hadoop, Hive and Spark services with support for Hudi.
 ```java
 $ mvn pre-integration-test -DskipTests
 ```
-The above command builds docker images for all the services with
+The above command builds Docker images for all the services with
 current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker images.
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in Docker images.
 
 To bring down the containers
 ```java
 $ cd hudi-integ-test
 $ mvn docker-compose:down
 ```
 
-If you want to bring up the docker containers, use
+If you want to bring up the Docker containers, use
 ```java
 $ cd hudi-integ-test
 $ mvn docker-compose:up -DdetachedMode=true
@@ -1345,21 +1351,21 @@ docker environment (See __hudi-integ-test/src/test/java/org/apache/hudi/integ/IT
 
 ### Building Local Docker Containers:
 
-The docker images required for demo and running integration test are already in docker-hub. The docker images
+The Docker images required for demo and running integration test are already in docker-hub. The Docker images
 and compose scripts are carefully implemented so that they serve dual-purpose
 
-1. The docker images have inbuilt hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
+1. The Docker images have inbuilt Hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
 2. For running integration-tests, we need the jars generated locally to be used for running services within docker. The
    docker-compose scripts (see `docker/compose/docker-compose_hadoop284_hive233_spark244.yml`) ensures local jars override
-   inbuilt jars by mounting local HUDI workspace over the docker location
-3. As these docker containers have mounted local HUDI workspace, any changes that happen in the workspace would automatically 
+   inbuilt jars by mounting local Hudi workspace over the Docker location
+3. As these Docker containers have mounted local Hudi workspace, any changes that happen in the workspace would automatically 
    reflect in the containers. This is a convenient way for developing and verifying Hudi for
    developers who do not own a distributed environment. Note that this is how integration tests are run.
 
-This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
-But if users want to test hudi from locations with lower network bandwidth, they can still build local images
+This helps avoid maintaining separate Docker images and avoids the costly step of building Hudi Docker images locally.
+But if users want to test Hudi from locations with lower network bandwidth, they can still build local images
 run the script
-`docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
+`docker/build_local_docker_images.sh` to build local Docker images before running `docker/setup_demo.sh`
 
 Here are the commands:
 
diff --git a/website/docs/hoodie_cleaner.md b/website/docs/hoodie_cleaner.md
@@ -14,15 +14,22 @@ each commit, to delete older file slices. It's recommended to leave this enabled
 When cleaning old files, you should be careful not to remove files that are being actively used by long running queries.
 Hudi cleaner currently supports the below cleaning policies to keep a certain number of commits or file versions:
 
-- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of 
-having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data 
-into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should 
-retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on 
-disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
-- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. 
-This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. 
-To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations 
-based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of
+  having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data
+  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should
+  retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on
+  disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+  Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`.
+
+- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time.
+  This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time.
+  To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations
+  based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+  Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`.
+
+- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time.
+  Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
+  Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
 
 ### Configurations
 For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -32,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio
 ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
 
 You can use this command for running the cleaner independently:
-```java
-[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
-  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
-  --target-base-path s3:///temp/hudi \
-  --spark-master yarn-cluster
 ```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+        Usage: <main class> [options]
+        Options:
+        --help, -h
+
+        --hoodie-conf
+        Any configuration that can be set in the properties file (using the CLI
+        parameter "--props") can also be passed command line using this
+        parameter. This can be repeated
+        Default: []
+        --props
+        path to properties file on localfs or dfs, with configurations for
+        hoodie client for cleaning
+        --spark-master
+        spark master to use.
+        Default: local[2]
+        * --target-base-path
+        base path for the hoodie table to be cleaner.
+```
+Some examples to run the cleaner.    
+Keep the latest 10 commits
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
+  --hoodie-conf hoodie.cleaner.commits.retained=10 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Keep the latest 3 file versions
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
+  --hoodie-conf hoodie.cleaner.fileversions.retained=3 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Clean commits older than 24 hours
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
+  --hoodie-conf hoodie.cleaner.hours.retained=24 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Note: The parallelism takes the min value of number of partitions to clean and `hoodie.cleaner.parallelism`.
 
 ### Run Asynchronously
 In case you wish to run the cleaner service asynchronously with writing, please configure the below:
@@ -54,4 +101,9 @@ CLI provides the below commands for cleaner service:
 - `clean showpartitions`
 - `cleans run`
 
+Example of cleaner keeping the latest 10 commits
+```
+cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
+```
+
 You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class. 
diff --git a/website/versioned_docs/version-0.11.1/hoodie_cleaner.md b/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
@@ -18,14 +18,18 @@ Hudi cleaner currently supports the below cleaning policies to keep a certain nu
 having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data 
 into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should 
 retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on 
-disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy. 
+Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`.
+
 - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. 
 This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. 
 To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations 
 based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`.
+
 - **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time.
   Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
-  Currently you can configure by parameter 'hoodie.cleaner.hours.retained'.
+  Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
 
 ### Configurations
 For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -35,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio
 ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
 
 You can use this command for running the cleaner independently:
-```java
-[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
-  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
-  --target-base-path s3:///temp/hudi \
-  --spark-master yarn-cluster
 ```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+        Usage: <main class> [options]
+        Options:
+        --help, -h
+
+        --hoodie-conf
+        Any configuration that can be set in the properties file (using the CLI
+        parameter "--props") can also be passed command line using this
+        parameter. This can be repeated
+        Default: []
+        --props
+        path to properties file on localfs or dfs, with configurations for
+        hoodie client for cleaning
+        --spark-master
+        spark master to use.
+        Default: local[2]
+        * --target-base-path
+        base path for the hoodie table to be cleaner.
+```
+Some examples to run the cleaner.    
+Keep the latest 10 commits
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
+  --hoodie-conf hoodie.cleaner.commits.retained=10 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Keep the latest 3 file versions
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
+  --hoodie-conf hoodie.cleaner.fileversions.retained=3 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Clean commits older than 24 hours
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
+  --hoodie-conf hoodie.cleaner.hours.retained=24 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Note: The parallelism takes the min value of number of partitions to clean and `hoodie.cleaner.parallelism`.
 
 ### Run Asynchronously
 In case you wish to run the cleaner service asynchronously with writing, please configure the below:
@@ -57,4 +101,9 @@ CLI provides the below commands for cleaner service:
 - `clean showpartitions`
 - `cleans run`
 
+Example of cleaner keeping the latest 10 commits
+```
+cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
+```
+
 You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class. 
diff --git a/website/versioned_docs/version-0.12.0/hoodie_cleaner.md b/website/versioned_docs/version-0.12.0/hoodie_cleaner.md