diff --git a/website/docs/clustering.md b/website/docs/clustering.md index bccba93f8ac45..58c237fa1531a 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -161,11 +161,11 @@ Users can leverage [HoodieClusteringJob](https://cwiki.apache.org/confluence/dis to setup 2-step asynchronous clustering. ### HoodieClusteringJob -With the release of Hudi version 0.9.0, we can schedule as well as execute clustering in the same step. We just need to -specify the `—mode` or `-m` option. There are three modes: +By specifying the `scheduleAndExecute` mode both schedule as well as clustering can be achieved in the same step. +The appropriate mode can be specified using `-mode` or `-m` option. There are three modes: 1. `schedule`: Make a clustering plan. This gives an instant which can be passed in execute mode. -2. `execute`: Execute a clustering plan at given instant which means --instant-time is required here. +2. `execute`: Execute a clustering plan at a particular instant. If no instant-time is specified, HoodieClusteringJob will execute for the earliest instant on the Hudi timeline. 3. `scheduleAndExecute`: Make a clustering plan first and execute that plan immediately. Note that to run this job while the original writer is still running, please enable multi-writing: diff --git a/website/docs/compaction.md b/website/docs/compaction.md index c3110aaeeaa08..70fba50361384 100644 --- a/website/docs/compaction.md +++ b/website/docs/compaction.md @@ -5,11 +5,7 @@ toc: true last_modified_at: --- -For Merge-On-Read table, data is stored using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. -Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or -asynchronously. One of the main motivations behind Merge-On-Read is to reduce data latency when ingesting records. -Hence, it makes sense to run compaction asynchronously without blocking ingestion. - +Compaction is executed asynchronously with Hudi by default. ## Async Compaction @@ -19,15 +15,13 @@ Async Compaction is performed in 2 steps: slices** to be compacted. A compaction plan is finally written to Hudi timeline. 1. ***Compaction Execution***: A separate process reads the compaction plan and performs compaction of file slices. - ## Deployment Models There are few ways by which we can execute compactions asynchronously. ### Spark Structured Streaming -With 0.6.0, we now have support for running async compactions in Spark -Structured Streaming jobs. Compactions are scheduled and executed asynchronously inside the +Compactions are scheduled and executed asynchronously inside the streaming job. Async Compactions are enabled by default for structured streaming jobs on Merge-On-Read table. @@ -74,22 +68,44 @@ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \ --continous ``` +### Hudi Compactor Utility +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) + +Example: +```properties +spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \ +--class org.apache.hudi.utilities.HoodieCompactor \ +--base-path \ +--table-name \ +--schema-file \ +--instant-time +``` + +Note, the `instant-time` parameter is now optional for the Hudi Compactor Utility. If using the utility without `--instant time`, +the spark-submit will execute the earliest scheduled compaction on the Hudi timeline. + ### Hudi CLI -Hudi CLI is yet another way to execute specific compactions asynchronously. Here is an example +Hudi CLI is yet another way to execute specific compactions asynchronously. Here is an example and you can read more in the [deployment guide](/docs/cli#compactions) +Example: ```properties hudi:trips->compaction run --tableName --parallelism --compactionInstant ... ``` -### Hudi Compactor Script -Hudi provides a standalone tool to also execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/next/deployment#compactions) +## Synchronous Compaction +By default, compaction is run asynchronously. -```properties -spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \ ---class org.apache.hudi.utilities.HoodieCompactor \ ---base-path \ ---table-name \ ---instant-time \ ---schema-file -``` +If latency of ingesting records is important for you, you are most likely using Merge-On-Read tables. +Merge-On-Read tables store data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. +Updates are logged to delta files & later compacted to produce new versions of columnar files. +To improve ingestion latency, Async Compaction is the default configuration. + +If immediate read performance of a new commit is important for you, or you want simplicity of not managing separate compaction jobs, +you may want Synchronous compaction, which means that as a commit is written it is also compacted by the same job. + +Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction scheduling). +When both ingestion and compaction is running in the same spark context, you can use resource allocation configuration +in DeltaStreamer CLI such as ("--delta-sync-scheduling-weight", +"--compact-scheduling-weight", ""--delta-sync-scheduling-minshare", and "--compact-scheduling-minshare") +to control executor allocation between ingestion and compaction. diff --git a/website/docs/hoodie_deltastreamer.md b/website/docs/hoodie_deltastreamer.md index a97f1cb5e6aa9..c1475002a7174 100644 --- a/website/docs/hoodie_deltastreamer.md +++ b/website/docs/hoodie_deltastreamer.md @@ -209,3 +209,8 @@ A deltastreamer job can then be triggered as follows: ``` Read more in depth about concurrency control in the [concurrency control concepts](/docs/concurrency_control) section + +## Hudi Kafka Connect Sink +If you want to perform streaming ingestion into Hudi format similar to HoodieDeltaStreamer, but you don't want to depend on Spark, +try out the new experimental release of Hudi Kafka Connect Sink. Read the [ReadMe](https://github.com/apache/hudi/tree/master/hudi-kafka-connect) +for full documentation. \ No newline at end of file diff --git a/website/versioned_docs/version-0.9.0/cli.md b/website/versioned_docs/version-0.9.0/cli.md new file mode 100644 index 0000000000000..64f0c1eb7fdd7 --- /dev/null +++ b/website/versioned_docs/version-0.9.0/cli.md @@ -0,0 +1,357 @@ +--- +title: CLI +keywords: [hudi, cli] +last_modified_at: 2021-08-18T15:59:57-04:00 +--- + +Once hudi has been built, the shell can be fired by via `cd hudi-cli && ./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the `basePath` and +we would need this location in order to connect to a Hudi table. Hudi library effectively manages this table internally, using `.hoodie` subfolder to track all metadata. + +To initialize a hudi table, use the following command. + +```java +=================================================================== +* ___ ___ * +* /\__\ ___ /\ \ ___ * +* / / / /\__\ / \ \ /\ \ * +* / /__/ / / / / /\ \ \ \ \ \ * +* / \ \ ___ / / / / / \ \__\ / \__\ * +* / /\ \ /\__\ / /__/ ___ / /__/ \ |__| / /\/__/ * +* \/ \ \/ / / \ \ \ /\__\ \ \ \ / / / /\/ / / * +* \ / / \ \ / / / \ \ / / / \ /__/ * +* / / / \ \/ / / \ \/ / / \ \__\ * +* / / / \ / / \ / / \/__/ * +* \/__/ \/__/ \/__/ Apache Hudi CLI * +* * +=================================================================== + +hudi->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1 --tableType COPY_ON_WRITE +..... +``` + +To see the description of hudi table, use the command: + +```java +hudi:hoodie_table_1->desc +18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants [] + _________________________________________________________ + | Property | Value | + |========================================================| + | basePath | ... | + | metaPath | ... | + | fileSystem | hdfs | + | hoodie.table.name | hoodie_table_1 | + | hoodie.table.type | COPY_ON_WRITE | + | hoodie.archivelog.folder| | +``` + +Following is a sample command to connect to a Hudi table contains uber trips. + +```java +hudi:trips->connect --path /app/uber/trips + +16/10/05 23:20:37 INFO model.HoodieTableMetadata: All commits :HoodieCommits{commitList=[20161002045850, 20161002052915, 20161002055918, 20161002065317, 20161002075932, 20161002082904, 20161002085949, 20161002092936, 20161002105903, 20161002112938, 20161002123005, 20161002133002, 20161002155940, 20161002165924, 20161002172907, 20161002175905, 20161002190016, 20161002192954, 20161002195925, 20161002205935, 20161002215928, 20161002222938, 20161002225915, 20161002232906, 20161003003028, 20161003005958, 20161003012936, 20161003022924, 20161003025859, 20161003032854, 20161003042930, 20161003052911, 20161003055907, 20161003062946, 20161003065927, 20161003075924, 20161003082926, 20161003085925, 20161003092909, 20161003100010, 20161003102913, 20161003105850, 20161003112910, 20161003115851, 20161003122929, 20161003132931, 20161003142952, 20161003145856, 20161003152953, 20161003155912, 20161003162922, 20161003165852, 20161003172923, 20161003175923, 20161003195931, 20161003210118, 20161003212919, 20161003215928, 20161003223000, 20161003225858, 20161004003042, 20161004011345, 20161004015235, 20161004022234, 20161004063001, 20161004072402, 20161004074436, 20161004080224, 20161004082928, 20161004085857, 20161004105922, 20161004122927, 20161004142929, 20161004163026, 20161004175925, 20161004194411, 20161004203202, 20161004211210, 20161004214115, 20161004220437, 20161004223020, 20161004225321, 20161004231431, 20161004233643, 20161005010227, 20161005015927, 20161005022911, 20161005032958, 20161005035939, 20161005052904, 20161005070028, 20161005074429, 20161005081318, 20161005083455, 20161005085921, 20161005092901, 20161005095936, 20161005120158, 20161005123418, 20161005125911, 20161005133107, 20161005155908, 20161005163517, 20161005165855, 20161005180127, 20161005184226, 20161005191051, 20161005193234, 20161005203112, 20161005205920, 20161005212949, 20161005223034, 20161005225920]} +Metadata for table trips loaded +``` + +Once connected to the table, a lot of other commands become available. The shell has contextual autocomplete help (press TAB) and below is a list of all commands, few of which are reviewed in this section +are reviewed + +```java +hudi:trips->help +* ! - Allows execution of operating system (OS) commands +* // - Inline comment markers (start of line only) +* ; - Inline comment markers (start of line only) +* addpartitionmeta - Add partition metadata to a table, if not present +* clear - Clears the console +* cls - Clears the console +* commit rollback - Rollback a commit +* commits compare - Compare commits with another Hoodie table +* commit showfiles - Show file level details of a commit +* commit showpartitions - Show partition level details of a commit +* commits refresh - Refresh the commits +* commits show - Show the commits +* commits sync - Compare commits with another Hoodie table +* connect - Connect to a hoodie table +* date - Displays the local date and time +* exit - Exits the shell +* help - List all commands usage +* quit - Exits the shell +* records deduplicate - De-duplicate a partition path contains duplicates & produce repaired files to replace with +* script - Parses the specified resource file and executes its commands +* stats filesizes - File Sizes. Display summary stats on sizes of files +* stats wa - Write Amplification. Ratio of how many records were upserted to how many records were actually written +* sync validate - Validate the sync by counting the number of records +* system properties - Shows the shell's properties +* utils loadClass - Load a class +* version - Displays shell version + +hudi:trips-> +``` + + +### Inspecting Commits + +The task of upserting or inserting a batch of incoming records is known as a **commit** in Hudi. A commit provides basic atomicity guarantees such that only committed data is available for querying. +Each commit has a monotonically increasing string/number called the **commit number**. Typically, this is the time at which we started the commit. + +To view some basic information about the last 10 commits, + + +```java +hudi:trips->commits show --sortBy "Total Bytes Written" --desc true --limit 10 + ________________________________________________________________________________________________________________________________________________________________________ + | CommitTime | Total Bytes Written| Total Files Added| Total Files Updated| Total Partitions Written| Total Records Written| Total Update Records Written| Total Errors| + |=======================================================================================================================================================================| + .... + .... + .... +``` + +At the start of each write, Hudi also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight + + +```java +$ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight +-rw-r--r-- 3 vinoth supergroup 321984 2016-10-05 23:18 /app/uber/trips/.hoodie/20161005225920.inflight +``` + + +### Drilling Down to a specific Commit + +To understand how the writes spread across specific partiions, + + +```java +hudi:trips->commit showpartitions --commit 20161005165855 --sortBy "Total Bytes Written" --desc true --limit 10 + __________________________________________________________________________________________________________________________________________ + | Partition Path| Total Files Added| Total Files Updated| Total Records Inserted| Total Records Updated| Total Bytes Written| Total Errors| + |=========================================================================================================================================| + .... + .... +``` + +If you need file level granularity , we can do the following + + +```java +hudi:trips->commit showfiles --commit 20161005165855 --sortBy "Partition Path" + ________________________________________________________________________________________________________________________________________________________ + | Partition Path| File ID | Previous Commit| Total Records Updated| Total Records Written| Total Bytes Written| Total Errors| + |=======================================================================================================================================================| + .... + .... +``` + + +### FileSystem View + +Hudi views each partition as a collection of file-groups with each file-group containing a list of file-slices in commit order (See concepts). +The below commands allow users to view the file-slices for a data-set. + +```java +hudi:stock_ticks_mor->show fsview all + .... + _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ + | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta File Size| Delta Files | + |==============================================================================================================================================================================================================================================================================================================================================================================================================| + | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]| + + + +hudi:stock_ticks_mor->show fsview latest --partitionPath "2018/08/31" + ...... + __________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ + | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size - compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta Files - compaction unscheduled| + |=================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================| + | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]| [] | + +``` + + +### Statistics + +Since Hudi directly manages file sizes for DFS table, it might be good to get an overall picture + + +```java +hudi:trips->stats filesizes --partitionPath 2016/09/01 --sortBy "95th" --desc true --limit 10 + ________________________________________________________________________________________________ + | CommitTime | Min | 10th | 50th | avg | 95th | Max | NumFiles| StdDev | + |===============================================================================================| + | | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 2 | 2.3 KB | + .... + .... +``` + +In case of Hudi write taking much longer, it might be good to see the write amplification for any sudden increases + + +```java +hudi:trips->stats wa + __________________________________________________________________________ + | CommitTime | Total Upserted| Total Written| Write Amplifiation Factor| + |=========================================================================| + .... + .... +``` + + +### Archived Commits + +In order to limit the amount of growth of .commit files on DFS, Hudi archives older .commit files (with due respect to the cleaner policy) into a commits.archived file. +This is a sequence file that contains a mapping from commitNumber => json with raw information about the commit (same that is nicely rolled up above). + + +### Compactions + +To get an idea of the lag between compaction and writer applications, use the below command to list down all +pending compactions. + +```java +hudi:trips->compactions show all + ___________________________________________________________________ + | Compaction Instant Time| State | Total FileIds to be Compacted| + |==================================================================| + | | REQUESTED| 35 | + | | INFLIGHT | 27 | +``` + +To inspect a specific compaction plan, use + +```java +hudi:trips->compaction show --instant + _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ + | Partition Path| File Id | Base Instant | Data File Path | Total Delta Files| getMetrics | + |================================================================================================================================================================================================================================================ + | 2018/07/17 | | | viewfs://ns-default/.../../UUID_.parquet | 1 | {TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=1230.0, TOTAL_LOG_FILES_SIZE=2.51255751E8, TOTAL_IO_WRITE_MB=991.0, TOTAL_IO_MB=2221.0}| + +``` + +To manually schedule or run a compaction, use the below command. This command uses spark launcher to perform compaction +operations. + +**NOTE:** Make sure no other application is scheduling compaction for this table concurrently +{: .notice--info} + +```java +hudi:trips->help compaction schedule +Keyword: compaction schedule +Description: Schedule Compaction + Keyword: sparkMemory + Help: Spark executor memory + Mandatory: false + Default if specified: '__NULL__' + Default if unspecified: '1G' + +* compaction schedule - Schedule Compaction +``` + +```java +hudi:trips->help compaction run +Keyword: compaction run +Description: Run Compaction for given instant time + Keyword: tableName + Help: Table name + Mandatory: true + Default if specified: '__NULL__' + Default if unspecified: '__NULL__' + + Keyword: parallelism + Help: Parallelism for hoodie compaction + Mandatory: true + Default if specified: '__NULL__' + Default if unspecified: '__NULL__' + + Keyword: schemaFilePath + Help: Path for Avro schema file + Mandatory: true + Default if specified: '__NULL__' + Default if unspecified: '__NULL__' + + Keyword: sparkMemory + Help: Spark executor memory + Mandatory: true + Default if specified: '__NULL__' + Default if unspecified: '__NULL__' + + Keyword: retry + Help: Number of retries + Mandatory: true + Default if specified: '__NULL__' + Default if unspecified: '__NULL__' + + Keyword: compactionInstant + Help: Base path for the target hoodie table + Mandatory: true + Default if specified: '__NULL__' + Default if unspecified: '__NULL__' + +* compaction run - Run Compaction for given instant time +``` + +### Validate Compaction + +Validating a compaction plan : Check if all the files necessary for compactions are present and are valid + +```java +hudi:stock_ticks_mor->compaction validate --instant 20181005222611 +... + + COMPACTION PLAN VALID + + ___________________________________________________________________________________________________________________________________________________________________________________________________________________________ + | File Id | Base Instant Time| Base Data File | Num Delta Files| Valid| Error| + |==========================================================================================================================================================================================================================| + | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445 | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1 | true | | + + + +hudi:stock_ticks_mor->compaction validate --instant 20181005222601 + + COMPACTION PLAN INVALID + + _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ + | File Id | Base Instant Time| Base Data File | Num Delta Files| Valid| Error | + |=====================================================================================================================================================================================================================================================================================================| + | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445 | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1 | false| All log files specified in compaction operation is not present. Missing .... | +``` + +**NOTE:** The following commands must be executed without any other writer/ingestion application running. +{: .notice--warning} + +Sometimes, it becomes necessary to remove a fileId from a compaction-plan inorder to speed-up or unblock compaction +operation. Any new log-files that happened on this file after the compaction got scheduled will be safely renamed +so that are preserved. Hudi provides the following CLI to support it + + +### Unscheduling Compaction + +```java +hudi:trips->compaction unscheduleFileId --fileId +.... +No File renames needed to unschedule file from pending compaction. Operation successful. +``` + +In other cases, an entire compaction plan needs to be reverted. This is supported by the following CLI + +```java +hudi:trips->compaction unschedule --compactionInstant +..... +No File renames needed to unschedule pending compaction. Operation successful. +``` + +### Repair Compaction + +The above compaction unscheduling operations could sometimes fail partially (e:g -> DFS temporarily unavailable). With +partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run +`compaction validate`, you can notice invalid compaction operations if there is one. In these cases, the repair +command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are +consistent with the compaction plan + +```java +hudi:stock_ticks_mor->compaction repair --instant 20181005222611 +...... +Compaction successfully repaired +..... +```