Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion docs-new/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ hudi_style_skin : "hudi"
version : &version "0.5.1-SNAPSHOT"

previous_docs:
latest: /docs/quick-start-guide.html
- version: latest
en: /docs/quick-start-guide.html
cn: /cn/docs/quick-start-guide.html

# 0.5.0-incubating: /versions/0.5.0-incubating/docs/quick-start-guide.html


Expand Down Expand Up @@ -53,6 +56,29 @@ author:
icon: "fa fa-navicon"
url: "https://issues.apache.org/jira/projects/HUDI/summary"

cn_author:
name : "Quick Links"
bio : "Hudi *ingests* & *manages* storage of large analytical datasets over DFS."
links:
- label: "Documentation"
icon: "fa fa-book"
url: "/cn/docs/quick-start-guide"
- label: "Technical Wiki"
icon: "fa fa-wikipedia-w"
url: "https://cwiki.apache.org/confluence/display/HUDI"
- label: "Contribution Guide"
icon: "fa fa-thumbs-o-up"
url: "/cn/contributing"
- label: "Join on Slack"
icon: "fa fa-slack"
url: "https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE"
- label: "Fork on GitHub"
icon: "fa fa-github"
url: "https://github.com/apache/incubator-hudi"
- label: "Report Issues"
icon: "fa fa-navicon"
url: "https://issues.apache.org/jira/projects/HUDI/summary"


# Layout Defaults
defaults:
Expand Down
55 changes: 51 additions & 4 deletions docs-new/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,13 @@ main:
url: https://cwiki.apache.org/confluence/display/HUDI/FAQ
- title: "Releases"
url: /releases.html
# - title: "Roadmap"
# url: /roadmap.html

# doc links
docs:
- title: Getting Started
children:
- title: "Quick Start"
url: /docs/quick-start-guide.html
# - title: "Structure"
# url: /docs/structure.html
- title: "Use Cases"
url: /docs/use_cases.html
- title: "Talks & Powered By"
Expand Down Expand Up @@ -51,3 +47,54 @@ docs:
- title: "Privacy Policy"
url: /docs/privacy.html

cn_main:
- title: "文档"
url: /cn/docs/quick-start-guide.html
- title: "社区"
url: /cn/community.html
- title: "动态"
url: /cn/activity.html
- title: "FAQ"
url: https://cwiki.apache.org/confluence/display/HUDI/FAQ
- title: "发布"
url: /cn/releases.html

# doc links
cn_docs:
- title: 入门指南
children:
- title: "快速开始"
url: /cn/docs/quick-start-guide.html
- title: "使用案例"
url: /cn/docs/use_cases.html
- title: "演讲 & hudi 用户"
url: /cn/docs/powered_by.html
- title: "对比"
url: /cn/docs/comparison.html
- title: "Docker 示例"
url: /cn/docs/docker_demo.html
- title: 帮助文档
children:
- title: "概念"
url: /cn/docs/concepts.html
- title: "写入数据"
url: /cn/docs/writing_data.html
- title: "查询数据"
url: /cn/docs/querying_data.html
- title: "配置"
url: /cn/docs/configurations.html
- title: "性能"
url: /cn/docs/performance.html
- title: "管理"
url: /cn/docs/admin_guide.html
- title: 其他信息
children:
- title: "文档版本"
url: /cn/docs/docs-versions.html
- title: "版权信息"
url: /cn/docs/privacy.html





82 changes: 82 additions & 0 deletions docs-new/_docs/0_1_s3_filesystem.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
title: S3 Filesystem
keywords: hudi, hive, aws, s3, spark, presto
permalink: /cn/docs/s3_hoodie.html
summary: In this page, we go over how to configure Hudi with S3 filesystem.
last_modified_at: 2019-12-30T15:59:57-04:00
language: cn
---
In this page, we explain how to get your Hudi spark job to store into AWS S3.

## AWS configs

There are two configurations required for Hudi-S3 compatibility:

- Adding AWS Credentials for Hudi
- Adding required Jars to classpath

### AWS Credentials

Simplest way to use Hudi with S3, is to configure your `SparkSession` or `SparkContext` with S3 credentials. Hudi will automatically pick this up and talk to S3.

Alternatively, add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hudi should be able to read/write from the bucket.

```xml
<property>
<name>fs.defaultFS</name>
<value>s3://ysharma</value>
</property>

<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS_KEY</value>
</property>

<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS_SECRET</value>
</property>

<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS_KEY</value>
</property>

<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS_SECRET</value>
</property>
```


Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
such variables and then have cli be able to work on datasets stored in s3

```java
export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey
export HOODIE_ENV_fs_DOT_s3n_DOT_awsAccessKeyId=$accessKey
export HOODIE_ENV_fs_DOT_s3n_DOT_awsSecretAccessKey=$secretKey
export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```



### AWS Libs

AWS hadoop libraries to add to our classpath

- com.amazonaws:aws-java-sdk:1.10.34
- org.apache.hadoop:hadoop-aws:2.7.3

AWS glue data libraries are needed if AWS glue data is used

- com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
- com.amazonaws:aws-java-sdk-glue:1.11.475
7 changes: 6 additions & 1 deletion docs-new/_docs/0_1_s3_filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Alternatively, add the required configs in your core-site.xml from where Hudi ca
Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
such variables and then have cli be able to work on datasets stored in s3

```Java
```java
export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
Expand All @@ -74,3 +74,8 @@ AWS hadoop libraries to add to our classpath

- com.amazonaws:aws-java-sdk:1.10.34
- org.apache.hadoop:hadoop-aws:2.7.3

AWS glue data libraries are needed if AWS glue data is used

- com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
- com.amazonaws:aws-java-sdk-glue:1.11.475
62 changes: 62 additions & 0 deletions docs-new/_docs/0_2_gcs_filesystem.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: GCS Filesystem
keywords: hudi, hive, google cloud, storage, spark, presto
permalink: /cn/docs/gcs_hoodie.html
summary: In this page, we go over how to configure hudi with Google Cloud Storage.
last_modified_at: 2019-12-30T15:59:57-04:00
language: cn
---
For Hudi storage on GCS, **regional** buckets provide an DFS API with strong consistency.

## GCS Configs

There are two configurations required for Hudi GCS compatibility:

- Adding GCS Credentials for Hudi
- Adding required jars to classpath

### GCS Credentials

Add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your GCS bucket name and Hudi should be able to read/write from the bucket.

```xml
<property>
<name>fs.defaultFS</name>
<value>gs://hudi-bucket</value>
</property>

<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>

<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: (GCS) uris.</description>
</property>

<property>
<name>fs.gs.project.id</name>
<value>GCS_PROJECT_ID</value>
</property>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>google.cloud.auth.service.account.email</name>
<value>GCS_SERVICE_ACCOUNT_EMAIL</value>
</property>
<property>
<name>google.cloud.auth.service.account.keyfile</name>
<value>GCS_SERVICE_ACCOUNT_KEYFILE</value>
</property>
```

### GCS Libs

GCS hadoop libraries to add to our classpath

- com.google.cloud.bigdataoss:gcs-connector:1.6.0-hadoop2
73 changes: 73 additions & 0 deletions docs-new/_docs/0_3_migration_guide.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: Migration Guide
keywords: hudi, migration, use case
permalink: /cn/docs/migration_guide.html
summary: In this page, we will discuss some available tools for migrating your existing dataset into a Hudi dataset
last_modified_at: 2019-12-30T15:59:57-04:00
language: cn
---

Hudi maintains metadata such as commit timeline and indexes to manage a dataset. The commit timelines helps to understand the actions happening on a dataset as well as the current state of a dataset. Indexes are used by Hudi to maintain a record key to file id mapping to efficiently locate a record. At the moment, Hudi supports writing only parquet columnar formats.
To be able to start using Hudi for your existing dataset, you will need to migrate your existing dataset into a Hudi managed dataset. There are a couple of ways to achieve this.


## Approaches


### Use Hudi for new partitions alone

Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the
dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete
Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive
partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing
to a new partition or convert your last N partitions into Hudi instead of the entire table. Note, since the historical
partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset.
Take this approach if your dataset is an append only type of dataset and you do not expect to perform any updates to existing (or non Hudi managed) partitions.


### Convert existing dataset to Hudi

Import your existing dataset into a Hudi managed dataset. Since all the data is Hudi managed, none of the limitations
of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently
make the update available to queries. Note that not only do you get to use all Hudi primitives on this dataset,
there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset
. You can define the desired file size when converting this dataset and Hudi will ensure it writes out files
adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
small files rather than writing new small ones thus maintaining the health of your cluster.

There are a few options when choosing this approach.

**Option 1**
Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format.
This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.

**Option 2**
For huge datasets, this could be as simple as :
```java
for partition in [list of partitions in source dataset] {
val inputDF = spark.read.format("any_input_format").load("partition_path")
inputDF.write.format("org.apache.hudi").option()....save("basePath")
}
```

**Option 3**
Write your own custom logic of how to load an existing dataset into a Hudi managed one. Please read about the RDD API
[here](/cn/docs/quick-start-guide.html). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
fired by via `cd hudi-cli && ./hudi-cli.sh`.

```java
hudi->hdfsparquetimport
--upsert false
--srcPath /user/parquet/dataset/basepath
--targetPath
/user/hoodie/dataset/basepath
--tableName hoodie_table
--tableType COPY_ON_WRITE
--rowKeyField _row_key
--partitionPathField partitionStr
--parallelism 1500
--schemaFilePath /user/table/schema
--format parquet
--sparkMemory 6g
--retry 2
```
2 changes: 1 addition & 1 deletion docs-new/_docs/0_3_migration_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ for partition in [list of partitions in source dataset] {

**Option 3**
Write your own custom logic of how to load an existing dataset into a Hudi managed one. Please read about the RDD API
[here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
[here](/docs/quick-start-guide.html). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
fired by via `cd hudi-cli && ./hudi-cli.sh`.

```java
Expand Down
Loading