From c81f1ace1a0bcf907a5885e2515d2ea659985bea Mon Sep 17 00:00:00 2001 From: ConradJam Date: Tue, 6 Aug 2024 19:50:53 +0800 Subject: [PATCH 1/3] Docs: fixed apache amoro(incubating) with iceberg (#11965) --- docs/docs/amoro.md | 89 ++++++++++++++++++++++++++++++++++++++++++++++ docs/mkdocs.yml | 2 ++ 2 files changed, 91 insertions(+) create mode 100644 docs/docs/amoro.md diff --git a/docs/docs/amoro.md b/docs/docs/amoro.md new file mode 100644 index 000000000000..b2e7875120eb --- /dev/null +++ b/docs/docs/amoro.md @@ -0,0 +1,89 @@ +--- +title: "Apache Amoro" +--- + + +# Apache Amoro With Iceberg + +**[Apache Amoro(incubating)](https://amoro.apache.org)** is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and +**[Table Maintenance](https://amoro.apache.org/docs/latest/self-optimizing/)** features for a Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. +**AMS(Amoro Management Service)** provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all compute engines, which can also be combined with existing metadata services like HMS(Hive Metastore). + +# Auto Self-optimizing + +Lakehouse is characterized by its openness and loose coupling, with data and files maintained by users through various engines. While this +architecture appears to be well-suited for T+1 scenarios, as more attention is paid to applying Lakehouse to streaming data warehouses and real-time +analysis scenarios, challenges arise. For example: + +- Streaming writes bring a massive amount of fragment files +- CDC ingestion and streaming updates generate excessive redundant data +- Using the new data lake format leads to orphan files and expired snapshots. + +These issues can significantly affect the performance and cost of data analysis. Therefore, Amoro has introduced a Self-optimizing mechanism to +create an out-of-the-box Streaming Lakehouse management service that is as user-friendly as a traditional database or data warehouse. Self-optimizing involves various procedures such as file compaction, deduplication, and sorting. + +The architecture and working mechanism of Self-optimizing are shown in the figure below: + +![Self-optimizing architecture](https://github.com/apache/amoro/blob/master/docs/images/concepts/self-optimizing_arch.png) + +The Optimizer is a component responsible for executing Self-optimizing tasks. It is a resident process managed by [AMS](https://amoro.apache.org/docs/latest/#architecture). AMS is responsible for +detecting and planning Self-optimizing tasks for tables, and then scheduling them to Optimizers for distributed execution in real-time. Finally, AMS +is responsible for submitting the optimizing results. Amoro achieves physical isolation of Optimizers through the Optimizer Group. + +The core features of [Amoro's Self-optimizing](https://amoro.apache.org/docs/latest/self-optimizing/) are: + +- Automated, Asynchronous and Transparent — Continuous background detecting of file changes, asynchronous distributed execution of optimizing tasks, + transparent and imperceptible to users +- Resource Isolation and Sharing — Allow resources to be isolated and shared at the table level, as well as setting resource quotas +- Flexible and Scalable Deployment — Optimizers support various deployment methods and convenient scaling + + +# Iceberg Format + +Apache Amoro supports all catalog types supported by Iceberg, including common catalog: [REST](https://iceberg.apache.org/concepts/catalog/#decoupling-using-the-rest-catalog), Hadoop, Hive, Glue, JDBC, Nessie and other third-party catalog. +Amoro supports all storage types supported by Iceberg, including common store: Hadoop, S3, GCS, ECS, OSS, and so on. + +At the same time, we also provide a unique form based on Apache Iceberg, including mixed-Iceberg Format and mixed-Hive Format, so that you can quickly upgrade to the iceberg+hive Mixed table while compatible with the original Hive data + +## Mixed-Iceberg Format + +[Mixed-Iceberg Format](https://amoro.apache.org/docs/latest/mixed-iceberg-format/) Compared with Iceberg format, Mixed-Iceberg format provides more features: + +- Stronger primary key constraints that also apply to Spark +- OLAP performance that is production-ready for real-time data warehouses through the auto-bucket mechanism +- LogStore configuration that can reduce data pipeline latency from minutes to milliseconds/seconds +- Transaction conflict resolution mechanism that enables concurrent writes with the same primary key +- The design intention of Mixed-Iceberg format is to provide a storage layer for stream-batch integration and offline-real-time unified data warehouses for big data platforms based on data lakes. Under this goal-driven approach, Amoro designs Mixed-Iceberg format as a three-tier structure, with each level named after a different TableStore: + +![mixed_format](https://github.com/apache/amoro/blob/master/docs/images/formats/mixed_format.png) + +- BaseStore — stores the stock data of the table, usually generated by batch computing or optimizing processes, and is more friendly to ReadStore for reading. +- ChangeStore — stores the flow and change data of the table, usually written in real-time by streaming computing, and can also be used for downstream CDC consumption, and is more friendly to WriteStore for writing. +- LogStore — serves as a cache layer for ChangeStore to accelerate stream processing. Amoro manages the consistency between LogStore and ChangeStore. + + +## Mixed-Hive Format + +[Mixed-Hive](https://amoro.apache.org/docs/latest/mixed-hive-format/) format is a format that has better compatibility with Hive than Mixed-Iceberg format. Mixed-Hive format uses a Hive table as the BaseStore and an Iceberg table as the ChangeStore. Mixed-Hive format supports: + +![mixed_format](https://github.com/apache/amoro/blob/master/docs/images/formats/mixed_format.png) + +- Schema, partition, and types consistent with Hive format +- Using the Hive connector to read and write Mixed-Hive format tables as Hive tables +- Upgrading a Hive table in-place to a Mixed-Hive format table without data rewriting or migration, with a response time in seconds +- All the functional features of Mixed-Iceberg format diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index bc8809595d6e..89cece20a4ab 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -58,6 +58,7 @@ nav: - Presto: https://prestodb.io/docs/current/connector/iceberg.html - Dremio: https://docs.dremio.com/data-formats/apache-iceberg/ - Starrocks: https://docs.starrocks.io/en-us/latest/data_source/catalog/iceberg_catalog + - Amoro: https://amoro.apache.org/docs/latest/iceberg-format - Amazon Athena: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html - Amazon EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-cluster.html - Amazon Data Firehose: https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-destination.html @@ -69,6 +70,7 @@ nav: - Druid: https://druid.apache.org/docs/latest/development/extensions-contrib/iceberg/ - Kafka Connect: kafka-connect.md - Integrations: + - amoro.md - aws.md - dell.md - jdbc.md From 37aa2c74acfc20d938fe0720950c6504df7e61cd Mon Sep 17 00:00:00 2001 From: ConradJam Date: Tue, 18 Feb 2025 16:12:24 +0800 Subject: [PATCH 2/3] fix: fix doc --- docs/docs/amoro.md | 44 ++++++++++++-------------------------------- 1 file changed, 12 insertions(+), 32 deletions(-) diff --git a/docs/docs/amoro.md b/docs/docs/amoro.md index b2e7875120eb..f74fd65ec7df 100644 --- a/docs/docs/amoro.md +++ b/docs/docs/amoro.md @@ -22,19 +22,11 @@ title: "Apache Amoro" **[Apache Amoro(incubating)](https://amoro.apache.org)** is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and **[Table Maintenance](https://amoro.apache.org/docs/latest/self-optimizing/)** features for a Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. -**AMS(Amoro Management Service)** provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all compute engines, which can also be combined with existing metadata services like HMS(Hive Metastore). +**[AMS](https://amoro.apache.org/docs/latest/#architecture)(Amoro Management Service)** provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all compute engines, which can also be combined with existing metadata services like HMS(Hive Metastore). -# Auto Self-optimizing +## Auto Self-optimizing -Lakehouse is characterized by its openness and loose coupling, with data and files maintained by users through various engines. While this -architecture appears to be well-suited for T+1 scenarios, as more attention is paid to applying Lakehouse to streaming data warehouses and real-time -analysis scenarios, challenges arise. For example: - -- Streaming writes bring a massive amount of fragment files -- CDC ingestion and streaming updates generate excessive redundant data -- Using the new data lake format leads to orphan files and expired snapshots. - -These issues can significantly affect the performance and cost of data analysis. Therefore, Amoro has introduced a Self-optimizing mechanism to +Amoro has introduced a Self-optimizing mechanism to create an out-of-the-box Streaming Lakehouse management service that is as user-friendly as a traditional database or data warehouse. Self-optimizing involves various procedures such as file compaction, deduplication, and sorting. The architecture and working mechanism of Self-optimizing are shown in the figure below: @@ -45,45 +37,33 @@ The Optimizer is a component responsible for executing Self-optimizing tasks. It detecting and planning Self-optimizing tasks for tables, and then scheduling them to Optimizers for distributed execution in real-time. Finally, AMS is responsible for submitting the optimizing results. Amoro achieves physical isolation of Optimizers through the Optimizer Group. -The core features of [Amoro's Self-optimizing](https://amoro.apache.org/docs/latest/self-optimizing/) are: +The core features of [Amoro Self Optimizing](https://amoro.apache.org/docs/latest/self-optimizing/) are: - Automated, Asynchronous and Transparent — Continuous background detecting of file changes, asynchronous distributed execution of optimizing tasks, transparent and imperceptible to users - Resource Isolation and Sharing — Allow resources to be isolated and shared at the table level, as well as setting resource quotas - Flexible and Scalable Deployment — Optimizers support various deployment methods and convenient scaling - -# Iceberg Format +## Table Format Apache Amoro supports all catalog types supported by Iceberg, including common catalog: [REST](https://iceberg.apache.org/concepts/catalog/#decoupling-using-the-rest-catalog), Hadoop, Hive, Glue, JDBC, Nessie and other third-party catalog. Amoro supports all storage types supported by Iceberg, including common store: Hadoop, S3, GCS, ECS, OSS, and so on. At the same time, we also provide a unique form based on Apache Iceberg, including mixed-Iceberg Format and mixed-Hive Format, so that you can quickly upgrade to the iceberg+hive Mixed table while compatible with the original Hive data -## Mixed-Iceberg Format +### Iceberg Format -[Mixed-Iceberg Format](https://amoro.apache.org/docs/latest/mixed-iceberg-format/) Compared with Iceberg format, Mixed-Iceberg format provides more features: +Starting from Apache Amoro v0.4, Iceberg format including v1 and v2 is supported. Users only need to register Iceberg’s catalog in Amoro to host the table for Amoro maintenance. Amoro maintains the performance and economic availability of Iceberg tables with minimal read/write costs through means such as small file merging, eq-delete file conversion to pos-delete files, +duplicate data elimination, and file cleaning, and Amoro has no intrusive impact on the functionality of Iceberg. -- Stronger primary key constraints that also apply to Spark -- OLAP performance that is production-ready for real-time data warehouses through the auto-bucket mechanism -- LogStore configuration that can reduce data pipeline latency from minutes to milliseconds/seconds -- Transaction conflict resolution mechanism that enables concurrent writes with the same primary key -- The design intention of Mixed-Iceberg format is to provide a storage layer for stream-batch integration and offline-real-time unified data warehouses for big data platforms based on data lakes. Under this goal-driven approach, Amoro designs Mixed-Iceberg format as a three-tier structure, with each level named after a different TableStore: +### Mixed-Iceberg Format -![mixed_format](https://github.com/apache/amoro/blob/master/docs/images/formats/mixed_format.png) +[Mixed-Iceberg Format](https://amoro.apache.org/docs/latest/mixed-iceberg-format/) is similar to that of clustered indexes in databases. Each TableStore can use different table formats. Mixed-Iceberg format provides high freshness OLAP through merge-on-read between BaseStore and ChangeStore. To provide high-performance merge-on-read, BaseStore and ChangeStore use completely consistent partition and layout, and both support auto-bucket. - BaseStore — stores the stock data of the table, usually generated by batch computing or optimizing processes, and is more friendly to ReadStore for reading. - ChangeStore — stores the flow and change data of the table, usually written in real-time by streaming computing, and can also be used for downstream CDC consumption, and is more friendly to WriteStore for writing. - LogStore — serves as a cache layer for ChangeStore to accelerate stream processing. Amoro manages the consistency between LogStore and ChangeStore. +### Mixed-Hive Format -## Mixed-Hive Format - -[Mixed-Hive](https://amoro.apache.org/docs/latest/mixed-hive-format/) format is a format that has better compatibility with Hive than Mixed-Iceberg format. Mixed-Hive format uses a Hive table as the BaseStore and an Iceberg table as the ChangeStore. Mixed-Hive format supports: - -![mixed_format](https://github.com/apache/amoro/blob/master/docs/images/formats/mixed_format.png) - -- Schema, partition, and types consistent with Hive format -- Using the Hive connector to read and write Mixed-Hive format tables as Hive tables -- Upgrading a Hive table in-place to a Mixed-Hive format table without data rewriting or migration, with a response time in seconds -- All the functional features of Mixed-Iceberg format +[Mixed-Hive](https://amoro.apache.org/docs/latest/mixed-hive-format/) format is a format that has better compatibility with Hive than Mixed-Iceberg format. Mixed-Hive format uses a Hive table as the BaseStore and an Iceberg table as the ChangeStore. From bd42aeb5fe705cc71325a13707bcba8275bbb016 Mon Sep 17 00:00:00 2001 From: ConradJam Date: Wed, 19 Feb 2025 14:50:41 +0800 Subject: [PATCH 3/3] fix: fix doc with mkdocs --- docs/mkdocs.yml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 89cece20a4ab..7273a1946896 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -58,7 +58,7 @@ nav: - Presto: https://prestodb.io/docs/current/connector/iceberg.html - Dremio: https://docs.dremio.com/data-formats/apache-iceberg/ - Starrocks: https://docs.starrocks.io/en-us/latest/data_source/catalog/iceberg_catalog - - Amoro: https://amoro.apache.org/docs/latest/iceberg-format + - Amoro: amoro.md - Amazon Athena: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html - Amazon EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-cluster.html - Amazon Data Firehose: https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-destination.html @@ -70,7 +70,6 @@ nav: - Druid: https://druid.apache.org/docs/latest/development/extensions-contrib/iceberg/ - Kafka Connect: kafka-connect.md - Integrations: - - amoro.md - aws.md - dell.md - jdbc.md