✏️大规模数据处理

0xcaffebabe · Dec 20, 2022 · f7e7b6e · f7e7b6e
1 parent d02bb43
commit f7e7b6e
Showing 1 changed file with 36 additions and 0 deletions.
diff --git a/doc/数据技术/大规模数据处理.md b/doc/数据技术/大规模数据处理.md
@@ -218,3 +218,39 @@ Structured Streaming 提供一个 level 更高的 API，这样的数据抽象可
 - 无界：Beam 要统一表达有界数据和无界数据，所以没有限制它的容量
 - 不可变
 
+### Transform
+
+```mermaid
+stateDiagram-v2
+  数据1 --> Transform
+  数据2 --> Transform
+  Transform --> 数据3
+  Transform --> 数据4
+```
+
+常见的 Transform 接口：
+
+- ParDo：类似于flatMap
+- GroupByKey：把一个 Key/Value 的数据集按 Key 归并
+
+### Pipeline
+
+```mermaid
+stateDiagram-v2
+  输入 --> PCollection1: Transform1
+  PCollection1 --> PCollection2: Transform2
+  PCollection2 --> PCollection3: Transform3
+  PCollection3 --> 输出: Transform4
+```
+
+分布式环境下，整个数据流水线会启动 N 个 Workers 来同时处理 PCollection，在具体处理某一个特定 Transform 的时候，数据流水线会将这个 Transform 的输入数据集 PCollection 里面的元素分割成不同的 Bundle，将这些 Bundle 分发给不同的 Worker 来处理
+
+在单个 Transfrom中，如果某一个 Bundle 里面的元素因为任意原因导致处理失败了，则这整个 Bundle 里的元素都必须重新处理
+
+在多步骤的 Transform 上，如果处理的一个 Bundle 元素发生错误了，则这个元素所在的整个 Bundle 以及与这个 Bundle 有关联的所有 Bundle 都必须重新处理
+
+### IO
+
+- TextIO.read()
+- TextIO.write()
+