Core: Add RocksDBStructLikeMap #2680

openinx · 2021-06-07T12:54:44Z

Currently, the insertedRowMap in BaseEqualityDeltaWriter is a in-memory hash map, which means it will be easily OOM if the data set is slightly larger than the given memory from task manager. For example, if we are migrating the full snapshot from mysql table to apache iceberg table, the existing data set from the mysql table will be quite large, but all those rows will be exported in the same flink checkpoint, OOM will be easily happened.

In this patch, we are trying to provide a map that was backend with an embedded rocksdb, which means we could spill the rows into disk when exceeding to the given threshold. The patch is still working in progess, will still need more test cases to make it available for reviewing.

kbendick · 2021-06-23T01:30:39Z

build.gradle

    compile "com.fasterxml.jackson.core:jackson-databind"
    compile "com.fasterxml.jackson.core:jackson-core"
    compile "com.github.ben-manes.caffeine:caffeine"
+    compile "org.rocksdb:rocksdbjni"


I know this is a WIP, but is it possible to avoid bringing this into the class path once this is done?

I believe Flink brings in their own RocksDB fork (frocksdb maybe?) and I imagine others might too.

As the StructLikeMap is in the core code path, which don't depend on any specific compute engines(spark/flink/hive/presto etc), so in theory we could not assume that the engine runtime will include this dependency jar. It's better to include it in iceberg's jar if possible.

That makes sense.

I'd advocate for possibly sharing it though, so that multiple versions can exist together.

With Spark 3.2 bringing support for RocksDB statestore as well, it might be wise to shade it in the finl outcome. But that is admittedly a long ways off.

openinx · 2021-06-23T03:44:42Z

core/src/main/java/org/apache/iceberg/util/RocksDBStructLikeMap.java

+    RocksDB.loadLibrary();
+  }
+
+  public static RocksDBStructLikeMap create(String path,


Currently, if multiple tasks are located in the same node, then they will use the same rocksdb location. That will mess up the spilled data. We will need to create separate rocksdb dir for different tasks even if they are located in the same node.

We will need to create separate rocksdb dir for different tasks even if they are located in the same node.

I think this comment is out of date now because we've already generated an unique directory name for each BaskTaskWriter now , please see here
https://github.com/apache/iceberg/pull/2680/files#diff-e550ee80e8343a3396e67ab7fec4fe50a2ac633b02682cdc330d9a2b719e1a5eR61-R68 .

Yes, I think we could add more rocksdb options (such as rocksdb writer buffer size etc) in here so that we could control what's the exactly behavior in rocksdb. https://github.com/apache/iceberg/pull/2680/files#diff-125c2d685d98fcab9fcf124dd51f55dafe83462f6aa0462eeb1fe034e136f8afR86-R91

Thanks for the great feedback @zhougit86 !

I think allow changing RocksDB DBOptions and ColumnFamilyOptions is quite important for debugging / tuning OOM issue with RocksDB. Maybe we should allow user to supply a config factory, similar to how Flink does with its RocksDB state backend.

aokolnychyi · 2021-07-06T20:08:21Z

I've been looking at our task writers to implement merge-on-read in Spark so I should be able to help reviewing this.
I'll try to do a pass tomorrow.

openinx · 2021-07-07T03:15:35Z

Thanks for your review bandwidth, @aokolnychyi !

coolderli · 2021-07-29T07:28:00Z

@openinx Has this patch been verified? Can it be used in the production environment？

openinx · 2021-07-29T11:39:47Z

@coolderli , I think this PR is ready for reviewing and provided fully covered unit tests, but we still don't get this merged because I don't know whether reviewers still has concerns for it. I think you need to have a basic test in staging env before publish it in prod env.

coolderli · 2021-08-24T08:16:35Z

I think the StructSet in DeleteFilter should also be replaced.https://github.com/apache/iceberg/blob/master/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java#L137

rdblue · 2021-09-19T23:30:52Z

@stevenzwu, could you help review this?

Reo-LEI · 2021-09-28T14:12:02Z

core/src/main/java/org/apache/iceberg/types/Serializers.java

+            record.set(i, null);
+          } else {
+            byte[] fieldData = new byte[length];
+            int fieldDataSize = dis.read(fieldData);


I think there should be int fieldDataSize = length > 0 ? dis.read(fieldData) : 0;.
Because dis.read() will return -1 when fieldData lenght is 0(e.g. empty string will cause this) and make Preconditions.checkState fail.

rdblue · 2021-09-28T15:12:41Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

      } else if (s instanceof String) {
        encoder.writeString(new Utf8((String) s));
+      } else if (s instanceof CharSequence) {
+        encoder.writeString((CharSequence) s);


Why is this needed? We generally want to avoid writing CharSequence directly because it will require conversion.

rdblue · 2021-09-28T15:14:14Z

core/src/main/java/org/apache/iceberg/util/StructLikeMapUtil.java

+
+  public static Map<StructLike, StructLike> load(Types.StructType keyType,
+                                                 Types.StructType valType,
+                                                 Map<String, String> properties) {


I don't think that this is a good option for controlling whether this feature is turned on. Passing generic properties all the way down to a factory method for a specific class isn't a good option. Instead, the calling code should decide which map to use. We want to limit where we pass property maps.

rdblue · 2021-09-28T15:22:42Z

core/src/main/java/org/apache/iceberg/types/Serializers.java

+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.util.ByteBuffers;
+
+public class Serializers {


What is the scope of this class? What will use it and how do we avoid exposing it in a confusing way in the API?

Don't we have other methods of serializing data that should be sufficient?

rdblue · 2021-09-28T15:23:09Z

core/src/main/java/org/apache/iceberg/types/Serializer.java

+package org.apache.iceberg.types;
+
+public interface Serializer<T> {
+  byte[] serialize(T object);


Is it necessary to use byte[]? Normally we use ByteBuffer to avoid the copies necessarily introduced by byte[].

rdblue · 2021-09-28T15:24:30Z

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+
+    @Override
+    public <T> void set(int pos, T value) {
+      put(pos, value);


Why use a private put method instead of moving the switch here in set? Won't this need a SuppressWarnings either way?

rdblue · 2021-09-28T15:28:12Z

@openinx, thanks for working on this. The approach of adding a RocksDBStructLikeMap seems good to me. There are a couple of issues that I think we need to work out before getting into details though:

How do we avoid adding new serialization, or how do we document it and avoid exposing it to library consumers?
How do we want to configure the map used? Passing properties everywhere requires a ton of changes and is really disruptive
How would this fit in with the writer refactor that @aokolnychyi has been working on?

I think the first thing to do is to get a PR with just the StructLikeMap implementation in and then we can work on integrating this with the rest of the code.

openinx · 2021-09-29T08:50:47Z

Thanks for the feedback, @rdblue . I will answer your questions sometimes later, the next PR will also reflect the changes.

coolderli · 2021-10-08T11:24:02Z

@openinx Using RocksDB can really solve the oom problem. But will it make things more complicated？Do users need to participate in the tuning of RocksDB?

kien-truong · 2021-12-07T02:51:32Z

We have used RocksDB in the past with Flink, and it does require some tuning to avoid OOM. For example, we have to change the memory allocator to jemalloc, because the default glibc allocator has major memory fragmentation issues with RocksDB. Furthermore, there are also various settings to control the amount of memory allocated to each tree level, index, cache, etc...

kbendick · 2021-12-07T03:12:51Z

We have used RocksDB in the past with Flink, and it does require some tuning to avoid OOM. For example, we have to change the memory allocator to jemalloc, because the default glibc allocator has major memory fragmentation issues with RocksDB. Furthermore, there are also various settings to control the amount of memory allocated to each tree level, index, cache, etc...

It's true that RocksDB introduces more complexity. But it's really a trade off. User's don't necessarily have to opt-in either.

It can save a large amount of money for jobs that need a very large amount of memory to keep their entire state in memory.
Mostly I've found, at least with Flink, that once the core set of parameters were tuned for our common use cases, we could apply those everywhere and get away with that being "good enough".
Tuning very large JVMs, with the amount of memory needed to store equivalent amount of data that can be stored for much less on disk with RocksDB is its own challenge. There's simply no truly escaping a certain level of complexity at a certain scale.

Additionally, RocksDB is arguably the most common state backend for Flink. Now, as of Spark 3.2, there's a RocksDB state store for structured streaming contributed by databricks.

Given these two things, I don't think that the additional settings available is a huge issue.

I think the Flink project has done a lot of work in minimizing the necessary tuning for RocksDB. With time, similar things could be introduced here. But even making this a possibility will be a really large win for very large jobs which might have a large amount of state or need to otherwise spill to disk.

kien-truong · 2021-12-07T03:30:02Z

I agree with you that having RocksDB is a win, our stateful Flink jobs all use Rocksdb backend :)

But I think we need to more clearly communicated the trade-off to the user. For example, RocksDB can be a magnitude slower than in-memory hash map even with very fast SSD, depending on the read/write pattern.

kbendick · 2021-12-07T03:47:22Z

I agree with you that having RocksDB is a win, our stateful Flink jobs all use Rocksdb backend :)

But I think we need to more clearly communicated the trade-off to the user. For example, RocksDB can be a magnitude slower than in-memory hash map even with very fast SSD, depending on the read/write pattern.

For sure. No disagreement. But we're just maybe not their yet is all I mean. Having the ability to spill to disk is a lot of work - which @openinx and others have been doing a great job with. But even look at the age of this PR - it's moving along, but it's definitely a process.

It's good to be aware of how the need to pass configurations will affect the rest of the codebase, as Ryan had mentioned was one of the bigger areas for concern. Ideally we can pass configs to rocksdb without too much disruption to the rest of the codebase.

Once that has crossed it's goal, we can worry more about user experience when using RocksDB. And I'll happily drop most anything I'm doing to review most documentation PRs!

And when the time comes, if you want to write a blogpost about using RocksDB, I'd happily review any drafts or be sure it's prominently displayed on the Apache Flink website 🙂

For now, let's focus on getting RocksDB or similar available to developers within Iceberg and then we can definitely focus on the user experience. It is definitely true that there are many situations where it makes less sense than remaining in-memory, but having the option sure is nice.

But it's good to be thinking about end user experience always. It's definitely one of my biggest concerns with all things too. So many thanks for that. 😀

kbendick · 2021-12-07T03:56:57Z

I agree with you that having RocksDB is a win, our stateful Flink jobs all use Rocksdb backend :)

But I think we need to more clearly communicated the trade-off to the user. For example, RocksDB can be a magnitude slower than in-memory hash map even with very fast SSD, depending on the read/write pattern.

(Shameless plug while I have your attention)

Also, if you're looking to help out or get more involved, we can definitely use more reviewers on the Flink side of the codebase that have the necessary context with Flink to help properly review code! I definitely appreciate your concern for end-users as well.

I don't consider myself "well-informed" about much of anything, least of all Flink. But I have experience with it, and I'll check out PRs from Github locally (I use the Github CLI to clone by PR number) and poke around the Iceberg and Flink source code in my IDE to be able to try to contribute to those conversations.

And if you can download it and run some sample jobs locally, even better! That's also always very appreciatied. Hope to hear more from you. 😀

jackye1995 · 2021-12-07T05:16:45Z

Are we still considering this feature? I wonder what would be the performance of this versus in memory vectorized read + compaction. If the later is good enough, we can probably avoid introducing this complication.

kbendick · 2021-12-07T05:37:24Z

Are we still considering this feature? I wonder what would be the performance of this versus in memory vectorized read + compaction. If the later is good enough, we can probably avoid introducing this complication.

Discussion came back up, but I'm not sure.

Besides other perceived possible benefits from RocksDB (which admittedly might not be reason enough to introduce the added complexity for some use cases), the motivating use case is really the potentially much larger than typical memory usage required to process the initial table snapshot for CDC, which is a common use case. To take a snapshot or begin a CDC stream from MySQL for example, a full table copy is required before the initial checkpoint can be taken iiuc. So this historically has lead users to have to majorly increase their initial memory / resource allocation and then tune down after the job had successfully checkpointed.

I'll follow up with openinx and steven about the status of this feature and state of the problem in general.

Without that issue, I'd lean towards keeping as few elements in state as possible to avoid having to worry about further serrialized forms. But I think the issue is still outstanding, as a full snapshot is needed before any deltas can be written iiuc.

ayush-san · 2022-01-19T10:39:33Z

@kbendick @openinx If we are not inclining toward using rocksdb, can we trigger checkpoints on the basis of number of events as we have in the s3 kafka sink connector flush.size

I am not so sure if something like this can be done in flink

coolderli · 2022-02-07T10:06:19Z

@rdblue @stevenzwu Hi, what do you think of this PR? In my company, there are some big tables such as TiDB or MySQL binlog that will use flink to load data to the iceberg. For example, we have a TiDB table that has six hundred million records. If we use flink streaming mode, it will cost too much time. If we use batch mode, the executor needs large heap memory to avoid OOM. Any suggestions about this?

I think there are two different problems.

when we use streaming, we can use rocksdb to avoid OOM when there is a peak flow.
when we use spark or flink batch, we can introduce a configure to skip position delete in one transaction, because we can manually de-duplicate in the program or SQL.

openinx · 2022-03-09T03:11:50Z

For example, we have a TiDB table that has six hundred million records. If we use flink streaming mode, it will cost too much time

@coolderli Are you using the latest flink cdc connector and iceberg to export the stream ? I remember the latest flink cdc connector are refactored to use the netflix DBLog algorithm in parallel to export those existing RDBMS records in parallel. So in theory, if we don't have any performance blocker in the flink->iceberg path. There should not be any blocker that cost too much time. What's your bottleneck in your CDC exporting path ?

gong · 2022-12-05T04:05:20Z

For example, we have a TiDB table that has six hundred million records. If we use flink streaming mode, it will cost too much time

@coolderli Are you using the latest flink cdc connector and iceberg to export the stream ? I remember the latest flink cdc connector are refactored to use the netflix DBLog algorithm in parallel to export those existing RDBMS records in parallel. So in theory, if we don't have any performance blocker in the flink->iceberg path. There should not be any blocker that cost too much time. What's your bottleneck in your CDC exporting path ?

@openinx Hello, Will not OOM be triggered if we use mysql-cdc2.0 to sync data to iceberg? Because mysql-cdc2.0 checkpoint in chunk level ?

github-actions bot added build core data flink spark labels Jun 7, 2021

github-actions bot added the MR label Jun 21, 2021

openinx changed the title ~~[WIP] Add RocksDBStructLikeMap~~ Core: Add RocksDBStructLikeMap Jun 21, 2021

kbendick reviewed Jun 23, 2021

View reviewed changes

openinx commented Jun 23, 2021

View reviewed changes

This was referenced Jun 24, 2021

Flink CDC | OOM during initial snapshot #2504

Closed

Flink: Support create iceberg table with 'connector'='iceberg' #2666

Merged

openinx added 12 commits July 29, 2021 20:57

Add RocksDBStructLikeMap

ca21bc1

Add unit tests.

3253286

Use the Conversions to accomplish the primitive conversions

b146f0e

Add serializers unit tests.

973e6a8

Fix the broken unit tests.

0145857

Add more unit tests.

c9895f8

Add unit tests for RocksDBStructLikeMap

1f34637

Enable the RocksDBStructLikeMap in the core path

082c940

Fix the compile issues

76f9203

Address the failure unit tests

0b189bb

Minor fixes

b67491c

Minor changes.

a5503a3

openinx mentioned this pull request Sep 3, 2021

Flink CDC job getting failed due to G1 old gc and large checkpointing time #2900

Closed

jackye1995 mentioned this pull request Sep 16, 2021

Read delete files in parallel. #3118

Closed

Reo-LEI reviewed Sep 28, 2021

View reviewed changes

rdblue reviewed Sep 28, 2021

View reviewed changes

coolderli mentioned this pull request Jan 26, 2022

When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory. #3909

Closed

openinx mentioned this pull request Mar 9, 2022

Flink: Add disk-based insertedRowMap to resolve the OOM while ingesting. #4298

Closed

singhpk234 mentioned this pull request Oct 12, 2022

In batch mode, pos delete may trigger oom #5955

Closed

luoyuxia mentioned this pull request Nov 10, 2022

Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model) #6104

Closed

luoyuxia mentioned this pull request Nov 30, 2022

OOM issues were encountered while reading and writing iceberg V2 table data to another hive table #6307

Closed

Humbedooh deleted the branch apache:master October 12, 2023 13:31

Humbedooh closed this Oct 12, 2023

manuzhang mentioned this pull request Mar 24, 2024

[Potential Bug] insertedRowMap too large could cause flink application failure? #10030

Closed

Core: Add RocksDBStructLikeMap #2680

Core: Add RocksDBStructLikeMap #2680

Uh oh!

Conversation

openinx commented Jun 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 6, 2021

Uh oh!

openinx commented Jul 7, 2021

Uh oh!

coolderli commented Jul 29, 2021

Uh oh!

openinx commented Jul 29, 2021

Uh oh!

coolderli commented Aug 24, 2021

Uh oh!

rdblue commented Sep 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openinx commented Sep 29, 2021

Uh oh!

coolderli commented Oct 8, 2021

Uh oh!

kien-truong commented Dec 7, 2021

Uh oh!

kbendick commented Dec 7, 2021

Uh oh!

kien-truong commented Dec 7, 2021

Uh oh!

kbendick commented Dec 7, 2021

Uh oh!

kbendick commented Dec 7, 2021

Uh oh!

jackye1995 commented Dec 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented Dec 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ayush-san commented Jan 19, 2022

Uh oh!

coolderli commented Feb 7, 2022

Uh oh!

openinx commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gong commented Dec 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

rdblue commented Sep 28, 2021 •

edited

Loading

jackye1995 commented Dec 7, 2021 •

edited

Loading

kbendick commented Dec 7, 2021 •

edited

Loading

openinx commented Mar 9, 2022 •

edited

Loading