Potential memory issue when using COPY with PARTITIONED BY #11042

hveiga · 2024-06-21T03:47:35Z

Describe the bug

Memory does not get freed after executing multiple COPY ... TO ... PARTITIONED BY ... queries. I have not been able to identify what is causing this behavior.

To Reproduce

The behavior can be observed using datafusion-cli. I have been monitoring the memory usage through Activity Monitor.

Download test parquet file (120MB): https://file.io/eKiHwu4waHVN
Run datafusion-cli
Create a external table:

CREATE EXTERNAL TABLE my_table
        (
            col1 VARCHAR NOT NULL,
            timestamp BIGINT NOT NULL,
            col2 VARCHAR NOT NULL,
            col3 VARCHAR NOT NULL,
            col4 VARCHAR NOT NULL,
            col5 VARCHAR NOT NULL,
            col6 VARCHAR NOT NULL,
            col7 VARCHAR NOT NULL,
            col8 VARCHAR NOT NULL,
            col9 VARCHAR NOT NULL,
            col10 VARCHAR NOT NULL,
            col11 VARCHAR NOT NULL,
            col12 DOUBLE
        )
        WITH ORDER (col1 ASC, timestamp ASC) STORED AS PARQUET LOCATION 'test_file.parquet';

Execute COPY .. PARTITIONED BY query:

COPY (SELECT col1, timestamp, col10, col12 FROM my_table ORDER BY col1 ASC, timestamp ASC)
TO './output' STORED AS PARQUET PARTITIONED BY (col1) OPTIONS (compression 'uncompressed');

Monitor memory usage.
Repeat execution of COPY .. PARTITIONED BY query and continue monitoring memory usage.
Observation: memory does not get released.

Expected behavior

My expectation is to be able to run the COPY command multiple times without having the memory usage increasing every time.

Additional context

There is more context of what I am trying to do in Discord: https://discord.com/channels/885562378132000778/1166447479609376850/1253419900043526236

I am also experiencing the same behavior when running my application in Kubernetes. K8s terminates my pod once it exceeds the pod memory limits:

The text was updated successfully, but these errors were encountered:

devinjdangelo · 2024-06-21T12:06:18Z

Does the issue happen if you do COPY without PARTITIONED BY?

The partitioning code spawns a potentially large number of tokio tasks (one for each partition), so if those tasks are not being cleaned up properly, it could lead to a memory leak.

hveiga · 2024-06-21T20:31:10Z

Yes, I can see the same behavior by running the following query multiple times:

COPY (SELECT col1, timestamp, col10, col12 FROM my_table ORDER BY col1 ASC, timestamp ASC)
TO './output/output.parquet' STORED AS PARQUET OPTIONS (compression 'uncompressed');

However, the memory increase is smaller and it takes many more queries to make it noticeable (10+).

hveiga · 2024-06-25T16:52:31Z

@devinjdangelo do you have any suggestions on how to investigate this issue further? I am happy to take the lead on it. I was chatting with @alamb yesterday and he suggested using heaptrack, but I was wondering if you would suggest other options. Thanks!

devinjdangelo · 2024-06-25T22:08:08Z

@devinjdangelo do you have any suggestions on how to investigate this issue further? I am happy to take the lead on it. I was chatting with @alamb yesterday and he suggested using heaptrack, but I was wondering if you would suggest other options. Thanks!

A self contained example script may be helpful. I have used peak_alloc crate in the past as a very simple way to measure how much memory is being consumed.

Heaptrack will provide more detail and will likely help narrow down the source of the issue faster, but a self contained script I think is useful for demonstration and sanity checking.

alamb · 2024-06-26T00:52:42Z

BTW something we have seen in InfluxDB, especially for very compressible data, was that the arrow writer was consuming substantial memory.

Something that might be worth testing would be to set the parquet writer's options to set data_page_row_limit to something like 20,000

By default it is unlimited. We just changed the default upstream in arrow-rs apache/arrow-rs#5957 but that is not yet released

alamb · 2024-07-09T16:24:10Z

Possible related to #11344 where the memory tracking for the parquet writing could be improved

hveiga · 2024-07-09T16:29:56Z

BTW something we have seen in InfluxDB, especially for very compressible data, was that the arrow writer was consuming substantial memory.

Something that might be worth testing would be to set the parquet writer's options to set data_page_row_limit to something like 20,000

By default it is unlimited. We just changed the default upstream in arrow-rs apache/arrow-rs#5957 but that is not yet released

Thanks for the suggestion @alamb. I tested with a different value for data_page_row_limit but got the same result.

In general I have been having a hard time trying to debug this since there is no heaptrack for Mac and the build process for heaptrack_gui is also broken at the moment as I cannot install one of the required dependencies. I did try valgrind and the RustRover profiler, but could not find anything relevant.

I'll wait for #11344 to land and test again.

alamb · 2024-07-09T17:59:20Z

In general I have been having a hard time trying to debug this since there is no heaptrack for Mac and the build process for heaptrack_gui is also broken at the moment as I cannot installed one of the required dependencies. I did try valgrind and the RustRover profiler, but could not find anything relevant.

I have had good luck with the Instruments tool that comes with XCode / Mac. Specifically the "allocations" tool:

But I think @wiedld said she didn't have good luck with it so your mileage may vary

wiedld · 2024-07-09T22:26:13Z

But I think @wiedld said she didn't have good luck with it so your mileage may vary

While using the xcode allocations tool, I was getting <10% of the allocations vs the peak measured with the time builtin. (Note: our process was an application-triggered datafusion query and not via the datafusion-cli.) As a result I ended up using heaptrack.

In general I have been having a hard time trying to debug this since there is no heaptrack for Mac and the build process for heaptrack_gui is also broken

I ran into the same problems. Here is the work-around I used (I'm sure there are others):

ran heaptrack on a linux vm. Instead of using the recommended cargo-heaptrack, I was building & running a rust project using cargo-with. Below is a rough idea (dependencies may differ for you).

sudo apt install git-all
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"
sudo apt-get install gcc build-essential time heaptrack
cargo install cargo-with
cargo with 'heaptrack' -- run --profile=quick-release --no-default-features -- <process cmd>

# results in <heaptrack_output_file>.zst

use heaptrack_print to convert the output files into stack files for flamegraph

heaptrack_print heaptrack_output_file.zst --flamegraph-cost-type peak -F my_stack_file.txt

# results in output file, and summary stats
total runtime: 27.98s.
calls to allocation functions: 12432130 (444322/s)
temporary memory allocations: 1214337 (43400/s)
peak heap memory consumption: 1.26G
peak RSS (including heaptrack overhead): 3.22G
total memory leaked: 1.88M

confirm the heaptrack peak memory (summary stats from above) is ~= time builtin
```
/usr/bin/time -l -h -p <process>
```

Move files to your macos, then build & analyze the flamegraphs.

git clone [email protected]:brendangregg/FlameGraph.git
cd FlameGraph

# generate flamegraph files
flamegraph.pl --title "heaptrack: peak memory" --colors mem --countname peak_bytes < my_stack_file.txt > my_flamegraph.svg
open my_flamegraph.svg

If you run into any issues, or find any better alternatives, please let me know @hveiga .

wiedld · 2024-07-09T22:37:23Z

Ah, I forgot to mention a key point. When extracting data via heaptrack_print, I was looking at memory peaks and hence used --flamegraph-cost-type peak. You may want to check other options for investigating leaks.

hveiga · 2024-07-17T02:29:51Z

I finally have some time to continue investigating this issue. I have not been able to make heaptrack work (yet!) but I did try using dhat and I got an interesting lead:

I won't claim I am experienced with this tool but I was curious why it was highlighting dict_encoder.rs. When writing parquet from a COPY query the option DICTIONARY_ENABLED is enabled by default. I decided to give a try to disable it using DICTIONARY_ENABLED false.

After disabling it I see the memory increasing only marginally for every invocation (in the 100-200MB range) while with DICTIONARY_ENABLED true each invocation increases the memory usage in multiple GBs (2-3GB) and it seems it never gets freed again.

I don't have a root cause of the issue yet but wanted to share this behavior in case somebody else might find this pattern familiar. This might also only be a red herring and not the actual issue.

I also found apache/arrow-rs#5828 which might be related and/or relevant.

alamb · 2024-07-17T10:40:31Z

I also found apache/arrow-rs#5828 which might be related and/or relevant.

I would expect that the memory usage hightlighted in apache/arrow-rs#5828 would be directly reduced by setting the data_page_row_limit.

After disabling it I see the memory increasing only marginally for every invocation (in the 100-200MB range) while with DICTIONARY_ENABLED true each invocation increases the memory usage in multiple GBs (2-3GB) and it seems it never gets freed again.

I wonder if this could be related to DataFusion overriding the data_page_row_limit setting in #11367 (that @wiedld is working on)

I think you can set this option like

COPY (SELECT col1, timestamp, col10, col12 FROM my_table ORDER BY col1 ASC, timestamp ASC)
TO './output' STORED AS PARQUET PARTITIONED BY (col1) 
OPTIONS (
  compression 'uncompressed', 
  'format.parquet.data_pagesize_limit' 20000
);

wiedld · 2024-07-17T20:41:42Z

I wonder if this could be related to DataFusion overriding the data_page_row_limit setting in #11367 (that @wiedld is working on)

@alamb is mentioning the data_page_row_limit since in our own work we found that the dict_encoder used alot more memory with the datafusion default data_page_row_limit=usize::max. It was only once we set data_page_row_limit=20k that we fixed the memory issue. As a result, we decided to change the arrow-rs/parquet default to 20k.

The current gotchas with using the defaults in `COPY TO`

So the arrow-rs/parquet writer now has a default data_page_row_limit=20k, so we should see that default when we use datafusion COPY TO, right?

Wrong. How the datafusion session's write configuration is treated in arrow-rs/parquet's ArrowWriter, (a.k.a. whether or not it's used in the actual parquet writing), is a little confusing IMO:

In some cases, the datafusion defaults override the arrow-rs/parquet defaults.
- e.g. datafusion default data_page_row_limit=usize::MAX, overwrites arrow-rs/parquet data_page_row_limit=20k.
- short term fix: do as alamb suggests and configure the sql COPY TO for data_page_row_limit=20k.
In other cases, the datafusion defaults get ignored in arrow-rs/parquet.
- specifically, these are in cases when the datafusion default is None.
- e.g. datafusion default dictionary_enabled=None gets overridden, and the actual default behavior is to turn it on (as @hveiga noticed above).
- short term fix: explicitly set dictionary_enabled=Some(false)

The PR description here gives an overview of how the default datafusion settings are treated in arrow-rs/parquet.

alamb · 2024-07-17T20:58:37Z

Also spoiler alert is that @wiedld is in the process of harmonizing the default configuration for the parquet writer and the arrow settings as part of #11367

wiedld · 2024-07-17T20:59:12Z

I also found apache/arrow-rs#5828 which might be related and/or relevant.

@hveiga is correct that this is one suspected place with extra memory usage (specifically in the dict_encoder) when processing many rows per page. But that is not the only possible impact from very-many-rows-per-page, and as such that's why we focused on changing the config setting data_page_row_limit=20k.

hveiga · 2024-07-17T21:05:41Z

Thanks for all the feedback, we are actively testing this and can report our findings later today.

Regarding data_page_row_limit=20k what is the right key to pass in the COPY command? I can see the Write Options docs say format.data_pagesize_limit but #11524 talks about data_page_row_limit. If this different naming is an issue, I am happy to PR a change to consolidate.

alamb · 2024-07-17T21:12:02Z

data_page_row_limit is the right one (based on rows)

There is another similarly named one called data_pagesize_limit (I would have expected it to be called data_page_size_limit) that is a limit in bytes

Sorry this is so confusing

hveiga · 2024-07-17T21:39:17Z

I believe the right one is data_page_row_count_limit instead of data_page_row_limit. So many similar configs :)

hveiga added the bug Something isn't working label Jun 21, 2024

alamb mentioned this issue Jul 9, 2024

Track memory used by parquet writers. #11344

Closed

wiedld mentioned this issue Jul 17, 2024

Add tests that show the different defaults for ArrowWriter and TableParquetOptions #11524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory issue when using COPY with PARTITIONED BY #11042

Potential memory issue when using COPY with PARTITIONED BY #11042

hveiga commented Jun 21, 2024 •

edited

Loading

devinjdangelo commented Jun 21, 2024

hveiga commented Jun 21, 2024 •

edited

Loading

hveiga commented Jun 25, 2024

devinjdangelo commented Jun 25, 2024

alamb commented Jun 26, 2024

alamb commented Jul 9, 2024

hveiga commented Jul 9, 2024 •

edited

Loading

alamb commented Jul 9, 2024

wiedld commented Jul 9, 2024

wiedld commented Jul 9, 2024 •

edited

Loading

hveiga commented Jul 17, 2024 •

edited

Loading

alamb commented Jul 17, 2024

wiedld commented Jul 17, 2024 •

edited

Loading

alamb commented Jul 17, 2024

wiedld commented Jul 17, 2024 •

edited

Loading

hveiga commented Jul 17, 2024

alamb commented Jul 17, 2024

hveiga commented Jul 17, 2024

Potential memory issue when using COPY with PARTITIONED BY #11042

Potential memory issue when using COPY with PARTITIONED BY #11042

Comments

hveiga commented Jun 21, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

devinjdangelo commented Jun 21, 2024

hveiga commented Jun 21, 2024 • edited Loading

hveiga commented Jun 25, 2024

devinjdangelo commented Jun 25, 2024

alamb commented Jun 26, 2024

alamb commented Jul 9, 2024

hveiga commented Jul 9, 2024 • edited Loading

alamb commented Jul 9, 2024

wiedld commented Jul 9, 2024

wiedld commented Jul 9, 2024 • edited Loading

hveiga commented Jul 17, 2024 • edited Loading

alamb commented Jul 17, 2024

wiedld commented Jul 17, 2024 • edited Loading

The current gotchas with using the defaults in COPY TO

alamb commented Jul 17, 2024

wiedld commented Jul 17, 2024 • edited Loading

hveiga commented Jul 17, 2024

alamb commented Jul 17, 2024

hveiga commented Jul 17, 2024

hveiga commented Jun 21, 2024 •

edited

Loading

hveiga commented Jun 21, 2024 •

edited

Loading

hveiga commented Jul 9, 2024 •

edited

Loading

wiedld commented Jul 9, 2024 •

edited

Loading

hveiga commented Jul 17, 2024 •

edited

Loading

wiedld commented Jul 17, 2024 •

edited

Loading

The current gotchas with using the defaults in `COPY TO`

wiedld commented Jul 17, 2024 •

edited

Loading