Parallel Table.append #428

bigluck · 2024-02-14T10:30:05Z

Apache Iceberg version

main (development)

Please describe the bug 🐞

While doing some tests with the latest RC (v0.6.0rc5), I generated a ~6.7GB arrow table and appended it to a new table.

In terms of performances, I got similar results (writing to S3) on these 2 type of EC2 machines:

c5ad.8xlarge 32 core, 64 ram, 10gbps nic -> wrote 1 parquet file of 2GB in 31s
c5ad.16xlarge 64 core, 128 ram, 20gbps nic -> wrote 1 parquet file of 1.6GB in 28s

By using htop I notice that the code was only using a thread during the append operation, which means that it's not parallelizing the write operation.

The text was updated successfully, but these errors were encountered:

Fokko · 2024-02-14T10:31:53Z

@bigluck Thanks for raising this. This is on my list to look into!

Parallelization of this is always hard since it is hard to exactly know how big the Parquet file will be. Efficient encoding and compression can vary a lot. However, just using one core does not make any sense.

kevinjqliu · 2024-02-17T17:59:42Z

Copy/paste my comments from Slack.

PyIceberg currently uses ParquetWriter's write_table function to write data files (see write_file) which I believe only uses 1 core (need to verify this)

ParquetWriter also has the write_dataset function for writing multiple files which has a use_threads argument to toggle and write in parallel.

https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html#write-partitioned-data---parquet
Describes using write_dataset to write multiple files

kevinjqliu · 2024-02-17T18:19:48Z

There's not a lot mentions of using Arrow to write a single file using multiple threads. The only thing I found was use_threads() here

I don't think this is exposed to writing a single file (write_table). But it is for writing multiple files (write_dataset)

kevinjqliu · 2024-02-17T18:22:51Z

Looks like we have to use a higher level API to force parallelism, i.e. write_dataset to write multiple parquet files using multiple threads

kevinjqliu · 2024-02-17T18:25:21Z

Looks like write_file already gets an iterator of WriteTasks.

Maybe we can use the ParquetWriter.write_dataset implementation to parallelize the data file writes.

kevinjqliu · 2024-02-17T18:29:40Z

Support encoding a single parquet file using multiple threads

Looks like the bottleneck might be encoding instead of IO

kevinjqliu · 2024-02-17T19:28:40Z

Related to #208 and #353

zeddit · 2024-02-18T06:32:46Z

I wonder if pyiceberg write will keep the row order when reading out later.
Why I ask this is because some engine like spark won't keep this, even specifying the sortOrder field.

Fokko · 2024-02-18T07:32:42Z

There's not a lot mentions of using Arrow to write a single file using multiple threads. The only thing I found was use_threads() here

This is for reading, so each core reads a separate row group.

Maybe we can use the ParquetWriter.write_dataset implementation to parallelize the data file writes.

How would we control the size of a file (or the number of records is a second best).

I wonder if we generate a fair amount of data, like @bigluck is doing:

import multiprocessing
import os
import time
import uuid
from functools import partial
from typing import Any

import pyarrow as pa
from faker import Faker
from pyiceberg.catalog import load_catalog


def generate_fake_rows(batch_size, i) -> list[dict[str, Any]]:
    fake = Faker()
    return [
        {
            **fake.profile(),
            'password': fake.sha256(),
            'device_id': fake.uuid4(),
            'device_ip': fake.ipv4(),
            'device_user_agent': fake.user_agent(),
            'phone': fake.phone_number(),
            'salary': fake.random_int(15_000, 250_000),
            'job_description': fake.sentence(nb_words=15),
            'cc_expire': fake.credit_card_expire(),
            'cc_number': fake.credit_card_number(),
            'cc_security_code': fake.credit_card_security_code(),
        }
        for _ in range(batch_size)
    ]


def generate_fake_table(num_records: int, batch_size: int) -> pa.Table:
    num_batches = num_records // batch_size

    generate_fake_rows_partial = partial(generate_fake_rows, batch_size)

    with multiprocessing.Pool(processes=multiprocessing.cpu_count() - 1) as pool:
        results = pool.map(generate_fake_rows_partial, range(num_batches))
    return pa.Table.from_pylist([row for batch in results for row in batch])

How many Arrow buffers do we get, and if we can smartly join these buffers (we already have a bin-packing algorithm in the code).

We could use the write_batch operation of ParquetWriter to write certain batches to a file.

kevinjqliu · 2024-02-18T17:19:10Z

I took the above code and did some investigation.
Here's the notebook to see it in action https://colab.research.google.com/drive/12O4ARckCwJqP2U6L4WxREZbyPv_AmWC2?usp=sharing

I'll summarize.

I used the above functions to generate 1 million records and save them in a feather file.
The file on disk is 447 MB in size.
Reading the feather file back to Python, the pyarrow.Table is 685.46 MB in size, using nbytes

We can use the to_batches function to chop the table into RecordBatchs, using the default setting.
This turns the table into 16 batches with around 40MB each.

Given this, it seems plausible to chunk the table into batches and use the write_batch function. We can set the default chunk to size 516MB or some recommended setting.

Note, to_batches is a "zero-copy" method, so there shouldn't be performance impact

kevinjqliu · 2024-02-18T17:40:11Z

It seems like there's an upper bound to the size of the RecordBatch produced by to_batches. I tried setting max_chunksize from 16 MB to 256 MB. All the batches produced are around 45MB in size.

I guess is this what you mean by bin-packing. We can bin-pack these batches into 512 MB parquet files.

kevinjqliu · 2024-02-18T17:48:39Z

Also, @bigluck, while running the code to generate the data using faker, I opened htop and saw that it was using 6 CPUs. I'm using a M1 Mac

bigluck · 2024-02-19T07:53:46Z

Thanks @kevinjqliu
Last week, I didn't test the code on my MBP; I did all the tests directly on the EC2 instance.

BTW it seems to use all the cores on my M2 Max:

Fokko · 2024-02-19T15:28:15Z

Do we know if this is for data generation, or also when writing? In the end, it would be good to be able to split the data into multiple files. The MacBooks have huge IO, so it might be that the CPU is the bottleneck and not the IO

kevinjqliu · 2024-02-19T17:18:15Z

It was for data generation only. I can't seem to reproduce the parallelism issue for append, probably due to MacBook's huge IO.

kevinjqliu · 2024-02-19T17:19:25Z

we already have a bin-packing algorithm in the code

@Fokko can you point me to that? I couldn't find it

Fokko · 2024-02-19T18:00:15Z

@kevinjqliu It is under utils/bin_packing.py.

kevinjqliu · 2024-02-19T18:03:41Z

thanks! I found it, had to fuzzy search in vscode :)

Here's an example of bin-packing an Arrow table.
https://colab.research.google.com/drive/1FM8mdr4j5KgsjBYmsp9_eH8SBlwVC6C6?usp=sharing

I didn't fully understand lookback and largest_bin_first so I just picked an arbitrary value by looking at the test examples

kevinjqliu · 2024-02-19T18:06:23Z

Integrating this with the write path, I have 2 approaches

refactoring write_file so that it can write multiple parquet files. This means 1 WriteTask can produce multiple DataFiles
Bin-pack earlier (in _dataframe_to_data_files) and create multiple WriteTasks that is then passed to write_file

I like the second option, but we need to coordinate with how we implement partitioned writes.

kevinjqliu · 2024-02-19T18:27:02Z

#444 something like this.

wrote out 3 files

(env) ➜  iceberg-python git:(kevinjqliu/bin-pack-write) ✗ ll -h /tmp/iceberg_warehouse8868767412496944120/iceberg_data/fake/fake_data/data   
total 498312
-rw-r--r--  1 kevinliu  wheel    80M Feb 19 10:19 00000-0-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet
-rw-r--r--  1 kevinliu  wheel    84M Feb 19 10:19 00000-1-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet
-rw-r--r--  1 kevinliu  wheel    80M Feb 19 10:19 00000-2-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet

and reading it back returns the same number of records

kevinjqliu · 2024-02-19T18:43:11Z

Oh interesting, the input is 1M records, 685.46 MB in memory. We bin-pack the Arrow representation into 256MB chunks (['224.61 MB', '236.23 MB', '224.62 MB']), but writing these into parquet turns them into only ~80 MB.

We'd want to take into account the size of the parquet file when written to disk, not the size of the Arrow representation in memory.

kevinjqliu · 2024-02-19T18:57:31Z

Here's how it's done in Java. BaseTaskWriter.java checks targetFileSize when writing and will create a new file when the file size target is met.

targetFileSize is set in SparkWrite which refers to the targetDataFileSize property

Default setting is 512 MB

kevinjqliu · 2024-02-23T23:44:55Z

@bigluck #444 should allow parallelized write. Do you mind giving it a try?

bigluck · 2024-02-24T09:23:17Z

Ciao @kevinjqliu, thanks!

I've tested it on the same c5ad.16xlarge machine, but the results are pretty similar, 27s vs 28s for this table:

$ pip install git+https://github.com/kevinjqliu/iceberg-python.git@kevinjqliu/bin-pack-write
$ python3 benchmark.py
...
Generating 10,000,000 records in 1,000 batches
 - generate_users: done (in 582.4395 seconds)
 - table size: 7188266296 bytes, 6.69 GB, 10,000,000 records (in 0.0002 seconds)
 - create empty table: users done (in 0.6575 seconds)
 - append data: done (in 27.7934 seconds)
...

-rw-rw-r--. 1 ec2-user ec2-user 6.7G Feb 24 09:04 table_10000000.arrow

This is the final table parquet file on s3:

kevinjqliu · 2024-02-24T18:18:58Z

hm. Looks like something weird is going on if the resulting parquet file is 1.6 GB. Each parquet file size should be at most 512 MB, if not less. See the bin packing logic.

Here's something we can run for diagnostics,

# https://stackoverflow.com/questions/12523586/python-format-size-application-converting-b-to-kb-mb-gb-tb
def humanbytes(B):
    """Return the given bytes as a human friendly KB, MB, GB, or TB string."""
    B = float(B)
    KB = float(1024)
    MB = float(KB ** 2) # 1,048,576
    GB = float(KB ** 3) # 1,073,741,824
    TB = float(KB ** 4) # 1,099,511,627,776

    if B < KB:
        return '{0} {1}'.format(B,'Bytes' if 0 == B > 1 else 'Byte')
    elif KB <= B < MB:
        return '{0:.2f} KB'.format(B / KB)
    elif MB <= B < GB:
        return '{0:.2f} MB'.format(B / MB)
    elif GB <= B < TB:
        return '{0:.2f} GB'.format(B / GB)
    elif TB <= B:
        return '{0:.2f} TB'.format(B / TB)

import pyarrow as pa
from pyarrow import feather

arrow_tbl = feather.read_table('fake_data.feather')
print(f"Table has {len(arrow_tbl)} records")

from pyiceberg.catalog import load_catalog
catalog = load_catalog()
try:
    iceberg_tbl = catalog.drop_table("fake.fake_data")
except:
    pass
iceberg_tbl = catalog.create_table("fake.fake_data", arrow_tbl.schema)

from pyiceberg.io.pyarrow import bin_pack_arrow_table
bins = bin_pack_arrow_table(arrow_tbl,iceberg_tbl.properties)
for bin in bins:
    print(f"total={humanbytes(sum(map(lambda x:x.nbytes, bin)))}, chunks={[humanbytes(batch.nbytes) for batch in bin]}")

You might have to change the arrow_tbl and iceberg_tbl.
This will show us how the arrow table is bin-packed during writes.

For Write operations (append/overwrite), parallelism only kicks in during the actual writing. In order to take advantage of the parallelism, you'd have to set the PYICEBERG_MAX_WORKERS env variable to the number of CPUs.

bigluck · 2024-02-25T08:25:59Z

Hey @kevinjqliu , we're currently debugging the issue on Slack, but I thought it would be helpful to report our findings here as well. In my tests, the pyarrow table is generated using the following code:

return pa.Table.from_pylist([row for batch in results for row in batch])

I've also cached the table on disk to save time, and it's read using the following code:

with pa.memory_map('table_10000000.arrow', 'r') as source:
    pa_table = pa.ipc.open_file(source).read_all()

Although I know that using a record batch would be the right way to read the file, I'm explicitly using read_all() because I can't control what table will be generated by users' code. This is to evaluate the edge case where the table has to be read without using streaming functions.

After importing, the bin_pack_arrow_table returns only one chunk:

total=6.69 GB, chunks=['6.69 GB']

Let me know if you have any questions, and thanks for your time!

bigluck · 2024-02-26T10:03:53Z

@kevinjqliu, your latest changes are mind-blowing (#444 (comment) for reference)

I have tested your last changes on c5ad.8xlarge and c5ad.16xlarge instances using my 10,000,000 table.

On a c5ad.8xlarge instance with 10Gbps NIC, 32 cores, and a 64 RAM, it took an average of 3.9s to write 14 Parquet files. Previously, using pynessie 0.6.0, it took 31s.
On a c5ad.16xlarge instance with 20Gbps NIC, 64 cores, and a 128 RAM, it took approximately 3.6s to complete the same task, compared to 28.2s using pynessie 0.6.0.

I have been experimenting with different settings to improve the writing performances, but I failed.
I tried adjusting the PYICEBERG_MAX_WORKERS variable, but it did not make much difference. This might be due to the small size of my dataset (only 6.69 GB in arrow format), which resulted in only 14 output files.
I also tested the write.target-file-size-bytes property, which produced 27 files when set to 268435456 and 54 files when set to 134217728.
However, even when I set PYICEBERG_MAX_WORKERS to 64, the total write operation still took 3.6 seconds.

Overall, I am very impressed with how it works now! Well done!

kevinjqliu · 2024-02-27T22:06:51Z

As a way to benchmark multithreaded writes to multiple parquet files, I've noticed that Duckdb's COPY command has the per_thread_output and file_size_bytes options.

Using the duckdb CLI,

.timer on
CREATE TABLE tbl AS SELECT * FROM read_parquet('table_10000000.parquet');

SELECT COUNT(*) FROM tbl;

COPY
    (SELECT * FROM tbl)
    TO 'parquet_files'
    (FORMAT PARQUET, PER_THREAD_OUTPUT true, FILE_SIZE_BYTES '512MB', OVERWRITE_OR_IGNORE true);

Result,

Run Time (s): real 14.588 user 66.250569 sys 8.554207

And setting FILE_SIZE_BYTES to 256MB,

COPY
    (SELECT * FROM tbl)
    TO 'parquet_files'
    (FORMAT PARQUET, PER_THREAD_OUTPUT true, FILE_SIZE_BYTES '256MB', OVERWRITE_OR_IGNORE true);

Run Time (s): real 15.575 user 66.483984 sys 10.547852

I'm not sure if there's a way to specify the number of threads Duckdb can use. But with htop while executing the statements, I can see that all the cores are used

bigluck · 2024-02-27T22:16:06Z

@kevinjqliu nice, duckdb should use https://duckdb.org/docs/sql/configuration.html

kevinjqliu · 2024-02-27T22:36:21Z

Thanks! It
thanks! I can see that its using 8 threads with

SELECT * FROM duckdb_settings();

I also ran

SET threads TO 8;

just in case

kevinjqliu · 2024-02-27T22:39:01Z

Here's the script I used to run the append() function to use 8 threads to write multiple parquet files
https://gist.github.com/kevinjqliu/e738641ec8f96de554c5ed39ead3f09a

Wrote 14 files in around 16.5 seconds

And on main branch without parallel write (#444),

Wrote a single 1.6G file in around 32.8 seconds

Fokko · 2024-05-13T07:01:37Z

Fixed in #444

sungwy added this to the PyIceberg 0.7.0 release milestone Feb 14, 2024

kevinjqliu mentioned this issue Feb 23, 2024

Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks #444

Merged

Fokko closed this as completed May 13, 2024

kevinjqliu mentioned this issue May 14, 2024

PyIceberg Near-Term Roadmap #736

Open

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Table.append #428

Parallel Table.append #428

bigluck commented Feb 14, 2024

Fokko commented Feb 14, 2024

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024 •

edited

Loading

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024

zeddit commented Feb 18, 2024 •

edited

Loading

Fokko commented Feb 18, 2024 •

edited

Loading

kevinjqliu commented Feb 18, 2024 •

edited

Loading

kevinjqliu commented Feb 18, 2024

kevinjqliu commented Feb 18, 2024

bigluck commented Feb 19, 2024

Fokko commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

Fokko commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024 •

edited

Loading

kevinjqliu commented Feb 19, 2024 •

edited

Loading

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 23, 2024

bigluck commented Feb 24, 2024

kevinjqliu commented Feb 24, 2024

bigluck commented Feb 25, 2024

bigluck commented Feb 26, 2024 •

edited

Loading

kevinjqliu commented Feb 27, 2024

bigluck commented Feb 27, 2024

kevinjqliu commented Feb 27, 2024

kevinjqliu commented Feb 27, 2024 •

edited

Loading

Fokko commented May 13, 2024

Parallel Table.append #428

Parallel Table.append #428

Comments

bigluck commented Feb 14, 2024

Apache Iceberg version

Please describe the bug 🐞

Fokko commented Feb 14, 2024

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024 • edited Loading

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024

kevinjqliu commented Feb 17, 2024

zeddit commented Feb 18, 2024 • edited Loading

Fokko commented Feb 18, 2024 • edited Loading

kevinjqliu commented Feb 18, 2024 • edited Loading

kevinjqliu commented Feb 18, 2024

kevinjqliu commented Feb 18, 2024

bigluck commented Feb 19, 2024

Fokko commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

Fokko commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024 • edited Loading

kevinjqliu commented Feb 19, 2024 • edited Loading

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 19, 2024

kevinjqliu commented Feb 23, 2024

bigluck commented Feb 24, 2024

kevinjqliu commented Feb 24, 2024

bigluck commented Feb 25, 2024

bigluck commented Feb 26, 2024 • edited Loading

kevinjqliu commented Feb 27, 2024

bigluck commented Feb 27, 2024

kevinjqliu commented Feb 27, 2024

kevinjqliu commented Feb 27, 2024 • edited Loading

Fokko commented May 13, 2024

kevinjqliu commented Feb 17, 2024 •

edited

Loading

zeddit commented Feb 18, 2024 •

edited

Loading

Fokko commented Feb 18, 2024 •

edited

Loading

kevinjqliu commented Feb 18, 2024 •

edited

Loading

kevinjqliu commented Feb 19, 2024 •

edited

Loading

kevinjqliu commented Feb 19, 2024 •

edited

Loading

bigluck commented Feb 26, 2024 •

edited

Loading

kevinjqliu commented Feb 27, 2024 •

edited

Loading