Transaction log parsing performance regression #2760

Tom-Newton · 2024-08-12T10:09:14Z

Environment

Delta-rs version: 0.18.2

Binding: Python

Environment:

Cloud provider: Azure
OS: Ubuntu 20.04
Other: Python 3.10

Bug

What happened:
Performance regression in transaction log parsing compared to deltalake 0.10.1

Y axis is time in seconds.

What you expected to happen:
New versions of deltalake to be the same or better performance than older versions.

How to reproduce it:

deltalake.DeltaTable(
    "abfss://<container>@<storage account>.dfs.core.windows.net/path/to/table/with/long/transaction_log",
    storage_options={"use_azure_cli": "true"},
)

Compare the time taken for this compared to the same thing when using deltalake 0.10.1.

More details:
I think I have identified the 2 reasons for the performance regression.

deltalake now relies on ObjectStore.list_with_offset which uses an inefficient implementation for Azure but is probably advantageous on GCS or S3.
1. I'm not particularly worried about this. Use reconstructed ListBlobs marker to provide list offset support in MicrosoftAzure store apache/arrow-rs#6174 (comment) is trying to solve it and we're trying to get Azure to help with that.
2. I have a temporary solution that re-implements the old behaviour of deltalake where it iteratively checks if commit versions exist, instead of using a list operation. This gives us the "0.18.2 modified" results in my graph above.
3. For the record list_with_offset was introduced by feat: buffered reading of transaction logs #1549
It looks like the latest checkpoint and subsequent log entries are download from blob storage twice.
1. First downloaded to get the table protocol and metadata, second download to get relevant parquet files
2. Looks like this behaviour was introduced by feat: arrow backed log replay and table state #2037
3. This one concerns me a bit more. Is there any plan to try to improve this?

(Reporting as a bug seemed more appropriate than a feature request but is not ideal)

The text was updated successfully, but these errors were encountered:

rtyler · 2024-08-12T15:27:38Z

interesting find, thanks for the report! 👋

ion-elgreco · 2024-08-12T18:54:59Z

@Tom-Newton If I am bored next weekend, I can take a look at the second issue ; )

ion-elgreco · 2024-08-12T19:34:52Z

@Tom-Newton does this have some negligible change for you? #2764

Tom-Newton · 2024-08-12T20:38:13Z

Thanks @ion-elgreco it looks like this did provide a measurable improvement

Middle is what I previously called "0.18.2 modified", right is the same again but additionally with #2764

It looks like its still about 41-55% slower than deltalake 0.10.1 though. I have some network download stats and it looks like its still downloaded about 120MB just to initialise deltalake.DeltaTable() while the same test downloaded about 60MB on 0.10.1.

ion-elgreco · 2024-08-12T20:45:41Z

@Tom-Newton awesome! Thanks for checking this!

Yeah so like you mentioned before we read the files twice.

I was thinking of perhaps modifying the commit_stream/checkpoint_stream so that it caches the result prior to returning. Then the stream function should also be able to drain the cache when it's executed for the second time.

rtyler · 2024-08-14T21:39:56Z

@Tom-Newton I'm curious if you can run your test with the latest 0.19.0 that just released, I'm very curious how well @roeap 's optimizations might have also helped your usecase 😄

Tom-Newton · 2024-08-14T22:52:46Z

Its looking good

The latest version now slightly outperforms 0.10.1 and it only downloads 30MiB compared to 120MiB previously (30MiB matches up nicely with the size of the checkpoint plus subsequent commits) 🎉.

I assume it was #2772 that made the difference, but I'm a bit confused about how.

rtyler · 2024-08-15T04:22:17Z

✨ magic! ✨

more efficient use of the checkpoints and other metadata in the log is likely causing fewer log files to be fetched from storage. Either way, thank you so much for providing pretty graphs! I'm going to stick a fork in this turkey and call it done!

ion-elgreco · 2024-08-15T06:07:25Z

@rtyler I do believe we can squeeze more performance out though

Tom-Newton · 2024-08-15T07:43:19Z

I think we are still reading the checkpoint and subsequent log files twice but with the new optimisations we're taking advantage of the features of parquet to not download columns and row groups of the checkpoint that we don't need. This seems to be enough to edge out the 0.10.1 performance even when doing that twice.

Anyway, thanks everyone 🙂. I'm impressed how quickly this was resolved.

ion-elgreco · 2024-08-15T07:45:26Z

I'll leave it open though, to not forget to look into this

Actually will just create new issue instead

alexwilcoxson-rel · 2024-08-21T15:52:44Z

saw this late but one thing that may have helped also was #2717 this should result in only reading the columns of parquet necessary from the checkpoint for the actions being queried

Tom-Newton added the bug Something isn't working label Aug 12, 2024

rtyler added the binding/rust Issues for the Rust crate label Aug 12, 2024

rtyler closed this as completed Aug 15, 2024

ion-elgreco reopened this Aug 15, 2024

ion-elgreco closed this as completed Aug 15, 2024

ion-elgreco mentioned this issue Aug 15, 2024

Add caching mechanism for log/checkpoint stream #2776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transaction log parsing performance regression #2760

Transaction log parsing performance regression #2760

Tom-Newton commented Aug 12, 2024 •

edited

Loading

rtyler commented Aug 12, 2024

ion-elgreco commented Aug 12, 2024

ion-elgreco commented Aug 12, 2024

Tom-Newton commented Aug 12, 2024

ion-elgreco commented Aug 12, 2024

rtyler commented Aug 14, 2024

Tom-Newton commented Aug 14, 2024

rtyler commented Aug 15, 2024

ion-elgreco commented Aug 15, 2024

Tom-Newton commented Aug 15, 2024

ion-elgreco commented Aug 15, 2024 •

edited

Loading

alexwilcoxson-rel commented Aug 21, 2024

Transaction log parsing performance regression #2760

Transaction log parsing performance regression #2760

Comments

Tom-Newton commented Aug 12, 2024 • edited Loading

Environment

Bug

rtyler commented Aug 12, 2024

ion-elgreco commented Aug 12, 2024

ion-elgreco commented Aug 12, 2024

Tom-Newton commented Aug 12, 2024

ion-elgreco commented Aug 12, 2024

rtyler commented Aug 14, 2024

Tom-Newton commented Aug 14, 2024

rtyler commented Aug 15, 2024

ion-elgreco commented Aug 15, 2024

Tom-Newton commented Aug 15, 2024

ion-elgreco commented Aug 15, 2024 • edited Loading

alexwilcoxson-rel commented Aug 21, 2024

Tom-Newton commented Aug 12, 2024 •

edited

Loading

ion-elgreco commented Aug 15, 2024 •

edited

Loading