Python: Add limit parameter to table scan #7163

Polectron · 2023-03-21T20:22:29Z

Building on top of #7033 uses pyarrow.dataset.Scanner.head(num_rows) and a multiprocessing.Value as a counter to limit the number rows retrieved on pyiceberg.io.pyarrow.project_table(), avoids reading more files if the desired quantity specified by limit has been reached

Closes #7013

Creates fragments based on the FileFormat. Blocked by: apache#6997

…c-as-well

Fokko

This is looking great @Polectron I've left some suggestions, let me know what you think

python/pyiceberg/io/pyarrow.py

Fokko · 2023-03-21T21:06:26Z

python/pyiceberg/io/pyarrow.py


+        if limit:
+            arrow_table = fragment_scanner.head(limit)
+            with rows_counter.get_lock():


I think we can remove this lock because we already did the expensive work. This will make the code a bit simpler and avoid locking.

The official python documentation for multiprocessing.Value suggests that non atomic operations like += should use a lock. Otherwise a race condition could happen where multiple reads end at the same time and overwrite the row counter, causing that we keep reading rows even if we already read enough

Ah, thanks for the explanation. Currently, we don't do multi-processing, but multi-threading. I did some extensive testing and noticed that multi-processing wasn't substantially faster than multithreading. Probably because most time is spent in fetching the files, and reading the data, which all happens in Arrow, which bypasses the GIL.

I think you made the right choice going for multi-threading because the task is IO-bound and there are potentially thousands of tasks that need to be executed. Even though we use multi-threading which is bound to the GIL some operations might still not be safe (like += which performs a read and a write as two separate operations).
Also, even though the tasks are run on threads in a ThreadPool, multiprocessing.Value implements a multiprocessing.RLock by default wich is compatible with both processes and threads.

Thanks for clearing this up!

JonasJ-ap · 2023-03-22T05:19:21Z

python/pyiceberg/manifest.py

+    def __setattr__(self, name: str, value: Any) -> None:
+        # The file_format is written as a string, so we need to cast it to the Enum
+        if name == "file_format":
+            value = FileFormat[value]


Just curious. Is this related to changes in pyarrow.py or just an independent fix for parsing file_format in DataFile?

This was introduced by merging #7033

Polectron · 2023-03-22T09:41:06Z

I don't really understand whats failing on the Python Integration test, it sometimes doesn't find some of the tables created by provision.py, maybe it's because in python-integration.yml the sleep after launching the docker container isn't long enough so the tests start running before the tables finish being created?

Fokko · 2023-03-22T19:04:16Z

@Polectron I noticed this as well. The current master is red because of failing Python integration tests. Maybe we can increase the timeout for now.

Fokko · 2023-03-22T19:04:35Z

python/Makefile

 	docker-compose -f dev/docker-compose-integration.yml build
 	docker-compose -f dev/docker-compose-integration.yml up -d
-	sleep 20
+	sleep 30


Works for now 👍🏻

Fokko · 2023-03-22T19:12:23Z

@Polectron thanks, this looks great! I have one final ask, could you update the docs under python/mkdocs/docs/? Otherwise, people won't use this awesome feature.

Polectron · 2023-03-22T19:36:56Z

@Fokko
Done! 🎉

Fokko · 2023-03-23T18:21:01Z

@Polectron looks like #7148 went in, can you rebase?

Fokko · 2023-03-23T20:55:14Z

python/pyiceberg/io/pyarrow.py


+        if limit:
+            arrow_table = fragment_scanner.head(limit)
+            with rows_counter.get_lock():


Thanks for clearing this up!

Fokko · 2023-03-24T08:33:30Z

Thanks for working on this @Polectron, this is a great addition 👍🏻

Fokko and others added 8 commits March 7, 2023 19:18

Python: Add support for ORC

7ccabd4

Creates fragments based on the FileFormat. Blocked by: apache#6997

Merge branch 'master' of github.com:apache/iceberg into fd-support-or…

d361c74

…c-as-well

Merge branch 'master' of github.com:apache/iceberg into fd-support-or…

6842abe

…c-as-well

Revert

30ec1f1

Merge branch 'fd-support-orc-as-well'

698253c

TableScan add limit

82d663c

pyarrow limit number of rows fetched from files if limit is set

e20dfe9

add tests for scan limit

69ef3b8

github-actions bot added the python label Mar 21, 2023

python ci rebuild container if changes on python/dev/

56cf45a

github-actions bot added the INFRA label Mar 21, 2023

Fokko reviewed Mar 21, 2023

View reviewed changes

Polectron added 2 commits March 22, 2023 00:12

remove support for ORC

e6f5f22

remove unused imports

2360232

JonasJ-ap mentioned this pull request Mar 22, 2023

Python: Add conversion from iceberg table scan to ray dataset #7148

Merged

JonasJ-ap reviewed Mar 22, 2023

View reviewed changes

Daniel Rückert García added 2 commits March 22, 2023 11:15

increase sleep before running tests

d2be213

Merge remote-tracking branch 'apache/master'

1cba012

Polectron requested a review from Fokko March 22, 2023 11:25

Fokko reviewed Mar 22, 2023

View reviewed changes

update python docs to include limit in table query

0073ef2

Polectron added 2 commits March 23, 2023 20:04

Merge remote-tracking branch 'apache/master'

01d7f7c

docs fix format

77580c6

Polectron requested a review from Fokko March 23, 2023 19:12

Fokko approved these changes Mar 23, 2023

View reviewed changes

Fokko merged commit 25360c0 into apache:master Mar 24, 2023

Fokko added this to the PyIceberg 0.4.0 release milestone Mar 24, 2023

Python: Add limit parameter to table scan #7163

Python: Add limit parameter to table scan #7163

Uh oh!

Conversation

Polectron commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Polectron Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Polectron commented Mar 22, 2023

Uh oh!

Fokko commented Mar 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Mar 22, 2023

Uh oh!

Polectron commented Mar 22, 2023

Uh oh!

Fokko commented Mar 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Mar 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Polectron commented Mar 21, 2023 •

edited

Loading

Polectron Mar 22, 2023 •

edited

Loading