-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Add limit parameter to table scan #7163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @Polectron I've left some suggestions, let me know what you think
|
|
||
| if limit: | ||
| arrow_table = fragment_scanner.head(limit) | ||
| with rows_counter.get_lock(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this lock because we already did the expensive work. This will make the code a bit simpler and avoid locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official python documentation for multiprocessing.Value suggests that non atomic operations like += should use a lock. Otherwise a race condition could happen where multiple reads end at the same time and overwrite the row counter, causing that we keep reading rows even if we already read enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the explanation. Currently, we don't do multi-processing, but multi-threading. I did some extensive testing and noticed that multi-processing wasn't substantially faster than multithreading. Probably because most time is spent in fetching the files, and reading the data, which all happens in Arrow, which bypasses the GIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you made the right choice going for multi-threading because the task is IO-bound and there are potentially thousands of tasks that need to be executed. Even though we use multi-threading which is bound to the GIL some operations might still not be safe (like += which performs a read and a write as two separate operations).
Also, even though the tasks are run on threads in a ThreadPool, multiprocessing.Value implements a multiprocessing.RLock by default wich is compatible with both processes and threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clearing this up!
| def __setattr__(self, name: str, value: Any) -> None: | ||
| # The file_format is written as a string, so we need to cast it to the Enum | ||
| if name == "file_format": | ||
| value = FileFormat[value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious. Is this related to changes in pyarrow.py or just an independent fix for parsing file_format in DataFile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was introduced by merging #7033
|
I don't really understand whats failing on the Python Integration test, it sometimes doesn't find some of the tables created by |
|
@Polectron I noticed this as well. The current master is red because of failing Python integration tests. Maybe we can increase the timeout for now. |
| docker-compose -f dev/docker-compose-integration.yml build | ||
| docker-compose -f dev/docker-compose-integration.yml up -d | ||
| sleep 20 | ||
| sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for now 👍🏻
|
@Polectron thanks, this looks great! I have one final ask, could you update the docs under |
|
@Fokko |
|
@Polectron looks like #7148 went in, can you rebase? |
|
|
||
| if limit: | ||
| arrow_table = fragment_scanner.head(limit) | ||
| with rows_counter.get_lock(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clearing this up!
|
Thanks for working on this @Polectron, this is a great addition 👍🏻 |
Building on top of #7033 uses
pyarrow.dataset.Scanner.head(num_rows)and amultiprocessing.Valueas a counter to limit the number rows retrieved onpyiceberg.io.pyarrow.project_table(), avoids reading more files if the desired quantity specified bylimithas been reachedCloses #7013