-
Notifications
You must be signed in to change notification settings - Fork 389
Fixed the limit bug and added test for count() method and documentation for count() #2423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
ca777d6
d8f9411
ecf72d1
782bea5
8b59b81
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,96 @@ | ||||||||
| --- | ||||||||
| title: Count Recipe - Efficiently Count Rows in Iceberg Tables | ||||||||
| --- | ||||||||
|
|
||||||||
| # Counting Rows in an Iceberg Table | ||||||||
|
|
||||||||
| This recipe demonstrates how to use the `count()` function to efficiently count rows in an Iceberg table using PyIceberg. The count operation is optimized for performance by reading file metadata rather than scanning actual data. | ||||||||
|
|
||||||||
| ## How Count Works | ||||||||
|
|
||||||||
| The `count()` method leverages Iceberg's metadata architecture to provide fast row counts by: | ||||||||
|
|
||||||||
| 1. **Reading file manifests**: Examines metadata about data files without loading the actual data | ||||||||
| 2. **Aggregating record counts**: Sums up record counts stored in Parquet file footers | ||||||||
| 3. **Applying filters at metadata level**: Pushes down predicates to skip irrelevant files | ||||||||
| 4. **Handling deletes**: Automatically accounts for delete files and tombstones | ||||||||
|
|
||||||||
| ## Basic Usage | ||||||||
|
|
||||||||
| Count all rows in a table: | ||||||||
|
|
||||||||
| ```python | ||||||||
| from pyiceberg.catalog import load_catalog | ||||||||
|
|
||||||||
| catalog = load_catalog("default") | ||||||||
| table = catalog.load_table("default.cities") | ||||||||
|
|
||||||||
| # Get total row count | ||||||||
| row_count = table.scan().count() | ||||||||
| print(f"Total rows in table: {row_count}") | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Count with Filters | ||||||||
|
|
||||||||
| Count rows matching specific conditions: | ||||||||
|
|
||||||||
| ```python | ||||||||
| from pyiceberg.expressions import GreaterThan, EqualTo, And | ||||||||
|
|
||||||||
| # Count rows with population > 1,000,000 | ||||||||
| large_cities = table.scan().filter(GreaterThan("population", 1000000)).count() | ||||||||
|
||||||||
| # Count rows with population > 1,000,000 | |
| large_cities = table.scan().filter(GreaterThan("population", 1000000)).count() | |
| large_cities = table.scan().filter("population > 1000000").count() |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| """ | ||
| Unit tests for the DataScan.count() method in PyIceberg. | ||
|
|
||
| The count() method is essential for determining the number of rows in an Iceberg table | ||
| without having to load the actual data. It works by examining file metadata and task | ||
| plans to efficiently calculate row counts across distributed data files. | ||
|
|
||
| These tests validate the count functionality across different scenarios: | ||
| 1. Basic counting with single file tasks | ||
| 2. Empty table handling (zero records) | ||
| 3. Large-scale counting with multiple file tasks | ||
|
|
||
| The tests use mocking to simulate different table states without requiring actual | ||
| Iceberg table infrastructure, ensuring fast and isolated unit tests. | ||
| """ | ||
|
|
||
| import pytest | ||
| from unittest.mock import MagicMock, Mock, patch | ||
| from pyiceberg.table import DataScan | ||
| from pyiceberg.expressions import AlwaysTrue | ||
|
|
||
|
|
||
| class DummyFile: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could write real data files and use that for testing wdyt? Here are some fixtures we could use to get a Maybe we can also add some more fixtures to get FileScanTasks for empty files and large ones
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, it will be a good addition actually. |
||
| """ | ||
| Mock representation of an Iceberg data file. | ||
|
|
||
| In real scenarios, this would contain metadata about Parquet files | ||
| including record counts, file paths, and statistics. | ||
| """ | ||
| def __init__(self, record_count): | ||
| self.record_count = record_count | ||
|
|
||
|
|
||
| class DummyTask: | ||
| """ | ||
| Mock representation of a scan task in Iceberg query planning. | ||
|
|
||
| A scan task represents work to be done on a specific data file, | ||
| including any residual filters and delete files that need to be applied. | ||
| In actual usage, tasks are generated by the query planner based on | ||
| partition pruning and filter pushdown optimizations. | ||
| """ | ||
| def __init__(self, record_count, residual=None, delete_files=None): | ||
| self.file = DummyFile(record_count) | ||
| self.residual = residual if residual is not None else AlwaysTrue() | ||
| self.delete_files = delete_files or [] | ||
|
|
||
| def test_count_basic(): | ||
| """ | ||
| Test basic count functionality with a single file containing data. | ||
|
|
||
| This test verifies that the count() method correctly aggregates record counts | ||
| from a single scan task. It simulates a table with one data file containing | ||
| 42 records and validates that the count method returns the correct total. | ||
|
|
||
| The test demonstrates the typical use case where: | ||
| - A table has one or more data files | ||
| - Each file has metadata containing record counts | ||
| - The count() method aggregates these counts efficiently | ||
| """ | ||
| # Create a mock table with the necessary attributes | ||
| table = Mock(spec=DataScan) | ||
|
||
|
|
||
| # Mock the plan_files method to return our dummy task | ||
| task = DummyTask(42, residual=AlwaysTrue(), delete_files=[]) | ||
| table.plan_files = MagicMock(return_value=[task]) | ||
|
|
||
| # Import and call the actual count method | ||
| from pyiceberg.table import DataScan as ActualDataScan | ||
| table.count = ActualDataScan.count.__get__(table, ActualDataScan) | ||
|
|
||
| assert table.count() == 42 | ||
|
|
||
|
|
||
| def test_count_empty(): | ||
| """ | ||
| Test count functionality on an empty table. | ||
|
|
||
| This test ensures that the count() method correctly handles empty tables | ||
| that have no data files or scan tasks. It validates that an empty table | ||
| returns a count of 0 without raising any errors. | ||
|
|
||
| This scenario is important for: | ||
| - Newly created tables before any data is inserted | ||
| - Tables where all data has been deleted | ||
| - Tables with restrictive filters that match no data | ||
| """ | ||
| # Create a mock table with the necessary attributes | ||
| table = Mock(spec=DataScan) | ||
|
||
|
|
||
| # Mock the plan_files method to return no tasks | ||
| table.plan_files = MagicMock(return_value=[]) | ||
|
|
||
| # Import and call the actual count method | ||
| from pyiceberg.table import DataScan as ActualDataScan | ||
| table.count = ActualDataScan.count.__get__(table, ActualDataScan) | ||
|
|
||
| assert table.count() == 0 | ||
|
|
||
|
|
||
| def test_count_large(): | ||
| """ | ||
| Test count functionality with multiple files containing large datasets. | ||
|
|
||
| This test validates that the count() method can efficiently handle tables | ||
| with multiple data files and large record counts. It simulates a distributed | ||
| scenario where data is split across multiple files, each containing 500,000 | ||
| records, for a total of 1 million records. | ||
|
|
||
| This test covers: | ||
| - Aggregation across multiple scan tasks | ||
| - Handling of large record counts (performance implications) | ||
| - Distributed data scenarios common in big data environments | ||
| """ | ||
| # Create a mock table with the necessary attributes | ||
| table = Mock(spec=DataScan) | ||
|
||
|
|
||
| # Mock the plan_files method to return multiple tasks | ||
| tasks = [ | ||
| DummyTask(500000, residual=AlwaysTrue(), delete_files=[]), | ||
| DummyTask(500000, residual=AlwaysTrue(), delete_files=[]), | ||
| ] | ||
| table.plan_files = MagicMock(return_value=tasks) | ||
|
|
||
| # Import and call the actual count method | ||
| from pyiceberg.table import DataScan as ActualDataScan | ||
| table.count = ActualDataScan.count.__get__(table, ActualDataScan) | ||
|
|
||
| assert table.count() == 1000000 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth mentioning as a note that we could get the total count of a table from snapshot properties doing this:
table.current_snapshot().summary.additional_properties["total-records"]so users can avoid doing a full table scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments, I will work on them 😊