-
Notifications
You must be signed in to change notification settings - Fork 571
UN-2882 [FIX] Fix BigQuery float precision issue in PARSE_JSON for metadata serialization #1593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tadata serialization - Added BigQuery-specific float sanitization with IEEE 754 double precision safe zone - Consolidated duplicate float sanitization logic into shared utilities - Fixed insertion errors caused by floats with >15 significant figures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Summary by CodeRabbit
WalkthroughAdds a shared float-sanitization utility and integrates it into BigQuery JSON/STRING handling and worker database utilities; BigQuery now sanitizes nested dicts/lists and floats (NaN/±Inf → None, large floats rounded) before JSON serialization and when parsing SQL values. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant BigQuery as BigQuery.execute_query
participant Sanit as _sanitize_for_bigquery / sanitize_floats_for_database
participant JSON as json.dumps
participant DB as BigQuery Service
Caller->>BigQuery: execute_query(values including dict/list/JSON)
rect rgba(220,235,255,0.4)
BigQuery->>Sanit: sanitize(value)\n- recurse dict/list\n- NaN/±Inf -> None\n- round to 15 sig figs
Sanit-->>BigQuery: sanitized_value
alt sanitized_value is not None
BigQuery->>JSON: json.dumps(sanitized_value)
JSON-->>BigQuery: json_string
BigQuery->>DB: send query with json_string
else sanitized_value is None
BigQuery->>DB: send query with NULL representation
end
end
DB-->>BigQuery: query result
BigQuery-->>Caller: result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)✅ Unit Test PR creation complete.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py(4 hunks)unstract/connectors/src/unstract/connectors/databases/utils.py(1 hunks)workers/shared/infrastructure/database/utils.py(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (5)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (2)
66-109: LGTM! Solid implementation of magnitude-based precision limiting.The algorithm correctly limits total significant figures to 15 for IEEE 754 compatibility. The magnitude calculation
floor(log10(abs(data))) + 1properly handles both large numbers (like Unix timestamps) and small numbers (like costs) by adjusting decimal precision based on the number's scale.Edge cases are well handled:
- Zero returns 0.0 directly
- NaN/Inf converted to None
- Small numbers (< 1) get extra decimal places, but this doesn't violate the 15 significant figure limit since leading zeros aren't significant
370-373: LGTM! Correct application of sanitization after parsing.The sanitization is correctly applied after
json.loads()without problematic conditional checks. The parsed value is sanitized and assigned directly.Minor note: The comment "json.loads() creates new float objects that may have binary precision problems" is slightly imprecise—it's the original JSON string's precision that may exceed binary representation limits, not json.loads() itself causing the problem. However, this is a documentation nitpick and doesn't affect correctness.
unstract/connectors/src/unstract/connectors/databases/utils.py (1)
11-51: LGTM! Clean shared utility with clear scope.The implementation correctly provides minimal NaN/Inf sanitization for all databases, with clear documentation that precision handling should be implemented in specific connectors. The recursive approach properly handles nested structures.
The docstring examples effectively demonstrate the utility's limited scope—showing that regular floats (like 1760509016.282637) are preserved unchanged, delegating precision handling to database-specific implementations like BigQuery's
_sanitize_for_bigquery().workers/shared/infrastructure/database/utils.py (2)
17-17: LGTM! Correct import of shared sanitizer.The import properly brings in the centralized
sanitize_floats_for_databaseutility, replacing the internal implementation that was removed.
297-297: LGTM! Consistent usage of shared sanitizer across all call sites.All four call sites correctly use
sanitize_floats_for_database()to handle NaN/Inf values before database operations. The refactoring properly centralizes the sanitization logic without changing behavior.Also applies to: 365-365, 377-377, 384-384
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py
Show resolved
Hide resolved
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py
Show resolved
Hide resolved
- Changed 'if sanitized_value' to 'if sanitized_value is not None'
- Prevents empty dicts {}, empty lists [], and zero values from becoming None
- Addresses CodeRabbit AI feedback on PR #1593
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
…stack/unstract into fix/UN-2882-bigquery-float-precision
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (1)
89-102: Consider refining magnitude calculation for numbers < 1.The magnitude-based calculation works correctly for large numbers (timestamps), but for small numbers (< 1), the formula
safe_decimals = max(0, 15 - magnitude)allows more than 15 decimal places since magnitude is negative. For example, 0.001228 has magnitude = -2, yielding safe_decimals = 17.While the f-string format
.{N}fstill rounds naturally and this likely isn't a practical issue (small numbers rarely exceed float precision), the implementation doesn't strictly enforce the "15 significant figures" claim for all inputs. Consider using a library likedecimalfor precise significant figure handling if strict enforcement is needed:from decimal import Decimal, ROUND_HALF_UP # Example: strictly limit to 15 significant figures sig_figs = 15 d = Decimal(str(data)) # Calculate the exponent and round to sig_figs exponent = d.adjusted() # Position of most significant digit quantize_exp = Decimal(10) ** (exponent - sig_figs + 1) return float(d.quantize(quantize_exp, rounding=ROUND_HALF_UP))That said, given that the primary use case (large timestamps) is handled correctly and small numbers with many significant figures are rare, the current implementation is acceptable for this defensive fix.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (5)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (5)
66-86: LGTM: Clear documentation and purpose.The method signature and docstring clearly explain the BigQuery PARSE_JSON compatibility requirements and provide helpful examples.
104-109: LGTM: Recursive handling is correct.The recursive processing of dicts and lists ensures nested structures are thoroughly sanitized, and the default passthrough preserves non-float types appropriately.
250-256: LGTM: Truthiness issue resolved.The code now correctly uses
is not Noneinstead of a truthy check, preserving empty dicts, empty lists, and zeros as requested in previous reviews. The sanitization is properly applied before JSON serialization.
267-273: LGTM: Truthiness issue resolved.Identical to lines 250-256, the code now correctly uses
is not Noneto avoid dropping valid empty values and zeros. The sanitization is properly applied before JSON serialization.
374-377: LGTM: Sanitization after parsing is appropriate.Applying sanitization after
json.loads()is a good defensive measure, as parsed float objects may have binary precision issues. This ensures consistent handling across all three code paths where floats enter the BigQuery pipeline.
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
✅ UTG Post-Process Complete No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully. |
|
Creating a PR to put the unit tests in... The changes have been created in this pull request: View PR |



What
Why
1760509016.282637) have 16 significant figures, exceeding IEEE 754 double precision safe zone (15 digits)Invalid input: Input number: 1760509016.282637 cannot round-trip through string representationHow
sanitize_floats_for_database()utility inunstract/connectors/databases/utils.pythat handles NaN/Inf for all databases_sanitize_for_bigquery()method that limits total significant figures to 15 using magnitude-based calculation:safe_decimals = max(0, 15 - magnitude)_sanitize_floats_for_database()method fromworkers/shared/infrastructure/database/utils.pyand updated 4 call sites to use common utilityjson.dumps()to ensure clean binary representation for BigQuery's PARSE_JSONCan this PR break any existing features. If yes, please list possible items. If no, please explain why.
No breaking changes expected:
1760509016.282637→1760509016.28264), but maintains sufficient accuracy for millisecond-level timing0.001228remains0.001228)Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Screenshots
N/A - Backend fix with no UI changes
Checklist
I have read and understood the Contribution Guidelines.
🤖 Generated with Claude Code