UN-2882 [FIX] Fix BigQuery float precision issue in PARSE_JSON for metadata serialization #1593

muhammad-ali-e · 2025-10-15T14:09:33Z

What

Fixed BigQuery database insertion failure caused by float precision issues in PARSE_JSON
Added BigQuery-specific float sanitization with IEEE 754 double precision safe zone (15 significant figures)
Consolidated duplicate float sanitization logic into shared utilities

Why

BigQuery's PARSE_JSON() is stricter than Python's json.loads() and requires floats that can "round-trip" through string representation
Unix timestamps with microsecond precision (e.g., 1760509016.282637) have 16 significant figures, exceeding IEEE 754 double precision safe zone (15 digits)
This caused insertion errors: Invalid input: Input number: 1760509016.282637 cannot round-trip through string representation
Multiple duplicate implementations of float sanitization existed across codebase, causing maintenance issues

How

Created common sanitize_floats_for_database() utility in unstract/connectors/databases/utils.py that handles NaN/Inf for all databases
Added BigQuery-specific _sanitize_for_bigquery() method that limits total significant figures to 15 using magnitude-based calculation: safe_decimals = max(0, 15 - magnitude)
Updated BigQuery connector to use database-specific sanitization at 3 call sites:
- JSON columns with PARSE_JSON
- STRING columns with JSON serialization
- Parsed JSON values from strings
Removed duplicate _sanitize_floats_for_database() method from workers/shared/infrastructure/database/utils.py and updated 4 call sites to use common utility
Applied sanitization before json.dumps() to ensure clean binary representation for BigQuery's PARSE_JSON

Can this PR break any existing features. If yes, please list possible items. If no, please explain why.

No breaking changes expected:

Changes are defensive and only affect BigQuery connector behavior
For BigQuery: Floats are limited to 15 significant figures, which is within IEEE 754 double precision guarantees
For large numbers (timestamps): Slightly reduces decimal precision (1760509016.282637 → 1760509016.28264), but maintains sufficient accuracy for millisecond-level timing
For small numbers (costs): Full precision preserved (0.001228 remains 0.001228)
Other databases (PostgreSQL, MySQL, Snowflake) unaffected - they only receive minimal NaN/Inf sanitization without precision limiting
Worker utils now uses shared common utility instead of duplicate implementation - behavior remains functionally identical

Database Migrations

None required

Env Config

None required

Relevant Docs

IEEE 754 Double Precision: 15-17 decimal digits of precision
BigQuery PARSE_JSON documentation
Related issue: UN-2882

Related Issues or PRs

Fixes UN-2882

Dependencies Versions

No dependency version changes

Notes on Testing

Tested with workflow execution that inserts metadata containing timestamps to BigQuery destination
Verified that timestamps with 16+ significant figures are correctly sanitized to 15 figures
Confirmed no insertion errors during BigQuery PARSE_JSON operations
Worker image rebuilt and deployed successfully
All pre-commit hooks passed

Screenshots

N/A - Backend fix with no UI changes

Checklist

I have read and understood the Contribution Guidelines.

🤖 Generated with Claude Code

…tadata serialization - Added BigQuery-specific float sanitization with IEEE 754 double precision safe zone - Consolidated duplicate float sanitization logic into shared utilities - Fixed insertion errors caused by floats with >15 significant figures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai · 2025-10-15T14:09:46Z

Summary by CodeRabbit

Bug Fixes
- Improved handling of special float values in database writes, preventing errors and malformed JSON.
- Ensures NaN/Infinity are converted to null and large/small numbers round-trip accurately (up to 15 significant digits).
- More reliable serialization for JSON and string fields in BigQuery.
Refactor
- Centralized float sanitization into a shared utility used across components for consistent behavior and easier maintenance.

Walkthrough

Adds a shared float-sanitization utility and integrates it into BigQuery JSON/STRING handling and worker database utilities; BigQuery now sanitizes nested dicts/lists and floats (NaN/±Inf → None, large floats rounded) before JSON serialization and when parsing SQL values.

Changes

Cohort / File(s)	Summary
BigQuery connector float/JSON sanitization `unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py`	Adds `BigQuery._sanitize_for_bigquery(data: Any) -> Any` (recursive: NaN/±Inf → None, round large floats to 15 significant digits) and uses it before `json.dumps` for JSON and STRING column paths in `execute_query`; also sanitizes parsed JSON in `get_sql_values_for_query`.
Shared DB float sanitizer (new module) `unstract/connectors/src/unstract/connectors/databases/utils.py`	New public `sanitize_floats_for_database(data: Any) -> Any` that recursively replaces NaN/±Inf with `None` across dicts/lists; intended as a shared helper for connectors and workers.
Workers refactor to shared sanitizer `workers/shared/infrastructure/database/utils.py`	Removes local `_sanitize_floats_for_database` implementation; imports and uses `sanitize_floats_for_database` from the new connectors utils and updates call sites accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant BigQuery as BigQuery.execute_query
  participant Sanit as _sanitize_for_bigquery / sanitize_floats_for_database
  participant JSON as json.dumps
  participant DB as BigQuery Service

  Caller->>BigQuery: execute_query(values including dict/list/JSON)
  rect rgba(220,235,255,0.4)
    BigQuery->>Sanit: sanitize(value)\n- recurse dict/list\n- NaN/±Inf -> None\n- round to 15 sig figs
    Sanit-->>BigQuery: sanitized_value
    alt sanitized_value is not None
      BigQuery->>JSON: json.dumps(sanitized_value)
      JSON-->>BigQuery: json_string
      BigQuery->>DB: send query with json_string
    else sanitized_value is None
      BigQuery->>DB: send query with NULL representation
    end
  end
  DB-->>BigQuery: query result
  BigQuery-->>Caller: result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly references the related issue, indicates that it is a fix, and concisely describes the main change as addressing BigQuery float precision in PARSE_JSON for metadata serialization; it directly summarizes the core purpose of the PR.
Description Check	✅ Passed	The description fully follows the repository template by providing clear “What,” “Why,” and “How” sections, addressing potential breaking changes, database migrations, environment configuration, relevant documentation, related issues, dependency versions, testing notes, screenshots, and the checklist; all required sections are present and comprehensively filled out.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

✅ Unit Test PR creation complete.

Create PR with unit tests
Commit unit tests in branch fix/UN-2882-bigquery-float-precision
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca14cf and 76e2365.

📒 Files selected for processing (3)

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (4 hunks)
unstract/connectors/src/unstract/connectors/databases/utils.py (1 hunks)
workers/shared/infrastructure/database/utils.py (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (5)

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (2)

66-109: LGTM! Solid implementation of magnitude-based precision limiting.

The algorithm correctly limits total significant figures to 15 for IEEE 754 compatibility. The magnitude calculation floor(log10(abs(data))) + 1 properly handles both large numbers (like Unix timestamps) and small numbers (like costs) by adjusting decimal precision based on the number's scale.

Edge cases are well handled:

Zero returns 0.0 directly

NaN/Inf converted to None

Small numbers (< 1) get extra decimal places, but this doesn't violate the 15 significant figure limit since leading zeros aren't significant

370-373: LGTM! Correct application of sanitization after parsing.

The sanitization is correctly applied after json.loads() without problematic conditional checks. The parsed value is sanitized and assigned directly.

Minor note: The comment "json.loads() creates new float objects that may have binary precision problems" is slightly imprecise—it's the original JSON string's precision that may exceed binary representation limits, not json.loads() itself causing the problem. However, this is a documentation nitpick and doesn't affect correctness.

unstract/connectors/src/unstract/connectors/databases/utils.py (1)

11-51: LGTM! Clean shared utility with clear scope.

The implementation correctly provides minimal NaN/Inf sanitization for all databases, with clear documentation that precision handling should be implemented in specific connectors. The recursive approach properly handles nested structures.

The docstring examples effectively demonstrate the utility's limited scope—showing that regular floats (like 1760509016.282637) are preserved unchanged, delegating precision handling to database-specific implementations like BigQuery's _sanitize_for_bigquery().

workers/shared/infrastructure/database/utils.py (2)

17-17: LGTM! Correct import of shared sanitizer.

The import properly brings in the centralized sanitize_floats_for_database utility, replacing the internal implementation that was removed.

297-297: LGTM! Consistent usage of shared sanitizer across all call sites.

All four call sites correctly use sanitize_floats_for_database() to handle NaN/Inf values before database operations. The refactoring properly centralizes the sanitization logic without changing behavior.

Also applies to: 365-365, 377-377, 384-384

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py

- Changed 'if sanitized_value' to 'if sanitized_value is not None' - Prevents empty dicts {}, empty lists [], and zero values from becoming None - Addresses CodeRabbit AI feedback on PR #1593 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…stack/unstract into fix/UN-2882-bigquery-float-precision

github-actions · 2025-10-15T14:31:11Z

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

github-actions · 2025-10-15T14:31:14Z

sonarqubecloud · 2025-10-15T14:31:17Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (1)
89-102: Consider refining magnitude calculation for numbers < 1.

The magnitude-based calculation works correctly for large numbers (timestamps), but for small numbers (< 1), the formula safe_decimals = max(0, 15 - magnitude) allows more than 15 decimal places since magnitude is negative. For example, 0.001228 has magnitude = -2, yielding safe_decimals = 17.

While the f-string format .{N}f still rounds naturally and this likely isn't a practical issue (small numbers rarely exceed float precision), the implementation doesn't strictly enforce the "15 significant figures" claim for all inputs. Consider using a library like decimal for precise significant figure handling if strict enforcement is needed:
from decimal import Decimal, ROUND_HALF_UP

# Example: strictly limit to 15 significant figures
sig_figs = 15
d = Decimal(str(data))
# Calculate the exponent and round to sig_figs
exponent = d.adjusted()  # Position of most significant digit
quantize_exp = Decimal(10) ** (exponent - sig_figs + 1)
return float(d.quantize(quantize_exp, rounding=ROUND_HALF_UP))
That said, given that the primary use case (large timestamps) is handled correctly and small numbers with many significant figures are rare, the current implementation is acceptable for this defensive fix.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 76e2365 and 58ce1a6.

📒 Files selected for processing (1)

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (5)

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (5)

66-86: LGTM: Clear documentation and purpose.

The method signature and docstring clearly explain the BigQuery PARSE_JSON compatibility requirements and provide helpful examples.

104-109: LGTM: Recursive handling is correct.

The recursive processing of dicts and lists ensures nested structures are thoroughly sanitized, and the default passthrough preserves non-float types appropriately.

250-256: LGTM: Truthiness issue resolved.

The code now correctly uses is not None instead of a truthy check, preserving empty dicts, empty lists, and zeros as requested in previous reviews. The sanitization is properly applied before JSON serialization.

267-273: LGTM: Truthiness issue resolved.

Identical to lines 250-256, the code now correctly uses is not None to avoid dropping valid empty values and zeros. The sanitization is properly applied before JSON serialization.

374-377: LGTM: Sanitization after parsing is appropriate.

Applying sanitization after json.loads() is a good defensive measure, as parsed float objects may have binary precision issues. This ensures consistent handling across all three code paths where floats enter the BigQuery pipeline.

coderabbitai · 2025-10-15T17:14:42Z

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Generating unit tests... This may take up to 20 minutes.

coderabbitai · 2025-10-15T17:24:23Z

✅ UTG Post-Process Complete

No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully.

coderabbitai · 2025-10-15T17:24:26Z

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py Show resolved Hide resolved

unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py Show resolved Hide resolved

muhammad-ali-e and others added 3 commits October 15, 2025 19:50

Merge branch 'main' into fix/UN-2882-bigquery-float-precision

2bf8057

Merge branch 'fix/UN-2882-bigquery-float-precision' of github.com:Zip…

58ce1a6

…stack/unstract into fix/UN-2882-bigquery-float-precision

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

muhammad-ali-e requested review from johnyrahul, kirtimanmishrazipstack, pk-zipstack, ritwik-g and vishnuszipstack October 15, 2025 17:13

coderabbitai bot mentioned this pull request Oct 15, 2025

CodeRabbit Generated Unit Tests: Add tests for float sanitization and BigQuery precision limits #1595

Open

muhammad-ali-e requested a review from athul-rs October 15, 2025 17:41

chandrasekharan-zipstack approved these changes Oct 16, 2025

View reviewed changes

vishnuszipstack approved these changes Oct 16, 2025

View reviewed changes

muhammad-ali-e merged commit 464fbb3 into main Oct 16, 2025
7 checks passed

muhammad-ali-e deleted the fix/UN-2882-bigquery-float-precision branch October 16, 2025 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UN-2882 [FIX] Fix BigQuery float precision issue in PARSE_JSON for metadata serialization #1593

UN-2882 [FIX] Fix BigQuery float precision issue in PARSE_JSON for metadata serialization #1593

Uh oh!

muhammad-ali-e commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 15, 2025

Uh oh!

github-actions bot commented Oct 15, 2025

Uh oh!

sonarqubecloud bot commented Oct 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UN-2882 [FIX] Fix BigQuery float precision issue in PARSE_JSON for metadata serialization #1593

UN-2882 [FIX] Fix BigQuery float precision issue in PARSE_JSON for metadata serialization #1593

Uh oh!

Conversation

muhammad-ali-e commented Oct 15, 2025

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why.

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 15, 2025

Uh oh!

github-actions bot commented Oct 15, 2025

Uh oh!

sonarqubecloud bot commented Oct 15, 2025

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Oct 15, 2025 •

edited

Loading