Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(airbyte-cdk): replace pydantic BaseModel with dataclasses in protocol #44026

Closed
wants to merge 9 commits into from

Conversation

artem1205
Copy link
Collaborator

@artem1205 artem1205 commented Aug 14, 2024

What

Resolving https://github.com/airbytehq/airbyte-internal-issues/issues/8945 by use of dataclasses

How

  1. Generate models from airbyte-protocol repo using dataclasses.dataclass as output
  datamodel-codegen \
    --input "$ROOT_DIR/$YAML_DIR/$filename_wo_ext.yaml" \
    --output "$ROOT_DIR/$OUTPUT_DIR/$filename_wo_ext.py" \
    --output-model-type dataclasses.dataclass \
    --target-python-version 3.10 \
    --use-title-as-name \
    --disable-timestamp
  1. Copy generated models to airbyte-cdk/python/airbyte_cdk/models/airbyte_protocol.py
  2. Ref to use dataclasses:
  • use orjson.dumps().decode() for serialization

Review guide

  1. airbyte-cdk/python/airbyte_cdk/models/airbyte_protocol.py
  2. airbyte-cdk/python/airbyte_cdk/models/well_known_types.py

Note

In this PR pydantic was completely replaced with dataclasses, but we can (should??) consider replacing only part of protocol entities (heavily used ones).

User Impact

-- changes in protocol models only

known problem

Caution

orjson serializes all fields (even None)

Tests

Test with docker run; 10_000_000 records; stream: dummy_records

/usr/bin/time -h docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-hardcoded-records:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json > /tmp/test.txt

10,000,000 records are about 1,770 MB.

TIME real user sys Mb/s
Before (0.0.6) 2m37.51s 3.83s 14.85s 11.27
After (dev) 1m30.22s (1.56x) 2.71s 7.53s 19.66 (1.74x)

Test with local platform (deployed with abctl)

24,000,000 records are about 4,270 MB.

Version Platform sync time E2E Mb/s
Before (0.0.6) 10.43s 6.64
After (dev) 5.52s 12.13 (1.82x)

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Signed-off-by: Artem Inzhyyants <[email protected]>
Copy link

vercel bot commented Aug 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Aug 16, 2024 11:25am

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Aug 14, 2024
@artem1205 artem1205 changed the title Airbyte CDK: add dataclass-wizard Airbyte CDK: replace pydantic with dataclasses in protocol Aug 14, 2024
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
@artem1205 artem1205 self-assigned this Aug 14, 2024
@artem1205 artem1205 changed the title Airbyte CDK: replace pydantic with dataclasses in protocol feat(airbyte-cdk): replace pydantic BaseModel with dataclasses in protocol Aug 14, 2024
…dk-protocol-dataclasses

# Conflicts:
#	airbyte-cdk/python/poetry.lock
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants