Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Load CDK: prefix search nonambiguous; filename uses wall time #48623

Conversation

johnny-schmidt
Copy link
Contributor

What

  • fix for truncate sync:
    • when we search metadata for old files, we
      • start with a prefix that is the longest possible constant path across streams (take the path, sub in stream name and namespace, then take everything up to the first unrealized variable)
      • match against a matcher that matches the path/file pattern exactly
    • bug: we were matching against everything except UUID (doesn't work; because even tho write time is constant across the sync, it isn't across sync(s)
    • fix: only match against the stream name & namespace

Also:

  • change the filename epoch to use wall time instead of load time (the YEAR/MONTH/DAY/etc in the path use load time, but the filename in the old code uses wall time)
  • change the integration test data dumper so it uses the same strategy as metadata; this way you can actually have full paths in the test configs; as soon as this is merged, I will update the secrets so the ITs are more robust

@johnny-schmidt johnny-schmidt requested a review from a team as a code owner November 22, 2024 21:51
Copy link

vercel bot commented Nov 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Nov 23, 2024 2:16am

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Nov 22, 2024
@johnny-schmidt
Copy link
Contributor Author

@davinchia added you here since you raised this as part of your IAM work

} else {
pathConstant
pathFactory.getFinalDirectory(stream, streamConstant = true).toString().takeWhile {
it != '$'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significance of $ here? Is there a variable interpolation scheme going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Basically, give me a directory with only the values constant across syncs realized, then take everything up to the start of the first variable.

FileVariable("timestamp", """\d+""") {
// NOTE: We use a constant time for the path but wall time for the files
Instant.now().toEpochMilli().toString()
},
Copy link
Contributor

@tryangul tryangul Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is there a way to use the DI'd clock here? Also, this being in a companion object, when does it actually get called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's init, maybe we can initialize it in an init block and reference it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DI'd clock is "sync time". It's used once and the same value is constant across all uploads, and is used for all the time-based variables except this one, which is supposed to be the actual load time.

val prefix =
pathFactory.getFinalDirectory(stream, streamConstant = true).toString().takeWhile {
it != '$'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we share this code since it's coupled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should really go into the path factory, and be like "longestStreamConstantPrefix" or something

Copy link
Contributor

@tryangul tryangul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions and nits, but not blocking

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3-v2/path-matcher-for-metadata-fix branch from 937566f to deafd18 Compare November 23, 2024 02:16
@johnny-schmidt johnny-schmidt merged commit b65a21f into jschmidt/s3v2/issue-10732/staging-tests Nov 23, 2024
19 of 21 checks passed
@johnny-schmidt johnny-schmidt deleted the jschmidt/s3-v2/path-matcher-for-metadata-fix branch November 23, 2024 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants