Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-245] Add script for resolving ETL-245 #90

Merged
merged 1 commit into from
Aug 30, 2022
Merged

[ETL-245] Add script for resolving ETL-245 #90

merged 1 commit into from
Aug 30, 2022

Conversation

philerooski
Copy link
Contributor

No description provided.

@philerooski philerooski requested a review from a team as a code owner July 30, 2022 23:15
@philerooski philerooski temporarily deployed to develop July 30, 2022 23:17 Inactive
except FileNotFoundError:
study_counts[study]["parquet"].append((dataset, 0))
study_counts[study]["json"] = sorted(study_counts[study]["json"])
study_counts[study]["parquet"] = sorted(study_counts[study]["parquet"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the comparison? Is this just to make sure all the file counts match the unique record counts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each record (zip file) from Bridge contain a number of JSON. The only required JSON in each record is metadata.json (which conforms to the ArchiveMetadata schema) so comparing file counts of the ArchiveMetadata JSON dataset to the record counts in the ArchiveMetadata Parquet dataset tells us if any records failed to process during the export to Parquet. I think every record from MTB thus far includes a taskData.json (sharedSchema) as well, so we can look at these counts and compare them to the ArchiveMetadata counts. The other datasets are less useful as QA unless you count the strict subset of records which contain AudioLevelRecord, MotionRecord, or WeatherResult data, but still interesting to count.

@philerooski philerooski merged commit 6d19983 into main Aug 30, 2022
@philerooski philerooski deleted the etl-245 branch August 30, 2022 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants