Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: backfill commit data to storage #78

Merged
merged 2 commits into from
Sep 8, 2023

Conversation

giovanni-guidini
Copy link
Contributor

Now that we are writing data to GCS with acceptable degree of success we want to backfil data from existing commits.
For such purpose we are introducing a new task.
The present changes implement said new task.

closes codecov/engineering-team#189

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #78 (0fdcf04) into main (69bdfd1) will increase coverage by 0.00%.
The diff coverage is 99.15%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main      #78    +/-   ##
========================================
  Coverage   98.47%   98.48%            
========================================
  Files         362      364     +2     
  Lines       26560    26797   +237     
========================================
+ Hits        26156    26392   +236     
- Misses        404      405     +1     
Flag Coverage Δ
integration 98.45% <99.15%> (+0.01%) ⬆️
latest-uploader-overall 98.45% <99.15%> (+0.01%) ⬆️
onlysomelabels 98.48% <99.15%> (+<0.01%) ⬆️
unit 98.45% <99.15%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 97.14% <97.33%> (+0.01%) ⬆️
OutsideTasks 98.25% <100.00%> (+<0.01%) ⬆️
Files Changed Coverage Δ
tasks/backfill_commit_data_to_storage.py 97.29% <97.29%> (ø)
database/tests/factories/core.py 100.00% <100.00%> (ø)
tasks/__init__.py 100.00% <100.00%> (ø)
.../unit/test_backfill_commit_data_to_storage_task.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

This change has been scanned for critical changes. Learn more

Now that we are writing data to GCS with acceptable degree of success we want to backfil data from existing commits.
For such purpose we are introducing a new task.
The present changes implement said new task.

closes codecov/engineering-team#189
)
return {"success": False, "errors": [BackfillError.missing_data.value]}

def handle_all_report_rows(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's already a method do do all this in the report service:

async def initialize_and_save_report(

Are we able to use that code? Or is there a subtle difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subtle difference that I've found that made me almost copy the code from there was that the initialize_and_save_report calls save_full_report

And that actually creates an Upload instance. We don't have one that isn't in the database, so I thought that was a no-no.

So the version here just does save_report.

Copy link
Contributor

@scott-codecov scott-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little question but otherwise look good

db_session.add(report_details)
db_session.flush()

repo_yaml = get_repo_yaml(commit.repository)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be get_current_yaml instead so that it takes into account the commit's YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters for the operations we are doing. And using the repo_yaml saves us a request to the git provider.

However from a correctness pov - given we will be working mostly with old commits - I guess the most accurate would be to use only the commit yaml... (given that more recent changes are probably merged to the repo yaml already and might have been done in the owner yaml)

What do you think is best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that it doesn't matter. Looks like there are 2 methods being called here:

  • report_service.get_existing_report_for_commit_from_legacy_data
  • report_service.save_report

I just looked through both and AFAICT neither rely on self.current_yaml in any way. You could probably even pass ReportService({}) here.

@giovanni-guidini giovanni-guidini merged commit 3056f09 into main Sep 8, 2023
12 checks passed
@giovanni-guidini giovanni-guidini deleted the gio/backfill-commit-data-task branch September 8, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create GCS Backfill task
2 participants