Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DataFiles table from TransformationDB #7752

Open
chaen opened this issue Aug 13, 2024 · 0 comments
Open

Remove DataFiles table from TransformationDB #7752

chaen opened this issue Aug 13, 2024 · 0 comments
Milestone

Comments

@chaen
Copy link
Contributor

chaen commented Aug 13, 2024

Looking into the performance of the TransformationSystem, and its DB in particular, the hotest spot is the DataFiles table.
The aim of this table is to deduplicate the LFN in the DB, so if multiple transformations are applied to the same file, the LFN is only stored once in this DataFiles, and the TransformationFiles just refers to it via foreign key.

When a lot of transformations are running, the DataFiles table can get big (currently 80M rows in LHCb). Queries we are running against it are of this type:

SELECT LFN,FileID FROM DataFiles WHERE LFN in  ('a', 'b', 'c')

They can take up to half an hour in our case.
Effectively, the DataFiles:

  • is inefficient at querying (which we do very often, even to insert new files)
  • subject to race condition (the code tries to protect it at various places, but still)

I propose to remove the DataFiles table, and add an indexed LFN column to the TransformationFiles table. It may make the DB slightly bigger in size, but the performance will be dramatically improved.

@fstagni fstagni added this to the After v9 milestone Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants