Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Performance: Hard Compactions #582

Closed
11 tasks done
sergiimk opened this issue Apr 8, 2024 · 4 comments
Closed
11 tasks done

Query Performance: Hard Compactions #582

sergiimk opened this issue Apr 8, 2024 · 4 comments
Assignees
Labels

Comments

@sergiimk
Copy link
Member

sergiimk commented Apr 8, 2024

Future idea: Proactively detect when derivative datasets' inputs have broke history and offer user to mitigate the problem by triggering hard compaction

@sergiimk sergiimk changed the title Data query performance Query performance (hard compactions) Jun 26, 2024
@sergiimk sergiimk changed the title Query performance (hard compactions) Query Performance: Hard Compactions Jun 26, 2024
@sergiimk
Copy link
Member Author

sergiimk commented Jul 25, 2024

After some testing the following defects were identified:

1) User feedback about history-breaking changes on derivative dataset inputs

Epic mentions a subtask:
Display task outcome to let user know that derivative dataset cannot be updated because of root compaction

However currently in this scenario a generic error is shown:
Image

According to @zaychenko-sergei there was a special interpretation of the Invalid block interval error added to the UI to explain the error to the user, so looks like this functionality is not working any more.

2) It's not possible to recover the derivative dataset and bring it into a good state if one of its inputs was compacted.

Consider a scenario where:

  • We run "full" compaction on the root dataset
  • Attempting to update derivative dataset will return error (as per 1.)
  • In settings there is no option to run a compaction or somehow reset the derivative dataset - we're stuck

3) "Metadata only" option for root datasets should be better explained

Even I was not entirely sure what to expect from "metadata-only option". Currently its not at all clear that it will delete all data (!!!)

Image

4) Not possible to compact a root and trigger recursive "metadata-only" compactions for derivatives

This scenario was the main reason for introducing recursive compactions, but it's not possible in current UI:

  • Recursive flow can be triggered only via "metadata only" option on a root dataset, which deletes all data
  • If you run normal compaction that preserves the data - you cannot compact the derivatives any more
    • neither recursively
    • nor non-recursively because of the problem 2)

5) Incorrect info in "Recursive" flag tooltip

Image


As a suggestion to address 2), 3), and 4) I propose the following steps:

  1. Remove "Metadata Only" option from root compaction settings, leaving only the "full" mode (fixes 3.)
    a. better document what regular hard compaction does
  2. Add "Reset dataset" option in "General" tab, next to the "Delete dataset"
    a. By default "reset" will remove everything except the Seed block (preserve only the identity of a dataset)
    b. If Flatten metadata checkbox is checked - this will trigger our "metadata only" compaction
    c. If recursive checkbox is checked - this will run "metadata only" recursively
    x. As a separate feature later "Reset" can also allow resetting to the specified block hash - i.e. we will bring reset and "metadata only" compactions under the same umbrella in UX
  3. Make "Reset dataset" available for both root and derivative (fixes 2.)
  4. Add "Reset downstream datasets recursively" checkbox to "Compaction" tab (fixes 4.)
    a. This will allow running normal compaction for root while triggering recursive "metadata-only" compactions for derivatives downstream.

In other words:

  • I think it's simpler for users to understand the concept of "reset with flattened metadata" than "metadata-only compaction"
  • Separating reset into its own function group avoids having users to chose from two very different compaction modes, one of which actually drops data

I think the above can be done with none or minimal changes on the backend the only thing missing I think is the ability to run normal compaction on root while triggering metadata-only compactions for derivatives.

@sergiimk
Copy link
Member Author

sergiimk commented Sep 1, 2024

@sergiimk
Copy link
Member Author

sergiimk commented Sep 9, 2024

My test setup is a chain of datasets:

gps -> gps-deriv-1 -> gps-deriv-2

Derivative datasets are simply doing select * from <input>.

Issues found:

  1. When root dataset is hard compacted (non-recursive) and I trigger a derivative update manually I get an error (as expected), but this error shows the name of my derivative dataset instead of the name of the root input
    image

  2. When I do Reset + flatten metadata on derivative dataset I see a flow confusingly called "Hard Compaction". I expected to see "Reset"
    image

Please ticket up these issues as low-priority bugs - they are not blocking this epic's completion.

Aside: I had a stable reproduce of the issue on demo environment where when I ran Reset with "recursive" flag the downstream datasets were not affected, only the current one.
image

I then tried the same on europort and everything worked. I then recreated datasets on demo and it started working as well. Just a note that there might be something fishy.

@sergiimk sergiimk closed this as completed Sep 9, 2024
@dmitriy-borzenko
Copy link
Contributor

dmitriy-borzenko commented Sep 10, 2024

  1. Created a ticket for the backend Change input dataset for derived datasets when manual update fails #820
  2. Reset + flatten metadata on derivative dataset under the hood it starts Hard compaction that's why the result is like this

image

Maybe this should be taken out into a separate flow? We need to discuss this, what is the best way to proceed.

image

I just checked this case and everything works. Everyone in the chain has empty data. I can't reproduce this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants