Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR changes the metrics collection in position delete files to drop useless boundaries. As discussed earlier, loading and comparing lower and upper boundaries in DeleteFileIndex is fairly expensive and should be done only if makes sense. Given the nature of generated data file paths, it only makes sense to persist the boundaries if a position delete file covers a single data file. Otherwise, it only harms the planning performance.

@aokolnychyi aokolnychyi force-pushed the ignore-file-stats-position-deletes branch from be8ca32 to c7698f5 Compare August 20, 2023 19:29
@github-actions github-actions bot removed the API label Aug 20, 2023
@aokolnychyi aokolnychyi force-pushed the ignore-file-stats-position-deletes branch from c7698f5 to 94cda10 Compare August 21, 2023 16:35
@aokolnychyi aokolnychyi force-pushed the ignore-file-stats-position-deletes branch from 94cda10 to 96dabf6 Compare August 21, 2023 18:23
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for OSS

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, considering lower and upper bounds are optional in spec.

private Metrics metrics() {
Metrics metrics = appender.metrics();
if (referencedDataFiles.size() > 1) {
return MetricsUtil.copyWithoutFieldBounds(metrics, SINGLE_REFERENCED_FILE_BOUNDS_ONLY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the idea is only to drop the bounds for _file and _pos? I guess that would mean that we keep data column ranges that might be used for filtering?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any engines that persist deleted rows but I opted in for an incremental change in behavior to be safe.

Assert.assertNull(deleteFile.lowerBounds());
Assert.assertNull(deleteFile.upperBounds());
} else {
Assert.assertEquals(2, deleteFile.lowerBounds().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also assert that referencedDataFiles.size() == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a check.

@rdblue
Copy link
Contributor

rdblue commented Aug 21, 2023

Looks good to me.

@aokolnychyi aokolnychyi merged commit 74a7d95 into apache:master Aug 22, 2023
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @RussellSpitzer @singhpk234 @rdblue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants