Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Jul 6, 2019

This PR introduces MetricsConfig that allows us to control metrics collection and resolves #173.

Currently, this logic is integrated only into the Parquet write path.

@aokolnychyi
Copy link
Contributor Author

@rdblue @vgankidi this what I had before we merged PR #254. It would be great to get preliminary feedback and align on whether truncation should be part of MetricsSpec or not.

| write.avro.compression-codec | gzip | Avro compression codec |

| write.metadata.metrics.default | full | Default metrics mode for all columns in the table; none, counts or full |
| write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts or full |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I like using a serial comma.

@rdblue
Copy link
Contributor

rdblue commented Jul 7, 2019

@aokolnychyi, this looks great! I think we need to merge it with the changes from #254, but this is an awesome start!

import static org.apache.iceberg.TableProperties.DEFAULT_WRITE_METRICS_MODE;
import static org.apache.iceberg.TableProperties.DEFAULT_WRITE_METRICS_MODE_DEFAULT;

public class MetricsSpec {
Copy link
Contributor

@rdblue rdblue Jul 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about naming this MetricsConfig instead? In a lot of places, we refer to just "spec" to mean the partition spec and I don't want to make the code confusing.

@aokolnychyi aokolnychyi force-pushed the metrics-config branch 3 times, most recently from 0f68b9d to d23687f Compare July 7, 2019 23:21
@aokolnychyi aokolnychyi changed the title [WIP] Make metrics collection configurable via table properties Make metrics collection configurable via table properties Jul 7, 2019
@aokolnychyi
Copy link
Contributor Author

@rdblue this one is ready for another review round.

| write.parquet.compression-codec | gzip | Parquet compression codec |
| write.avro.compression-codec | gzip | Avro compression codec |

| write.metadata.metrics.default | truncate(16) | Default metrics mode for all columns in the table; none, counts, truncate(lengthInBytes), or full |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncate length isn't in bytes. It is bytes for binary and characters (unicode codepoints) for UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I misinterpreted the property name WRITE_METADATA_TRUNCATE_BYTES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should have changed that name, good catch.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MetricsModes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have some docs here for what the modes are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added short descriptions to MetricsModes and each MetricsMode.

@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2019

@aokolnychyi, just a couple minor comments and this is ready to go. It also needs a rebase. Thanks!

@aokolnychyi aokolnychyi mentioned this pull request Jul 15, 2019
@rdblue rdblue merged commit c2957eb into apache:master Jul 15, 2019
@rdblue
Copy link
Contributor

rdblue commented Jul 15, 2019

Merged. Thanks for fixing this, @aokolnychyi!

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make collection of lower/upper bounds configurable

2 participants