Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

FlinkAppenderFactory uses the wrong metric configs for position deletes, leading to ineffective filtering. Refactoring the whole write path to FileWriterFactory could fix this while we are using FileAppenderFactory for a long time. Not sure what other benefits can bring if we change that. So here is a simple fix.

build.gradle Outdated
oldGroup = project.group
oldName = project.name
oldVersion = "1.1.0"
oldVersion = "1.0.0"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for build, will change back if the master pass the build.

@chenjunjiedada
Copy link
Collaborator Author

@stevenzwu @hililiwei You guys may have an interest in this.

*
* @param props table configuration
*/
public static MetricsConfig forPositionDelete(Map<String, String> props) {
Copy link
Contributor

@stevenzwu stevenzwu Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use the method from line 91?

 public static MetricsConfig forPositionDelete(Table table) {

Flink write properties allows override of table properties for write (e.g. compression configs). Do we have such use cases for metrics config?

I am asking because the current fromProperties method (for data files) is deprecated in favor of the method forTable. Looks like deprecation is from PR #2240 by @szehon-ho .

  /**
   * Creates a metrics config from table configuration.
   *
   * @param props table configuration
   * @deprecated use {@link MetricsConfig#forTable(Table)}
   */
  @Deprecated
  public static MetricsConfig fromProperties(Map<String, String> props) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, we can avoid code duplication by extracting common code btw these two methods, if we do need this new variant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraction done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada what do you think of using the existing API? I know it would require changes in FlinkAppenderFactory to pass in the Table object.

public static MetricsConfig forPositionDelete(Table table) {

The benefit is that position delete file will get the consistent behavior as data file. unless we are saying position delete has its own schema, shared code path doesn't really make sense.

  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
    ...
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, positions delete does NOT share the same behavior as data files, SortedPosDeleteWriter sorts the content.

What do you mean by has its own schema? Position deletes can have two schemas, one is [path, pos] and another is [path, pos, row]. Currently, it writes [path, pos] in BaseTaskWriter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu I recheck the metric config, the schema and sort order are for the data row. Change to use forTable now.

Copy link
Contributor

@stevenzwu stevenzwu Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada hmm. I think we should revert the last commit. if we are going to pass in the Table object, it should be non-null. Then we can consistently use forTable everywhere.

I would also be open with the forProperties approach, as you were saying that schema and sort order are for data rows/files. I had some reservations because MetricConfig deprecated forProperties for data rows/files. It is a little weird to introduce it back for position delete. Flink can use forPositionDelete(Table table) API by passing a valid SerializableTable object to FlinkAppenderFactory. With Table object, we can remove the Schema and PartitionSpec from the current FlinkAppenderFactory constructor

Copy link
Contributor

@stevenzwu stevenzwu Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, Anton had a previous attempt for moving Flink to SerializableTable for Flink source/reader. #2987.

We can revive that effort as a separate PR? this can be merged after that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look at that one fisrtly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu , That one (#6407) is for source. This PR is for the sink side. Let me take a look at the sink as well. Maybe eventually we will change back to adopt FileWriterFactory.

columnModes.put(MetadataColumns.DELETE_FILE_PATH.name(), MetricsModes.Full.get());
columnModes.put(MetadataColumns.DELETE_FILE_POS.name(), MetricsModes.Full.get());

MetricsConfig tableConfig = from(props, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is not forTable method, it shouldn't be called tableConfig.

Copy link
Contributor

@hililiwei hililiwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to use this simple fix first.


MetricsConfig metricsConfig = MetricsConfig.fromProperties(props);
MetricsConfig metricsConfig =
table != null ? MetricsConfig.forTable(table) : MetricsConfig.fromProperties(props);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data row in equality delete should share the same metrics config as the table, so I change this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually makes a stronger case that we should pass in a valid Table object to the FlinkAppenderFactory as I mentioned in the other comment. we should also use forTable for position delete although it doesn't need schema or SortOrder as you said.

EncryptedOutputFile outputFile, FileFormat format, StructLike partition) {
MetricsConfig metricsConfig = MetricsConfig.fromProperties(props);
MetricsConfig metricsConfig =
table != null ? MetricsConfig.forTable(table) : MetricsConfig.fromProperties(props);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be forPositionDelete

public FileAppender<RowData> newAppender(OutputFile outputFile, FileFormat format) {
MetricsConfig metricsConfig = MetricsConfig.fromProperties(props);
MetricsConfig metricsConfig =
table != null ? MetricsConfig.forTable(table) : MetricsConfig.fromProperties(props);
Copy link
Contributor

@stevenzwu stevenzwu Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricsConfig.fromProperties is deprecated. I am wondering if we should just ensure table is never null? That would require changing the current 2 constructors (technically breaking) without adding a new constructor. but this class should be an @Internal class. what do you think? @pvary @hililiwei @chenjunjiedada

@stevenzwu stevenzwu merged commit b735401 into apache:master Dec 14, 2022
@stevenzwu
Copy link
Contributor

thanks @chenjunjiedada for the contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants