Skip to content

Conversation

@jackye1995
Copy link
Contributor

@aokolnychyi this should probably go to 0.11.1

HadoopFileIO currently does not have a no-arg constructor and could not be loaded by the dynamic FileIO loader. But our documentation claims it can be loaded, so we need to fix it asap. There are 2 ways to fix it,

  1. add a no-arg constructor to HadoopFileIO, let it implement Configurable, and set configuration in setConf method.
  2. (current approach in PR) allow the loader to load an implementation that has a constructor taking Hadoop configuration.

To do approach 1, we need to change HadoopFileIO class variable to be not final, and I am a bit hesitated to do that for performance concerns, so I went with approach 2 for now.

@github-actions github-actions bot added the core label Mar 15, 2021
} catch (NoSuchMethodException e) {
throw new IllegalArgumentException(String.format(
"Cannot initialize FileIO, missing no-arg constructor: %s", impl), e);
LOG.error("No-arg constructor not found for {}, try to use Hadoop Configuration constructor", impl, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try -> will attempt to use

Just because the current wording makes it sound like the user needs to try to do something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shouldn't this be "info"? It's not necessarily an error right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

LOG.error("No-arg constructor not found for {}, try to use Hadoop Configuration constructor", impl, e);
try {
ctor = DynConstructors.builder(Catalog.class).impl(impl, Configuration.class).buildChecked();
useNoArgConstructor = false;
Copy link
Member

@RussellSpitzer RussellSpitzer Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just invoke the constructor where we load it and skip the state variable passing through here?

Something like

try {
   ctor = ..
   fileIO = ctor.newInstance
} catch {
   ctor = ... single are ctor
   fileIo = ctor.newInstance(hadoopConf)
}

To avoid try nesting we could extract that to a helper function and then have something like

try {
  fileIO = loadFileIO(impl)
} catch {
  Cannot find a 0-arg or 1-arg constructor to initalize FileImplemenation
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional question, do we always want the 0 arg constructor to get priority over a 1-arg? Or should we throw an error if you provide both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we always want the 0 arg constructor to get priority over a 1-arg

I think it is better to let 0 arg constructor to get priority, so it does not block any FileIO that might have both constructors.

To avoid try nesting we could extract that to a helper function

Yeah I tried about that but there was duplicated logic for finding ClassCastException. Using a helper function is probably a way to solve it, let me update with that, thank you!

@jackye1995
Copy link
Contributor Author

@RussellSpitzer updated based on all the comments, please take a look when you have time, thanks!

* a Hadoop config will be passed using {@link Configurable#setConf(Configuration)}.
* {@link FileIO#initialize(Map properties)} is called to complete the initialization.
*
* For compatibility with HadoopFileIO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we starts to support 1-arg constructor as a valid case in the code and prints error messages about it, should we mention this as a fully supported case instead of a backward compatibility support? Especially that the reason for you to choose this option is performance concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I also noticed this is a bit annoying to see a stack trace printed when trying to load it, let me just not print the error.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Configuration configuration = new Configuration();
configuration.set("key", "val");
FileIO fileIO = CatalogUtil.loadFileIO(
TestFileIOHadoopConstructor.class.getName(), Maps.newHashMap(), configuration);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just load the HadoopFileIO directly ? Here we are trying to load an TestFileIOHadoopConstructor, I'm concerning that we could not detect the breaking case if someone change the HadoopFileIO constructor....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would be possible, since HadoopFileIO is public and the constructor is also public, we cannot simply change it. But this is a good point, I can just load it instead of having this dummy test class.

* If the class implements {@link Configurable},
* a Hadoop config will be passed using {@link Configurable#setConf(Configuration)}.
* {@link FileIO#initialize(Map properties)} is called to complete the initialization.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paragraphs should be separated by <p>.

}
} catch (NoSuchMethodException e) {
LOG.info("0-arg constructor not found for {}, will attempt to use Hadoop Configuration constructor", impl);
ctor = DynConstructors.builder(Catalog.class).impl(impl, Configuration.class).buildChecked();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be collapsed into a single DynConstructors call that lists both implementations. The first one that is found will be selected. And the newInstance method should discard arguments that are not needed so you can use the same call for both.

@rdblue
Copy link
Contributor

rdblue commented Mar 24, 2021

@jackye1995, can you elaborate on your performance concerns if the configuration isn't final? I'm not sure that would be a problem.

@aokolnychyi
Copy link
Contributor

Sorry, I missed this PR. I can hold 0.11.1 RC for a little longer.

@jackye1995
Copy link
Contributor Author

jackye1995 commented Mar 24, 2021

@rdblue

I think this can be collapsed into a single DynConstructors call that lists both implementations.

The issue with that approach is that I still need a boolean to check if I need to call the setConf for no-arg constructor, and there was no way to get that boolean. I guess I can bypass that by checking instanceof Configurable for all FileIOs, so if we decide to go with that route I can make the change accordingly.

can you elaborate on your performance concerns if the configuration isn't final?

I would actually prefer to use this approach of adding no-arg consturctor for HadoopFileIO which makes the whole concept of FIleIO loading much simpler. But I see conf() is called in all read and write paths to create a serializable configuration to avoid Kryo serialization issue. There is definitely a small difference for using final versus not final, but the difference might not be significant from a broader view, I do not have a definitive answer for that.

@jackye1995
Copy link
Contributor Author

Sorry, I missed this PR. I can hold 0.11.1 RC for a little longer.

Thank you Anton!

@rdblue
Copy link
Contributor

rdblue commented Mar 25, 2021

@jackye1995, let's go with the no-arg constructor and remove final. Configuration is primarily accessed to create FileIO, which can then have a final reference. So I don't think this is in a critical path and we shouldn't worry about it. Probably not worth making the contract for dynamic loading more complicated and difficult to maintain.

@jackye1995
Copy link
Contributor Author

@rdblue I was just going to reply, I did some benchmarking, as expected there is a difference but it was at microsecond level, so I think it should be fine to just remove the final. I think I was overly conservative on this, thanks for the suggestion!

@rdblue rdblue merged commit d30a225 into apache:master Mar 26, 2021
@aokolnychyi
Copy link
Contributor

Thanks everyone! I am going to include this change in 0.11.1.

@aokolnychyi aokolnychyi added this to the Java 0.11.1 Release milestone Mar 30, 2021
coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants