-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-674: Add InputFile abstraction for openable files. #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@julienledem and @robert3005, we might want to get this in 1.9.0 (though this is a low priority). It adds |
3bb875c to
f677944
Compare
|
👍 That looks like a good abstraction in general. Thanks! |
| import java.util.concurrent.Executors; | ||
| import java.util.concurrent.Future; | ||
|
|
||
| import org.apache.commons.math3.analysis.function.Add; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I hate how IntelliJ now adds static imports automatically. It constantly adds these when I pause typing a name. Thanks for catching it!
f677944 to
4a7c327
Compare
| /** | ||
| * Returns the file location. | ||
| */ | ||
| String getLocation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we keep this as Path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is an abstraction that isn't tied to Hadoop or another FS library. A string location should be portable across implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking more on the lines of a java Path / URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the location at all? Is this for showing in error messages? Maybe just toString is enough?
|
👍 |
|
@julienledem, could you take a look at this? It would be better to do this for 1.9.0 than to do it later because it would prevent exposing a public method. Thanks! |
julienledem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall this looks good to me. I made some comments
| * {@code ParquetDataSource} is an interface with the methods needed by Parquet | ||
| * to read data files using {@link SeekableInputStream} instances. | ||
| */ | ||
| public interface ParquetDataSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a SeekableInputStream provider with a length.
maybe call it InputFile ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't too happy with the name either. InputFile is something I hadn't though of and sounds pretty good. I'll go with that.
| import org.apache.parquet.io.ParquetDataSource; | ||
| import java.io.IOException; | ||
|
|
||
| public class HadoopDataSource implements ParquetDataSource, Configurable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make it Configurable?
The conf is passed in the constructor and does not need to be settable or exposed.
even better once initialized, the conf is not used anymore. I would remove the conf field as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to be able to create a ParquetMetadataConverter using its constructor that takes a Configuration, but I think it's better to remove this because that option should be removed in the next release.
|
@julienledem, thanks for your comments. I implemented your suggestions so I think this is about ready when tests are passing. |
|
+1 |
Author: Ryan Blue <[email protected]> Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits: 8c689e9 [Ryan Blue] PARQUET-674: Implement review comments. 4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
Author: Ryan Blue <[email protected]> Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits: 8c689e9 [Ryan Blue] PARQUET-674: Implement review comments. 4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
Author: Ryan Blue <[email protected]> Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits: 8c689e9 [Ryan Blue] PARQUET-674: Implement review comments. 4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
Author: Ryan Blue <[email protected]> Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits: 8c689e9 [Ryan Blue] PARQUET-674: Implement review comments. 4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
No description provided.