Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Sep 12, 2016

No description provided.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 12, 2016

@julienledem and @robert3005, we might want to get this in 1.9.0 (though this is a low priority). It adds ParquetDataSource, which is what we discussed adding in discussion on PARQUET-400 to encapsulate file size, location, and providing new SeekableInputStream instances. I think this would be a bit cleaner for the readFooters method added in #357. Currently this only works with Hadoop, but you could easily add an implementation for other file systems.

@rdblue rdblue force-pushed the PARQUET-674-add-data-source branch 2 times, most recently from 3bb875c to f677944 Compare September 12, 2016 17:22
@robert3005
Copy link

👍 That looks like a good abstraction in general. Thanks!

import java.util.concurrent.Executors;
import java.util.concurrent.Future;

import org.apache.commons.math3.analysis.function.Add;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I hate how IntelliJ now adds static imports automatically. It constantly adds these when I pause typing a name. Thanks for catching it!

@rdblue rdblue force-pushed the PARQUET-674-add-data-source branch from f677944 to 4a7c327 Compare September 13, 2016 17:43
/**
* Returns the file location.
*/
String getLocation();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep this as Path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is an abstraction that isn't tied to Hadoop or another FS library. A string location should be portable across implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking more on the lines of a java Path / URI.

Copy link
Member

@julienledem julienledem Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the location at all? Is this for showing in error messages? Maybe just toString is enough?

@piyushnarang
Copy link

👍

@rdblue
Copy link
Contributor Author

rdblue commented Sep 17, 2016

@julienledem, could you take a look at this? It would be better to do this for 1.9.0 than to do it later because it would prevent exposing a public method. Thanks!

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this looks good to me. I made some comments

* {@code ParquetDataSource} is an interface with the methods needed by Parquet
* to read data files using {@link SeekableInputStream} instances.
*/
public interface ParquetDataSource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a SeekableInputStream provider with a length.
maybe call it InputFile ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't too happy with the name either. InputFile is something I hadn't though of and sounds pretty good. I'll go with that.

import org.apache.parquet.io.ParquetDataSource;
import java.io.IOException;

public class HadoopDataSource implements ParquetDataSource, Configurable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make it Configurable?
The conf is passed in the constructor and does not need to be settable or exposed.
even better once initialized, the conf is not used anymore. I would remove the conf field as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to be able to create a ParquetMetadataConverter using its constructor that takes a Configuration, but I think it's better to remove this because that option should be removed in the next release.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 3, 2016

@julienledem, thanks for your comments. I implemented your suggestions so I think this is about ready when tests are passing.

@rdblue rdblue changed the title PARQUET-674: Add DataSource abstraction for openable files. PARQUET-674: Add InputFile abstraction for openable files. Oct 3, 2016
@julienledem
Copy link
Member

+1

@asfgit asfgit closed this in b59be86 Oct 3, 2016
robert3005 pushed a commit to palantir/parquet-mr that referenced this pull request Oct 7, 2016
Author: Ryan Blue <[email protected]>

Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits:

8c689e9 [Ryan Blue] PARQUET-674: Implement review comments.
4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
Author: Ryan Blue <[email protected]>

Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits:

8c689e9 [Ryan Blue] PARQUET-674: Implement review comments.
4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
Author: Ryan Blue <[email protected]>

Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits:

8c689e9 [Ryan Blue] PARQUET-674: Implement review comments.
4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
Author: Ryan Blue <[email protected]>

Closes apache#368 from rdblue/PARQUET-674-add-data-source and squashes the following commits:

8c689e9 [Ryan Blue] PARQUET-674: Implement review comments.
4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for openable files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants