Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -375,7 +375,7 @@ private[parquet] object ParquetTypesConverter extends Logging {

val children = fs.listStatus(path).filterNot { status =>
val name = status.getPath.getName
name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME
(name(0) == '.' || name(0) == '_') && name != ParquetFileWriter.PARQUET_METADATA_FILE
}

// NOTE (lian): Parquet "_metadata" file can be very slow if the file consists of lots of row
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, a better solution for all of this could be:
no val children = fs.listStatus(path)... any more

then:

val metafile = fs.listStatus(path).find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE)
val datafile = fs.listStatus(path).find(isNotHiddenFile(_.getPath.getName))

this isNotHiddenFile simply check like this (name(0) != '.' && name(0) != '_')

then something like:

if datafile is not null
  return ParquetFileReader.readFooter(conf, datafile)
else
  return ParquetFileReader.readFooter(conf, metafile)

and moreover, @liancheng, after carefully reading following comments, finally i know what you mean "complete Parquet file on HDFS should be directory" #2044 (comment)

you mean the whole directory is "a single parquet file", and the files in it are "data"? but such a definition is really very very confusing... are you sure about this definition? i just googled, but found noting, only some like "Parquet files are self-describing so the schema is preserved"

so, since they are self-describing, in my mind, each "data-file" in a parquet file (a parquet-folder actually...) is also valid parquet-format-file, it should also be able to take as an input source for parquet reader like our Spark SQLContext...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listStatus(path) call can be expensive, especially when the path is an S3 URI. In� that case, an HTTP request is issued to get all the FileStatus objects. and it may take hundred of milliseconds to return (depending on how may HDFS files there are). That's why children is used.

And yea, I also find definition of "Parquet file" somewhat confusing, and even the official Parquet documentations doesn't provide a precise definition. IMO a part-* file is roughly equivalent to a row group of a Parquet file, but it also carries its own metadata and is thus self describing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info, yep, this method is a confusing point, maybe we can reference some other parquet reader implementation

Expand Down