Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Apr 8, 2021

Removes:

  • parquet-tools-deprecated
  • parquet-scrooge-deprecated
  • parquet-cascading-common23-deprecated
  • parquet-cascading-deprecated
  • parquet-cascading3-deprecated

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@Fokko Fokko requested review from gszadovszky and shangxinli April 8, 2021 20:45
@Fokko Fokko force-pushed the fd-remove-unused-modules branch 2 times, most recently from ecab1cc to 9cf6bc9 Compare April 8, 2021 21:05
Removes:

- parquet-tools-deprecated
- parquet-scrooge-deprecated
- parquet-cascading-common23-deprecated
- parquet-cascading-deprecated
- parquet-cascading3-deprecated
@Fokko Fokko force-pushed the fd-remove-unused-modules branch from 9cf6bc9 to dd25b7d Compare April 8, 2021 21:06
@Fokko Fokko merged commit 907314c into apache:master Apr 19, 2021
elikkatz added a commit to TheWeatherCompany/parquet-mr that referenced this pull request Jun 2, 2021
* 'master' of https://github.com/apache/parquet-mr: (222 commits)
  PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910)
  PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899)
  PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908)
  PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745)
  PARQUET-2046: Upgrade Apache POM to 23 (apache#904)
  PARQUET-2048: Deprecate BaseRecordReader (apache#906)
  PARQUET-1922: Deprecate IOExceptionUtils (apache#825)
  PARQUET-2037: Write INT96 with parquet-avro (apache#901)
  PARQUET-2044: Enable ZSTD buffer pool by default (apache#903)
  PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898)
  Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)"
  PARQUET-2027: Fix calculating directory offset for merge (apache#896)
  [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)
  PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895)
  PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897)
  PARQUET-1448: Review of ParquetFileReader (apache#892)
  PARQUET-2020: Remove deprecated modules (apache#888)
  PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893)
  PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889)
  PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871)
  ...

# Conflicts:
#	parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java
#	parquet-hadoop/pom.xml
#	parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
#	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
@ChrisCollinsIBM
Copy link

ChrisCollinsIBM commented Oct 5, 2022

@Fokko @gszadovszky

I hope you can point me in the right direction. With the deprecation and removal of parquet-tools, is there any way in Java code to render a JSON representation of a parquet record?

We were previously using parquet.tools.util.JsonRecordFormatter like this.

HadoopInputFile inputFile = HadoopInputFile.fromPath(new Path(filePath), hadoopConfig);
		
try (ParquetFileReader reader = ParquetFileReader.open(inputFile))
{
	MessageType schema = reader.getFooter().getFileMetaData().getSchema();
	JsonRecordFormatter.JsonGroupFormatter formatter = JsonRecordFormatter.fromSchema(schema);
	PageReadStore pages;
			
	while ((pages = reader.readNextRowGroup()) != null)
	{
		long rows = pages.getRowCount();
		MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
		RecordReader<SimpleRecord> recordReader = columnIO.getRecordReader(pages, new SimpleRecordMaterializer(schema));
	
		for (int i = 0; i < rows; i++)
		{
			SimpleRecord simpleRecord = (SimpleRecord) recordReader.read();
			System.out.println(formatter.formatRecord(simpleRecord));
		}
	}
}

Is there anything in the remaining libraries that can achieve this? And if not could we look at pulling these classes back in to maybe parquet-format-structures or some other related project that makes sense?

I would have opened this as an issue but I don't see issues enabled for this repository.

EDIT: Question cross-posted to https://issues.apache.org/jira/browse/PARQUET-2020

@ChrisCollinsIBM
Copy link

Would still love to get some clarity on a path forward on this @Fokko @gszadovszky

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants