Port OrcFileWriterFactory.createOrcDataSink to TrinoFileSystem#14627
Conversation
|
CI: #14631 |
adb3467 to
dd5fe8f
Compare
|
This is prerequisite to be able to track |
|
Sorry I didn't think anyone is going to review it at this stage. |
I am not going to review it then, but you can @ mention me when you want me to take a look |
|
Yup. Please mark it ready-for review and re-request review when this is ready |
dd18dca to
2c9cae3
Compare
|
@alexjo2144 @sopel39 there is no button for re-requesting the review but please take a look ;) |
| implements ConnectorPageSinkProvider | ||
| { | ||
| private final Set<HiveFileWriterFactory> fileWriterFactories; | ||
| private final TrinoFileSystemFactory trinoFileSystemFactory; |
| @Inject | ||
| public HivePageSinkProvider( | ||
| Set<HiveFileWriterFactory> fileWriterFactories, | ||
| TrinoFileSystemFactory trinoFileSystemFactory, |
| HiveWriterStats hiveWriterStats) | ||
| { | ||
| this.fileWriterFactories = ImmutableSet.copyOf(requireNonNull(fileWriterFactories, "fileWriterFactories is null")); | ||
| this.trinoFileSystemFactory = requireNonNull(trinoFileSystemFactory, "trinoFileSystemFactory is null"); |
There was a problem hiding this comment.
just fileSystemFactory here and in other places
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveWriterFactory.java
Show resolved
Hide resolved
| for (TempFile tempFile : files) { | ||
| Path file = tempFile.getPath(); | ||
| String file = tempFile.getPath(); | ||
| TrinoInputFile trinoInputFile = trinoFileSystem.newInputFile(file); |
| FileSystem fileSystem = hdfsEnvironment.getFileSystem(identity, path, configuration); | ||
| FSDataInputStream inputStream = hdfsEnvironment.doAs(identity, () -> fileSystem.open(path)); | ||
| TrinoFileSystem fileSystem = trinoFileSystemFactory.create(identity); | ||
| TrinoInput inputStream = hdfsEnvironment.doAs(identity, () -> fileSystem.newInputFile(path).newInput()); |
There was a problem hiding this comment.
you don't need to wrap in doAs anymore. See io.trino.filesystem.hdfs.HdfsInputFile#newInput
| OrcDataSink orcDataSink = createOrcDataSink(trinoFileSystem, path.toString()); | ||
|
|
||
| Optional<Supplier<OrcDataSource>> validationInputFactory = Optional.empty(); | ||
| String stringPath = path.toString(); |
There was a problem hiding this comment.
if you move it higher, you can also use it in createOrcDataSink(trinoFileSystem, path.toString());
| FileSystem fileSystem = hdfsEnvironment.getFileSystem(identity, path, configuration); | ||
| FSDataInputStream inputStream = hdfsEnvironment.doAs(identity, () -> fileSystem.open(path)); | ||
| TrinoFileSystem trinoFileSystem = trinoFileSystemFactory.create(identity); | ||
| TrinoInput inputStream = hdfsEnvironment.doAs(identity, () -> trinoFileSystem.newInputFile(path.toString()).newInput()); |
| FileSystem fileSystem = hdfsEnvironment.getFileSystem(identity, splitPath, configuration); | ||
| FSDataInputStream inputStream = hdfsEnvironment.doAs(identity, () -> fileSystem.open(splitPath)); | ||
| TrinoFileSystem trinoFileSystem = trinoFileSystemFactory.create(identity); | ||
| TrinoInput trinoInput = hdfsEnvironment.doAs(identity, () -> trinoFileSystem.newInputFile(splitPath).newInput()); |
There was a problem hiding this comment.
trinoInput - > inputFile would be better (here and in other places)
There was a problem hiding this comment.
but that is not a file it is more like inputStream, see io.trino.filesystem.TrinoInput
| public Optional<HivePageSourceFactory> getHivePageSourceFactory(HdfsEnvironment hdfsEnvironment) | ||
| { | ||
| return Optional.of(new OrcPageSourceFactory(new OrcReaderOptions(), hdfsEnvironment, new FileFormatDataSourceStats(), UTC)); | ||
| return Optional.of(new OrcPageSourceFactory(new OrcReaderOptions(), hdfsEnvironment, new FileFormatDataSourceStats(), UTC, new HdfsFileSystemFactory(hdfsEnvironment))); |
There was a problem hiding this comment.
why not use HDFS_FILE_SYSTEM_FACTORY?
24a085f to
8f35ca7
Compare
|
CI: #14686 |
bafffe5 to
8f35ca7
Compare
There was a problem hiding this comment.
move below makeRowIdSortingWriter
There was a problem hiding this comment.
I think you could add unit test for it in io.trino.plugin.hive.TestHiveWriterFactory if you make this method staic and package private (then move this method to bottom of class)
There was a problem hiding this comment.
use setSchemeToFileIfOneDoesNotExist
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcDeletedRows.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
It would be cleaner if this was TrinoInputFile. Then HdfsOrcDataSource would really own lifecycle of TrinoInput (both create it and close it).
There was a problem hiding this comment.
Can HdfsOrcDataSource just accept inputFile?
There was a problem hiding this comment.
If it doesn't throw anything just remove the try/catch
There was a problem hiding this comment.
I don't want to change when specific exception is thrown, if you check deeper then you will see that I cach IOException there and throw UncheckedIO. I will change this to catch UncheckedIO only. Does it make sense ?
There was a problem hiding this comment.
Super nit: I would just inline this
| String file = tempFile.getPath(); | |
| fileSystem.deleteFile(file); | |
| fileSystem.deleteFile(tempFile.getPath()); |
There was a problem hiding this comment.
You could migrate verifyAcidSchema pretty easily as well. Looks like it just uses the path in error messages
There was a problem hiding this comment.
I actually tried that and it turned out it is not that simple - I would also have to make changes in OrcPageSourceFactory to verifyFileHasColumnNames and some more places - there would be a lot of changes and this is not really related.
There was a problem hiding this comment.
Nit: I'd put this at the top where HdfsEnvironment was
There was a problem hiding this comment.
This is kinda a half-way migration right? We still need the hdfsEnvironment for some places. Let's put the two fields together in the field list
c15232f to
60d4f1b
Compare
60d4f1b to
6e815fa
Compare
There was a problem hiding this comment.
TrinoFileSystem fileSystem = ...
There was a problem hiding this comment.
you can inline it: String parentPath = setSchemeToFileIfAbsent(parent.toString())
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveWriterFactory.java
Outdated
Show resolved
Hide resolved
6e815fa to
1782f52
Compare
|
CI: #12818 |
|
CI #14814 |
| @@ -123,16 +124,23 @@ public class OrcPageSourceFactory | |||
| private static final Pattern DEFAULT_HIVE_COLUMN_NAME_PATTERN = Pattern.compile("_col\\d+"); | |||
| private final OrcReaderOptions orcReaderOptions; | |||
| private final HdfsEnvironment hdfsEnvironment; | |||
There was a problem hiding this comment.
you can remove hdfsEnvironment now. It seems unused
There was a problem hiding this comment.
It is still used, it is passed to io.trino.plugin.hive.orc.OrcPageSourceFactory#createOrcPageSource and then used to create OrcDeletedRows.
There was a problem hiding this comment.
We could try to simplify it further
|
|
||
| boolean originalFilesPresent = acidInfo.isPresent() && !acidInfo.get().getOriginalFiles().isEmpty(); | ||
| try { | ||
| FileSystem fileSystem = hdfsEnvironment.getFileSystem(identity, path, configuration); |
There was a problem hiding this comment.
HdfsEnvironment hdfsEnvironment is unused now
Description
This is prerequisite to be able to track FileSystem memory utilization while writing ORC data - #14023
Non-technical explanation
Allows better tracking of memory consumed internally.
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: