Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable users to add support for LIstingTable / object_store table formats of different types #8345

Closed
tychoish opened this issue Nov 28, 2023 · 8 comments · Fixed by #11060
Closed
Labels
enhancement New feature or request

Comments

@tychoish
Copy link

Is your feature request related to a problem or challenge?

I've been exploring adding new file formats and types to our system (@GlareDB), and have run into the problem where the FileFormat trait depends on the FileType enum, which means it's impossible to implement entirely new filetypes except from within the package.

This is sort of gnarly to unwind as I've dug in, and I don't have a particular solution that I favor.

Describe the solution you'd like

I think collapsing FileType into FileFormat is maybe my favorite solution conceptually, but I think (more narrowly) removing the references to FileType in the FileFormat (as in my branch) would be good.

Describe alternatives you've considered

There are a few ways to resolve this:

  • make the FileType an implementable trait rather than an enum.
  • remove dependency between the interfaces I tried to do this here, which I think is a good start, but I think the issue runs even deeper. Having said that, the only casualty here would be the listing code, which is maybe something that can be worked around.
  • move all properties of "file_type" into the FileFormat (which kind of makes sense? I'm not sure that these are actually meaningfully separate abstractions.)

Additional context

No response

@tychoish tychoish added the enhancement New feature or request label Nov 28, 2023
@alamb
Copy link
Contributor

alamb commented Dec 1, 2023

Thank you for bringing this up @tychoish -- I agree the current state of FileType/FileFormat is not ideal as it means, as you have pointed out, that it is not possible to use ListingTable with user defined file formats.

make the FileType an implementable trait rather than an enum.

I would very much like to see this and I think it would result in the code being significantly more modular as well.

That being said as you have discovered it is a pretty major undertaking, but I would be interested in helping make it happen / reviewing PRs and designs (though I don't think I have time at the moment to contribute code)

perhaps @devinjdangelo or @tustvold have some ideas to add as well

@tychoish
Copy link
Author

tychoish commented Dec 1, 2023

I guess my first question is "are FileType and FileFormat actually meaningfully distinct?" Do we imagine that there are cases where you'd want to have different FileType values but the same FileFormat? Seem's unlikely to me but I don't know.

Second very high level question: is there anything to be done in ListingTable separetly from either "FileType as a trait" or flattening FileType into FileFormat.

Secondly, (and this is somewhat orthogonal) but for the project that brought me to this point; I'm leaning towards just to implementing StreamingTable (rather than FileFormat or a TableProvier from the ground up, something approximating it?) For things like JSON and CSV this seems maybe more ideal? Though of course, I'm new and don't know all of the innards. Obviously this is very format specific.

Would love to hear additional

@devinjdangelo
Copy link
Contributor

I guess my first question is "are FileType and FileFormat actually meaningfully distinct?" Do we imagine that there are cases where you'd want to have different FileType values but the same FileFormat? Seem's unlikely to me but I don't know.

I don't see any fundamental reason why there should be multiple file format related abstractions. I think it is mostly a product of development on different parts of the codebase related to files progressing at the same time. I know I was guilty at one point of creating a third file related enum, which was consolidated into FileType later.

I took a brief look at where we are matching on the FileType enum, and I don't see any reason why those sections could not be refactored as method calls on a trait object. It would probably make the most sense to add that functionality to FileFormat and consolidate all file related abstractions under one trait.

Second very high level question: is there anything to be done in ListingTable separetly from either "FileType as a trait" or flattening FileType into FileFormat.

I'm not sure, but I think it makes a lot of sense to handle file type specific related operations using a single trait object, similar to how we currently do with ObjectStore.

Secondly, (and this is somewhat orthogonal) but for the project that brought me to this point; I'm leaning towards just to implementing StreamingTable (rather than FileFormat or a TableProvier from the ground up, something approximating it?) For things like JSON and CSV this seems maybe more ideal? Though of course, I'm new and don't know all of the innards. Obviously this is very format specific.

I am not very familiar with the new StreamingTable, but it appears to be a struct rather than a trait. StreamingTable itself implements TableProvider, so I'm not quite sure how it helps here. @tustvold may know better than I on this though.

@tustvold
Copy link
Contributor

tustvold commented Dec 2, 2023

https://docs.rs/datafusion/latest/datafusion/datasource/streaming/struct.StreamingTable.html

Is simply a way to reduce the amount of boilerplate when implementing a TableProvider. It makes a lot of sense if taking this route.

As for the distinction between the enumeration and trait, serialization might be part of the reason behind this design. It is certainly part of the rationale behind using ListingTableUrl to provide a serializable version of ObjectStore

@alamb
Copy link
Contributor

alamb commented Dec 3, 2023

BTW I really love the idea of a trait based implementation of file formats as it would permit (eventually) pulling the format support (and its dependecies) into separate crates and would make the codebase more manageable.

I took a brief look at where we are matching on the FileType enum, and I don't see any reason why those sections could not be refactored as method calls on a trait object. It would probably make the most sense to add that functionality to FileFormat and consolidate all file related abstractions under one trait.

I agree this sounds like a good idea. If we want to head down this route, I would suggest making a Proof of Concept (POC) PR that adds the trait and shows how it might work for one of the formats (you could temporarily add a new FileType::Dynamic(Arc<dyn FileTypeTrait>) variant)

The POC would help figure out what areas might need additional work / that we haven't figured out yet.

As for the distinction between the enumeration and trait, serialization might be part of the reason behind this design. It is certainly part of the rationale behind using ListingTableUrl to provide a serializable version of ObjectStore

@tustvold are you referring to

  1. plan serialization (e.g. datafusion-proto)
  2. data serialization (e.g. RecordBatches --> Parquet bytes)?
  3. Something else ?

@tustvold
Copy link
Contributor

tustvold commented Dec 3, 2023

I was referring to plan serialization, it certainly isn't the only way to achieve this, but it might be part of the motivation for the current design

@alamb alamb changed the title enable users to add support for object_store table formats of different types enable users to add support for LIstingTable / object_store table formats of different types Dec 26, 2023
@alamb
Copy link
Contributor

alamb commented Dec 26, 2023

This feature request came up again in #8637

I filed #8657 with a summary of the ideas in this ticket's threads

@tychoish
Copy link
Author

Interesting, and I'm glad this is becoming more relevant!

For my part, I managed to use the StreamingTable implementation to add (to GlareDB) support for reading/writing BSON without needing any of these changes,1 and I'm quite pleased with the results. It isn't a 100% replacement, through the gaps sort of feel like (to me!) cases where the FileFormat/FileType infrastructure is a bit overbuilt, though this might just be application specific concerns

  • the write path goes through a slightly different code path that's disconnected (though frankly, given atomicity requirements and having multiple files feeding a single table, it feels weirder to combine both of these things,
  • and schema inference is a bit fiddly, though I think schema inference is something that should itself be pluggable and maybe separately,

Footnotes

  1. See here for the main code; there's some other plumbing and wiring to connect it in, but it both works pretty well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants