Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger closing time partition after time some event time have been observed. #103

Open
krisskross opened this issue Aug 18, 2016 · 4 comments

Comments

@krisskross
Copy link

krisskross commented Aug 18, 2016

Hi

It would be nice to be able to trigger a close of a file when a certain event time has been observed. This so that a certain hour can be considered finished instead of waiting for flush.size or rotate.interval.ms to trigger the close. This would make parquet files larger and finish earlier. Win-win.

Cheers,
-Kristoffer

@ewencp
Copy link
Contributor

ewencp commented Aug 29, 2016

@krisskross This seems like basically what the Partitioner classes and partitioner.class config was designed for. They don't close the files explicitly, but they allow you to divide data across files, and once you hit rotation, the file will be closed anyway.

Is there a specific case you're thinking of where you need to close the file sooner? I think one common example might be a time-based partitioner where timestamps are guaranteed to be monotonically increasing, in which case you know when it is safe to rotate a file (although note that the requirement that it is monotonically increasing is harder to guarantee in practice than you may think). Is there some other example you're thinking of that isn't addressed by the existing approach, or are you just trying to reduce the latency of delivery for the final file?

@krisskross
Copy link
Author

The use case i'm seeking is closing files earlier so that they can be processed by batch jobs as early as possible. Also it might be possible to get larger files (ideally one per partition) which have the benefit of more efficient compression and faster batch processing.

We use an modified version of HourlyPartitioner that looks at event time (monotonically increasing) of the log and we can determine when no more events will arrive for a certain partition/hour by looking at the event time (with exceptions). But there is no mechanism to trigger the close and move the file to the final directory.

@ewencp
Copy link
Contributor

ewencp commented Aug 30, 2016

Sure, so maybe a way to implement this would be to add a method to the Partitioner interface to allow it to indicate what files are now safe to close, and have the connector invoke it after each record? I think it'll make tracking the necessary state a bit more complicated (we normally commit all outstanding data, but in this case we'd change several states to handle only a subset of the outstanding data), but probably won't make things much more complicated.

@krisskross
Copy link
Author

Yes, sounds like reasonable way forward. Maybe as default method on the interface in order to not break backward compatibility?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants