Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatGeoBuf driver loads the whole feature offset index / inefficient for network access #1966

Closed
rouault opened this issue Oct 29, 2019 · 16 comments · Fixed by #1973
Closed

Comments

@rouault
Copy link
Member

rouault commented Oct 29, 2019

CC @bjornharrtell
See https://lists.osgeo.org/pipermail/gdal-dev/2019-October/050946.html

@mloskot
Copy link
Member

mloskot commented Oct 29, 2019

Seems like this #1964 added recently by @bjornharrtell is related

@mloskot mloskot changed the title FlatGeoBuf driver loads the whole feature offet index / inefficient for network access FlatGeoBuf driver loads the whole feature offset index / inefficient for network access Oct 29, 2019
@rouault
Copy link
Member Author

rouault commented Oct 29, 2019

Seems like this #1964 added recently by @bjornharrtell is related

Not really :-) #1964 is about writing side, this one about reading side

@bjornharrtell
Copy link
Contributor

Ah yes, I've been semi-aware of this shortcoming.

@bjornharrtell
Copy link
Contributor

I want to take a stab at this but unsure about the best strategy. I would try to do an adaptive block cache for the I/O here but unsure if I should simply try to implement that directly in the driver or if I should attempt to implement at a higher abstraction level. Do you have any guidance @rouault?

@rouault
Copy link
Member Author

rouault commented Nov 1, 2019

I would just suggest implementing the most simple strategy, that is issuing a VSIFRead() to get the offset of just the feature you need at the time where you need it. For local I/O, the operating system will cache things. And for /vsicurl/ and related, they will also have some caching.

@bjornharrtell
Copy link
Contributor

Mmm yes, will be very simple and probably work well for most cases. 👍

@bjornharrtell
Copy link
Contributor

@rouault it seems accessing via vsizip is performing very badly on large files (uncompressed >1GB) and I suspect may not have been so bad before this PR. I've looked through cpl_vsil_gzip but fail to find any limit to the snapshot cache which would explain it... will investigate further when time permits but mabye you have an idea what I should look for? (accessing comparable shapefile data via vsizip performs well)

@rouault
Copy link
Member Author

rouault commented Nov 6, 2019

ah, it might be that the lots of seeks that you need to do might hit hard the snapshot mechanism. but random access in a zip file is going to suck anyway...

@bjornharrtell
Copy link
Contributor

Well, a spat filter query via vsizip shapefile actually performs rather well, even over vsicurl. (!)

@bjornharrtell
Copy link
Contributor

My theory is that because shp is separated into several files you get several caches and less seeking per file. But the cache could/should be smarter than that. Will try to understand it and improve when I can. :)

@bjornharrtell
Copy link
Contributor

But I slowly realize that deflating a single stream requires sequential read unless some kind of extra/custom indexing is done like fx. https://github.com/pauldmccarthy/indexed_gzip#overview.

@bjornharrtell
Copy link
Contributor

Ok some more thought on this and I now believe that simply reading out the feature offset indexes in advance (before getting any feature data) and sorting it (to avoid backwards seeks when reading out features) will be better for both vsicurl and vsizip and possibly compete with the performance I see in shp.

@bjornharrtell
Copy link
Contributor

Looks like there are some compression methods out there that do split the steam into indexed blocks for random access. Most prominent I've found is https://lz4.github.io/lz4/ which is also very fast. Would be fun to eventually support that all the way.

@rouault
Copy link
Member Author

rouault commented Nov 8, 2019

We already have the ZSTD dependency, which has excellent compression_ratio / speed

@bjornharrtell
Copy link
Contributor

Ah yes zstd is quite nice too even if lz4 wins by far in decompression speed. But I fail to find any info about if zstd has out of the box stream splitting which is what I found interesting about lz4 (however I do not know how mature it is). Looks like this is a work in progress for zstd, see facebook/zstd#395.

@bjornharrtell
Copy link
Contributor

And somewhat related project https://blosc.org/ is also interesting.

rouault pushed a commit that referenced this issue Nov 10, 2019
A follow up optimization after conclusions in #1966.

Tested locally to significantly improve read performance when reading through vsigzip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants