FlatGeoBuf driver loads the whole feature offset index / inefficient for network access #1966

rouault · 2019-10-29T22:43:10Z

CC @bjornharrtell
See https://lists.osgeo.org/pipermail/gdal-dev/2019-October/050946.html

mloskot · 2019-10-29T22:56:08Z

Seems like this #1964 added recently by @bjornharrtell is related

rouault · 2019-10-29T23:06:27Z

Seems like this #1964 added recently by @bjornharrtell is related

Not really :-) #1964 is about writing side, this one about reading side

bjornharrtell · 2019-10-29T23:09:08Z

Ah yes, I've been semi-aware of this shortcoming.

bjornharrtell · 2019-11-01T16:23:09Z

I want to take a stab at this but unsure about the best strategy. I would try to do an adaptive block cache for the I/O here but unsure if I should simply try to implement that directly in the driver or if I should attempt to implement at a higher abstraction level. Do you have any guidance @rouault?

rouault · 2019-11-01T16:59:03Z

I would just suggest implementing the most simple strategy, that is issuing a VSIFRead() to get the offset of just the feature you need at the time where you need it. For local I/O, the operating system will cache things. And for /vsicurl/ and related, they will also have some caching.

bjornharrtell · 2019-11-01T18:05:54Z

Mmm yes, will be very simple and probably work well for most cases. 👍

bjornharrtell · 2019-11-06T22:48:21Z

@rouault it seems accessing via vsizip is performing very badly on large files (uncompressed >1GB) and I suspect may not have been so bad before this PR. I've looked through cpl_vsil_gzip but fail to find any limit to the snapshot cache which would explain it... will investigate further when time permits but mabye you have an idea what I should look for? (accessing comparable shapefile data via vsizip performs well)

rouault · 2019-11-06T23:02:40Z

ah, it might be that the lots of seeks that you need to do might hit hard the snapshot mechanism. but random access in a zip file is going to suck anyway...

bjornharrtell · 2019-11-06T23:04:45Z

Well, a spat filter query via vsizip shapefile actually performs rather well, even over vsicurl. (!)

bjornharrtell · 2019-11-07T08:07:19Z

My theory is that because shp is separated into several files you get several caches and less seeking per file. But the cache could/should be smarter than that. Will try to understand it and improve when I can. :)

bjornharrtell · 2019-11-07T08:22:27Z

But I slowly realize that deflating a single stream requires sequential read unless some kind of extra/custom indexing is done like fx. https://github.com/pauldmccarthy/indexed_gzip#overview.

bjornharrtell · 2019-11-07T08:45:53Z

Ok some more thought on this and I now believe that simply reading out the feature offset indexes in advance (before getting any feature data) and sorting it (to avoid backwards seeks when reading out features) will be better for both vsicurl and vsizip and possibly compete with the performance I see in shp.

bjornharrtell · 2019-11-08T07:38:00Z

Looks like there are some compression methods out there that do split the steam into indexed blocks for random access. Most prominent I've found is https://lz4.github.io/lz4/ which is also very fast. Would be fun to eventually support that all the way.

rouault · 2019-11-08T11:31:53Z

We already have the ZSTD dependency, which has excellent compression_ratio / speed

bjornharrtell · 2019-11-08T11:39:27Z

Ah yes zstd is quite nice too even if lz4 wins by far in decompression speed. But I fail to find any info about if zstd has out of the box stream splitting which is what I found interesting about lz4 (however I do not know how mature it is). Looks like this is a work in progress for zstd, see facebook/zstd#395.

bjornharrtell · 2019-11-08T11:44:49Z

And somewhat related project https://blosc.org/ is also interesting.

A follow up optimization after conclusions in #1966. Tested locally to significantly improve read performance when reading through vsigzip.

rouault added the enhancement label Oct 29, 2019

mloskot changed the title ~~FlatGeoBuf driver loads the whole feature offet index / inefficient for network access~~ FlatGeoBuf driver loads the whole feature offset index / inefficient for network access Oct 29, 2019

bjornharrtell mentioned this issue Nov 1, 2019

FlatGeobuf: Read feature indices on demand #1973

Merged

rouault closed this as completed in #1973 Nov 2, 2019

rouault pushed a commit that referenced this issue Nov 2, 2019

FlatGeobuf: Read feature indices on demand (#1973) (fixes #1966)

8dfe365

bjornharrtell mentioned this issue Nov 7, 2019

FlatGeobuf: Optimize reads for index search #1993

Merged

rouault pushed a commit that referenced this issue Nov 10, 2019

FlatGeobuf: Optimize reads for index search (#1993)

6530b2b

A follow up optimization after conclusions in #1966. Tested locally to significantly improve read performance when reading through vsigzip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlatGeoBuf driver loads the whole feature offset index / inefficient for network access #1966

FlatGeoBuf driver loads the whole feature offset index / inefficient for network access #1966

rouault commented Oct 29, 2019

mloskot commented Oct 29, 2019

rouault commented Oct 29, 2019

bjornharrtell commented Oct 29, 2019

bjornharrtell commented Nov 1, 2019

rouault commented Nov 1, 2019

bjornharrtell commented Nov 1, 2019

bjornharrtell commented Nov 6, 2019

rouault commented Nov 6, 2019

bjornharrtell commented Nov 6, 2019

bjornharrtell commented Nov 7, 2019

bjornharrtell commented Nov 7, 2019

bjornharrtell commented Nov 7, 2019

bjornharrtell commented Nov 8, 2019

rouault commented Nov 8, 2019

bjornharrtell commented Nov 8, 2019

bjornharrtell commented Nov 8, 2019

FlatGeoBuf driver loads the whole feature offset index / inefficient for network access #1966

FlatGeoBuf driver loads the whole feature offset index / inefficient for network access #1966

Comments

rouault commented Oct 29, 2019

mloskot commented Oct 29, 2019

rouault commented Oct 29, 2019

bjornharrtell commented Oct 29, 2019

bjornharrtell commented Nov 1, 2019

rouault commented Nov 1, 2019

bjornharrtell commented Nov 1, 2019

bjornharrtell commented Nov 6, 2019

rouault commented Nov 6, 2019

bjornharrtell commented Nov 6, 2019

bjornharrtell commented Nov 7, 2019

bjornharrtell commented Nov 7, 2019

bjornharrtell commented Nov 7, 2019

bjornharrtell commented Nov 8, 2019

rouault commented Nov 8, 2019

bjornharrtell commented Nov 8, 2019

bjornharrtell commented Nov 8, 2019