Datasource implementation is complicated #238

twelch · 2023-10-25T17:35:28Z

twelch
Oct 25, 2023
Maintainer

It all started so simple, with datasources internal to a geoprocessing project. You just needed to maintain some parameters to import local internal datasources (meaning reproject, transform into one or more formats cog, fgb, geojson) before publishing. Timestamps for when the datasource was first and last imported (which has never been used effectively).
You precalc some metrics for the whole dataset (overall sum, area, cell count).
Then the datasource get published to the projects s3 datasets bucket.
This import, precalc, publish was very much coupled together initially. In fact, precalc stats were built into the datasource objects as keyStats.

Then things got a little more complicated:

Support for using external datasources was vaguely added on in the first version of datasource support, it's a simple datasource object with a url instead of a file path, and none of the import configuration. Think of it as a datasource that's already published, you don't need as much.
You then have one projects internal datasources being used as external datasources, like the global datasources project which has now recently become the central repository for global cloud-optimized datasources for all seasketch projects.
Subdivided datasources stil have their place for efficient loading of vector data. But bundle-features for producing/publishing subdivided datasources to their own S3 bucket has fallen into disrepair (with slonik upgrades not working as expected). The VectorDatasource client for reading, reconstructing using union implemented in JS still works well. It has a caching layer for reuse of fetched features that the flatgeobuf client doesn't have. subdivided is currently it own special type of external datasource.
So now you have one datasources.json file containing internal and external datasources, raster and vector, cog/fgb/subdivided. And you have type interfaces to tell them apart, and helper functions for getting and filtering between the different datasource types and narrowing their return types accordingly.
ProjectClientBase.ts smooths over a lot of this and parses the datasources.json file using zod, ensuring it fits one of the 4 datasource subtypes. Then offers getter methods for working with the datasources, handling much of the filtering and type narrowing. But it's getting complicated and verbose.

export const datasourceSchema = internalVectorDatasourceSchema
  .or(externalVectorDatasourceSchema)
  .or(internalRasterDatasourceSchema)
  .or(externalRasterDatasourceSchema);
export const datasourcesSchema = z.array(datasourceSchema);

import:data has been created to simplify importing local files as datasources and transforming them to be ready for publishing. The user doesn't have to touch datasources.json, until they do need to. Maybe they need to tweak the object to edit the property names, change the path. Maybe add an external datasource. Suddenly they need to know what they are trying to build. Thankfully zod will complain up front if the object is malformed in datasources.json, but still in a slightly cryptic way.

I wonder if there's a simpler and better way to accomplish all of this. Here's some aspects that may be a guiding light:

getFeatures() and ProjectClientBase.getDatasourceUrl() were developed as a function that you give it a datasource and it will just handle it and fetch the resources whether internal/external, vector/raster, and using the approprate client for the format (fgb, cog). This is a unifying function that gp functions, preprocessors, precalc, etc. all now use. No need for the developer to figure out which cloud-optimized client to use.
datasources are always eventually published and then accessed via a url. the difference is that we construct the url for an internal datasource and you need to know the url for an external datasource.
One projects internal datasources, are another projects external datasources. Maybe they should just all be datasources, and further configuration dictates what and where they are.

Path forward:

Maybe the datasources data model should get slimmer and the configuration needed to import a datasource should be kept separate. This is something your average UI-driven product experience backed by a database would handle it. These JSON files (datasources, metrics, geographies, basic, etc.) are just little document databases. One reason we do it is to keep it lightweight, to allow the report developer to push it aside and do their own thing or customize it when needed. It's all just reading/writing JSON from disk.
Something like LowDB might help. A lot of boilerplate gets written for reading/writing/editing each new type of thing in JSON.
What is currently subdivided datasources probably needs to go away as a separate and special thing. The flatgeobuf client needs to get wrapped in a set of libraries that takes on subdivision and unioning, as well as caching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasource implementation is complicated #238

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Datasource implementation is complicated #238

twelch Oct 25, 2023 Maintainer

Replies: 0 comments

twelch
Oct 25, 2023
Maintainer