Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access Google Cloud Storage via NIO #6775

Merged
merged 35 commits into from
Feb 7, 2023
Merged

Access Google Cloud Storage via NIO #6775

merged 35 commits into from
Feb 7, 2023

Conversation

fm3
Copy link
Member

@fm3 fm3 commented Jan 23, 2023

  • Adds support for remote datasets hosted on Google Cloud Storage gs:// with optional GoogleServiceAccountCredentials
  • Refactor NIO usage: do not get file system using NIO lookup by schema, but handle this explicitly. This results in a lot less magic and more direct and typesafe passing of different kinds of credentials to the FileSystems. Also no need for the META-INF files anymore 🎉
  • Note: the paths passed to the FileSystem.getPath are no longer full URIs but only paths inside of the bucket scope. The GCS one requires this, and the others support it as well. This has to be converted back to URIs when passing it to MagLocator
  • Slightly refactor handling of credentials, rename some fields.
  • Use asynchronous caching for file system creation to avoid duplication due to parallel requests
  • try to gunzip data returned by NIO file system (I was surprised to randomly receive some gzipped data for a gcs bucket)

URL of deployed dev instance (used for testing):

TODO

  • access GCS data anonymously
  • access GCS data with service account credential json
  • integrate GCS file system into managed file systems creation
  • unify file system handling
  • clean up credential naming, enums
  • re-check that s3 and https credentials and uri styles still work (rework so that paths never contain url schema and host/bucket?)
  • add route to create GCS credential
  • adapt front-end for uploading service account credential json
  • remove fast-start changes to application.conf

Steps to test:

  • Front-end is not ready yet, but this feature can already be tested by pasting a gs:// zarr uri and ignoring the front-ends warnings
  • Adding credential is also possible by using whichever secret field (and entering a non-empty arbitrary string into the left username/keyid field)
  • also test that s3 and https datasets can still be explored and viewed, with and without credentials.

Issues:


@fm3 fm3 self-assigned this Jan 23, 2023
@@ -311,7 +311,6 @@ function AddZarrLayer({
(userInput.indexOf("https://") !== 0 && userInput.indexOf("s3://") === 0)
) {
setSelectedProtocol(userInput.indexOf("https://") === 0 ? "https" : "s3");
setShowCredentialsFields(userInput.indexOf("s3://") === 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that all supported schemas now support their form of credentials so this is no longer needed

@fm3 fm3 marked this pull request as ready for review February 1, 2023 10:28
@fm3 fm3 requested a review from frcroth February 1, 2023 10:30
@fm3
Copy link
Member Author

fm3 commented Feb 1, 2023

@frcroth I’d appreciate it if you could already have a look at the backend changes even though the frontend part is not yet complete. @philippotto agreed to adapt the front-end in the coming days.

@philippotto
Copy link
Member

@philippotto Great, thanks! Yes, the json is sent correctly :) Compare https://scm.slack.com/archives/C5AKLAV0B/p1675339672375929?thread_ts=1674548652.080749&cid=C5AKLAV0B for access to example data with credentials.

Perfect 👍 Exploration works, but the data fetch requests don't really work for me. The front-end waits forever since the requests don't finish (until they'll probably time out). The console says:

java.lang.NoClassDefFoundError: Could not initialize class org.blosc.IBloscDll
        at com.scalableminds.webknossos.datastore.datareaders.BloscCompressor.cbufferSizes(Compressor.scala:254)
        at com.scalableminds.webknossos.datastore.datareaders.BloscCompressor.uncompress(Compressor.scala:240)
        at com.scalableminds.webknossos.datastore.datareaders.ChunkReader.$anonfun$readBytes$2(ChunkReader.scala:38)
        at scala.Option.map(Option.scala:230)
        at com.scalableminds.webknossos.datastore.datareaders.ChunkReader.$anonfun$readBytes$1(ChunkReader.scala:35)
        at scala.util.Using$Manager.scala$util$Using$Manager$$manage(Using.scala:171)
        at scala.util.Using$Manager$.$anonfun$apply$2(Using.scala:225)
        at scala.util.Try$.apply(Try.scala:213)
        at scala.util.Using$Manager$.apply(Using.scala:225)
        at com.scalableminds.webknossos.datastore.datareaders.ChunkReader.readBytes(ChunkReader.scala:34)
        at com.scalableminds.webknossos.datastore.datareaders.ChunkReader.read(ChunkReader.scala:31)
        at com.scalableminds.webknossos.datastore.datareaders.DatasetArray.$anonfun$getSourceChunkDataWithCache$1(DatasetArray.scala:96)
        at akka.http.caching.LfuCache$.$anonfun$toJavaMappingFunction$2(LfuCache.scala:97)
        at scala.compat.java8.functionConverterImpls.AsJavaBiFunction.apply(FunctionConverters.scala:41)
        at com.github.benmanes.caffeine.cache.LocalAsyncCache.lambda$get$2(LocalAsyncCache.java:92)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2413)
        at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2411)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
        at com.github.benmanes.caffeine.cache.LocalAsyncCache.get(LocalAsyncCache.java:91)
        at com.github.benmanes.caffeine.cache.LocalAsyncCache.get(LocalAsyncCache.java:82)
        at akka.http.caching.LfuCache.getOrLoad(LfuCache.scala:126)
        at com.scalableminds.webknossos.datastore.datareaders.DatasetArray.getSourceChunkDataWithCache(DatasetArray.scala:96)
        at com.scalableminds.webknossos.datastore.datareaders.DatasetArray.$anonfun$readAsFortranOrder$4(DatasetArray.scala:77)
        at com.scalableminds.util.tools.Fox$.runNext$3(Fox.scala:131)
        at com.scalableminds.util.tools.Fox$.serialCombined(Fox.scala:137)
        at com.scalableminds.webknossos.datastore.datareaders.DatasetArray.readAsFortranOrder(DatasetArray.scala:75)
        at com.scalableminds.webknossos.datastore.datareaders.DatasetArray.readBytes(DatasetArray.scala:54)
        at com.scalableminds.webknossos.datastore.datareaders.DatasetArray.readBytesXYZ(DatasetArray.scala:46)
        at com.scalableminds.webknossos.datastore.dataformats.zarr.ZarrCubeHandle.cutOutBucket(ZarrBucketProvider.scala:23)
        at com.scalableminds.webknossos.datastore.dataformats.BucketProvider.$anonfun$load$2(BucketProvider.scala:23)
        at com.scalableminds.webknossos.datastore.storage.DataCubeCache.$anonfun$withCache$5(DataCubeCache.scala:76)
        at com.scalableminds.util.tools.Fox.$anonfun$flatMap$1(Fox.scala:259)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)

Is there something wrong with my setup? I did a ./clean after checking out this branch..

  • The label of the dropzone says „(Optional)“, which seems a bit redundant since above there is already the radio selection between anonymous and with credential. Maybe this could be removed (or remove the radio selection and make the auth fields optional for all cases?)

Done 👍

  • If I press reset after a dataset has been recovered, the status which protocol was selected seems to be lost. The gs uri is still in the field, but the auth fields are showing the basicauth case. Not super important, since the reset button will probably be used rarely shrug

Good point. I adapted the reset button to also reset the original input field.

@fm3
Copy link
Member Author

fm3 commented Feb 3, 2023

Thanks!

The error reads to me as if blosc is not installed. That is a data (de)compression library. Does loading other blosc-compressed zarr datasets work for you? Could you try apt install libblosc1? Compare https://github.com/scalableminds/webknossos/blob/master/DEV_INSTALL.md

@fm3 fm3 requested a review from frcroth February 6, 2023 09:01
Copy link
Member

@daniel-wer daniel-wer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend code almost LGTM. I'll try to test and will report back.

CHANGELOG.unreleased.md Outdated Show resolved Hide resolved
Comment on lines 37 to 38
const jsonString = await readFileAsText(file);
return JSON.parse(jsonString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be guarded somehow? The page shouldn't crash if a file with the wrong format was uploaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this case. Currently, the page doesn't crash, but also nothing happens and no error is shown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should give a proper error msg now :)

if (credentials) {
return exploreRemoteDataset([datasourceUrl], {
username: "",
pass: JSON.stringify(credentials),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find out which type credentials has, but it's not strictly specified in wk. The only thing I found was Record<string, any> in the NeuroglancerDatasetConfig. Is there a more specific definition of the credential type? And is stringifying and passing all of it as the password here, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I found the link to the documentation which shows what the credentials file looks like (https://cloud.google.com/iam/docs/creating-managing-service-account-keys?hl=de#creating). In that case, there doesn't need to be a more specific type and it also makes sense to pass all of it as pass here, so nevermind :)

@daniel-wer
Copy link
Member

daniel-wer commented Feb 6, 2023

Works very nicely 👍

Notes from testing:

  • The "Add Remote Zarr / N5 Dataset" page should list which storage types are supported: https, s3, gcs afaik. There is a validation error if a non-supported url is pasted, but it would be nice to find out, before.
    • There is a typo on the page: "segmentattion" with double t

Not from this PR, but I really like the error log, if the import doesn't work! 🥇

@philippotto
Copy link
Member

[x] The "Add Remote Zarr / N5 Dataset" page should list which storage types are supported: https, s3, gcs afaik. There is a validation error if a non-supported url is pasted, but it would be nice to find out, before.

  • There is a typo on the page: "segmentattion" with double t

Done :)

@fm3 fm3 enabled auto-merge (squash) February 7, 2023 08:59
@fm3 fm3 merged commit a185e65 into master Feb 7, 2023
@fm3 fm3 deleted the google-cloud branch February 7, 2023 09:18
hotzenklotz added a commit that referenced this pull request Feb 7, 2023
…_editable_text_style

* 'master' of github.com:scalableminds/webknossos:
  Fix error message when trying to join an orga you are already in (#6824)
  Access Google Cloud Storage via NIO (#6775)
hotzenklotz added a commit that referenced this pull request Feb 7, 2023
…a_owner

* 'master' of github.com:scalableminds/webknossos:
  Fix error message when trying to join an orga you are already in (#6824)
  Access Google Cloud Storage via NIO (#6775)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Zarr/N5 streaming from Google Cloud Storage
4 participants