Recursive exploration of remote datasets #7912

MichaelBuessemeyer · 2024-07-02T15:18:01Z

This PR adds recursive exploration to the already existing remote dataset exploration. This is supported for the local file system, GCS and S3.

URL of deployed dev instance (used for testing):

https://___.webknossos.xyz

Steps to test:

Test GCS via gs://neuroglancer-fafb-data/fafb_v14/. This should result in a successfully explored dataset.
Test S3 via s3://janelia-cosem-datasets/jrc_mus-nacc-4/jrc_mus-nacc-4.zarr/. This should result in a successfully explored dataset.
Test locally:
- Create a new local folders e.g. <wk-root>/binaryData2/some_dir/more_dir, <wk-root>/binaryData2/other_dir/more_dir
- Add <wk-root>/binaryData2/ to the whilelist in the application.conf in line 197.
- Enter file:///binaryData2/` into the add remote dataset form. The request should fail and only include a short error message, not leaking any information about the underlying folder structure of the server.
- Add a new dataset (not wkw, as wkw exploration is not implemented) e.g. l4_sample_zarr3_sharded to <wk-root>/binaryData2/other_dir/more_dir
- Enter file:///binaryData2/` into the add remote dataset form. The request should successfully find the dataset.

TODOs:

Currently, the backend leaks the directory structure of the whitelisted directories allowed by the whitelisting feature in case the exploration fails. This should not be exposed. Moreover, @normanrz argued that the information is not useful to users.
-> I'd add a warning to the docs about the whitelisting feature to only include datasets in the subdirectories and not any kind of sensitive information like ssh key and such. Moreover, the debug log should not be exposed to the end users. At least in case a local file system is used. In case wk crawls remote cloud storages, the person using wk already has the necessary credentials (if necessary) to list the cloud storage. In that case wk does not leak any information the user would already have.
Should the mutable report be included in the client answer even for non local datasets? -> It is kept for now and maybe removed in the future.

Issues:

fixes #7714

(Please delete unneeded items, merge only when none are left open)

Updated changelog
Updated documentation if applicable
Removed dev-only changes like prints and application.conf edits
Considered common edge cases
Needs datastore update after deployment

…ursive-exploration

…rrors before pushing

MichaelBuessemeyer · 2024-07-03T11:30:42Z

project/Dependencies.scala

@@ -56,7 +56,7 @@ object Dependencies {
    // MultiArray (ndarray) handles. import ucar
    "edu.ucar" % "cdm-core" % "5.4.2",
    // Amazon S3 cloud storage client. import com.amazonaws
-    "com.amazonaws" % "aws-java-sdk-s3" % "1.12.584",
+    "com.amazonaws" % "aws-java-sdk-s3" % "1.12.584", // TODO Update?!!


There exists a version two of the lib. See https://github.com/aws/aws-sdk-java-v2/#using-the-sdk

Do we want to migrate to version V2? This should be rather easy as the lib is afaik only used in S3DataVault.scala

I wrote #7913 Sounds good (reads like it may improve async api?) but should not block this PR

package.json

...-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala

fm3

Cool stuff, thanks for looking into this!

To be honest, I had a bit of a hard time reading the new code in ExploreRemoteLayerService. This is complicated nested logic, and a lot is happening in very compact code. I think that a few extractions into well-named private methods could already help a lot.

Also, please have a look at my individual comments. Feel free to ask if anything is unclear :)

frontend/javascripts/admin/dataset/dataset_add_remote_view.tsx

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/DataVault.scala

fm3 · 2024-07-04T08:03:46Z

...sos-datastore/app/com/scalableminds/webknossos/datastore/datavault/FileSystemDataVault.scala


 class FileSystemDataVault extends DataVault {

-  override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
-      implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = {
+  private def vaultPathToLocalPath(path: VaultPath)(implicit ec: ExecutionContext): Fox[Path] = {


Please move the private method below the public one that uses it. Either below the first usage, or further down. (That’s the typical reading order of the backend files. or at least it should be 😅)

I hope that's correct now 🙈

...os-datastore/app/com/scalableminds/webknossos/datastore/datavault/GoogleCloudDataVault.scala

...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala

fm3 · 2024-07-04T08:20:40Z

...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala

+          .toFox
+          .flatten
+          .futureBox
+          .flatMap(


Phew, that’s a mouthful. Maybe extracting some of these into functions with annotated return types could help with understanding what’s going on here?

Maybe the newest version is easier to read? Each method has something like max 8 lines as a method body

...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala

…ice.scala

…nossos into recursive-exploration

…rsive-exploration

MichaelBuessemeyer · 2024-07-16T11:40:58Z

There is a problem I noticed: Exploring gs://iarpa_microns/minnie/minnie65/ takes very long as it contains a lot of datasets / subfolders. The current code supports at max depth of 8 and max 10 subfolders are explored in each level. This makes up to a limit of 10 x 10 x 10 x ... x 10 = 10^8 explored subfolders -> This potentially takes up a lot of time 🥴. Should I decrease both limits?

Another thing: The exploration seems to not yield any results on many folders that seem to sound like a valid dataset / layer. But I cannot test this, as I was unable to find a proper gcs explorer in a reasonable time frame :/

And I also have the feeling that the sibling exploration is not working as I / we want it to:
If a dataset is like this:

dataset
- layer 1
  - mag 1
    - shard 1
      - actual data files
    - shard 2
      - actual data files
    - ...
  - mag 2
    - shard 1
      - actual data files
    - shard 2
      - actual data files
    - ...
- layer 2
  - mag 1
    - shard 1
      - actual data files
    - shard 2
      - actual data files
    - ...
  - mag 2
    - shard 1
      - actual data files
    - shard 2
      - actual data files
    - ...

then the current exploration would find dataset/layer 1/mag 1 as a layer and then would treat dataset/layer 1/mag 2 as a sibling and thus an additional layer although it is a different mag of the layer...
How should we handle this? Should the sibling check be something like:
val sameParentRemainingPaths = remainingPaths.filter(_._1.parent == path.parent.parent).

Moreover, I suspect the code to keep on running recursive search in the sibling directories. IMO we should restrict this to a max nesting level of 2, as the data of sibling layers should be at the same nesting level imo.

fm3 · 2024-07-16T11:54:12Z

Yes, 10^8 is too big as a maximum. How about 10^3? How long does the tested dataset take then?

Not sure I fully understood the problem with mixed siblings/mags, but yes, I guess we can restrict this more, with the assumption that the structure is not arbitrarily mixed.

…ayers in sibling folders - And remove exploration of sibling folders to find additional layers

MichaelBuessemeyer · 2024-07-17T07:59:48Z

How about 10^3

Done: Max 3 depth & max 10 items per level. But it seems a little restrictive in depth :/. Maybe something like 4 depth & 6 items per level (max 1296) or 5 depth and 5 items per level (max 3125) would be better 🤔?

How long does the tested dataset take then?

I tested twice. Once it took 30 sec, the other run it took 60 sec 🤷‍♂️. Still quite long.

Not sure I fully understood the problem with mixed siblings/mags, but yes, I guess we can restrict this more, with the assumption that the structure is not arbitrarily mixed

The "Sibling exploration" was now removed and an issue for this was created (see #7924). See also: https://scm.slack.com/archives/C5AKLAV0B/p1721129347974129 where it was decided to postpone this feature and move it to a separate issue.

fm3 · 2024-07-17T08:19:35Z

I think 60s is acceptable. I guess it will only take this long if there are actually this many folders, right? Maybe you could test a couple more from our example remote dataset table. I have no strong opinion on depth 3 or 4. 3125 tests seems like too much, though.
Did the sibling test removal help with runtime by any chance?

MichaelBuessemeyer · 2024-07-18T07:30:27Z

I think 60s is acceptable. I guess it will only take this long if there are actually this many folders, right?

Yeah

Did the sibling test removal help with runtime by any chance?

Oh right, that might explain the difference between 30 sec and 60 sec 🤔

Maybe you could test a couple more from our example remote dataset table. I have no strong opinion on depth 3 or 4. 3125 tests seems like too much, though.

I'll do some more :)

Edit: See https://scm.slack.com/archives/C5AKLAV0B/p1721293670420979 for testing results

fm3

Yes, this is more readable, thanks! Of course, the logic is still pretty complicated, and with everything being async, it will always be a bit hard to read, but the task is fairly complicated, and I think this is certainly better than before :)

I added a couple more comments

docs/datasets.md

test/backend/DataVaultTestSuite.scala

...sos-datastore/app/com/scalableminds/webknossos/datastore/datavault/FileSystemDataVault.scala

...os-datastore/app/com/scalableminds/webknossos/datastore/datavault/GoogleCloudDataVault.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala

...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala

...sos-datastore/app/com/scalableminds/webknossos/datastore/explore/N5MultiscalesExplorer.scala

MichaelBuessemeyer

Thanks for your feedback. I hopefully covered everything :)

test/backend/DataVaultTestSuite.scala

MichaelBuessemeyer · 2024-07-18T16:55:47Z

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala

+    try {
+      val listObjectsRequest = new ListObjectsV2Request
+      listObjectsRequest.setBucketName(bucketName)
+      listObjectsRequest.setPrefix(keyPrefix)
+      listObjectsRequest.setDelimiter("/")
+      listObjectsRequest.setMaxKeys(maxItems)
+      val objectListing = client.listObjectsV2(listObjectsRequest)
+      val s3SubPrefixes = objectListing.getCommonPrefixes.asScala.toList
+      Fox.successful(s3SubPrefixes)
+    } catch {
+      case e: AmazonServiceException =>
+        e.getStatusCode match {
+          case 404 => Fox.empty
+          case _   => Fox.failure(e.getMessage)
+        }
+      case e: Exception => Fox.failure(e.getMessage)
+    }
+


I have a question @fm3.
Should I use tryo here or this try and catch construct? Which one is better in which case?

tryo is a shortcut, it will always produce a Box.Failure in case of any exception. In this code, however, we want to create different results based on a parameter of the exception (Fox.empty for status code 404), so we need the full try/catch to implement that custom logic.

...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala

...sos-datastore/app/com/scalableminds/webknossos/datastore/explore/N5MultiscalesExplorer.scala

…ursive-exploration

fm3

Good Stuff :) I’d say this is good to go

Could you update the PR description before merging? (The TODOs are solved, right?)

MichaelBuessemeyer added 7 commits June 21, 2024 20:04

WIP: Add recursive exporation of remote s3 layer

a54c5b3

WIP: finish first version of recursive exploration of remote s3 layer

f4e973a

WIP: add gcs support

2ad6765

WIP: add gcs support

d5fba44

WIP: run explorers in parallel on same subdirectory

7b06e4e

Code clean up (mainly extracted methods)

b7e7096

Merge branch 'master' of github.com:scalableminds/webknossos into rec…

20cf6b5

…ursive-exploration

MichaelBuessemeyer added backend new feature labels Jul 2, 2024

MichaelBuessemeyer self-assigned this Jul 2, 2024

MichaelBuessemeyer added 10 commits July 2, 2024 17:42

add local file system exploration

1e42c45

Merge branch 'master' of github.com:scalableminds/webknossos into rec…

c3f0ec3

…ursive-exploration

do not include mutableReport in requests regarding the local file system

2798b1d

add missing override of listDirectory of MockDataVault

3ea72a1

some cleanup

abc7b2e

add command to build backend parts like in CI to be ablte to detect e…

10d5c3e

…rrors before pushing

clean up code

4db966f

format backend code

bc6e93e

update docs to mention recursive exploration

45a86ff

add changelog entry

ca418b6

MichaelBuessemeyer requested a review from fm3 July 3, 2024 13:10

MichaelBuessemeyer marked this pull request as ready for review July 3, 2024 13:10

MichaelBuessemeyer commented Jul 3, 2024

View reviewed changes

fm3 requested changes Jul 4, 2024

View reviewed changes

apply some feedback

c8e35bf

fm3 mentioned this pull request Jul 8, 2024

Bounding Box tool improvements #7892

Merged

3 tasks

MichaelBuessemeyer and others added 4 commits July 15, 2024 12:49

Merge branch 'master' into recursive-exploration

aeb7569

apply some feedback; Mainly extract methods in ExploreRemoteLayerServ…

d6b5719

…ice.scala

Merge branch 'recursive-exploration' of github.com:scalableminds/webk…

b8cdb7e

…nossos into recursive-exploration

erge branch 'master' of github.com:scalableminds/webknossos into recu…

e8d1da6

…rsive-exploration

fm3 mentioned this pull request Jul 16, 2024

Upgrade to AWS Client lib v2, use new async API? #7913

Closed

Only let explorers of simple dataset formats explore for additional l…

afea2df

…ayers in sibling folders - And remove exploration of sibling folders to find additional layers

MichaelBuessemeyer mentioned this pull request Jul 17, 2024

Support finding mutliple data layers during dataset exploration for simpler dataset formats without a metadata file #7924

Open

MichaelBuessemeyer requested a review from fm3 July 18, 2024 12:48

fm3 reviewed Jul 18, 2024

View reviewed changes

apply pr feedback

ef69593

MichaelBuessemeyer commented Jul 18, 2024

View reviewed changes

Merge branch 'master' of github.com:scalableminds/webknossos into rec…

dafcc7e

…ursive-exploration

MichaelBuessemeyer requested a review from fm3 July 22, 2024 08:06

fm3 approved these changes Jul 22, 2024

View reviewed changes

MichaelBuessemeyer and others added 2 commits July 22, 2024 15:12

restore accidentally deleted changelog entry

84fea6d

Merge branch 'master' into recursive-exploration

d4b1465

MichaelBuessemeyer enabled auto-merge (squash) July 22, 2024 13:32

MichaelBuessemeyer merged commit d08affc into master Jul 22, 2024
2 checks passed

MichaelBuessemeyer deleted the recursive-exploration branch July 22, 2024 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive exploration of remote datasets #7912

Recursive exploration of remote datasets #7912

MichaelBuessemeyer commented Jul 2, 2024 •

edited

Loading

MichaelBuessemeyer Jul 3, 2024

fm3 Jul 4, 2024

fm3 left a comment

fm3 Jul 4, 2024

MichaelBuessemeyer Jul 5, 2024

fm3 Jul 4, 2024

MichaelBuessemeyer Jul 15, 2024

MichaelBuessemeyer commented Jul 16, 2024 •

edited

Loading

fm3 commented Jul 16, 2024

MichaelBuessemeyer commented Jul 17, 2024

fm3 commented Jul 17, 2024

MichaelBuessemeyer commented Jul 18, 2024 •

edited

Loading

fm3 left a comment

MichaelBuessemeyer left a comment

MichaelBuessemeyer Jul 18, 2024

fm3 Jul 22, 2024

fm3 left a comment

Recursive exploration of remote datasets #7912

Recursive exploration of remote datasets #7912

Conversation

MichaelBuessemeyer commented Jul 2, 2024 • edited Loading

URL of deployed dev instance (used for testing):

Steps to test:

TODOs:

Issues:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fm3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelBuessemeyer commented Jul 16, 2024 • edited Loading

fm3 commented Jul 16, 2024

MichaelBuessemeyer commented Jul 17, 2024

fm3 commented Jul 17, 2024

MichaelBuessemeyer commented Jul 18, 2024 • edited Loading

fm3 left a comment

Choose a reason for hiding this comment

MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fm3 left a comment

Choose a reason for hiding this comment

MichaelBuessemeyer commented Jul 2, 2024 •

edited

Loading

MichaelBuessemeyer commented Jul 16, 2024 •

edited

Loading

MichaelBuessemeyer commented Jul 18, 2024 •

edited

Loading