Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive exploration of remote datasets #7912

Merged
merged 27 commits into from
Jul 22, 2024

Conversation

MichaelBuessemeyer
Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer commented Jul 2, 2024

This PR adds recursive exploration to the already existing remote dataset exploration. This is supported for the local file system, GCS and S3.

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz

Steps to test:

  • Test GCS via gs://neuroglancer-fafb-data/fafb_v14/. This should result in a successfully explored dataset.
  • Test S3 via s3://janelia-cosem-datasets/jrc_mus-nacc-4/jrc_mus-nacc-4.zarr/. This should result in a successfully explored dataset.
  • Test locally:
    • Create a new local folders e.g. <wk-root>/binaryData2/some_dir/more_dir, <wk-root>/binaryData2/other_dir/more_dir

    • Add <wk-root>/binaryData2/ to the whilelist in the application.conf in line 197.

    • Enter file:///binaryData2/` into the add remote dataset form. The request should fail and only include a short error message, not leaking any information about the underlying folder structure of the server.

    • Add a new dataset (not wkw, as wkw exploration is not implemented) e.g. l4_sample_zarr3_sharded to <wk-root>/binaryData2/other_dir/more_dir

    • Enter file:///binaryData2/` into the add remote dataset form. The request should successfully find the dataset.

TODOs:

  • Currently, the backend leaks the directory structure of the whitelisted directories allowed by the whitelisting feature in case the exploration fails. This should not be exposed. Moreover, @normanrz argued that the information is not useful to users.
    -> I'd add a warning to the docs about the whitelisting feature to only include datasets in the subdirectories and not any kind of sensitive information like ssh key and such. Moreover, the debug log should not be exposed to the end users. At least in case a local file system is used. In case wk crawls remote cloud storages, the person using wk already has the necessary credentials (if necessary) to list the cloud storage. In that case wk does not leak any information the user would already have.
  • Should the mutable report be included in the client answer even for non local datasets? -> It is kept for now and maybe removed in the future.

Issues:


(Please delete unneeded items, merge only when none are left open)

@MichaelBuessemeyer MichaelBuessemeyer marked this pull request as ready for review July 3, 2024 13:10
@@ -56,7 +56,7 @@ object Dependencies {
// MultiArray (ndarray) handles. import ucar
"edu.ucar" % "cdm-core" % "5.4.2",
// Amazon S3 cloud storage client. import com.amazonaws
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.584",
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.584", // TODO Update?!!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists a version two of the lib. See https://github.com/aws/aws-sdk-java-v2/#using-the-sdk

Do we want to migrate to version V2? This should be rather easy as the lib is afaik only used in S3DataVault.scala

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote #7913 Sounds good (reads like it may improve async api?) but should not block this PR

package.json Show resolved Hide resolved
Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff, thanks for looking into this!

To be honest, I had a bit of a hard time reading the new code in ExploreRemoteLayerService. This is complicated nested logic, and a lot is happening in very compact code. I think that a few extractions into well-named private methods could already help a lot.

Also, please have a look at my individual comments. Feel free to ask if anything is unclear :)


class FileSystemDataVault extends DataVault {

override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = {
private def vaultPathToLocalPath(path: VaultPath)(implicit ec: ExecutionContext): Fox[Path] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the private method below the public one that uses it. Either below the first usage, or further down. (That’s the typical reading order of the backend files. or at least it should be 😅)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that's correct now 🙈

.toFox
.flatten
.futureBox
.flatMap(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, that’s a mouthful. Maybe extracting some of these into functions with annotated return types could help with understanding what’s going on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the newest version is easier to read? Each method has something like max 8 lines as a method body

@fm3 fm3 mentioned this pull request Jul 8, 2024
3 tasks
@MichaelBuessemeyer
Copy link
Contributor Author

MichaelBuessemeyer commented Jul 16, 2024

There is a problem I noticed: Exploring gs://iarpa_microns/minnie/minnie65/ takes very long as it contains a lot of datasets / subfolders. The current code supports at max depth of 8 and max 10 subfolders are explored in each level. This makes up to a limit of 10 x 10 x 10 x ... x 10 = 10^8 explored subfolders -> This potentially takes up a lot of time 🥴. Should I decrease both limits?

Another thing: The exploration seems to not yield any results on many folders that seem to sound like a valid dataset / layer. But I cannot test this, as I was unable to find a proper gcs explorer in a reasonable time frame :/

And I also have the feeling that the sibling exploration is not working as I / we want it to:
If a dataset is like this:

  • dataset
    • layer 1
      • mag 1
        • shard 1
          • actual data files
        • shard 2
          • actual data files
        • ...
      • mag 2
        • shard 1
          • actual data files
        • shard 2
          • actual data files
        • ...
    • layer 2
      • mag 1
        • shard 1
          • actual data files
        • shard 2
          • actual data files
        • ...
      • mag 2
        • shard 1
          • actual data files
        • shard 2
          • actual data files
        • ...

then the current exploration would find dataset/layer 1/mag 1 as a layer and then would treat dataset/layer 1/mag 2 as a sibling and thus an additional layer although it is a different mag of the layer...
How should we handle this? Should the sibling check be something like:
val sameParentRemainingPaths = remainingPaths.filter(_._1.parent == path.parent.parent).

Moreover, I suspect the code to keep on running recursive search in the sibling directories. IMO we should restrict this to a max nesting level of 2, as the data of sibling layers should be at the same nesting level imo.

@fm3
Copy link
Member

fm3 commented Jul 16, 2024

Yes, 10^8 is too big as a maximum. How about 10^3? How long does the tested dataset take then?

Not sure I fully understood the problem with mixed siblings/mags, but yes, I guess we can restrict this more, with the assumption that the structure is not arbitrarily mixed.

…ayers in sibling folders

- And remove exploration of sibling folders to find additional layers
@MichaelBuessemeyer
Copy link
Contributor Author

How about 10^3

Done: Max 3 depth & max 10 items per level. But it seems a little restrictive in depth :/. Maybe something like 4 depth & 6 items per level (max 1296) or 5 depth and 5 items per level (max 3125) would be better 🤔?

How long does the tested dataset take then?

I tested twice. Once it took 30 sec, the other run it took 60 sec 🤷‍♂️. Still quite long.

Not sure I fully understood the problem with mixed siblings/mags, but yes, I guess we can restrict this more, with the assumption that the structure is not arbitrarily mixed

The "Sibling exploration" was now removed and an issue for this was created (see #7924). See also: https://scm.slack.com/archives/C5AKLAV0B/p1721129347974129 where it was decided to postpone this feature and move it to a separate issue.

@fm3
Copy link
Member

fm3 commented Jul 17, 2024

I think 60s is acceptable. I guess it will only take this long if there are actually this many folders, right? Maybe you could test a couple more from our example remote dataset table. I have no strong opinion on depth 3 or 4. 3125 tests seems like too much, though.
Did the sibling test removal help with runtime by any chance?

@MichaelBuessemeyer
Copy link
Contributor Author

MichaelBuessemeyer commented Jul 18, 2024

I think 60s is acceptable. I guess it will only take this long if there are actually this many folders, right?

Yeah

Did the sibling test removal help with runtime by any chance?

Oh right, that might explain the difference between 30 sec and 60 sec 🤔

Maybe you could test a couple more from our example remote dataset table. I have no strong opinion on depth 3 or 4. 3125 tests seems like too much, though.

I'll do some more :)

Edit: See https://scm.slack.com/archives/C5AKLAV0B/p1721293670420979 for testing results

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is more readable, thanks! Of course, the logic is still pretty complicated, and with everything being async, it will always be a bit hard to read, but the task is fairly complicated, and I think this is certainly better than before :)

I added a couple more comments

Copy link
Contributor Author

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback. I hopefully covered everything :)

test/backend/DataVaultTestSuite.scala Outdated Show resolved Hide resolved
Comment on lines +101 to +118
try {
val listObjectsRequest = new ListObjectsV2Request
listObjectsRequest.setBucketName(bucketName)
listObjectsRequest.setPrefix(keyPrefix)
listObjectsRequest.setDelimiter("/")
listObjectsRequest.setMaxKeys(maxItems)
val objectListing = client.listObjectsV2(listObjectsRequest)
val s3SubPrefixes = objectListing.getCommonPrefixes.asScala.toList
Fox.successful(s3SubPrefixes)
} catch {
case e: AmazonServiceException =>
e.getStatusCode match {
case 404 => Fox.empty
case _ => Fox.failure(e.getMessage)
}
case e: Exception => Fox.failure(e.getMessage)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question @fm3.
Should I use tryo here or this try and catch construct? Which one is better in which case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryo is a shortcut, it will always produce a Box.Failure in case of any exception. In this code, however, we want to create different results based on a parameter of the exception (Fox.empty for status code 404), so we need the full try/catch to implement that custom logic.

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Stuff :) I’d say this is good to go

Could you update the PR description before merging? (The TODOs are solved, right?)

@MichaelBuessemeyer MichaelBuessemeyer enabled auto-merge (squash) July 22, 2024 13:32
@MichaelBuessemeyer MichaelBuessemeyer merged commit d08affc into master Jul 22, 2024
2 checks passed
@MichaelBuessemeyer MichaelBuessemeyer deleted the recursive-exploration branch July 22, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Recursive explore for listable DataVaults
2 participants