-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive exploration of remote datasets #7912
Conversation
…rrors before pushing
project/Dependencies.scala
Outdated
@@ -56,7 +56,7 @@ object Dependencies { | |||
// MultiArray (ndarray) handles. import ucar | |||
"edu.ucar" % "cdm-core" % "5.4.2", | |||
// Amazon S3 cloud storage client. import com.amazonaws | |||
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.584", | |||
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.584", // TODO Update?!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exists a version two of the lib. See https://github.com/aws/aws-sdk-java-v2/#using-the-sdk
Do we want to migrate to version V2? This should be rather easy as the lib is afaik only used in S3DataVault.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote #7913 Sounds good (reads like it may improve async api?) but should not block this PR
...-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff, thanks for looking into this!
To be honest, I had a bit of a hard time reading the new code in ExploreRemoteLayerService. This is complicated nested logic, and a lot is happening in very compact code. I think that a few extractions into well-named private methods could already help a lot.
Also, please have a look at my individual comments. Feel free to ask if anything is unclear :)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/DataVault.scala
Outdated
Show resolved
Hide resolved
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/DataVault.scala
Outdated
Show resolved
Hide resolved
|
||
class FileSystemDataVault extends DataVault { | ||
|
||
override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)( | ||
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = { | ||
private def vaultPathToLocalPath(path: VaultPath)(implicit ec: ExecutionContext): Fox[Path] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the private method below the public one that uses it. Either below the first usage, or further down. (That’s the typical reading order of the backend files. or at least it should be 😅)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that's correct now 🙈
...os-datastore/app/com/scalableminds/webknossos/datastore/datavault/GoogleCloudDataVault.scala
Outdated
Show resolved
Hide resolved
...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala
Outdated
Show resolved
Hide resolved
...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala
Outdated
Show resolved
Hide resolved
.toFox | ||
.flatten | ||
.futureBox | ||
.flatMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, that’s a mouthful. Maybe extracting some of these into functions with annotated return types could help with understanding what’s going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the newest version is easier to read? Each method has something like max 8 lines as a method body
...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala
Outdated
Show resolved
Hide resolved
...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala
Outdated
Show resolved
Hide resolved
…nossos into recursive-exploration
There is a problem I noticed: Exploring Another thing: The exploration seems to not yield any results on many folders that seem to sound like a valid dataset / layer. But I cannot test this, as I was unable to find a proper gcs explorer in a reasonable time frame :/ And I also have the feeling that the sibling exploration is not working as I / we want it to:
then the current exploration would find dataset/layer 1/mag 1 as a layer and then would treat dataset/layer 1/mag 2 as a sibling and thus an additional layer although it is a different mag of the layer... Moreover, I suspect the code to keep on running recursive search in the sibling directories. IMO we should restrict this to a max nesting level of 2, as the data of sibling layers should be at the same nesting level imo. |
Yes, 10^8 is too big as a maximum. How about 10^3? How long does the tested dataset take then? Not sure I fully understood the problem with mixed siblings/mags, but yes, I guess we can restrict this more, with the assumption that the structure is not arbitrarily mixed. |
…ayers in sibling folders - And remove exploration of sibling folders to find additional layers
Done: Max 3 depth & max 10 items per level. But it seems a little restrictive in depth :/. Maybe something like 4 depth & 6 items per level (max 1296) or 5 depth and 5 items per level (max 3125) would be better 🤔?
I tested twice. Once it took 30 sec, the other run it took 60 sec 🤷♂️. Still quite long.
The "Sibling exploration" was now removed and an issue for this was created (see #7924). See also: https://scm.slack.com/archives/C5AKLAV0B/p1721129347974129 where it was decided to postpone this feature and move it to a separate issue. |
I think 60s is acceptable. I guess it will only take this long if there are actually this many folders, right? Maybe you could test a couple more from our example remote dataset table. I have no strong opinion on depth 3 or 4. 3125 tests seems like too much, though. |
Yeah
Oh right, that might explain the difference between 30 sec and 60 sec 🤔
I'll do some more :) Edit: See https://scm.slack.com/archives/C5AKLAV0B/p1721293670420979 for testing results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is more readable, thanks! Of course, the logic is still pretty complicated, and with everything being async, it will always be a bit hard to read, but the task is fairly complicated, and I think this is certainly better than before :)
I added a couple more comments
...sos-datastore/app/com/scalableminds/webknossos/datastore/datavault/FileSystemDataVault.scala
Outdated
Show resolved
Hide resolved
...os-datastore/app/com/scalableminds/webknossos/datastore/datavault/GoogleCloudDataVault.scala
Outdated
Show resolved
Hide resolved
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala
Outdated
Show resolved
Hide resolved
...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala
Outdated
Show resolved
Hide resolved
...sos-datastore/app/com/scalableminds/webknossos/datastore/explore/N5MultiscalesExplorer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback. I hopefully covered everything :)
try { | ||
val listObjectsRequest = new ListObjectsV2Request | ||
listObjectsRequest.setBucketName(bucketName) | ||
listObjectsRequest.setPrefix(keyPrefix) | ||
listObjectsRequest.setDelimiter("/") | ||
listObjectsRequest.setMaxKeys(maxItems) | ||
val objectListing = client.listObjectsV2(listObjectsRequest) | ||
val s3SubPrefixes = objectListing.getCommonPrefixes.asScala.toList | ||
Fox.successful(s3SubPrefixes) | ||
} catch { | ||
case e: AmazonServiceException => | ||
e.getStatusCode match { | ||
case 404 => Fox.empty | ||
case _ => Fox.failure(e.getMessage) | ||
} | ||
case e: Exception => Fox.failure(e.getMessage) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question @fm3.
Should I use tryo here or this try and catch construct? Which one is better in which case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tryo is a shortcut, it will always produce a Box.Failure in case of any exception. In this code, however, we want to create different results based on a parameter of the exception (Fox.empty for status code 404), so we need the full try/catch to implement that custom logic.
...datastore/app/com/scalableminds/webknossos/datastore/explore/ExploreRemoteLayerService.scala
Outdated
Show resolved
Hide resolved
...sos-datastore/app/com/scalableminds/webknossos/datastore/explore/N5MultiscalesExplorer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Stuff :) I’d say this is good to go
Could you update the PR description before merging? (The TODOs are solved, right?)
This PR adds recursive exploration to the already existing remote dataset exploration. This is supported for the local file system, GCS and S3.
URL of deployed dev instance (used for testing):
Steps to test:
gs://neuroglancer-fafb-data/fafb_v14/
. This should result in a successfully explored dataset.s3://janelia-cosem-datasets/jrc_mus-nacc-4/jrc_mus-nacc-4.zarr/
. This should result in a successfully explored dataset.Create a new local folders e.g.
<wk-root>/binaryData2/some_dir/more_dir
,<wk-root>/binaryData2/other_dir/more_dir
Add
<wk-root>/binaryData2/
to the whilelist in theapplication.conf
in line 197.Enter
file://
/binaryData2/` into the add remote dataset form. The request should fail and only include a short error message, not leaking any information about the underlying folder structure of the server.Add a new dataset (not wkw, as wkw exploration is not implemented) e.g.
l4_sample_zarr3_sharded
to<wk-root>/binaryData2/other_dir/more_dir
Enter
file://
/binaryData2/` into the add remote dataset form. The request should successfully find the dataset.TODOs:
-> I'd add a warning to the docs about the whitelisting feature to only include datasets in the subdirectories and not any kind of sensitive information like ssh key and such. Moreover, the debug log should not be exposed to the end users. At least in case a local file system is used. In case wk crawls remote cloud storages, the person using wk already has the necessary credentials (if necessary) to list the cloud storage. In that case wk does not leak any information the user would already have.
Issues:
(Please delete unneeded items, merge only when none are left open)