Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable S3 compliant data vaults using https and http #8167

Merged
merged 2 commits into from
Nov 4, 2024

Conversation

frcroth
Copy link
Member

@frcroth frcroth commented Nov 4, 2024

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz

Weird things happen when querying an object storage that supports HTTPS via HTTP. This PR first sends a get request to a datavault to check if it supports HTTPs and then uses it.
This way both HTTP and HTTPS should be supported (as sending HTTPS to a HTTP vault will cause the request to fail because of an SSL error).

This did not come up with #7453, since it was only tested on local (http) minio object storage.

Steps to test:

  • Explore s3://fsn1.your-objectstorage.com/webknossos-test/l4_sample with the access key id and secret access key
  • Explore a local minio object storage

Issues:


(Please delete unneeded items, merge only when none are left open)

  • Updated changelog
  • Needs datastore update after deployment

Summary by CodeRabbit

Release Notes

  • New Features

    • Added metadata support for annotations on Trees and Segments.
    • Introduced a summary row in the time tracking overview.
    • Enhanced slider functionality for improved user interaction.
    • Users can now search for unnamed segments using their full default names.
    • Support for remote OME-Zarr NGFF version 0.5 datasets.
    • S3-compliant object storage can now be accessed via HTTPS.
  • Bug Fixes

    • Resolved issues with dataset uploads and bbox export.
    • Fixed problems with sharing tokens during annotation saves.
    • Improved functionality for skeleton searches and public dataset annotations.
  • Chores

    • Updated dependencies and scripts for improved performance and compatibility.

Copy link

coderabbitai bot commented Nov 4, 2024

Walkthrough

The pull request introduces several enhancements and fixes across the WEBKNOSSOS application, including new features for metadata handling, improved search functionalities, and enhanced support for S3-compliant object storage. Notable changes include updates to the package.json for script and dependency management, modifications to the S3DataVault and DataVaultService classes to integrate a web service client, and improvements in test cases for better HTTP request handling. Additionally, the changelog has been updated to reflect these changes comprehensively.

Changes

File Path Change Summary
CHANGELOG.unreleased.md Updated to include new features like metadata for annotations, improved slider functionality, and S3 access.
package.json Scripts updated for startup process; dependencies updated to newer versions.
test/backend/DataVaultTestSuite.scala Integrated WsTestClient for HTTP requests in S3 data vault tests, ensuring consistent client usage.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala Updated constructor and methods to include WSClient for HTTP requests; added protocol determination logic.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/DataVaultService.scala Modified createVault method to include WSClient for S3 and HTTPS vault creation.

Assessment against linked issues

Objective Addressed Explanation
Debug compatibility with Hetzner Object Storage (#8153)

Possibly related PRs

Suggested labels

backend, testing

Suggested reviewers

  • MichaelBuessemeyer
  • hotzenklotz

Poem

In the meadow where data flows,
New features bloom like springtime's rose.
Metadata dances, sliders glide,
With S3 access, we take pride.
Bugs are fixed, and tests run free,
Hooray for changes, hop with glee! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@frcroth frcroth self-assigned this Nov 4, 2024
@frcroth frcroth marked this pull request as ready for review November 4, 2024 14:56
@frcroth frcroth requested a review from fm3 November 4, 2024 14:58
@@ -123,7 +123,7 @@
"ts-coverage": "typescript-coverage-report",
"find-cyclic-dependencies": "yarn run dpdm -T --tree false --warning false --extensions .ts,.tsx frontend/javascripts/main.tsx",
"check-cyclic-dependencies": "node ./tools/check-cyclic-dependencies.js",
"startf": "yarn rm-fossil-lock; yarn kill-listeners; yarn start",
"startf": "yarn rm-fossil-lock; yarn kill-listeners; rm -r webknossos-jni/target; yarn start",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to commit this? Seems like some that is covered by clean.sh already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because this is always annoying me because the webknossos-jni folder causes an error after using docker compose up (e.g. e2e tests). startf is "start forcefully", and this directory in particular is sometimes leading to me not being able to start.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (8)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/DataVaultService.scala (2)

50-50: Document protocol detection behavior.

Given that this change enables automatic protocol detection for S3 vaults, consider adding a code comment explaining this behavior to help future maintainers understand the purpose of passing WSClient.

+      // WSClient is used to detect if the S3 vault supports HTTPS by sending a test request
+      // before falling back to HTTP if necessary
       S3DataVault.create(remoteSource, ws)

Line range hint 41-64: Consider refactoring scheme handling for better maintainability.

The current implementation uses string comparisons for scheme detection. Consider extracting these into a sealed trait/enum for better type safety and maintainability.

Example refactor:

sealed trait VaultScheme {
  def value: String
}
object VaultScheme {
  case object S3 extends VaultScheme { val value = "s3" }
  case object HTTPS extends VaultScheme { val value = "https" }
  case object HTTP extends VaultScheme { val value = "http" }
  case object GS extends VaultScheme { val value = "gs" }
  case object File extends VaultScheme { val value = "file" }

  def fromString(scheme: String): Option[VaultScheme] = 
    values.find(_.value == scheme)
  
  def values: Seq[VaultScheme] = Seq(S3, HTTPS, HTTP, GS, File)
}
package.json (1)

126-126: Consider adding error handling to the startup script.

The new startf script combines cleanup and startup steps, which is helpful for development workflow. However, consider making it more robust:

-    "startf": "yarn rm-fossil-lock; yarn kill-listeners; rm -r webknossos-jni/target; yarn start",
+    "startf": "yarn rm-fossil-lock; yarn kill-listeners; rm -rf webknossos-jni/target 2>/dev/null || true; yarn start",

The changes:

  • Added -f flag to rm to prevent errors if files don't exist
  • Added 2>/dev/null || true to suppress errors and continue execution
test/backend/DataVaultTestSuite.scala (3)

141-149: Enhance data integrity verification in brotli-compression tests.

The current test only verifies the first 10 bytes of the decompressed data. Consider:

  1. Verifying the complete data integrity
  2. Adding test cases for different compression types
  3. Testing large file handling
  4. Adding partial read tests

180-203: Enhance directory listing test coverage.

Current tests could be more comprehensive. Consider adding test cases for:

  1. Empty directories
  2. Directories with exactly maxItems entries
  3. Pagination scenarios
  4. Special characters in paths
  5. Nested directory structures

Also, consider parameterizing the tests with different maxItems values instead of hardcoding 3:

Seq(0, 1, 5, 10).foreach { maxItems =>
  s"list available directories with maxItems=$maxItems" in {
    // Test implementation
  }
}

Based on the code analysis, I can now generate the final response:

Test suite needs focus on HTTPS fallback mechanism and data vault interactions

The current test suite focuses on basic data vault operations but misses critical scenarios:

  • The HTTPS fallback mechanism is not explicitly tested, despite being a core feature of HttpsDataVault
  • No tests verify the automatic fallback to HTTPS when other protocols fail
  • Missing edge cases for range requests and protocol switching

Suggested improvements:

  • Add test cases for protocol fallback scenarios in DataVaultTestSuite.scala
  • Include tests for supportsRangeRequests behavior in HttpsDataVault
  • Test error handling during protocol transitions
🔗 Analysis chain

Line range hint 1-277: Consider restructuring tests for better organization and coverage.

While the test suite provides good basic coverage, consider:

  1. Creating a dedicated test section for HTTPS fallback mechanism (PR's main objective)
  2. Using shared test fixtures to reduce duplication
  3. Organizing tests by feature rather than by request type
  4. Adding integration tests with Hetzner Object Storage (mentioned in PR objectives)

Let's verify the test coverage:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are any direct references to Hetzner storage in tests
rg -i "hetzner" --type scala

Length of output: 28


Script:

#!/bin/bash
# Let's check for any S3 or HTTPS fallback related code in the codebase
rg -A 5 "HttpsDataVault" --type scala

# Also check for any other test files related to data vaults
fd ".*Test.*\.scala$" | grep -i "vault"

Length of output: 5446


Script:

#!/bin/bash
# Let's check the implementation of HttpsDataVault to understand the fallback mechanism
ast-grep --pattern 'class HttpsDataVault $_' -A 30

# Also check for any S3 fallback related code
rg "fallback" --type scala

Length of output: 21184

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala (2)

Line range hint 131-138: Potential Unhandled Exceptions in getObjectSummaries

While fetching object summaries, exceptions other than NoSuchBucketException may not be adequately handled, potentially causing the application to crash on unexpected errors.

Implement comprehensive exception handling to manage all possible exceptions from listObjectsV2:

objectListing <- notFoundToFailure(client.listObjectsV2(listObjectsRequest).asScala).recoverWith {
  case exception: SomeSpecificExceptionType =>
    // Handle specific exception
    Fox.failure(exception)
}

232-233: Sequential Dependency on Protocol Determination

The construction of the S3 client depends on the result of determineProtocol, which might introduce latency due to the network call. This could be problematic in high-throughput scenarios.

Consider decoupling client creation from protocol determination or caching the result to minimize repeated network calls.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 159f868 and af3bf71.

📒 Files selected for processing (5)
  • CHANGELOG.unreleased.md (1 hunks)
  • package.json (1 hunks)
  • test/backend/DataVaultTestSuite.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala (7 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/DataVaultService.scala (1 hunks)
🔇 Additional comments (4)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/DataVaultService.scala (2)

Line range hint 41-64: Implementation aligns well with PR objectives.

The changes effectively address the Hetzner Object Storage compatibility issue by:

  1. Adding protocol detection capability through WSClient
  2. Maintaining consistent error handling
  3. Preserving backward compatibility with other vault types

Line range hint 1-29: Verify WSClient lifecycle management.

The injected WSClient is used across multiple vault types. Ensure proper cleanup of WSClient resources when the service is terminated.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala (2)

227-239: AWS Region Parsing May Fail for Custom Endpoints

The call to AwsHostNameUtils.parseSigningRegion may return None for custom or non-standard endpoints, leading to a default region of US_EAST_1. This might cause authentication failures with certain S3-compatible services.

Verify that the region is correctly determined for all intended endpoints. If necessary, allow configuration of the region or implement a fallback mechanism.


227-239: Ensure Thread Safety in Asynchronous Client Creation

The getAmazonS3Client method assembles the S3 client asynchronously. If multiple threads access clientFox, there might be potential race conditions during client initialization.

Please verify that clientFox is thread-safe and properly handles concurrent access. Run the following script to check for any shared mutable state that could lead to race conditions:

@@ -38,6 +38,7 @@ For upgrade instructions, please check the [migration guide](MIGRATIONS.released
- Users without edit permissions to a dataset can no longer delete sharing tokens via the API. [#8083](https://github.com/scalableminds/webknossos/issues/8083)
- Fixed downloading task annotations of teams you are not in, when accessing directly via URI. [#8155](https://github.com/scalableminds/webknossos/pull/8155)
- Deleting a bounding box is now possible independently of a visible segmentation layer. [#8164](https://github.com/scalableminds/webknossos/pull/8164)
- S3-compliant object storages can now be accessed via HTTPS. [#8167](https://github.com/scalableminds/webknossos/pull/8167)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance and relocate the changelog entry.

The changelog entry should be moved to the "Added" section since it represents a new feature rather than a fix. Additionally, the description should be expanded to better explain the functionality.

Apply this diff to improve the changelog entry:

- - S3-compliant object storages can now be accessed via HTTPS. [#8167](https://github.com/scalableminds/webknossos/pull/8167)
+ ### Added
+ - Added automatic protocol detection for S3-compliant object storages, allowing seamless access via both HTTPS and HTTP. The system now automatically determines the supported protocol by first attempting HTTPS and falling back to HTTP if necessary. This improves compatibility with various object storage providers, including Hetzner Object Storage. [#8167](https://github.com/scalableminds/webknossos/pull/8167)

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +96 to +103
WsTestClient.withClient { ws =>
val vaultPath =
new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext))
val bytes =
(vaultPath / "s0/5/5/5").readBytes(Some(range))(globalExecutionContext).get(handleFoxJustification)
assert(bytes.length == range.length)
assert(bytes.take(10).sameElements(Array(0, 0, 0, 3, 0, 0, 0, 64, 0, 0)))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test coverage for HTTPS fallback mechanism and edge cases.

While the basic range request test is good, consider adding test cases for:

  1. HTTPS fallback mechanism (key PR objective)
  2. Invalid range requests (e.g., overlapping ranges, out-of-bounds)
  3. Edge cases (e.g., range boundaries, empty ranges)

Example test structure:

"when vault supports HTTPS" should {
  "fallback to HTTPS successfully" in {
    WsTestClient.withClient { ws =>
      val uri = new URI("http://https-supporting-bucket.example.com")
      val vaultPath = new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None), ws))
      // Test HTTPS fallback
    }
  }
}

Comment on lines +155 to +172
WsTestClient.withClient { ws =>
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext)
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
}
}
}

"return empty box" when {
"requesting a non-existent object in existent bucket" in {
val uri = new URI(s"s3://open-neurodata/non-existent-object${UUID.randomUUID}")
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None))
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
WsTestClient.withClient { ws =>
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext)
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Expand error handling test coverage.

While basic error cases are covered, consider adding tests for:

  1. Network errors (connection refused, timeout)
  2. Permission denied scenarios
  3. Rate limiting responses
  4. Timeout handling

Example test:

"handle network errors gracefully" in {
  WsTestClient.withClient { ws =>
    // Mock WS client to simulate network errors
    val mockWs = // ... setup mock
    val uri = new URI("s3://example-bucket")
    val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), mockWs)
    val vaultPath = new VaultPath(uri, s3DataVault)
    val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
    assertBoxFailure(result)
  }
}

Comment on lines +45 to +49
class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential],
uri: URI,
ws: WSClient,
implicit val ec: ExecutionContext)
extends DataVault {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential Performance Impact During Initialization

The constructor now includes a network call to determine the protocol using WSClient. This synchronous call during initialization can introduce latency and block the thread, potentially affecting application startup time and scalability.

Consider deferring the protocol determination until it is actually needed or making the network call asynchronously to avoid blocking the main execution thread.

Comment on lines +212 to +224
private def determineProtocol(uri: URI, ws: WSClient)(implicit ec: ExecutionContext): Fox[String] = {
// If the endpoint supports HTTPS, use it. Otherwise, use HTTP.
val httpsUri = new URI("https", uri.getAuthority, "", "", "")
val httpsFuture = ws.url(httpsUri.toString).get()

val protocolFuture = httpsFuture.transformWith({
case TrySuccess(_) => Future.successful("https")
case TryFailure(_) => Future.successful("http")
})
for {
protocol <- protocolFuture.toFox
} yield protocol
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Robustness of Protocol Determination Logic

The determineProtocol method relies on attempting an HTTPS request to infer protocol support. This approach may not be reliable due to potential transient network issues or endpoints that respond slowly, leading to false negatives.

Consider implementing a more reliable mechanism for protocol detection, such as:

  • Configurable protocol preference with fallback options.
  • Checking endpoint capabilities using a dedicated health-check endpoint if available.
  • Introducing retries with backoff strategies to handle transient failures.

Comment on lines +55 to +56
private lazy val clientFox: Fox[S3AsyncClient] =
S3DataVault.getAmazonS3Client(s3AccessKeyCredential, uri, ws)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Error Handling for Client Initialization Failures

The clientFox lazy initialization wraps getAmazonS3Client which can fail. If the initialization fails, subsequent calls will also fail without a clear indication.

Ensure that failures during client initialization are properly propagated and handled. Consider eagerly initializing the client and handling errors to prevent silent failures during runtime.

Comment on lines +150 to +157
def create(remoteSourceDescriptor: RemoteSourceDescriptor, ws: WSClient)(
implicit ec: ExecutionContext): S3DataVault = {
val credential = remoteSourceDescriptor.credential.flatMap {
case f: S3AccessKeyCredential => Some(f)
case f: LegacyDataVaultCredential => Some(f.toS3AccessKey)
case _ => None
}
new S3DataVault(credential, remoteSourceDescriptor.uri)
new S3DataVault(credential, remoteSourceDescriptor.uri, ws, ec)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Constructor Change May Break Existing Instantiations

The create method signature has changed to include WSClient. This might affect other parts of the codebase that instantiate S3DataVault without this parameter.

Ensure all instantiations of S3DataVault across the codebase are updated accordingly. Update documentation and constructor references.

Comment on lines +216 to +220

val protocolFuture = httpsFuture.transformWith({
case TrySuccess(_) => Future.successful("https")
case TryFailure(_) => Future.successful("http")
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Exception Handling May Mask Underlying Issues

In the determineProtocol method, exceptions during the HTTPS request are caught broadly, and any failure results in falling back to HTTP without logging the cause. This can mask underlying issues such as misconfigured SSL certificates or network problems.

Refine the exception handling to log the specific reasons for HTTPS failures. This can aid in debugging and ensure that genuine issues are not overlooked.

+ import play.api.Logger
...
  val protocolFuture = httpsFuture.transformWith {
    case TrySuccess(_) => Future.successful("https")
    case TryFailure(exception) =>
+     Logger.warn(s"HTTPS request failed: ${exception.getMessage}")
      Future.successful("http")
  }

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@frcroth frcroth merged commit 96c9d7f into master Nov 4, 2024
3 checks passed
@frcroth frcroth deleted the fix-s3-compliant-https branch November 4, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Debug compatibility with Hetzner Object Storage
3 participants