Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable S3 compliant data vaults using https and http #8167

Merged
merged 2 commits into from
Nov 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.unreleased.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ For upgrade instructions, please check the [migration guide](MIGRATIONS.released
- Users without edit permissions to a dataset can no longer delete sharing tokens via the API. [#8083](https://github.com/scalableminds/webknossos/issues/8083)
- Fixed downloading task annotations of teams you are not in, when accessing directly via URI. [#8155](https://github.com/scalableminds/webknossos/pull/8155)
- Deleting a bounding box is now possible independently of a visible segmentation layer. [#8164](https://github.com/scalableminds/webknossos/pull/8164)
- S3-compliant object storages can now be accessed via HTTPS. [#8167](https://github.com/scalableminds/webknossos/pull/8167)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance and relocate the changelog entry.

The changelog entry should be moved to the "Added" section since it represents a new feature rather than a fix. Additionally, the description should be expanded to better explain the functionality.

Apply this diff to improve the changelog entry:

- - S3-compliant object storages can now be accessed via HTTPS. [#8167](https://github.com/scalableminds/webknossos/pull/8167)
+ ### Added
+ - Added automatic protocol detection for S3-compliant object storages, allowing seamless access via both HTTPS and HTTP. The system now automatically determines the supported protocol by first attempting HTTPS and falling back to HTTP if necessary. This improves compatibility with various object storage providers, including Hetzner Object Storage. [#8167](https://github.com/scalableminds/webknossos/pull/8167)

Committable suggestion skipped: line range outside the PR's diff.


### Removed

Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@
"ts-coverage": "typescript-coverage-report",
"find-cyclic-dependencies": "yarn run dpdm -T --tree false --warning false --extensions .ts,.tsx frontend/javascripts/main.tsx",
"check-cyclic-dependencies": "node ./tools/check-cyclic-dependencies.js",
"startf": "yarn rm-fossil-lock; yarn kill-listeners; yarn start",
"startf": "yarn rm-fossil-lock; yarn kill-listeners; rm -r webknossos-jni/target; yarn start",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to commit this? Seems like some that is covered by clean.sh already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because this is always annoying me because the webknossos-jni folder causes an error after using docker compose up (e.g. e2e tests). startf is "start forcefully", and this directory in particular is sometimes leading to me not being able to start.

"beautify-front": "yarn fix-frontend && yarn typecheck",
"beautify": "yarn format-backend && yarn beautify-front"
},
Expand Down
87 changes: 50 additions & 37 deletions test/backend/DataVaultTestSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,14 @@ class DataVaultTestSuite extends PlaySpec {
"using S3 data vault" should {
"return correct response" in {
val uri = new URI("s3://janelia-cosem-datasets/jrc_hela-3/jrc_hela-3.n5/em/fibsem-uint16/")
val vaultPath = new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None)))
val bytes =
(vaultPath / "s0/5/5/5").readBytes(Some(range))(globalExecutionContext).get(handleFoxJustification)
assert(bytes.length == range.length)
assert(bytes.take(10).sameElements(Array(0, 0, 0, 3, 0, 0, 0, 64, 0, 0)))
WsTestClient.withClient { ws =>
val vaultPath =
new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext))
val bytes =
(vaultPath / "s0/5/5/5").readBytes(Some(range))(globalExecutionContext).get(handleFoxJustification)
assert(bytes.length == range.length)
assert(bytes.take(10).sameElements(Array(0, 0, 0, 3, 0, 0, 0, 64, 0, 0)))
}
Comment on lines +96 to +103
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test coverage for HTTPS fallback mechanism and edge cases.

While the basic range request test is good, consider adding test cases for:

  1. HTTPS fallback mechanism (key PR objective)
  2. Invalid range requests (e.g., overlapping ranges, out-of-bounds)
  3. Edge cases (e.g., range boundaries, empty ranges)

Example test structure:

"when vault supports HTTPS" should {
  "fallback to HTTPS successfully" in {
    WsTestClient.withClient { ws =>
      val uri = new URI("http://https-supporting-bucket.example.com")
      val vaultPath = new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None), ws))
      // Test HTTPS fallback
    }
  }
}

}
}
}
Expand Down Expand Up @@ -135,59 +138,69 @@ class DataVaultTestSuite extends PlaySpec {
"using s3 data vault" should {
"return correctly decoded brotli-compressed data" in {
val uri = new URI("s3://open-neurodata/bock11/image/4_4_40")
val vaultPath = new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None)))
val bytes =
(vaultPath / "33792-34304_29696-30208_3216-3232")
.readBytes()(globalExecutionContext)
.get(handleFoxJustification)
assert(bytes.take(10).sameElements(Array(-87, -95, -85, -94, -101, 124, 115, 100, 113, 111)))
WsTestClient.withClient { ws =>
val vaultPath =
new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext))
val bytes =
(vaultPath / "33792-34304_29696-30208_3216-3232")
.readBytes()(globalExecutionContext)
.get(handleFoxJustification)
assert(bytes.take(10).sameElements(Array(-87, -95, -85, -94, -101, 124, 115, 100, 113, 111)))
}
}

"return empty box" when {
"requesting a non-existent bucket" in {
val uri = new URI(s"s3://non-existent-bucket${UUID.randomUUID}/non-existent-object")
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None))
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
WsTestClient.withClient { ws =>
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext)
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
}
}
}

"return empty box" when {
"requesting a non-existent object in existent bucket" in {
val uri = new URI(s"s3://open-neurodata/non-existent-object${UUID.randomUUID}")
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None))
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
WsTestClient.withClient { ws =>
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext)
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
assertBoxEmpty(result)
}
Comment on lines +155 to +172
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Expand error handling test coverage.

While basic error cases are covered, consider adding tests for:

  1. Network errors (connection refused, timeout)
  2. Permission denied scenarios
  3. Rate limiting responses
  4. Timeout handling

Example test:

"handle network errors gracefully" in {
  WsTestClient.withClient { ws =>
    // Mock WS client to simulate network errors
    val mockWs = // ... setup mock
    val uri = new URI("s3://example-bucket")
    val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), mockWs)
    val vaultPath = new VaultPath(uri, s3DataVault)
    val result = vaultPath.readBytes()(globalExecutionContext).await(handleFoxJustification)
    assertBoxFailure(result)
  }
}

}
}
}
}

"using directory list requests" when {
val uri = new URI("s3://janelia-cosem-datasets/jrc_hela-3/jrc_hela-3.n5/em/fibsem-uint16/")
val vaultPath = new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None)))

"using s3 data vault" should {
"list available directories" in {
val result = vaultPath.listDirectory(maxItems = 3)(globalExecutionContext).get(handleFoxJustification)
assert(result.length == 3)
assert(
result.exists(
_.toUri == new URI("s3://janelia-cosem-datasets/jrc_hela-3/jrc_hela-3.n5/em/fibsem-uint16/s0/")))
}
WsTestClient.withClient { ws =>
val vaultPath =
new VaultPath(uri, S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext))

"using s3 data vault" should {
"list available directories" in {
val result = vaultPath.listDirectory(maxItems = 3)(globalExecutionContext).get(handleFoxJustification)
assert(result.length == 3)
assert(
result.exists(
_.toUri == new URI("s3://janelia-cosem-datasets/jrc_hela-3/jrc_hela-3.n5/em/fibsem-uint16/s0/")))
}

"return failure" when {
"requesting directory listing on non-existent bucket" in {
val uri = new URI(f"s3://non-existent-bucket${UUID.randomUUID}/non-existent-object/")
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None))
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.listDirectory(maxItems = 5)(globalExecutionContext).await(handleFoxJustification)
assertBoxFailure(result)
"return failure" when {
"requesting directory listing on non-existent bucket" in {
val uri = new URI(f"s3://non-existent-bucket${UUID.randomUUID}/non-existent-object/")
val s3DataVault = S3DataVault.create(RemoteSourceDescriptor(uri, None), ws)(globalExecutionContext)
val vaultPath = new VaultPath(uri, s3DataVault)
val result = vaultPath.listDirectory(maxItems = 5)(globalExecutionContext).await(handleFoxJustification)
assertBoxFailure(result)
}
}
}

}
}
}

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.scalableminds.webknossos.datastore.datavault

import com.scalableminds.util.tools.Fox
import com.scalableminds.util.tools.Fox.box2Fox
import com.scalableminds.util.tools.Fox.{box2Fox, future2Fox}
import com.scalableminds.webknossos.datastore.storage.{
LegacyDataVaultCredential,
RemoteSourceDescriptor,
Expand All @@ -10,6 +10,7 @@ import com.scalableminds.webknossos.datastore.storage.{
import net.liftweb.common.Box.tryo
import net.liftweb.common.{Box, Empty, Full, Failure => BoxFailure}
import org.apache.commons.lang3.builder.HashCodeBuilder
import play.api.libs.ws.WSClient
import software.amazon.awssdk.auth.credentials.{
AnonymousCredentialsProvider,
AwsBasicCredentials,
Expand Down Expand Up @@ -41,14 +42,18 @@ import scala.jdk.FutureConverters._
import scala.jdk.OptionConverters.RichOptional
import scala.util.{Failure => TryFailure, Success => TrySuccess}

class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI) extends DataVault {
class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential],
uri: URI,
ws: WSClient,
implicit val ec: ExecutionContext)
extends DataVault {
Comment on lines +45 to +49
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential Performance Impact During Initialization

The constructor now includes a network call to determine the protocol using WSClient. This synchronous call during initialization can introduce latency and block the thread, potentially affecting application startup time and scalability.

Consider deferring the protocol determination until it is actually needed or making the network call asynchronously to avoid blocking the main execution thread.

private lazy val bucketName = S3DataVault.hostBucketFromUri(uri) match {
case Some(value) => value
case None => throw new Exception(s"Could not parse S3 bucket for ${uri.toString}")
}

private lazy val client: S3AsyncClient =
S3DataVault.getAmazonS3Client(s3AccessKeyCredential, uri)
private lazy val clientFox: Fox[S3AsyncClient] =
S3DataVault.getAmazonS3Client(s3AccessKeyCredential, uri, ws)
Comment on lines +55 to +56
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Error Handling for Client Initialization Failures

The clientFox lazy initialization wraps getAmazonS3Client which can fail. If the initialization fails, subsequent calls will also fail without a clear indication.

Ensure that failures during client initialization are properly propagated and handled. Consider eagerly initializing the client and handling errors to prevent silent failures during runtime.


private def getRangeRequest(bucketName: String, key: String, range: NumericRange[Long]): GetObjectRequest =
GetObjectRequest.builder().bucket(bucketName).key(key).range(s"bytes=${range.start}-${range.end - 1}").build()
Expand All @@ -64,6 +69,7 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI
val responseTransformer: AsyncResponseTransformer[GetObjectResponse, ResponseBytes[GetObjectResponse]] =
AsyncResponseTransformer.toBytes
for {
client <- clientFox
responseBytesObject: ResponseBytes[GetObjectResponse] <- notFoundToEmpty(
client.getObject(request, responseTransformer).asScala)
encoding = responseBytesObject.response().contentEncoding()
Expand Down Expand Up @@ -122,6 +128,7 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI
val listObjectsRequest =
ListObjectsV2Request.builder().bucket(bucketName).prefix(keyPrefix).delimiter("/").maxKeys(maxKeys).build()
for {
client <- clientFox
objectListing: ListObjectsV2Response <- notFoundToFailure(client.listObjectsV2(listObjectsRequest).asScala)
s3SubPrefixes: List[CommonPrefix] = objectListing.commonPrefixes().asScala.take(maxItems).toList
} yield s3SubPrefixes.map(_.prefix())
Expand All @@ -140,13 +147,14 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI
}

object S3DataVault {
def create(remoteSourceDescriptor: RemoteSourceDescriptor): S3DataVault = {
def create(remoteSourceDescriptor: RemoteSourceDescriptor, ws: WSClient)(
implicit ec: ExecutionContext): S3DataVault = {
val credential = remoteSourceDescriptor.credential.flatMap {
case f: S3AccessKeyCredential => Some(f)
case f: LegacyDataVaultCredential => Some(f.toS3AccessKey)
case _ => None
}
new S3DataVault(credential, remoteSourceDescriptor.uri)
new S3DataVault(credential, remoteSourceDescriptor.uri, ws, ec)
Comment on lines +150 to +157
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Constructor Change May Break Existing Instantiations

The create method signature has changed to include WSClient. This might affect other parts of the codebase that instantiate S3DataVault without this parameter.

Ensure all instantiations of S3DataVault across the codebase are updated accordingly. Update documentation and constructor references.

}

private def hostBucketFromUri(uri: URI): Option[String] = {
Expand Down Expand Up @@ -201,16 +209,34 @@ object S3DataVault {
private def isNonAmazonHost(uri: URI): Boolean =
(isPathStyle(uri) && !uri.getHost.endsWith(".amazonaws.com")) || uri.getHost == "localhost"

private def getAmazonS3Client(credentialOpt: Option[S3AccessKeyCredential], uri: URI): S3AsyncClient = {
private def determineProtocol(uri: URI, ws: WSClient)(implicit ec: ExecutionContext): Fox[String] = {
// If the endpoint supports HTTPS, use it. Otherwise, use HTTP.
val httpsUri = new URI("https", uri.getAuthority, "", "", "")
val httpsFuture = ws.url(httpsUri.toString).get()

val protocolFuture = httpsFuture.transformWith({
case TrySuccess(_) => Future.successful("https")
case TryFailure(_) => Future.successful("http")
})
Comment on lines +216 to +220
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Exception Handling May Mask Underlying Issues

In the determineProtocol method, exceptions during the HTTPS request are caught broadly, and any failure results in falling back to HTTP without logging the cause. This can mask underlying issues such as misconfigured SSL certificates or network problems.

Refine the exception handling to log the specific reasons for HTTPS failures. This can aid in debugging and ensure that genuine issues are not overlooked.

+ import play.api.Logger
...
  val protocolFuture = httpsFuture.transformWith {
    case TrySuccess(_) => Future.successful("https")
    case TryFailure(exception) =>
+     Logger.warn(s"HTTPS request failed: ${exception.getMessage}")
      Future.successful("http")
  }

Committable suggestion skipped: line range outside the PR's diff.

for {
protocol <- protocolFuture.toFox
} yield protocol
}
Comment on lines +212 to +224
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Robustness of Protocol Determination Logic

The determineProtocol method relies on attempting an HTTPS request to infer protocol support. This approach may not be reliable due to potential transient network issues or endpoints that respond slowly, leading to false negatives.

Consider implementing a more reliable mechanism for protocol detection, such as:

  • Configurable protocol preference with fallback options.
  • Checking endpoint capabilities using a dedicated health-check endpoint if available.
  • Introducing retries with backoff strategies to handle transient failures.


private def getAmazonS3Client(credentialOpt: Option[S3AccessKeyCredential], uri: URI, ws: WSClient)(
implicit ec: ExecutionContext): Fox[S3AsyncClient] = {
val basic =
S3AsyncClient.builder().credentialsProvider(getCredentialsProvider(credentialOpt)).crossRegionAccessEnabled(true)
if (isNonAmazonHost(uri))
basic
.forcePathStyle(true)
.endpointOverride(new URI(s"http://${uri.getAuthority}"))
.region(AwsHostNameUtils.parseSigningRegion(uri.getAuthority, "s3").toScala.getOrElse(Region.US_EAST_1))
.build()
else basic.region(Region.US_EAST_1).build()
if (isNonAmazonHost(uri)) {
for {
protocol <- determineProtocol(uri, ws)
} yield
basic
.forcePathStyle(true)
.endpointOverride(new URI(s"${protocol}://${uri.getAuthority}"))
.region(AwsHostNameUtils.parseSigningRegion(uri.getAuthority, "s3").toScala.getOrElse(Region.US_EAST_1))
.build()
} else Fox.successful(basic.region(Region.US_EAST_1).build())
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ class DataVaultService @Inject()(ws: WSClient) extends LazyLogging {
val fs: DataVault = if (scheme == DataVaultService.schemeGS) {
GoogleCloudDataVault.create(remoteSource)
} else if (scheme == DataVaultService.schemeS3) {
S3DataVault.create(remoteSource)
S3DataVault.create(remoteSource, ws)
} else if (scheme == DataVaultService.schemeHttps || scheme == DataVaultService.schemeHttp) {
HttpsDataVault.create(remoteSource, ws)
} else if (scheme == DataVaultService.schemeFile) {
Expand Down