Conversation
src/main/java/bio/terra/workspace/service/datareference/DataReferenceService.java
Outdated
Show resolved
Hide resolved
| referenceRequest.workspaceId(), | ||
| referenceRequest.referenceType(), | ||
| referenceRequest.name()); | ||
| throw new DuplicateDataReferenceException( |
There was a problem hiding this comment.
it would be good to include the name and type and workspace iD in the error message.
There was a problem hiding this comment.
My concern with that is that name is a user provided string, and the input from security has been to avoid directly returning user-provided strings where possible. A common pattern we use instead is returning an error message without the values and logging the actual values, which I've added here.
There was a problem hiding this comment.
We will be returning the name string in many cases. It is the user's name for the object and we will need to return it in get and enumerate requests. If we need to constrain the contents of the name, we should do that. That seems to be Google's approach; for example, BigQuery dataset naming
There was a problem hiding this comment.
And there is nothing dangerous about returning the reference type or workspaceId is there?
So if we fixed all of that, you could skip the logging and just throw and let the global exception handler do the logging.
There was a problem hiding this comment.
Makes sense, I'll add bucket and dataset name validation and then return the names as part of the error response.
src/main/java/bio/terra/workspace/service/datareference/model/BigQueryDatasetReference.java
Show resolved
Hide resolved
src/main/java/bio/terra/workspace/service/datareference/utils/DataReferenceValidationUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/bio/terra/workspace/service/datareference/utils/DataReferenceValidationUtils.java
Outdated
Show resolved
Hide resolved
| new DataRepoSnapshot().snapshot("snapshot-name").instanceName("instance-name"); | ||
| CreateDataReferenceRequestBody deprecatedRequest = | ||
| new CreateDataReferenceRequestBody() | ||
| .referenceType(ReferenceTypeEnum.DATA_REPO_SNAPSHOT) |
There was a problem hiding this comment.
I think it's not too late to rename ReferenceTypeEnum to ReferenceType. Clients have access to that name but since the ordinals are the same it should just work, right?
src/test/java/bio/terra/workspace/service/datareference/DataReferenceServiceTest.java
Outdated
Show resolved
Hide resolved
src/test/java/bio/terra/workspace/service/datareference/model/ReferenceObjectTest.java
Show resolved
Hide resolved
jaycarlton
left a comment
There was a problem hiding this comment.
Thanks for the detailed responses!
|
fyi @MatthewBemis for BigQuery references and @zarsky-broad for APIs related to managing references |
davidangb
left a comment
There was a problem hiding this comment.
👍 overall, some non-blocking comments inline.
| switch (referenceType()) { | ||
| case DATA_REPO_SNAPSHOT: | ||
| // TODO: we populate both the deprecated reference field and the new referenceInfo fields | ||
| // until all clients have migrated to use the new field. |
There was a problem hiding this comment.
this will require more than just adopting the new client library - for instance, the Terra UI is looking inside the reference key to find snapshot ids. We can migrate but it'll be more work.
There was a problem hiding this comment.
Hm, good to know. I think we would still want to migrate, but it sounds like it's going to be worth coordinating more on this. I've currently been tracking this as PF-404, but we may end up with AS tickets as well.
| } catch (BigQueryException e) { | ||
| throw new InvalidDataReferenceException("Error while trying to access BigQuery dataset", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
I wonder if each ReferenceObject subclass should define their own validate() method inside their own class - similar to how toJson() is defined. Then we'd just delegate to their own implementations. This might require passing a CrlService or DataRepoService to it.
There was a problem hiding this comment.
Having to pass the services to the objects was the reason I didn't want to take that approach - it feels like we're drifting further away from these being value classes. (They already kind of weren't with things like toApiModel(), but adding more members makes serialization/deserialization trickier).
There was a problem hiding this comment.
What is the benefit of them being value classes if we have to write switch statements to dispatch methods on them?
I think it would be a good separation of concerns to have the reference objects encapsulate the knowledge of the cloud resource they reference.
There was a problem hiding this comment.
The value is that they can be serialized into the database and back out, which is how we store the actual references. If we require the reference objects to have a DataRepoService field, that gets more complicated - do we serialize the DataRepoService field with the reference? and if not, how do you deserialize into a valid reference object?
There was a problem hiding this comment.
I think there is a middle ground. I wrote a doc for further discussion: Workspace Manager Object Hierarchy
| # PLACEHOLDER_VALUE should be removed when we have an actual second value | ||
| # to use. | ||
| enum: ['DATA_REPO_SNAPSHOT', 'PLACEHOLDER_VALUE'] #eventually include GCS bucket, etc. | ||
| enum: ['DATA_REPO_SNAPSHOT', 'GOOGLE_BUCKET', 'BIG_QUERY_DATASET'] |
There was a problem hiding this comment.
glad to have PLACEHOLDER_VALUE out of here!
There was a problem hiding this comment.
Same, that truncation is still one of the strangest opinions I've seen swagger take.
| # type: string | ||
| # datasetName: | ||
| # description: The name of this dataset | ||
| # type: string |
ddietterich
left a comment
There was a problem hiding this comment.
I think there is an upward compatibility issue and maybe a validation issue.
| description: The ID of the Data Repo snapshot | ||
| type: string | ||
|
|
||
| # GoogleBucket: |
There was a problem hiding this comment.
Nope, we replaced those definitions with shared ones from the common library and I forgot to remove them
| CreateDataReferenceRequestBody: | ||
| type: object | ||
| required: [name, cloningInstructions] | ||
| required: [name, referenceType, cloningInstructions] |
There was a problem hiding this comment.
Does that cause an upward compatibility problem with Rawls?
If they can specify the type, then they can also change from using reference to using referenceInfo.
Otherwise, I think we need to leave it optional - maybe with a default value?
There was a problem hiding this comment.
I think that's a good point. default it to data repo snapshot and leave it optional.
There was a problem hiding this comment.
Rawls is already specifying type, and this was previously enforced by rejecting requests with no referenceType field in ControllerValidationUtils.
We did that validation as part of the app logic instead of putting it in Swagger because the request shape was (kind of) also used for controlled resource requests, so only common fields between the two were made required. Now that we're removing all the unused controlled resource stuff here, we can move referenceType to required. I'd also like to move referenceInfo there once we stop using the reference field.
| final DataReferenceInfo referenceInfo = body.getReferenceInfo(); | ||
| final ReferenceObject referenceObject; | ||
| if (body.getReference() != null) { | ||
| referenceObject = |
There was a problem hiding this comment.
It looks like the referenceType is required in the OpenAPI yaml, so should we verify that it is set to data repo snapshot in this case?
I see later that there is a request validator, but it will error if the referenceType is not specified which seems incompatible.
There was a problem hiding this comment.
We also previously errored in ControllerValidationUtils if referenceType was not specified, so that is actually compatible.
| // TODO(PF-404): requests using the deprecated reference field will not set the referenceInfo | ||
| // field. referenceInfo can be made required and this check can be removed once clients migrate. | ||
| DataReferenceInfo info = | ||
| body.getReferenceInfo() == null ? new DataReferenceInfo() : body.getReferenceInfo(); |
There was a problem hiding this comment.
But that reference is not filled in with anything. Either this is broken or I am confused...
There was a problem hiding this comment.
This handles the existing case where a client specifies the old reference field but does not specify the new referenceInfo field. You can't have bucket or dataset references this way (as getGoogleBucket() and getBigQueryDataset() on this reference will be null), but you can specify a snapshot in the old reference field this way, and we want to support that until we move clients. That's why we check either the referenceInfo or the reference field when validating snapshots here.
(As above, existing clients do already use the DATA_REPO_SNAPSHOT referenceType, so that hasn't changed)
| referenceRequest.workspaceId(), | ||
| referenceRequest.referenceType(), | ||
| referenceRequest.name()); | ||
| throw new DuplicateDataReferenceException( |
There was a problem hiding this comment.
We will be returning the name string in many cases. It is the user's name for the object and we will need to return it in get and enumerate requests. If we need to constrain the contents of the name, we should do that. That seems to be Google's approach; for example, BigQuery dataset naming
| referenceRequest.workspaceId(), | ||
| referenceRequest.referenceType(), | ||
| referenceRequest.name()); | ||
| throw new DuplicateDataReferenceException( |
There was a problem hiding this comment.
And there is nothing dangerous about returning the reference type or workspaceId is there?
So if we fixed all of that, you could skip the logging and just throw and let the global exception handler do the logging.
src/main/java/bio/terra/workspace/service/datareference/model/BigQueryDatasetReference.java
Show resolved
Hide resolved
| typeMap.put(DataReferenceType.GOOGLE_BUCKET, ReferenceTypeEnum.GOOGLE_BUCKET); | ||
| typeMap.put(DataReferenceType.BIG_QUERY_DATASET, ReferenceTypeEnum.BIG_QUERY_DATASET); | ||
|
|
||
| sqlMap.put(DATA_REPO_SNAPSHOT, "DATA_REPO_SNAPSHOT"); |
There was a problem hiding this comment.
This map is unnecessary. The Enum's built-in name() and validOf methods will do the same thing. And you can convert to them without changing your interface.
There was a problem hiding this comment.
The idea of the sql map is to have invariant names, so that even if they refactor the enum names in Java, we can serialize correctly. .name() is too smart for this purpose, and would return the new name.
There was a problem hiding this comment.
Cutting out the typeMap seems fine. I'm a little less sure about sqlMap for the reason Jay stated, since this value does get serialized directly.
|
|
||
| static { | ||
| typeMap.put(DataReferenceType.DATA_REPO_SNAPSHOT, ReferenceTypeEnum.DATA_REPO_SNAPSHOT); | ||
| typeMap.put(DataReferenceType.GOOGLE_BUCKET, ReferenceTypeEnum.GOOGLE_BUCKET); |
There was a problem hiding this comment.
And this map is overkill. Maybe I will write you an "enum" class that will use a list search - which is generally faster than a hash for lists smaller than 10 items...
I know, it probably isn't important and it is all static. If we were running on a PDP-11/70 you would thank me :)
There was a problem hiding this comment.
| } catch (BigQueryException e) { | ||
| throw new InvalidDataReferenceException("Error while trying to access BigQuery dataset", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
What is the benefit of them being value classes if we have to write switch statements to dispatch methods on them?
I think it would be a good separation of concerns to have the reference objects encapsulate the knowledge of the cloud resource they reference.
|
Yeah, and we could have a parent class with just the type as a public member in case we need it. |
|
As posted in slack:
It might make the new version cleaner because we could require referenceType and remove the reference field. And we can make a general GET for /resources that returns everything we’ve got. I know this is a late-breaking change, but I just thought of it after Jay’s suggestion on the controlled resources spec. |
|
re: moving to I'm also thinking about our model of resources - the DB structure suggests that we have data references to all resources (both controlled and uncontrolled), so should the common GET be on |
|
Per earlier discussion, we'll break out each type into a separate endpoint, while still supporting the existing data reference APIs for compatibility in the short-term. Moving this to a draft while I work on that |
|
...huh. I'm wary of blocking on something that's been in progress for almost a year, but I want to keep an eye on that. thanks for the pointer! |
Also, don't we use swagger-codegen and not openapi-generator? ::shakes fist at whoever forked the projects, however good their intentions may have been:: |
|
Update: I've broken out the create, get, get-by-name, and delete endpoints for each resource type separately, and the new interface no longer uses any polymorphic types. I'm not totally sure this structure is right (delete and get-by-id don't really depend on type and could be methods on |
ddietterich
left a comment
There was a problem hiding this comment.
lotta good stuff in here! I want to discuss object model before we merge on it and maybe remove some endpoints.
src/main/java/bio/terra/workspace/app/controller/DataRepoReferenceController.java
Outdated
Show resolved
Hide resolved
| } catch (BigQueryException e) { | ||
| throw new InvalidDataReferenceException("Error while trying to access BigQuery dataset", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
I think there is a middle ground. I wrote a doc for further discussion: Workspace Manager Object Hierarchy
| '500': | ||
| $ref: '#/components/responses/ServerError' | ||
|
|
||
| /api/workspaces/v1/{id}/resources/references/datarepo/snapshot/name/{name}: |
There was a problem hiding this comment.
Is name URL safe? Or you just assume it is encoded?
There was a problem hiding this comment.
Names must be URL safe and they're checked for validity in Controller methods
| .cloningInstructions( | ||
| CloningInstructions.fromApiModel(body.getMetadata().getCloningInstructions())) | ||
| .referenceObject(referenceObject) | ||
| .build(); |
There was a problem hiding this comment.
this needs a rebase now that #221 merged; this should include .referenceDescription(body.getReferenceDescription())
| referenceId.toString(), id.toString(), userReq.getEmail())); | ||
| return new ResponseEntity<>(HttpStatus.NO_CONTENT); | ||
| } | ||
| } |
There was a problem hiding this comment.
also from #221 this needs to include updateDataReference(...)
| .toApiModel()) | ||
| .metadata(ControllerTranslationUtils.metadataFromDataReference(ref)); | ||
| return new ResponseEntity<>(response, HttpStatus.OK); | ||
| } |
There was a problem hiding this comment.
similar to private void deleteDataReference and private DataReference createDataReference, could the get-by-id and get-by-name methods be factored into a common method?
| // TODO(PF-404): this is a workaround until clients migrate off this endpoint. | ||
| if (ref.referenceType() == DataReferenceType.DATA_REPO_SNAPSHOT) { | ||
| responseList.addResourcesItem(ref.toApiModel()); | ||
| } |
There was a problem hiding this comment.
If this omits non-snapshot references from its response payload, it could easily return incorrect pagination - user requested a limit of 5, but we only return 3.
I don't think this is worth fixing in this PR - we should encourage clients to move to the new endpoints instead. But let's remain aware of this behavior in case it causes strange problems elsewhere, potentially in automated tests.
There was a problem hiding this comment.
Agreed. I think the main way to mitigate this is not adopting the new endpoints (including creating non-snapshot references) until the enumerate is ready, but that's a risk.
| */ | ||
| public static final Pattern BQ_DATASET_NAME_VALIDATION_PATTERN = | ||
| Pattern.compile("^[_a-zA-Z0-9]{1,1024}$"); | ||
|
|
There was a problem hiding this comment.
do any of the Google client libs offer name validation that we could delegate to, instead of writing our own regexes? (honest question, I don't know if they do or not)
There was a problem hiding this comment.
Generally the client libs don't validate the names, you'll just get a 40X response from GCP. Agreed that rolling our own regexes for a "necessary but not sufficient" check is a little messy, but I think it's better than passing around totally unverified strings
There was a problem hiding this comment.
Devil's advocate: what are we ever going to use the string for except to pass to Google to make a resource? What's the harm in late validation?
There was a problem hiding this comment.
We want to send it back to the user in an error message. Security doesn't want us putting unvalidated user input into an output string.
| throw new InvalidDataReferenceException( | ||
| String.format("Error while trying to access GCS bucket %s", ref.bucketName()), e); | ||
| } | ||
| } |
There was a problem hiding this comment.
I haven't looked at the innards of CRL - do you know what happens here if the bucket is requester-pays?
There was a problem hiding this comment.
I haven't seen that case specifically, but CRL is a really thin wrapper around the Google client library. If we're calling GET on a bucket (which I think is a class B operation), it looks like this will fail with a 400 UserProjectMissing and we'd end up throwing. Is that a usecase we need to support?
|
|
||
| private AuthenticatedUserRequest getAuthenticatedInfo() { | ||
| return authenticatedUserRequestFactory.from(request); | ||
| } |
There was a problem hiding this comment.
DataRepoReferenceController and GoogleReferenceController share boilerplate - is there opportunity for a ReferenceController base class?
There was a problem hiding this comment.
Done. I'm not sure I took the cleanest approach given the way I'm passing beans around, but it does cut down on boilerplate. I might try to revisit this a bit to shrink the argument list
|
|
||
| /api/workspaces/v1/{id}/resources/references/datarepo/snapshots: | ||
| parameters: | ||
| - $ref: '#/components/parameters/Id' |
There was a problem hiding this comment.
not for this PR: I really wish this was named workspaceId instead of the totally-ambiguous id
There was a problem hiding this comment.
it really should be, that's tripped people up in client tests before when id gets mixed up between workspaces and references.
| application/json: | ||
| schema: | ||
| $ref: '#/components/schemas/BigQueryDatasetReference' | ||
|
|
There was a problem hiding this comment.
this will also need updates based on #221 - adding description to DataReferenceMetadata ( I assume that's where it would go ), plus patch APIs for each reference type.
|
Update: the current approach is per-type create, patch, and get (and get-by-name) endpoints. Delete and enumerate will be shared at a |
ddietterich
left a comment
There was a problem hiding this comment.
Looking good. A scattering of comments.
I want to understand moving the untar of the common swagger into the source tree. I'd like to avoid that if possible
| GoogleBucketReference response = | ||
| new GoogleBucketReference() | ||
| .bucket( | ||
| ((bio.terra.workspace.service.datareference.model.GoogleBucketReference) |
There was a problem hiding this comment.
I think we should have some prefix or suffix like API on the generated code so we don't run into these name collisions.
Duplicate names will make it easy to screw up and confusing to understand and debug.
There was a problem hiding this comment.
It's easy to set up a global prefix in Swagger.
There was a problem hiding this comment.
Agreed, but I think making that change (especially if we add a global prefix) ends up out of scope for this PR.
| new GoogleBucketReference() | ||
| .bucket( | ||
| ((bio.terra.workspace.service.datareference.model.GoogleBucketReference) | ||
| ref.referenceObject()) |
There was a problem hiding this comment.
We can leave this for now, but this is where having the class hierarchy would make things cleaner.
| } | ||
|
|
||
| logger.info( | ||
| String.format( |
There was a problem hiding this comment.
You don't need the almost all the time. You can use placeholders {} in the log message.
There was a problem hiding this comment.
Ah, I missed this one. thanks!
| from zipTree(configurations.cloudResourceSchema.singleFile).matching { | ||
| include 'cloud_resources_uid.yaml' | ||
| } | ||
| into "${resourceDir}/api/common" |
There was a problem hiding this comment.
Why do we need to unpack this into the source tree?
Is an alternative to copy our resources into the build tree?
There was a problem hiding this comment.
It would be nice to deploy our yaml file into a build dir so we could use relative (non-upward) paths.
There was a problem hiding this comment.
I unpacked it into the source tree so that our swagger-ui (also in the source dir) can parse references to it. otherwise, those schemas won't show up in swagger-ui, which breaks example values.
There was a problem hiding this comment.
We can leave it for now. I'll make a ticket.
Another approach might be to move all of the parts into a build subdirectory and do the codegen from there.
I'll make a ticket.
| from zipTree(configurations.cloudResourceSchema.singleFile).matching { | ||
| include 'cloud_resources_uid.yaml' | ||
| } | ||
| into "${resourceDir}/api/common" |
There was a problem hiding this comment.
It would be nice to deploy our yaml file into a build dir so we could use relative (non-upward) paths.
|
|
||
| // Internal dependencies | ||
| // Note The Open API schema depends on an external library - cloud-resource-schema, so need to unzip it first. | ||
| swaggerSources.server.code.dependsOn tasks.unzipCloudResourceSchema |
There was a problem hiding this comment.
what about the swagger client tasks?
There was a problem hiding this comment.
This took longer to investigate and fix than I want to admit, but you're right that swagger client tasks didn't depend on this. Should be fixed now
| GoogleBucketReference response = | ||
| new GoogleBucketReference() | ||
| .bucket( | ||
| ((bio.terra.workspace.service.datareference.model.GoogleBucketReference) |
There was a problem hiding this comment.
It's easy to set up a global prefix in Swagger.
|
|
||
| @Override | ||
| public ResponseEntity<Void> updateGoogleBucketReference( | ||
| UUID id, UUID referenceId, UpdateDataReferenceRequestBody body) { |
There was a problem hiding this comment.
this method can just be updateBucketReference because it's on the Google controller.
| public void testInvalidCharInBucketName() { | ||
| assertThrows( | ||
| InvalidDataReferenceException.class, | ||
| () -> DataReferenceValidationUtils.validateBucketName("INVALIDBUCKETNAME")); |
There was a problem hiding this comment.
It would be nice to have this validation at a higher level so controlled resources can use it too.
There was a problem hiding this comment.
This is in a validation util class called by the controller, I think it's actually reasonable for the various controlled resource controllers to use this bean as well
| from zipTree(configurations.cloudResourceSchema.singleFile).matching { | ||
| include 'cloud_resources_uid.yaml' | ||
| } | ||
| into "${resourceDir}/api/common" |
There was a problem hiding this comment.
We can leave it for now. I'll make a ticket.
Another approach might be to move all of the parts into a build subdirectory and do the codegen from there.
I'll make a ticket.
|
Published version |
pair programmed with @jaycarlton
This change adds support for GCS bucket and BigQuery dataset data references in addition to the existing Data Repo snapshot references. It also introduces a new pattern for reference endpoints broken out by reference type. Cleanup of the old endpoints after migration from the old API to the new API is tracked in PF-404.
To preserve backwards compatibility, the existing endpoints and interface objects are left unchanged. This means the existing GET and enumerate endpoints will not work with bucket and dataset references, as that would require changing the response object.
This includes basic reference validation for the new types, similar to what we do for TDR snapshots. In this case we make a GET call for the specified bucket or dataset on behalf of the user, which requires that the user have appropriate permission on the object. This does not do any further validation, such as examining bucket contents or BQ tables.
This change is not feature gated behind the MC_WORKSPACE. Rawls will have access to these new reference types.
This requires publishing a new client library version, which I'll do after approvals but before merging.