Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eng 1241 Add ability to fetch an artifact from an existing flow version #157

Merged
merged 23 commits into from
Jul 13, 2022

Conversation

Fanjia-Yan
Copy link
Contributor

@Fanjia-Yan Fanjia-Yan commented Jun 28, 2022

Describe your changes and why you are making these changes

This PR add get(artifact_name) feature for flow object. By passing in an artifact name, the function will return the Artifact associated with the artifact_name. If not, raise ArtifactNotFound Error.

I have created a integration test:
The test create a workflow and publish it online. After the workflow has finished, I use get() to extract the Artifact provided the artifact name. Finally, I checked if the returned Artifact object has the same name as provided.

Related issue number (if any)

Eng 1241

Checklist before requesting a review

  • [x ] I have performed a self-review of my code.
  • [x ] If this is a new feature, I have added unit tests and integration tests.
  • [x ] I have manually run the integration tests and they are passing.
  • All features on the UI continue to work correctly.

Update

Many changes to this PR after last review

  1. Complete the get_workflow() so that the workflowresponse also has serialized function attached to each operator. To get the serialized function, I route "/api/function/%s/export" and request by operator_id
  2. Modify the previous flow.get() into flow.artifact(). In flow.artifact(), we first fetch the artifact.Artifact object with matching artifact_name and convert it to generic_artifact.Artifact so that we can call .get().
  3. For the integration test. I fetch the artifact using flow.artifact(output_artifact.name()). And check for name equality and dataframe equality between locally created artifact and fetched from server artifact.

Update 2.0

  1. Migrate flow.artifact() to flowrun.artifact()
  2. Create a export_serialized_function() to handle the getting serialized function into the dag_response rather than putting them all in get_workflow()
  3. As mentioned by Kenny, the operation on operator is crossing the the abstraction barrier. Therefore, in the api_client, I eliminated any operation on operator and put them into workflow_dag_response
  4. In order to prevent reuse of artifact from flowrun.arifact(), I create a boolean parameter for Artifact(TableArtifact etc.) from_flow_run. The parameter defaults to false but can be set to true when creating the Artifact Object. Then, in decorator.py, when the operators and artifacts are wrapped, I check whether any input artifact from_flow_run is set to true. If so return a Exception.
  5. I also create a test case to test whether we catch any input artifact from flow run. I basically publish a flow and fetch the artifact and reuse it for sentiment model. When feeding in the return artifact, we are expected an Exception.

Update 3.0

After last review, several things have been changed:

  1. the process of updating operator will be taking place in dag.py instead of api_client.py
  2. the action of putting serialized function into the dag will be taking place in flow.py: construct_flow_run() instead of api_client.py: get_workflow()
  3. fixing some misc and updating documentations for _from_flow_run parameter for InputArtifact

This version looks nicer since serveral lines of codes are taken away from the api_client file :)

@Fanjia-Yan Fanjia-Yan requested review from kenxu95 and cw75 June 28, 2022 21:42
Copy link
Contributor

@vsreekanti vsreekanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fanjia-Yan, this mostly looks good, but I left a few comments and nits inline.

Let's also please make sure to title our pull requests clearly -- GitHub defaults to pulling from the branch name, but that's a little unhelpful/unclear in this case.

Thanks!

integration_tests/sdk/flow_test.py Outdated Show resolved Hide resolved
)
wait_for_flow_runs(client, flow.id(), num_runs=1)
artifact_return = flow.get(get_artifact_name())
assert artifact_return.name == get_artifact_name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other metadata returned that we can check other than the name of the artifact?

@@ -153,3 +154,19 @@ def describe(self) -> None:
)
)
print(json.dumps(self.list_runs(), sort_keys=False, indent=4))

def get(self, artifact_name: str) -> Artifact:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() feels a little ambiguous to me. Are we getting code, data, metadata? Feels like a generic get() method should support all of those. I would suggest doing something specific like either flow.artifact(name) or flow.get_artifact(name).

@kenxu95, any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think flow.artifact(name) is consistent with how we usually do things.

assert (
len(resp.workflow_dag_results) > 0
), "Every flow must have at least one run attached to it."
latest_result = resp.workflow_dag_results[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming here that when we have a handle to a flow, it always points to the latest version @kenxu95?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm really asking how this interacts with versioning -- can we access older versions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, @Fanjia-Yan this method shouldn't be on the flow, this should be on the flow run. A flow itself is actually dag-agnostic right now, since a dag is always associated with a particular run. You can also grab a run with flow.latest() or flow.fetch()

@Fanjia-Yan Fanjia-Yan changed the title Eng 1241 add ability to fetch an artifact from an Eng 1241 Add ability to fetch an artifact from an existing flow version Jun 29, 2022
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So checking the artifact name is fine, but it's incomplete. We should also make sure that calling artifact.get() returns the same value as before! An artifact fetched from a flow should be functionally indistinguishable from an artifact that was locally created.

In order for that to work, you'll have to modify the get_workflow route to also return the serialized functions corresponding to each dag element.

@@ -153,3 +154,19 @@ def describe(self) -> None:
)
)
print(json.dumps(self.list_runs(), sort_keys=False, indent=4))

def get(self, artifact_name: str) -> Artifact:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think flow.artifact(name) is consistent with how we usually do things.

assert (
len(resp.workflow_dag_results) > 0
), "Every flow must have at least one run attached to it."
latest_result = resp.workflow_dag_results[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, @Fanjia-Yan this method shouldn't be on the flow, this should be on the flow run. A flow itself is actually dag-agnostic right now, since a dag is always associated with a particular run. You can also grab a run with flow.latest() or flow.fetch()

@Fanjia-Yan
Copy link
Contributor Author

Many changes to this PR after last review

  1. Complete the get_workflow() so that the workflowresponse also has serialized function attached to each operator. To get the serialized function, I route "/api/function/%s/export" and request by operator_id
  2. Modify the previous flow.get() into flow.artifact(). In flow.artifact(), we first fetch the artifact.Artifact object with matching artifact_name and convert it to generic_artifact.Artifact so that we can call .get().
  3. For the integration test. I fetch the artifact using flow.artifact(output_artifact.name()). And check for name equality and dataframe equality between locally created artifact and fetched from server artifact.

@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 July 2, 2022 00:41
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I have a number of comments but thanks for getting this working.

There are two other things I thought of while reviewing:

  1. Did you want to rename the generic artifact to GenericArtifact? We can probably just do it (here or in a separate PR)
  2. A returned artifact currently cannot be used to construct a new dag right? Because the underlying dag objects are different. Could we add a defensive check in there for now, where if you attempt to use a returned artifact in new computation, we'd throw an error? This might be an interesting follow up :)

integration_tests/sdk/flow_test.py Outdated Show resolved Hide resolved
self.EXPORT_FUNCTION_ROUTE % str(operator.id), self.use_https
)
operator_resp = requests.get(operator_url, headers=headers)
operator.change_file(operator_resp.content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of change_file() on the operator, I'd like to keep our current abstraction where the dag class is the only thing that can modify underlying operator and artifact objects. Having fewer places where things can modify the dag will make things easier to maintain + us to have fewer bugs.

Would you be able to use dag.update_operator_spec() instead? You can probably write a helper to construct the new operator spec from the old one + serialized contents.

Copy link
Contributor Author

@Fanjia-Yan Fanjia-Yan Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenxu95 Since get_workflow returns a class WorkflowDagResponse, the serialize function is inserted into WorkflowDagResponse rather than the dag class. Therefore I believe the update_operator_spec() should be under WorkflowDagResponse?

sdk/aqueduct/api_client.py Outdated Show resolved Hide resolved
sdk/aqueduct/flow.py Outdated Show resolved Hide resolved
sdk/aqueduct/flow.py Outdated Show resolved Hide resolved
@Fanjia-Yan
Copy link
Contributor Author

Fanjia-Yan commented Jul 7, 2022

The following is the update after the last review

  1. Migrate flow.artifact() to flowrun.artifact()
  2. Create a export_serialized_function() to handle the getting serialized function into the dag_response rather than putting them all in get_workflow()
  3. As mentioned by Kenny, the operation on operator is crossing the the abstraction barrier. Therefore, in the api_client, I eliminated any operation on operator and put them into workflow_dag_response
  4. In order to prevent reuse of artifact from flowrun.arifact(), I create a boolean parameter for Artifact(TableArtifact etc.) from_flow_run. The parameter defaults to false but can be set to true when creating the Artifact Object. Then, in decorator.py, when the operators and artifacts are wrapped, I check whether any input artifact from_flow_run is set to true. If so return a Exception.
  5. I also create a test case to test whether we catch any input artifact from flow run. I basically publish a flow and fetch the artifact and reuse it for sentiment model. When feeding in the return artifact, we are expected an Exception.

@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 July 7, 2022 18:34
@Fanjia-Yan Fanjia-Yan added the run_integration_test Triggers integration tests label Jul 8, 2022
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting better! Thanks for fixing the incompatible dag issue. There are a lot of code quality and abstraction issues still here so I still need to block. I think maybe there was miscommunication about what abstractions are present in the SDK? Especially around how we update the dag.

sdk/aqueduct/api_client.py Outdated Show resolved Hide resolved
self.EXPORT_FUNCTION_ROUTE % str(operator.id), self.use_https
)
operator_resp = requests.get(operator_url, headers=headers)
work_flow_dag.update_operator_spec(operator, operator_resp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is better, but I was really saying this APIClient layer should be as thin as possible - it should literally just take in a single operator id and return the serialized function for that operator. The double-for loop belongs in the caller.

Question: does an operator always only correspond to a single workflow dag? Is there a case where an operator corresponds to multiple workflow dags on the backend. cc @cw75

sdk/aqueduct/check_artifact.py Show resolved Hide resolved
sdk/aqueduct/flow_run.py Outdated Show resolved Hide resolved
sdk/aqueduct/flow_run.py Outdated Show resolved Hide resolved
sdk/aqueduct/flow_run.py Outdated Show resolved Hide resolved
sdk/aqueduct/generic_artifact.py Outdated Show resolved Hide resolved
sdk/aqueduct/responses.py Outdated Show resolved Hide resolved
sdk/aqueduct/operators.py Outdated Show resolved Hide resolved
sdk/aqueduct/decorator.py Outdated Show resolved Hide resolved
@Fanjia-Yan
Copy link
Contributor Author

Fanjia-Yan commented Jul 11, 2022

After last review, several things have been changed:

  1. the process of updating operator will be taking place in dag.py instead of api_client.py
  2. the action of putting serialized function into the dag will be taking place in flow.py: construct_flow_run() instead of api_client.py: get_workflow()
  3. fixing some misc and updating documentations for _from_flow_run parameter for InputArtifact

This version looks nicer since serveral lines of codes are taken away from the api_client file :)

@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 July 11, 2022 23:33
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for pushing this through 👍 Left a few stylistic/cleanup comments but looks good to me. I think @vsreekanti also needs to unblock.

artifact_from_dag = flow_run_dag.get_artifacts_by_name(name)

if artifact_from_dag is None:
raise ArtifactNotFoundException("The artifact name provided does not exist.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be nicer for the user if we return None if the artifact doesn't exist? Maybe they just want to check for existence of an artifact, and its a bit less annoying to check for None than it is to catch an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add that, could you also put that in the function comment string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea!

elif get_artifact_type(artifact_from_dag) is ArtifactType.PARAM:
return ParamArtifact(self._api_client, self._dag, artifact_from_dag.id, True)

return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I actually think we should raise an exception (aka fail very loudly), since this means something went wrong in our system. Is there an Internal error we can throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenxu95 I think this error is really unlikely trigger as an artifact falls in either the four categories. However, if there will be an error, I think it is likely to be either "InvalidArtifactError"(does not exist yet in error.py) or "UnprocessableEntityError"(Exception raised for errors occured when the Aqueduct system fails to process certain inputs.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we can throw the general AqueductError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's throw an InternalAqueductError. The reason that this needs to be internal is because its definitely something thats gone wrong in our system, and is not the user's fault.

This error should never happen, as you pointed out, but it's important in our code to make sure that we cover all assumptions. If any assumption breaks, we should also fail loudly so that the person that broke the assumption can fix the issue. In this case, the assumption is that the artifact is one of four types, but this can easily be changed (imagine someone adds a new type on the backend, but doesn't find this function and doesn't update it).

sdk/aqueduct/operators.py Outdated Show resolved Hide resolved
sdk/aqueduct/flow_run.py Outdated Show resolved Hide resolved
@Fanjia-Yan Fanjia-Yan requested a review from vsreekanti July 12, 2022 18:34
@vsreekanti
Copy link
Contributor

Hey @Fanjia-Yan, Kenny knows this codebase much better than me, so if he's signed off on the PR, you should be good to go. 🙂 No need to wait for my approval here.

@Fanjia-Yan Fanjia-Yan merged commit b487ff5 into main Jul 13, 2022
@vsreekanti vsreekanti deleted the eng-1241-add-ability-to-fetch-an-artifact-from-an branch July 13, 2022 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_integration_test Triggers integration tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants