Eng 1241 Add ability to fetch an artifact from an existing flow version #157

Fanjia-Yan · 2022-06-28T21:42:51Z

Describe your changes and why you are making these changes

This PR add get(artifact_name) feature for flow object. By passing in an artifact name, the function will return the Artifact associated with the artifact_name. If not, raise ArtifactNotFound Error.

I have created a integration test:
The test create a workflow and publish it online. After the workflow has finished, I use get() to extract the Artifact provided the artifact name. Finally, I checked if the returned Artifact object has the same name as provided.

Related issue number (if any)

Eng 1241

Checklist before requesting a review

[x ] I have performed a self-review of my code.
[x ] If this is a new feature, I have added unit tests and integration tests.
[x ] I have manually run the integration tests and they are passing.
All features on the UI continue to work correctly.

Update

Many changes to this PR after last review

Complete the get_workflow() so that the workflowresponse also has serialized function attached to each operator. To get the serialized function, I route "/api/function/%s/export" and request by operator_id
Modify the previous flow.get() into flow.artifact(). In flow.artifact(), we first fetch the artifact.Artifact object with matching artifact_name and convert it to generic_artifact.Artifact so that we can call .get().
For the integration test. I fetch the artifact using flow.artifact(output_artifact.name()). And check for name equality and dataframe equality between locally created artifact and fetched from server artifact.

Update 2.0

Migrate flow.artifact() to flowrun.artifact()
Create a export_serialized_function() to handle the getting serialized function into the dag_response rather than putting them all in get_workflow()
As mentioned by Kenny, the operation on operator is crossing the the abstraction barrier. Therefore, in the api_client, I eliminated any operation on operator and put them into workflow_dag_response
In order to prevent reuse of artifact from flowrun.arifact(), I create a boolean parameter for Artifact(TableArtifact etc.) from_flow_run. The parameter defaults to false but can be set to true when creating the Artifact Object. Then, in decorator.py, when the operators and artifacts are wrapped, I check whether any input artifact from_flow_run is set to true. If so return a Exception.
I also create a test case to test whether we catch any input artifact from flow run. I basically publish a flow and fetch the artifact and reuse it for sentiment model. When feeding in the return artifact, we are expected an Exception.

Update 3.0

After last review, several things have been changed:

the process of updating operator will be taking place in dag.py instead of api_client.py
the action of putting serialized function into the dag will be taking place in flow.py: construct_flow_run() instead of api_client.py: get_workflow()
fixing some misc and updating documentations for _from_flow_run parameter for InputArtifact

This version looks nicer since serveral lines of codes are taken away from the api_client file :)

vsreekanti

@Fanjia-Yan, this mostly looks good, but I left a few comments and nits inline.

Let's also please make sure to title our pull requests clearly -- GitHub defaults to pulling from the branch name, but that's a little unhelpful/unclear in this case.

Thanks!

integration_tests/sdk/flow_test.py

vsreekanti · 2022-06-29T15:45:15Z

integration_tests/sdk/flow_test.py

+    )
+    wait_for_flow_runs(client, flow.id(), num_runs=1)
+    artifact_return = flow.get(get_artifact_name())
+    assert artifact_return.name == get_artifact_name()


Is there any other metadata returned that we can check other than the name of the artifact?

vsreekanti · 2022-06-29T15:46:46Z

sdk/aqueduct/flow.py

@@ -153,3 +154,19 @@ def describe(self) -> None:
            )
        )
        print(json.dumps(self.list_runs(), sort_keys=False, indent=4))
+
+    def get(self, artifact_name: str) -> Artifact:


get() feels a little ambiguous to me. Are we getting code, data, metadata? Feels like a generic get() method should support all of those. I would suggest doing something specific like either flow.artifact(name) or flow.get_artifact(name).

@kenxu95, any thoughts?

I think flow.artifact(name) is consistent with how we usually do things.

vsreekanti · 2022-06-29T15:47:11Z

sdk/aqueduct/flow.py

+        assert (
+            len(resp.workflow_dag_results) > 0
+        ), "Every flow must have at least one run attached to it."
+        latest_result = resp.workflow_dag_results[-1]


Are we assuming here that when we have a handle to a flow, it always points to the latest version @kenxu95?

I guess I'm really asking how this interacts with versioning -- can we access older versions?

yeah, @Fanjia-Yan this method shouldn't be on the flow, this should be on the flow run. A flow itself is actually dag-agnostic right now, since a dag is always associated with a particular run. You can also grab a run with flow.latest() or flow.fetch()

kenxu95

So checking the artifact name is fine, but it's incomplete. We should also make sure that calling artifact.get() returns the same value as before! An artifact fetched from a flow should be functionally indistinguishable from an artifact that was locally created.

In order for that to work, you'll have to modify the get_workflow route to also return the serialized functions corresponding to each dag element.

kenxu95 · 2022-06-29T18:00:19Z

sdk/aqueduct/flow.py

@@ -153,3 +154,19 @@ def describe(self) -> None:
            )
        )
        print(json.dumps(self.list_runs(), sort_keys=False, indent=4))
+
+    def get(self, artifact_name: str) -> Artifact:


I think flow.artifact(name) is consistent with how we usually do things.

kenxu95 · 2022-06-29T18:01:50Z

sdk/aqueduct/flow.py

+        assert (
+            len(resp.workflow_dag_results) > 0
+        ), "Every flow must have at least one run attached to it."
+        latest_result = resp.workflow_dag_results[-1]


yeah, @Fanjia-Yan this method shouldn't be on the flow, this should be on the flow run. A flow itself is actually dag-agnostic right now, since a dag is always associated with a particular run. You can also grab a run with flow.latest() or flow.fetch()

Fanjia-Yan · 2022-07-02T00:36:40Z

Many changes to this PR after last review

Complete the get_workflow() so that the workflowresponse also has serialized function attached to each operator. To get the serialized function, I route "/api/function/%s/export" and request by operator_id
Modify the previous flow.get() into flow.artifact(). In flow.artifact(), we first fetch the artifact.Artifact object with matching artifact_name and convert it to generic_artifact.Artifact so that we can call .get().
For the integration test. I fetch the artifact using flow.artifact(output_artifact.name()). And check for name equality and dataframe equality between locally created artifact and fetched from server artifact.

kenxu95

This is awesome! I have a number of comments but thanks for getting this working.

There are two other things I thought of while reviewing:

Did you want to rename the generic artifact to GenericArtifact? We can probably just do it (here or in a separate PR)
A returned artifact currently cannot be used to construct a new dag right? Because the underlying dag objects are different. Could we add a defensive check in there for now, where if you attempt to use a returned artifact in new computation, we'd throw an error? This might be an interesting follow up :)

integration_tests/sdk/flow_test.py

kenxu95 · 2022-07-05T18:02:35Z

sdk/aqueduct/api_client.py

+                    self.EXPORT_FUNCTION_ROUTE % str(operator.id), self.use_https
+                )
+                operator_resp = requests.get(operator_url, headers=headers)
+                operator.change_file(operator_resp.content)


instead of change_file() on the operator, I'd like to keep our current abstraction where the dag class is the only thing that can modify underlying operator and artifact objects. Having fewer places where things can modify the dag will make things easier to maintain + us to have fewer bugs.

Would you be able to use dag.update_operator_spec() instead? You can probably write a helper to construct the new operator spec from the old one + serialized contents.

@kenxu95 Since get_workflow returns a class WorkflowDagResponse, the serialize function is inserted into WorkflowDagResponse rather than the dag class. Therefore I believe the update_operator_spec() should be under WorkflowDagResponse?

sdk/aqueduct/api_client.py

sdk/aqueduct/flow.py

Fanjia-Yan · 2022-07-07T18:34:22Z

The following is the update after the last review

Migrate flow.artifact() to flowrun.artifact()
Create a export_serialized_function() to handle the getting serialized function into the dag_response rather than putting them all in get_workflow()
As mentioned by Kenny, the operation on operator is crossing the the abstraction barrier. Therefore, in the api_client, I eliminated any operation on operator and put them into workflow_dag_response
In order to prevent reuse of artifact from flowrun.arifact(), I create a boolean parameter for Artifact(TableArtifact etc.) from_flow_run. The parameter defaults to false but can be set to true when creating the Artifact Object. Then, in decorator.py, when the operators and artifacts are wrapped, I check whether any input artifact from_flow_run is set to true. If so return a Exception.
I also create a test case to test whether we catch any input artifact from flow run. I basically publish a flow and fetch the artifact and reuse it for sentiment model. When feeding in the return artifact, we are expected an Exception.

kenxu95

It's getting better! Thanks for fixing the incompatible dag issue. There are a lot of code quality and abstraction issues still here so I still need to block. I think maybe there was miscommunication about what abstractions are present in the SDK? Especially around how we update the dag.

sdk/aqueduct/api_client.py

kenxu95 · 2022-07-08T22:03:58Z

sdk/aqueduct/api_client.py

+                    self.EXPORT_FUNCTION_ROUTE % str(operator.id), self.use_https
+                )
+                operator_resp = requests.get(operator_url, headers=headers)
+                work_flow_dag.update_operator_spec(operator, operator_resp)


ok this is better, but I was really saying this APIClient layer should be as thin as possible - it should literally just take in a single operator id and return the serialized function for that operator. The double-for loop belongs in the caller.

Question: does an operator always only correspond to a single workflow dag? Is there a case where an operator corresponds to multiple workflow dags on the backend. cc @cw75

sdk/aqueduct/check_artifact.py

sdk/aqueduct/flow_run.py

sdk/aqueduct/generic_artifact.py

sdk/aqueduct/responses.py

sdk/aqueduct/operators.py

sdk/aqueduct/decorator.py

Fanjia-Yan · 2022-07-11T23:27:59Z

After last review, several things have been changed:

the process of updating operator will be taking place in dag.py instead of api_client.py
the action of putting serialized function into the dag will be taking place in flow.py: construct_flow_run() instead of api_client.py: get_workflow()
fixing some misc and updating documentations for _from_flow_run parameter for InputArtifact

This version looks nicer since serveral lines of codes are taken away from the api_client file :)

kenxu95

Awesome! Thanks for pushing this through 👍 Left a few stylistic/cleanup comments but looks good to me. I think @vsreekanti also needs to unblock.

kenxu95 · 2022-07-12T18:13:43Z

sdk/aqueduct/flow_run.py

+        artifact_from_dag = flow_run_dag.get_artifacts_by_name(name)
+
+        if artifact_from_dag is None:
+            raise ArtifactNotFoundException("The artifact name provided does not exist.")


I wonder if it might be nicer for the user if we return None if the artifact doesn't exist? Maybe they just want to check for existence of an artifact, and its a bit less annoying to check for None than it is to catch an error.

If you add that, could you also put that in the function comment string?

I think that's a good idea!

kenxu95 · 2022-07-12T18:14:35Z

sdk/aqueduct/flow_run.py

+        elif get_artifact_type(artifact_from_dag) is ArtifactType.PARAM:
+            return ParamArtifact(self._api_client, self._dag, artifact_from_dag.id, True)
+
+        return None


Here I actually think we should raise an exception (aka fail very loudly), since this means something went wrong in our system. Is there an Internal error we can throw?

@kenxu95 I think this error is really unlikely trigger as an artifact falls in either the four categories. However, if there will be an error, I think it is likely to be either "InvalidArtifactError"(does not exist yet in error.py) or "UnprocessableEntityError"(Exception raised for errors occured when the Aqueduct system fails to process certain inputs.)

Or maybe we can throw the general AqueductError?

Let's throw an InternalAqueductError. The reason that this needs to be internal is because its definitely something thats gone wrong in our system, and is not the user's fault.

This error should never happen, as you pointed out, but it's important in our code to make sure that we cover all assumptions. If any assumption breaks, we should also fail loudly so that the person that broke the assumption can fix the issue. In this case, the assumption is that the artifact is one of four types, but this can easily be changed (imagine someone adds a new type on the backend, but doesn't find this function and doesn't update it).

sdk/aqueduct/operators.py

sdk/aqueduct/flow_run.py

vsreekanti · 2022-07-13T16:01:09Z

Hey @Fanjia-Yan, Kenny knows this codebase much better than me, so if he's signed off on the PR, you should be good to go. 🙂 No need to wait for my approval here.

…om-an

Ubuntu added 6 commits June 28, 2022 16:50

get implementation

bcb3ad8

get

52e76ba

one more

28b1f3e

one more

fb44e18

force publish flow

db5cc17

make sure complex also run

39898b2

Fanjia-Yan requested review from kenxu95 and cw75 June 28, 2022 21:42

Ubuntu added 2 commits June 28, 2022 22:09

try fix mypy

ba45120

mypy fix

17f3b5e

vsreekanti reviewed Jun 29, 2022

View reviewed changes

Fanjia-Yan changed the title ~~Eng 1241 add ability to fetch an artifact from an~~ Eng 1241 Add ability to fetch an artifact from an existing flow version Jun 29, 2022

kenxu95 requested changes Jun 29, 2022

View reviewed changes

Ubuntu added 4 commits July 1, 2022 23:49

get_workflow + integration_test

466097d

package cut

cd9219d

delete print

cc20b47

lint and mypy

4e24e8b

simplify naming

a90359c

Fanjia-Yan requested a review from kenxu95 July 2, 2022 00:41

kenxu95 requested changes Jul 5, 2022

View reviewed changes

Ubuntu added 2 commits July 6, 2022 20:36

modify

b2f79f4

update integration test

8b0ef8f

Fanjia-Yan requested a review from kenxu95 July 7, 2022 18:34

Fanjia-Yan added the run_integration_test Triggers integration tests label Jul 8, 2022

kenxu95 requested changes Jul 8, 2022

View reviewed changes

Ubuntu added 3 commits July 11, 2022 22:58

update

01e8cc3

lint

d312f08

style fix

157f1b9

name change

7bfb9d1

Fanjia-Yan requested a review from kenxu95 July 11, 2022 23:33

kenxu95 approved these changes Jul 12, 2022

View reviewed changes

Fanjia-Yan requested a review from vsreekanti July 12, 2022 18:34

Ubuntu and others added 4 commits July 13, 2022 17:47

artifact None type update

1409884

Merge branch 'main' into eng-1241-add-ability-to-fetch-an-artifact-fr…

522d431

…om-an

mypy

2d62793

raise internal error if unrecognized artifact type

4a125b1

Fanjia-Yan merged commit b487ff5 into main Jul 13, 2022

vsreekanti deleted the eng-1241-add-ability-to-fetch-an-artifact-from-an branch July 13, 2022 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eng 1241 Add ability to fetch an artifact from an existing flow version #157

Eng 1241 Add ability to fetch an artifact from an existing flow version #157

Fanjia-Yan commented Jun 28, 2022 •

edited

Loading

vsreekanti left a comment

vsreekanti Jun 29, 2022

vsreekanti Jun 29, 2022

kenxu95 Jun 29, 2022

vsreekanti Jun 29, 2022

vsreekanti Jun 29, 2022

kenxu95 Jun 29, 2022

kenxu95 left a comment •

edited

Loading

kenxu95 Jun 29, 2022

kenxu95 Jun 29, 2022

Fanjia-Yan commented Jul 2, 2022

kenxu95 left a comment

kenxu95 Jul 5, 2022

Fanjia-Yan Jul 6, 2022 •

edited

Loading

Fanjia-Yan commented Jul 7, 2022 •

edited

Loading

kenxu95 left a comment

kenxu95 Jul 8, 2022

Fanjia-Yan commented Jul 11, 2022 •

edited

Loading

kenxu95 left a comment

kenxu95 Jul 12, 2022

kenxu95 Jul 12, 2022

Fanjia-Yan Jul 13, 2022

kenxu95 Jul 12, 2022

Fanjia-Yan Jul 13, 2022

Fanjia-Yan Jul 13, 2022

kenxu95 Jul 13, 2022

vsreekanti commented Jul 13, 2022

Eng 1241 Add ability to fetch an artifact from an existing flow version #157

Eng 1241 Add ability to fetch an artifact from an existing flow version #157

Conversation

Fanjia-Yan commented Jun 28, 2022 • edited Loading

Describe your changes and why you are making these changes

Related issue number (if any)

Checklist before requesting a review

Update

Update 2.0

Update 3.0

vsreekanti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kenxu95 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fanjia-Yan commented Jul 2, 2022

kenxu95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fanjia-Yan Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Fanjia-Yan commented Jul 7, 2022 • edited Loading

kenxu95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fanjia-Yan commented Jul 11, 2022 • edited Loading

kenxu95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vsreekanti commented Jul 13, 2022

Fanjia-Yan commented Jun 28, 2022 •

edited

Loading

kenxu95 left a comment •

edited

Loading

Fanjia-Yan Jul 6, 2022 •

edited

Loading

Fanjia-Yan commented Jul 7, 2022 •

edited

Loading

Fanjia-Yan commented Jul 11, 2022 •

edited

Loading