Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple jobs running on Flink session cluster reuse the persistent Python environment. #21123

Closed
damccorm opened this issue Jun 4, 2022 · 2 comments · Fixed by #16658
Closed
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. flink harness P1 python runners

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 4, 2022

I'm running TFX pipelines on a Flink cluster using Beam in k8s. However, extra python packages passed to the Flink runner (or rather beam worker side-car) are only installed once per deployment cycle. Example:

  • Flink is deployed and is up and running
  • A TFX pipeline starts, submits a job to Flink along with a python whl of custom code and beam ops.
  • The beam worker installs the package and the pipeline finishes succesfully.
  • A new TFX pipeline is build where a new beam fn is introduced, the pipline is started and the new whl is submitted as in step 2).
  • This time, the new package is not being installed in the beam worker causing the job to fail due to a reference which does not exist in the beam worker, since it didn't install the new package.

 

I started using Flink from beam version 2.27 and it has been an issue all the time.

Imported from Jira BEAM-12792. Original Jira may contain additional context.
Reported by: ConverJens.

@kennknowles
Copy link
Member

@damccorm @tvalentyn as experts in our ML integrations and Python do you know anything about how these installs work and why it would not occur twice? I guess it has to do with how the portable Flink runner deploys Python SDK harness containers? But wouldn't a fresh container start up for a new pipeline, hence be a clean start?

@tvalentyn
Copy link
Contributor

We actually have an open PR on this: #16658
There was a seemingly working solution, but it had a very strange behavior on GCE VMs which we didn't rootcause, I'll take another look.

@github-actions github-actions bot added this to the 2.44.0 Release milestone Nov 10, 2022
@tvalentyn tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. flink harness P1 python runners
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants