Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker worker does not have aws credentials #18186

Closed
2 tasks
xwjiang2010 opened this issue Aug 28, 2021 · 2 comments · Fixed by #18220
Closed
2 tasks

docker worker does not have aws credentials #18186

xwjiang2010 opened this issue Aug 28, 2021 · 2 comments · Fixed by #18220
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Aug 28, 2021

What is the problem?

I was just trying to see if sync to S3 works to bypass rsync issue for our customer.
The same set up as #17940, with the following change:

        sync_config=tune.SyncConfig(
            sync_to_driver=False,
            upload_dir="s3://xwjiang-test/keep_checkpoints_num/",
        ),

And wrap train in tune.durable.
In cluster.yaml's setup cmd, add
pip install awscli

Still the same workflow but ran into the following issue on worker:
Error message (1): fatal error: Unable to locate credentials

On Driver

(base) ray@ip-172-31-24-198:~$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************2RSP         iam-role
secret_key     ****************S2Jm         iam-role
    region                <not set>             None    None

On worker

(base) ray@ip-172-31-19-194:~$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key                <not set>             None    None
secret_key                <not set>             None    None
    region                <not set>             None    None

Is this expected? Am I missing something?

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@xwjiang2010 xwjiang2010 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 28, 2021
@xwjiang2010
Copy link
Contributor Author

@richardliaw

@xwjiang2010
Copy link
Contributor Author

xwjiang2010 commented Aug 30, 2021

Per Richard's suggestion, added IamInstanceProfile to cluster.yaml.

worker_nodes:
    InstanceType: m5.xlarge
    ImageId: latest_dlami
    KeyName: xwjiang-test
    IamInstanceProfile:
        Arn: arn:aws:iam::YOUR_AWS_ACCOUNT_NUMBER:YOUR_INSTANCE_PROFILE

which grants S3 access to worker nodes.
More information here.

I will add some documentation to Ray.Tune's SyncConfig section. Keep the bug open to track that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant