Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset permission errors from the tokenizer in finetune-workflow #129

Open
parallelo opened this issue Jan 19, 2023 · 1 comment
Open

Comments

@parallelo
Copy link
Contributor

Hi! I'm working on reproducing your Argo workflow for fine-tuning GPT-J.

I'm able to create a PVC, download the dataset into it, and submit the argo workflow.

kubectl apply -f finetune-pvc.yaml
kubectl apply -f finetune-download-dataset.yaml
kubectl apply -f inference-role.yaml
argo submit finetune-workflow.yaml \
        -p run_name=example-gpt-j-6b \
        -p dataset=dataset \
        -p reorder=random \
        -p run_inference=true \
        -p inference_only=false\
        -p model=EleutherAI/gpt-j-6B \
        --serviceaccount inference

However, whenever I try to read the dataset in the tokenizer step of the workflow, it hits a filesystem access error for the PVC:

2023/01/19 04:30:29 Downloaded /finetune-data/models/EleutherAI/gpt-j-6B/tokenizer.json... 1.4 MB completed.
2023/01/19 04:30:29 Resolving /finetune-data/models/EleutherAI/gpt-j-6B/config.json...
2023/01/19 04:30:29 Downloaded /finetune-data/models/EleutherAI/gpt-j-6B/config.json... 930 B completed.
2023/01/19 04:30:29 open /finetune-data/dataset/: permission denied
time="2023-01-19T04:30:30.343Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

I tried updating the roles / role bindings for accessing the PVC but that still has issues:

$ git diff inference-role.yaml
diff --git a/finetuner-workflow/inference-role.yaml b/finetuner-workflow/inference-role.yaml
index 7d99bd1..3d50526 100644
--- a/finetuner-workflow/inference-role.yaml
+++ b/finetuner-workflow/inference-role.yaml
@@ -21,6 +21,9 @@ rules:
       - revisions
     verbs:
       - '*'
+  - apiGroups: [""]
+    resources: ["persistentvolumeclaims"]
+    verbs: ["get", "watch", "list"]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding

The events listed for the relevant tokenizer pods do not show any warnings/errors for attaching to the PVC.

Still troubleshooting... must be missing some further permissions somewhere. Please let me know if you have suggestions in the meantime. Thanks in advance!

@parallelo
Copy link
Contributor Author

parallelo commented Jan 19, 2023

Still digging... seems like the mountPaths are goofed up in the Argo Workflow?

filebrowser pod (WORKS CORRECTLY):

    volumeMounts:
    - mountPath: /data/finetune-data
      name: finetune-data
  ...
  volumes:
  - name: finetune-data
    persistentVolumeClaim:
      claimName: finetune-data

finetune-model-tokenizer pod (READ PERMISSION ERROR):

    volumeMounts:
    - mountPath: /finetune-data
      name: finetune-data
  ...
  volumes:
  - name: finetune-data
    persistentVolumeClaim:
      claimName: finetune-data

Edit: Previously referenced mainctrfs, but that was just the wait container. Now just looking into mountPath values set as:

  • /data/finetune-data (working)
  • /finetune-data (not working)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant