Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in run train.py #85

Open
JianchengZ opened this issue Jun 30, 2024 · 3 comments
Open

Problems in run train.py #85

JianchengZ opened this issue Jun 30, 2024 · 3 comments

Comments

@JianchengZ
Copy link

Hello, I just got a problem.
When I run: [torchrun --nproc_per_node 4 train.py --scale small --data_dir ./Data --output_dir ./Results/ --exp_name clip_score_train_results],
I was told that: [from training.distributed import world_info_from_env
ModuleNotFoundError: No module named 'training'],
But I use pip or conda, I still can not have the module.

@JianchengZ
Copy link
Author

Sorry, I have solved that problem above. But another problem still exits.
The problem is :
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500
Could you please help to answer it ?

@HariSeldon11988
Copy link

@JianchengZ

I have the same problem as you (from training.distributed ..... No module named 'training').
Can you tell me how you solved the problem? Would help me a lot :)

@JianchengZ
Copy link
Author

@HariSeldon11988

Yes, here is the answer(commented by others):
The training module comes from open_clip, and you can find the module in the open_clip repository if you are interested in looking at the source code: https://github.com/mlfoundations/open_clip/tree/v2.16.1/src/training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants