-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation Improvements #745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dirkgr Can you review this?
README.md
Outdated
@@ -58,7 +55,7 @@ The core models in the OLMo family released so far are (all trained on the [Dolm | |||
URLs to checkpoints at intermediate steps of the models' trainings can be found in the csv files under [`checkpoints/official/`](https://github.com/allenai/OLMo/blob/main/checkpoints/official). These 'directory' URLs cannot currently be directly accessed, but files within the directory are publicly accessible. These URLs can also be provided to the training script to resume training from the checkpoint (see [Training](#training)). Each checkpoint directory consists of: | |||
|
|||
- `config.yaml`: the config at that training step. | |||
- `model.pt`, `optim.pt`, `train.pt`: model, optimizer and training state at that training step. | |||
- `model.safetensors`, `optim.safetensors`, `train.pt`: model, optimizer and training state at that training step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train.safetensors
? Also, for the original model we just have *.pt
so we should have that format mentioned somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to save in .safetensors starting from OLMo-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but people might still try to use older OLMo models. The documentation should be backwards-compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not backwards compatible
scripts/download_checkpoints.py
Outdated
except requests.exceptions.HTTPError as e: | ||
print(f"HTTP error for {pattern}: {e}") | ||
except requests.exceptions.RequestException as e: | ||
print(f"Connection error for {pattern}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still swallowing exceptions. It just prints about them. This part of the code should check whether the exception is a 404 error, in which case it's fine, and otherwise let the exception propagate.
scripts/download_checkpoints.py
Outdated
print(f"Saving to: {base_path}") | ||
available_files = try_get_directory_listing(public_url) | ||
|
||
if not available_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not available_files: | |
if len(available_files) <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this make any better? found_files can never be negative right? if not available_files:
is same as if len(available_files) == 0:
.
scripts/download_checkpoints.py
Outdated
try: | ||
test_url = urljoin(url.rstrip('/') + '/', pattern) | ||
response = requests.head(test_url) | ||
# response.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debugging code?
scripts/download_checkpoints.py
Outdated
try: | ||
print(f"\nDownloading: {file}") | ||
download_file(file_url, file_path) | ||
except requests.exceptions.Timeout: | ||
print(f"Timeout error for {file}, retrying once...") | ||
try: | ||
download_file(file_url, file_path) | ||
except requests.exceptions.RequestException as e: | ||
failed_files.append(file) | ||
print(f"Failed to download {file}: {e}") | ||
except requests.exceptions.RequestException as e: | ||
failed_files.append(file) | ||
print(f"Failed to download {file}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't retry in this way. The Python requests library already supports retries. In this code we should just call download_file()
. If a file fails, we fail the whole process (i.e., we let the exception propagate).
scripts/download_checkpoints.py
Outdated
response = requests.get(url, stream=True) | ||
response.raise_for_status() | ||
total_size = int(response.headers.get('content-length', 0)) | ||
save_path.parent.mkdir(parents=True, exist_ok=True) | ||
|
||
with open(save_path, 'wb') as f: | ||
with tqdm(total=total_size, unit='B', unit_scale=True, desc=save_path.name) as pbar: | ||
for chunk in response.iter_content(chunk_size=chunk_size): | ||
if chunk: | ||
f.write(chunk) | ||
pbar.update(len(chunk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retries should be handled in here. If the request fails with a recoverable error code (anything that starts with 5XX, 408, 409, 429), it should wait one second, and then try again. Try up to 5 times, and then give up. When giving up, it must make sure that the file at save_path
does not exist.
scripts/download_checkpoints.py
Outdated
print(f"Error: Step {args.step} not found in the CSV file.") | ||
print("Use list argument to see available step numbers.") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't print and then return. If you do this, the process will return normally, and whoever called it will think it succeeded. If you want to fail, raise an exception and let it crash the process.
scripts/download_checkpoints.py
Outdated
reader = csv.DictReader(f) | ||
urls = [(row['Step'], row['Checkpoint Directory']) for row in reader] | ||
|
||
if args.command == 'list': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an if ladder:
if args.command == 'foo':
do_foo_things()
do_other_things()
elif args.command == 'bar':
do_bar_things()
do_more_things()
else:
raise RuntimeException(f"Unexpected command: {args.command}")
scripts/download_checkpoints.py
Outdated
print(f"Step {step}") | ||
return | ||
|
||
if args.step: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if args.step: | |
if args.step is not None: |
scripts/download_checkpoints.py
Outdated
help='Download checkpoints from CSV file') | ||
download_parser.add_argument('csv_file', type=str, | ||
help='Path to the CSV file containing checkpoint URLs') | ||
download_parser.add_argument('--step', type=str, required=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code it seems like step
is not required if you want to download everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, step is required argument. It throws this error when user fails to add --step
.
python scripts/download_checkpoints.py download checkpoints/official/OLMo-1B.csv --save-dir ./new_checkpoints
usage: download_checkpoints.py download [-h] --step STEP [--save-dir SAVE_DIR] csv_file
download_checkpoints.py download: error: the following arguments are required: --step
scripts/download_checkpoints.py
Outdated
r2_url = convert_to_r2_url(url) | ||
public_url = convert_to_public_url(r2_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URLs in the CSV are already public URLs? Why are we doing this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken these functions from my another file where I will have r2 urls in csv while uploading. I missed this here.
Documentation Improvements
Changes Made
scripts/download_checkpoints.py
to automate checkpoint downloads.scripts/train.py
.New Features
The new
scripts/download_checkpoints.py
script: