From 29fed188871b3bf3d9d7bf4534ad08826a305988 Mon Sep 17 00:00:00 2001
From: Akshita Bhagia <akshita23bhagia@gmail.com>
Date: Thu, 25 Jan 2024 21:00:46 -0800
Subject: [PATCH 1/5] update readme with inference section

---
 README.md | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/README.md b/README.md
index 3e63e51dc..f239a0e0a 100644
--- a/README.md
+++ b/README.md
@@ -46,3 +46,37 @@ torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \
 ```
 
 Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config.
+
+
+## Inference
+
+To run inference on the olmo checkpoints:
+
+```python
+from hf_olmo import *
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B")
+tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B")
+message = ["Language modeling is "]
+
+inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
+response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
+print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
+```
+
+Alternatively, with the huggingface pipeline abstraction:
+
+```python
+from transformers import pipeline
+olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B")
+print(olmo_pipe("Language modeling is"))
+```
+
+### Quantization
+
+```python
+olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B", torch_dtype=torch.float16, load_in_8bit=True)  # requires bitsandbytes
+```
+
+The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as inputs.input_ids.to('cuda') to avoid potential issues.

From 761fa94d66e9d3c243656cc4dbe9cdd1d98c78a1 Mon Sep 17 00:00:00 2001
From: Akshita Bhagia <akshita23bhagia@gmail.com>
Date: Thu, 25 Jan 2024 21:06:36 -0800
Subject: [PATCH 2/5] add subsection about hf conversion

---
 README.md | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/README.md b/README.md
index f239a0e0a..3da6ca978 100644
--- a/README.md
+++ b/README.md
@@ -73,6 +73,15 @@ olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B")
 print(olmo_pipe("Language modeling is"))
 ```
 
+
+### Inference on finetuned checkpoints
+
+If you finetune the model using the code above, you can use the conversion script to convert a native olmo checkpoint to a huggingface-compatible checkpoint
+
+```bash
+python hf_olmo/convert_olmo_to_hf.py --checkpoint-dir /path/to/checkpoint
+```
+
 ### Quantization
 
 ```python

From 35fcf30dd0fc1cd0b59c5bd33ebf423bf4f7363b Mon Sep 17 00:00:00 2001
From: Akshita Bhagia <akshita23bhagia@gmail.com>
Date: Fri, 26 Jan 2024 11:46:04 -0800
Subject: [PATCH 3/5] Update README.md

Co-authored-by: Pete <epwalsh10@gmail.com>
---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 3da6ca978..c5bf876b7 100644
--- a/README.md
+++ b/README.md
@@ -54,12 +54,12 @@ To run inference on the olmo checkpoints:
 
 ```python
 from hf_olmo import *
-
 from transformers import AutoModelForCausalLM, AutoTokenizer
+
 olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B")
 tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B")
-message = ["Language modeling is "]
 
+message = ["Language modeling is "]
 inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
 response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
 print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

From 99d5090cb5df6eceabacbbb14ea93f4dec72ba3f Mon Sep 17 00:00:00 2001
From: Akshita Bhagia <akshita23bhagia@gmail.com>
Date: Fri, 26 Jan 2024 11:46:19 -0800
Subject: [PATCH 4/5] Update README.md

Co-authored-by: Pete <epwalsh10@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c5bf876b7..e126f1d66 100644
--- a/README.md
+++ b/README.md
@@ -76,7 +76,7 @@ print(olmo_pipe("Language modeling is"))
 
 ### Inference on finetuned checkpoints
 
-If you finetune the model using the code above, you can use the conversion script to convert a native olmo checkpoint to a huggingface-compatible checkpoint
+If you finetune the model using the code above, you can use the conversion script to convert a native OLMo checkpoint to a HuggingFace-compatible checkpoint
 
 ```bash
 python hf_olmo/convert_olmo_to_hf.py --checkpoint-dir /path/to/checkpoint

From 780e38627d5335ef887012e62026560b9f8a22a3 Mon Sep 17 00:00:00 2001
From: Akshita Bhagia <akshita23bhagia@gmail.com>
Date: Tue, 30 Jan 2024 11:25:10 -0800
Subject: [PATCH 5/5] update content

---
 README.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index a96976b26..d84f5a660 100644
--- a/README.md
+++ b/README.md
@@ -71,10 +71,11 @@ Note: passing CLI overrides like `--reset_trainer_state` is only necessary if yo
 
 ## Inference
 
-To run inference on the olmo checkpoints:
+You can utilize our HuggingFace integration to run inference on the olmo checkpoints:
 
 ```python
-from hf_olmo import *
+from hf_olmo import * # registers the Auto* classes
+
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B")