LokaHQ
diff --git a/‎README.md
+58 b/‎README.md
+58
diff --git a/‎flux-lora-img-gen-results.ipynb
+712 b/‎flux-lora-img-gen-results.ipynb
+712
diff --git a/‎image-caption.ipynb
+205 b/‎image-caption.ipynb
+205
@@ -0,0 +1,58 @@
+# Flux training
+
+I made this repo to be able to share with you the details of how I fine-tuned the Flux model. The main steps were made following this [tutorial](https://medium.com/@geronimo7/how-to-train-a-flux1-lora-for-1-dfd1800afce5). 
+
+## Hardware requirements
+In order to train you would need at least 24 GB VRAM, and currently there is support for single-GPU, so you should go for a SageMaker notebook running on a G5 or higher. For inference, currently you need a bit more, at least 28GB VRAM. I couldn't run inference on our ML-PT account, I ended up renting an instance on Vast.AI with 48GB VRAM. For storage, you would need at least 100GB of storage.
+
+## Training steps
+
+### Step 1: Clone repo and install dependencies
+The tutorial I followed is based on the Ostris' [AI-Toolkit](https://github.com/ostris/ai-toolkit). You first begin by cloning that repo and installing it's dependencies. The tutorial uses /workspace as folder, you could choose whatever suits you. 
+
+```
+!cd /workspace 
+!git clone https://github.com/ostris/ai-toolkit.git
+!cd ai-toolkit && git submodule update --init --recursive && pip install -r requirements.txt
+```
+
+### Step 2: Upload images and generate captions
+Then, you need to upload a folder with your images and captions. If you don't have captions, you can generate them automatically using the script in the `image-caption.ipynb` notebook. The code inside takes the folder with your images, generates a caption for every image and stores it on a .txt file with the same name. Keep in mind that you may need to adapt it to your specific folder structure.
+
+### Step 3: Log into HuggingFace
+Access to the FLUX1-dev model is gated, so you first have to accept their terms. Log into your Hugging Face account (or create one) and accept their terms: [FLUX1-dev repository](https://huggingface.co/black-forest-labs/FLUX.1-dev)
+
+Next, generate a Hugging Face API token on your account and log in:
+```
+!huggingface-cli login --token hf_XXXXTOKENXXXX
+```
+
+### Step 4: Define training parameters
+
+On the first cell of the `train-flux.ipynb` you would need to define:
+* INPUT_FOLDER: where your images and captions are stored),
+* OUTPUT_FOLDER: where to store results like samples and weights,
+* TRIGGER_WORD: the name/word for the object or subject you are finetuning on. 
+
+and a few other training parameters upon which you can play on, like after how many steps you would like to save the weights or produce sample images to measure progress. For your first time I would leave them as is, and adjust on subsequent training runs if you feel the need for it. 
+
+### Step 5: Define job parameter dictionary
+
+That's what the second cell of the notebook is doing. You can dig deeper into each parameter, my advice is that if you are training with limited VRAM, that you leave uncommented the `low_vram` parameter. If you have VRAM to spare, comment it so the training takes much less time.
+
+### Step 6: Final step - run the training job
+
+Finally, run the training job using the last cell of the notebook. The actual training time will vary depending on how many images you used, your GPU VRAM, how many steps you defined, etc. In my case, using 24GB VRAM and `low_vram` mode, it took around 4:30 hs to run 2250 training epochs.
+
+## Inference
+
+Finally, you can use the `flux-lora-img-gen-results.ipynb` notebook to use your fine-tuned model and generate cool images with it. There are several parameters to consider when running the inference:
+
+* prompt and negative prompt: the actual textual description of the image you want to generate. Apparently Flux relies on plain text prompts, different to Stable diffusion which had a bunch of flags and parameters. For inspiration you could visit [PromptHero](https://prompthero.com/flux-prompts?__cf_chl_tk=nKmeQBc9IU6dIH9o44wP3ak3HplrZ71Rfq_jM1gC8k4-1727291842-0.0.1.1-7956), even though I found it to be quite biased towards suggestive images 😒. Also, I left a couple interesting prompts on the inference notebook.
+* num_inference_steps: how many inference passes the model does before returning the image. You would have to find a number that gives you the best number, I found that between 20 and 40 I had the best results.
+* width
+* height
+
+## Extra info
+
+Bojan Jakimovski shared with me this other repository for fine-tuning Flux that looks even easier [FluxGym](https://github.com/cocktailpeanut/fluxgym). I believe is from the guy that made the AI-Browser [Pinokio](https://pinokio.computer/). If anyone tries it out, please let me know and we could add your experience to this repo.
@@ -0,0 +1,205 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "14e6b1e1-d567-41c4-98b1-78accfdaf73e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m pip install --upgrade pip wheel setuptools"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2302b9f3-e521-4cbe-99f4-b7d3982b2aff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install torch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6f522e74-7814-4d70-81da-bd4ae447cf19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "38e6031b-adb6-4807-a237-9b679cff6b51",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install transformers timm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5de01ad4-42ee-4c82-97db-777260a94163",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoProcessor, AutoModelForCausalLM\n",
+    "from PIL import Image\n",
+    "import requests\n",
+    "import copy\n",
+    "\n",
+    "model_id = 'microsoft/Florence-2-large'\n",
+    "model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()\n",
+    "processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)\n",
+    "\n",
+    "def run_example(task_prompt, text_input=None):\n",
+    "    if text_input is None:\n",
+    "        prompt = task_prompt\n",
+    "    else:\n",
+    "        prompt = task_prompt + text_input\n",
+    "\n",
+    "    inputs = processor(text=prompt, images=image, return_tensors=\"pt\")\n",
+    "    generated_ids = model.generate(\n",
+    "        input_ids=inputs[\"input_ids\"].cuda(),\n",
+    "        pixel_values=inputs[\"pixel_values\"].cuda(),\n",
+    "        max_new_tokens=1024,\n",
+    "        early_stopping=False,\n",
+    "        do_sample=False,\n",
+    "        num_beams=3,\n",
+    "    )\n",
+    "    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]\n",
+    "    parsed_answer = processor.post_process_generation(\n",
+    "        generated_text,\n",
+    "        task=task_prompt,\n",
+    "        image_size=(image.width, image.height)\n",
+    "    )\n",
+    "\n",
+    "    return parsed_answer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "7f57030b-a1dd-4cca-923a-eab7bbc19d34",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'<MORE_DETAILED_CAPTION>': 'The image shows a young man standing on a sandy beach with a lake and mountains in the background. He is wearing a grey t-shirt, black shorts, and sunglasses, and has a backpack slung over his shoulder. He has a red hat in his left hand and is holding a pair of sunglasses in his right hand. The man is looking up at the sky with a slight smile on his face. The lake is calm and the water is a light blue color. There are trees and mountains visible in the distance. The sky is clear and blue.'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "image = Image.open(\"img16.jpg\").convert(\"RGB\")\n",
+    "\n",
+    "task_prompt = \"<MORE_DETAILED_CAPTION>\"\n",
+    "answer = run_example(task_prompt=task_prompt)\n",
+    "\n",
+    "print(answer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "420c7f51-3150-4b3e-852b-2f6525624d88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4105d2e1-6ae0-4b66-a393-cc2520903843",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.listdir('./images')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "2239f850-dfb5-46fe-a02d-531a9cb12564",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Captioning image: /images/img9\n",
+      "Captioning image: /images/img8\n",
+      "Captioning image: /images/img5\n",
+      "Captioning image: /images/img4\n",
+      "Captioning image: /images/img6\n",
+      "Captioning image: /images/img7\n",
+      "Captioning image: /images/img3\n",
+      "Captioning image: /images/img2\n",
+      "Captioning image: /images/img1\n",
+      "Captioning image: /images/img16\n",
+      "Captioning image: /images/img14\n",
+      "Captioning image: /images/img15\n",
+      "Captioning image: /images/img11\n",
+      "Captioning image: /images/img10\n",
+      "Captioning image: /images/img12\n",
+      "Captioning image: /images/img13\n"
+     ]
+    }
+   ],
+   "source": [
+    "folder = './images'\n",
+    "\n",
+    "list_of_img = os.listdir(folder)\n",
+    "\n",
+    "for img in list_of_img:\n",
+    "    if img.endswith('.jpg'):\n",
+    "        file_path = (folder+'/'+img).split('.')[1]\n",
+    "        print(f'Captioning image: {file_path}')\n",
+    "        image_path = '.'+file_path+'.jpg'\n",
+    "        image = Image.open(image_path).convert(\"RGB\")\n",
+    "        task_prompt = \"<MORE_DETAILED_CAPTION>\"\n",
+    "        answer = run_example(task_prompt=task_prompt)\n",
+    "        text_path = '.'+file_path+'.txt'\n",
+    "        with open(text_path, 'w') as f:\n",
+    "            f.write(answer['<MORE_DETAILED_CAPTION>'])\n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eac60bdd-ab37-4f62-98c8-d068bcd55aad",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}