This document describes the steps to adapt SeeClick to downstream tasks, including data download, preprocessing, visualization, model fine-tuning, and testing.
-
Mind2Web: Download the screenshots and annotations (train set and test set of Domain/Website/Task). Note that according to mind2web, please DO NOT redistribute the unzipped data files online.
-
AITW: Download the screenshots and annotations (train/val/test). Check the origin AITW project for details and data usage.
-
MiniWob: Download the screenshots and annotations (2.8K train set). These trajectories are rollout with a recent LLM agent framework Synapse, check their repo for more details.
conda create --name env_name python=3.8
source activate env_name
pip install -r requirements_agent.txt
Place the downloaded annotations in the data folder. Then process the mind2web training set to get the json file for sft LVLMs:
cd agent_tasks
python mind2web_process.py --imgs_dir mind2web_imgs
The mind2web_imgs
should be replaced by the actual dir of downloaded mind2web screenshots.
Uncomment lines 84-87
to visualize the annotation episode of mind2web.
bash finetune/finetune_lora_ds.sh --save-name SeeClick_test --max-length 704 --micro-batch-size 4 --save-interval 500
--train-epochs 10 --nproc-per-node 2 --data-path xxxx/mind2web_train_sft.json --learning-rate 3e-5
--gradient-accumulation-steps 8 --qwen-ckpt xxxx/Qwen-VL-Chat --pretrain-ckpt xxxx/SeeClick-pretrain
--save-path xxxx/checkpoint_qwen
data-path
: sft data generated in the above stepqwen-ckpt
: origin Qwen-VL ckpt path for loading tokenizerpretrain-ckpt
: base model for fine-tuning, e.g. SeeClick-pretrain or Qwen-VLsave-path
: directory to save training checkpoints
The fine-tuning scripts are similar to Qwen-VL, except for we use LoRA to fine-tune customized parameters, as in finetune/finetune.py lines 315-327
.
This scripts fine-tune pre-train LVLM with LoRA and multi-GPU training; for more option like full-finetuning, Q-LoRA and single-GPU training, please refer to Qwen-VL.
After fine-tuning LVLM on the above sft data, the evaluation was performed on three subsets.
Alternatively, we provide the fine-tuned checkpoint of SeeClick for evaluation.
cd agent_tasks
python mind2web_test.py --model_path xxxx/SeeClick-mind2web --qwen_path xxxx/Qwen-VL-Chat --imgs_dir mind2web_imgs --task website
model_path
: the trained checkpoint of LVLM/SeeClick modelqwen_path
: the origin checkpoint of Qwen-VL-Chat, for loading the tokenizer and configimgs_dir
: the directory of downloaded mind2web screenshotstask
: evaluation subset, one ofdomain
,website
andtask
Place the downloaded annotations in the data folder. Then process the AITW training set to get the json file for sft LVLMs:
cd agent_tasks
python aitw_process.py --imgs_dir aitw_imgs
The aitw_imgs
should be replaced by the actual dir of downloaded AITW screenshots.
Uncomment lines 99-104
to visualize the annotation episode of AITW.
bash finetune/finetune_lora_ds.sh --save-name SeeClick_test --max-length 704 --micro-batch-size 4 --save-interval 500
--train-epochs 10 --nproc-per-node 2 --data-path xxxx/aitw_train_sft.json --learning-rate 3e-5
--gradient-accumulation-steps 8 --qwen-ckpt xxxx/Qwen-VL-Chat --pretrain-ckpt xxxx/SeeClick-pretrain
--save-path xxxx/checkpoint_qwen
data-path
: sft data generated in the above stepqwen-ckpt
: origin Qwen-VL ckpt path for loading tokenizerpretrain-ckpt
: base model for fine-tuning, e.g. SeeClick-pretrain or Qwen-VLsave-path
: directory to save training checkpoints
The fine-tuning scripts are similar to Qwen-VL, except for we use LoRA to fine-tune customized parameters, as in finetune/finetune.py lines 315-327
.
This scripts fine-tune pre-train LVLM with LoRA and multi-GPU training; for more option like full-finetuning, Q-LoRA and single-GPU training, please refer to Qwen-VL.
After fine-tuning LVLM on the above sft data, the evaluation was performed on test set. Our evaluation following the official repo of AITW to calculate the action matching score.
cd agent_tasks
python aitw_test.py --model_path xxxx/SeeClick-aitw --qwen_path xxxx/Qwen-VL-Chat --imgs_dir aitw_imgs
model_path
: the trained checkpoint of LVLM/SeeClick modelqwen_path
: the origin checkpoint of Qwen-VL-Chat, for loading the tokenizer and configimgs_dir
: the directory of downloaded AITW screenshots
Place the downloaded annotations in the data folder. Then process the MiniWob training set to get the json file for sft LVLMs:
cd agent_tasks
python miniwob_process.py --imgs_dir miniwob_imgs
The miniwob_imgs
should be replaced by the actual dir of downloaded MiniWob screenshots.
Uncomment lines 50-55
to visualize the annotation episode of MiniWob.
bash finetune/finetune_lora_ds.sh --save-name SeeClick_test --max-length 704 --micro-batch-size 4 --save-interval 500
--train-epochs 10 --nproc-per-node 2 --data-path xxxx/miniwob_train_sft.json --learning-rate 3e-5
--gradient-accumulation-steps 8 --qwen-ckpt xxxx/Qwen-VL-Chat --pretrain-ckpt xxxx/SeeClick-pretrain
--save-path xxxx/checkpoint_qwen
data-path
: sft data generated in the above stepqwen-ckpt
: origin Qwen-VL ckpt path for loading tokenizerpretrain-ckpt
: base model for fine-tuning, e.g. SeeClick-pretrain or Qwen-VLsave-path
: directory to save training checkpoints
The fine-tuning scripts are similar to Qwen-VL, except for we use LoRA to fine-tune customized parameters, as in finetune/finetune.py lines 315-327
.
This scripts fine-tune pre-train LVLM with LoRA and multi-GPU training; for more option like full-finetuning, Q-LoRA and single-GPU training, please refer to Qwen-VL.
After fine-tuning LVLM on the above sft data, the evaluation was performed on the MiniWob environment. Each MiniWob episode is initialized by random seed, so the instructions and environments during evaluation are unseen in training.
Our evaluation code using the MiniWob environment in Synapse. The environment is built with Chrome and Selenium, so you need to install chrome and the compatible chromedriver first.
cd agent_tasks
python miniwob_test.py --model_path xxxx/SeeClick-miniwob --qwen_path xxxx/Qwen-VL-Chat --imgs_dir miniwob_imgs
model_path
: the trained checkpoint of LVLM/SeeClick modelqwen_path
: the origin checkpoint of Qwen-VL-Chat, for loading the tokenizer and configimgs_dir
: the directory of downloaded MiniWob screenshotsnum_episodes
: the number of evaluation episode for each taskenv_name
: specific task name, defaultall
to test on all 55 available tasksheadless
: server without Graphical User Interface need to evaluate with the headless mode