Skip to content

Commit c4e9dd9

Browse files
authored
Merge pull request EvolvingLMMs-Lab#63 from hunterheiden/hsh/new_task/screenspot
New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens
2 parents d8a3a99 + 319afcc commit c4e9dd9

9 files changed

+458
-0
lines changed

Diff for: README.md

+3
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,9 @@ We also provide the raw data exported from Weights & Biases for the detailed res
242242
- ScienceQA (scienceqa_full)
243243
- ScienceQA Full (scienceqa)
244244
- ScienceQA IMG (scienceqa_img)
245+
- ScreenSpot (screenspot)
246+
- ScreenSpot REC / Grounding (screenspot_rec)
247+
- ScreenSpot REG / Instruction Generation (screenspot_reg)
245248
- SeedBench (seedbench)
246249
- SeedBench 2 (seedbench_2)
247250
- ST-VQA (stvqa)

Diff for: lmms_eval/tasks/screenspot/README.md

+54
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# SceenSpot
2+
3+
## GUI Grounding Benchmark: ScreenSpot
4+
5+
ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).
6+
7+
8+
## Groups
9+
10+
- `screenspot`: This group bundles both the original grounding task and the new instruction generation task.
11+
12+
## Tasks
13+
- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC);
14+
- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG).
15+
16+
### REC Metrics
17+
18+
REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are:
19+
- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box.
20+
- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct.
21+
- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper.
22+
23+
### REG Metrics
24+
25+
REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are:
26+
- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets.
27+
28+
## Baseline Scores
29+
30+
As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset:
31+
- `IoU`: 0.051
32+
33+
34+
35+
36+
37+
- `CENTER ACC`: 0.097
38+
- `CIDEr`: 0.097
39+
40+
## References
41+
42+
- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
43+
- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
44+
45+
```bibtex
46+
@misc{cheng2024seeclick,
47+
title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
48+
author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
49+
year={2024},
50+
eprint={2401.10935},
51+
archivePrefix={arXiv},
52+
primaryClass={cs.HC}
53+
}
54+
```
+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
dataset_path: rootsautomation/ScreenSpot
2+
output_type: generate_until
3+
doc_to_visual: !function utils_rec.screenspot_rec_doc_to_visual
4+
doc_to_text: !function utils_rec.screenspot_rec_doc_to_text
5+
doc_to_target: "bbox"
6+
generation_kwargs:
7+
until:
8+
- "ASSISTANT:"
9+
process_results: !function utils_rec.screenspot_rec_process_result
10+
metric_list:
11+
- metric: screenspot_IoU
12+
aggregation : !function utils_rec.screenspot_rec_iou
13+
higher_is_better : true
14+
15+
aggregation : !function utils_rec.screenspot_rec_acc01
16+
higher_is_better : true
17+
18+
aggregation : !function utils_rec.screenspot_rec_acc03
19+
higher_is_better : true
20+
21+
aggregation : !function utils_rec.screenspot_rec_acc05
22+
higher_is_better : true
23+
24+
aggregation : !function utils_rec.screenspot_rec_acc07
25+
higher_is_better : true
26+
27+
aggregation : !function utils_rec.screenspot_rec_acc09
28+
higher_is_better : true
29+
- metric: screenspot_Center_ACC
30+
aggregation : !function utils_rec.screenspot_rec_center_acc
31+
higher_is_better : true
32+
metadata:
33+
version: '0.0'
+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
dataset_path: rootsautomation/ScreenSpot
2+
output_type: generate_until
3+
doc_to_visual: !function utils.screenspot_bbox_doc_to_visual
4+
doc_to_text: !function utils.screenspot_doc_to_text
5+
doc_to_target: "instruction"
6+
generation_kwargs:
7+
until:
8+
- "ASSISTANT:"
9+
process_results: !function utils.screenspot_process_result
10+
metric_list:
11+
- metric: screenspot_CIDEr
12+
aggregation : !function utils.screenspot_cider
13+
higher_is_better : true
14+
metadata:
15+
version: '0.0'

Diff for: lmms_eval/tasks/screenspot/_screenspot.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
group: screenspot
2+
task:
3+
- screenspot_reg_test
4+
- screenspot_rec_test

Diff for: lmms_eval/tasks/screenspot/screenspot_rec_test.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
group: screenspot_rec
2+
task: screenspot_rec_test
3+
include: _default_template_rec_yaml
4+
test_split: test

Diff for: lmms_eval/tasks/screenspot/screenspot_reg_test.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
group: screenspot_reg
2+
task: screenspot_reg_test
3+
include: _default_template_reg_yaml
4+
test_split: test

Diff for: lmms_eval/tasks/screenspot/utils.py

+126
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
from PIL import ImageDraw
2+
from pycocoevalcap.eval import COCOEvalCap, Bleu, Meteor, Rouge, Cider, Spice
3+
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
4+
from pycocotools.coco import COCO
5+
6+
# COCO_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1", "METEOR", "ROUGE_L", "CIDEr"] # , "SPICE"]
7+
COCO_METRICS = ["CIDEr"]
8+
9+
import logging
10+
11+
eval_logger = logging.getLogger("lmms-eval")
12+
13+
14+
def screenspot_bbox_doc_to_visual(doc):
15+
bbox = doc["bbox"]
16+
image = doc["image"].convert("RGB")
17+
draw = ImageDraw.Draw(image)
18+
bbox_xy = [bbox[0], bbox[1], bbox[2], bbox[3]]
19+
draw.rectangle(bbox_xy, outline="red", width=3)
20+
return [image.convert("RGB")]
21+
22+
23+
def screenspot_process_result(doc, result):
24+
"""
25+
Args:
26+
doc: a instance of the eval dataset
27+
results: [pred]
28+
Returns:
29+
a dictionary with key: metric name (in this case coco_bleu), value: metric value
30+
"""
31+
pred = result[0] if len(result) > 0 else ""
32+
ann_id = doc["file_name"]
33+
data_dict = {"instruction": doc["instruction"], "pred": pred, "ann_id": ann_id, 'data_type': doc['data_type'], 'data_source': doc['data_source']}
34+
return {f"screenspot_{metric}": data_dict for metric in COCO_METRICS}
35+
36+
37+
def screenspot_doc_to_text(doc):
38+
return f"Direct a user to interact with the highlighted region [{doc['bbox'][0]:.2f}, {doc['bbox'][1]:.2f}, {doc['bbox'][2]:.2f}, {doc['bbox'][3]:.2f}]."
39+
40+
41+
def screenspot_aggregation_result(results, metric):
42+
# scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr"), (Spice(), "SPICE")]
43+
scorers = [(Cider(), "CIDEr")]
44+
scorers_dict = {s[1]: s for s in scorers}
45+
46+
stored_results = []
47+
# In order to make the coco eval tools to successfully create index
48+
# We need at least two dict in the dataset
49+
# 'annotation' and 'images'
50+
# 'annotation' exactly reproduce the original annotation
51+
# 'images' however only need the image id which is contained in the file name
52+
dataset = {"annotations": [], "images": []}
53+
idx = 0
54+
ann_id = 0
55+
for result in results:
56+
stored_results.append({"image_id": idx, "caption": result["pred"]})
57+
# for s in result["answer"]:
58+
dataset["annotations"].append({"image_id": idx, "caption": result['instruction'], "id": ann_id})
59+
ann_id += 1
60+
61+
dataset["images"].append({"id": idx})
62+
idx += 1
63+
64+
coco = COCO()
65+
# Manually create index here
66+
coco.dataset = dataset
67+
coco.createIndex()
68+
69+
coco_result = coco.loadRes(stored_results)
70+
coco_eval = COCOEvalCap(coco, coco_result)
71+
72+
imgIds = coco_eval.params["image_id"]
73+
gts = {}
74+
res = {}
75+
for imgId in imgIds:
76+
gts[imgId] = coco_eval.coco.imgToAnns[imgId]
77+
res[imgId] = coco_eval.cocoRes.imgToAnns[imgId]
78+
79+
eval_logger.info("tokenization...")
80+
tokenizer = PTBTokenizer()
81+
gts = tokenizer.tokenize(gts)
82+
res = tokenizer.tokenize(res)
83+
84+
eval_logger.info(f"Computing {metric} scores...")
85+
86+
score, scores = scorers_dict[metric][0].compute_score(gts, res)
87+
# coco_eval.setEval(score, metric)
88+
89+
# When metric is one of the Bleu, score will be a list
90+
if type(score) == list:
91+
n = int(metric.split("_")[-1])
92+
score = score[n - 1]
93+
94+
return score
95+
96+
97+
def screenspot_bleu4(results):
98+
return screenspot_aggregation_result(results, "Bleu_4")
99+
100+
101+
def screenspot_bleu3(results):
102+
return screenspot_aggregation_result(results, "Bleu_3")
103+
104+
105+
def screenspot_bleu2(results):
106+
return screenspot_aggregation_result(results, "Bleu_2")
107+
108+
109+
def screenspot_bleu1(results):
110+
return screenspot_aggregation_result(results, "Bleu_1")
111+
112+
113+
def screenspot_meteor(results):
114+
return screenspot_aggregation_result(results, "METEOR")
115+
116+
117+
def screenspot_rougel(results):
118+
return screenspot_aggregation_result(results, "ROUGE_L")
119+
120+
121+
def screenspot_cider(results):
122+
return screenspot_aggregation_result(results, "CIDEr")
123+
124+
125+
def screenspot_spice(results):
126+
return screenspot_aggregation_result(results, "SPICE")

0 commit comments

Comments
 (0)