Merge pull request EvolvingLMMs-Lab#63 from hunterheiden/hsh/new_task/screenspot

Luodian · web-flow · commit c4e9dd9f6e40 · 2024-04-26T14:37:22.000+08:00
New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens
diff --git a/README.md b/README.md
@@ -242,6 +242,9 @@ We also provide the raw data exported from Weights & Biases for the detailed res
 - ScienceQA (scienceqa_full)
   - ScienceQA Full (scienceqa)
   - ScienceQA IMG (scienceqa_img)
+- ScreenSpot (screenspot)
+  - ScreenSpot REC / Grounding (screenspot_rec)
+  - ScreenSpot REG / Instruction Generation (screenspot_reg)
 - SeedBench (seedbench)
 - SeedBench 2 (seedbench_2)
 - ST-VQA (stvqa)
diff --git a/lmms_eval/tasks/screenspot/README.md b/lmms_eval/tasks/screenspot/README.md
@@ -0,0 +1,54 @@
+# SceenSpot
+
+## GUI Grounding Benchmark: ScreenSpot
+
+ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).
+
+
+## Groups
+
+- `screenspot`: This group bundles both the original grounding task and the new instruction generation task.
+
+## Tasks
+- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC);
+- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG).
+
+### REC Metrics
+
+REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are:
+- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box. 
+- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct.
+- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper.
+
+### REG Metrics
+
+REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are:
+- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets.
+
+## Baseline Scores
+
+As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset:
+- `IoU`: 0.051
+- `ACC@0.1`: 0.195
+- `ACC@0.3`: 0.042
+- `ACC@0.5`: 0.006
+- `ACC@0.7`: 0.000
+- `ACC@0.9`: 0.000
+- `CENTER ACC`: 0.097
+- `CIDEr`: 0.097
+
+## References 
+
+- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
+- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
+
+```bibtex
+@misc{cheng2024seeclick,
+      title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, 
+      author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
+      year={2024},
+      eprint={2401.10935},
+      archivePrefix={arXiv},
+      primaryClass={cs.HC}
+}
+```
diff --git a/lmms_eval/tasks/screenspot/_default_template_rec_yaml b/lmms_eval/tasks/screenspot/_default_template_rec_yaml
@@ -0,0 +1,33 @@
+dataset_path: rootsautomation/ScreenSpot
+output_type: generate_until
+doc_to_visual: !function utils_rec.screenspot_rec_doc_to_visual
+doc_to_text: !function utils_rec.screenspot_rec_doc_to_text
+doc_to_target: "bbox"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+process_results: !function utils_rec.screenspot_rec_process_result
+metric_list:
+  - metric: screenspot_IoU
+    aggregation : !function utils_rec.screenspot_rec_iou
+    higher_is_better : true
+  - metric: screenspot_ACC@0.1
+    aggregation : !function utils_rec.screenspot_rec_acc01
+    higher_is_better : true
+  - metric: screenspot_ACC@0.3
+    aggregation : !function utils_rec.screenspot_rec_acc03
+    higher_is_better : true
+  - metric: screenspot_ACC@0.5
+    aggregation : !function utils_rec.screenspot_rec_acc05
+    higher_is_better : true
+  - metric: screenspot_ACC@0.7
+    aggregation : !function utils_rec.screenspot_rec_acc07
+    higher_is_better : true
+  - metric: screenspot_ACC@0.9
+    aggregation : !function utils_rec.screenspot_rec_acc09
+    higher_is_better : true
+  - metric: screenspot_Center_ACC
+    aggregation : !function utils_rec.screenspot_rec_center_acc
+    higher_is_better : true
+metadata:
+  version: '0.0'
diff --git a/lmms_eval/tasks/screenspot/_default_template_reg_yaml b/lmms_eval/tasks/screenspot/_default_template_reg_yaml
@@ -0,0 +1,15 @@
+dataset_path: rootsautomation/ScreenSpot
+output_type: generate_until
+doc_to_visual: !function utils.screenspot_bbox_doc_to_visual
+doc_to_text: !function utils.screenspot_doc_to_text
+doc_to_target: "instruction"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+process_results: !function utils.screenspot_process_result
+metric_list:
+  - metric: screenspot_CIDEr
+    aggregation : !function utils.screenspot_cider
+    higher_is_better : true
+metadata:
+  version: '0.0'
diff --git a/lmms_eval/tasks/screenspot/_screenspot.yaml b/lmms_eval/tasks/screenspot/_screenspot.yaml
@@ -0,0 +1,4 @@
+group: screenspot
+task:
+- screenspot_reg_test
+- screenspot_rec_test
diff --git a/lmms_eval/tasks/screenspot/screenspot_rec_test.yaml b/lmms_eval/tasks/screenspot/screenspot_rec_test.yaml
@@ -0,0 +1,4 @@
+group: screenspot_rec
+task: screenspot_rec_test
+include: _default_template_rec_yaml
+test_split: test
diff --git a/lmms_eval/tasks/screenspot/screenspot_reg_test.yaml b/lmms_eval/tasks/screenspot/screenspot_reg_test.yaml
@@ -0,0 +1,4 @@
+group: screenspot_reg
+task: screenspot_reg_test
+include: _default_template_reg_yaml
+test_split: test
diff --git a/lmms_eval/tasks/screenspot/utils.py b/lmms_eval/tasks/screenspot/utils.py
@@ -0,0 +1,126 @@
+from PIL import ImageDraw
+from pycocoevalcap.eval import COCOEvalCap, Bleu, Meteor, Rouge, Cider, Spice
+from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
+from pycocotools.coco import COCO
+
+# COCO_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1", "METEOR", "ROUGE_L", "CIDEr"]  # , "SPICE"]
+COCO_METRICS = ["CIDEr"]
+
+import logging
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+def screenspot_bbox_doc_to_visual(doc):
+    bbox = doc["bbox"]
+    image = doc["image"].convert("RGB")
+    draw = ImageDraw.Draw(image)
+    bbox_xy = [bbox[0], bbox[1], bbox[2], bbox[3]]
+    draw.rectangle(bbox_xy, outline="red", width=3)
+    return [image.convert("RGB")]
+
+
+def screenspot_process_result(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary with key: metric name (in this case coco_bleu), value: metric value
+    """
+    pred = result[0] if len(result) > 0 else ""
+    ann_id = doc["file_name"]
+    data_dict = {"instruction": doc["instruction"], "pred": pred, "ann_id": ann_id, 'data_type': doc['data_type'], 'data_source': doc['data_source']}
+    return {f"screenspot_{metric}": data_dict for metric in COCO_METRICS}
+
+
+def screenspot_doc_to_text(doc):
+    return f"Direct a user to interact with the highlighted region [{doc['bbox'][0]:.2f}, {doc['bbox'][1]:.2f}, {doc['bbox'][2]:.2f}, {doc['bbox'][3]:.2f}]."
+
+
+def screenspot_aggregation_result(results, metric):
+    # scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr"), (Spice(), "SPICE")]
+    scorers = [(Cider(), "CIDEr")]
+    scorers_dict = {s[1]: s for s in scorers}
+
+    stored_results = []
+    # In order to make the coco eval tools to successfully create index
+    # We need at least two dict in the dataset
+    # 'annotation' and 'images'
+    # 'annotation' exactly reproduce the original annotation
+    # 'images' however only need the image id which is contained in the file name
+    dataset = {"annotations": [], "images": []}
+    idx = 0
+    ann_id = 0
+    for result in results:
+        stored_results.append({"image_id": idx, "caption": result["pred"]})
+        # for s in result["answer"]:
+        dataset["annotations"].append({"image_id": idx, "caption": result['instruction'], "id": ann_id})
+        ann_id += 1
+
+        dataset["images"].append({"id": idx})
+        idx += 1
+
+    coco = COCO()
+    # Manually create index here
+    coco.dataset = dataset
+    coco.createIndex()
+
+    coco_result = coco.loadRes(stored_results)
+    coco_eval = COCOEvalCap(coco, coco_result)
+
+    imgIds = coco_eval.params["image_id"]
+    gts = {}
+    res = {}
+    for imgId in imgIds:
+        gts[imgId] = coco_eval.coco.imgToAnns[imgId]
+        res[imgId] = coco_eval.cocoRes.imgToAnns[imgId]
+
+    eval_logger.info("tokenization...")
+    tokenizer = PTBTokenizer()
+    gts = tokenizer.tokenize(gts)
+    res = tokenizer.tokenize(res)
+
+    eval_logger.info(f"Computing {metric} scores...")
+
+    score, scores = scorers_dict[metric][0].compute_score(gts, res)
+    # coco_eval.setEval(score, metric)
+
+    # When metric is one of the Bleu, score will be a list
+    if type(score) == list:
+        n = int(metric.split("_")[-1])
+        score = score[n - 1]
+
+    return score
+
+
+def screenspot_bleu4(results):
+    return screenspot_aggregation_result(results, "Bleu_4")
+
+
+def screenspot_bleu3(results):
+    return screenspot_aggregation_result(results, "Bleu_3")
+
+
+def screenspot_bleu2(results):
+    return screenspot_aggregation_result(results, "Bleu_2")
+
+
+def screenspot_bleu1(results):
+    return screenspot_aggregation_result(results, "Bleu_1")
+
+
+def screenspot_meteor(results):
+    return screenspot_aggregation_result(results, "METEOR")
+
+
+def screenspot_rougel(results):
+    return screenspot_aggregation_result(results, "ROUGE_L")
+
+
+def screenspot_cider(results):
+    return screenspot_aggregation_result(results, "CIDEr")
+
+
+def screenspot_spice(results):
+    return screenspot_aggregation_result(results, "SPICE")
diff --git a/lmms_eval/tasks/screenspot/utils_rec.py b/lmms_eval/tasks/screenspot/utils_rec.py