This is the pytorch implemention for the ACL '22 Finding paper Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework.
See requirements.txt (Generated by pipreqs. If there are any issues, please contact us :)
Please download the pre-trained weight of LayoutReader from here or here, and copy pytorch_model.bin into ./weights/layoutreader/.
git clone https://github.com/zlwang-cs/LASER-release.git
cd LASER-release
mkdir outputs
cd shell_scripts
sh run_few_shot_FUNSD.sh 0
The digit after the last line is the ID of GPU you want to use.
Each dataset involves three files, and please put them in a folder under /data
meta.jsontrain/test-text-s2s.jsonstrain/test-layout-s2s.jsons
Contains the following attributes:
labels: The entity typeswords: The words used in the labelstokens: The tokens used in the labels (To achieve better performance, please use simple label words so that each label only involves single token)token_dict: A dictionary mapping the token to the token indexnext_token_dict: A dictionary mapping the token to the possible following token (Refer to the provided dataset, FUNSD, to see the example)
Each line is a json object which has 4 attributes:
src: The input texttgt: The input text embeddded with tags:<BEGIN> Sender <END> question <TAG_END>filename: Name of the filepart_idx: Given that a file is too long or augmented into multiple samples, there will be several pieces of inputs from the same file. We name them sequentially.
File for layout data (train-layout-s2s.jsons and test-layout-s2s.jsons for train/test respectively).
Each line is a json object which has 4 attributes:
src: A list of normalized bounding boxes. Each box corresponds to a word in thesrcof the text part.[[335, 154, 389, 169],...]tgt: A list of normalized bounding boxes. Each box corresponds to a word in thetgtof the text part. The special tags are as follows:
<BEGIN>: [1001, 1001, 1001, 1001]<END>: [1002, 1002, 1002, 1002]<TAG_END>: [1003, 1003, 1003, 1003]- the entity type labels: [1004, 1004, 1004, 1004], ...
w: the original width of the pageh: the original height of the page
A json file under data_utils
{
"1": { // the number of shots
"1": // the random seed used to generate this few-shot list
[ "A" ],' // the exact filenames in this list
"2":
[ "B" ],
...
}
}
See shell_scripts/run_few_shot_CUSTOM.sh
See collect_results.ipynb
If you find the project useful, please cite our paper:
@inproceedings{wang2022towards,
title={Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework},
author={Wang, Zilong and Shang, Jingbo},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
pages={4174--4186},
year={2022}
}