This is the pytorch implemention for the ACL '22 Finding paper Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework.
See requirements.txt
(Generated by pipreqs
. If there are any issues, please contact us :)
Please download the pre-trained weight of LayoutReader from here or here, and copy pytorch_model.bin
into ./weights/layoutreader/
.
git clone https://github.com/zlwang-cs/LASER-release.git
cd LASER-release
mkdir outputs
cd shell_scripts
sh run_few_shot_FUNSD.sh 0
The digit after the last line is the ID of GPU you want to use.
Each dataset involves three files, and please put them in a folder under /data
meta.json
train/test-text-s2s.jsons
train/test-layout-s2s.jsons
Contains the following attributes:
labels
: The entity typeswords
: The words used in the labelstokens
: The tokens used in the labels (To achieve better performance, please use simple label words so that each label only involves single token)token_dict
: A dictionary mapping the token to the token indexnext_token_dict
: A dictionary mapping the token to the possible following token (Refer to the provided dataset, FUNSD, to see the example)
Each line is a json object which has 4 attributes:
src
: The input texttgt
: The input text embeddded with tags:<BEGIN> Sender <END> question <TAG_END>
filename
: Name of the filepart_idx
: Given that a file is too long or augmented into multiple samples, there will be several pieces of inputs from the same file. We name them sequentially.
File for layout data (train-layout-s2s.jsons
and test-layout-s2s.jsons
for train/test respectively).
Each line is a json object which has 4 attributes:
src
: A list of normalized bounding boxes. Each box corresponds to a word in thesrc
of the text part.[[335, 154, 389, 169],...]
tgt
: A list of normalized bounding boxes. Each box corresponds to a word in thetgt
of the text part. The special tags are as follows:
<BEGIN>
: [1001, 1001, 1001, 1001]<END>
: [1002, 1002, 1002, 1002]<TAG_END>
: [1003, 1003, 1003, 1003]- the entity type labels: [1004, 1004, 1004, 1004], ...
w
: the original width of the pageh
: the original height of the page
A json file under data_utils
{
"1": { // the number of shots
"1": // the random seed used to generate this few-shot list
[ "A" ],' // the exact filenames in this list
"2":
[ "B" ],
...
}
}
See shell_scripts/run_few_shot_CUSTOM.sh
See collect_results.ipynb
If you find the project useful, please cite our paper:
@inproceedings{wang2022towards,
title={Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework},
author={Wang, Zilong and Shang, Jingbo},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
pages={4174--4186},
year={2022}
}