LASER

LASER

This is the pytorch implemention for the ACL '22 Finding paper Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework.

Quick Start

See requirements.txt (Generated by pipreqs. If there are any issues, please contact us :)

Pretrained Weight

Please download the pre-trained weight of LayoutReader from here or here, and copy pytorch_model.bin into ./weights/layoutreader/.

Train & Decode & Evaluate FUNSD

git clone https://github.com/zlwang-cs/LASER-release.git
cd LASER-release
mkdir outputs
cd shell_scripts
sh run_few_shot_FUNSD.sh 0

The digit after the last line is the ID of GPU you want to use.

Your Customed Dataset

Required Format

Each dataset involves three files, and please put them in a folder under /data

meta.json
train/test-text-s2s.jsons
train/test-layout-s2s.jsons

Dataset Meta (A json file describing the dataset information)

Contains the following attributes:

labels: The entity types
words: The words used in the labels
tokens: The tokens used in the labels (To achieve better performance, please use simple label words so that each label only involves single token)
token_dict: A dictionary mapping the token to the token index
next_token_dict: A dictionary mapping the token to the possible following token (Refer to the provided dataset, FUNSD, to see the example)

File for text data (`train-text-s2s.jsons` and `test-text-s2s.jsons` for train/test respectively).

Each line is a json object which has 4 attributes:

src: The input text
tgt: The input text embeddded with tags: <BEGIN> Sender <END> question <TAG_END>
filename: Name of the file
part_idx: Given that a file is too long or augmented into multiple samples, there will be several pieces of inputs from the same file. We name them sequentially.

File for layout data (`train-layout-s2s.jsons` and `test-layout-s2s.jsons` for train/test respectively).

Each line is a json object which has 4 attributes:

src: A list of normalized bounding boxes. Each box corresponds to a word in the src of the text part. [[335, 154, 389, 169],...]
tgt: A list of normalized bounding boxes. Each box corresponds to a word in the tgt of the text part. The special tags are as follows:

<BEGIN>: [1001, 1001, 1001, 1001]
<END>: [1002, 1002, 1002, 1002]
<TAG_END>: [1003, 1003, 1003, 1003]
the entity type labels: [1004, 1004, 1004, 1004], ...

w: the original width of the page
h: the original height of the page

Few-shot Info

A json file under data_utils

{
  "1": {                // the number of shots
    "1":                // the random seed used to generate this few-shot list
      [ "A" ],'         // the exact filenames in this list
    "2": 
      [ "B" ],
    ...
  }
}

Shell Script to Run the Experiments

See shell_scripts/run_few_shot_CUSTOM.sh

Collect Results

See collect_results.ipynb

Citation

If you find the project useful, please cite our paper:

@inproceedings{wang2022towards,
  title={Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework},
  author={Wang, Zilong and Shang, Jingbo},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
  pages={4174--4186},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/funsd		data/funsd
data_utils		data_utils
s2s_ft		s2s_ft
s2s_tag		s2s_tag
shell_scripts		shell_scripts
weights		weights
.gitignore		.gitignore
LASER_FUNSD_Training.ipynb		LASER_FUNSD_Training.ipynb
README.md		README.md
collect_results.ipynb		collect_results.ipynb
decode_seq2seq.py		decode_seq2seq.py
eval_seq2seq.py		eval_seq2seq.py
requirements.txt		requirements.txt
run_seq2seq.py		run_seq2seq.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LASER

Quick Start

Pretrained Weight

Train & Decode & Evaluate FUNSD

Your Customed Dataset

Required Format

Dataset Meta (A json file describing the dataset information)

File for text data (`train-text-s2s.jsons` and `test-text-s2s.jsons` for train/test respectively).

File for layout data (`train-layout-s2s.jsons` and `test-layout-s2s.jsons` for train/test respectively).

Few-shot Info

Shell Script to Run the Experiments

Collect Results

Citation

About

Releases

Packages

Contributors 2

Languages

zlwang-cs/LASER-release

Folders and files

Latest commit

History

Repository files navigation

LASER

Quick Start

Pretrained Weight

Train & Decode & Evaluate FUNSD

Your Customed Dataset

Required Format

Dataset Meta (A json file describing the dataset information)

File for text data (train-text-s2s.jsons and test-text-s2s.jsons for train/test respectively).

File for layout data (train-layout-s2s.jsons and test-layout-s2s.jsons for train/test respectively).

Few-shot Info

Shell Script to Run the Experiments

Collect Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

File for text data (`train-text-s2s.jsons` and `test-text-s2s.jsons` for train/test respectively).

File for layout data (`train-layout-s2s.jsons` and `test-layout-s2s.jsons` for train/test respectively).

Packages