Skip to content

Conversation

@mraunak
Copy link
Contributor

@mraunak mraunak commented Apr 27, 2022

What does this PR do?

Adding a new feature for fine-tuning transformer models called Information Gain Filtration 'IGF'

Motivation

The quality of a fine-tuned model depends on the quality of the data samples used for the first few batches. As the process is stochastic, a random seed would influence the quality of the final fine-tuned model. We are proposing a novel and robust fine-tuning method “Information Gain Filtration” (IGF), which filters informative training samples before a fine-tuning (training) and improves the overall training efficiency and final performance of the language model fine-tuning step

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. This can be of interest to @sgugger,

Models:

  • gpt2

Examples:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 27, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @mraunak!

@LysandreJik
Copy link
Member

Could you just run the code quality tool to ensure that the code quality passes? You can install them with the following, from the root of your clone:

pip install -e ".[quality]"

And then run them with:

make fixup

@mraunak
Copy link
Contributor Author

mraunak commented May 2, 2022

Running the command: make fixup, gives an error that does not include terms from my PR,

The output with error is shown below. Please guide me on it. Thanks

(igfprnew) mraunak@bcl-main1:~/transformers$ make fixup
No library .py files were modified
python utils/custom_init_isort.py
python utils/style_doc.py src/transformers docs/source --max_len 119
running deps_table_update
updating src/transformers/dependency_versions_table.py
python utils/check_copies.py
python utils/check_table.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
python utils/check_dummies.py
python utils/check_repo.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Checking all models are included.
Checking all models are public.
Checking all models are properly tested.
Checking all objects are properly documented.
Checking all models are in at least one auto class.
utils/check_repo.py:456: UserWarning: Full quality checks require all backends to be installed (with pip install -e .[dev] in the Transformers repo, the following are missing: PyTorch, TensorFlow, Flax. While it's probably fine as long as you didn't make any change in one of those backends modeling files, you should probably execute the command above to be on the safe side.
warnings.warn(
python utils/check_inits.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Traceback (most recent call last):
File "utils/check_inits.py", line 265, in
check_submodules()
File "utils/check_inits.py", line 256, in check_submodules
raise ValueError(
ValueError: The following submodules are not properly registed in the main init of Transformers:

  • sagemaker
  • activations
  • activations_tf
  • convert_slow_tokenizer
  • deepspeed
  • generation_beam_constraints
  • generation_beam_search
  • generation_flax_logits_process
  • generation_flax_utils
  • generation_logits_process
  • generation_stopping_criteria
  • generation_tf_logits_process
  • generation_tf_utils
  • generation_utils
  • image_utils
  • keras_callbacks
  • modeling_flax_outputs
  • modeling_flax_utils
  • modeling_outputs
  • modeling_tf_outputs
  • modeling_tf_utils
  • modeling_utils
  • optimization
  • optimization_tf
  • pytorch_utils
  • tf_utils
  • trainer
  • trainer_pt_utils
  • trainer_seq2seq
  • trainer_tf
  • data.datasets
    Make sure they appear somewhere in the keys of _import_structure with an empty list as value.
    make: *** [repo-consistency] Error 1

@TuKo
Copy link

TuKo commented May 10, 2022

@LysandreJik thanks for the suggestion! We were able to correct the quality check issues.
Let us know if you need us to run/test anything else. Thank you!

@LysandreJik LysandreJik merged commit 5fdb54e into huggingface:main May 18, 2022
@LysandreJik
Copy link
Member

Thank you for your contributions!

@TuKo
Copy link

TuKo commented May 18, 2022

Thank you for accepting our work!

@TuKo TuKo deleted the igf-pr branch May 18, 2022 14:47
ArthurZucker added a commit to ArthurZucker/transformers that referenced this pull request May 20, 2022
commit 5419205
Author: Patrick von Platen <[email protected]>
Date:   Thu May 19 23:46:26 2022 +0200

    [Test OPT] Add batch generation test opt (huggingface#17359)

    * up

    * up

commit 48c2269
Author: ddobokki <[email protected]>
Date:   Fri May 20 05:42:44 2022 +0900

    Fix bug in Wav2Vec2 pretrain example (huggingface#17326)

commit 5d6feec
Author: Nathan Dahlberg <[email protected]>
Date:   Thu May 19 16:21:19 2022 -0400

    fix for 17292 (huggingface#17293)

commit 518bd02
Author: Patrick von Platen <[email protected]>
Date:   Thu May 19 22:17:02 2022 +0200

    [Generation] Fix Transition probs (huggingface#17311)

    * [Draft] fix transition probs

    * up

    * up

    * up

    * make it work

    * fix

    * finish

    * update

commit e8714c0
Author: Patrick von Platen <[email protected]>
Date:   Thu May 19 22:15:36 2022 +0200

    [OPT] Run test in lower precision on GPU (huggingface#17353)

    * [OPT] Run test only in half precision

    * up

    * up

    * up

    * up

    * finish

    * fix on GPU

    * Update tests/models/opt/test_modeling_opt.py

commit 2b28229
Author: Nicolas Patry <[email protected]>
Date:   Thu May 19 20:28:12 2022 +0200

    Adding `batch_size` test to QA pipeline. (huggingface#17330)

commit a4386d7
Author: Nicolas Patry <[email protected]>
Date:   Thu May 19 10:29:16 2022 +0200

    [BC] Fixing usage of text pairs (huggingface#17324)

    * [BC] Fixing usage of text pairs

    The BC is actually preventing users from misusing the pipeline since
    users could have been willing to send text pairs and the pipeline would
    instead understand the thing as a batch returning bogus results.

    The correct usage of text pairs is preserved in this PR even when that
    makes the code clunky.

    Adds support for {"text":..,, "text_pair": ...} inputs for both dataset
    iteration and more explicit usage to pairs.

    * Updating the doc.

    * Update src/transformers/pipelines/text_classification.py

    Co-authored-by: Sylvain Gugger <[email protected]>

    * Update src/transformers/pipelines/text_classification.py

    Co-authored-by: Sylvain Gugger <[email protected]>

    * Update tests/pipelines/test_pipelines_text_classification.py

    Co-authored-by: Lysandre Debut <[email protected]>

    * quality.

    Co-authored-by: Sylvain Gugger <[email protected]>
    Co-authored-by: Lysandre Debut <[email protected]>

commit 3601aa8
Author: Stas Bekman <[email protected]>
Date:   Wed May 18 16:00:47 2022 -0700

    [tests] fix copy-n-paste error (huggingface#17312)

    * [tests] fix copy-n-paste error

    * fix

commit 1b20c97
Author: Yih-Dar <[email protected]>
Date:   Wed May 18 21:49:08 2022 +0200

    Fix ci_url might be None (huggingface#17332)

    * fix

    * Update utils/notification_service.py

    Co-authored-by: Lysandre Debut <[email protected]>

    Co-authored-by: ydshieh <[email protected]>
    Co-authored-by: Lysandre Debut <[email protected]>

commit 6aad387
Author: Yih-Dar <[email protected]>
Date:   Wed May 18 21:26:44 2022 +0200

    fix (huggingface#17337)

    Co-authored-by: ydshieh <[email protected]>

commit 1762ded
Author: Zachary Mueller <[email protected]>
Date:   Wed May 18 14:17:40 2022 -0400

    Fix metric calculation in examples and setup tests to run on multi-gpu for no_trainer scripts (huggingface#17331)

    * Fix length in no_trainer examples

    * Add setup and teardown

    * Use new accelerator config generator to automatically make tests able to run based on environment

commit 6e195eb
Author: Jader Martins <[email protected]>
Date:   Wed May 18 14:18:43 2022 -0300

    docs for typical decoding (huggingface#17186)

    Co-authored-by: Jader Martins <[email protected]>

commit 060fe61
Author: Yih-Dar <[email protected]>
Date:   Wed May 18 19:07:48 2022 +0200

    Not send successful report (huggingface#17329)

    * send report only if there is any failure

    Co-authored-by: ydshieh <[email protected]>

commit b3b9f99
Author: Yih-Dar <[email protected]>
Date:   Wed May 18 17:57:23 2022 +0200

    Fix test_t5_decoder_model_past_large_inputs (huggingface#17320)

    Co-authored-by: ydshieh <[email protected]>

commit 6da76b9
Author: Jingya HUANG <[email protected]>
Date:   Wed May 18 17:52:13 2022 +0200

    Add onnx export cuda support (huggingface#17183)

    Co-authored-by: Lysandre Debut <[email protected]>

    Co-authored-by: lewtun <[email protected]>

commit adc0ff2
Author: NielsRogge <[email protected]>
Date:   Wed May 18 17:47:18 2022 +0200

    Add CvT (huggingface#17299)

    * Adding cvt files

    * Adding cvt files

    * changes in init file

    * Adding cvt files

    * changes in init file

    * Style fixes

    * Address comments from code review

    * Apply suggestions from code review

    Co-authored-by: Sylvain Gugger <[email protected]>

    * Format lists in docstring

    * Fix copies

    * Apply suggestion from code review

    Co-authored-by: AnugunjNaman <[email protected]>
    Co-authored-by: Ayushman Singh <[email protected]>
    Co-authored-by: Niels Rogge <[email protected]>
    Co-authored-by: Sylvain Gugger <[email protected]>

commit 4710702
Author: Sylvain Gugger <[email protected]>
Date:   Wed May 18 10:46:40 2022 -0400

    Fix style

commit 5fdb54e
Author: mraunak <[email protected]>
Date:   Wed May 18 10:39:02 2022 -0400

    Add Information Gain Filtration algorithm (huggingface#16953)

    * Add information gain filtration algorithm

    * Complying with black requirements

    * Added author

    * Fixed import order

    * flake8 corrections

    Co-authored-by: Javier Turek <[email protected]>

commit 91ede48
Author: Kamal Raj <[email protected]>
Date:   Wed May 18 19:59:53 2022 +0530

    Fix typo (huggingface#17328)

commit fe28eb9
Author: Yih-Dar <[email protected]>
Date:   Wed May 18 16:06:41 2022 +0200

    remove (huggingface#17325)

    Co-authored-by: ydshieh <[email protected]>

commit 2cb2ea3
Author: Nicolas Patry <[email protected]>
Date:   Wed May 18 16:06:24 2022 +0200

    Accepting real pytorch device as arguments. (huggingface#17318)

    * Accepting real pytorch device as arguments.

    * is_torch_available.

commit 1c9d1f4
Author: Nicolas Patry <[email protected]>
Date:   Wed May 18 15:46:12 2022 +0200

    Updating the docs for `max_seq_len` in QA pipeline (huggingface#17316)

commit 60ad734
Author: Patrick von Platen <[email protected]>
Date:   Wed May 18 15:08:56 2022 +0200

    [T5] Fix init in TF and Flax for pretraining (huggingface#17294)

    * fix init

    * Apply suggestions from code review

    * fix

    * finish

    * Update src/transformers/modeling_tf_utils.py

    Co-authored-by: Sylvain Gugger <[email protected]>

    Co-authored-by: Sylvain Gugger <[email protected]>

commit 7ba1d4e
Author: Joaq <[email protected]>
Date:   Wed May 18 09:23:47 2022 -0300

    Add type hints for ProphetNet (Pytorch) (huggingface#17223)

    * added type hints to prophetnet

    * reformatted with black

    * fix bc black misformatted some parts

    * fix imports

    * fix imports

    * Update src/transformers/models/prophetnet/configuration_prophetnet.py

    Co-authored-by: Matt <[email protected]>

    * update OPTIONAL type hint and docstring

    Co-authored-by: Matt <[email protected]>

commit d6b8e9c
Author: Carl <[email protected]>
Date:   Wed May 18 01:07:43 2022 +0200

    Add trajectory transformer (huggingface#17141)

    * Add trajectory transformer

    Fix model init

    Fix end of lines for .mdx files

    Add trajectory transformer model to toctree

    Add forward input docs

    Fix docs, remove prints, simplify prediction test

    Apply suggestions from code review

    Co-authored-by: Sylvain Gugger <[email protected]>
    Apply suggestions from code review

    Co-authored-by: Lysandre Debut <[email protected]>
    Co-authored-by: Sylvain Gugger <[email protected]>
    Update docs, more descriptive comments

    Apply suggestions from code review

    Co-authored-by: Sylvain Gugger <[email protected]>
    Update readme

    Small comment update and add conversion script

    Rebase and reformat

    Fix copies

    Fix rebase, remove duplicates

    Fix rebase, remove duplicates

    * Remove tapex

    * Remove tapex

    * Remove tapex

commit c352640
Author: Patrick von Platen <[email protected]>
Date:   Wed May 18 00:34:31 2022 +0200

    fix (huggingface#17310)

commit d9050dc
Author: Cesare Campagnano <[email protected]>
Date:   Tue May 17 23:44:37 2022 +0200

    [LED] fix global_attention_mask not being passed for generation and docs clarification about grad checkpointing (huggingface#17112)

    * [LED] fixed global_attention_mask not passed for generation + docs clarification for gradient checkpointing

    * LED docs clarification

    Co-authored-by: Patrick von Platen <[email protected]>

    * [LED] gradient_checkpointing=True should be passed to TrainingArguments

    Co-authored-by: Patrick von Platen <[email protected]>

    * [LED] docs: remove wrong word

    Co-authored-by: Patrick von Platen <[email protected]>

    * [LED] docs fix typo

    Co-authored-by: Patrick von Platen <[email protected]>

    Co-authored-by: Patrick von Platen <[email protected]>

commit bad3583
Author: Jean Vancoppenolle <[email protected]>
Date:   Tue May 17 23:42:14 2022 +0200

    Add support for pretraining recurring span selection to Splinter (huggingface#17247)

    * Add SplinterForSpanSelection for pre-training recurring span selection.

    * Formatting.

    * Rename SplinterForSpanSelection to SplinterForPreTraining.

    * Ensure repo consistency

    * Fixup changes

    * Address SplinterForPreTraining PR comments

    * Incorporate feedback and derive multiple question tokens per example.

    * Update src/transformers/models/splinter/modeling_splinter.py

    Co-authored-by: Patrick von Platen <[email protected]>

    * Update src/transformers/models/splinter/modeling_splinter.py

    Co-authored-by: Patrick von Platen <[email protected]>

    Co-authored-by: Jean Vancoppenole <[email protected]>
    Co-authored-by: Tobias Günther <[email protected]>
    Co-authored-by: Tobias Günther <[email protected]>
    Co-authored-by: Patrick von Platen <[email protected]>

commit 0511305
Author: Yih-Dar <[email protected]>
Date:   Tue May 17 18:56:58 2022 +0200

    Add PR author in CI report + merged by info (huggingface#17298)

    * Add author info to CI report

    * Add merged by info

    * update

    Co-authored-by: ydshieh <[email protected]>

commit 032d63b
Author: Sylvain Gugger <[email protected]>
Date:   Tue May 17 12:56:24 2022 -0400

    Fix dummy creation script (huggingface#17304)

commit 986dd5c
Author: Sylvain Gugger <[email protected]>
Date:   Tue May 17 12:50:14 2022 -0400

    Fix style

commit 38ddab1
Author: Karim Foda <[email protected]>
Date:   Tue May 17 09:32:12 2022 -0700

    Doctest longformer (huggingface#16441)

    * Add initial doctring changes

    * make fixup

    * Add TF doc changes

    * fix seq classifier output

    * fix quality errors

    * t

    * swithc head to random init

    * Fix expected outputs

    * Update src/transformers/models/longformer/modeling_longformer.py

    Co-authored-by: Yih-Dar <[email protected]>

    Co-authored-by: Yih-Dar <[email protected]>

commit 10704e1
Author: Patrick von Platen <[email protected]>
Date:   Tue May 17 18:20:36 2022 +0200

    [Test] Fix W2V-Conformer integration test (huggingface#17303)

    * [Test] Fix W2V-Conformer integration test

    * correct w2v2

    * up

commit 28a0811
Author: regisss <[email protected]>
Date:   Tue May 17 17:58:14 2022 +0200

    Improve mismatched sizes management when loading a pretrained model (huggingface#17257)

    - Add --ignore_mismatched_sizes argument to classification examples

    - Expand the error message when loading a model whose head dimensions are different from expected dimensions

commit 1f13ba8
Author: Patrick von Platen <[email protected]>
Date:   Tue May 17 15:48:23 2022 +0200

    correct opt (huggingface#17301)

commit 349f1c8
Author: Matt <[email protected]>
Date:   Tue May 17 14:36:23 2022 +0100

    Rewrite TensorFlow train_step and test_step (huggingface#17057)

    * Initial commit

    * Better label renaming

    * Remove breakpoint before pushing (this is your job)

    * Test a lot more in the Keras fit() test

    * make fixup

    * Clarify the case where we flatten y dicts into tensors

    * Clarify the case where we flatten y dicts into tensors

    * Extract label name remapping to a method

commit 651e48e
Author: Matt <[email protected]>
Date:   Tue May 17 14:14:17 2022 +0100

    Fix tests of mixed precision now that experimental is deprecated (huggingface#17300)

    * Fix tests of mixed precision now that experimental is deprecated

    * Fix mixed precision in training_args_tf.py too

commit 6d21142
Author: SaulLu <[email protected]>
Date:   Tue May 17 14:33:13 2022 +0200

    fix retribert's `test_torch_encode_plus_sent_to_model` (huggingface#17231)
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* Add information gain filtration algorithm

* Complying with black requirements

* Added author

* Fixed import order

* flake8 corrections

Co-authored-by: Javier Turek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants