Releases · tshauck/gcgc

Fix bug in the event that a token id is supplied that overrides a default of
an inferred token.
Add pad_at_end boolean setting that when True pads at the end of the
sequence, and when False pads at the beginning.
Add dedicated Vocab object which replaces the dictionary of string to
integer.
Update tokenizer integration to override convert_tokens_to_string
Fix bug when trying to save the huggingface tokenizer.
Make the third party "extras" during python packaging.
Add better testing and batch encoding operatons.

Assets 2

25 Jan 19:11

tshauck

v0.12.0

ec1b6e7

0.12.0 (2020-01-25)

Improved the docs to reflect the SequenceTokenizerSpec that was added in
0.11.0.
Made max length optional for the tokenizer.
Added CLI that parses use the SequencePiece library.
Began versioning docker build, and make pushing easier during build process.
Have the tokenizer resolve the named alphabets.
Use poetry along with general updates to a build pipeline.

Assets 2

01 Dec 20:10

tshauck

v0.11.0

e8e27fd

v0.11.0 (2019-11-15)

Added

Added the SequenceTokenizerSpec object for specifying the tokenizer.
Added Vocab object for storing the int to token, and token to int encodings.
Added example of using tensorflow/keras together with gcgc.

Assets 2

09 Nov 18:59

tshauck

v0.10.0

f0237ff

v0.10.0 (2019-11-09)

gcgc has been revamped quite a bit to better support existing processing
pipelines for NLP without trying to do to much. See the docs for more
information about how this works.

Assets 2

06 Aug 03:07

tshauck

v0.9.1

4c4d82e

v0.9.1

0.9.0 (2019-08-05)

Added

Parser now outputs the length of the tensor not including padding. This is
useful for packing and length based iteration.
Generating masked output from the parse_record method is now available.
Alphabet can include an optional mask token.

Changed

Can now specify how large of kmer step size to generate when supplying a kmer
value.
Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which
takes a kmer_step_size to specify how large of steps to take when encoding.
Add parsed_seq_len to the SequenceParser object to control how much padding to
apply to the end of the integer encoded sequence. This is useful since a batch
of tensors is expected to have the same size.

0.8.0 (2019-07-04)

Fixed

Broken test due to platform differences in Path.glob sorting.

Added

User can specify to use start or end tokens optionally.

Removed

Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
see scatter in PyTorch.

0.7.0 (2019-06-22)

Added

Properties to access the integer encodings of special tokens. (35cae2a)
- Alphabet.encoded_start
- Alphabet.encoded_end
- Alphabet.encoded_padding
Remove uniprot dataset creation. (e233162)
Simplify index handling for GenomicDataset. (3213a9e)

0.6.1 (2019-06-10)

Added

Updated package management so gcgc is easier to use with other version of
torch.

0.6.0 (2019-04-04)

Added

Ability for kmer size to be passed to an alphabet.

0.5.2 (2019-03-21)

Added

Add Dockerfile and docker-compose.yml for development.
EncodedSeq.shift, which will shift sequence by an offset integer.
EncodedSeq.from_integer_encoded_seq will take a list of integers and an
alphabet and return an EncodedSeq object.
Add the ability to apply a function to the rollout_kmers yielded values.

Changed

Alphabet special characters are now located at the start, rather than the end,
of the letters and token sequence.

0.5.1 (2019-01-09)

Added

Add extra css to make underline links in articles.
Exit if the download directory doesn't exist in the call to download organism.
Wording improvements in docs.

0.5.0 (2018-12-31)

Added

Include seq_tensor_one_hot in the PyTorch Parser.
Added a GCGCRecord.encoded_seq property.
New gcgc.random module to start holding sequence data.
New gcgc.rollout module to handle working through chunks of sequences.
- rollout_kmers will roll out kmers.
- rollout_seq_features will roll out the SeqFeatures from a SeqRecord.
EncodingAlphabet now can optionally take a gap_characters set of characters to add to the
alphabet letters. It also takes add_lower_case_for_inserts which will duplicate the alphabet,
but convert the letters to lowercase.

Changed

Fixed

Fixed bug in GenomicDataset.from_path where it still referred to init_from_path_generator.

0.4.0

Added

EncodedSeq now supports iterating through kmers, see EncodedSeq.rollout_kmers for options.
GCGC is citable.
GCGC now has a CHANGELOG.md.

Assets 2

06 Aug 03:07

tshauck

v0.8.1

4c4d82e

# Changelog

Development

0.9.0 (2019-08-05)

Added

Parser now outputs the length of the tensor not including padding. This is
useful for packing and length based iteration.
Generating masked output from the parse_record method is now available.
Alphabet can include an optional mask token.

Changed

Can now specify how large of kmer step size to generate when supplying a kmer
value.
Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which
takes a kmer_step_size to specify how large of steps to take when encoding.
Add parsed_seq_len to the SequenceParser object to control how much padding to
apply to the end of the integer encoded sequence. This is useful since a batch
of tensors is expected to have the same size.

0.8.0 (2019-07-04)

Fixed

Broken test due to platform differences in Path.glob sorting.

Added

User can specify to use start or end tokens optionally.

Removed

Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
see scatter in PyTorch.

0.7.0 (2019-06-22)

Added

Properties to access the integer encodings of special tokens. (35cae2a)
- Alphabet.encoded_start
- Alphabet.encoded_end
- Alphabet.encoded_padding
Remove uniprot dataset creation. (e233162)
Simplify index handling for GenomicDataset. (3213a9e)

0.6.1 (2019-06-10)

Added

Updated package management so gcgc is easier to use with other version of
torch.

0.6.0 (2019-04-04)

Added

Ability for kmer size to be passed to an alphabet.

0.5.2 (2019-03-21)

Added

Add Dockerfile and docker-compose.yml for development.
EncodedSeq.shift, which will shift sequence by an offset integer.
EncodedSeq.from_integer_encoded_seq will take a list of integers and an
alphabet and return an EncodedSeq object.
Add the ability to apply a function to the rollout_kmers yielded values.

Changed

Alphabet special characters are now located at the start, rather than the end,
of the letters and token sequence.

0.5.1 (2019-01-09)

Added

Add extra css to make underline links in articles.
Exit if the download directory doesn't exist in the call to download organism.
Wording improvements in docs.

0.5.0 (2018-12-31)

Added

Include seq_tensor_one_hot in the PyTorch Parser.
Added a GCGCRecord.encoded_seq property.
New gcgc.random module to start holding sequence data.
New gcgc.rollout module to handle working through chunks of sequences.
- rollout_kmers will roll out kmers.
- rollout_seq_features will roll out the SeqFeatures from a SeqRecord.
EncodingAlphabet now can optionally take a gap_characters set of characters to add to the
alphabet letters. It also takes add_lower_case_for_inserts which will duplicate the alphabet,
but convert the letters to lowercase.

Changed

Fixed

Fixed bug in GenomicDataset.from_path where it still referred to init_from_path_generator.

0.4.0

Added

EncodedSeq now supports iterating through kmers, see EncodedSeq.rollout_kmers for options.
GCGC is citable.
GCGC now has a CHANGELOG.md.

Assets 2

01 Aug 04:43

tshauck

v0.9.0

d49c060

Release v0.9.0

Added

Parser now outputs the length of the tensor not including padding. This is
useful for packing and length based iteration.
Generating masked output from the parse_record method is now available.
Alphabet can include an optional mask token.

Changed

Can now specify how large of kmer step size to generate when supplying a kmer
value.
Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which
takes a kmer_step_size to specify how large of steps to take when encoding.
Add parsed_seq_len to the SequenceParser object to control how much padding to
apply to the end of the integer encoded sequence. This is useful since a batch
of tensors is expected to have the same size.

Assets 2

09 Jul 00:13

tshauck

v0.8.0

088504b

v0.8.0

Fixed

Broken test due to platform differences in Path.glob sorting.

Added

User can specify to use start or end tokens optionally.

Removed

Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
see scatter in PyTorch.

Assets 2

Releases: tshauck/gcgc

v0.12.4

v0.12.4 (2020-09-18)

Fix

v0.12.3

v0.12.3 (2020-09-18)

v0.12.2

0.12.0 (2020-01-25)

v0.11.0 (2019-11-15)

Added

v0.10.0 (2019-11-09)

v0.9.1

0.9.0 (2019-08-05)

Added

Changed

0.8.0 (2019-07-04)

Fixed

Added

Removed

0.7.0 (2019-06-22)

Added

0.6.1 (2019-06-10)

Added

0.6.0 (2019-04-04)

Added

0.5.2 (2019-03-21)

Added

Changed

0.5.1 (2019-01-09)

Added

0.5.0 (2018-12-31)

Added

Changed

Fixed

0.4.0

Added

# Changelog

Development

0.9.0 (2019-08-05)

Added

Changed

0.8.0 (2019-07-04)

Fixed

Added

Removed

0.7.0 (2019-06-22)

Added

0.6.1 (2019-06-10)

Added

0.6.0 (2019-04-04)

Added

0.5.2 (2019-03-21)

Added

Changed

0.5.1 (2019-01-09)

Added

0.5.0 (2018-12-31)

Added

Changed

Fixed

0.4.0

Added

Release v0.9.0

Added

Changed

v0.8.0

Fixed

Added

Removed