- dense embeddings, rather than sparse "one-hot"
- guiding without fixing: no frozen dictionary, context agnostic
- tokenization independant of the input partitioning / shift
- dense encoding != one-hot vectors on the vocabulary
- composite tokens have parent / child relation: "splitting" carries the information of "split" and "ing"
- reduce token dimension: from several 100k to 256!
- better support for eastern languages: Arabic, Chinese, Hindi, etc
- spatial repartition of tokens
- embeddings of child <=> parent tokens
- limit embedding size = when fidelity starts to drop = max compression = x64?
- shift training data by 1, 2, ..., G - 1 ticks along the time / context axis
- switch between equivalent formats:
- byte shift
- abbreviations: "can't" <=> "cannot"
- change number format (while keeping the same value)
- random perturbations on the inputs:
- letter capitalization
- byte replacement
- byte insertion
- reversing order in groups?
- equivalence 1 <=> 4 <=> 4 x 4:
- pad data with 0s to fill bigger tokens until they match their parts
- tokenization:
- simplify: divide + position + merge = reshape + dense (position = dense bias on the merged vector)
- move data from axis 0 to axis -1 in the end: (B * G, E) => (B, G * E)
- detokenization
- simplify: same as the tokenization block
- head
- VAE
- VAE + CNN
- VAE + CNN + attention
- VAE + hierarchical CNN
- VAE + hierarchical CNN + attention
- VAE + hierarchical CNN + attention + normalization