Skip to content

Latest commit

 

History

History
50 lines (40 loc) · 1.68 KB

TODO.md

File metadata and controls

50 lines (40 loc) · 1.68 KB

TODO

Objectives

  • dense embeddings, rather than sparse "one-hot"
  • guiding without fixing: no frozen dictionary, context agnostic
  • tokenization independant of the input partitioning / shift
  • dense encoding != one-hot vectors on the vocabulary
  • composite tokens have parent / child relation: "splitting" carries the information of "split" and "ing"
  • reduce token dimension: from several 100k to 256!
  • better support for eastern languages: Arabic, Chinese, Hindi, etc

Dataviz

  • spatial repartition of tokens
  • embeddings of child <=> parent tokens
  • limit embedding size = when fidelity starts to drop = max compression = x64?

Curriculum

  • shift training data by 1, 2, ..., G - 1 ticks along the time / context axis
  • switch between equivalent formats:
    • byte shift
    • abbreviations: "can't" <=> "cannot"
    • change number format (while keeping the same value)
  • random perturbations on the inputs:
    • letter capitalization
    • byte replacement
    • byte insertion
    • reversing order in groups?
  • equivalence 1 <=> 4 <=> 4 x 4:
    • pad data with 0s to fill bigger tokens until they match their parts

Blocks

  • tokenization:
    • simplify: divide + position + merge = reshape + dense (position = dense bias on the merged vector)
    • move data from axis 0 to axis -1 in the end: (B * G, E) => (B, G * E)
  • detokenization
    • simplify: same as the tokenization block
  • head

Models

  • VAE
  • VAE + CNN
  • VAE + CNN + attention
  • VAE + hierarchical CNN
  • VAE + hierarchical CNN + attention
  • VAE + hierarchical CNN + attention + normalization