Skip to content

Commit

Permalink
note
Browse files Browse the repository at this point in the history
  • Loading branch information
lucidrains committed Dec 21, 2024
1 parent 8b367e6 commit 8880051
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2240,7 +2240,7 @@ ids_out, num_out, is_number_mask = model.generate(start_ids, start_nums, 17)
}
```

```
```bibtex
@article{Yang2017BreakingTS,
title = {Breaking the Softmax Bottleneck: A High-Rank RNN Language Model},
author = {Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen},
Expand Down
2 changes: 1 addition & 1 deletion x_transformers/x_transformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1650,7 +1650,7 @@ def __init__(
unet_skips = False,
num_residual_streams = 1,
reinject_input = False, # seen first in DEQ paper https://arxiv.org/abs/1909.01377, but later used in a number of papers trying to achieve depthwise generalization https://arxiv.org/abs/2410.03020v1
add_value_residual = False, # resformer from Zhou et al - https://arxiv.org/abs/2410.17897v1
add_value_residual = False, # resformer from Zhou et al - https://arxiv.org/abs/2410.17897v1 - further corroboration by https://arxiv.org/abs/2412.15113 (faster emergence of ICL) - looks like this setting may becoming a necessity for every transformer soon
learned_value_residual_mix = True, # seeing big improvements when the value residual mix value is learned per token - credit goes to @faresobeid for taking the first step with learned scalar mix, then @Blinkdl for taking it a step further with data dependent. here we will use per token learned
rel_pos_kwargs: dict = dict(),
**kwargs
Expand Down

0 comments on commit 8880051

Please sign in to comment.