Clarification of RMSNorm layer fusion #14

tobiasvanderwerff · 2024-10-03T08:22:34Z

Hi,

Thanks very much for your work and for publishing your code. I am currently working on integration of SpinQuant into torch/ao, and I would like to clarify something about the code that would help me in my implementation.

In the paper, the following is mentioned in footnote 3:

In a pre-norm LLM like LLaMA, we can convert a transformer network into a rotation-invariant network by incorporating the RMSNorm scale parameters α into the weight matrix right after the RMSNorm layer.

In the code, this appears to be done in the fuse_layer_norms function.

However, I also noticed that in that same function, the embedding weights are modified, in the following lines:

SpinQuant/utils/fuse_norm_utils.py

Lines 42 to 45 in 7f5bf66

    
           # Embedding fusion 
        
           for W in [model.model.embed_tokens]: 
        
               W_ = W.weight.data.double() 
        
               W.weight.data = (W_ - W_.mean(dim=-1, keepdim=True)).to(W.weight.data.dtype)

Could you help me understand why this is done? I.e. subtraction of the mean from the input embeddings. I don't see a connection to the RMSNorm layer fusion, so I must be missing something.

Thanks in advance.

testworldagain · 2024-11-13T08:41:42Z

When remove those lines(line 42 ~ 45)， the ppl of the Llama2-7b on wikitext2 comes from 6.8 down to 6.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification of RMSNorm layer fusion #14

Clarification of RMSNorm layer fusion #14

tobiasvanderwerff commented Oct 3, 2024

testworldagain commented Nov 13, 2024

Clarification of RMSNorm layer fusion #14

Clarification of RMSNorm layer fusion #14

Comments

tobiasvanderwerff commented Oct 3, 2024

testworldagain commented Nov 13, 2024