Squeezeformer implementation #1431

swigls · 2022-09-05T05:38:58Z

This PR is about my personal implementation of Squeezeformer for WeNet encoder structure.
(Original paper: https://arxiv.org/abs/2206.00888)
(Original code: https://github.com/kssteven418/Squeezeformer)

Added features
- SqueezeformerEncoder / SqueezeformerEncoderLayer
  - 2x Time reduce & recover logic (in forward functions of BaseEncoder)
    - TimeReduction2 layer (in subsampling.py)
  - Transformer-style block (Att.->Feedforward->Conv->Feedforward)
  - Scale & Bias layer (in place of pre layer-norm)
  - Option: not to use GLU (in convolution.py)
  - Depthwise conv2d input layers (in subsampling.py)
Experimental validation (on LibriSpeech)
- Batch training
- Batch inference (both for full-utterance & chunk-wise)
- Streaming inference (e.g, JIT)
- Configuration file and WER results

Squeezeformer can be seen as an extension of the Conformer structure.
Thus it can be implemented by modifying the ConformerEncoder class, but I rather added SqueezeformerEncoder class to avoid confusion.

On the other hand, 2x Time reduce & recover logic code is inserted in the middle of forward functions in BaseEncoder class, which might cause unintended side effects.
(Likewise, Option: not to use GLU code is inserted in existing scripts of convolution.py)
Special care is needed for reviewing these codes.

robin1001 · 2022-09-05T10:21:35Z

Great job, I will go through the details as soon as I can.

TeaPoly · 2022-09-08T05:54:01Z

wenet/transformer/encoder.py

+            if time_recover_idx is not None
+            else None
+        )
+        if len(time_reduce_idx) > 0:


It will be error if time_reduce_idx is None.

TeaPoly · 2022-09-08T07:20:55Z

wenet/transformer/convolution.py

@@ -79,14 +81,15 @@ def __init__(self,
            self.norm = nn.LayerNorm(channels)


The argument of BatchNorm1d or LayerNorm maybe channels if use_glu else 2 * channels.
In addition, the argument groups of depthwise_conv maybe channels if use_glu else 2 * channels too.

swigls · 2022-09-09T08:11:32Z

Thanks for the reviews, TeaPoly. I've checked them and fixed the corresponding codes.

Additionally, I found a mistake that use_glu was set to be True in the SqueezeformerEncoder of the last commit.
I also modified this part.

TeaPoly · 2022-09-09T08:27:19Z

wenet/transformer/encoder.py

+                for i, recover_layer in enumerate(self.time_recover_layers):
+                    if time_reduce_level == i:
+                        xs = recover_layer(xs)
+                xs += residual_xs  # (B,T,D)
            xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
        if self.normalize_before:
            xs = self.after_norm(xs)


It should be a PreLN at the beginning of blocks if you use Squeezeformer. This after_norm would be duplicate because there are PostLN at the end of every Squeezeformer block.

As Squeezeformer always employ PostLN rather than PreLN, there's no option to choose PreLN.
Therefore, to avoid such a duplication problem, I just inserted an assert code to make sure normalize_before is False in SqueezeformerEncoder.

uloveqian2021 · 2022-09-10T12:47:57Z

Thanks for the reviews, TeaPoly. I've checked them and fixed the corresponding codes.

Additionally, I found a mistake that use_glu was set to be True in the SqueezeformerEncoder of the last commit. I also modified this part.

How to set the values of time_reduce_idx and time_recover_idx?

swigls · 2022-09-13T05:18:08Z

How to set the values of time_reduce_idx and time_recover_idx?
time_reduce_idx and time_recover_idx should be set to lists of (encoder layer) indexes where time reduction or recovery happens.

For example, let time_reduce_idx=[2,5] and time_recover_idx=[8,11] where num_blocks=12.
It means on top of basic conv2d-subsampling (e.g., conv2d with 4x subsampling), additional 2x time reductions are done at the 3rd and 6th encoder blocks. Likewise, 2x time recoveries with residual connections are done at the 9th and last blocks.
In this case, the 1st, 2nd, and the last blocks are processed with a 40 ms stride while the 3rd, 4th, 5th, 9th, 10th, and 11th blocks are processed with an 80 ms stride. The 6th, 7th, and 8th blocks are processed with a 160 ms stride.

This convention, setting time_reduce_idx and time_recover_idx to be lists of indexes, came from the original code (https://github.com/kssteven418/Squeezeformer).

robin1001 · 2022-09-15T02:44:50Z

Great job! Any experimental result of SqueezeFormer on wenet? And I'm curious about the performance compared to Conformer.

robin1001 · 2022-09-15T02:55:26Z

On the other hand, 2x Time reduce & recover logic code is inserted in the middle of forward functions in BaseEncoder class, which might cause unintended side effects.

I think it is tricky if we insert the time reduction and upsampling in BaseEncoder, and it becomes more tricky if we take streaming, JIT export, ONNX support in consideration. It's better if we decouple it from BaseEncoder and it avoids unintended side effects.

robin1001 · 2022-09-15T03:21:32Z

@yygle is also working on SqueezeFormer, and he has made great progress, please see #1446. We can work together.

swigls · 2022-09-15T06:14:23Z

On the other hand, 2x Time reduce & recover logic code is inserted in the middle of forward functions in BaseEncoder class, which might cause unintended side effects.

I think it is tricky if we insert the time reduction and upsampling in BaseEncoder, and it becomes more tricky if we take streaming, JIT export, ONNX support in consideration. It's better if we decouple it from BaseEncoder and it avoids unintended side effects.

That's right. As I'm working on other urgent projects now, I'll try to decouple it from BaseEncoder later.

swigls · 2022-09-15T06:16:26Z

Great job! Any experimental result of SqueezeFormer on wenet? And I'm curious about the performance compared to Conformer.

I'm trying to get WER results on LibriSpeech. Maybe I can share some results in a few days if no more errors are found.

robin1001 · 2022-10-18T14:42:05Z

Thanks for the contribution. #1447 is taken since it gives complete implementation and thorough experiments result.

heyuandeng · 2022-11-09T14:27:03Z

Squeezeformer apply 2x Time reduce to get smaller FLOPs, I wonder whether the performence of Squeezeformer will be much worse than Conformer when the speed of audio is fast?

yygle · 2022-11-10T02:55:26Z

Maybe you are right. Authors haven't do ablation study on this. In our experiments, it show no degradation on public datasets in terms of ’squeeze’ operation. Most of the time, audio frames are usually much more than actual words or ctc path length. You could decrease downsampling factor or set reduce and recover idx to None in order to get better performence on fast audio.

xingchensong · 2023-11-29T13:33:18Z

close this PR and leave it as a reference

swigls added 4 commits September 5, 2022 13:53

squeezeformer initial commit

6d5b294

flake8 formatted & comments purified

2b269da

removed trailing whitespaces

96e82ee

misspelling fixed

f00be32

robin1001 requested review from robin1001, xingchensong, pengzhendong and whiteshirt0429 September 5, 2022 10:21

TeaPoly reviewed Sep 8, 2022

View reviewed changes

fixed implementation problems and a torchscript error

9fe6378

TeaPoly reviewed Sep 9, 2022

View reviewed changes

swigls added 2 commits September 13, 2022 14:00

prevent normalize_before for Squeezeformer

be89944

err fix 2

997e188

swigls added 2 commits September 13, 2022 14:19

err fix 3

c40dc76

residual connection err fix

60ea122

xingchensong closed this Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squeezeformer implementation #1431

Squeezeformer implementation #1431

swigls commented Sep 5, 2022 •

edited by fanlu

Loading

robin1001 commented Sep 5, 2022

TeaPoly Sep 8, 2022

TeaPoly Sep 8, 2022

swigls commented Sep 9, 2022

TeaPoly Sep 9, 2022

swigls Sep 13, 2022

uloveqian2021 commented Sep 10, 2022

swigls commented Sep 13, 2022

robin1001 commented Sep 15, 2022

robin1001 commented Sep 15, 2022 •

edited

Loading

robin1001 commented Sep 15, 2022

swigls commented Sep 15, 2022

swigls commented Sep 15, 2022

robin1001 commented Oct 18, 2022

heyuandeng commented Nov 9, 2022

yygle commented Nov 10, 2022

xingchensong commented Nov 29, 2023

		@@ -79,14 +81,15 @@ def __init__(self,
		self.norm = nn.LayerNorm(channels)

Squeezeformer implementation #1431

Squeezeformer implementation #1431

Conversation

swigls commented Sep 5, 2022 • edited by fanlu Loading

robin1001 commented Sep 5, 2022

TeaPoly Sep 8, 2022

Choose a reason for hiding this comment

TeaPoly Sep 8, 2022

Choose a reason for hiding this comment

swigls commented Sep 9, 2022

TeaPoly Sep 9, 2022

Choose a reason for hiding this comment

swigls Sep 13, 2022

Choose a reason for hiding this comment

uloveqian2021 commented Sep 10, 2022

swigls commented Sep 13, 2022

robin1001 commented Sep 15, 2022

robin1001 commented Sep 15, 2022 • edited Loading

robin1001 commented Sep 15, 2022

swigls commented Sep 15, 2022

swigls commented Sep 15, 2022

robin1001 commented Oct 18, 2022

heyuandeng commented Nov 9, 2022

yygle commented Nov 10, 2022

xingchensong commented Nov 29, 2023

swigls commented Sep 5, 2022 •

edited by fanlu

Loading

robin1001 commented Sep 15, 2022 •

edited

Loading