SUPERB

Excerpt

A comprehensive and reproducible benchmark for Self-supervised Speech Representation Learning


WavLM Large

Microsoft

M-P + VQ + GREP + Utterance Mixing

3.166e+8

4.326e+12

3.863e+11

6.764e+11

1.094e+12

2.169e+12

38

1145

97.86

99.31

3.06

3.44

70.62

8.86

92.21

18.36

95.49

3.77

3.24

WavLM Base+

Microsoft

M-P + VQ + GREP + Utterance Mixing

9.470e+7

1.670e+12

1.493e+11

2.614e+11

4.226e+11

8.367e+11

36.25

1106

97.37

99

3.92

5.59

68.65

9.88

90.58

21.2

89.42

4.07

3.5

IIITD

MIDAS_IIITD

JSC

-

9.618e+7

9.618e+7

9.618e+7

9.618e+7

9.618e+7

9.618e+7

32.65

1080

97.34

98.21

5.54

7.09

68.25

10.82

88.64

24.38

85.36

4.33

3.78

WavLM Base

Microsoft

M-P + VQ + GREP + Utterance Mixing

9.470e+7

1.670e+12

1.493e+11

2.614e+11

4.226e+11

8.367e+11

32.05

1019

96.79

98.63

4.84

6.21

65.94

8.7

89.38

22.86

84.51

4.69

4.55

LightHuBERT Stage 1

LightHuBERT

Once-for-All HuBERT + Two-Stage Distillation

9.500e+7

-

-

-

-

-

30.8

959

96.82

98.5

4.15

5.71

66.25

7.37

88.44

25.92

80.01

5.14

5.51

data2vec Large

Cl Tang

Masked Generative

3.143e+8

4.306e+12

3.841e+11

6.735e+11

1.089e+12

2.159e+12

30.2

949

96.75

98.31

3.6

3.36

66.31

6.28

90.98

22.16

76.77

5.73

5.53

data2vec-aqc Base

Speech Lab, IITM

Masked Generative (M-G) + M-C + VQ

9.384e+7

1.657e+12

1.480e+11

2.594e+11

4.192e+11

8.300e+11

27.95

935

96.36

98.92

4.11

5.39

67.59

6.65

89.39

22.88

59.87

5.82

4.84

HuBERT Base

paper

M-P + VQ

9.470e+7

1.669e+12

1.493e+11

2.613e+11

4.224e+11

8.363e+11

27.65

941

96.3

98.34

5.41

6.42

64.92

7.36

88.53

25.2

81.42

5.11

5.88

HuBERT Large

paper

M-P + VQ

3.166e+8

4.324e+12

3.861e+11

6.761e+11

1.094e+12

2.168e+12

27.55

919

95.29

98.76

3.53

3.62

67.62

3.53

89.81

21.76

90.33

5.98

5.75

CoBERT Base

ByteDance AI Lab

Code Representation Learning + Self-Distillation

9.435e+7

1.660e+12

1.480e+11

2.594e+11

4.192e+11

8.300e+11

26.7

894

96.36

98.87

3.08

4.74

65.32

5.07

89.04

23.35

72.66

6.13

5.74

ccc-wav2vec 2.0 Base

Speech Lab, IITM

M-C + VQ

9.504e+7

1.670e+12

1.493e+11

2.617e+11

4.228e+11

8.367e+11

26.4

940

96.72

96.47

5.95

6.3

64.17

6.73

88.08

24.34

72.84

5.61

4.27

wav2vec 2.0 Large

paper

M-C + VQ

3.174e+8

4.326e+12

3.861e+11

6.762e+11

1.094e+12

2.169e+12

26.15

914

96.66

95.28

4.75

3.75

65.64

4.89

87.11

27.31

86.14

5.65

5.62

data2vec base

Cl Tang

Masked Generative (M-G)

9.375e+7

1.657e+12

1.480e+11

2.594e+11

4.192e+11

8.300e+11

25.05

884

96.56

97.63

4.69

4.94

66.27

5.76

88.59

25.27

70.21

5.77

6.67

STaRHuBERT-L

Kangwook Jang

Temporal Gram Matrix Distillation

-

2.663e+7

5.119e+11

4.406e+10

7.793e+10

1.278e+11

2.621e+11

24.85

901

96.56

97.5

7.39

8.9

63.48

7

88.01

25.36

78.66

5.45

5.83

DPWavLM

Yifan Peng

DPWavLM is a task-agnostic compression method based on joint distillation and structured pruning.

2.359e+7

5.892e+11

5.356e+10

9.334e+10

1.499e+11

2.924e+11

24.5

926

96.27

98.58

8.22

10.19

65.24

8.74

87.68

26.11

82.11

5.98

5.53

LightHuBERT Small

LightHuBERT

Once-for-All HuBERT + Two-Stage Distillation

2.700e+7

8.607e+11

7.721e+10

1.351e+11

2.180e+11

4.304e+11

23.8

901

96.07

98.23

6.6

8.34

64.12

7.64

87.58

26.9

69.7

5.42

5.85

ARMwavLM-S

Kangwook Jang

Attention map reusing + Masking distillation

2.239e+7

4.499e+11

3.924e+10

6.915e+10

1.129e+11

2.287e+11

22.65

861

96.98

97.76

7.43

9.95

64.08

7.41

87.46

26.09

71.18

5.9

6.78

STaRHuBERT

Kangwook Jang

Temporal Gram Matrix Distillation

-

2.231e+7

4.635e+11

3.953e+10

7.009e+10

1.154e+11

2.385e+11

21.95

880

96.27

97.55

7.97

9.35

63.01

6.88

87.94

25.31

77.58

5.71

6.05

FaST-VGS+

Puyuan Peng, David Harwath

FaST-VGS loss + w2v2 loss

-

2.172e+8

-

-

-

-

-

21.5

809

97.27

98.97

7.76

8.83

62.71

5.62

88.15

27.12

41.34

5.87

6.05

DPHuBERT

Yifan Peng

DPHuBERT is a task-agnostic compression method based on joint distillation and structured pruning.

2.359e+7

6.541e+11

5.960e+10

1.038e+11

1.666e+11

3.241e+11

20.9

866

96.36

97.92

9.67

10.47

63.16

6.93

86.86

28.26

76.83

5.84

5.92

ARMHuBERT

Kangwook Jang

Attention map reusing + Masking distillation

2.645e+7

5.016e+11

4.375e+10

7.710e+10

1.258e+11

2.549e+11

20.55

832

97.05

97.23

7.73

10.08

62.77

6.35

87.21

26.88

65.19

5.65

6.78

wav2vec 2.0 Base

paper

M-C + VQ

9.504e+7

1.669e+12

1.493e+11

2.613e+11

4.224e+11

8.363e+11

19.5

818

96.23

92.35

5.74

6.43

63.43

2.33

88.3

24.77

75.18

6.02

6.08

STaRHuBERT-S

Kangwook Jang

Temporal Gram Matrix Distillation

-

1.411e+7

3.563e+11

3.036e+10

5.384e+10

8.865e+10

1.835e+11

17.75

847

95.98

96.18

10.08

10.29

62.03

6.67

87.03

27.79

70.09

5.82

5.88

STaRHuBERT-XS

Kangwook Jang

Temporal Gram Matrix Distillation

-

9.393e+6

2.959e+11

2.513e+10

4.461e+10

7.354e+10

1.526e+11

15.3

804

95.33

94.12

11.83

11.37

61.24

6.78

85.9

29.42

64.77

5.95

6.49

DistilHuBERT

Heng-Jui Chang

multi-task layer-wise distillation

-

2.349e+7

7.859e+11

7.251e+10

1.259e+11

2.010e+11

3.865e+11

15

717

95.98

94.99

16.27

13.37

63.02

5.11

82.57

35.59

73.54

8.55

6.19

DeCoAR 2.0

paper

M-G + VQ

8.984e+7

1.114e+12

9.719e+10

1.713e+11

2.796e+11

5.661e+11

13.55

722

94.48

90.8

14.93

13.02

62.47

4.06

83.28

34.73

74.42

7.16

6.59

wav2vec

paper

F-C

3.254e+7

1.086e+12

1.016e+11

1.760e+11

2.795e+11

5.291e+11

10.5

529

95.59

84.92

31.58

15.86

59.79

4.85

76.37

43.71

56.56

7.99

9.9

admin_baseline

Leo Yang

Used to make sure the server is working correctly

-

0.000e+0

0.000e+0

0.000e+0

0.000e+0

0.000e+0

0.000e+0

9.1

370

95.94

74.69

41.98

24.28

66.67

1.77

70.46

51.57

60.42

10.03

10.53

vq-wav2vec

paper

F-C + VQ

3.415e+7

1.118e+12

1.046e+11

1.813e+11

2.878e+11

5.449e+11

8.4

422

93.38

85.68

33.48

17.71

58.24

4.1

77.68

41.54

38.8

10.38

9.93

WavLM Base+

Lawrance

WavLM Base+

-

9.470e+7

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

7.4

-

96.92

-

4.64

-

-

-

-

-

-

-

-

VQ-APC

paper

F-G + VQ

4.630e+6

5.135e+11

1.814e+10

3.327e+10

5.256e+10

9.447e+10

7.05

377

91.11

74.48

41.08

21.2

59.66

2.51

68.53

52.91

60.15

8.72

10.45

APC

paper

F-G

4.105e+6

5.017e+11

1.704e+10

3.137e+10

4.953e+10

8.874e+10

6.95

392

91.01

74.69

41.98

21.28

59.33

3.1

70.46

50.89

60.42

8.56

10.53

NPC

paper

M-G + VQ

1.938e+7

4.349e+11

4.063e+10

7.043e+10

1.119e+11

2.119e+11

6.6

386

88.96

69.44

43.81

20.2

59.08

2.46

72.79

48.44

55.92

9.4

9.34

modified CPC

paper

F-C

1.843e+6

2.026e+11

1.510e+10

2.635e+10

4.175e+10

7.832e+10

6.4

278

91.88

64.09

42.54

20.18

60.96

3.26

71.19

49.91

39.63

12.86

10.38

Hubert Base

Lawrance

PR and KS

-

9.440e+7

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

5.8

-

96.53

-

5.99

-

-

-

-

-

-

-

-

Wav2Vec 2.0

Lawrance

Wav2Vec2-Base-960h

-

9.500e+7

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

5.2

-

96.33

-

6.49

-

-

-

-

-

-

-

-

TERA

paper

time/freq M-G

2.133e+7

5.677e+11

4.789e+10

8.579e+10

1.432e+11

2.908e+11

4.2

150

89.48

58.42

49.17

18.17

56.27

0.13

67.5

54.17

57.57

15.89

9.96

layer10

gaeulisautumn

1

-

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

3.2

-

-

-

-

-

-

7.29

-

-

-

-

-

HuBERT

gaeulisautumn

dtw to 1.3.0

-

2.000e+0

2.000e+0

2.000e+0

2.000e+0

2.000e+0

2.000e+0

3.1

-

-

-

-

-

-

7.19

-

-

-

-

-

PASE+

paper

multi-task

7.833e+6

4.954e+11

4.648e+10

8.036e+10

1.275e+11

2.411e+11

3.05

149

82.54

29.82

58.87

25.11

57.86

0.72

62.14

60.17

37.99

11.61

8.68

b0990106x

陳äș­ç‘‹

wav2vec2-ctc PR

-

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

2.8

-

-

-

6.28

-

-

-

-

-

-

-

-

FBANK

paper

classic feature

0.000e+0

4.791e+8

4.477e+7

7.760e+7

1.233e+8

2.334e+8

2.15

0

41.38

9.65

82.01

23.18

48.24

0.58

69.64

52.94

20.06

9.56

10.05

distilHubert_base KS

陳äș­ç‘‹

distilHubert_base KS

-

0.000e+0

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

2

-

96.2

-

-

-

-

-

-

-

-

-

-

wav2vec2 SF

陳äș­ç‘‹

wav2vec2 SF

-

0.000e+0

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.95

-

-

-

-

-

-

-

86.89

26.44

-

-

-

Mockingjay

paper

time M-G

8.512e+7

2.076e+12

1.909e+11

3.368e+11

5.317e+11

1.017e+12

1.75

54

83.67

34.33

70.19

22.82

50.28

0.07

61.59

58.89

32.29

11.66

10.54

wav2vec2_large

gaeulisautumn

dtw to 1.3.0

-

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.7

-

-

-

-

-

-

5.03

-

-

-

-

-

distilhubert_base

陳äș­ç‘‹

distilhubert_base ctc PR

-

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.6

-

-

-

14.11

-

-

-

-

-

-

-

-

wav2vec2 KS

陳äș­ç‘‹

wav2vec2 KS

-

0.000e+0

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.5

-

95.85

-

-

-

-

-

-

-

-

-

-

distilHubert_base SF

陳äș­ç‘‹

distilHubert_base SF

-

0.000e+0

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.35

-

-

-

-

-

-

-

82.94

34.36

-

-

-

HuBERT_Large

gaeulisautumn

dtw to 1.3.0

-

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.1

-

-

-

-

-

-

3.4

-

-

-

-

-

wav2vec2

gaeulisautumn

dtw to 1.3.0

-

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

1.000e+0

0.5

-

-

-

-

-

-

2.13

-

-

-

-

-

SF

èŹæ‰żäżź

sf

-

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

0.4

-

8.15

-

81.96

-

-

-

68.15

53.53

-

-

-

Alan

èŹæ‰żäżź

PR

-

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

0.1

-

-

-

81.96

-

-

-

-

-

-

-

-

KS

èŹæ‰żäżź

ks

-

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

1.600e+2

0.1

-

8.15

-

81.96

-

-

-

-

-

-

-

-