Please cite the following if you make use of the dataset.
@INPROCEEDINGS{10448079,
author={Wang, Haoxu and Yu, Fan and Shi, Xian and Wang, Yuezhang and Zhang, Shiliang and Li, Ming},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus},
year={2024},
volume={},
number={},
pages={11076-11080},
keywords={Visualization;Text recognition;Pipelines;Streaming media;Benchmark testing;Web conferencing;Signal processing;audio visual speech recognition;corpus;slides},
doi={10.1109/ICASSP48485.2024.10448079}
}
The provided SlideSpeech metadata is available to download under a Creative Commons Attribution-ShareAlike 4.0 International License(CC BY-SA 4.0).
The official related scripts for downloading the SlideSpeech Corpus. Github Scripts.
Here is the wavid2channel of SlideSpeech Train, Dev and Test sets.
Here is the segments, transcription text, OCR results (~1.37GB) of SlideSpeech Train, Dev and Test sets.
Performance of the baseline model and the contextual asr benchmark on SlideSpeech Corpus.
w/o K represents the without Keyword, w K represents the with Keyword. OCR, R1, Keyword means the performance is calculated according to the OCR, R1, Keyword bias list. U/B/R refers to the U-WER/B-WER/Recall. The unit of recall rate is percentage (%).
If you would like your results to be listed on the leaderboard, please send the link of paper, WER, U-WER, B-WER, Recall, number of parameters, and code implementation(if available) to here.
Model | Set | WER | OCR(U/B/R) | R1(U/B/R) | Keyword(U/B/R) |
---|---|---|---|---|---|
Baseline CTC/AED | S95 | 21.05 | 21.54/19.04/82.64 | 18.75/100/0.00 | 20.29/31.27/68.76 |
Contextualized CTC/AED w/o K | S95 | 21.06 | 21.56/19.01/82.63 | 18.98/92.63/7.83 | 20.29/31.37/68.73 |
Contextualized CTC/AED w K | S95 | 20.80 | 21.48/18.00/83.64 | 18.83/88.56/11.81 | 20.22/28.61/71.48 |
Model | Set | WER | OCR(U/B/R) | R1(U/B/R) | Keyword(U/B/R) |
---|---|---|---|---|---|
Baseline CTC/AED | S95 | 21.22 | 21.97/18.14/83.69 | 19.17/100/0.00 | 20.83/26.60/73.51 |
Contextualized CTC/AED w/o K | S95 | 21.25 | 21.97/18.27/83.51 | 19.39/92.52/8.00 | 20.83/26.96/73.17 |
Contextualized CTC/AED w K | S95 | 20.95 | 21.85/17.24/84.51 | 19.21/87.76/12.86 | 20.73/24.05/76.10 |
Model | Set | WER | OCR(U/B/R) | R1(U/B/R) | Keyword(U/B/R) |
---|---|---|---|---|---|
Baseline CTC/AED | L95 | 13.09 | 13.70/10.58/90.47 | 11.75/100/0.00 | 12.87/16.13/83.90 |
Contextualized CTC/AED w/o K | L95 | 12.91 | 13.50/10.46/90.66 | 11.65/93.87/6.27 | 12.70/15.67/84.36 |
Contextualized CTC/AED w K | L95 | 12.64 | 13.46/ 9.25/91.85 | 11.57/81.89/18.25 | 12.66/12.39/87.64 |
Model | Set | WER | OCR(U/B/R) | R1(U/B/R) | Keyword(U/B/R) |
---|---|---|---|---|---|
Baseline CTC/AED | L95 | 12.89 | 13.70/9.59/91.45 | 11.78/100/0.00 | 12.90/12.70/87.43 |
Contextualized CTC/AED w/o K | L95 | 12.64 | 13.46/9.28/91.80 | 11.63/91.93/8.55 | 12.64/12.63/87.54 |
Contextualized CTC/AED w K | L95 | 12.38 | 13.42/8.13/92.91 | 11.53/78.87/21.81 | 12.60/ 9.32/90.86 |