LectBench-95: Preparing a University-Lecture Corpus for A/B Evaluation of Lecture-Processing Techniques
DOI:
https://doi.org/10.21928/uhdjst.v9n2y2025.pp216-230Keywords:
University Lecture corpus, A/B evaluation, Lecture Processing Techniques, Educational Natural Language ProcessingAbstract
Tools for processing educational lectures are rapidly advancing, but there is a need for a diverse, balanced, and high-quality university lecture transcript corpus. The existing datasets are either limited to K-12, tutorial styled, and lack interactivity, focus on narrow disciplines, paywalled/non-accessible, or are impractically large. We introduce LectBench-95, a publicly available corpus of 95 video lecture transcripts spanning 3 disciplines and 17 specific subjects within them. With strict filtering for high audio quality (SNR ≥25 dB), transcription confidence (Mean 0.84, Min 0.7), transcript quality controls, and a power analysis-guided sample size, the 94-h dataset aims to detect ≥20% performance differences between competing systems with 95% confidence for head-to-head A/B experiments. LectBench-95 contains 816 k words (unique ≈ 123 k), a mean Measure of Textual Lexical Diversity of 54.5, and a mean speech rate of 144 words/min, mirroring real-world university lectures. A toy A/B test on zero-shot summarization (Gemini-1.5-Flash vs. 1.5-Flash-8B) shows the corpus’s utility, resulting in a statistically significant 43% win-rate gap with P = 3 × 10−5. Released under CC BY-NC-SA 4.0, LectBench-95 provides a modest yet statistically robust dataset for future educational natural language processing research and prototyping.
References
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan … & D. Amodei. Language models are few-shot learners. In: “Advances in Neural Information Processing Systems”. Curran Associates, Inc., United States, pp. 1877-1901, 2020.
A. I. Open, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. S. Altman, S. Anadkat, R. Avila, I. Babuschkin …& S. Balaji. “GPT-4 Technical Report”. Cornell University, United States, 2023.
Y. Hicke, A. Agarwal, Q. Ma and P. Denny. “AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using Open- Source LLMs”. Cornell University, United States, 2023.
B. Oh, Y. Lee and Y. Kim. Applicability of pretrained language models: Automatic Screening for children’s language development level. In: L. Biester, D. Demszky, Z. Jin, M. Sachan, J. Tetreault, S. Wilson, L. Xiao and J. Zhao, editors. “Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)”. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), pp. 149-156. 2022.
A. Xenakis, I. Dimos, M. Feidakis, D. Sotiropoulos, K. Kalovrektis and G. Nikolaou. An LLM-based smart repository platform to support educators with computational thinking, AI, and STEM Activities. In: “Empowering STEM Educators With Digital Tools”. IGI Global Scientific Publishing, United States, pp. 107-136. 2025.
C. Cohn, N. Hutchins, T. Le and G. Biswas. “A chain-of-thought prompting approach with LLMs for evaluating students’ formative assessment responses in science”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, pp. 23182- 23190, 2024.
D. Demszky and H. Hill. “The NCTE Transcripts: A Dataset of Elementary Math Classroom Transcripts”. Association for Computational Linguistics, USA, 2023.
T. J. Kane, D. F. McCaffrey, T. Miller and D. O. Staiger. “Have we Identified Effective Teachers? Validating Measures of Effective Teaching using Random Assignment”. Bill and Melinda Gates Foundation, MET Project, [Research Paper], 2013.
A. Suresh, J. Jacobs, C. Harty, M. Perkoff, J. H. Martin and T. Sumner. “The TalkMoves Dataset: K-12 Mathematics Lesson Transcripts Annotated for Teacher and Student Discursive Moves”. Cornell University, United States, 2022.
K. Stasaski, K. Kao and M. A. Hearst. CIMA: A large open access dialogue dataset for tutoring. In: J. Burstein, E. Kochmar, C. Leacock, N. Madnani, I. Pilán, H. Yannakoudakis, and T. Zesch, editors. “Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications”. Association for Computational Linguistics, Seattle, WA, USA, pp. 52-64, 2020.
A. K. Rai, S. D. Jaiswal and A. Mukherjee. “A deep dive into the disparity of word error rates across thousands of NPTEL MOOC videos”. Proceedings of the International AAAI Conference on Web and Social Media, vol. 18, pp. 1302-1314, 2024.
A. Hernandez and S. Yang. “Multimodal Corpus Analysis of Autoblog 2020: Lecture Videos in Machine Learning”. Springer, Berlin, pp. 262-270. 2021.
A. Hernandez, P. Klumpp, B. Das, A. Maier and S. H. Yang. Autoblog 2021: The importance of language models for spontaneous lecture speech. In: “Text, Speech, and Dialogue: 25th International Conference, TSD 2022, Brno, Czech Republic, September 6-9, 2022, Proceedings”. Springer-Verlag, Berlin, Heidelberg, pp. 291- 300, 2022.
M. Song, I. Aslan, E. Parada-Cabaleiro, Z. Yang, E. André, Y. Yamamoto and B. Schuller. Lecture Video Highlights Detection from Speech. In: “2024 32nd European Signal Processing Conference (EUSIPCO)”. pp. 361-365. 2024.
T. Lv, L. Cui, M. Vasilijevic and F. Wei. “VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization”. arXiv:2106.05606.
J. Wright, M. Liberman, N. Ryant and J. Fiumara. “Evaluating Speech-to-Text Systems with PennSound”. Cornell University, United States, 2025.
D. S. Singh, A. Gupta, C. V. Jawahar and M. Tapaswi. Unsupervised Audio-Visual Lecture Segmentation. In: “2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)”. pp. 5221-5230, 2023.
H. Wang, F. Yu, X. Shi, Y. Wang, S. Zhang and M. Li. “SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus”. Cornell University, United States, 2023.
D. Ďurišková, D. Jurášová, M. Žilinec, E. Šubert and O. Bojar. Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures. In: “N. Calzolari, M. Y. Kan, V. Hoste, A. Lenci, S. Sakti and N. Xue, editors. “Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)”. ELRA and ICCL, Torino, Italia, pp. 9743-9752, 2024.
A. Rousseau, P. Deléglise and Y. Estève. TED-LIUM: An Automatic Speech Recognition Dedicated Corpus. In: Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk and S. Piperidis, editors. “Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), N”. European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 125-129.
F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko and Y. Estève. “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In: A. Karpov, O. Jokisch and R. Potapova, editors. “Speech and Computer”. Springer International Publishing, Cham, pp. 198-208, 2018.
D. W. Lee, C. Ahuja, P. P. Liang, S. Natu and L. P. Morency. “Lecture presentations multimodal dataset: Towards understanding multimodality in educational videos. In: “2023 IEEE/CVF International Conference on Computer Vision (ICCV)”. IEEE, United States, pp. 20030-20041, 2023.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey and I. Sutskever. “Robust Speech Recognition Via Large-Scale Weak Supervision”. Cornell University, United States, 2022.
C. Graham and N. Roll. “Evaluating openAI’s whisper ASR: Performance analysis across diverse accents and speaker traits”. JASA Express Letters, vol. 4, no. 2, p. 025206, 2024.
P. M. McCarthy and S. Jarvis. “MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment”. Behavior Research Methods, vol. 42, no. 2, pp. 381- 392, 2010.
P. Wingrove. “How suitable are TED talks for academic listening?” Journal of English for Academic Purposes, vol. 30, pp. 79-95, 2017,
W. Xiao and S. Sun. “Dynamic lexical features of PhD theses across disciplines: A text mining approach”. Journal of Quantitative Linguistics, vol. 27, no. 2, pp. 114-133, 2020.
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger and Y. Artzi. “BERTScore: Evaluating Text Generation with BERT”. Cornell University, United States, 2019.
M. Laouenan, P. Bhargava, J. B. Eyméoud, O. Gergaud, G. Plique and E. Wasmer. “A cross-verified database of notable people, 3500BC-2018AD”. Scientific Data, vol. 9, no. 1, p. 290, 2022.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ari A. Aziz, Aree A. Mohammed

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
