Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, speech recognition.

1198 papers with code • 236 benchmarks • 89 datasets

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

research paper of speech recognition

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
FAdam
parakeet-rnnt-1.1b
wav2vec 2.0
IBM (LSTM+Conformer encoder-decoder)
Speechstew 100M
Qwen-Audio
wav2vec 2.0 XLS-R 1B + TEVR (5-gram)
ConformerCTC-L (4-gram)
ConformerCTC-L (5-gram)
Quartznet
Conformer-Transducer (no LM)
W2V2-L-LL60K (+ TED-LIUM 3 LM)
XLSR-53-Viet
IBM (LSTM+Conformer encoder-decoder)
Paraformer-large
LAS + SpecAugment (with LM, Switchboard mild policy)
wav2vec 2.0 Large-10h-LV-60k
wav2vec 2.0 Large-10h-LV-60k
ReVISE (bf)
ConformerXXL-PS + G-Augment
CTC-CRF ST-NAS
Deep Speech 2
Triphone (39 features) + LDA and MLLT + SGMM
parakeet-rnnt-1.1b
ConformerXXL-PS + G-Augment
Whisper (Large v2)
Icefall - zipformer transducer
ConformerXXL-P + Downstream NST
parakeet-rnnt-1.1b
Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI
wav2vec2-base-vietnamese-160h (No Language Model)
ConformerXXL-P + Downstream NST
ConformerXXL-P
mllp_2021_offline_verb
mllp_2021_offline_filt
Liquid-S4
AV-HuBERT Large
TS-SEP
Paraformer-large
Whisper-LLaMa-7b
End-to-end LF-MMI
Espresso
CTC-CRF
wav2vec_wav2letter
wav2vec_wav2letter
XLSR53 Wav2Vec2 Portuguese by Orlem Santos
Whisper (Large v2)
wav2vec2-large-xls-r-1b-frisian
Whisper (Large v2)
Conformer/Transformer-AED
Conformer/Transformer-AED
Conformer/Transformer-AED
SpeechStew (100M)
SpeechStew (100M)
ImportantAug
RAVEn Large
TDT 0-2
TDT 0-4
Qwen-Audio
Qwen-Audio
Qwen-Audio
WavLM Large & EEND-vector clustering
Branchformer + GFSA
Branchformer + GFSA

research paper of speech recognition

Latest papers

Beyond levenshtein: leveraging multiple algorithms for robust word error rate computations and granular error classifications.

shuffle-project/beyond-levenshtein • 28 Aug 2024

The Word Error Rate (WER) is the common measure of accuracy for Automatic Speech Recognition (ASR).

Self-supervised Speech Representations Still Struggle with African American Vernacular English

cmu-llab/s3m-aave • 26 Aug 2024

Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties.

Approaching Deep Learning through the Spectral Dynamics of Weights

research paper of speech recognition

We propose an empirical approach centered on the spectral dynamics of weights -- the behavior of singular values and vectors during optimization -- to unify and clarify several phenomena in deep learning.

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition

In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis.

SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models.

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data.

wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness.

HydraFormer: One Encoder For All Subsampling Rates

In automatic speech recognition, subsampling is essential for tackling diverse scenarios.

Preserving spoken content in voice anonymisation with character-level vocoder conditioning

Voice anonymisation can be used to help protect speaker privacy when speech data is shared with untrusted others.

The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

An optimized attention based hybrid deep learning framework for automatic speaker identification from speech signals

  • Published: 23 August 2024

Cite this article

research paper of speech recognition

  • Venkata Subba Reddy Gade 1 &
  • M. Sumathi 1  

19 Accesses

Explore all metrics

Speaker recognition (SR) is the identification of speakers using the characteristics of their voice notes, and it has been researched extensively for many years. Technology advancements have made SR a more popular research topic in recent years. Deep learning (DL)-based SR works are the most advanced and effective among the extensive SR works documented in the literature, leading to higher accuracy. Nevertheless, to determine practical significance, one must closely investigate the effects of noise in the input signal and the neglect of important details during the learning phase. This work presents a novel automated DL-based hybrid framework for the accurate identification of male voice speakers. Pre-processing, feature extraction, feature selection, and SR are some of the stages that are applied to the voice notes that were taken out of the input dataset. First, the noise and interference from the audio samples are eliminated using a two-stage Savitzky Golay filtering technique (2S-SGF). Many significant features are extracted from the input signal following denoising in order to supply the recognition model with information. From the extracted features, a Chaotic Honey Badger Optimization Algorithm (ChHBOA) is used to select the most informative features. The densenet121_self-attention deep convolutional neural network (D121_SAttnDCNN) model receives these chosen features and uses them to perform SR. The proposed network model includes a self-attention layer to focus on highly informative features. Lastly, comprehensive evaluations are carried out through model simulation on the Python platform. A variety of experiments are used to demonstrate the performance significance of the model, which is assessed using the Voxceleb-1 gender dataset, which is made available to the public. The proposed SR model secured an overall accuracy of 98% and can be applied in different fields of voice-based authentication practices, including forensic management, security purposes, personal smart devices, remote payment, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper of speech recognition

Similar content being viewed by others

research paper of speech recognition

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

research paper of speech recognition

Deep Learning Approaches for Speech Analysis: A Critical Insight

research paper of speech recognition

A Comprehensive Review on Speaker Recognition

Explore related subjects.

  • Artificial Intelligence

Data availability

Data sharing is not applicable to this article.

Bai Z, Zhang XL (2021) Speaker recognition based on DL: An overview. Neural Netw 140:65–99

Article   Google Scholar  

Hanifa RM, Isa K, Mohamad S (2021) A review on speaker recognition: Technology and challenges. Comput Electr Eng 90:107005

Asali E, Shenavarmasouleh F, Mohammadi FG, Suresh PS, and Arabnia HR (2021) Deepmsrf: A novel deep multimodal speaker recognition framework with feature selection. In Advances in computer vision and computational biology, Springer, Cham 39–56

Li L, Liu R, Kang J, Fan Y, Cui H, Cai Y, Vipperla R, Zheng TF, Wang D (2022) CN-Celeb: multi-genre speaker recognition. Speech Commun 137:77–91

Biswas S, Solanki SS (2021) Speaker recognition: an enhanced approach to identify singer voice using neural network. Int J Speech Technol 24(1):9–21

Tao R, Lee KA, Das RK, Hautamäki V, and Li H (2022) Self-supervised speaker recognition with loss-gated learning. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 6142–6146

Huang Y, Yutian C, Pelecanos J, and Wang Q (2021) Synth2aug: Cross-domain speaker recognition with tts synthesized speech. In 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE 316–322

Chourasia M, Haral S, Bhatkar S, and Kulkarni S (2021) Emotion recognition from speech signal using DL. In Intelligent Data Communication Technologies and Internet of Things, Springer, Singapore 471–481

Amini MM, and Matrouf D (2021) Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments. In 2020 28th European Signal Processing Conference (EUSIPCO), IEEE 1–5

Garain A, Singh PK, Sarkar R (2021) FuzzyGCP: A deep learning architecture for automatic spoken language identification from speech signals. Expert Syst Appl 168:114416

Agarwal G, Om H (2021) Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition. Multimedia Tools and Applications 80(7):9961–9992

Mokgonyane TB, Sefara TJ, Modipa TI, Mogale MM, Manamela MJ, and Manamela PJ (2019) Automatic speaker recognition system based on machine learning algorithms. In 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), IEEE 141–146

Han JH, Bae KM, Hong SK, Park H, Kwak JH, Wang HS, Juhyung Joe D et al (2018) Machine learning-based self-powered acoustic sensor for speaker recognition. Nano Energy 53:658–665

Abdullah H, Garcia W, Peeters C, Traynor P, Butle KRB, and Wilson J (2019) Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734

Tursunov A, Choeh JY, Kwon S (2021) Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors 21(17):5892

Chen S, Wu Y, Wang C, Chen Z, Chen Z, Liu S, Wu J et al. (2022) Unispeech-sat: Universal speech representation learning with speaker aware pre-training. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 6152–6156

Hourri S, Kharroubi J (2019) A novel scoring method based on distance calculation for similarity measurement in text independent speaker verification. Procedia Computer Science 148:256–265

Junior MY, Freire RZ, Seman LO, Stefenon SF, Mariani VC, dos Santos CL (2024) Optimized hybrid ensemble learning approaches applied to very short-term load forecasting. Int J Electr Power Energy Syst 155:109579

da Silva LS, Seman LO, Camponogara E, Mariani VC, dos Santos CL (2024) Bilinear optimization of protein structure prediction: An exact approach via AB off-lattice model. Comput Biol Med 176:108558

Nainan S, Kulkarni V (2021) Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN. Int J Speech Technol 24(4):809–822

Ye F, Yang J (2021) A deep neural network model for speaker identification. Appl Sci 11(8):3603

Hourri S, Nikolov NS, Kharroubi J (2021) Convolutional neural network vectors for speaker recognition. Int J Speech Technol 24(2):389–400

Pelecanos J, Wang Q, and Moreno IL (2021) Dr-Vectors: Decision residual networks and an improved loss for speaker recognition. arXiv preprint arXiv:2104.01989

Hu ZF, Si XT, Luo Y, Tang SS, and Jian F (2021) Speaker Recognition Based on 3DCNN-LSTM. Engineering Letters 29(2).

El-Moneim SA, Nassar MA, Dessouky MI, Ismail NA, El-Fishawy AS, El-Samie A, Fathi E (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl 79(33):24013–24028

Sefara TJ, and Mokgonyane TB (2020) Emotional speaker recognition based on machine and DL. In 2020 2nd International Multidisciplinary Information Technology and Engineering Conference (IMITEC), IEEE 1–8

Nawas KK, Barik MK, Nayeemulla Khan A (2021) Speaker Recognition using Random Forest. In ITM Web of Conferences, EDP Sciences 37:01022

Wang R, Ao J, Zhou L, Liu S, Wei Z, Ko T, Li Q, and Zhang Y (2022) Multi-View Self-Attention Based Transformer for Speaker Recognition. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 6732–6736

Abbood ZA, Yasen BT, Ahmed MR, Duru AD (2022) Speaker identification model based on deep neural networks. Iraqi J Comput Sci Math 3(1):108–114

Google Scholar  

Krishnan, Sunder Ram, and Chandra Sekhar Seelamantula. (2012) “On the selection of optimum Savitzky-Golay filters.”IEEE Trans Signal Proces 61, 2 380–391

Mahmood A, Köse U (2021) Speech recognition based on convolutional neural networks and MFCC algorithm. Adv Artif Intell Res 1(1):6–12

Labied Maria, Belangour Abdessamad (2021) Automatic speech recognition features extraction techniques: A multi-criteria comparison. Int J Adv Comput Sci Appl 12:8

Lauraitis A, Maskeliūnas R, Damaševičius R, Krilavičius T (2020) Detection of speech impairments using cepstrum, auditory spectrogram and wavelet time scattering domain features. IEEE Access 8:96162–96172

Jo J, Kung J, Lee Y (2020) Approximate LSTM computing for energy-efficient speech recognition. Electronics 9(12):2004

Hashim FA, Houssein EH, Hussain K, Mabrouk MS, Al-Atabany W (2022) Honey Badger Algorithm: New metaheuristic algorithm for solving optimization problems. Math Comput Simul 192:84–110

Article   MathSciNet   Google Scholar  

Lin W, and Mak MW (2022) Robust Speaker Verification Using Population-Based Data Augmentation. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 7642–7646

Li P, Li L, Hamdulla A, and Wang D (2022) Reliable Visualization for Deep Speaker Recognition. arXiv preprint arXiv:2204.03852

https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

Farsiani S, Izadkhah H, Lotfi S (2022) An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput Electr Eng 100:107882

Prachi NN, Nahiyan FM, Habibullah M, and Khan R (2022) Deep Learning Based Speaker Recognition System with CNN and LSTM Techniques. In 2022 Interdisciplinary Research in Technology and Management (IRTM), IEEE 1–6

Download references

No funding is provided for the preparation of the manuscript.

Author information

Authors and affiliations.

Department of Electronics and Communication Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu, 600119, India

Venkata Subba Reddy Gade & M. Sumathi

You can also search for this author in PubMed   Google Scholar

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Venkata Subba Reddy Gade .

Ethics declarations

Ethical approval.

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent to participate

All the authors involved have agreed to participate in this submitted article.

Consent to publish

All the authors involved in this manuscript give full consent for publication of this submitted article.

Conflict of interest

Authors declare that they have no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Gade, V.S.R., Sumathi, M. An optimized attention based hybrid deep learning framework for automatic speaker identification from speech signals. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19996-x

Download citation

Received : 09 January 2023

Revised : 24 July 2024

Accepted : 30 July 2024

Published : 23 August 2024

DOI : https://doi.org/10.1007/s11042-024-19996-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Speaker recognition
  • Two stage-Savitzky Golay filtering
  • Chaotic honey badger optimization
  • Dense network
  • Deep convolutional neural network
  • Find a journal
  • Publish with us
  • Track your research
  • DOI: 10.1109/SPCOM60851.2024.10631626
  • Corpus ID: 271936800

Effect of Speech Modification on Wav2Vec2 Models for Children Speech Recognition

  • Abhijit Sinha , Mittul Singh , +2 authors H. Kathania
  • Published in International Conference on… 1 July 2024
  • Computer Science

Figures and Tables from this paper

figure 1

36 References

Study of formant modification for children asr, a wav2vec2-based experimental study on self-supervised learning methods to improve child speech recognition, data augmentation using prosody and false starts to recognize non-native children's speech, a formant modification method for improved asr of children's speech.

  • Highly Influential

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Data augmentation based on vowel stretch for improving children's speech recognition, automatic speech recognition (asr) systems for children: a systematic literature review, an experimental study on the significance of variable frame-length and overlap in the context of children’s speech recognition, significance of pitch-based spectral normalization for children's speech recognition, data augmentation using spectral warping for low resource children asr, related papers.

Showing 1 through 3 of 0 Related Papers

Interspeech 2024

Apple is sponsoring the 25th annual Interspeech conference, in Kos, Greece, September 1 to 5. Interspeech focuses on research surrounding the science and technology of spoken language processing. Below is the schedule of Apple-sponsored workshops and events at Interspeech 2024.

Stop by the Apple booth in the Kipriotis Hotels & Conference Center, Floor 1, Booth #4, from 10:30 - 19:00 on Monday, September 2; 09:30 - 18:00 on Tuesday, September 3, and Wednesday, September 4; and 10:30 - 16:00 on Thursday, September 5 (all times GMT+3).

Saturday, August 31

  • Young Female* Researchers in Speech Workshop (YFRSW)
  • 13:15 - 14:15 GMT+3, 2nd Lyceum of Kos
  • Carolina Brum will be representing Apple during the mentoring hour at the workshop.

Wednesday, September 4

Positional Description for Numerical Normalization

  • 10:00 - 12:00 GMT+3, Poster Area 4B

Deepanshu Gupta, Javier Latorre Martinez

Novel-view Acoustic Synthesis from 3D Reconstructed Rooms

  • 13:30 - 15:30 GMT+3, Poster Area 2A

Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Oncel Tuzel, Miguel Sarabia del Castillo, Rick Chang

RepCNN: Micro-sized, Mighty Models for Wakeword Detection

  • 14:30 - 14:50 GMT+3, Hippocrates

Arnav Kundu, Prateeth Nayak, Priyanka Padmanabhan, Devang Naik

Transformer-based Model for ASR N-Best Rescoring and Rewriting

  • 14:50 - 15:10 GMT+3, Aegle B

Edwin Kang, Christophe Van Gysel, Man-Hung Siu

Thursday, September 5

Can You Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

  • 10:00 - 12:00 GMT+3, Yanis Club

Zak Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe

Enhancing CTC-based Speech Recognition with Diverse Modeling Units

  • 10:00 - 12:00 GMT+3, Poster Area 3B

Michael Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang

ESPnet-SPK: Full Pipeline Speaker Verification Toolkit with Multiple Reproducible Recipes, Self-Supervised Front-Ends, and Off-the-Shelf Models

  • 11:00 - 11:20 GMT+3, Iasso

Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zak Aldeneh, Takuya Higuchi, Barry Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Satyam Kumar, Sai Srujana Buddi, Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Oggi Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

Accepted Papers

Acknowledgements.

Arnav Kundu, Ilya Oparin, Javier Latorre Martinez, Lyan Verwimp, Markus Nussbaum-Thom, Mirko Hannemann, Thiago Fraga da Silva, Tuomo Raitio, and Tatiana Likhomanenko are reviewers for Interspeech.

Related readings and updates.

International conference on acoustics, speech and signal processing (icassp) 2023.

Apple sponsored the International Conference on Acoustics, Speech and Signal Processing (ICASSP), which took place in person from June 4 to 10 in Rhodes Island, Greece. ICASSP is the IEEE Signal Processing Society's flagship conference on signal processing and its applications. Below was the schedule of Apple sponsored workshops and events at ICASSP 2023.

Interspeech 2022

Apple sponsored the 34th Interspeech conference, which was held in Incheon, Republic of Korea from September 18 to 22. Interspeech is a global conference focused on cognitive intelligence for speech processing and application.

Bottom banner

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

  • News Release
  • News archives (2024)
  • Self-Supervised Learning and Data Augmentation Technologies for AI Speech Recognition Paper Accepted by the INTERSPEECH 2024

Development of Efficient and Effective Training Methods and Verification of Enhancements to AI Speech Recognition Performance

TOKYO, August 29, 2024 – Ricoh today announced its paper on Self-Supervised Learning and Data Augmentation Technologies for Artificial Intelligence (AI) Speech Recognition will be presented at INTERSPEECH 2024, the international spoken language processing conference. This is the first time a Ricoh paper has been accepted by the INTERSPEECH.

The paper presented at this conference introduces the development of an efficient and effective training method for AI speech recognition models using only speech data without transcripts.

Traditionally, supervised learning methods for AI speech recognition require speech data paired with corresponding transcripts to teach the AI the relationship between that speech and the text. However, this method demands a large volume of transcribed speech data, making it costly to acquire. Moreover, the audio quality recorded in real-world environments can vary depending on factors such as application and location, necessitating enhanced tolerance to acoustic noise for broader usability across different settings. Ricoh's newly developed self-supervised learning method, combined with data augmentation techniques that strengthen resistance to acoustic noise, achieves more accurate speech recognition performance at a lower cost compared to traditional methods.

research paper of speech recognition

AI speech recognition is a technology that recognizes and analyzes spoken words, voices, and conversations, converting them into text data for output. Today, this technology is widely used in business, such as displaying subtitles during meetings, creating meeting minutes, and generating reports. AI speech recognition allows for faster transcription and data entry into systems compared to manual typing, making it an extremely effective tool for improving operational efficiency. Ricoh's advanced AI speech recognition technology offers highly accurate recognition of casual conversations, as well as voices recorded at a distance from the microphone, even where there is noise or reverberations, making it ideal for business environments where accuracy is critical.

This capability is also integrated into Ricoh's AI Agent under development, which quickly recognizes and analyzes speech, dynamically generates follow-up questions, and engages in ongoing dialogues to anticipate customer needs and provide precise recommendations. Additionally, by combining AI speech recognition with other AI technologies, Ricoh delivers digital services that help customers effectively utilize speech and conversational data in their workplaces, transforming the data into new value and addressing their management challenges.

research paper of speech recognition

INTERSPEECH is the world's largest and most comprehensive conference on the science and technology of spoken language processing and organized by International Speech Communication Association (ISCA). The papers accepted will be presented at the 25th INTERSPEECH Conference to be held on Kos Island, Greece, from September 1-5, 2024.

As its mid-term vision, Ricoh aims to provide consistent services globally as a workplace services provider in the changing workplace. Ricoh empowers its customers' creativity by improving efficiency in workplace, any location or space where people work beyond traditional offices, through AI and data, while creating collaboration and innovation. Ricoh promotes AI research and development in spoken language and imaging fields to bolster the digital services it provides to those sectors. Ricoh will support the digital transformations of its customers through AI service offerings tailored for use in each business and industry.

Relevant Contents

  • Self-supervised learning and data augmentation technologies for automatic speech recognition (ASR)

Related Links

  • INTERSPEECH 2024

News release in PDF format

For full version in pdf format.

  • Self-Supervised Learning and Data Augmentation Technologies for AI Speech Recognition Paper Accepted by the INTERSPEECH 2024 (3pages/248KB)

| About Ricoh |

Ricoh is a leading provider of integrated digital services and print and imaging solutions designed to support digital transformation of workplaces, workspaces and optimize business performance.

Headquartered in Tokyo, Ricoh's global operation reaches customers in approximately 200 countries and regions, supported by cultivated knowledge, technologies, and organizational capabilities nurtured over its 85-year history. In the financial year ended March 2024, Ricoh Group had worldwide sales of 2,348 billion yen (approx. 15.5 billion USD).

It is Ricoh's mission and vision to empower individuals to find Fulfillment through Work by understanding and transforming how people work so we can unleash their potential and creativity to realize a sustainable future.

For further information, please visit www.ricoh.com

© 2024 RICOH COMPANY, LTD. All rights reserved. All referenced product names are the trademarks of their respective companies.

  • before 2013

Electrophysiological analysis of brain network for augmented recognition of driving distractions

  • Download PDF Copy

A research paper by scientists at Beijing Jiaotong University proposed an electrophysiological analysis-based brain network method for the augmented recognition of different types of distractions during driving.

The new research paper, published on Jul. 04 in the journal Cyborg and Bionic Systems, designed and conducted a simulated experiment comprising 4 distracted driving subtasks. Three connectivity indices, including both linear and nonlinear synchronization measures, were chosen to construct the brain network. By computing connectivity strengths and topological features, we explored the potential relationship between brain network configurations and states of driver distraction.

Driver distractions, such as cognitive processing and visual disruptions during driving, lead to distinct alterations in the electroencephalogram (EEG) signals and the extracted brain networks. We designed and conducted a simulated experiment comprising 4 distracted driving subtasks. Three connectivity indices, including both linear and nonlinear synchronization measures, were chosen to construct the brain network. "By computing connectivity strengths and topological features, we explored the potential relationship between brain network configurations and states of driver distraction." explained study author Wei Guan, a professor at Beijing Jiaotong University. Statistical analysis of network features indicates substantial differences between normal and distracted states, suggesting a reconfiguration of the brain network under distracted conditions. Different brain network features and their combinations are fed into varied machine learning classifiers to recognize the distracted driving states. The results indicate that XGBoost demonstrates superior adaptability, outperforming other classifiers across all selected network features. For individual networks, features constructed using synchronization likelihood (SL) achieved the highest accuracy in distinguishing between cognitive and visual distraction. The optimal feature set from 3 network combinations achieves an accuracy of 95.1% for binary classification and 88.3% for ternary classification of normal, cognitively distracted, and visually distracted driving states.

" The proposed method could accomplish the augmented recognition of distracted driving states and may serve as a valuable tool for further optimizing driver assistance systems with distraction control strategies, as well as a reference for future research on the brain-computer interface in autonomous driving ." said study authors.

The aim of this study was to establish an augmented recognition framework of distracted driving states by leveraging varied synchronization indicators in brain networks. "A simulated carfollowing experiment containing 4 distraction subtasks was designed to encompass the cognitive distraction and visual distraction states. Tree connectivity indices including synchronization likelihood (SL), phase locking value (PLV), and coherence indicator were selected to construct functional brain networks. The connectivity strength as well as 4 global topological features were calculated to explore the potential relationship between the configuration of the brain network and the occurrence of driving distraction. Subsequently, the machine learning classifiers were trained and implemented to recognize the different distracted driving states based on brain network features." said Geqi Qi.

Related Stories

  • Study shows frequent mitochondrial DNA insertion in the brain cells
  • Study reveals the role of blood clotting in COVID-19
  • Deep learning reveals disparities in brain aging across Latin America and the Caribbean

The main contributions of the paper are listed as follows: a. The configuration of the functional brain network during distracted driving is constructed through electrophysiological analysis using 3 synchronization indicators as network edges and 4 global topological features as network properties. b. The performance of different synchronization indicators in brain networks is compared and the SL presents optimal recognition capability in distinguishing between normal and distracted driving states using single brain network knowledge. c. The augmented framework of recognizing normal, visual distraction, and cognitive distraction states is proposed, and the best classification performance is achieved by utilizing the combined global topological features of the 3 varied brain networks characterized by different synchronization indicators. Totally, such electrophysiological analysis of the brain network will provide a foundation for the advancement of driver assistance systems with distraction control strategies and the development of brain-controlled systems, in both conventional human driving scenarios and autonomous driving contexts.

Authors of the paper include Geqi Qi, Rui Liu, Wei Guan, Ailing Huang

This work was supported by the the National Natural Science Foundation of China (grant nos. 72101014 and 72271018) and the Key Laboratory of Brain-Machine Intelligence for Information Behavior, Ministry of Education, China (2023JYBKFKT009).

Beijing Institute of Technology Press Co., Ltd

Qi, G., et al . (2024) Augmented Recognition of Distracted State based on Electrophysiological Analysis of Brain Network.  Cyborg and Bionic Systems . doi.org/10.34133/cbsystems.0130 .

Posted in: Device / Technology News | Medical Science News

Tags: Brain , Education , Laboratory , Machine Learning , Research

Suggested Reading

New photoacoustic probes could revolutionize brain activity imaging

Cancel reply to comment

  • Trending Stories
  • Latest Interviews
  • Top Health Articles

New report reveals the truth behind plant-based protein alternatives

Global and Local Efforts to Take Action Against Hepatitis

Lindsey Hiebert and James Amugsi

In this interview, we explore global and local efforts to combat viral hepatitis with Lindsey Hiebert, Deputy Director of the Coalition for Global Hepatitis Elimination (CGHE), and James Amugsi, a Mandela Washington Fellow and Physician Assistant at Sandema Hospital in Ghana. Together, they provide valuable insights into the challenges, successes, and the importance of partnerships in the fight against hepatitis.

Global and Local Efforts to Take Action Against Hepatitis

Addressing Important Cardiac Biology Questions with Shotgun Top-Down Proteomics

In this interview conducted at Pittcon 2024, we spoke to Professor John Yates about capturing cardiomyocyte cell-to-cell heterogeneity via shotgun top-down proteomics.

Addressing Important Cardiac Biology Questions with Shotgun Top-Down Proteomics

A Discussion with Hologic’s Tim Simpson on the Future of Cervical Cancer Screening

Tim Simpson

Hologic’s Tim Simpson Discusses the Future of Cervical Cancer Screening.

A Discussion with Hologic’s Tim Simpson on the Future of Cervical Cancer Screening

Latest News

NIH awards $15.5 million to expand clinical trials inclusion for nursing home residents

Newsletters you may be interested in

MedTech

Your AI Powered Scientific Assistant

Hi, I'm Azthena, you can trust me to find commercial scientific answers from News-Medical.net.

A few things you need to know before we start. Please read and accept to continue.

  • Use of “Azthena” is subject to the terms and conditions of use as set out by OpenAI .
  • Content provided on any AZoNetwork sites are subject to the site Terms & Conditions and Privacy Policy .
  • Large Language Models can make mistakes. Consider checking important information.

Great. Ask your question.

Azthena may occasionally provide inaccurate responses. Read the full terms .

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions .

Provide Feedback

research paper of speech recognition

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Computation and Language

Title: speech recognition transformers: topological-lingualism perspective.

Abstract: Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm, i.e., monolingual, bilingual, multilingual, and cross-lingual (3) dataset and languages, acoustic features, architecture, decoding, and evaluation metric from a specific topological lingualism perspective (4) popular speech transformer toolkit for building end-to-end ASR systems. Finally, highlight the discussion of open challenges and potential research directions for the community to conduct further research in this domain.
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: [cs.CL]
  (or [cs.CL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite (pending registration)

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Speech recognition technology

    research paper of speech recognition

  2. (PDF) Speech Recognition System

    research paper of speech recognition

  3. (PDF) A Review on Different Approaches for Speech Recognition System

    research paper of speech recognition

  4. (PDF) Development of Framework for Automatic Speech Recognition

    research paper of speech recognition

  5. LRS3-TED Benchmark (Speech Recognition)

    research paper of speech recognition

  6. COMMON_VOICE

    research paper of speech recognition

VIDEO

  1. Class 9-10

  2. FM Nirmala Sitharaman Angry Reply to Opposition

  3. หนังโฆษณา Double A : Speech

  4. Kazakh Speech Commands Recognition

  5. Kazakh Speech Commands Recognition

  6. rahul gandhi🔥 speech parliament #shorts #shortsfeed #politicalshorts #indianpolitician #neetexamdate

COMMENTS

  1. Automatic Speech Recognition: Systematic Literature Review

    A huge amount of research has been done in the field of speech signal processing in recent years. In particular, there has been increasing interest in the automatic speech recognition (ASR) technology field. ASR began with simple systems that responded to a limited number of sounds and has evolved into sophisticated systems that respond fluently to natural language. This systematic review of ...

  2. SPEECH RECOGNITION SYSTEMS

    Speech Recognition is a t echnology with the help of which a machine can. acknowledge the spoken word s and phrases, which can further be used to. generate text. Speech Recognition System works ...

  3. Speech Recognition Using Deep Neural Networks: A Systematic Review

    Abstract: Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of ...

  4. Speech Recognition

    Speech Recognition. 1194 papers with code • 236 benchmarks • 89 datasets. Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio ...

  5. Recent Advances in End-to-End Automatic Speech Recognition

    View a PDF of the paper titled Recent Advances in End-to-End Automatic Speech Recognition, by Jinyu Li. Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results ...

  6. Trends and developments in automatic speech recognition research

    This paper discusses how automatic speech recognition systems are and could be designed, in order to best exploit the discriminative information encoded in human speech. This contrasts with many recent machine learning approaches that apply general recognition architectures to signals to identify, with little concern for the nature of the input.

  7. Robust Speech Recognition via Large-Scale Weak Supervision

    View a PDF of the paper titled Robust Speech Recognition via Large-Scale Weak Supervision, by Alec Radford and 5 other authors. We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the ...

  8. [2403.01255] Automatic Speech Recognition using Advanced Deep Learning

    View a PDF of the paper titled Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey, by Hamza Kheddar and 2 other authors. Recent advancements in deep learning (DL) have posed a significant challenge for automatic speech recognition (ASR). ASR relies on extensive training datasets, including confidential ones, and ...

  9. Automatic Speech Recognition (ASR)

    4. Paper. Code. **Automatic Speech Recognition (ASR)** involves converting spoken language into written text. It is designed to transcribe spoken words into text in real-time, allowing people to communicate with computers, mobile devices, and other technology using their voice. The goal of Automatic Speech Recognition is to accurately ...

  10. PDF Robust Speech Recognition via Large-Scale Weak Supervision

    pre-training has been underappreciated so far for speech recognition. We achieve these results without the need for the self-supervision or self-training techniques that have been a mainstay of recent large-scale speech recognition work. To serve as a foundation for further research on robust speech recognition, we release inference code and ...

  11. Speech Recognition

    Speech Recognition. 1190 papers with code • 234 benchmarks • 89 datasets. Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio ...

  12. Automatic Speech Recognition: Systematic Literature Review

    ASR can be defined as the process of deriving the. transcription of speech, known as a word sequence, in which. the focus is on the shape of the speech wave [1]. In actuality, speech recognition ...

  13. A comprehensive survey on automatic speech recognition using ...

    Table 2 presents that we have shortlisted 148 research papers in this study to review. The primary studies having major terms such as speech recognition along with Deep Neural Network, Recurrent Neural Network, Convolution Neural Network, Long Short-term Memory, Denoising, Neural Network, Deep Learning, Transfer Learning, End-to-End ASR have been included.

  14. Speech Recognition Using Deep Neural Networks: A Systematic Review

    ABSTRACT Over the past decades, a tremendous amount of research has been done on the use of machine. learning for speech processing applications, especially speech recognition. However, in the ...

  15. Survey paper A review on speech emotion recognition: A survey, recent

    The research paper [156] delves into speech emotion classification, discussing selecting features, classification strategies, and creating emotional speech databases. It acknowledges the varying accuracy of current SER systems, with speaker-independent models needing more success. ... However, when building speech recognition systems, it is ...

  16. Speech emotion recognition using machine learning

    Speech emotion recognition (SER) as a Machine Learning (ML) problem continues to garner a significant amount of research interest, especially in the affective computing domain. This is due to its increasing potential, algorithmic advancements, and applications in real-world scenarios. Human speech contains para-linguistic information that can ...

  17. A Comprehensive Review of Speech Emotion Recognition Systems

    During the last decade, Speech Emotion Recognition (SER) has emerged as an integral component within Human-computer Interaction (HCI) and other high-end speech processing systems. Generally, an SER system targets the speaker's existence of varied emotions by extracting and classifying the prominent features from a preprocessed speech signal. However, the way humans and machines recognize and ...

  18. [2303.03329] End-to-End Speech Recognition: A Survey

    Title: End-to-End Speech Recognition: A Survey Authors: Rohit Prabhavalkar , Takaaki Hori , Tara N. Sainath , Ralf Schlüter , Shinji Watanabe View a PDF of the paper titled End-to-End Speech Recognition: A Survey, by Rohit Prabhavalkar and 4 other authors

  19. An optimized attention based hybrid deep learning framework ...

    Speaker recognition (SR) is the identification of speakers using the characteristics of their voice notes, and it has been researched extensively for many years. Technology advancements have made SR a more popular research topic in recent years. Deep learning (DL)-based SR works are the most advanced and effective among the extensive SR works documented in the literature, leading to higher ...

  20. Effect of Speech Modification on Wav2Vec2 Models for Children Speech

    It is observed that all Wav2Vec2 variants still underperform for children under 10 years and speech modification methods and their combinations help improve performance for small and large Wav2Vec2 models but have plenty of room for improvement. Speech modification methods normalize children's speech towards adults' speech, enabling off-the-shelf generic automatic speech recognition (ASR) for ...

  21. Speech emotion recognition approaches: A systematic review

    The speech-emotion recognition (SER) field became crucial in advanced Human-computer interaction (HCI). ... QARs were utilized to assess the quality of the research papers following the study objectives. 5 QARs were identified, with each worth one point out of five. When the fully answered score was 1, the answer above average score was 0.75 ...

  22. (PDF) Speech Recognition: A review

    This paper presents fundamental concept of speech processing systems. It explores the pattern matching techniques in speech recognition system in noisy as well as in noise less environment. A ...

  23. Deep Speech: Scaling up end-to-end speech recognition

    View a PDF of the paper titled Deep Speech: Scaling up end-to-end speech recognition, by Awni Hannun and 9 other authors. We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing ...

  24. Researchers identify basic approaches for how people ...

    McMurray has been studying word recognition in children and in older adults for three decades. His research has shown differences in how people across all ages recognize spoken language.

  25. Interspeech 2024

    Accepted Papers. Can You Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? Zak Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe. Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World ...

  26. Self-Supervised Learning and Data Augmentation Technologies for AI

    TOKYO, August 29, 2024 - Ricoh today announced its paper on Self-Supervised Learning and Data Augmentation Technologies for Artificial Intelligence (AI) Speech Recognition will be presented at INTERSPEECH 2024, the international spoken language processing conference. This is the first time a Ricoh paper has been accepted by the INTERSPEECH.

  27. The Impact of Neural Networks on Image and Speech Recognition

    This paper provides a comprehensive overview of ChatGPT, exploring its development, underlying technology, applications, ethical considerations, and future implications.

  28. Electrophysiological analysis of brain network for augmented

    A research paper by scientists at Beijing Jiaotong University proposed an electrophysiological analysis-based brain network method for the augmented recognition of different types of distractions ...

  29. (PDF) A Study on Automatic Speech Recognition

    2. Automatic Speech Recognition. Automatic speech recognition is one of the most automatic speech processing areas, allowing the machine to understand the. user's speech and convert it into a ...

  30. [2408.14991] Speech Recognition Transformers: Topological-lingualism

    Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this ...