Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems) Internet/AI 腾讯集团

Connecting World's top Talents with Premier Jobs and Networking.

Post a job FREE lang

lang

Register Log in

Connecting World's top Talents with Premier Jobs and Networking.

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems) Apply Share link
Job Source	腾讯集团
Location	United States, Bellevue
Salary	Negotiable
Designation	Internet/AI
Job Type	Full Time
Language
Job Posted Date	20-06-2025
Job Description
Job Responsibilities: We are building large-scale, native multimodal model systems that jointly support vision, audio, and text to enable comprehensive perception and understanding of the physical world. You will join the core research team focused on speech and audio, contributing to the following key research areas: Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding. Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications. Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text. Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies. Work Location: US-Washington-Bellevue
Job Requirements
Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field; or Master’s degree with several years of relevant experience. Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures. Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation; experience with multilingual, multitask, or end-to-end systems is a plus. Candidates with in-depth research or practical experience in the following areas are strongly preferred: Speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper) Multimodal alignment and cross-modal modeling (e.g., audio-visual-text) Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models Proficient in deep learning frameworks such as PyTorch or TensorFlow; experience with large-scale training and distributed systems is a plus. Familiar with Transformer-based architectures and their applications in speech and multimodal training/inference.。加分项：

Apply

腾讯集团

Just one more quick step more to complete your application!

Welcome to Linkedtour! Please complete your profile first and then enjoy your trip in Linkedtour!

Just one more quick step more to complete your application!

Please complete now your information at our partner site and click to apply. Good luck !