NVIDIA has merely launched its new streaming English transcription model (Nemotron Speech ASR) constructed significantly for low latency voice brokers and dwell captioning. The checkpoint nvidia/nemotron-speech-streaming-en-0.6b on Hugging Face combines a cache acutely aware FastConformer encoder with an RNNT decoder, and is tuned for every streaming and batch workloads on stylish NVIDIA GPUs.
Model design, construction and enter assumptions
Nemotron Speech ASR (Computerized Speech Recognition) is a 600M parameter model primarily based totally on a cache acutely aware FastConformer encoder with 24 layers and an RNNT decoder. The encoder makes use of aggressive 8x convolutional downsampling to chop again the number of time steps, which straight lowers compute and memory costs for streaming workloads. The model consumes 16 kHz mono audio and requires a minimum of 80 ms of enter audio per chunk.
Runtime latency is managed by the use of configurable context sizes. The model exposes 4 regular chunk configurations, akin to about 80 ms, 160 ms, 560 ms and 1.12 s of audio. These modes are pushed by the att_context_size parameter, which models left and correct consideration context in multiples of 80 ms frames, and might be modified at inference time with out retraining.
Cache acutely aware streaming, not buffered sliding dwelling home windows
Typical ‘streaming ASR’ sometimes makes use of overlapping dwelling home windows. Each incoming window reprocesses part of the sooner audio to maintain up context, which wastes compute and causes latency to drift upward as concurrency will enhance.
Nemotron Speech ASR instead retains a cache of encoder states for all self consideration and convolution layers. Each new chunk is processed as quickly as, with the model reusing cached activations pretty than recomputing overlapping context. This provides:
- Non overlapping physique processing, so work scales linearly with audio dimension
- Predictable memory progress, on account of cache measurement grows with sequence dimension pretty than concurrency related duplication
- Safe latency beneath load, which is necessary for flip taking and interruption in voice brokers
Accuracy vs latency: WER beneath streaming constraints
Nemotron Speech ASR is evaluated on the Hugging Face OpenASR leaderboard datasets, along with AMI, Earnings22, Gigaspeech and LibriSpeech. Accuracy is reported as phrase error charge (WER) for varied chunk sizes.
For a median all through these benchmarks, the model achieves:
- About 7.84 % WER at 0.16 s chunk measurement
- About 7.22 % WER at 0.56 s chunk measurement
- About 7.16 % WER at 1.12 s chunk measurement
This illustrates the latency accuracy tradeoff. Greater chunks give additional phonetic context and barely lower WER, nonetheless even the 0.16 s mode retains WER beneath 8 % whereas remaining usable for precise time brokers. Builders can choose the working stage at inference time counting on utility desires, as an example 160 ms for aggressive voice brokers, or 560 ms for transcription centric workflows.
Throughput and concurrency on stylish GPUs
The cache acutely aware design has measurable impression on concurrency. On an NVIDIA H100 GPU, Nemotron Speech ASR helps about 560 concurrent streams at a 320 ms chunk measurement, roughly 3x the concurrency of a baseline streaming system on the an identical latency objective. RTX A5000 and DGX B200 benchmarks current comparable throughput constructive points, with larger than 5x concurrency on A5000 and as a lot as 2x on B200 all through typical latency settings.
Equally important, latency stays regular as concurrency will enhance. In Modal’s assessments with 127 concurrent WebSocket purchasers at 560 ms mode, the system maintained a median end to complete delay spherical 182 ms with out drift, which is necessary for brokers that ought to maintain synchronized with dwell speech over multi minute durations.
Teaching data and ecosystem integration
Nemotron Speech ASR is expert totally on the English portion of NVIDIA’s Granary dataset along with a giant mixture of public speech corpora, for an entire of about 285k hours of audio. Datasets embody YouTube Commons, YODAS2, Mosel, LibriLight, Fisher, Switchboard, WSJ, VCTK, VoxPopuli and various Mozilla Widespread Voice releases. Labels combine human and ASR generated transcripts.
Key Takeaways
- Nemotron Speech ASR is a 0.6B parameter English streaming model that makes use of a cache acutely aware FastConformer encoder with an RNNT decoder and operates on 16 kHz mono audio with a minimum of 80 ms enter chunks.
- The model exposes 4 inference time chunk configurations, about 80 ms, 160 ms, 560 ms and 1.12 s, which let engineers commerce latency for accuracy with out retraining whereas defending WER spherical 7.2 % to 7.8 % on regular ASR benchmarks.
- Cache acutely aware streaming removes overlapping window recomputation so each audio physique is encoded as quickly as, which yields about 3 events bigger concurrent streams on H100, larger than 5 events on RTX A5000 and as a lot as 2 events on DGX B200 as compared with a buffered streaming baseline at comparable latency.
- In an end to complete voice agent with Nemotron Speech ASR, Nemotron 3 Nano 30B and Magpie TTS, measured median time to final transcription is about 24 ms and server aspect voice to voice latency on RTX 5090 is spherical 500 ms, which makes ASR a small fraction of the complete latency value vary.
- Nemotron Speech ASR is launched as a NeMo checkpoint beneath the NVIDIA Permissive Open Model License with open weights and training particulars, so teams can self host, good tune and profile the whole stack for low latency voice brokers and speech capabilities.
Check out the MODEL WEIGHTS proper right here. Moreover, be completely satisfied to look at us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you presumably will be part of us on telegram as correctly.
Check out our latest launch of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem train proper right into a structured dataset you presumably can filter, consider, and export
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is devoted to harnessing the potential of Artificial Intelligence for social good. His newest endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth safety of machine learning and deep learning data that’s every technically sound and easily understandable by a big viewers. The platform boasts of over 2 million month-to-month views, illustrating its repute amongst audiences.
Elevate your perspective with NextTech Info, the place innovation meets notion.
Uncover the most recent breakthroughs, get distinctive updates, and be part of with a world group of future-focused thinkers.
Unlock tomorrow’s tendencies within the current day: study additional, subscribe to our publication, and alter into part of the NextTech neighborhood at NextTech-news.com
Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our e-newsletter, and be part of our rising group at nextbusiness24.com

