Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-System Speech Language Model With Instant Voice Cloning

Next Business 24

5 months ago

Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-System Speech Language Model With Instant Voice Cloning

Neuphonic has launched NeuTTS Air, an open-source text-to-speech (TTS) speech language model designed to run domestically in precise time on CPUs. The Hugging Face model card lists 748M parameters (Qwen2 construction) and ships in GGUF quantizations (This autumn/Q8), enabling inference via llama.cpp/llama-cpp-python with out cloud dependencies. It’s licensed beneath Apache-2.0 and includes a runnable demo and examples.

So, what’s new?

NeuTTS Air {{couples}} a 0.5B-class Qwen backbone with Neuphonic’s NeuCodec audio codec. Neuphonic positions the system as a “super-realistic, on-device” TTS LM that clones a voice from ~3 seconds of reference audio and synthesizes speech in that style, specializing in voice brokers and privacy-sensitive functions. The model card and repository explicitly emphasize real-time CPU expertise and small-footprint deployment.

Key Choices

Realism at sub-1B scale: Human-like prosody and timbre preservation for a ~0.7B (Qwen2-class) text-to-speech LM.
On-device deployment: Distributed in GGUF (This autumn/Q8) with CPU-first paths; acceptable for laptops, telephones, and Raspberry Pi-class boards.
Instant speaker cloning: Mannequin swap from ~3 seconds of reference audio (reference WAV + transcript).
Compact LM+codec stack: Qwen 0.5B backbone paired with NeuCodec (0.8 kbps / 24 kHz) to steadiness latency, footprint, and output prime quality.

Make clear the model construction and runtime path?

Backbone: Qwen 0.5B used as a lightweight LM to state of affairs speech expertise; the hosted artifact is reported as 748M params beneath the qwen2 construction on Hugging Face.
Codec: NeuCodec provides low-bitrate acoustic tokenization/decoding; it targets 0.8 kbps with 24 kHz output, enabling compact representations for setting pleasant on-device use.
Quantization & format: Prebuilt GGUF backbones (This autumn/Q8) could be discovered; the repo consists of instructions for llama-cpp-python and an elective ONNX decoder path.
Dependencies: Makes use of espeak for phonemization; examples and a Jupyter pocket e book are equipped for end-to-end synthesis.

On-device effectivity focus

NeuTTS Air showcases ‘real-time expertise on mid-range devices‘ and provides CPU-first defaults; GGUF quantization is supposed for laptops and single-board pc programs. Whereas no fps/RTF numbers are revealed on the cardboard, the distribution targets native inference with no GPU and demonstrates a working stream via the equipped examples and Space.

🚨 [Recommended Read] ViPE (Video Pose Engine): A Extremely efficient and Versatile 3D Video Annotation System for Spatial AI

Voice cloning workflow

NeuTTS Air requires (1) a reference WAV and (2) the transcript textual content material for that reference. It encodes the reference to style tokens after which synthesizes arbitrary textual content material inside the reference speaker’s timbre. The Neuphonic group recommends 3–15 s clear, mono audio and provides pre-encoded samples.

Privateness, obligation, and watermarking

Neuphonic frames the model for on-device privateness (no audio/textual content material leaves the machine with out shopper’s approval) and notes that each one generated audio includes a Perth (Perceptual Threshold) watermarker to assist accountable use and provenance.

The best way it compares?

Open, native TTS strategies exist (e.g., GGUF-based pipelines), nevertheless NeuTTS Air is notable for packaging a small LM + neural codec with on the spot cloning, CPU-first quantizations, and watermarking beneath a permissive license. The “world’s first super-realistic, on-device speech LM” phrasing is the vendor’s declare; the verifiable information are the measurement, codecs, cloning course of, license, and equipped runtimes.

The principle focus is on system trade-offs: a ~0.7B Qwen-class backbone with GGUF quantization paired with NeuCodec at 0.8 kbps/24 kHz is a sensible recipe for real-time, CPU-only TTS that preserves timbre using ~3–15 s style references whereas defending latency and memory predictable. The Apache-2.0 licensing and built-in watermarking are deployment-friendly, nevertheless publishing RTF/latency on commodity CPUs and cloning-quality vs. reference-length curves would enable rigorous benchmarking in opposition to present native pipelines. Operationally, an offline path with minimal dependencies (eSpeak, llama.cpp/ONNX) lowers privateness/compliance hazard for edge brokers with out sacrificing intelligibility.

Check out the Model Card on Hugging Face and GitHub Net web page. Be at liberty to try our GitHub Net web page for Tutorials, Codes and Notebooks. Moreover, be glad to watch us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you probably could be a part of us on telegram as properly.

Michal Sutter is a data science expert with a Grasp of Science in Data Science from the School of Padova. With a robust foundation in statistical analysis, machine learning, and knowledge engineering, Michal excels at reworking superior datasets into actionable insights.

🔥[Recommended Read] NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Extremely efficient and Versatile 3D Video Annotation System for Spatial AI

Elevate your perspective with NextTech Data, the place innovation meets notion.
Uncover the newest breakthroughs, get distinctive updates, and be a part of with a worldwide neighborhood of future-focused thinkers.
Unlock tomorrow’s traits proper this second: be taught further, subscribe to our e-newsletter, and alter into part of the NextTech group at NextTech-news.com

Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our e-newsletter, and be a part of our rising neighborhood at nextbusiness24.com