Liquid AI launched LFM2-VL-3B, a 3B parameter imaginative and prescient language model for image textual content material to textual content material duties. It extends the LFM2-VL family previous the 450M and 1.6B variants. The model targets elevated accuracy whereas preserving the tempo profile of the LFM2 construction. It’s on the market on LEAP and Hugging Face under the LFM Open License v1.0.
Model overview and interface
LFM2-VL-3B accepts interleaved image and textual content material inputs and produces textual content material outputs. The model exposes a ChatML like template. The processor inserts an sentinel that’s modified with encoded image tokens at run time. The default textual content material context dimension is 32,768 tokens. These particulars help devs reproduce evaluations and mix the model with present multimodal pipelines.
Construction
The stack pairs a language tower with a kind aware imaginative and prescient tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus consideration backbone. The imaginative and prescient tower is SigLIP2 NaFlex at 400M parameters, it preserves native facet ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses image tokens sooner than fusion with the language space. This design lets clients cap imaginative and prescient token budgets with out retraining the model.
The encoder processes native resolutions as a lot as 512×512. Greater inputs are reduce up into non overlapping 512×512 patches. A thumbnail pathway affords worldwide context all through tiling. The surroundings pleasant token mapping is documented with concrete examples, a 256×384 image maps to 96 tokens, a 1000×3000 image maps to 1,020 tokens. The model card exposes particular person controls for minimal and most image tokens and the tiling change. These controls tune tempo and top quality at inference time.
Inference settings
The Hugging Face model card affords helpful parameters. Textual content material period makes use of temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Imaginative and prescient settings use min image tokens 64, max image tokens 256, and movie splitting enabled. The processor applies the chat template and the image sentinel robotically. The occasion makes use of AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.
How is it expert?
Liquid AI describes a staged technique. The group performs joint mid teaching that adjusts the textual content material to image ratio over time. The model then undergoes supervised unbelievable tuning centered on image understanding. The data sources are huge scale open datasets plus in residence synthetic imaginative and prescient data for exercise safety.
Benchmarks
The evaluation group experiences aggressive outcomes amongst lightweight open VLMs. On MM-IFEval the model reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE score is 89.01. The desk notes that scores for various applications have been computed with VLMEvalKit. The desk excludes Qwen3-VL-2B on account of that system was launched in the end earlier.
The language performance stays close to the LFM2-2.6B backbone. The evaluation group cites 30 p.c on GPQA and 63 p.c on MMLU. This points when notion duties embody knowledge queries. The group moreover states expanded multilingual seen understanding all through English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese language language, and Korean.
Why edge clients should care?
The construction retains compute and memory inside small gadget budgets. Image tokens are compressible and particular person constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves facet ratios, which helps unbelievable grained notion. The projector reduces tokens on the connector, which improves tokens per second. The evaluation group moreover revealed a GGUF assemble for on gadget runtimes. These properties are useful for robotics, cell, and industrial purchasers that need native processing and strict data boundaries.
Key Takeaways
- Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex imaginative and prescient encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native facet ratios.
- Choice coping with and token budgets: Photos run natively as a lot as 512×512, greater inputs tile into non overlapping 512×512 patches with a thumbnail pathway for worldwide context. Documented token mappings embody 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
- Inference interface: ChatML-like prompting with an sentinel, default textual content material context 32,768 tokens, helpful decoding settings, and processor-level controls for image splitting permit reproducible evaluation and easy integration in multimodal pipelines.
- Measured effectivity: Reported outcomes embody MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only indicators from the backbone are about 30% GPQA and 63% MMLU, useful for blended notion plus knowledge workloads.
LFM2-VL-3B is a smart step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an surroundings pleasant projector, which lowers image token counts for predictable latency. Native choice processing with 512 by 512 tiling and token caps supplies deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are aggressive for this measurement. Open weights, a GGUF assemble, and LEAP entry in the reduction of integration friction. Whole, that’s an edge ready VLM launch with clear controls and clear benchmarks.
Check out the Model on HF and Technical particulars. Be comfortable to try our GitHub Internet web page for Tutorials, Codes and Notebooks. Moreover, be comfortable to look at us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be capable of be part of us on telegram as correctly.
Michal Sutter is a information science expert with a Grasp of Science in Data Science from the School of Padova. With a secure foundation in statistical analysis, machine finding out, and data engineering, Michal excels at reworking sophisticated datasets into actionable insights.
🙌 Adjust to MARKTECHPOST: Add us as a hottest provide on Google.
Elevate your perspective with NextTech Data, the place innovation meets notion.
Uncover the most recent breakthroughs, get distinctive updates, and be a part of with a worldwide group of future-focused thinkers.
Unlock tomorrow’s traits at current: be taught further, subscribe to our publication, and turn into part of the NextTech group at NextTech-news.com
Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our publication, and be a part of our rising group at nextbusiness24.com

