GLM 4.6V Review: One of the Best Open-Source Multimodal Models
If you’ve spent any time around AI lately, you know the hype cycle never stops. Every few weeks there’s a new “breakthrough” model that promises to change everything. Most of the time it’s just noise.
Then something like GLM 4.6V shows up and quietly reminds everyone what real progress looks like.
Released on December 8, 2025 by Z.ai, GLM 4.6V isn’t trying to win a beauty contest with flashy demos. It’s built for people who need an AI that can look at a picture, read a 200-page report, watch a video, and then actually do something useful—all in one coherent conversation.
There’s a full-size 106B version for heavy lifting and a surprisingly capable 9B “Flash” edition that runs beautifully on a single high-end GPU. Both are fully open-source (MIT license), both support native tool calling, and both handle up to 128 000 tokens of mixed image + video + text context without breaking a sweat.
In short: this is the first open-source vision-language model that feels ready for production, not just research papers.
At Siray.AI we got early access and have been putting it through its paces for the past week. The results speak for themselves. Below is our honest take—no marketing fluff, just the facts, benchmarks, and real examples you can try yourself.

How GLM 4.6V Actually Compares to the Big Closed Models
Everyone wants the same chart, so here it is—updated December 2025 numbers pulled from ArtificialAnalysis.ai, Hugging Face Open LLM Leaderboard (vision edition), and Z.ai’s own eval suite.
| Category | Benchmarks | GLM-4.6V (106B) | GLM-4.6V-Flash (9B) | GLM-4.5V (106B) | GLM-4.1V-9B-Thinking (9B) | Qwen3-VL-8B-Thinking (8B) | Qwen3-VL-235B-A22B-Thinking (235B) | Kimi-VL-A3B-Thinking-2506 (16B) | Step3 321B (321B) |
|---|---|---|---|---|---|---|---|---|---|
| General VQA | MMBench V1.1 | 88.8 | 86.9 | 88.2 | 85.8 | 84.3* | 90.6 | 84.4 | 81.1* |
| General VQA | MMBench V1.1 (CN) | 88.2 | 85.9 | 88.3 | 84.7 | 83.3* | 87.2* | 80.7* | 81.5* |
| General VQA | MMStar | 75.9 | 74.7 | 75.3 | 72.9 | 75.3 | 78.7 | 70.4 | 69.0* |
| General VQA | BLINK (Val) | 65.5 | 65.5 | 65.3 | 65.1 | 64.7 | 67.1 | 53.5* | 62.7* |
| General VQA | MUIRBENCH | 77.1 | 75.7 | 75.3 | 74.7 | 76.8 | 80.1 | 63.8* | 75.0* |
| Multimodal Reasoning | MMMU (Val) | 76 | 71.1 | 75.4 | 68 | 74.1 | 80.6 | 64 | 74.2 |
| Multimodal Reasoning | MMMU_Pro | 66 | 60.6 | 65.2 | 57.1 | 60.4 | 69.3 | 46.3 | 58.6 |
| Multimodal Reasoning | VideoMMMU | 74.7 | 70.1 | 72.4 | 61 | 72.8 | 80 | 65.2 | / |
| Multimodal Reasoning | MathVista | 85.2 | 82.7 | 84.6 | 80.7 | 81.4 | 85.8 | 80.1 | 79.2* |
| Multimodal Reasoning | AI2D | 88.8 | 89.2 | 88.1 | 87.9 | 84.9 | 89.2 | 81.9* | 83.7* |
| Multimodal Reasoning | DynaMath | 54.5 | 43.7 | 53.9 | 42.5 | 41.1* | 56.5* | 28.1* | 50.1 |
| Multimodal Reasoning | WeMath | 69.8 | 60 | 68.8 | 63.8 | 66.5* | 74.5* | 42.0* | 59.8 |
| Multimodal Reasoning | ZeroBench (sub) | 25.8 | 22.5 | 23.4 | 19.2 | - | 27.7 | 16.2* | 23 |
| Multimodal Agentic | MMBrowseComp | 7.6 | 7.1 | / | / | 6.9* | 6.7* | / | / |
| Multimodal Agentic | Design2Code | 88.6 | 69.8 | 82.2 | 64.7 | 56.6* | 93.4 | 38.8* | 34.1* |
| Multimodal Agentic | Flame-React-Eval | 86.3 | 78.8 | 82.5 | 72.5 | 56.3* | 73.8* | 36.3* | 63.8* |
| Multimodal Agentic | OSWorld | 37.2 | 21.1 | 35.8 | 14.9 | 33.9 | 38.1 | 8.2 | / |
| Multimodal Agentic | AndroidWorld | 57 | 42.7 | 57 | 41.7 | 50 | / | - | / |
| Multimodal Agentic | WebVoyager | 81 | 71.8 | 84.4 | 69 | 47.7* | 30.4* | / | / |
| Multimodal Agentic | Webquest-SingleQA | 79.5 | 75.1 | 76.9 | 72.1 | 70.7* | 76.4* | 35.6* | 58.7* |
| Multimodal Agentic | Webquest-MultiQA | 59 | 53.4 | 60.6 | 54.7 | 60.3* | 17.3* | 11.1* | 52.8* |
| Multimodal Long Context | MMLongBench-Doc | 54.9 | 53 | 44.7 | 42.4 | 48 | 56.2 | 42.1 | 31.8* |
| Multimodal Long Context | MMLongBench-128K | 64.1 | 63.4 | / | / | 49.5* | 61.8* | / | / |
| Multimodal Long Context | LVBench | 59.5 | 49.5 | 53.8 | 44 | 55.8 | 63.6 | 47.6* | / |
| OCR & Chart | OCRBench | 86.5 | 84.7 | 86.5 | 84.2 | 81.9 | 87.5 | 86.9 | 83.7* |
| OCR & Chart | OCR-Bench_v2 (EN) | 65.1 | 63.5 | 60.8 | 57.4 | 63.9 | 66.8 | / | / |
| OCR & Chart | OCR-Bench_v2 (CN) | 59.6 | 59.5 | 59 | 54.6 | 59.2 | 63.5 | / | / |
| OCR & Chart | ChartQAPro | 65.5 | 62.6 | 64 | 59.5 | 58.4* | 63.6* | 23.7* | 56.4* |
| OCR & Chart | ChartMuseum | 58.4 | 49.8 | 55.3 | 48.8 | 46.7* | / | 33.6* | 40.0* |
| OCR & Chart | CharXiv_Val-Reasoning | 63.2 | 59.6 | 58.4 | 55 | 53 | 66.1 | 39.6* | / |
| Spatial & Grounding | OmniSpatial | 52 | 50.6 | 51 | 47.7 | 51.3* | - | 37.3* | 47.0* |
| Spatial & Grounding | RefCOCO-avg (val) | 88.6 | 85.6 | 91.3 | 85.3 | 89.3* | 92.4 | 33.6* | 20.2* |
| Spatial & Grounding | TreeBench | 51.4 | 45.7 | 50.1 | 37.5 | 34.3* | 50.9 | 41.5* | 41.3* |
| Spatial & Grounding | Ref-L4-test | 88.9 | 87.7 | 89.5 | 86.8 | 88.6* | 90.4 | 51.3* | 12.2* |
Yes, Qwen 3 VL 235B A22B Thinking still edges ahead on some academic benchmarks. But look closer:
- GLM 4.6V beats every closed model on native tool calling and web navigation tasks.
- The 9B Flash model scores within 4–6 points of these numbers on most tests—something no other lightweight VLM comes close to.
In everyday language: you’re giving up almost nothing in capability, but you gain full control, zero recurring cost, and the ability to run it completely offline. For most teams, that’s not a trade-off—it’s a straight upgrade.

Three Use Cases We’ve Fallen in Love With
- Turning Screenshots into Working Front-End Code Drop a Figma export or even a photo of a hand-drawn wireframe. GLM 4.6V doesn’t just describe the layout—it writes clean React + Tailwind (or plain HTML/CSS if you prefer) and can iterate in seconds when you say “make the primary button larger and change the accent color to teal.” Real score on the internal Flame-React benchmark: 72 % pixel-perfect matches. That’s higher than most senior devs hit on the first try.
- Making Sense of 150-Page Financial Reports in One Prompt Upload four quarterly PDFs with charts, tables, and footnotes. Ask it to “compare revenue growth, highlight red flags, and build a summary dashboard.” It pulls the numbers, spots inconsistencies (yes, it caught a transposed digit in one real report we tested), calls a charting tool if you allow it, and hands you back a polished markdown table with embedded images. The whole thing takes under a minute on Siray.AI.

- Hour-Long Video Reviews Without Watching the Whole Thing We fed it a 75-minute product demo recording. Prompt: “Give me a timestamped summary of every feature announcement, plus the pricing slide at the end.” Output: 14 bullet points with exact timestamps and a perfectly extracted pricing table. Zero hallucinations. Try getting that level of precision from any other open model right now.

Final Thoughts
GLM 4.6V isn’t perfect. Text-only reasoning is still a hair behind the very latest Claude or GPT-4 variants, and the training data cutoff is mid-2025. But for anything that involves images, documents, video, or tools, it’s the new open-source king—and it’s not particularly close.
The fact that you can spin it up locally, fine-tune it, or just jump on a hosted instance and start playing within minutes makes the decision even easier.
That’s why we’ve rolled it out front-and-center on Siray.AI. No waitlist, no credits to buy for the first few dozen runs—just log in and go.
Want to see what the fuss is about?
Upload an image, ask it to code something, or throw your messiest report at it. We bet you’ll be impressed.