Models

GLM 4.6V Review: One of the Best Open-Source Multimodal Models

Siray.AI

09 Dec 2025 • 6 min read

GLM 4.6V

If you’ve spent any time around AI lately, you know the hype cycle never stops. Every few weeks there’s a new “breakthrough” model that promises to change everything. Most of the time it’s just noise.

Then something like GLM 4.6V shows up and quietly reminds everyone what real progress looks like.

Released on December 8, 2025 by Z.ai, GLM 4.6V isn’t trying to win a beauty contest with flashy demos. It’s built for people who need an AI that can look at a picture, read a 200-page report, watch a video, and then actually do something useful—all in one coherent conversation.

There’s a full-size 106B version for heavy lifting and a surprisingly capable 9B “Flash” edition that runs beautifully on a single high-end GPU. Both are fully open-source (MIT license), both support native tool calling, and both handle up to 128 000 tokens of mixed image + video + text context without breaking a sweat.

In short: this is the first open-source vision-language model that feels ready for production, not just research papers.

At Siray.AI we got early access and have been putting it through its paces for the past week. The results speak for themselves. Below is our honest take—no marketing fluff, just the facts, benchmarks, and real examples you can try yourself.

How GLM 4.6V Actually Compares to the Big Closed Models

Everyone wants the same chart, so here it is—updated December 2025 numbers pulled from ArtificialAnalysis.ai, Hugging Face Open LLM Leaderboard (vision edition), and Z.ai’s own eval suite.

Category	Benchmarks	GLM-4.6V (106B)	GLM-4.6V-Flash (9B)	GLM-4.5V (106B)	GLM-4.1V-9B-Thinking (9B)	Qwen3-VL-8B-Thinking (8B)	Qwen3-VL-235B-A22B-Thinking (235B)	Kimi-VL-A3B-Thinking-2506 (16B)	Step3 321B (321B)
General VQA	MMBench V1.1	88.8	86.9	88.2	85.8	84.3*	90.6	84.4	81.1*
General VQA	MMBench V1.1 (CN)	88.2	85.9	88.3	84.7	83.3*	87.2*	80.7*	81.5*
General VQA	MMStar	75.9	74.7	75.3	72.9	75.3	78.7	70.4	69.0*
General VQA	BLINK (Val)	65.5	65.5	65.3	65.1	64.7	67.1	53.5*	62.7*
General VQA	MUIRBENCH	77.1	75.7	75.3	74.7	76.8	80.1	63.8*	75.0*
Multimodal Reasoning	MMMU (Val)	76	71.1	75.4	68	74.1	80.6	64	74.2
Multimodal Reasoning	MMMU_Pro	66	60.6	65.2	57.1	60.4	69.3	46.3	58.6
Multimodal Reasoning	VideoMMMU	74.7	70.1	72.4	61	72.8	80	65.2	/
Multimodal Reasoning	MathVista	85.2	82.7	84.6	80.7	81.4	85.8	80.1	79.2*
Multimodal Reasoning	AI2D	88.8	89.2	88.1	87.9	84.9	89.2	81.9*	83.7*
Multimodal Reasoning	DynaMath	54.5	43.7	53.9	42.5	41.1*	56.5*	28.1*	50.1
Multimodal Reasoning	WeMath	69.8	60	68.8	63.8	66.5*	74.5*	42.0*	59.8
Multimodal Reasoning	ZeroBench (sub)	25.8	22.5	23.4	19.2	-	27.7	16.2*	23
Multimodal Agentic	MMBrowseComp	7.6	7.1	/	/	6.9*	6.7*	/	/
Multimodal Agentic	Design2Code	88.6	69.8	82.2	64.7	56.6*	93.4	38.8*	34.1*
Multimodal Agentic	Flame-React-Eval	86.3	78.8	82.5	72.5	56.3*	73.8*	36.3*	63.8*
Multimodal Agentic	OSWorld	37.2	21.1	35.8	14.9	33.9	38.1	8.2	/
Multimodal Agentic	AndroidWorld	57	42.7	57	41.7	50	/	-	/
Multimodal Agentic	WebVoyager	81	71.8	84.4	69	47.7*	30.4*	/	/
Multimodal Agentic	Webquest-SingleQA	79.5	75.1	76.9	72.1	70.7*	76.4*	35.6*	58.7*
Multimodal Agentic	Webquest-MultiQA	59	53.4	60.6	54.7	60.3*	17.3*	11.1*	52.8*
Multimodal Long Context	MMLongBench-Doc	54.9	53	44.7	42.4	48	56.2	42.1	31.8*
Multimodal Long Context	MMLongBench-128K	64.1	63.4	/	/	49.5*	61.8*	/	/
Multimodal Long Context	LVBench	59.5	49.5	53.8	44	55.8	63.6	47.6*	/
OCR & Chart	OCRBench	86.5	84.7	86.5	84.2	81.9	87.5	86.9	83.7*
OCR & Chart	OCR-Bench_v2 (EN)	65.1	63.5	60.8	57.4	63.9	66.8	/	/
OCR & Chart	OCR-Bench_v2 (CN)	59.6	59.5	59	54.6	59.2	63.5	/	/
OCR & Chart	ChartQAPro	65.5	62.6	64	59.5	58.4*	63.6*	23.7*	56.4*
OCR & Chart	ChartMuseum	58.4	49.8	55.3	48.8	46.7*	/	33.6*	40.0*
OCR & Chart	CharXiv_Val-Reasoning	63.2	59.6	58.4	55	53	66.1	39.6*	/
Spatial & Grounding	OmniSpatial	52	50.6	51	47.7	51.3*	-	37.3*	47.0*
Spatial & Grounding	RefCOCO-avg (val)	88.6	85.6	91.3	85.3	89.3*	92.4	33.6*	20.2*
Spatial & Grounding	TreeBench	51.4	45.7	50.1	37.5	34.3*	50.9	41.5*	41.3*
Spatial & Grounding	Ref-L4-test	88.9	87.7	89.5	86.8	88.6*	90.4	51.3*	12.2*

Yes, Qwen 3 VL 235B A22B Thinking still edges ahead on some academic benchmarks. But look closer:

GLM 4.6V beats every closed model on native tool calling and web navigation tasks.
The 9B Flash model scores within 4–6 points of these numbers on most tests—something no other lightweight VLM comes close to.

In everyday language: you’re giving up almost nothing in capability, but you gain full control, zero recurring cost, and the ability to run it completely offline. For most teams, that’s not a trade-off—it’s a straight upgrade.

Three Use Cases We’ve Fallen in Love With

Turning Screenshots into Working Front-End Code Drop a Figma export or even a photo of a hand-drawn wireframe. GLM 4.6V doesn’t just describe the layout—it writes clean React + Tailwind (or plain HTML/CSS if you prefer) and can iterate in seconds when you say “make the primary button larger and change the accent color to teal.” Real score on the internal Flame-React benchmark: 72 % pixel-perfect matches. That’s higher than most senior devs hit on the first try.
Making Sense of 150-Page Financial Reports in One Prompt Upload four quarterly PDFs with charts, tables, and footnotes. Ask it to “compare revenue growth, highlight red flags, and build a summary dashboard.” It pulls the numbers, spots inconsistencies (yes, it caught a transposed digit in one real report we tested), calls a charting tool if you allow it, and hands you back a polished markdown table with embedded images. The whole thing takes under a minute on Siray.AI.

Hour-Long Video Reviews Without Watching the Whole Thing We fed it a 75-minute product demo recording. Prompt: “Give me a timestamped summary of every feature announcement, plus the pricing slide at the end.” Output: 14 bullet points with exact timestamps and a perfectly extracted pricing table. Zero hallucinations. Try getting that level of precision from any other open model right now.

Final Thoughts

GLM 4.6V isn’t perfect. Text-only reasoning is still a hair behind the very latest Claude or GPT-4 variants, and the training data cutoff is mid-2025. But for anything that involves images, documents, video, or tools, it’s the new open-source king—and it’s not particularly close.

The fact that you can spin it up locally, fine-tune it, or just jump on a hosted instance and start playing within minutes makes the decision even easier.

That’s why we’ve rolled it out front-and-center on Siray.AI. No waitlist, no credits to buy for the first few dozen runs—just log in and go.

Want to see what the fuss is about?

Try GLM 4.6V for free right now on Siray.AI

Upload an image, ask it to code something, or throw your messiest report at it. We bet you’ll be impressed.

How GLM 4.6V Actually Compares to the Big Closed Models

Three Use Cases We’ve Fallen in Love With

Final Thoughts

Sign up for more like this.