“"GLM 5.2 (xhigh) spent over 15 minutes (!), spending about 45k tokens, before it finally wrote the first file." — user Tiberium, Hacker News https://news.ycombinator.com/item?id=48567759 (June 2026)”
You know that feeling when you need a frontier-grade AI agent for long software engineering tasks but every model smart enough is locked behind a paid API where you cannot audit the weights, control the data, or negotiate on cost? GLM-5.2 targets that gap: a fully open-weights 744B model you download and run on your own hardware. The second problem is architectural — 1M-token context at 744B scale is expensive; standard sparse attention rebuilds its index at every transformer layer, making the compute prohibitive. IndexShare cuts 75% of that index computation by reusing indices across layers, which is what makes the economics of self-hosted long-context inference plausible here.
Think of a librarian who indexes 1 million books across 100 floors. Standard sparse attention rebuilds that index from scratch on every floor — expensive at this scale. IndexShare designates 'Full layers' every few floors that compute the full index, while in-between 'Shared layers' borrow the nearest Full layer's result. Adjacent transformer layers turn out to select nearly identical top-k positions anyway, so borrowing costs almost nothing in quality while removing 75% of index compute. On top of this, training uses an asynchronous RL framework called slime: rollout workers generate long agent responses while training workers apply gradient updates in parallel, so the training loop is never blocked waiting for a 1M-token generation to finish. The result is a 744B MoE model where 40B parameters activate per token, with 1M-token context that costs 2.9× less compute at full context length than standard sparse attention.
If you are an ML engineer or AI infrastructure developer who needs a frontier-capable agent for long-horizon coding tasks and cannot use closed APIs due to data sovereignty, compliance requirements, or cost predictability, this is the strongest open-weights option as of June 2026. You need either a multi-A100 cluster or a 256 GB unified-memory Mac — 245 GB RAM is the minimum for the lowest-quality (2-bit) quantization. Skip this if you need something that runs on a single consumer GPU or a standard cloud instance; the hardware floor rules that out.
Worth evaluating if you have the hardware and need the strongest open-weights intelligence scores available — the model is production-deployed by HPC-AI and Unsloth, which means the deployment path is tested and documented. Do not use 'max' reasoning effort in production; community reports of 45k tokens spent before writing a single file mean you will burn significant budget. Set reasoning_effort to 'high' from the start.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.