The Circuitry
THE CIRCUITRYYour one-stop source for all tech news
HOMENEWSFEEDEVENTS
BOOKMARKS
RSS
© 2026 The Circuitry
About UsSourcesContactCorrectionsPrivacy
  • Home
  • Feed
  • Events
  • Saved
Scroll for more
Verification
VERIFIEDConfidence: HIGH
Source identified
Claims cross-referenced
No discrepancies found
Fact-check summary

Google's official blog, NVIDIA, The New Stack, and Hugging Face confirm the June 10, 2026 DiffusionGemma release: a 26B MoE diffusion model delivering up to 4x faster local text generation.

Sourcing
1source

via Ars Technica

Ars Technica · track record
18Stories
100%Verified
1430d
All sources →
Markets
GOOGL···

Live quote · not investment advice

Home/Tech/Google DeepMind releases DiffusionGemma for 4x faster local AI
VERIFIEDBy Xavier Rivera· ·2 min read

Google DeepMind releases DiffusionGemma for 4x faster local AI

Google DeepMind released DiffusionGemma, a parallel text-generation model that produces up to four times more tokens per second than similarly sized autoregressive Gemma models on local GPUs. The approach trades higher error rates for better compute efficiency on non-linear tasks but remains experimental.

Source:Ars Technica
Post
Google DeepMind releases DiffusionGemma for 4x faster local AI
TL;DRAI · 60 sec read

Google DeepMind releases DiffusionGemma, a 26-billion-parameter Mixture of Experts model that generates text in parallel via diffusion rather than token by token. It reaches 700-1,000 tokens per second on RTX 5090 and H100 GPUs, four times faster than autoregressive Gemma models, by shifting the bottleneck from memory to compute on local hardware.

Google DeepMind has released DiffusionGemma, a new member of the Gemma 4 open model family that generates text in parallel rather than one token at a time.

DiffusionGemma uses a parallel generation approach borrowed from image models. Unlike autoregressive models that produce text left to right one token at a time, DiffusionGemma starts with a field of placeholder tokens and runs over the canvas multiple times to generate likely tokens. It uses those to improve estimation of others before finalizing outputs in one large block of denoised text. Google says this makes the model faster and more efficient on local hardware such as an Nvidia DGX or a gaming GPU.
In language, a single bad token can render an entire block meaningless and require restarting, unlike image generation where one flawed pixel rarely ruins the result.

The model is a 26-billion-parameter Mixture of Experts design. Only 3.8 billion parameters activate during inference, allowing it to fit in the 18GB RAM of a high-end GPU. In testing on an RTX 5090, DiffusionGemma produces around 700 tokens per second. On a single Nvidia H100, it reaches 1,000-plus tokens per second, roughly four times the output of similarly sized autoregressive Gemma models.
POST FROM @GoogleDeepMind· official announcement tweet matching the article topic and date
https://x.com/GoogleDeepMind/status/2064741061352636762

Parallel generation shifts the performance bottleneck from memory to compute. The model can generate up to 256 tokens at once. Google reports measurable gains on non-linear tasks including in-line editing, molecular sequencing and mathematical graphing. An animation in the release demonstrates how DiffusionGemma solves Sudoku puzzles by continuously self-correcting large sets of tokens, a task that challenges standard autoregressive models because each token depends on future ones.
An animation in the release demonstrates how DiffusionGemma solves Sudoku puzzles by continuously self-correcting large sets of tokens, a task that challenges standard autoregressive models because each token depends on future ones.

Text diffusion carries trade-offs that limit its use in cloud models. Google has experimented with the technique for its Gemini models but notes a higher error rate. In language, a single bad token can render an entire block meaningless and require restarting, unlike image generation where one flawed pixel rarely ruins the result. Diffusion models also waste resources on short outputs that autoregressive models can complete in just a few steps.

Local hardware benefits more from diffusion than cloud systems do. Cloud autoregressive models batch jobs across users and leverage high-bandwidth memory to stay efficient. Local AI often faces idle cycles and lower memory bandwidth. Diffusion makes better use of available compute, outperforming even Google's Multi-Token Prediction drafters that also target wasted cycles. Google describes DiffusionGemma as experimental.
Why this mattersAI · ~100 words

Tap a lens to see what this story means for you.

Reader-supported
DonateBuy me a coffee →Follow@thecircuitry_ →

Reader-supported · Daily Brief

Daily brief at 7 AM ET. Top tech stories, every morning. Sourced and fact-checked.

HELP US IMPROVE

Reader-supported

The Circuitry is a passion project I've always wanted to build, and I love the work behind it.

Running it costs real money. APIs, hosting, time. To keep improving the site and growing this into something useful for everyone, those costs have to be covered.

Any contribution is appreciated. If not, no pressure. Thanks for reading.

Buy me a coffee
AIGoogleDeepMindGemma
More fromArs Technica
  • FCC Waives Amazon Leo Satellite Deadline

    Tech · 1d
  • Gemini 3.5 and Antigravity arrive in NotebookLM

    Tech · 2d
  • Microsoft packages laced with credential stealer for second time

    Tech · 2d
More inTech
  • Path Traversal Flaw in Langflow Actively Exploited

    Tech · 1h
  • AWS Launches Graviton5-Powered EC2 M9g Instances

    Tech · 7h
  • Microsoft patches three Windows zero-days including BitLocker bypass

    Tech · 10h
SupportThe Work

The Circuitry is reader-supported. If you find the daily brief useful, you can buy me a coffee to keep it going.

Buy a coffee →
SubscribeCircuitry Brief

Daily brief at 7 AM ET. Top tech stories, every morning.

MORE IN TECH

Path Traversal Flaw in Langflow Actively Exploited

Attackers are actively exploiting CVE-2026-5027, a high-severity path traversal flaw in Langflow, to write arbitrary files on exposed servers. The open-source AI development platform has more than 149,000 GitHub stars, and roughly 7,000 instances are publicly exposed.

AWS Launches Graviton5-Powered EC2 M9g Instances

AWS has released Amazon EC2 M9g and M9gd instances powered by Graviton5 processors that deliver up to 25% better compute performance than Graviton4. The new silicon targets surging demand for CPU compute in agentic AI while expanding AWS's already massive Graviton footprint across more than 120,000 customers.

Microsoft patches three Windows zero-days including BitLocker bypass

Microsoft patched GreenPlasma, MiniPlasma, and YellowKey zero-days in its June 2026 Patch Tuesday release, addressing SYSTEM privilege escalation and a BitLocker bypass. The flaws were disclosed by researcher Nightmare Eclipse in protest of Microsoft's vulnerability handling process.