THE CIRCUITRYYour one-stop source for all tech news

Home/Tech/Google DeepMind releases DiffusionGemma for 4x faster local AI

VERIFIEDBy Xavier Rivera· ·2 min read

Google DeepMind releases DiffusionGemma for 4x faster local AI

Google DeepMind released DiffusionGemma, a parallel text-generation model that produces up to four times more tokens per second than similarly sized autoregressive Gemma models on local GPUs. The approach trades higher error rates for better compute efficiency on non-linear tasks but remains experimental.

Source:Ars Technica

Post

Google DeepMind releases DiffusionGemma for 4x faster local AI

TL;DRAI · 60 sec read

Google DeepMind has released DiffusionGemma, a new member of the Gemma 4 open model family that generates text in parallel rather than one token at a time.

DiffusionGemma uses a parallel generation approach borrowed from image models. Unlike autoregressive models that produce text left to right one token at a time, DiffusionGemma starts with a field of placeholder tokens and runs over the canvas multiple times to generate likely tokens. It uses those to improve estimation of others before finalizing outputs in one large block of denoised text. Google says this makes the model faster and more efficient on local hardware such as an Nvidia DGX or a gaming GPU.

In language, a single bad token can render an entire block meaningless and require restarting, unlike image generation where one flawed pixel rarely ruins the result.

The model is a 26-billion-parameter Mixture of Experts design. Only 3.8 billion parameters activate during inference, allowing it to fit in the 18GB RAM of a high-end GPU. In testing on an RTX 5090, DiffusionGemma produces around 700 tokens per second. On a single Nvidia H100, it reaches 1,000-plus tokens per second, roughly four times the output of similarly sized autoregressive Gemma models.

POST FROM @GoogleDeepMind· official announcement tweet matching the article topic and date

https://x.com/GoogleDeepMind/status/2064741061352636762

Parallel generation shifts the performance bottleneck from memory to compute. The model can generate up to 256 tokens at once. Google reports measurable gains on non-linear tasks including in-line editing, molecular sequencing and mathematical graphing. An animation in the release demonstrates how DiffusionGemma solves Sudoku puzzles by continuously self-correcting large sets of tokens, a task that challenges standard autoregressive models because each token depends on future ones.

From The CircuitryThe Feed — live briefs across tech, all day.See what’s happening →

Text diffusion carries trade-offs that limit its use in cloud models. Google has experimented with the technique for its Gemini models but notes a higher error rate. In language, a single bad token can render an entire block meaningless and require restarting, unlike image generation where one flawed pixel rarely ruins the result. Diffusion models also waste resources on short outputs that autoregressive models can complete in just a few steps.

An animation in the release demonstrates how DiffusionGemma solves Sudoku puzzles by continuously self-correcting large sets of tokens, a task that challenges standard autoregressive models because each token depends on future ones.

Local hardware benefits more from diffusion than cloud systems do. Cloud autoregressive models batch jobs across users and leverage high-bandwidth memory to stay efficient. Local AI often faces idle cycles and lower memory bandwidth. Diffusion makes better use of available compute, outperforming even Google's Multi-Token Prediction drafters that also target wasted cycles. Google describes DiffusionGemma as experimental.

Why this mattersAI · ~100 words

Tap a lens to see what this story means for you.

Reader-supported

DonateBuy me a coffee →Follow@thecircuitry_ →Follow@thecircuitry.to →

Reader-supported · The Brief

Liked this? The Brief brings you the whole day in tech, verified, every morning. Two minutes, free forever.

HELP US IMPROVE

Reader-supported

The Circuitry is a passion project I've always wanted to build, and I love the work behind it.

Running it costs real money. APIs, hosting, time. To keep improving the site and growing this into something useful for everyone, those costs have to be covered.

Any contribution is appreciated. If not, no pressure. Thanks for reading.

Buy me a coffee

AI Google DeepMind Gemma

MORE IN TECH

Apple Sets 'Upgrade' Leasing Launch and Subscription Price Increases

Apple is preparing an "Apple Upgrade" leasing program with Klarna for a July 28 debut in the U.S. while lifting Apple Music and related subscription costs. The period also delivered iOS 27 beta 4, plans for roughly a dozen new Macs according to Bloomberg's Mark Gurman, and hands-on impressions of Samsung's Galaxy Z Fold8.

OpenAI confirms ChatGPT is down worldwide

OpenAI confirms ChatGPT is down worldwide. The outage began at approximately 5 AM ET on July 25, 2026, preventing users from loading chats or accessing previous conversations.

Critical CVE-2026-56163 Hits Azure Kubernetes Service

Microsoft disclosed CVE-2026-56163, a critical vulnerability in Azure Kubernetes Service with a CVSS 3.1 score of 10.0 that allows unauthorized network-based privilege elevation. The flaw was received from Microsoft on July 24, 2026 and requires immediate attention from Azure Kubernetes users to prevent high-impact compromise.