Fine-grain clock gating, or FGCG, is the kind of optimization that every power-conscious chip team knows it should do more of and almost never does enough of. The idea is simple: if a register or a block is not changing on a given cycle, do not toggle its clock, because the clock tree is one of the largest consumers of dynamic power on a modern SoC. The hard part is not the principle but the labor — finding exactly where, in millions of cycles of real workload, a gate can be inserted without breaking functionality. A June 2026 arXiv preprint, AUTOGATE: Automated Clock Gating via Toggling-Aware LLM-based RTL Rewriting, by Yiting Wang, Chenhui Deng, Chia-Tung Ho, Yanqing Zhang, Zhuo Feng, Cunxi Yu and colleagues, tries to hand that labor to an automated agent.

The authors are blunt about why prior LLM-for-RTL efforts have stalled on this specific problem. Language models cannot ingest a waveform trace that spans millions of cycles, and they struggle to keep correctness intact when asked to rewrite large hierarchical codebases. Those two failure modes — context limits on dynamic data and scaling limits on static code — are exactly what stand between a clever demo and a tool an industrial team would trust on production RTL.

"Recent LLM-based RTL optimization approaches remain limited by two key drawbacks: (1) the inability to process long waveform traces spanning millions of cycles, and (2) the difficulty of scaling optimization to large hierarchical codebases while preserving correctness."— arXiv:2606.17461 (Wang et al.), source

The ML-LLM co-design

AUTOGATE's central move is to refuse to make the LLM read the waveform at all. Instead it introduces what the authors call an ML-LLM co-design: a machine-learning clustering algorithm first distills raw toggling traces into compact, structured representations, and only those distilled representations are handed to the language model to guide RTL rewriting. This is a sensible division of labor. Toggling statistics — which signals switch together, which stay quiet across a workload — are a clustering problem, not a language problem. By compressing millions of cycles into a structured summary of where activity correlates, the framework gives the LLM a tractable map of gating opportunities rather than an unreadable firehose of simulation data.

That design choice directly answers the first drawback. The model never tries to fit a multi-million-cycle trace into its context window; it reasons over a clustered abstraction of that trace and then edits the Verilog accordingly. The clustering is doing the workload-awareness, and the LLM is doing the code transformation — each tool kept to the job it is actually good at.

Hierarchy as the scaling strategy

The second drawback, scaling across large hierarchical codebases while preserving correctness, is addressed structurally. AUTOGATE uses a hierarchical multi-agent architecture that decomposes a large design into independently optimizable modules, then coordinates optimization across the deep hierarchy. This mirrors how human RTL teams already work — nobody optimizes a full SoC as one flat blob; they work module by module and reconcile at the boundaries. By making the agent topology follow the design hierarchy, the framework keeps each rewriting task small enough to verify locally while still propagating gating decisions up and down the tree.

The correctness emphasis is the part that separates this from a research toy. Clock gating that changes functional behavior is not a power saving, it is a bug, and any tool that touches RTL in an industrial flow lives or dies on whether the gated design remains logically equivalent to the original. The authors frame their hierarchy explicitly around preserving correctness during decomposition, which is the right priority even if the abstract does not detail the equivalence-checking methodology behind it.

The numbers, read carefully

The reported results span a deliberately wide range of design sizes. On a suite of small RTL designs, AUTOGATE reduces dynamic power by 49.31% on average — a large figure that reflects how much low-hanging gating opportunity exists in unoptimized small blocks. The more telling numbers come from real, sizable open designs: 19.34% on NVDLA, NVIDIA's open deep-learning accelerator, and 7.96% on BlackParrot, a well-known open RISC-V multicore. And on what the authors describe as highly optimized proprietary production designs, the framework still finds up to 6.86%.

That declining gradient — roughly 49% on small designs, ~19% on NVDLA, ~8% on BlackParrot, ~7% on tuned production silicon — is itself the honest story of the paper. It says exactly what an experienced power engineer would expect: the more a design has already been hand-optimized, the less headroom an automated pass can find. A single-digit reduction on a design that human experts have already worked over is arguably the most impressive number in the set, not the least, because it represents savings on top of best-effort manual gating, found automatically.

What the abstract leaves open is the cost side of the ledger. It does not report area or timing overhead from the inserted gating logic, the runtime of the clustering-plus-LLM pipeline on a large design, or how correctness was formally verified after rewriting — all of which determine whether the power savings come for free or trade against frequency and die area. Nor does it name the specific model or the simulation flow that produced the toggling traces. Those omissions matter for reproducibility, but they do not undercut the framing.

The broader significance is that AUTOGATE positions itself as the first agentic framework for industry-grade RTL power optimization, and the choice of NVDLA, BlackParrot and proprietary production designs as benchmarks signals an ambition beyond academic demonstration. If the correctness story holds up under independent scrutiny, the contribution is less about any single percentage and more about a workable template: let ML compress the dynamic data, let an LLM rewrite the static code, and let the design hierarchy bound the blast radius of each edit. In a sector where dynamic power increasingly sets the thermal and packaging envelope of AI silicon, automating even single-digit gains on already-tuned designs is the kind of leverage that scales with every chip a team ships.