Speeding Up Per-CPU Operations: A Controversial Kernel Proposal

The Linux kernel uses this_cpu operations to access per-CPU variables quickly, avoiding costly locks. However, performance varies across architectures. At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, Yang Shi proposed a fundamental change to how these operations work—a change that sparked debate. Below, we explore the proposal, its motivations, and the controversy surrounding it.

What are this_cpu operations and why are they important?

this_cpu operations are kernel primitives designed to read or modify per-CPU variables without needing global synchronization. In a multiprocessor system, each CPU has its own copy of certain variables (e.g., counters, statistics). Using this_cpu functions, the kernel can update these copies quickly because the operation is guaranteed to be atomic on the local CPU. This avoids expensive spinlocks or atomic operations that would otherwise serialize access across CPUs. The result is better performance in hot paths like memory allocation, scheduling, and network statistics. However, the efficiency of these operations depends heavily on the underlying hardware architecture—some CPUs handle them natively with single instructions, while others require workarounds that degrade performance. Yang Shi’s proposal aims to make these operations consistently fast across different CPU designs.

Speeding Up Per-CPU Operations: A Controversial Kernel Proposal

What specific change did Yang Shi propose at the 2026 summit?

Yang Shi suggested altering the implementation of this_cpu operations so that they are no longer guaranteed to be atomic on the local CPU in all cases. Instead, he proposed that the operations become per-CPU racing-friendly—meaning they could tolerate concurrent modifications from other CPUs, but without enforcing strict ordering or atomicity. The core idea was to replace heavy memory barriers or lock prefixes with lighter-weight synchronization, or even none, relying on the fact that per-CPU variables are rarely accessed from remote CPUs. This would allow architectures that currently suffer from expensive emulation (e.g., certain RISC CPUs) to benefit from simpler, faster sequences. However, this change breaks the traditional contract of this_cpu operations, which has always promised local atomicity for single-CPU updates.

Why does the current this_cpu implementation perform poorly on some architectures?

On x86 and ARM64, this_cpu operations map neatly to single instructions (e.g., add with a memory operand on x86) that are inherently atomic on the local CPU. However, on architectures like RISC-V or MIPS, there is no such single instruction. The kernel must instead use a sequence of load-modify-store with explicit locking or interrupt disabling to ensure atomicity. This increases instruction count, cache misses, and pipeline stalls. Moreover, some architectures require additional memory barriers to maintain ordering, further slowing things down. The discrepancy became more apparent as per-CPU variables are used in increasingly performance-critical paths. Yang Shi’s proposal would remove the atomicity guarantee, allowing those architectures to use simpler, non-locking sequences—thus speeding them up.

What was the main controversy or objection to the proposal?

Many kernel developers worried that removing the atomicity guarantee from this_cpu operations would open the door to subtle, hard-to-debug data races. Currently, kernel code can safely update a per-CPU variable on its own CPU without considering concurrent updates from other CPUs, because local atomicity ensures the operation appears instantaneous. If that guarantee is weakened, a remote CPU might see an intermediate state, or a preemption between load and store could cause lost updates. Critics argued that the gain in performance might not justify the increased complexity and risk. Some also pointed out that the worst-performing architectures are not the most common ones, so the change would penalize well-optimized platforms like x86 and ARM64 by possibly requiring extra overhead to preserve correctness. The debate highlighted the tension between micro-optimization and maintainability.

How would the proposed change affect existing kernel code?

If implemented, existing code that uses this_cpu operations would not break immediately—it would just lose the atomicity guarantee. However, developers who rely on that guarantee would need to audit their code and possibly add explicit synchronization (e.g., local_bh_disable() or preempt_disable()) to prevent races. For example, a common pattern like this_cpu_inc(variable) might become non-atomic unless surrounding code already provides mutual exclusion. The kernel community would need to update documentation, introduce new API variants (e.g., this_cpu_rcu for race-tolerant uses), and gradually convert users. A migration plan would be essential to avoid regressions. Some subsystems, like networking or slab allocators, which heavily use per-CPU counters, might need careful re-engineering. The proposal ultimately aims to provide a faster alternative for code that can tolerate races, while preserving the old semantics for safety-critical paths.

What were the potential benefits cited by Yang Shi?

Yang Shi argued that making this_cpu operations faster on a wider range of architectures would improve overall kernel performance, especially for workloads that are sensitive to per-CPU variable access, such as memory management, networking, and I/O statistics. On slow architectures, the improvement could be dramatic—reducing overhead by 50% or more in certain hot paths. Furthermore, simplifying the implementation could reduce code complexity in architecture-specific low-level handlers. By decoupling the API from the atomicity guarantee, the kernel could offer both a fast, relaxed variant and a strict, safe variant, letting developers choose based on their needs. This aligns with Linux’s philosophy of providing flexibility while maintaining backward compatibility.

What is the likely next step for this proposal?

After the summit, the proposal remains under discussion. A common next step is to produce a proof-of-concept patch series that implements the relaxed this_cpu operations on one or two architectures (e.g., RISC-V and MIPS) while keeping the existing atomic version as a fallback. This would allow benchmarking and testing to see if the performance gains materialize without breaking existing workloads. The kernel community will evaluate the trade-offs through code reviews and stress tests. If the approach proves viable, it may be merged in a future kernel cycle, possibly as an opt-in feature behind a Kconfig option. However, given the controversy, it may take several releases before a consensus is reached.