← Back to blog

Comparing SPEC CPU2006 and SPEC CPU2026 at the Instruction Level

This is a personal study note — I write these to solidify my own understanding. If you spot anything wrong or have thoughts, feel free to reach out. Layout, formatting, and grammar assisted by Claude.


TL;DR
Both source year and compiler version create small differences in the binary; neither dominates. Holding the toolchain at GCC 13 and swapping SPEC2006 source for SPEC2026 source adds 29 new mnemonics; holding the source at SPEC2006 and swapping GCC 9 for GCC 13 adds only 12. Both numbers are small relative to the ~210 mnemonics shared across all four builds.
Held constant ↔ varied mnemonics only
in older version
mnemonics only
in newer version
Source year varies (GCC 13 fixed)
SPEC2006 → SPEC2026
5 29
Compiler varies (SPEC2006 fixed)
GCC 9 → GCC 13, -O2 and -O3 aggregated
5 12

Context

I work on binary translation. My evaluation uses SPEC CPU2006 compiled with GCC 13, and I wanted to know — quantitatively — how the instruction stream in those binaries compares to what you'd get from other reasonable combinations of source year and compiler version. For a binary translator the input is the instruction stream, so the differences worth measuring are at the instruction level, along two independent axes:

  1. Source-axis: SPEC2006 vs SPEC2026, both built with GCC 13.
  2. Compiler-axis: the same SPEC2006 source built with GCC 9 versus GCC 13.

The post is the heatmaps that show what each comparison contains.

Setup

  • Compiler: GCC 13 (x87 instruction emission disabled).
  • Inputs: 12 SPEC CPU2006 integer benchmarks, 17 SPEC CPU2026 binaries (14 benchmark folders; 3 ship a main + helper binary). For each, both -O2 and -O3 builds are aggregated.
  • Method: static disassembly of the executable text. The first opcode byte (after stripping legacy + REX prefixes) is binned into a 16×16 grid — row = high nibble, column = low nibble. This is the canonical Intel SDM Vol. 2 Appendix A layout [1].

x86 has two opcode spaces in scope here: instructions identified by a single byte (mov, add, jmp, cmp, …), and instructions whose opcode starts with 0F and uses a second byte to identify them (SSE, conditional moves, long conditional jumps, …). Each figure below shows one grid per space.

Hover any cell for its mnemonic breakdown and total count.

Figure 1 — Top mnemonics, side by side

The 20 most common mnemonics in each suite, shown as fractions of that suite's total static instructions. The top 14 are the same in both, in the same rank order, with percentages that match to within ~3 points. mov is ~34% of all 2006 instructions and ~36% of all 2026 instructions; call, cmp, lea, je, nop, test, jmp, pop, jne, add, push, xor follow in essentially the same order on both sides. The 2026 distribution isn't shifted; it's the same distribution with more instructions in it.

Figure 2 — Encoding-space heatmap

Same data binned by opcode byte instead of by mnemonic. Both halves use the same color scale, so cells with the same intensity have the same static count. Look at the bright cells — mov, call, lea, cmp, jmp, je, push, pop, add, sub, xor. They sit in the same positions, with the same relative intensities. 2026 lights up a few extra cells in the lower rows of each pane (mostly atomics and SSE2 packed-integer ops), but the overall shape is unmistakably the same instruction set.

The downside of this view: a single mnemonic like add has 6+ valid opcode-byte encodings, so it spreads across multiple cells. The heatmap is honest about x86's encoding layout, not about how much each operation is used — that's what Figure 1 is for.

Figure 3 — Which suite owns each cell

A categorical view of the same data: indigo = only in SPEC2006, teal = only in SPEC2026, green = in both suites. The figure is a wall of green. 210 of 215 mnemonics in the 2006 build also appear in 2026 — about 98% overlap. The indigo cells are nearly invisible: 5 of them (cmpleps, cmpneqps, minps, movsw, psubusw) — incidental SSE compare and string variants GCC 13 happened to emit for specific loop shapes in the 2006 source but not in the 2026 source. They don't represent a category of instruction 2026 is missing; they're just artifacts of which exact loops GCC encountered. The teal cells cluster in the lower rows of each pane — the atomics and SSE2 packed-int additions.

For a binary translator's correctness surface, the two suites cover essentially the same instruction set. SPEC2026 binaries are roughly 10× larger on average (1.8M instructions per binary vs 181K), so they exercise more code paths and longer functions per run.

The 29 mnemonics that are only in 2026 group cleanly:

CategoryCountExamples
Atomics / concurrency5lock add, lock cmpxchg, lock or, lock sub, lock xadd
BMI / bitmanip1tzcnt
CPU feature detection2cpuid, xgetbv
Debugger / runtime traps1int3
SSE2 packed integer9pmaxsw, pmaxub, pminub, pmulhuw, pmulhw, pmullw, psraw, psubb, psubw
SSE packed compare / move6cmpeqps, cmpltps, cmpltpd, cmpnltsd, movhpd, movmskpd
Misc base x865cbw, dec, inc, jno, sets

Atomics and CPU-detection reflect runtime patterns that SPEC2026 source uses and SPEC2006 source doesn't — multithreading primitives and runtime feature dispatching. The SSE2 packed-integer and packed-compare clusters aren't a newer ISA; SSE2 has been baseline on x86-64 since the architecture was specified, and the 2006 build does emit several of them at -O3 (paddb, pcmpgtb, psllw, punpckhbw, psllq, punpckhqdq all appear in the 2006-O3 binaries), just less often than the 2026 source triggers them. The misc category is base x86. The only post-2003 ISA in the whole "only in 2026" set is tzcnt (BMI1, 2013) — a single instruction out of 29.

So Figure 3 shows that source year barely shifts the binary when the compiler is held fixed. The converse experiment closes the loop: hold the source fixed and change the compiler.

I happen to have SPEC2006 also compiled with GCC 9.4.0 on my machine, both -O2 and -O3 — the work of supporting both the GCC 9 and the GCC 13 builds in the translator was substantial, so I have everything lying around. Strictly this isn't only a compiler change either: the GCC 9 build links against the 2019-era Linux toolchain, including glibc 2.31; the GCC 13 build links against glibc 2.35+. A concrete fingerprint of that libc jump shows up right in the symbol table:

GCC 9 build, bzip2 symbols:
  __libc_csu_init       in .text (101 bytes)
  __libc_csu_fini       in .text (5 bytes)
  __libc_start_main@@GLIBC_2.2.5   (undefined, resolved by ld.so)

GCC 13 build, bzip2 symbols:
  __libc_start_main@GLIBC_2.34     (undefined, resolved by ld.so)
  — __libc_csu_init / __libc_csu_fini are gone

In glibc 2.34 (August 2021), the long-standing __libc_csu_init and __libc_csu_fini "C Start-Up" routines that every dynamically-linked binary used to carry were removed [2]. Their job — walking .init_array to call constructors and .fini_array to call destructors — moved into __libc_start_main itself, exposed under a new symbol version __libc_start_main@GLIBC_2.34 [2]. The same release also merged libpthread, libdl, libutil, libanl into libc.so.6, so binaries no longer need -lpthread / -ldl / -lutil / -lanl to use those interfaces [3]. The CSU removal additionally closed the well-known ret2csu ROP gadget that exploit writers had relied on for years [4].

It's concrete enough that it shows up in my translator's source. The translator intercepts __libc_start_main from the translated x86 binary. Before glibc 2.34, the binary passes a pointer to its own __libc_csu_init via the init argument and we just call it. After glibc 2.34, the init argument is NULL, and the wrapper has to walk the ELF's PT_DYNAMIC segment by hand to find DT_INIT and DT_INIT_ARRAY and call each constructor itself:

if (init != NULL) {
  // OLD glibc (<2.34): init points to __libc_csu_init in the translated binary.
  init();
} else {
  // NEW glibc (>=2.34): init is NULL. We must walk PT_DYNAMIC ourselves.
  Elf64_Dyn *dyn = find_pt_dynamic(CTX->x86_ehdr);

  // 1. Call _init() via DT_INIT
  uint64_t init_vaddr = dyn_lookup(dyn, DT_INIT);
  if (init_vaddr) ((void(*)(void))rva_to_host(init_vaddr))();

  // 2. Walk .init_array via DT_INIT_ARRAY / DT_INIT_ARRAYSZ
  uint64_t arr_vaddr = dyn_lookup(dyn, DT_INIT_ARRAY);
  uint64_t arr_size  = dyn_lookup(dyn, DT_INIT_ARRAYSZ);
  void (**arr)(void) = rva_to_host(arr_vaddr);
  for (size_t i = 0; i < arr_size / sizeof(void *); i++) {
    if (arr[i] && arr[i] != (void *)-1) arr[i]();
  }
}

The pre-2.34 branch is two lines: check that init isn't NULL, call it. The post-2.34 branch is the entire else. None of it would exist if glibc hadn't moved CSU into __libc_start_main.

The next heatmap pair is really "2019-era Linux toolchain vs 2024-era Linux toolchain," with this libc shift baked in. The libc change shows up as a few stripped startup symbols; the rest of what differs is a small set of codegen refinements that we'll see next.

Figure 4 — Which compiler emits each cell

Same coloring as Figure 3, just with the axis swapped to compiler version: indigo = only emitted by GCC 9, teal = only emitted by GCC 13, green = emitted by both. Same SPEC2006 source on both sides, both with -O2 and -O3 builds aggregated. The indigo cells are tiny (5 of them: cld, movhpd, rep cmpsb, repne jmp, repne scasb — old-style string-instruction idioms GCC 13 has stopped using, plus one stray SSE2 move). The teal cells are also small (12), and the figure is overwhelmingly green.

The 12 mnemonics GCC 13 emits that GCC 9 doesn't, with their counts:

CategoryMnemonics
SSE / SSE2 vector (8)movlps (177), unpcklps (117), pshuflw (61), paddw (6), movlpd (5), maxps (2), minps (2), pminsw (2)
Base x86 (3)xchg (52), shld (30), shrd (10)
SSE compare (1)cmplesd (4)

A small list — and most of it isn't "new" instructions. The bulk of GCC 13's vectorization repertoire (paddd, paddq, pand, por, pshufd, pmuludq, punpckhdq, punpckldq, punpcklqdq, psrldq, pcmpgtd, cvtdq2pd, addps, mulps, divps, shufps) is also emitted by GCC 9 once it's allowed -O3. When the two compilers compete on equal footing, they reach very similar code on this source.

The takeaway from comparing Figures 3 and 4 together: across both axes the SPEC2006-with-GCC-13 binary differs from the other builds in this study by a small fraction of its mnemonic set — 29 on the source axis, 12 on the compiler axis. The shared core is ~210 mnemonics in both comparisons; the differences sit at the edges.

Caveats

  • Static counts, not dynamic frequency. Execution-time questions need separate measurement.
  • Mnemonic-level, not encoding-level. mov r, imm and mov r/m, r both land in the mov bucket even though they're distinct opcodes.
  • Main executable text only. Dynamic libraries (libc, libstdc++) aren't measured here.
  • GCC specifically. Different compiler families (Clang, ICC) might produce different shapes.
  • Instruction-set surface only. This comparison doesn't speak to workload character — concurrency patterns, indirect-branch density, working-set size, dynamic execution profile.

References

[1] Intel's canonical opcode map — the 16×16 layout used in this post, with one grid for the 1-byte primary opcode space and another for the 2-byte 0F-prefixed space: Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2, Appendix A — Opcode Map

[2] The glibc 2.34 release announcement describes the removal of __libc_csu_init / __libc_csu_fini, the new __libc_start_main@GLIBC_2.34 symbol version that absorbs their work, and the consolidation of libpthread / libdl / libutil / libanl into libc.so.6: GNU C Library 2.34 — sourceware.org libc-alpha announcement, August 2021

[3] Red Hat's write-up of the library consolidation explains the engineering motivation and backward-compatibility approach (empty static archives for libpthread.a etc. so existing build systems keep working): Why glibc 2.34 removed libpthread — Red Hat Developer, December 2021

[4] Background on the ret2csu ROP technique that the CSU removal closes — the gadget at the end of __libc_csu_init controlled rdi/rsi/rdx/rbp/r12r15 in a single call and was a standard primitive in CTF exploits before glibc 2.34: CSU Hardening — ir0nstone binary-exploitation notes