Running SPEC CPU2017 at Chips and Cheese?

September 19, 2024 clamchowder 20 Comments

SPEC, or Standard Performance Evaluation Corporation, maintains and publishes various benchmark suites that are often taken as an industry standard. Here, I’ll be looking at SPEC CPU2017, which measures CPU performance. Chips and Cheese has a free press license for SPEC CPU2017, and we thank SPEC for graciously donating that. Running SPEC CPU2017 is interesting because OEMs often publish SPEC CPU2017 scores to set performance expectations for their systems, and SPEC CPU test suites are sometimes used as an optimization target by CPU makers.

Personally, I want to leave benchmarking to mainstream tech sites. Chips and Cheese is a free time project for me, meaning I have to pick the areas I want to cover. I started writing articles about hardware architecture because I felt mainstream sites were not covering that in anywhere near enough detail. But mainstream sites have important advantages. With people working full time, they’re far better positioned to run a lot of benchmarks and repeat runs should issues come up. But I may make an exception for SPEC CPU2017 because I don’t see more established sites running it, and we do have a license. Anandtech used to run SPEC CPU tests, but unfortunately they’ve closed down.

SPEC CPU2017 is a particularly challenging benchmark to run. It’s distributed in source code form to let it run across a broad range of systems. But that means compiler selection and optimization flags can influence results. Running SPEC CPU2017 also takes a long time once compilation finishes. A run can take several hours even on a fast CPU like the Ryzen 9 9950X. On Ampere Altra, a run took more than 11 hours to complete. Testing different compilers and flags, and doing multiple runs to collect performance counter data can easily turn into a giant time investment.

Here, I’m going to share my early impressions of running SPEC CPU2017. Nothing here is final including the test methodology, or whether I’ll actually run SPEC CPU2017 going forward.

Acknowledgment

We would like to thank the SPEC organization for giving us a SPEC CPU 2017 license for us to use free of charge.

Initial Run Configuration

My initial methodology is loosely based off Anandtech’s. However, I’m using GCC 14.2.0 instead of clang. Like Anandtech, I want to skip working with vendor specific compilers like AMD’s AOCC, Intel’s ICC, and Arm’s armclang. I know that many submitted SPEC results use vendor specific compilers, and they may give better scores. But working out flags for each compiler is not a time investment I’m willing to make. I also want to focus on what scores look like with a standard, widely deployed compiler. I did look at clang, but it wasn’t able to compile some SPEC workloads out of the box. GCC wasn’t able to either, but SPEC already documented workarounds for most of the problems I encountered. The rest were fixed by passing -Wno-implicit-int and -Wno-error=implicit-int to GCC, and -std=legacy to GFortran. clang threw more complex errors that didn’t immediately lead me to an obvious debugging route.

As with compiler selection, I’m keeping optimization flags simple:

-O3 -fomit-frame-pointer -mcpu=native

I’m using -O3 instead of -Ofast as Anandtech did, because -Ofast causes a failure in SPEC2017’s floating point suite. -mcpu=native is a compromise so I don’t have to pass different tuning and ISA feature targets for different CPUs. On aarch64, -mcpu=native tells GCC to target the current CPU’s ISA feature set and tune for it. On x86-64 however, -mcpu=native only tells GCC to tune for the current CPU. When I have more time, I do intend to see what’ll happen if I split out -march and -mtune options. -fomit-frame-pointer tells the compiler to not use a register to store the stack frame pointer if a function doesn’t need it, freeing a register up for other uses. I’m only including it because Anandtech used it. From brief testing, Zen 4 had identical scores with and without the flag. Ampere Altra got a 1.2% higher integer score with -fomit-frame-pointer, which is negligible in my opinion.

I’ll be focusing on single threaded results by running a single copy of the rate tests. SPEC CPU2017 allows running multiple copies of the rate tests, or speed tests that use multiple threads to measure multithreaded performance. I think that’s a bridge too far because I’ve already sunk too much time into getting single threaded test results. Finally, I’m using bare metal Linux except in cloud environments where virtualization isn’t avoidable.

Initial Results

Results are marked “estimated” because they haven’t been submitted to SPEC’s site.

Linux didn’t support boost on the 3950X. Also, what’s Bulldozer doing here?

For reference, here are Anandtech’s results from their last article on the Ryzen 9 9950X:

At a glance, scores are reasonable. My runs on the Ryzen 9 9950X scored 8.6% and 11.7% higher than Anandtech’s in the integer and floating point suites, respectively. If I use my Ryzen 9 7950X3D’s non-VCache die as a proxy for the Ryzen 9 7950X, my runs scored 6.2% and 7.3% higher in integer and floating point. That difference can be partially explained by faster memory. Cheese set the Ryzen 9 9950X test system up with DDR5-6000, while Anandtech used DDR5-5600. On my system, I’m using DDR5-5600. Anandtech didn’t specify what they tested Zen 4 with. Compiler selection and flags may also have made a difference, as GCC is a more mature compiler.

The SPEC CPU2017 scores actually indicate how fast the tested system is compared to a 2.1 GHz UltraSPARC IV+ from 2006, with tests compiled by Oracle Developer Studio 12.5. Current consumer cores are more than 10x faster than the old UltraSPARC IV+. Even a density optimized core like Crestmont still posts a more than 5x performance advantage.

Diving into Subtests

SPEC CPU2017’s floating point and integer suites are collections of workloads, each with different characteristics. I want to understand how each workload challenges the CPU’s pipeline, and what bottlenecks they face. First, let’s look at the scores.

Modern CPUs post huge gains over the UltraSPARC IV+ across the board, but 548.exchange2 is an outlier in the integer suite. Outliers get even more extreme in the floating point suite. Even the FX-8150, which was infamously bad at single threaded performance, obliterates the UltraSPARC IV+ in 503.bwaves.

To avoid cluttering graphs and limit time expenditure, I’ll focus on Redwood Cove and Zen 5 for performance counter analysis. They’re the most modern Intel and AMD architectures I have access to, and their high performance means runs can complete faster.

Within the integer suite, 548.exchange2, 525.x264, and 500.perlbench are very high IPC workloads. IPC is very low in 505.mcf and 520.omnetpp. Zen 5 and Redwood Cove have broadly similar IPC patterns across the test suite.

SPEC CPU2017’s floating point suite generally has higher IPC workloads. 538.imagick has very high IPC in particular. 549.fotonik3d sits at the other end, and shows low IPC on both Zen 5 and Redwood Cove.

Performance Counter Data

Top-down analysis tries to account for lost throughput in a CPU’s pipeline. It measures at the narrowest stage, where lost throughput can’t be recovered by bursty behavior later. 505.mcf and 520.omnetpp are the lowest IPC workloads in the integer suite, making them interesting targets for analysis.

On Zen 5, omnetpp is heavily limited by backend memory latency. 505.mcf’s situation is more complex. Frontend latency is the biggest limiting factor, but every other bottleneck category plays a role too.

Intel’s Redwood Cove is balanced differently than Zen 5. 520.omnetpp gets destroyed by backend memory latency, likely because Meteor Lake has a higher latency cache and memory hierarchy. In 505.mcf, Redwood Cove is also more memory bound. But Intel also loses far more throughput to bad speculation. That’s when the frontend sends the core down the wrong path, usually due to branch mispredicts.

Performance counters show Intel and AMD’s branch predictors have a hard time in 505.mcf. Technically 541.leela challenges the branch predictor even more, but only 16.47% of leela’s instruction stream consists of branches. 505.mcf is a branch nightmare, with branches accounting for 22.5% of its instruction stream. Normalizing for that with mispredicts per instruction shows just how nasty 505.mcf is.

xalancbmk has an even higher branch rate, but high prediction accuracy means its IPC ends up being ok-ish in the end

Mispredicted branches also incur frontend latency because the frontend has to clear out its queues and eat latency from whichever level of cache the correct branch target is at. Thankfully, 505.mcf’s instruction footprint is pretty tame. The test fits within the micro-op cache on both Zen 5 and Redwood Cove.

I find it interesting that Zen 5 suffers hard from both frontend latency and bandwidth even when using its fastest method of instruction delivery. Zooming back up, Intel and AMD’s latest micro-op caches are big enough to contain the majority of the instruction stream across SPEC CPU2017’s integer suite. Zen 5 does especially well, delivering over 90% of micro-ops from the micro-op cache even in more challenging workloads like leela and deepsjeng.

505.mcf

At this point, I decided 505.mcf was interesting enough to warrant further investigation. Redwood Cove’s frontend situation makes sense. High micro-op cache hitrate keeps frontend latency and frontend bandwidth bottlenecks under control, but mispredicts cause a lot of wasted work. Zen 5’s situation is more difficult to understand, because frontend latency plays a huge role even though the op cache is the lowest latency source of instruction delivery.

Instruction Cache fills from L2/Ki	0.06
Instruction Cache fills from System/Ki System = L3, DRAM, or another CCX	0.08

L1 instruction cache misses aren’t a significant factor. High op cache hitrate already implies that, but it’s good to make sure. Because instruction-side cache miss latency shouldn’t be a factor, it’s time to look at branches. 505.mcf has a ton. Returns and indirect branches account for a minority of them.

Category	% of Retired (Actually Executed) Instructions
Branches	22.5%
Taken Branches	12.5%
Returns	2.1%
Indirect Branches	2%

Zen 5 can’t do zero-bubble branching for returns and indirect branches with changing targets. Those should have minor impact because they only account for a small percentage of the instruction stream. However, Zen 5 uses its indirect predictor more than twice for every indirect branch that shows up in the instruction stream.

Retired Indirect Branches Per 1000 Instructions	20.72
Variable Target Predictions Per 1000 Instructions	49.86
Variable Target Predictions Per Retired Indirect Branch	2.4
Branch Predictor Pipe Corrections (i.e. L2 Predictor Overrides) Per 1000 Instructions	1.45
Decoder Overrides Per 1000 Instructions	0.12

Speculative events like how often the core uses its indirect predictor are expected to count higher than retired events (instructions committed at the end of the pipeline). Mispredicts will always cause extra wasted work. The branch predictor may predict an indirect branch, then have to do it again if a branch right before that was mispredicted. But a more than 2x count is suspiciously high. Add returns and direct branches that miss the L1 BTB, and you have about 72.8 branch predictor delays per 1000 instructions.

I suspect these delays cause Zen 5 to suffer more from frontend latency than Redwood Cove. But those delays may also mean Zen 5 doesn’t get as far ahead of a mispredicted branch before it discovers the mispredict. Redwood Cove loses less from frontend latency, but that simply means it sends more wasted work to to the backend.

Redwood Cove also suffers more from backend memory bound reasons in 505.mcf. Its L1 data cache sees a lot of misses, as does its L1 DTLB. Intel’s events count at retirement, and therefore exclude prefetches or erroneously fetched instructions that were later flushed out.

Cache	Misses Per 1000 Instructions (MPKI)
L1 Data Cache	25.44
L2 Cache	4.15
L3 Cache	1.97
L1 DTLB	19.2
L2 TLB	0.41

Most misses are caught by second and third level caches, which incurs extra latency. However, DRAM latency is always a nasty thing especially with LPDDR5. 1.97 L3 MPKI might not seem too bad, but DRAM takes so long to access that it ends up accounting for the bulk of memory-bound execution delays. Intel’s cores can track when all of their execution units were idle while there’s a pending cache miss.

An execution stall doesn’t necessarily impact performance because the execution stage is a very wide part of the pipeline. The core could momentarily have a lot of execution units busy once cache miss data shows up, minimizing or even completely hiding a prior stall. But I expect these memory bound stalls to correlate with how often a cache miss leads to a dispatch stall at the renamer, after which the core will never recover the lost throughput.

Memory Bandwidth Usage

Memory latency can be exacerbated by approaching bandwidth limits. That’s typically not an issue from a single core, and bandwidth usage across across the integer suite is well under control. 505.mcf does have the highest bandwidth usage across the integer tests, but 8.77 GB/s is not high in absolute terms. So while Redwood Cove’s caches struggle a bit more with 505.mcf than in other tests, it’s still doing its job in avoiding memory bandwidth bottlenecks.

Another observation is that write bandwidth often accounts for a small minority of bandwidth usage. Read bandwidth is more important.

Floating point tests often follow the same pattern, with low bandwidth usage and not a lot of writes. But a few subtests stand out with very high memory bandwidth usage, and more write bandwidth too.

549.fotonik3d and 503.bwaves are outliers. From both IPC data above and top-down analysis data below, 503.bwaves isn’t held back by memory bandwidth. IPC is high, and backend memory-bound stalls are only a minor factor. 549.fotonik3d however suffers heavily from backend memory stalls. It eats 21 GB/s of read bandwidth and 7.23 GB/s of write bandwidth. For perspective, I measured 23.95 GB/s of DRAM bandwidth from a single Redwood Cove core using a read-only pattern, and 38.92 GB/s using a read-modify-write pattern. 549.fotonik3d is heavily bound by memory latency, but memory latency in this case is high because the workload is approaching bandwidth limits for a single core.

Besides being more bandwidth heavy, SPEC CPU2017’s floating point tests tend to be more core bound than the integer ones. And as the higher IPC would imply, more pipeline slots result in retired (useful) work. Generally, the SPEC CPU2017 floating point tests make better use of core width than the integer ones, and move emphasis towards execution latency and throughput. Zen 5 shows similar patterns, though AMD’s latest core still suffers a bit from frontend latency.

Micro-op cache hitrates are generally high, except in 507.cactuBSSN. It’s the only test across SPEC CPU2017 that sees Zen 5 drop below 90% micro-op cache hitrate. Redwood Cove gets hit harder, with micro-op cache hitrate dropping to 58.98%.

Still, both cores suffer very little in the way of frontend bandwidth or latency losses in 507.cactuBSSN. Missing the micro-op cache doesn’t seem to be a big deal for either core. Part of this could be because branches account for just 2.22% cactuBSSN’s instruction stream and both cores enjoy over 99.9% branch prediction accuracy. That makes it easy for the branch predictor to run far ahead of instruction fetch, hiding cache miss latency. It’s a stark contrast to 505.mcf, and suggests having branchy code is far worse than having a large instruction footprint.

Backend Memory Footprint

Micro-op cache coverage is pretty good throughout much of SPEC CPU2017’s test suite. But backend memory latency and occasionally memory bandwidth play a role too. Caches help mitigate those issues, insulating cores from slow DRAM. I collected demand cache fill data on Zen 4 because the 7950X3D lets me check the impact of more L3 cache without changing other variables. “Demand” means an instruction initiated a memory access, as opposed to a prefetch. However, a demand access might be started by an instruction that was never retired, like one flushed out because of a branch mispredict. Intel in contrast reports load data sources at retirement, which excludes instructions fetched down the wrong path. Therefore, figures here aren’t comparable to Redwood Cove data above.

L1D miss rate is surprisingly low across a large portion of SPEC CPU2017’s integer tests. 548.exchange2 is the most extreme case, fitting entirely within the L1 data cache. It also enjoys 98.89% micro-op cache hitrate on Zen 4, explaining its very high 4.1 IPC. Other tests that take a few L1D misses often see those misses satisfied from L2 or L3. 502.gcc, 505.mcf, and 520.omnetpp are exceptions, and challenge the cache hierarchy a bit more.

The same pattern persists across many floating point tests. Zen 4’s 32 KB L1D sometimes sees more misses, but again L2 and L3 are extremely effective in catching them. 549.fotonik3d and 554.roms are extreme exceptions with very high L3 miss rates. There’s nothing in between.

That likely explains why VCache does poorly in the floating point suite. 96 MB of L3 will have little effect if 32 MB of L3 already covers a workload’s data footprint. If the data footprint is too large, 96 MB might not be enough to dramatically increase hitrate. Then, higher clock speed from the non-VCache die may be enough to negate the IPC gain enabled by VCache.

Increased L3 hitrate in 549.fotonik3d and 554.roms produced a 17% IPC increase, which should have been enough to offset VCache’s clock speed loss. However, the lower score suggests the tested VCache core did not maintain its top clock speed during the test.

The integer suite is a little more friendly to VCache. 520.omnetpp is a great example, where a 96 MB L3 cache is nearly able to contain the workload’s data footprint. The resulting 52% IPC increase is more than enough to negate the non-VCache core’s clock speed advantage.

Still, many tests fit within L3 and therefore punish VCache’s lower clock speed. The total integer score sees VCache pull even with its non-VCache counterpart, which isn’t a good showing when the VCache die almost certainly costs more to make.

Final Words

Running SPEC CPU2017 was an interesting if time consuming exercise. I think there’s potential in running the test for future articles. The floating point and integer suites both have a diverse set of workloads, and examining subscores can hint at a CPU’s strengths and weaknesses. At a higher level, the integer and floating point tests draw a clear contrast where the former tends to be more latency bound, and the latter tends to be more throughput bound.

I have criticisms though. Many of SPEC CPU2017’s workloads have small instuction footprints, meaning they’re almost entirely contained within the micro-op cache on recent CPUs. From the data side, many tests fit within a 32 MB L3 or even a 1 MB L2. The floating point suite’s tests are either extremely cache friendly or very bound by DRAM bandwidth and latency.

And of course, there’s no getting away from the time commitment involved. Having a run fail several hours in because -Ofast caused a segfault in a floating point subtest is quite a frustrating experience. Even if everything works out, getting the info I want out of SPEC CPU2017 can easily take several days on a fast system.

Going forward, I might include SPEC CPU2017 runs if the settings above work across a wide variety of systems. That’ll let me reuse results across articles, rather than having to do multiple runs per article. If I end up having to go down a rabbit hole with different compilers and flags, I’ll probably stick to benchmarks that are easier to run. And of course, I’ll continue poking at architecture details with microbenchmarks.

If you like our articles and journalism, and you want to support us in our endeavors, then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way. If you would like to talk with the Chips and Cheese staff and the people behind the scenes, then consider joining our Discord.

Author

clamchowder

View all posts

20 thoughts on “Running SPEC CPU2017 at Chips and Cheese?”

mttpd says:

September 19, 2024 at 7:49 pm

I’d go with `-march=native` for x86.
Don’t use `-mcpu` for x86: It’s a deprecated synonym for `-mtune`. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
If you don’t set `-march` then by default you’ll use an ancient x86 target (possibly from the pre-AVX/pre-Haswell era).

FWIW, as of (relatively) recent GCC version `-march=native` will map to `-mcpu=native` has always been *the* way to go for AArch64, https://maskray.me/blog/2022-08-28-march-mcpu-mtune, https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

Reply
1. clamchowder says:
  
  September 19, 2024 at 9:00 pm
  
  I may try that eventually if time permits. Right now even the ancient x86 target seems to have GCC outperforming clang hitting a modern target, so I doubt different targets will make a difference.
  
  Reply
  1. Garf says:
    
    September 20, 2024 at 2:13 am
    
    Not using AVX/AVX2 might cause quite the handicap for x86 targets vs ARM though.
    
    Reply
  2. mttpd says:
    
    September 20, 2024 at 12:58 pm
    
    If you have the time to watch an entire talk on this, I’d highly recommend “What GCC optimization level is best for you?” by Jan Hubicka (GCC developer) at SUSE Labs Conference 2022: https://www.youtube.com/watch?v=GufwdypTfrE
    If you don’t, then the summary slide (slide 29/36, around 39 minutes into the talk) as well as the “Common issues with optimization level settings” slide (35/36, around 48 minutes in) are still worth looking into 🙂
    
    Despite the title it also covers Clang/LLVM–and uses SPEC 2017 as one of the example benchmarks:
    
    > We discuss GCC optimization levels (-O1, -O2, -O3, -Ofast, -Og) and other key features code quality related features (architecture setting, CPU tuning, link-time optimization and profile feedback) .We show some data on performance and code size of GCC, Clang, Firefox and SPEC2k17.
    
    Worth noting that LTO (link-time optimization) and PGO (profile-guided optimization) make a major difference–also (which may not be commonly known) in the relative sense (as in: when an option for GCC is better and when it’s worse than Clang/LLVM–and which compiler is better overall–will differ depending whether you compile without any, with LTO, or with LTO+PGO). When targeting AMD Zen 3 Milan, they do get 12% for SPECfp2017 (but only 1% for SPECint2017) from the native tuning.
    
    Reply
    1. clamchowder says:
      
      September 20, 2024 at 6:30 pm
      
      Honestly I’d rather skip the software side and focus on the hardware side as much as possible. I want to discuss hardware, not benchmark compilers.
      
      Reply
2. Rev says:
  
  September 20, 2024 at 2:00 pm
  
  While -march=native is better than mtune native as it will enable autovectorizer to use AVX(depending on arch) its dangerous as it only works for targets that the compiler recognize. Therefore its better to use option sets like x86-64-v4 if you don’t plan to update compiler often to catch up with new microarchs.
  
  Reply
  1. Garf says:
    
    September 24, 2024 at 2:37 am
    
    It will error out immediately won’t it? So not a real “risk”.
    
    Using x86-64-v4 also has limitations, i.e. it doesn’t include the VNNI and newer additions to AVX512.
    
    Reply
Hassan says:

September 20, 2024 at 1:56 am

Hi, are the Branch MPKI and top down analysis from Speed or rate version of the benchmark. Also, why not run with speed benchmarks as they measure the performance of a single core?

Reply
1. Garf says:
  
  September 20, 2024 at 2:16 am
  
  The “speed” versions of the benchmark can use multi-threading so they do not measure single-core performance. (See: https://www.spec.org/cpu2017/Docs/config.html#openmp)
  
  Reply
2. Garf says:
  
  September 20, 2024 at 3:36 am
  
  The speed benchmarks are allowed to use multi-threading so they do *not* measure the performance of a single core. (I’d link the relevant part of the SPEC documentation, but that causes my comment to get deleted!)
  
  Reply
  1. clamchowder says:
    
    September 20, 2024 at 8:11 am
    
    Links don’t cause comments to get deleted, but they do require manual approval to counter spam.
    
    Reply
Garf says:

September 20, 2024 at 2:41 am

>Many of SPEC CPU2017’s workloads have small instuction footprints, meaning they’re almost entirely contained within the micro-op cache on recent CPUs.

To be fair, if you look at the benchmarks, even GCC is served 80-90% from the micro-op cache, and a compiler to me feels like a stereotypical case of a really large (active) code footprint. The only one that does worse is deepsjeng – which surprises me a bit.

I’m curious what common workloads you think are worse. Maybe browsers having to execute newly JIT-ed JavaScript code from a page with 100 different ads on it?

Reply
1. YYY says:
  
  September 20, 2024 at 3:19 am
  
  As always – it’d be nice to assemble and profile a set of “representative workloads” to answer the question about the code footprint. SPEC has been doing a good job in this regard.
  
  Reply
2. clamchowder says:
  
  September 20, 2024 at 8:03 am
  
  GCC is fine, but a lot of other SPEC CPU2017 workloads have >90% op cache hitrates. I cover a couple of workloads with lower op cache hitrate at https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5-on-desktop/, namely Linux kernel compilation and SVT-AV1 encoding.
  
  Another not mentioned example is games, which often see under 90% op cache hitrate. I wasn’t able to test that on Zen 5 because I don’t physically have the system, but on Zen 4 it’s often in the low 80% range. GCC funny enough is the only test across the integer suite that ends up in the same range. The rest enjoy much higher op cache coverage, including deepsjeng (89.79%, basically rounds to 90%)
  
  There’s one FP test, cactuBSSN, with low op cache hitrate. But it has predictable branches and not a lot of them, so the frontend eats it for lunch.
  
  Reply
  1. mttpd says:
    
    September 20, 2024 at 12:48 pm
    
    If you’re looking for benchmarks to specifically stress different microarchitecture aspects than SPEC does then it may be worth looking into DCPerf, https://engineering.fb.com/2024/08/05/data-center-engineering/dcperf-open-source-benchmark-suite-for-hyperscale-compute-applications/, https://github.com/facebookresearch/DCPerf
    
    Reply
CRT says:

September 20, 2024 at 8:39 am

-Ofast can be fixed by adding -fno-finite-math-only.

Reply
Tao Zhang says:

September 20, 2024 at 10:34 am

The analysis is (damn) good. Thank you!

Reply
Some One says:

September 20, 2024 at 4:44 pm

Worth mentioning that -fomit-frame-pointer is included in -O1 and greater; main reason why -Og exists.

Reply
John Henning says:

September 24, 2024 at 5:29 am

Hello Clam Chowder,

I am glad that you are finding the CPU 2017 benchmarks interesting – thanks for your article! As one of the developers of SPEC CPU, I could add a few notes:

You wrote “DRAM takes so long to access”. Yes indeed. Even 20+ years ago, my friends in a compiler group were dealing with L1 cache latency of 1 or 2 cycles, L2 on the order of a 10 cycles, and main memory more than 100 cycles away, leading them to refer to main memory as “fast disk”.

You are correct that it can be difficult for SPEC to find benchmarks that exceed the available caches. With every release of SPEC CPU, there have been attempts to increase the size of the benchmarks, but on the other hand compiler developers are always trying to avoid that long trip to main memory (“fast disk”) and make better use of the caches.

You are right that Blast Waves (503.bwaves) does particularly poorly on that ancient SPARC v490. In at least some of my tests (though apparently not in yours) I have seen it be sensitive to both main memory bandwidth and to compiler transformations that reduce its bandwidth needs. The SPARC v490 and the compiler that was used were limited in both regards.

There are various hints about GCC usage in $SPEC/config/Example-gcc*.cfg which may be useful.

“Having a run fail several hours in because -Ofast caused a segfault” – sorry you ran into that. One thing that might help: If you are testing “NewCompiler dash-dash-NewOptimization”, you might want to say something like

runcpu –parallel_test=22 –parallel-test-workloads=ref intrate

to do correctness testing of all the integer benchmarks at once. The above won’t give you valid performance results (or at least not valid for your scenario of single copy SPECrate) but it would at least tell you quickly whether there are any failures. After the machine goes idle, you can summarize any errors with:

port_progress -q -a –table

https://www.spec.org/cpu2017/Docs/utility.html#port_progress

It is good that you call your results “Estimates” if you have not gone through all the tedium of the run rules (!) but just fyi it is NOT required that you submit your results to SPEC in order to remove the word “Estimate”. That is, SPEC CPU (unlike some other SPEC benchmarks) allows independent publication. See http://www.spec.org/fairuse.html for details.

Lastly, note that SPEC CPU intentionally attempts to test the CPU chip, the memory hierarchy (including caches), and the compiler. That being said, there’s absolutely nothing wrong with choosing to hold one of these constant – for example, picking gcc dash-dash-my-favorite-opt-level and choosing not to budge.

Again, I am very glad to see that you find SPEC CPU interesting.

John Henning
(One of the developers of SPEC CPU but NOT speaking on behalf of SPEC; my opinions are my own)

Reply
1. clamchowder says:
  
  September 26, 2024 at 7:58 am
  
  Thanks for the comment, and I’ll keep the parallel trick in mind if I have to validate other compiler flags! Looking back, prefetchers could have helped Blast Waves/503.bwaves because I’m only looking at demand cache misses. Investigating that of course would be another rabbit hole.
  
  As far as run rules (https://www.spec.org/cpu2017/Docs/runrules.html#rule_2.3) I should be well within the rules with regards to compiler flags, one runcpu –reportable invocation for an entire suite, and using a single file system. Basically I’m trying to get as close to a compliant run as I can without the tedium of checking every last box. As a stretch goal I might go back and see if I can meet all the rules.
  
  Reply