Telum II at Hot Chips 2024: Mainframe with a Unique Caching Strategy

Mainframes still play a vital role in today, providing extremely high uptime and low latency for financial transactions. Telum II is IBM’s latest mainframe processor, and is designed unlike any other server CPU. It only has eight cores, but runs them at a very high 5.5 GHz and feeds them with 360 MB of on-chip cache. IBM also includes a DPU for accelerating IO, along with an on-board AI accelerator. Telum II is implemented on Samsung’s leading edge 5 nm process node.

IBM’s presentation has already been covered by other outlets. Therefore I’ll focus on what I feel like is Telum (II)’s most interesting features. DRAM latency and bandwidth limitations often mean good caching is critical to performance, and IBM has a often deployed interesting caching solutions. Telum II is no exception, carrying forward a virtual L3 and virtual L4 strategy from prior IBM chips.

Virtual L3

Telum II has ten 36 MB L2 on-chip, which is absolutely huge. For perspective, the L3 cache on AMD’s Zen 3 desktop and server CPUs is typically 32 MB. Eight of Telum II’s L2 caches areare attached to cores, another is attached to the DPU, and a final tenth one isn’t attached to anything. Another comparison is Qualcomm’s Oryon core in the Snapdragon X Elite, which has a 12 MB L2 cache with 5.29 ns of latency. Qualcomm considers a tightly coupled high capacity cache. Telum II has 3.6 ns of L2 latency, and more capacity to boot.

Qualcomm’s Oryon has 5.28 ns of L2 latency

These giant L2s are great for cutting down memory access latency, but put Telum II in a funny position. Modern CPUs typically have a large cache shared across multiple cores. A shared cache can give a single thread more capacity in low-threaded loads, and reduce duplication for shared data in multithreaded loads. But eight cores in Telum II would already use 288 MB of SRAM capacity, along with the associated area and power cost. An even larger L3 would be expensive even for a specialized mainframe chip.

IBM’s solution is to reduce data duplication within its caches, and re-use the chip’s massive L2 capacity to create a virtual L3 cache. From IBM’s patent, each L2 has a “Saturation Metric” based on how often its core brings data into it (to satisfy misses). When a L2 kicks out a cache line to make room for incoming data, that evicted line goes another L2 with a lower Saturation Metric. That way, a core already pushing the limits of its own L2 capacity can keep that capacity to itself. The L2 slice without an attached core always has the lowest possible Saturation Metric, making it a preferred destination for lines evicted out of L2. If another L2 already has a copy if the evicted line, Telum II gives that L2 ownership of the line instead of putting it into virtual L3, also reducing data duplication.

Green = L2, Red = Virtual L3 (Lines evicted from L2 persisted in other L2s)

Tracking Saturation Metrics isn’t the only way IBM keeps the virtual L3 from monopolizing L2 capacity. IBM’s patent also mentions inserting virtual L3 lines into intermediate LRU positions. Replacement strategy is how a cache chooses a line to kick out when a new line is brought in. LRU stands for Least Recently Used, and is a replacement policy where the least recently used line gets kicked out. Normally, a newly inserted line gets put into the MRU (most recently used) position, putting it last in line to get replaced. But virtual L3 lines can get put into intermediate positions, giving priority to L2 lines brought in by the core.

Using intermediate positions also lets IBM control how much L2 capacity the virtual L3 can use. For example, putting virtual L3 fills halfway between the LRU and MRU positions would limit the virtual L3 to using half of L2 capacity. It’s easy to see how IBM can use this to elegantly adapt to changing application demands. If one core goes idle, its L2 could start inserting virtual L3 fills at the LRU position, letting the virtual L3 use all available capacity in that slice.

Besides making L2 capacity doesn’t get squished out, the virtual L3 faces another challenge. A line in the virtual L3 could be in any of Telum II’s ten L2 slices. In AMD, Arm, and Intel’s L3 caches, an address always goes to the same slice. A core can look at a subset of an address’s bits, and knows exactly which L3 slice to send a request to. That’s not the case for IBM. Apparently IBM handles this by potentially checking all L2 slices for a virtual L3 access. I asked if they were worried about overhead from more tag comparisons, but they said it wasn’t an issue because L2 miss rate was low. It’s important to remember that engineers try to keep a CPU design balanced, rather than pouring resources into making sure it’s great at every last thing. A Telum II core can use up to 36 MB of L2 capacity. Other factors being equal, it should miss L2 less often than a Zen 5 core misses in its 32 MB L3. A L3 access on Zen 5 may cost less power, but AMD needs that to be the case because Zen 5 cores have an order of magnitude less L2 capacity than Telum II’s cores.

Why Stop at L3?

IBM’s mainframe chips aren’t deployed in isolation. Up to 32 Telum II processors can be connected to form a large shared memory system. Again, IBM is looking at a lot of cache capacity, this time across a system instead of a single Telum II chip. And again, they applied the same virtual cache concept. Telum II creates a 2.8 GB virtual L4 by sending L3 victims to other Telum II chips with spare cache capacity.

A CPC (Central Processor Complex) Drawer on the prior Z16/Telum

I’m not sure how IBM implemented the virtual L4, but prior IBM mainframe designs might provide some clues. IBM organizes mainframe CPUs into CPC drawers, which somewhat resemble a rack-mount servers. A CPC can’t operate independently like a server because it doesn’t include non-volatile storage or a power supply unit, but it does place a set of CPUs and DRAM in close physical proximity. Prior IBM designs had a drawer-wide L4, and Telum II might do the same. 2.8 GB of L4 aligns with on-chip cache capacity across eight Telum II chips. If a drawer has eight Telum II dies, just as a Z16 drawer had eight Telum dies, then IBM could be persisting L3 victims across a drawer.

Go back another generation, and z15 has a System Controller (SC) chip with a huge 960 MB L4 shared across a drawer

IBM claims 48.5 ns of latency for a virtual L4 access. Achieving that across drawer boundaries would be quite a challenge. As ambitious as Telum II is, I think that would be a bridge too far. On the latency subject, IBM’s likely betting that L3 misses are rare enough that 48.5 ns of L4 latency isn’t a big deal. IBM also deserves a lot of credit for getting latency to be so low when crossing die boundaries. For perspective, Nvidia’s Grace superchip has over 42 ns of L3 latency with a monolithic die.

Looking Back at Telum

When talking about Telum II’s caching strategy, it’s impossible to ignore the original Telum/Z16. That’s where IBM brought virtual L3 and L4 caches into their mainframe line. Telum was presented at last year’s Hot Chips 2023. Since then, IBM has published more info on how Telum’s virtual L3/L4 setup worked. It’s a fairly simple scheme where each 32 MB L2 slice is split into two 16 MB segments. One segments can become part of the virtual L3/L4 if a core doesn’t need all of its L2 capacity. If the core is idle, all 32 MB can go towards the virtual L3/L4.

IBM’s technical overview furthermore clarifies that the virtual L4 is implemented across a drawer. A Z16 drawer has eight Telum chips, each with 256 MB of L2 capacity. A single threaded workload running on a drawer gets its own 32 MB L2. All the other on-die L2 caches become that thread’s 224 MB L3, and other L2 caches within the drawer combine to form a 1.75 GB virtual L4.

Telum II takes that strategy further, with higher cache capacity at L2, virtual L3, and virtual L4. When I asked how Telum II handled its virtual L3 after Cheese did his interview, they mentioned congruence classes. I had no clue what a congruence class was. IBM after all uses different terminology that I wasn’t familiar with, and it took quite some searching to figure out that a congruence class was basically a cache set.

The Thing	IBM’s Term	Typical Term
Memory addresses that directly correspond to locations on DRAM chips or memory-mapped IO	Real Address	Physical Address
Memory addresses used by programs, translated via page tables to real/physical addresses	Effective Address	Virtual Address
Line brought into a cache	Cast-in	Fill
Line kicked out of a cache	Cast-out	Eviction
A group of positions in a cache that an address can be cached in	Congruence Class	Set

Funny enough, I asked why IBM had different terminology compared to the rest of the industry. They said IBM came up with the terminology first, and later on the industry adopted different terminology. It was all lighthearted and funny. But it’s also a reminder that IBM was a pioneering company in developing high performance CPUs.

Final Words

CPU designers have to strike a balance between single and multithreaded performance. Server CPUs tend to focus on the latter, while client designs do the opposite. Telum II and prior IBM mainframe chips handle server tasks like financial transactions, but curiously seem to prioritize single threaded performance. IBM has actually decreased core counts per die from 12 in Z15 to 8 in Telum and Telum II. The focus on single threaded performance is very visible in IBM’s caching strategy as well. A single thread running on Telum II can enjoy client-like L2 and L3 access latencies, but with an order of magnitude more capacity at each level.

Rendering of a Telum II chip from IBM’s site

I wonder if IBM’s strategies can be applicable to client designs. AMD’s highest end CPUs notably have a lot of L3 cache capacity on the chip. For example, the Ryzen 9 9950X has 32 MB of L3 and 8 MB of L2 per CCD. Both CCDs together would give 64 MB of L3 and 16 MB of L2, for 80 MB of total caching capacity. If all that could be used for a single thread, it’s not far off from the 96 MB of L3 cache that a VCache part gets. A hypothetical dual CCD part with VCache on both dies could have over 200 MB of virtual L3 capacity.

Of course, pulling that off isn’t so straightforward. IBM’s mainframe chips use very fast cross-die interconnects. A Zen 5 CCD’s Infinity Fabric link only provides 64 GB/s and 32 GB/s of read and write bandwidth, respectively. IBM’s prior generation Telum already has twice as much bandwidth between dual-chip modules, and even more than that between Telum chips located on the same module.

Perhaps AMD calculated that consumers weren’t willing the pay the premium for advanced packaging technologies in client CPUs. But I do wonder what would happen if AMD could be the Nvidia of the GPU market, where enthusiasts don’t object to paying over $2000 for a top end SKU. Games tend to be cache sensitive, as VCache has shown. Perhaps such a part with a CoWoS-R RDL interposer or a similar packaging technology and a virtual L3 could push gaming performance even further.

I’d like to thank IBM for giving an excellent presentation at Hot Chips 2024, as well as discussing parts of their architecture in side conversations.

If you like our articles and journalism, and you want to support us in our endeavors, then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way. If you would like to talk with the Chips and Cheese staff and the people behind the scenes, then consider joining our Discord.

Author

clamchowder

View all posts

13 thoughts on “Telum II at Hot Chips 2024: Mainframe with a Unique Caching Strategy”

Dolda2000 says:

September 8, 2024 at 12:55 pm

In the images, many of the L2 slices look slightly different from one another. Any idea what’s up with that?

1. Sean Kurtz says:
  
  September 9, 2024 at 10:46 am
  
  I believe this was addressed in the associated interview around ~7:50. The lopsided slice is additional cache for the virtual L3 and intended to stay colder than the directly attached to a core cache slices (based on the Saturation Metric).
  
  It’s like the “first line” if the virtual L3 before it starts putting lines into another cores L2.
  
  1. Sean Kurtz says:
    
    September 9, 2024 at 6:40 pm
    
    Rephrasing my last sentence…it’s like the “first priority” slice in the virtual L3 and will take on lines headed to L3 before they go to another (attached to core) L2 cache block, based on the saturation metric (as least until that slice warms up).
    
Dolda2000 says:

September 8, 2024 at 12:58 pm

Also, any word on die sizes? Curious how the cache density compares to other designs.

1. eastcoastpete says:
  
  September 8, 2024 at 5:27 pm
  
  Somewhat building on the question about die size, the die shot provided by IBM seems to show 10 cores, yet Telum II only uses eight per CPU. Is that to get to a somewhat reasonable yield, i.e. two cores can be subpar, and they are just disabled? Since Telum II has reportedly at least 33 billion transistors, even a “mature” process* will have a significant number of defects, especially in such a large monolith, so this could be one way of getting still reasonable yields.
  If you have a chance for a follow up interview with IBM, I would be curious to hear why they didn’t chose to move to a tile/CCD type approach. Were they concerned about the interconnect causing losses in speed?
  
  * Some would argue that Samsung VLSI’s 5 nm node is not as low in defect density as, for example, TSMC’s 5 nm nodes are.
  
  1. Sean Kurtz says:
    
    September 9, 2024 at 6:44 pm
    
    Looks like one of those cores is missing, and one is actually the DPU. So it doesn’t look like they are disabling any or anything like that. It seems like it was designed for 8 cores.
    
    Interesting conversation about yields though and maybe that has something to do with it.
    
  2. Sean Kurtz says:
    
    September 9, 2024 at 6:49 pm
    
    I believe the original Telum was actually a dual chip module. Based on that (and the mention in this article of potentially 8 CPUs being included per drawer…leads me to believe that the Telum II may be the same and also shipped as a dual chip module.
    
2. Pierce Wiederecht says:
  
  September 8, 2024 at 7:49 pm
  
  The cache capacity will be based on Samsung’s 5nm sram cell size and basically nothing else.
  
3. Jz says:
  
  September 9, 2024 at 7:07 am
  
  Isn’t it just 43b and 600mm²?
  
Sean Kurtz says:

September 8, 2024 at 5:29 pm

These Mainframe chips are always some of the most interesting.

I think the focus on single threaded performance comes from Mainframes unique position in the market…they tend to be used by single organizations (not split up and rented out to multiple customers like in a cloud situation), and those customers tend to be clustered in industries like finance where parallelization of a lot of the code still tends to be uniquely challenging. Not to mention, these orgs tend to be old and have A LOT of code, so even if it were possible it would be massively expensive to rewrite to take advantage of larger core counts…it may very well be cheaper to pay IBMs large premium and eek out every last little bit of single threaded performance possible.

For an example of the kinds of workloads that run on Mainframes typically:

Think transaction processing of an enormous batch of transactions from a banks daily operations.

There is a very low tolerance for any failure in that sort of workload, and it’s inherently transactional and cannot be easily “sharded”.

So in summary, I think that’s why the unusual focus on cache and single threaded performance for these chips. I’d also be interested to see similar technoques in client designs, but you are right that the cost might be obscene.

1. Pierce Wiederecht says:
  
  September 8, 2024 at 7:52 pm
  
  I’d imagine that in finance there’s just a lot explicitly sequential code and algorithms based simply on the nature of the business.
  
  1. Sean Kurtz says:
    
    September 9, 2024 at 8:22 am
    
    Definitely. Each transaction might implicitly depend on the completion of each prior one running to completion.
    
    I.e. imagine you have debits/credits between two accounts that causes one to go to at or near zero balance. Then the transaction concludes. Now, imagine there is a similar transaction halfway across the world where one of the ends of the first transaction is involved again and it tries to move money from the account that’s near zero.
    
    There isn’t an east way to reorder those in a non-time based purely sequential manner that would remain logically consistent. Which is what you would need to do to parallelize and “split the transactions up” into buckets. There are techniques but they aren’t foolproof. You’d need to do massive amounts of preprocessing to determine dependencies…in which case…might as well just run the transactions.
    
    So in some sense a lot of these large batch workloads are inherently sequential and the algorithms are very sequential.
    
    “Eventual consistency” and other typical web techniques for horizontal scaling just don’t work as easily for those sorts of business processes.
    
    1. Joe Temple says:
      
      September 10, 2024 at 1:58 pm
      
      Yes this is the crux of it. The resulting serialization reduces the ability to scale and also increases the value of more cache. You will find that the mainframe has always traded off core count for more cache since at least the early 90’s. You will also find that in the very early CMOS mainframes cycle time was treaded off as well.
      
      A patent entitled Translation Lookahead Based Cache Access fixed the cycle time problem. (US5148538A), but the “Nest intense” nature of mainframe loads still drives the CPU design toward fewer faster cores. IBM achieves this with high clock rates and lots of cache.
      
      This drives the high cost of mainframes: “Processors are cheap, Bandwidth is expensive” was true in the 1980s and 1990s; it is still true today. It is also why mainframe emulation on less expensive conputers has always failed.