(Tech Babble #18): Why Ampere (RTX 30 series) "doesn't perform as high" with so many "CUDA cores".
Oh my god, A Tech Babble! What is this?! Are you feeling OK, Ash? Yes. Shut up while I type!
This babble will be based on my understanding of NVIDIA's 30 series, using the 'Ampere' architecture. It will contain knowledge that I have learned myself, reading technical details, thinking, and talking to people like Nemez (GPUsAreMagic on Twitter), no, I won't link Twitter on my website) or are extremely knowledgable about this subject. A special nod and thanks to Nemez, because they answered my questions on the subject and I learned a good portion of the understanding from them. Aight, so without further ado...
Basic idea behind the Babble.
I often hear people saying how Ampere has regressed in "IPC" and how the CUDA cores are "weaker" than previous architectures (Turing, 20-series, and even Pascal, 10-series), and it bothers me that they use this assumption to insult the architecture and the engineers that designed it as "bad", or whatever. Since at first, I was also quite confused about how Ampere works with the huge number of advertised CUDA cores, I did some reading, asked some questions, thought a lot about it, and here we go: a Tech Babble. As is often the case, I will break down the Babble into subheadings addressing specific points, or ideas that I want to address and/or put forward.
Ampere's SM is actually about higher utilisation, not lower.
Wrap your head around this - the argument is often put forward that Ampere's huge number of advertised CUDA cores are underutilised, and well... from a completely top-level perspective, they are. I mean, from such a perspective, you can say that GA102 isn't using anywhere near all of the 10752 "CUDA Cores" at a given time, but also - it wasn't designed to.
That's where the low-down of what the Ampere SM comes in. So, let's have a look at some cool diagrams of Turing, Ampere's direct predecessor, and the architecture it is based on, and then Ampere itself:
Aight, so aside from the other differences like the Tensor blocks, I want to highlight the math execution pipelines, which is where the "CUDA cores" are, as they are the focus of this Babble.
Okay, you probably already know about Turing and the 20 series, and how "Turing Shaders" were advertised with the ability to co-execute Floating Point and Integer math, right? Well you can see how that happens up there in that diagram. The important thing to note here is that Turing's SM actually has 128 shader processors, not 64 as NVIDIA Advertise as "CUDA Cores. Since you are also probably aware that one of the changes of the Turing SM from Pascal, is that the number of "advertised CUDA cores" (A good name, since it's marketing) went from 128/SM to 64/SM, but Turing just had quite a lot more of them (30 on GP102, 72 on TU102), yes?
Well, both Turing and Pascal have essentially the same number of actual math pipelines in the SM, the difference is how they're wired; On Pascal, the 128 math pipelines ("CUDA cores") can either do Integer operations, or Floating Point. That means, all 128 can work with numbers that have a decimal place, or instead, whole numbers without tracking the decimal place). Both those formats are useful, but the Pascal SM can do either or, so you're either doing 128 Float, or 128 Int with the 128 total lanes ("CUDA cores!").
Turing changes the SM wiring by splitting those 128 lanes up into two groups of 64, and making it so each group does a specific format. So you have 128 lanes, but only 64 can do Float, and only 64 can do Int. The big point here is that you can do 64 float and 64 int at the same time, concurrently, which is where you get this "co-execution" business from. And it's very likely that the 128 lanes on Turing use less transistors than the 128 lanes on Ampere, because not all of the lanes have to account for the decimal places (please note this is a theory), and since Turing has way more SMs, performance goes up overall. This split approach is also better for compute and allows Turing do work with native 32-bit integers, which Pascal can't do entirely at full speed.
Advertised "CUDA Cores" and "Actual math lanes".
Some of the confusion (and/or excitement) over Ampere's huge number of "Advertised CUDA cores" is that Nvidia decided to only call FP32-capable math units "CUDA cores", and even then, didn't set any other criteria for what qualifies as a "CUDA Core", criteria such as "All of them can be used independently". I say that, because when you look at Turing; technically it has 128 dispath ports for 128 math lanes per SM, exactly like Pascal, but NVIDIA Is only advertising half of them now.
Yes, that means RTX 2080 Ti actually has 8704 "Stream processors" (128/SM) but you're only advertised 50% of that at 4352 because only that many can do FP32; which NVIDIA decided is the sole criteria for a "CUDA Core".
So what about Ampere?
This is where it gets confusing (at least at first) because to start this off I have to say straight off the bat, Ampere also has 128 dispatch ports per SM. So, technically it should also have 128 Shader processors per SM, just like Pascal and Turing, right? RIGHT?
No, because this is where some design choices in wiring the circuit come into play, and how NVIDIA can essentially double the advertised "CUDA Cores" per SM vs Turing, without actually having more math dispatch ports.
Aight, so the big difference here is that the second (on the right!) bunch of math pipes now has the ability to also do Floating point, like the first, or Integer. That's the important bit we have to understand when "measuring" Ampere's performance expectations from that frankly huge bloated number of "CUDA Cores" NVIDIA put on the spec sheets.
The way to look at this is that, Ampere actually has 192 Shader Processors / Math lanes per SM, 128 of which can do Float, and 64 can do Integer. That's 64 more than Pascal and Turing, but the catch is that the extra 64 float lanes are sharing dispatch ports with the existing 64 Integer from Turing's design. What does that mean? It means those extra 64 Float ports (you know, the ones that give Ampere the 2X "CUDA Core" count on the spec sheets, can only be used some of the time and that is not a bottleneck, or an architectural "issue"; it's by design from the start, which brings me to the next point in this Babble.
About that "Ampere is for higher utilisation" point I made earlier.
So I digressed quite a bit, but whatever. Anyway, the point I am trying to make here is that from the math dispatch perspective (Pascal, Turing and Ampere all have 128 math ports/SM) The extra "CUDA Cores" on Ampere are actually there to improve utilisation of the 64 of those ports that would otherwise not be used when the SM is done working on Integer, or there are no Integer operations to complete at a given time.
You'll probably be aware of the graph that NVIDIA Made when they compared the Pascal-based GTX 1060 to the Turing-based GTX 1660 Ti:
What I want to highlight that's important about this graph is that it appears to be from the GPU top level and not the SM level, so it's just saying that for every 100 instructions going into the GPU (not the SM), they can be co-issued like that, reducing the time needed to complete the operation, which is now only 62 slots on Turing.
Thing is, see that bit at the back of the 1660 Ti's instruction slots, where only float is being performed? That's where Ampere's upgrade comes in. Now it's not applicable to the visualisation here (I.e, the extra float lanes aren't always filling in the back of the workload, more on that in a sec), but Ampere'design essentially allows Turing's design to double up on float operations when there aren't Integer instructions being used. But as you can probably tell, that's not actually even half of the number of instructions being worked on.
After having an awesome conversation with Nemes, (To which the Credit for these diagram edits goes!), they came up with something like this that highlights what Ampere's extra Float units are actually doing:
Nemes Edited this (credit to GPUsAreMagic on Twitter!) to show me how this more or less works. So, you can see that the Ampere SM has that second floating point execution path at the bottom, that is taking the remaining float workloads that would normally be done with 5 instructions on Turing, since even though both SMs have 128 lanes, only 64 on Turing can do Float, so you end up with 50% of the ports doing nothing when it's just