Oh my god, A Tech Babble! What is this?! Are you feeling OK, Ash? Yes. Shut up while I type!
This babble will be based on my understanding of NVIDIA's 30 series, using the 'Ampere' architecture. It will contain knowledge that I have learned myself, reading technical details, thinking, and talking to people like Nemez (GPUsAreMagic on Twitter), no, I won't link Twitter on my website) or are extremely knowledgable about this subject. A special nod and thanks to Nemez, because they answered my questions on the subject and I learned a good portion of the understanding from them. Aight, so without further ado...
Basic idea behind the Babble.
I often hear people saying how Ampere has regressed in "IPC" and how the CUDA cores are "weaker" than previous architectures (Turing, 20-series, and even Pascal, 10-series), and it bothers me that they use this assumption to insult the architecture and the engineers that designed it as "bad", or whatever. Since at first, I was also quite confused about how Ampere works with the huge number of advertised CUDA cores, I did some reading, asked some questions, thought a lot about it, and here we go: a Tech Babble. As is often the case, I will break down the Babble into subheadings addressing specific points, or ideas that I want to address and/or put forward.
Ampere's SM is actually about higher utilisation, not lower.
Wrap your head around this - the argument is often put forward that Ampere's huge number of advertised CUDA cores are underutilised, and well... from a completely top-level perspective, they are. I mean, from such a perspective, you can say that GA102 isn't using anywhere near all of the 10752 "CUDA Cores" at a given time, but also - it wasn't designed to.
That's where the low-down of what the Ampere SM comes in. So, let's have a look at some cool diagrams of Turing, Ampere's direct predecessor, and the architecture it is based on, and then Ampere itself:
Aight, so aside from the other differences like the Tensor blocks, I want to highlight the math execution pipelines, which is where the "CUDA cores" are, as they are the focus of this Babble.
Okay, you probably already know about Turing and the 20 series, and how "Turing Shaders" were advertised with the ability to co-execute Floating Point and Integer math, right? Well you can see how that happens up there in that diagram. The important thing to note here is that Turing's SM actually has 128 shader processors, not 64 as NVIDIA Advertise as "CUDA Cores. Since you are also probably aware that one of the changes of the Turing SM from Pascal, is that the number of "advertised CUDA cores" (A good name, since it's marketing) went from 128/SM to 64/SM, but Turing just had quite a lot more of them (30 on GP102, 72 on TU102), yes?
Well, both Turing and Pascal have essentially the same number of actual math pipelines in the SM, the difference is how they're wired; On Pascal, the 128 math pipelines ("CUDA cores") can either do Integer operations, or Floating Point. That means, all 128 can work with numbers that have a decimal place, or instead, whole numbers without tracking the decimal place). Both those formats are useful, but the Pascal SM can do either or, so you're either doing 128 Float, or 128 Int with the 128 total lanes ("CUDA cores!").
Turing changes the SM wiring by splitting those 128 lanes up into two groups of 64, and making it so each group does a specific format. So you have 128 lanes, but only 64 can do Float, and only 64 can do Int. The big point here is that you can do 64 float and 64 int at the same time, concurrently, which is where you get this "co-execution" business from. And it's very likely that the 128 lanes on Turing use less transistors than the 128 lanes on Ampere, because not all of the lanes have to account for the decimal places (please note this is a theory), and since Turing has way more SMs, performance goes up overall. This split approach is also better for compute and allows Turing do work with native 32-bit integers, which Pascal can't do entirely at full speed.
Advertised "CUDA Cores" and "Actual math lanes".
Some of the confusion (and/or excitement) over Ampere's huge number of "Advertised CUDA cores" is that Nvidia decided to only call FP32-capable math units "CUDA cores", and even then, didn't set any other criteria for what qualifies as a "CUDA Core", criteria such as "All of them can be used independently". I say that, because when you look at Turing; technically it has 128 dispath ports for 128 math lanes per SM, exactly like Pascal, but NVIDIA Is only advertising half of them now.
Yes, that means RTX 2080 Ti actually has 8704 "Stream processors" (128/SM) but you're only advertised 50% of that at 4352 because only that many can do FP32; which NVIDIA decided is the sole criteria for a "CUDA Core".
So what about Ampere?
This is where it gets confusing (at least at first) because to start this off I have to say straight off the bat, Ampere also has 128 dispatch ports per SM. So, technically it should also have 128 Shader processors per SM, just like Pascal and Turing, right? RIGHT?
No, because this is where some design choices in wiring the circuit come into play, and how NVIDIA can essentially double the advertised "CUDA Cores" per SM vs Turing, without actually having more math dispatch ports.
Aight, so the big difference here is that the second (on the right!) bunch of math pipes now has the ability to also do Floating point, like the first, or Integer. That's the important bit we have to understand when "measuring" Ampere's performance expectations from that frankly huge bloated number of "CUDA Cores" NVIDIA put on the spec sheets.
The way to look at this is that, Ampere actually has 192 Shader Processors / Math lanes per SM, 128 of which can do Float, and 64 can do Integer. That's 64 more than Pascal and Turing, but the catch is that the extra 64 float lanes are sharing dispatch ports with the existing 64 Integer from Turing's design. What does that mean? It means those extra 64 Float ports (you know, the ones that give Ampere the 2X "CUDA Core" count on the spec sheets, can only be used some of the time and that is not a bottleneck, or an architectural "issue"; it's by design from the start, which brings me to the next point in this Babble.
About that "Ampere is for higher utilisation" point I made earlier.
So I digressed quite a bit, but whatever. Anyway, the point I am trying to make here is that from the math dispatch perspective (Pascal, Turing and Ampere all have 128 math ports/SM) The extra "CUDA Cores" on Ampere are actually there to improve utilisation of the 64 of those ports that would otherwise not be used when the SM is done working on Integer, or there are no Integer operations to complete at a given time.
You'll probably be aware of the graph that NVIDIA Made when they compared the Pascal-based GTX 1060 to the Turing-based GTX 1660 Ti:
What I want to highlight that's important about this graph is that it appears to be from the GPU top level and not the SM level, so it's just saying that for every 100 instructions going into the GPU (not the SM), they can be co-issued like that, reducing the time needed to complete the operation, which is now only 62 slots on Turing.
Thing is, see that bit at the back of the 1660 Ti's instruction slots, where only float is being performed? That's where Ampere's upgrade comes in. Now it's not applicable to the visualisation here (I.e, the extra float lanes aren't always filling in the back of the workload, more on that in a sec), but Ampere'design essentially allows Turing's design to double up on float operations when there aren't Integer instructions being used. But as you can probably tell, that's not actually even half of the number of instructions being worked on.
After having an awesome conversation with Nemes, (To which the Credit for these diagram edits goes!), they came up with something like this that highlights what Ampere's extra Float units are actually doing:
Nemes Edited this (credit to GPUsAreMagic on Twitter!) to show me how this more or less works. So, you can see that the Ampere SM has that second floating point execution path at the bottom, that is taking the remaining float workloads that would normally be done with 5 instructions on Turing, since even though both SMs have 128 lanes, only 64 on Turing can do Float, so you end up with 50% of the ports doing nothing when it's just float work. Ampere changes that by allowing the second bunch of the 128 ports do also do float if they are not doing Integer, and what this does it allows those 5 instruction slots to become 2.5 - more performance since the same number of instructions took less slots/time!
In this rough diagram, you can see what Ampere is doing. Okay, so, you can see that where there has been a gap in Integer instructions, the otehr bunch of ports (each row represents 64 ports out of the 128) can also do floats. Do you see where this is going? It is an attempt to fill in the gaps where Turing's dedicated Integer units weren't working because it was only float- and how NVIDIA achieved that, is by tacking on another bunch of 64 Float units onto a shared dispatch connection with the integer ones.
The conclusion is simply that Ampere's "Extra CUDA Cores" you see on the advert, I.e the difference between RTX 2080 Ti and RTX 3080 (both have 68 SM!) is that 2080 Ti is advertised as having 4352 "CUDA Cores" and 3080 is advertised as having literally twice that, at 8704; but the extra +100% "CUDA Cores" are merely present to fill in gaps in the Integer dispatch ports during float-heavy workloads.
It's not because "Ampere is poorly designed or bloated" (Well, it is a bit bloated, it's a fairly transistor heavy approach but then Turing was anyway), since the SM still has the same number of math dispatch ports as Pascal and Turing, but they are just being used more often because the second bunch can fill in no-int gaps with some extra floating point work.
But Ampere often cannot (in gaming workloads) (Nor was it designed to), occupy all of those "CUDA Cores" with meaningful work especially in shader code that uses a lot of Integer instructions.
So why not just make it so both sets of math lanes can do Integer or float? Well, it would likely require significantly more transistors in a situation where the Integer side of the execution lanes would see underutilisation in gaming, but keep in mind that the A100 datacentre processor doesn't have the dual-purpose FP/Int capability on the second bunch of math ports like "consumer" Ampere does - so it was entirely for visualisation/graphics workloads which are often floating point heavy. It was also likely a fairly "simple" upgrade to the Turing SM design since it didn't actually add any more width to the math dispatch, because the extra +100% "CUDA Cores" are sharing with the 64 Integer ones on the Turing-design.
It was designed that way, NVIDIA just liked to advertise really high numbers, and well, the actual definition of a "CUDA Core" isn't really well defined, so I guess NVIDIA considers it a math processor that can do floating point (an FPU) but didn't set any criteria on whether it can be used 100% of the time in a situation where you're also feeding Integer instructions into a Turing-style SM and half of those advertised FPUs are sharing dispatch ports with the integer units.
So are Ampere CUDA cores real?
Yes, and no they don't have "SMT" like I saw someone comment on the internet (please, no). RTX 3090 literally has 10496 Floating point math lanes, physically in silicon, it's just that 5248 of those are sharing dispatch bandwidth with the chip's 5248 Integer math lanes, and that means that those extra ones can only do work when Integer isn't being fed into the SM. So, no, you're not getting a doubling of performance in FP32 shading, but it does mean that for an SM with 128 math lanes (like Pascal and Turing), half of them are not idling when only floats are coming into the SM.