Mini-Tech Babble #14: Ampere, are they 'real' 'shader processors' or not?

Yay! A tech babble. Time for Sash to type some babbly crap on a subject relating to tech. This time, it's about Ampere, or the new RTX 30-series NVIDIA just announced. Okay, so this is a "mini-tech" babble as I often say that when I'm not willing to commit typing a "Normal" one and it makes me feel better. I think, at this point, mini and normal are just different by how confident I am when I start typing.

Please remember I'm babbling from the best of my knowledge. That knowledge also gets updated quite often so not everything I say here is to be taken as 1000% concrete, even though I am confident I have a high level of factual accuracy. Thanks!

What's a 'CUDA Core'?

This question arises from people who didn't quite understand what pretty significant changes were made to Turing, and before that, Volta. I'm going to be very simple here and say it as easy to understand as possible.

Historically, Nvidia's 'CUDA cores' are a math-processing block of very simple logic that takes a number and performs an operation on it. The CUDA Core itself is nothing like a CPU core: the closest thing to that comparison is the SM. The CUDA core is more like an individual ALU pathway or SIMD-lane (32bit for FP32) in a CPU's FPU width (For example, AVX 512-bit = 16 FP32 "lanes", or 16 CUDA cores worth of throughput. Kinda).

Here is a simplified top level block view of what a "CUDA" core has meant, up to Pascal, basically.

So this whole thing boils down to what Nvidia say is a 'CUDA Core'. From Fermi to Pascal (to my knowledge) the 'CUDA core' has been a dual-function simple math execution unit with a pipeline for Integer work and a pipeline for Floating point work. Integer work is exactly as it says: these are only whole numbers, like "586" for example, no "." here and the logic doesn't have to track that. Floating point can include non-integer numbers with a point (floating point) like 5.86, for example. The CUDA core has historically, been advertised by NVIDIA as a math unit that can do either of those types of work - not at the same time - it has to choose what it does at a given time, but it can do both.

Here is a Pascal "Streaming Multiprocessor" - A collection of execution resources comparable to a "CPU core" if you make some big jumps. Please don't hate me.

Non GP100 Pascal SM.

The SM is split into four 'quads', each with 32 "CUDA Cores" as you can see. Ignore the Load/Store and Special Function Units for now, that is besides this point. Essentially, each of those light-green "Core" as listed above is what the diagram above this one shows. A simple math unit with an INT and FP pipe combined, that can do both, just not at the same time. The Pascal SM has 4x32 and 128 of these math pipes in total. Oh, that's the same as Ampere per SM, right? Uh, yes but there are some catches I think. Please keep reading.

Volta and Turing kinda change this a bit.

This is where it gets interesting. I typed some bits on Turing in the past, and I mention something along the lines of "Turing has a huge die size because of the Split-design CUDA cores and Caches, not because of the RT cores". And I stick by that. The RT cores are quite small on the die, and you can see good evidence of that with Nemes's awesome annotations, here.

What I want to say here is that Turing (Volta was a DC chip, let's ignore it for now, since 30-series Ampere is Turing's gaming successor), Turing is fat for how little CUDA cores it has. And when I ay "CUDA cores" I mean those that NVIDIA are advertising as CUDA cores. You know? 4608 "CUDA cores" on the RTX TITAN?

Okay, let's take TU102. The bandwagon among techies is that this die is huge, (750mm2 is quite large) because of the "RT and Tensor" cores. Understandable, but read what I said above. The BVH-traversal fixed-function logic blocks are quite small. You can see this with a comparison between the TU10x chips and the TU11x chips, the TPC doesn't look that much smaller, size normalised, does it? And that's what I mean.

This is because, Turing actually has twice the math-pipelines ("""CUDA CORES""") that Nvidia is advertising on their spec sheet. Yes, twice. RTX TITAN has 9,216 "CUDA Math pipeline-things". :O

A wild Turing SM diagram appeared.

Well, take a look at that. Please ignore the Load/Store, SFU and Tensors, as they are sort of beside this point (again). So, here we see the green blocks are no longer listed "Core" as on Pascal, and why is that? because they no longer fit the definition of "CUDA Core" because what NVIDIA essentially did here, is take the INT math component, and the FP math component of a Pascal "CUDA core" and split them, into two separate pipelines.

What this means, is that the INT pipes can accept work while the FP pipes are also occupied. "Integer Co-execution", and NVidia advertise that with Turing.

Fancy.

My concentration is waning so I'll get to the point.

Turing's math units are grouped into "2"s because Nvidia's definition of a CUDA core required a unit that can be flexible and do Int and FP work.

Yeah, basically, Nvidia only wanted to sell you a CUDA core if it was flexible enough to do both INT and FP math work. because I just found this article which said, essentially, everything I wanted to say, I'm going to stop now and save myself the time. x.x

Blah Blah Shut up Sash, you Autistic Retard.

Anyway, Ampere basically goes ahead and converts the 64 INT and 64 FP pipes back into single-unified ones, like on Pascal (FP or Int, not both, remember?) and then adds another bunch of FP only pipes, 64 of them, alongside that. So you get 128 FP32 per SM, like Pascal, but only 64 can do INT, whereas all 128 on Pascal can do INT or FP, confusing right?

TLDR: Ampere's SM has 50% of the Integer throughput of a Pascal SM, despite having the same number of """"CUDA cores"""". Same in floats, though. Also pascal can only do full speed INT24, not 32. :D Meow.

Ampere SM looks like this:

Ugh, I need a simple way of describing this. Okay.

Pascal's "CUDA Core"

Pascal's Cuda is a jack-of-all-trades math unit that can do both Floating point math and Integer math, just not at the same time. So I'm doing Int or float. But I can do both! (not at the same time!) 128 of those per SM!

Turing's "CUDA Core"

Turing's Cuda is made of up two little cuda cores one for each math type, one for Int and one for FP. This means these "cores but are actually made up of 2 cores" can do INT and Float at the same time! since there is dedicated pathyways for each. Nvidia only advertises the pairs because that was the standard definition of a Cuda core. It'd be like selling a car without wheels. Kinda? There are 64 of those "pairs" in one SM.

Ampere's "CUDA Core"

Ampere's Cuda core is sort of a hybrid between Pascal's and Turing's. However, Nvidia is changing the way it advertises CUDA cores to you. Instead of only saying a CUDA core is a CUDA core if it can do INT OR FLOAT (not necessarily at the same time, but has that flexibility in math type), they are only advertising the FLOATY ones to you.

In design, Ampere's SM is made up of 64 Pascal-type "jack of all trades" CUDA cores (FP OR INT, not both at the same time) and 64 FP-ONLY CUDA cores. I.e, fucking useless for Integer math. That is where you get the 128 """"""CUDA CORES"""""" from. Yes?

At the end of the day, the Ampere SM is like a Turing SM in INT throughput (64 op/clock) and a Pascal SM in FP throughput (128 op/clock).

It is actually weaker than Pascal per SM in mixed code (float+int) in raw shader throughput.

Good day to you!

Eridonia Archives