Mini-Tech Babble #9: Turing Tensor/RT 'cores'. Just how dedicated are they?
Updated: Mar 16, 2020
Okay, this will be a very small Tech Babble, hence the 'Mini Tech Babble' in the title. That is, I don't really want to spend a huge amount of time typing this one. Because I am procrastinating and I want to play Warframe. That said, I usually just go off an start babbling endlessly and by the time I realise, I've typed 8000 words.
For your information: I can't verify if this is true, it might not be. Take with a huge grain of salt, and don't take it is fact. But it's interesting concept nonetheless.
So let me start with a simple question.
Does Turing actually have dedicated Tensor and Ray Trace cores?
Before my head gets bitten off, let me explain. I read some very interesting comments on Anandtech from someone who has apparently done a lot of research (and their homework!) on how Turing's Tensor and RT ops are performed. At first I was skeptical, but then, the more I think about it the more it actually makes sense.
Consider that the 'Tensor Core' in the TU10x class GPUs is not actually a dedicated, fixed-function piece of logic seperate to the shader processors. How does that work? Well, the Tensor Core could very well be a grouping of Shader Processors (CUDA cores) that are scheduled in a very different way to a usual SIMD operation on a Warp for a 3D graphics task.
What I read, is that the Shader Processors are grouped into 8x8 clusters, and when a Tensor op is performed they run seperately from normal operations in a different way. That is, the Shaders (CUDA Cores) themselves are doing the Tensor Op, but they are using a vastly different logical pathway to do it. This actually makes a lot of sense.
The Ray Trace core itself, is something very similar. With small dedicated BVH-traversal logic, and then performing the tracing on the shader processors. I read that the Tensor & RT cores cannot be utilised simultaneously with the CUDA cores, and this would also support this "not so dedicated" theory. Apparently this comes from developers who are working with the GPUs themselves.
Take a look at this, Nvidia-provided graph of a frame with RTX, in Metro Exodus:
If you look at when the RT and Tensor operations are actually being performed, they are mostly not concurrent with other shader operations. Whether this is due to the fact that those ops are being done on the shaders themselves, is not certain, but it is interesting.
The small amount of concurrent work could be because not all of the SMs are occupied with that specific Tensor/RT op. However, RT seems to take up almost the entire load % of the GPU's engine during the process, which makes sense because it's resource heavy. The Tensor itself, seems to be dispatching some FP and INT work at the same time, but very small. This is not a single SM, this is entire GPU's array, in which SMs are able to work on different tasks in paralell.
The question at hand is: just how much dedicated processing logic is present within each SM?
Funnily enough, this is actually quite similar to how AMD proposed to do Ray Tracing.
This is true, and something the author of what I was reading pointed out. Looking at it, it does indeed seem so. AMD proposed to do Ray Tracing on the shader processors with a hybrid approach using some dedicated logic with the bulk of the work being done on CU's SPs. Could it be that the "Ray Tracing" core and "Tensor Core" are a Marketing-driven name given to additional functionality provided by the Shader Processors?
So why can't Pascal do RT as quickly?
Because Pascal's Shader Processors are wired to do very specific types of Floating-Point and Integer operations. The entire logical layout is built for that highly parallel vector workload like any GPU. If the cores were wired to do branchy code with a complex front-end, then you'd just have a CPU with some really beefy FPU pipes (lol?).
The functional units on Turing might have alternative scheduling/logic pathways to perform other types of operations, supported by a very light implementation of dedicated logic (BVH lookup for example). Why add more execution elements if the existing ones can be "repurposed" to do Tensor & RT operations?
A small nod to this theory is why the TU116 GPU is not actually that much smaller than the TU106 when you normalised the other resource count (SP, ROP, Bus, etc), despite lacking the "RT" and "Tensor Cores". The bulk of Turing's die size increase as I mentioned earlier, in my opinion, comes from beefed up caches and seperated INT32 and FP32 logic.
It could also be, that TU116's FP16X2 accumilators are similar to this in nature, but I haven't read anything on that topic.
Please take with a grain of salt. I thought this was very interesting and I wanted to post, and type a bit about it.
Thanks for reading. And thanks to the person commenting on Anandtech that gave a very insightful view into just how Turing might be doing Tensor/RT operations.