GCN Geometry Performance
I was curious about the architectural gains in heavy geometry workloads between some different revisions of the GCN architecture. Of particular interest to me was a comparison between the Tahiti XT, Tonga XT and Polaris 10/20 PRO chips, due to their very similar top-level core configuration (32 CU & 32 ROP). Tahiti of course has a 384-bit bus (so does Tonga, actually, but no product ever used it). But we can get around that by tweaking the memory speeds.
So I got my hands on some cards with these chips. R9 280X for Tahiti XT, R9 380X for Tonga XT, and an RX 570 (4G) for the Polaris 20 PRO chip. Each of these processors have 32 Compute Units (2048 Stream processor, 128 Texture) and 32 pixel/clock of ROPs. The main difference here is Tahiti, with only a dual-raster design, whereas the other two have a quad-raster design. That also includes geometry/tessellator engines so I expect to see big gains between Tahiti and Tonga, but maybe smaller ones with Polaris. But I was a bit surprised with that. More on this in a moment.
First I'm going to talk about what I know is different from AMD has said between these GPUs. So Tahiti is the first iteration of GCN, and this chip was announced back in December 2011, and released in January the next year, so it's old. 7 years old. This GPU has a pair of Shader Engines (not sure if AMD called them that, back then). with 16 CU each. Each of these has a Geometry processor and tessellator and a raster engine. So Tahiti can spew out 2 primitives per clock. Tonga bumps this up to 4 primitives per clock, by adding another pair of Shader Engines and lowering CU count to 8 each, so you get the same number of CUs but the chip has significantly more geometry performance. Well technically this quad-raster design was first implemented on GCN2 with Hawaii, but I don't have one of those for testing (yet).
Tonga's geometry processors are also improved, it can re-used triangle data from earlier in the pipeline and has improved through-put per clock, per engine. Polaris has all of these upgrades and Quad-raster, too, but also has something really interesting that AMD called 'Primitive Discard Accelerator'. To my understanding this is essentially a hardware-level (on silicon) engine that runs very fast checks to see if a triangle/primitive has any visible parts (that you can see on the current frame) and if it doesn't (it's degenerate), it discards that primitive early in the pipeline to prevent it from being drawn, and wasting GPU time for no reason (you can't see it, so why draw it?). Of course a good game engine should cull triangles or geometry you can't see, but this is faster and doesn't require devs to do it (do you remember Crysis 2?).
So in a nutshell Polaris should be able to throw out a ton of primitives very early if they are not visible, and that means GPU time saved can be used to draw ones you can see, a lot faster. Hence the the potential FPS gains to be had in geometry heavy workloads. Shall we test that?
You probably want to know what I'm testing these cards with. So here you go.
Ryzen 3 1200 3.1 GHz
2x8GB 2400 MHz CL14
MSI B350M Mortar
EVGA 500W Bronze
Radeon Driver Adrenalin 19.4.3
Tessellation set to Use Application settings
CPU isn't the fastest thing around but it gets the job done. And I didn't want to take apart my main PC, I got that just perfect. Anyway I made sure all tests were GPU bound so the results are accurate. CPU was running at 3.1 GHz for Firestrike, Heaven and TessMark, but I clocked it up to 3.7 for the gaming tests and Timespy. Oh and AMD driver will limit insane tessellation factors if left on default (You do Remember Crysis 2, right? Yeah. This is for things like that). So I set that to use Application settings, so we can run these synthetics. Anyway, moving on~
Okay so I am setting all three cards to run a set frequency on the core, to allow us to see the per-clock gains of the architecture. That's 1000 MHz core, and I also set the memory speeds to produce the same raw bandwidth, of 192GB/s. Tahiti's wider bus meant I had to set its memory 2Gbps slower (4 vs 6) to achieve the same bandwidth. Just FYI. Anyway...
First up is TessMark. Now this is a pure synthetic test that tessellates the crap out of an object and gets the GPU to spew out a simply mental number of triangles. Over 11 million of them on 64X, actually. It's extremely light on all other GPU resources so it's basically a pure triangle-throughput test. If PDA is doing its job, we should see it here...
Well there you go. In a pure tessellation/primitive rate synthetic workload we can really see some pretty enormous gains between the generations. Tahiti's performance is essentially doubled by Tonga, which isn't really surprising when you consider it has literally double the geometry hardware. The more interesting thing is Polaris. It gains a huge 40% increase in Frames per second rendered, over the Tonga, at the same clocks, with the same amount of geometry engines. I think its safe to say Primitive Discard is doing something here. As the tessellated blob rotates, some of those triangles are being obscured, and this thing prevents them from being rendered. That's a lot of saved GPU time.
Here's the same bench with 8X MSAA thrown in. Testing because AMD did a slide saying PDA scales with MSAA loads.
Okay so the gains are much lower here. I guess bottleneck shifted to Render backend or something. Tahiti is still left behind, though, with its measly 2 prim/clock throughput.
Next I'm testing Unigine Heaven, because it has very configurable Tessellation settings. I'm not doing the full bench because it wouldn't let me test with custom Tessellation settings (Factor, distance). So I just bench the scene that spins the camera around the Dragon Statue in the courtyard. Here's a couple pictures with Tessellation toggled to show the massive increased in geometry load caused by these settings.
First, I'm testing without Anti-Aliasing.
Aside from the pretty enormous more-than-doubling of performance from Tahiti to Tonga as you'd expect from the doubled hardware, Polaris gains are much smaller. The improvements to the Geometry processors are still doing something, but it's not quite an ideal a situation as the TessMark benchmark. I consider this a 'semi-synthetic' workload, as it's a gaming engine (That you can walk around in!) but no game developer in their right-mind would tessellate the pants out of the environment like I set it to here (unless sponsored by NVIDIA lol). But still, it's a geometry test so...