GCN Geometry Performance

Sasha W.
May 12, 2019
8 min read

I was curious about the architectural gains in heavy geometry workloads between some different revisions of the GCN architecture. Of particular interest to me was a comparison between the Tahiti XT, Tonga XT and Polaris 10/20 PRO chips, due to their very similar top-level core configuration (32 CU & 32 ROP). Tahiti of course has a 384-bit bus (so does Tonga, actually, but no product ever used it). But we can get around that by tweaking the memory speeds.

So I got my hands on some cards with these chips. R9 280X for Tahiti XT, R9 380X for Tonga XT, and an RX 570 (4G) for the Polaris 20 PRO chip. Each of these processors have 32 Compute Units (2048 Stream processor, 128 Texture) and 32 pixel/clock of ROPs. The main difference here is Tahiti, with only a dual-raster design, whereas the other two have a quad-raster design. That also includes geometry/tessellator engines so I expect to see big gains between Tahiti and Tonga, but maybe smaller ones with Polaris. But I was a bit surprised with that. More on this in a moment.

First I'm going to talk about what I know is different from AMD has said between these GPUs. So Tahiti is the first iteration of GCN, and this chip was announced back in December 2011, and released in January the next year, so it's old. 7 years old. This GPU has a pair of Shader Engines (not sure if AMD called them that, back then). with 16 CU each. Each of these has a Geometry processor and tessellator and a raster engine. So Tahiti can spew out 2 primitives per clock. Tonga bumps this up to 4 primitives per clock, by adding another pair of Shader Engines and lowering CU count to 8 each, so you get the same number of CUs but the chip has significantly more geometry performance. Well technically this quad-raster design was first implemented on GCN2 with Hawaii, but I don't have one of those for testing (yet).

Tonga's geometry processors are also improved, it can re-used triangle data from earlier in the pipeline and has improved through-put per clock, per engine. Polaris has all of these upgrades and Quad-raster, too, but also has something really interesting that AMD called 'Primitive Discard Accelerator'. To my understanding this is essentially a hardware-level (on silicon) engine that runs very fast checks to see if a triangle/primitive has any visible parts (that you can see on the current frame) and if it doesn't (it's degenerate), it discards that primitive early in the pipeline to prevent it from being drawn, and wasting GPU time for no reason (you can't see it, so why draw it?). Of course a good game engine should cull triangles or geometry you can't see, but this is faster and doesn't require devs to do it (do you remember Crysis 2?).

So in a nutshell Polaris should be able to throw out a ton of primitives very early if they are not visible, and that means GPU time saved can be used to draw ones you can see, a lot faster. Hence the the potential FPS gains to be had in geometry heavy workloads. Shall we test that?

You probably want to know what I'm testing these cards with. So here you go.

Ryzen 3 1200 3.1 GHz
2x8GB 2400 MHz CL14
MSI B350M Mortar
EVGA 500W Bronze
Radeon Driver Adrenalin 19.4.3
Tessellation set to Use Application settings

CPU isn't the fastest thing around but it gets the job done. And I didn't want to take apart my main PC, I got that just perfect. Anyway I made sure all tests were GPU bound so the results are accurate. CPU was running at 3.1 GHz for Firestrike, Heaven and TessMark, but I clocked it up to 3.7 for the gaming tests and Timespy. Oh and AMD driver will limit insane tessellation factors if left on default (You do Remember Crysis 2, right? Yeah. This is for things like that). So I set that to use Application settings, so we can run these synthetics. Anyway, moving on~

Okay so I am setting all three cards to run a set frequency on the core, to allow us to see the per-clock gains of the architecture. That's 1000 MHz core, and I also set the memory speeds to produce the same raw bandwidth, of 192GB/s. Tahiti's wider bus meant I had to set its memory 2Gbps slower (4 vs 6) to achieve the same bandwidth. Just FYI. Anyway...

Squidgy Tessellated blob of geometry Death..

First up is TessMark. Now this is a pure synthetic test that tessellates the crap out of an object and gets the GPU to spew out a simply mental number of triangles. Over 11 million of them on 64X, actually. It's extremely light on all other GPU resources so it's basically a pure triangle-throughput test. If PDA is doing its job, we should see it here...

Well there you go. In a pure tessellation/primitive rate synthetic workload we can really see some pretty enormous gains between the generations. Tahiti's performance is essentially doubled by Tonga, which isn't really surprising when you consider it has literally double the geometry hardware. The more interesting thing is Polaris. It gains a huge 40% increase in Frames per second rendered, over the Tonga, at the same clocks, with the same amount of geometry engines. I think its safe to say Primitive Discard is doing something here. As the tessellated blob rotates, some of those triangles are being obscured, and this thing prevents them from being rendered. That's a lot of saved GPU time.

Here's the same bench with 8X MSAA thrown in. Testing because AMD did a slide saying PDA scales with MSAA loads.

Okay so the gains are much lower here. I guess bottleneck shifted to Render backend or something. Tahiti is still left behind, though, with its measly 2 prim/clock throughput.

Next I'm testing Unigine Heaven, because it has very configurable Tessellation settings. I'm not doing the full bench because it wouldn't let me test with custom Tessellation settings (Factor, distance). So I just bench the scene that spins the camera around the Dragon Statue in the courtyard. Here's a couple pictures with Tessellation toggled to show the massive increased in geometry load caused by these settings.

One pretty smooth Wireframe Dragon Statue

First, I'm testing without Anti-Aliasing.

Aside from the pretty enormous more-than-doubling of performance from Tahiti to Tonga as you'd expect from the doubled hardware, Polaris gains are much smaller. The improvements to the Geometry processors are still doing something, but it's not quite an ideal a situation as the TessMark benchmark. I consider this a 'semi-synthetic' workload, as it's a gaming engine (That you can walk around in!) but no game developer in their right-mind would tessellate the pants out of the environment like I set it to here (unless sponsored by NVIDIA lol). But still, it's a geometry test so...

Here's the same scene but with 8XAA thrown in just for academics.

Let's move on to everyone's favourite benchmark, 3DMark Firestrike. I used some custom settings for these runs that I couldn't fit on the chart title. So here they are:

Custom Run, Graphics Test 1
1920x1080
Texture Filter Trilinear
MSAA Sample Count (See graph)
Tessellation Detail Level 10
Tessellation Factor 64
Shadowmap Size 512
Surface Shadows 1
Volumetric Illumination Quality 1, Sample count 1.0
Particle Illumination Quality 1
Ambient Occlusion Quality 1
Depth of Field quality 1

The Tessellation factor of 64 will hopefully put a bit more geometry load on this usually-shader/compute heavy benchmark. Results time!

Even with MSAA Sample count set to 1, the gains here are pretty small. But more significant is the Tahiti -> Tonga improvement as you'd expect. But Firestrike is still very heavy on Compute throughput so you can expect a lot of the frames these GPUs are rendering are limited by something other than primitive-rate. Interesting to note that Polaris's CU can pre-fetch instructions and AMD touted a ~15% increase in shading performance best-case, so maybe that is helping here instead.

With 2 MSAA Samples, the three GPUs are very close. Not a lot of gains to be had in this benchmark so let's move on.

Timespy is an updated benchmark from 3DMark using DX12. It also has configurable tessellation settings which I'm using here on custom runs.

Custom Run, Graphics Test 1
1920x1080
Async enabled
Texture filter Trilinear
Max tessellation factor - see chart
Tessellation factor scale - see chart

Timespy configured to use 64 tessellation factor. It still doesn't have huge gains between the architectures (which shows why they are fairly limited in real-world gaming scenarios). That's because not everything is limited by geometry (though it is an issue for GCN in my opinion). Also, Primitive Discard can only really shine in scenes with lots of hidden geometry; if the scene is fairly large like here, you're going to be limited by raw trhoughput, because there's still a boat-load of visible geometry.

Let's turn the tessellation down a bit and have a look.

So Tahiti is lagging behind as expected. Also worth noting that Tahiti's implementation of Asynchronous compute is actually, last time I checked, not used. It also has worse DX12 feature support, only supporting DX12 in Software, not Hardware. So keep that in mind. Polaris shows no improvement over Tonga here, so I guess the bottleneck is not hidden geometry.

I tested two of my favourite games, to give a rough look at gaming impacts of these architectural differences. Both also make use of heavy tessellation, and in Exodus's case, probably the most complex geometry in a current video game.

Exodus is up first. I tested in DX12 mode.

At 1920x1080, Tahiti barely keeps it over 30 FPS, even on Low settings. There are major gains from Tahiti to Tonga here - Exodus is DX12 in this test so maybe some of Tahiti's drawbacks in Hardware support are showing. But I also think raw primitive-rate is holding it back. Metro Exodus is probably the most heavily tessellated game you can play right now. (and it looks gorgeous). Polaris has small gains but nothing massive.

Switching the resolution down to 720P, we get obviously some higher framerates. What's interesting is Tonga gains around 27% from going down to 720p, while Tahiti only gains about 19%. Do you know what doesn't scale with resolution? Geometry.

That's a huge 44.1% increase in per-clock, same-bandwidth performance between GCN1 and GCN4. (Keep in mind Polaris also has a much larger L2$, and shader prefetching, and Delta Colour Compression to optimise bandwidth use. Those are almost certainly contributing to these gains, so it's not all from geometry improvements. But I think a large chunk of it is, especially in this game).

Metro 2033 Redux is next, the redux version of probably the best looking game of 2010. Here everything is set to maximum settings except for Motion blur, SSAA and PhysX (done on CPU with Radeon, do you want to kill my R3 1200?).

So there you can see what sort of architectural gains can be had between these three similar (from top-level) core configs in a gaming scenario with tessellation and fairly complex geometry. He were see an almost 30% increase in 'IPC' (don't hate me) between these three GPUs, each with 2048 Stream Processors, 128 Texture units and 32 ROPs, and at the same memory bandwidth.

Conclusion?

You want me to write a fancy conclusion? :D

Basically in very synthetic tests showing just pure tessellation/geometry throughput, we can see AMD hasn't been sitting doing nothing all this time. They've made some huge gains. But when we switch these workloads to more gaming-focused ones, we can see that the gains are much smaller. Geometry isn't everything in every frame, and the bottleneck shifts very quickly depending on what you're looking at, etc. Over the years games have been using more complex geometry (moving off PS3/360 helped a lot with ports), so these gains increased a bit, especially versus early tests of Tonga vs Tahiti that I have seen.

By fair the biggest gains in sort of realistic 3D workloads are from Tahiti to Tonga. Which shows that brute-forcing the issue and doubling up on raw hardware is the best approach, right? Well it's good to have both. As far as I'm aware GCN is limited to a quad-raster implementation and that's not going to change until AMD launches a brand-new uArch. So they've been trying to squeeze as much performance out of that quad-raster design ever since Hawaii first implemented it, back in 2013. And they've done well considering. But this also highlights that other aspects of the GPU (Compute, Render backend, etc) are not showing a lot of improvement.

Oh, you want to see something funny? I tested my Radeon VII (GCN5, Vega) in TessMark at the same settings and clock speeds and my results proved that GCN's lop-sided Compute/geometry ratio isn't doing the bigger chips any favours (at least that's what I think).

I couldn't set the HBM2 speed low enough (it'd have to run very low, that 4096-bit bus, is wide). So here you go with Radeon VII pumping a TeraByte per second of raw bandwidth, but the core at 1000 MHz, like other GPUs. 8% gain in pure synthetic tessellation/geometry benchmark over Polaris, an RX 570 actually. If this doesn't highlight an architecture imbalance then I don't know what does. Make of it what you will.

Thanks for reading! ^-^

Eridonia Archives

GCN Geometry Performance

Recent Posts

Comentários