(GPU) AMD Radeon RX 6900 XT

(Profile updated as of 16/12/2021)

AMD's return to ultra-high end graphics cards, and Hardware Ray Tracing, this one is a very special entry for me. It's also a sad one, since I only owned this card for a week or so, before accidentally breaking it (long story), but in that short time, it was truly impressive piece of hardware. So, I had to write a profile for it, of course. So here it is.

Graphics Card Information

Graphics Card: AMD Radeon RX 6900 XT

Graphics Card Manufacturer: Advanced Micro Devices

Graphics Card Release Date: April 2021

Graphics Card MSRP: $999USD

Graphics Processor Codename: "Navi 21"

Graphics Processor Manufacturer: Advanced Micro Devices

Graphics Processor Implementation: Full Die

Graphics Interface: PCI-E 16x Gen4

Architecture: RDNA2

Lithography Process: TSMC 7nm (N7P) FinFET

Approximate die size: 520mm²

Sasha's GPU die Size Rating: large

Approximate Transistor Count: 26,800 Million

Approximate Transistor Density: 51 Million / Square Milimetre

GPU Features

Double-speed FP16 Shading: Yes (Rapid Packed Math)

Asynchronous Compute Capability: Full

DirectX Hardware Support: DX12.1 (FL 12_1)

Dedicated DXR Acceleration on chip: Yes

Variable-rate Shading: Yes

Adv. Geometry shading: Yes (Primitive Shader Triangle Culling)

Adv. Geometry shading (Programmable/DX12 Mesh Shaders): Yes

AI/ML Acceleration: No

Advanced Memory Management: No

Integer and Float Shader Co-execution: No

Tile-based Renderer: No

GPU Computing Resources

GPU Substructures: 4 Shader Engines, 40 Workgroup Processors

Graphics Cores: 80 Compute Units

Graphics Cores per Substructure: 4 Shader Engines with 10 Workgroup Processors each

Total Stream Processors (ALU/Shaders): 5120

Stream Processors per Graphics Core: 64

Graphics Core SIMD Structure: 2 x 32

Total Special Execution Units: 160 Scalar Units *

Special Execution Units per Graphics Core: 2 Scalar Units per Compute Unit, 4 per Workgroup Processor *

Total Texturing Units: 320

Texturing Units per Graphics Core: 4 (Texture filter)

Pixel Pipelines (ROPs): 128 (16 x Render Backend with 8 Pixels per clock)

Level 2 shared on-chip cache: 4096 KiB

Level 3 shared on-chip cache: 131,072 KiB (128 MiB Infinity Cache)

Geometry/Tessellation Processors: 1 Geometry Engine, 4 Primitive Units (Including Tessellation)

Raster Engines: 4 Primitive Units (Including Rasterisation)

GPU Memory Subsystem

Graphics Memory Type: GDDR6

Graphics Memory Standard Capacity: 16,384 MB

Graphics Memory Composition: 8 x 2048 MB GDDR6 DRAM chips

Graphics Memory Access Granularity: 32-bit (4 bytes)

Graphics Memory Standard Clock Speed / Data Rate: 2000 MHz / 16000 MHz

Graphics Memory Full Interface Width: 256-bit (32 bytes per clock)

Graphics Memory Peak Memory Bandwidth: 512 GB/s

GPU Frequency and Peak performance

Graphics Engine Clock: 2250 MHz *

GPU Computing Power FP16: 23,040,000 Million operations per second (FMA)

GPU Computing Power FP32: 46,080,000 Million operations per second (FMA)

GPU Computing Power FP64: 1,440,000Million operations per second (FMA)

GPU Texturing Rate INT8: 720,000 Million Texels per second

GPU Texturing Rate FP16: 720,000 Million Texels per second

GPU Pixel Rate: 288,000 Million Pixels per second

GPU Primitive Rate: 9,000 Million triangles per second rasterised (out). 18,000 Million triangles per second into the pipeline before Hardware culling (in). *

GPU Thermal and Power

Standard Cooling Solution: Triple-Fan Axial Cooler

Typical Board Power: 300W

Maximum Board Power: 340W

Standard External Power Connectors: 8 + 8 pin

Maximum Allowed Junction Temperature (TJ Max): 105*C

Graphics Card description

The Radeon RX 6900 XT represents AMD's return to the ultra-high end graphics card segment after many years of offering solid, but not quite flagship performance graphics processors. It also represents AMD's first foray into hardware Ray Tracing that was first implemented by Nvidia's "Turing" architecture (20-series) in 2018.

The card features the full implementation of the "Navi 21" graphics processor based on the "RDNA2" graphics microarchitecture. The processor builds on the fundamental core designs set by the RDNA1-based Navi 10 processor; adding more functional units, but also new technologies such s the industry's first large L3 cache for a dedicated GPU. In terms of specification, the Navi 21 chip essentially doubles the Navi 10 chip's execution units with a total of 80 "Compute Units" arranged in 40 Workgroup processors, this totals an impressive 5120 shader processor ALU pipelines - but perhaps most impressive is the clock speed these are running at; over 2.2 GHz, indicating AMD has invested significant engineering time in optimising the physical layout of the circuit for higher clock rates without huge increases in voltage / current.

Other notable features of the new Navi 21 processor compared to its predecessor include the massive 128 MiB on-chip L3 cache which AMD calls the "Infinity Cache". In my understanding, the primary goal of this large on-chip memory buffer is to store the framebuffer (the frame's pixel colour information) on-chip, and allow the Render Outputs to work directly from the cache rather than from the dedicated, off-chip video memory (which incurs performance and power costs); or using a potentially slower tiled renderer to keep it in the L2 cache like NVIDIA's GPUs have featured since Maxwell's first GM107 GPU on the GTX 750 Ti. This frees up significant bandwidth for other operations such as shader compute, and as such, allows the Navi 21 processor to achieve high performance with a narrower memory interface, in this case "only" 256-bits wide - netting the cost, complexity and power savings from that in the process.

From the block diagram, Navi 21 doesn't increase the Rasteriser or Primitive logic block count; this remains at 4 for both, with a single centralised geometry engine as in the previous RDNA1 implementation. However, Navi 21 implements AMD's "Next Generation Geometry / Fast Path" (In concept, this is essentially what Vega's fabled "Primitive Shaders" was supposed to do) shader-based culling system to drastically improve geometry handling performance at a low level. This is in addition to carrying over the Rasteriser's "double in" feature where it can accept 2 triangles while drawing just one; potentially saving time if one of the triangles doesn't have visible points.

Furthermore, the Render Backends have been doubled up to handle 8 pixels per clock, per backend, compared to 4 on the Navi 10 processor. So while Navi 21 remains 16 backends, the GPU is now able to handle 128 pixels per clock ("ROPs") instead of 64; a significant increase in pixel throughput, likely enabled by the Infinity Cache.

The Graphics Cores themselves remain similar to the RDNA1 cores, but have several major changes. One thing to note is, in my understanding, the RDNA2 CU has a slightly lower instruction handling throughput compared to RDNA1; likely as a result of transistor optimisation, or potentially to allow higher clock speeds, but it doesn't drastically reduce per-CU-per-clock performance. Perhaps the biggest change to the Graphics Core is the addition of dedicated, fixed-function BVH-traversal logic blocks, which AMD call "Ray Accelerators". Similar in principle to the "Ray Tracing Cores" on NVIDIA's Turing and Ampere architectures, these logic blocks allow the GPU to run checks on whether a ray impacts a triangle in a mesh; and as a result, affects the colour of the associated pixel on the screen, to allow much faster Real-Time Ray Tracing performance. This is essentially performed by compiling a list of the geometry in the scene into a hierarchical list called a Bounding Volume Hierarchy. The BVH contains several "boxes" that contain groups of triangles within the mesh. The Ray Accelerator will fire a ray from the viewport and run a fast (ASIC accelerated) check on which box it interacts with, and then which triangle within said box, passing the final data onto the shaders to alter the colour of the pixel; producing realistic lighting effects. This process is possible via GPGPU (shader only) operations but is orders of magnitude faster using dedicated hardware. AMD's Ray Accelerators are comparable in performance to Turing's RT cores; but are significantly behind Ampere's implementation in raw Ray Tracing Performance.

Unlike Turing or Ampere, RDNA2 doesn't implement wide matrix-math hardware units for AI / Neural Net acceleration, however AMD opts for a fast shader-based approach for Super Resolution; which is capable of harnessing some of Navi 21's extremely high half-precision float shader performance (nearly 50 TFLOPS)

Graphics Card approximate 3D Performance

Sasha's gaming performance rating (2021): Great for 1440p High Refresh 144 FPS+ or 4K 60 FPS+ with maximum settings, minus Ray Tracing. 1080p 60FPS+ with full Ray Tracing enabled.

The Radeon RX 6900 XT is one of the fastest graphics cards available as of typing this, which is refreshing since these have typically always been NVIDIA since the 2013-2015 era. "Navi 21" achieves performance comparable to NVIDIA's almost fully enabled "GA102" graphics processor on the RTX 3090 flagship card, but with less power due to a vastly more efficient memory subsystem, and fewer transistors to boot. The RX 6900 XT is often faster in GPU-bound scenarios at high frame rates at lower resolutions such as 1080p or 1440p; where the superior geometry processing performance likely enabled by the large L3 cache and faster clock rates can fill out more triangles, and there's less emphasis on the raw brute force of Ampere's huge number of FP32 shader processors. At 4K, the RX 6900 XT often trails the 3090, but not by a lot; again an impressive feat for a much leaner graphics chip. The only area that AMD is not quite competitive in at a similar performance level is Ray Tracing - the RX 6900 XT falls far behind the GA102-based cards in almost all games using the technology, arriving somewhere around the RTX 3060 Ti or 3070 in the worst cases, though there are a few exceptions where it can manage 3080 performance. Considering this is AMD's first generation Ray Tracing implementation vs NVIDIA's second (including all the experience in tuning the software / driver for performance that NVIDIA has), I don't see this is a huge technological disadvantage - though it is a disadvantage worth mentioning.

Navi 21 lacks any on-chip AI-acceleration hardware, but AMD's shader-based Fidelity Super Resolution puts up a respectable showing against the hardware-accelerated DLSS from NVIDIA, which helps even the playing field in supported games.

Notes

Special Execution Units:

I am currently unsure of the amount of Load/Store Units in the CU for Navi. I think it remains the same as GCN, at 16, but I will not list them here until I have solid confirmation, as the new block diagrams for Navi don't actually state LSUs at all.

Graphics Engine Clock:

6900 XT will boost higher than this under sporadic load, often peaking over 2300 MHz or 2400 MHz for after market models.

GPU Primitive Rate:
This is based on my understanding of Navi's Primitive Shading implementation for fast hardware culling.

Misc.
This bit is for my personal opinion on this Graphics card / Graphics processor

I love this chip. It represents such a monumental shift in GPU design that feels more innovative to me than Turing's RT cores, especially since RDNA2 brings that technology, too. AMD has gunned for NVIDIA's latest and most powerful GPU, the GA102, and they've done it - in a single generation AMD has doubled their performance and gone from competing with the "xx70" class, right to the top, even surpassing the newly re-introduced "xx90" class in many cases - using less power on a smaller die with fewer transistors no less. From an engineering perspective, this is remarkable and proves (along with Zen) that AMD has incredibly talented engineers and extremely competent management. I just wish my time with Navi 21 could have been longer.

Sasha's Awesomeness Rating: Epic.