Sash Thought: The Curious Case of TIM-based Performance Variation (UPDATE)
Updated: Mar 8, 2020
Something weird happened. My RX 590 with the Graphite Thermal pad suddenly lost its performance again, and reverted to the previous level of performance with the paste aplication. It was resolved with a reboot. I think my RX 590 is "Special". That doesn't change some of the previous experiences I've had, so the original post remains the same.
Here's a small thought I wanted to share. This is not an 'issue' I'm new to. It's happened before, several times, on various graphics cards I've owned, from an RX 480, through 590, to GTX 1070 Ti and RX Vega 56, 64 and Radeon VII.
That 'issue' is, the repeatable phenomenon of performance variation depending on the thermal material used to create a thermal bond between the silicon die of the processor, and its cooling solution.
I won't babble too much, as I want to get straight to the point (lol). The 'long and the short of it'; is that sometimes when I apply a thermal compound (Paste) to a GPU, the performance may be impacted, in such a way that is small but repeatable, despite reported GPU temperatures being consistent and acceptable between each run.
Actually, one such time resulted in poor GPU utilisation and stuttering on an RX 480, reapplying the paste resolved the issue. Temperature reported by the GPU was actually lower in the first run, likely due to the low utilisation.
Recently, I sold my RTX 2070 and started using my trust RX 590 again, after I had liquid cooled it. Well, fast forward to today and Sash changed his mind again and wanted to put the stock air cooler on it again (I have a love/hate relationship with liquid cooling...). So, I did just that, and pasted the GPU using my usual method, which involves spreading a thin layer of thermal paste (in this case MX-4) over the surface of the GPU die.
If you'll excuse the utter mess I make doing this (this probably doesn't even help...), it's non conductive (obviously), but anyway this method has served me very well for the countless times I've had to reapply thermal compound on a naked-die GPU.
Get to the point.
Anyway, I'll get to the point. I had successfully liquid cooled this RX 590 and managed 1650 MHz, at 1.2V, and sub- 40*C GPU temperatures (my room is cold) under load and scored some pretty impressive results. However, I couldn't quite achieve the same performance as some other - lower clocked - results in the 3DMark database. By the way, I am using normal Firestrike as a baseline for comparisons here because it's easily repeatable.
So I decide to take the liquid cooler off. So did the same method, and everything is installed and the card works fine in games. Temperatures are good and the performance seems OK... Until I start up Fallout 76. Now, this is hardly the shining example of a well performing game, but something felt 'off'.
By the way, my hands are cold and I'm really hungry. Having a cold room is great for my WCG server farm, but not so great for the squishy organic bit that maintains them. Sorry if I make mistakes or seem like I'm in a hurry to go make some warm instant noodles.
So, in a burst of anxiety as is often the case for Sash, I fired up Firestrike (ha, a Pun is detected). And lo and behold, my Firestrike score is bordering on 10% worse at the same clocks than scores I'd created just days before with the liquid.
One thing I noticed is that the GPU wasn't getting very hot, and the power usage was quite low in certain games. This suggests an I/O/IMC issue and I have some further evidence to back that up in a moment...
So I re-sat the cooler, reapplying the paste several times like the method I used above, to no avail. The score wouldn't increase. Until the final time, where I had realised I didn't take the plastic cover off the thermal pads I used on the GDDR5 memory (duh...). So I fixed that issue but the score only went up slightly, flirting with the margin of error (1%) but still measurable.
So what gives?
Finally I took the card apart and this time thought about my previous experience with Polaris-based RX 480 and how that had a more severe problem with a re-paste issue, despite really good reported temperatures. Basically, after a re-paste, that would stutter like crazy in all games, and the Video memory wouldn't even clock up properly, the utilisation would be very low on the core and the power usage uncharacteristically low.
GDDR5 or memory controller hotspot? Perhaps...
Graphite Pads are cool
I had been using Graphite Thermal pads on some of my CPUs for a while, due to re-usability and cleanliness (also they aren't sticky, so you can't pull the AM4 CPU out of the socket with a stubbon cooler that won't twist and lift... so many times).
One thing I notice with graphite pads, is that, mounting pressure is even more important than it usually is with paste - this is especially the case with a naked die like this. So I tighten it up until the screws stop, there's little chance of cracking the die if the pressure is evenly distributed (I.e, don't do one side tight then the other. Do alternating screws diagnally, each a couple of rotations, then the other, etc, until the cooler is applied).
My experience with these graphite pads is that they are usually the same or slightly worse than decent thermal paste, in practise (on AM4 CPUs with IHS), despite significantly higher thermal conductivity properties on paper. That said, there are innate advantages such as the fact that it is re-usable, non sticky, and has even distrubtion all of the time.
It was a sucess, because my score went back up (It even exceeded the liquid-cooled result at the same clocks which also uses thermal paste), and the temperatures, while a tad higher (anecdotal - my room ambient might well have increased in the meantime), were perfectly acceptable.
Here is a comparison between the paste application score, and the graphite pad (on the left).
Conclusion and Speculation
I was going to type a bit about other GPUs that I have had experience similar issues, such as my various Vega-based cards and even a GTX 1070 Ti, but for now I'll just conclude the post because it's already super long and I'm babbling again.
Essentially, all silicon processors will have hot-spots in the die, some even expose this sensor (such as Vega and Navi), others do not. Most GPUs (Pre Vega, and most NVIDIA) likely use a junction temperature of the die surface, not the actual temperature of various parts of logic inside the silicon itself.
My only conclusion is that slightly poor paste application (in my case, likely over-application) is causing hot spots in certain parts of the chip that might trigger internal throttling that reduce performance in certain workloads. Or, cache errors that are corrected with internal ECC that uses more cycles. Something like that.
I belive it is something to do with the memory controller, due to the issues with VRAM I had with the RX 480.
Ultimately, I might change my method of paste, or just use a thermal pad for GPUs, since I've had success with this one. That said, Polaris 30 is tiny, and the issue of hotspots gets worse for larger dies with more transistors.
Thanks for reading my garbage. NOODLE TIME! <3