Tech Babble #11: Simultaenous Multi-Threading. And how does it affect World Community Grid
Updated: Jun 18, 2021
Simultaenous Multi-Threading, or SMT (Intel calls this 'Hyper Threading') is a technology that enables a single (usually wide) core to 'work on' multiple logical software threads Simultaenously, hence the name. SMT is not limited to just two threads, like we see in most modern x86-64 CPUs like AMD's Zen and Intel's Skylake. However, it is the most effective for these designs.
Modern high performance cores on x86 ISA are very wide. There are multiple execution resources within each CPU "Core", often in parallel; allowing the core to execute instructions out of order in which they are given, and extract Instruction Level Paralellism. That is to say, the core can decode future instructions (within a single thread) and determine which ones are not dependent, and then execute them out of order, alongside other non dependent instructions. It then has the result 'on-hand' when the result of that instruction is requested by the program thread. This increases IPC.
SMT is a broader, more top-level way of increasing parallelism inside the CPU core. Not all the instructions being fed into the core can be made to run in paralell; many require inputs based on the outputs of previous instructions - dependency - in this situation you simply can't produce an accurate result without first completing any dependencies.
SMT exploits these "gaps" in core execution where some of the other pipelines within the core are idle because of stalls caused by dependent instructions, among other things. Essentially, the core is able to 'juggle' two software threads (two seperate streams of instructions) that, by the very nature of multi-threading, are often independent (though thread sync/dependency is not uncommon; hence difficulty in multi-threading). In a very broad sense, SMT essentially allows the resources of the core that are not being used by the first thread, to be used by the second thread, and vice versa, increasing the time the resources are being used, reducing idle time and improving overall throughput by keeping more parts of the core busy.
A broad example of this, is if thread A was Integer heavy, and thread B was FPU heavy, on a core like Zen (see above) each thread could theoretically execute large portions of its code in parallel on the seperate engines of the core, in my understanding.
Because the resources are still shared; there is no doubling of performance as one less informed could be forgiven for thinking due to doubling of available threads to the operating system. The performance gain from SMT is usually somewhere between nil (0%) and a bout 40%, depending entirely on the code from each thread.
SMT technically reduces single-thread performance in situations where all threads are fully loaded; because the two threads per core are actually competing for execution resources. In this situation, the fact that you're able to run twice the number of threads means your program scales well, and the 30-40% upper gain from SMT is had as stated above. However, modern CPU cores are very intelligent with scheduling and in situations where a single thread is a priority for execution; it can juggle all that internally to maintain 1T performance despite having other threads running, like in a game.
This brings me to the point of this post. I have read some things on forums dedicated to World Community Grid, a distributed computing project that allows your PC to run various humanitarian science projects, I read that SMT can sometimes adversely affect some workloads. I can't back these claims up, without doing some Science!
There are some interesting points of note here, that I wanted to state. That is aside, from the obvious fact that SMT slows down each thread/Task, but could be entirely mitigated (plus some gain) by running twice as many tasks at once.
Fewer Threads Means more Cache per Thread.
A potential source of better (relative to power use) performance of SMT is the fact that if you run only 8 threads on the CPU with 8 cores, each Task/Thread has access to technically (if shared equally) about twice as much on-chip cache, in L2 and L3, than if you ran SMT with 16 tasks in parallel. That broadly means the tasks are going to miss L3 less often, and hit the RAM less often. That alone - in some projects - could work towards offsetting the overall performance gain from exploiting execution gaps by running two threads per core. SMT is, indeed, a huge balancing act.
World Community Grid runs an independent, completely single-threaded task on each logical thread/processor available to the system. For example, a CPU with 8 cores and 16 threads, would essentially run 16 single-threaded computational tasks together. They don't need information from each other, the are likely no dependencies and the threads would liekly never need to even communciate. This is, actually, an ideal workload for the Zen topology; which trades of fast all core-to-all core interconnects for massive scalability/modularity. (FYI: Threads within a 4-core CCX still have a fast connection).
Zen1 and Zen+ (Ryzen 1000 and 2000 series desktop CPUs) - I use only Ryzen 7 2700s for my WCG farm, so I am focusing on these designs) have 8MiB of L3 cache per 4 core CCX. For the Ryzen 7 2700, which I have 4 of in my farm, this consists of two 4-core CCXs linked with an internal fabric, the silicon has 16MiB of L3 cache in total, in 2x 8MiB chunks.
For this processor, running 16 tasks on WCG, each task would theoretically be given access to an average (if shared evenly; you also have to consider that not all tasks could need as much working set in cache, but for the sake of simplicity, let's assume they are broadly similar as I'm working with mixed projects right now). That average of shared L3 cache per task on a Ryzen 2700 with SMT is 1MiB of L3 cache and 256 KiB of L2 cache.
If you disabled SMT, that grows to 2MiB per task and the full 512 KiB of private L2 per core, just for the one thread. That directly results in that task having to seek data from memory, less often - improving performance by reducing execution stalls requiring memory access. That also reduces power consumption by less reliance on the DDR4 bus which brings me to the second point...
SMT uses more overall power per core.
Running distributed computing like WCG, on a 24 hour, 7 days a week basis requires one to be aware of performance per watt; because, at the end of the month I am having to pay a good chunk of money out for the power my farm uses. Power consumption is important.
SMT actually increases the power consumption of the CPU core because, simply put: it's working harder. Less idle resources mean more power use, because in a modern core; if those resources are doing nothing, they are likely gated and using very little power. So, it only makes sense that a higher utilised core will use a bit more power. However, in most workloads the increase in performance of the core with SMT enabled is greater than the % increase in power use; improving overall performance per watt of the design, hence why SMT is widely used on x86 HP cores.
This might not always be the case. So, with that in mind, it could be interesting to see how WCG with mixed projects reacts to running 8 tasks with lower power, with faster execution per task than 16 slower tasks with more power. I think there will be fine balance here, and it might well depend on the tasks being used. Some projects need a lot more data in their working set, like Africa Rainfall Project, of which a single Task can use over 700 MiB of system memory; compared to Microbiome Immunity Project, that in my observation, uses less than 100 MiB of system memory.
In my testing, each core used about 0.5 - 1W less. The entire CPU was using about 6-7W less with SMT disabled. That may not sound like much, but multiply that by four and you have a reasonably significant cost saving in energy and heat output.
The experiment is simple; I have four machines, each has a Ryzen 7 2700. Admitedly, my RAM is a bit miss-matched, and RAM performance (as noted above) can have a significant impact, but I don't have enough high performance kits to use the same speed. Two of my machine are running DDR4-2933 15-17-17, one is running 2667 16-18-18 and the other is running 2400 16-16-16. All four CPUs are locekd to 3025 MHz per core, on all cores, with a tuned voltage of 928mV for Vcore. With SMT, they use around 55-60W. Without, they use about 50W, ish. These figures are me 'eyeballing' it, so keep that in mind. I'll do more reliable testing soon (ish).
I have disabled SMT on the system with the weakest memory, as my thinking was the lower cache load per thread would put less strain on the weaker memory; being the best use of my faster kits. Anyway, this CPU is running the exact same frequency as the others, but without SMT.
I am going to wait around a week for averages to plateau, then I am going to check how many results it submits, and how many points per day it produces, compared to the other three systems, relative to the amount of power it uses.
Then I can determine if it's better overall, at least on Zen1/+ to disable SMT for World Community Grid's projects right now. And it'll be some fun SCIENCE!, too.