Hyperthreaded virtual cores, different C-States?

turbostat reports C-states of all CPU cores, and includes entries for each hyper-threaded core as well. Often enough the two logical cores on a single physical core will list different C state percentages. Does that make any sense?

Is this reporting the c-states of the few duplicated parts that support hyperthreading, vs the actual computing units in the single physical core?

This isn't a turbostat specific question, that just happens to be the tool I used to display that info. Its more a question about hyperthreading in general.
Edit: CPU is an Intel 5820K hexacore if that matters. Its my first hyperthreaded CPU.

Good question.

Virtual cores aren't real cores but Linux treats them as such to simplify its scheduler, to the point they appear in /proc/cpuinfo. As such, they sometimes get tallied in ways that don't make perfect sense.

I don't have a hyperthreaded core to compare with, but I suspect that exploring the structure inside /sys/ would reveal the true, more complex, grouping.

the info in cpuinfo as well as the output of turbostat shows how the virtual/logical cores relate to physical. I would expect logical core 0 and 6, on physical core 0, to have the exact same C state time/percentages. A lot of the time they are. But then sometimes not. Puzzled. Curious.

Log Phys
----------
0 0
1 1
2 2
3 3
4 4
5 5
6 0
7 1
8 2
9 3
10 4
11 5

Edit:

Speaking of linux scheduling. I have also noticed that the scheduler will sometimes put two tasks on the same physical core, but leave another physical core idle. I guess when deciding what core is most available, two virt cores might be idle, while another core is still finishing up something.. Not sure how fast load should ( if it is at all) be reballanced. Gotta break out the os internals books and refresh.

Moving lwps from core to core can reduce cache hits, so the dispatching logic may look at longer term stats before reassigning a lwp.

Hyperthreading allows two threads to use different parts of one core -- one might be using the ALU for math, while another does something floating point, or reads from memory, etc. It's still just one core, but sometimes it can slip in an extra cycle here and there using parts of itself which happen to be free.

DGPickett: good point about cache implications of moving from core to core.

Hi Guys,

I'd just like to chuck in my two cents worth on this, I've fallen victim to the perfomance issues that "cache thrashing" can cause and it took me some time to work out what the issue actually was.

Although the issue was in my case "Solaris" based and was due to my configuration of the system - down to me I'm afraid. The system in question a Sun "T" series had been domained and I had set up some containers/zones, due to my lack of understanding I set up a small domain across core boundaries - with the result that the four "VCPU's" actually threads spent a high percentage of time sending cache from core to core.

A lesson well learned at the time, although I think in the later versions of the OS related software and the firmware the impact of such a mistake is reduced - I tend to shy away from configuring domains or VM's - particularly small ones over core boundaries.

Regards

Dave

In my case 'the fix' was to put whole-core constraint on most utilized ldoms (databases) and keep the VCPUs count inside core boundary.

For instance, t5-2 sparc has 256 available VCPU (threads), which translates into 32 cores or 2 sockets 16 core each. For best performance one should give VCPU resources multiples of 8.
Be sure to reboot the hypervisor after such major changes.

Regarding HPVM (now vpars and Integrity VM) i would recommend using VPAR since they are configured only is such manner(dedicated cores for virtual machines and hypervisor).

Integrity VM can suffer from such 'misconfiguration' as well since it (can) share cores.

Nice blog about it, a bit old but good.
https://blogs.oracle.com/jsavit/entry/best\_practices\_core_allocation

Often cache is just reloaded from the lower, slower layers on the new core, and eventually snooped empty on the old core when the data is modified. This means that while a different core may be available at an instant, it is better to wait a bit for the old core, which may not be 100% busy in the longer term. Of course, some caches cache keyed on virtual addresses, not physical ones, and may be flushed when other processes use the core. For them, dispatching multiple threads of the same process in succession reduces cache flushing. So, while you have asked for concurrent threads, that might actually be made less true in the fine by the system.

Hyperthreading is only for same-process threads, as they share the same virtual space. It is a nice way to increase use of CPU resources, with some added delay when threads' needs collide. It is an interesting alternate direction to the trend in modern CPU design to do speculative operations that are 50% or more a waste of the resource, but speed the critical thread. I find it reminiscent of the old Honeywell-800, where the CPU ran instructions of up to 8 threads more or less in rotation. (If you loaded the accumulator, it did not 'hunt', so many programmers used the accumulator as a register to hog the CPU and speed their thread.) I used to fix this stuff, before it crawled inside a chip!