Skip to content

while inserting Group0 (cpuset 0xffffffff,,0xffffffff) at Package (P#0 cpuset 0xffffffff,0xffffffff) #712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ChaoHsin-fang opened this issue Apr 14, 2025 · 11 comments

Comments

@ChaoHsin-fang
Copy link

What version of hwloc are you using?
2.7.0

Which operating system and hardware are you running on?
ubuntu22.04. kvm Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz ( all pcie device pass through KVM)

Details of the problem

Image

lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 0 size: 352671 MB
node 0 free: 350266 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 352777 MB
node 1 free: 350764 MB
node distances:
node 0 1
0: 10 20
1: 20 10

@bgoglin
Copy link
Contributor

bgoglin commented Apr 14, 2025

Hello
Looks like the NUMA information is invalid in this KVM. Assuming this happens with "lstopo --no-io" as well, please run "hwloc-gather-topology foo" and send us the generated foo.tar.bz2 so that we may debug remotely by looking at what's buggy in your /sys files.
If lstopo --no-io works fine but lstopo (without options) shows the warning, you'll need to pass "--io" to hwloc-gather-topology which will make the script slower and the tarball bigger.

@bgoglin
Copy link
Contributor

bgoglin commented Apr 14, 2025

Also, if you created the VM by specifying some core and NUMA topology, the bug might be there, it would be useful to see what you specified there.

@ChaoHsin-fang
Copy link
Author

Hello Looks like the NUMA information is invalid in this KVM. Assuming this happens with "lstopo --no-io" as well, please run "hwloc-gather-topology foo" and send us the generated foo.tar.bz2 so that we may debug remotely by looking at what's buggy in your /sys files. If lstopo --no-io works fine but lstopo (without options) shows the warning, you'll need to pass "--io" to hwloc-gather-topology which will make the script slower and the tarball bigger.

lstopo.log

foo.tar.gz

@ChaoHsin-fang
Copy link
Author

ChaoHsin-fang commented Apr 14, 2025

Also, if you created the VM by specifying some core and NUMA topology, the bug might be there, it would be useful to see what you specified there.

I’ve tried configuring NUMA in KVM XML, but it’s not working as expected. Any clues?

<cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='2' cores='32' threads='2'/> 
    <numa>
      <cell id='0' cpus='0-31,64-95' memory='350' unit='GiB'/> <!-- NUMA0 350GB -->
      <cell id='1' cpus='32-63,96-127' memory='350' unit='GiB'/> <!-- NUMA1 350GB -->
    </numa>
  </cpu>
  <numatune>
    <memory nodeset='0,1' mode='strict'/>
    <memnode cellid='0' nodeset='0'/>
    <memnode cellid='1' nodeset='1'/>
  </numatune>
  <cputune>
  <!-- NUMA 0 -->
  <!-- vCPU 0-31 → host pCPU 0-31  -->
  <vcpupin vcpu='0' cpuset='0'/>
  <vcpupin vcpu='1' cpuset='1'/>
  <vcpupin vcpu='2' cpuset='2'/>
  <vcpupin vcpu='3' cpuset='3'/>
  <vcpupin vcpu='4' cpuset='4'/>
  <vcpupin vcpu='5' cpuset='5'/>
  <vcpupin vcpu='6' cpuset='6'/>
  <vcpupin vcpu='7' cpuset='7'/>
  <vcpupin vcpu='8' cpuset='8'/>
  <vcpupin vcpu='9' cpuset='9'/>
  <vcpupin vcpu='10' cpuset='10'/>
  <vcpupin vcpu='11' cpuset='11'/>
  <vcpupin vcpu='12' cpuset='12'/>
  <vcpupin vcpu='13' cpuset='13'/>
  <vcpupin vcpu='14' cpuset='14'/>
  <vcpupin vcpu='15' cpuset='15'/>
  <vcpupin vcpu='16' cpuset='16'/>
  <vcpupin vcpu='17' cpuset='17'/>
  <vcpupin vcpu='18' cpuset='18'/>
  <vcpupin vcpu='19' cpuset='19'/>
  <vcpupin vcpu='20' cpuset='20'/>
  <vcpupin vcpu='21' cpuset='21'/>
  <vcpupin vcpu='22' cpuset='22'/>
  <vcpupin vcpu='23' cpuset='23'/>
  <vcpupin vcpu='24' cpuset='24'/>
  <vcpupin vcpu='25' cpuset='25'/>
  <vcpupin vcpu='26' cpuset='26'/>
  <vcpupin vcpu='27' cpuset='27'/>
  <vcpupin vcpu='28' cpuset='28'/>
  <vcpupin vcpu='29' cpuset='29'/>
  <vcpupin vcpu='30' cpuset='30'/>
  <vcpupin vcpu='31' cpuset='31'/>

  <!-- vCPU 32-63 → host pCPU 64-95  -->
  <vcpupin vcpu='32' cpuset='64'/>
  <vcpupin vcpu='33' cpuset='65'/>
  <vcpupin vcpu='34' cpuset='66'/>
  <vcpupin vcpu='35' cpuset='67'/>
  <vcpupin vcpu='36' cpuset='68'/>
  <vcpupin vcpu='37' cpuset='69'/>
  <vcpupin vcpu='38' cpuset='70'/>
  <vcpupin vcpu='39' cpuset='71'/>
  <vcpupin vcpu='40' cpuset='72'/>
  <vcpupin vcpu='41' cpuset='73'/>
  <vcpupin vcpu='42' cpuset='74'/>
  <vcpupin vcpu='43' cpuset='75'/>
  <vcpupin vcpu='44' cpuset='76'/>
  <vcpupin vcpu='45' cpuset='77'/>
  <vcpupin vcpu='46' cpuset='78'/>
  <vcpupin vcpu='47' cpuset='79'/>
  <vcpupin vcpu='48' cpuset='80'/>
  <vcpupin vcpu='49' cpuset='81'/>
  <vcpupin vcpu='50' cpuset='82'/>
  <vcpupin vcpu='51' cpuset='83'/>
  <vcpupin vcpu='52' cpuset='84'/>
  <vcpupin vcpu='53' cpuset='85'/>
  <vcpupin vcpu='54' cpuset='86'/>
  <vcpupin vcpu='55' cpuset='87'/>
  <vcpupin vcpu='56' cpuset='88'/>
  <vcpupin vcpu='57' cpuset='89'/>
  <vcpupin vcpu='58' cpuset='90'/>
  <vcpupin vcpu='59' cpuset='91'/>
  <vcpupin vcpu='60' cpuset='92'/>
  <vcpupin vcpu='61' cpuset='93'/>
  <vcpupin vcpu='62' cpuset='94'/>
  <vcpupin vcpu='63' cpuset='95'/>

  <!-- NUMA 1 -->
  <!-- vCPU 64-95 → host pCPU 32-63 -->
  <vcpupin vcpu='64' cpuset='32'/>
  <vcpupin vcpu='65' cpuset='33'/>
  <vcpupin vcpu='66' cpuset='34'/>
  <vcpupin vcpu='67' cpuset='35'/>
  <vcpupin vcpu='68' cpuset='36'/>
  <vcpupin vcpu='69' cpuset='37'/>
  <vcpupin vcpu='70' cpuset='38'/>
  <vcpupin vcpu='71' cpuset='39'/>
  <vcpupin vcpu='72' cpuset='40'/>
  <vcpupin vcpu='73' cpuset='41'/>
  <vcpupin vcpu='74' cpuset='42'/>
  <vcpupin vcpu='75' cpuset='43'/>
  <vcpupin vcpu='76' cpuset='44'/>
  <vcpupin vcpu='77' cpuset='45'/>
  <vcpupin vcpu='78' cpuset='46'/>
  <vcpupin vcpu='79' cpuset='47'/>
  <vcpupin vcpu='80' cpuset='48'/>
  <vcpupin vcpu='81' cpuset='49'/>
  <vcpupin vcpu='82' cpuset='50'/>
  <vcpupin vcpu='83' cpuset='51'/>
  <vcpupin vcpu='84' cpuset='52'/>
  <vcpupin vcpu='85' cpuset='53'/>
  <vcpupin vcpu='86' cpuset='54'/>
  <vcpupin vcpu='87' cpuset='55'/>
  <vcpupin vcpu='88' cpuset='56'/>
  <vcpupin vcpu='89' cpuset='57'/>
  <vcpupin vcpu='90' cpuset='58'/>
  <vcpupin vcpu='91' cpuset='59'/>
  <vcpupin vcpu='92' cpuset='60'/>
  <vcpupin vcpu='93' cpuset='61'/>
  <vcpupin vcpu='94' cpuset='62'/>
  <vcpupin vcpu='95' cpuset='63'/>

  <!-- vCPU 96-127 → host pCPU 96-127  -->
  <vcpupin vcpu='96' cpuset='96'/>
  <vcpupin vcpu='97' cpuset='97'/>
  <vcpupin vcpu='98' cpuset='98'/>
  <vcpupin vcpu='99' cpuset='99'/>
  <vcpupin vcpu='100' cpuset='100'/>
  <vcpupin vcpu='101' cpuset='101'/>
  <vcpupin vcpu='102' cpuset='102'/>
  <vcpupin vcpu='103' cpuset='103'/>
  <vcpupin vcpu='104' cpuset='104'/>
  <vcpupin vcpu='105' cpuset='105'/>
  <vcpupin vcpu='106' cpuset='106'/>
  <vcpupin vcpu='107' cpuset='107'/>
  <vcpupin vcpu='108' cpuset='108'/>
  <vcpupin vcpu='109' cpuset='109'/>
  <vcpupin vcpu='110' cpuset='110'/>
  <vcpupin vcpu='111' cpuset='111'/>
  <vcpupin vcpu='112' cpuset='112'/>
  <vcpupin vcpu='113' cpuset='113'/>
  <vcpupin vcpu='114' cpuset='114'/>
  <vcpupin vcpu='115' cpuset='115'/>
  <vcpupin vcpu='116' cpuset='116'/>
  <vcpupin vcpu='117' cpuset='117'/>
  <vcpupin vcpu='118' cpuset='118'/>
  <vcpupin vcpu='119' cpuset='119'/>
  <vcpupin vcpu='120' cpuset='120'/>
  <vcpupin vcpu='121' cpuset='121'/>
  <vcpupin vcpu='122' cpuset='122'/>
  <vcpupin vcpu='123' cpuset='123'/>
  <vcpupin vcpu='124' cpuset='124'/>
  <vcpupin vcpu='125' cpuset='125'/>
  <vcpupin vcpu='126' cpuset='126'/>
  <vcpupin vcpu='127' cpuset='127'/>
    </cputune>

@ChaoHsin-fang
Copy link
Author

Thanks for your reply.
The issue seems to be with the NUMA topology in KVM XML file.Tried this setup as well,

<cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='2' cores='32' threads='2'/> 
    <numa>
      <cell id='0' cpus='0-31,64-95' memory='350' unit='GiB'/> <!-- NUMA0 350GB -->
      <cell id='1' cpus='32-63,96-127' memory='350' unit='GiB'/> <!-- NUMA1 350GB -->
    </numa>
  </cpu>
  <numatune>
    <memory nodeset='0,1' mode='strict'/>
    <memnode cellid='0' nodeset='0'/>
    <memnode cellid='1' nodeset='1'/>
  </numatune>

but it's still not working.

@bgoglin
Copy link
Contributor

bgoglin commented Apr 14, 2025

From the /sys point of view, there's a clear bug in the topology:

  • each package has 32 cores hyperthreaded
  • however each NUMA rather has 64 single-threaded cores

You just need to fix the cpu numbers in NUMA config, replace this

<cell id='0' cpus='0-31,64-95' memory='350' unit='GiB'/>
<cell id='1' cpus='32-63,96-127' memory='350' unit='GiB'/>

with

<cell id='0' cpus='0-63' memory='350' unit='GiB'/>
<cell id='1' cpus='64-127' memory='350' unit='GiB'/>

@ChaoHsin-fang
Copy link
Author

ChaoHsin-fang commented Apr 14, 2025

I replaced the NUMA configuration as suggested, but it's still not taking effect.

  <cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='2' cores='32' threads='2'/>
    <numa>
      <cell id='0' cpus='0-63' memory='350' unit='GiB'/> 
      <cell id='1' cpus='64-127' memory='350' unit='GiB'/> 
    </numa>
  </cpu>
  <numatune>
    <memory nodeset='0,1' mode='strict'/>
    <memnode cellid='0' nodeset='0'/>
    <memnode cellid='1' nodeset='1'/>
  </numatune>

Image

(kvm) numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 352719 MB
node 0 free: 350173 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 352729 MB
node 1 free: 351282 MB
node distances:
node 0 1
0: 10 20
1: 20 10
(kvm) lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127

@bgoglin
Copy link
Contributor

bgoglin commented Apr 14, 2025

Which issue are we supposed to see in your outputs? The hwloc warning seems to be gone, which is what I expected.

@ChaoHsin-fang
Copy link
Author

My task is to ensure the nvidia-smi topo output in KVM matches the bare-metal topology exactly.
The issue is that in the nvidia-smi topo -m output, the NUMA Affinity section inside the virtual machine looks different from a normal baremeta host — the NUMA node binding in KVM didn't take effect.

numactl -H output in kvm

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 352719 MB
node 0 free: 350173 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 352729 MB
node 1 free: 351282 MB
node distances:
node 0 1
0: 10 20
1: 20 10

lscpu | grep -i numa output in kvm

NUMA node(s): 2
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
normal topo (from baremetal host)numa 0 numa 1

but nvidia-smi topo -m output in kvm is

Image

Could the topology problems shown in nvidia-smi topo -m be connected to hwloc's NUMA identification issues?

expected topo in kvm (0-63 ,64-127 is also acceptable)
Image

@bgoglin
Copy link
Contributor

bgoglin commented Apr 15, 2025

Ah, I see. I am not a KVM expert, but I don't understand why you're talking about NUMA identification instead of PCI NUMA affinity here. Each PCI root complex reports a local NUMA node through ACPI tables, but your VM doesn't seem to specify any, hence the GPU is attached to the entire machine (all CPUs, and no specific NUMA node). I think you're just missing that in your KVM config.

hwloc just reads files such as this to read it:
/sys/bus/pci/devices//local_cpulist
/sys/bus/pci/devices//numa_node
As long as those files are different on baremetal and VM, there's no way CPU affinity and NUMA affinity columns will be the same in nvidia-smi.

A quick search reports similar issues such as kubevirt/kubevirt#13926 but I don't know this relates to PCI passthrough.

@ChaoHsin-fang
Copy link
Author

Appreciate the help! I'll dig deeper into what's causing this problem.Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants