Skip to content

hwloc-calc --local-memory and -I numa produce different results #716

Closed
@antoine-morvan

Description

@antoine-morvan

What version of hwloc are you using?

From 2.10.0 till latest (using --oo flag of hwloc-calc) ; built from source

Which operating system and hardware are you running on?

Tested on RHEL9 (kernel 5.14), on various systems (AMD Milan, Intel SPR, NVidia GraceHopper)

Details of the problem

We use hwloc-calc --local-memory to find the nearest memories of a set of compute resources, to bind allocations there. Typically :

hwloc-binc core:0 --membind $(hwloc-calc --oo --local-memory core:0) -- $exe

The above works fine. But when we go to more complex situations, we observe behavior we cannot explain.

Empty results

The first issue is when querying the local memory of severals cores :

+ hwloc-calc --number-of core numa:0
16
+ hwloc-calc --oo --local-memory core:0
NUMANode:0
+ hwloc-calc --oo -I numa core:0
NUMANode:0
+ hwloc-calc --oo --local-memory core:16
NUMANode:1
+ hwloc-calc --oo -I numa core:16
NUMANode:1
+ hwloc-calc --oo --local-memory core:0 core:16

+ hwloc-calc --oo -I numa core:0 core:16
NUMANode:0,NUMANode:1

In this example, was ask for the local memory of core 0 from numa 0 and core 16 from numa 1 : we expect hwloc-calc --oo --local-memory core:0 core:16 to return numa 0 and 1, as the last command does. But it returns an empty string.

Missing results

The second issue is when querying the local memory of a range of cores :

+ hwloc-calc --number-of core numa:0
16
+ hwloc-calc --oo --local-memory core:0
NUMANode:0
+ hwloc-calc --oo -I numa core:0
NUMANode:0
+ hwloc-calc --oo --local-memory core:16
NUMANode:1
+ hwloc-calc --oo -I numa core:16
NUMANode:1
+ hwloc-calc --oo --local-memory core:0-16
NUMANode:0
+ hwloc-calc --oo -I numa core:0-16
NUMANode:0,NUMANode:1

The command hwloc-calc --oo --local-memory core:0-16 asks for a range covering the whole numa:0, and the first core of numa:1 ; as -I numa outputs, we expect numa 0 and 1 to be part of the result, but only 0 appears.

Note, when we give a range covering all the cores of the 2nd numa, the output is correct :

+ hwloc-calc --oo --local-memory core:0-31
NUMANode:0,NUMANode:1
+ hwloc-calc --oo -I numa core:0-31
NUMANode:0,NUMANode:1

But until it is fully covered by the rank, the numa does not show up :

+ hwloc-calc --oo --local-memory core:0-46
NUMANode:0,NUMANode:1
+ hwloc-calc --oo -I numa core:0-46
NUMANode:0,NUMANode:1,NUMANode:2

When we do not add any flag or best-memattr, we expect --local-memory to behave like -I numa.

Additional information

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions