Skip to content

Benchmark Protocol

David Tanner edited this page Feb 22, 2017 · 12 revisions

Old Benchmark Architecture was Intractable

The benchmarking strategy from version 1 was vanilla flavored brute force: (8 WorkGroups)* (12 ThreadTiles)* (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)* (5 BranchTypes)* ...*(1024 ProblemSizes)=23,592,960 is a multiplicative series which grows very quickly. Adding one more boolean parameter doubles the number of kernel enqueues of the benchmark.

Incremental Benchmark is Faster

Tensile version 2 allows the user to manually interrupt the multiplicative series with "additions" instead of "multiplies", i.e., (8 WorkGroups)* (12 ThreadTiles)+ (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)+ (5 BranchTypes)* ...+(1024 ProblemSizes)=1,151 is a dramatically smaller number of enqueues. Now, adding one more boolean parameter may only add on 2 more enqueues.

Phases of Benchmark

To make the Tensile's programability more manageable for the user and developer, the benchmarking protocol has been split up into several steps encoded in a config.yaml file. The below sections reference the following config.yaml. Note that this config.yaml has been created to be a simple illustration and doesn't not represent an actual good benchmark protocol. See the configs included in the repository (/Tensile/Configs) for examples of good benchmarking configs.

BenchmarkProblems:
  - ProblemType:
      OperationType: GEMM
    
    InitialSolutionParameters:
      - WorkGroupShape: [ 0 ]
      - NumLoadsCoalescedA: [ 1 ]
      - NumLoadsCoalescedB: [ 1 ]
      - WorkGroupEdge: [ 16 ]
      - ThreadTileEdge: [ 4 ]

    BenchmarkCommonParameters:
      - ProblemSizes: [ [512], [512], [512] ]
      - WorkGroupShape: [ -1, 0, 1 ]
        ThreadTileShape: [ -1, 0, 1 ]
    ForkParameters:
      - WorkGroupEdge: [8, 16]
      - ThreadTileEdge: [2, 4, 8 ]
    BenchmarkForkParameters:
      - ProblemSizes: [ [2880], [2880], [2880] ]
      - NumLoadsCoalescedA: [ 1, 2, 3, 4, 6 ]
      - NumLoadsCoalescedB: [ 1, 2, 3, 4, 6 ]
    JoinParameters:
      - MacroTile
    BenchmarkJoinParameters:
      - LoopUnroll: [8, 16]
    BenchmarkFinalParameters:
      - ProblemSizes: [ [16, 128], [16, 128], [256] ]

Initial Solution Parameters

A Solution is comprised of ~20 parameters, and all are needed to create a kernel. Therefore, during the first benchmark which determines which WorkGroupShape is fastest, what are the other 19 solution parameters which are used to describe the kernels that we benchmark? That's what InitialSolutionParameters are for. The solution used for benchmarking WorkGroupShape will use the parameters from InitialSolutionParameters. The user must choose good default solution parameters in order to correctly identify subsequent optimal parameters.

Benchmark Common Parameters

During this first phase of benchmarking, we examine parameters which will be the same for all solutions for this ProblemType. During each step of benchmarking, there is only 1 winner. In the above example we are benchmarking the dictionary {WorkGroupShape: [ -1, 0, 1 ], ThreadTileShape: [ -1, 0, 1 ]}.; therefore, this benchmark steps generates 9 solution candidates, and the winner will be the fastest WorkGroupShape, ThreadTileShape combination. Assuming the winner is WGS=0, TTS=0, then all solutions for this ProblemType will have WGE=0 and TTS=0. Also, once a parameter has been determined, all subsequent benchmarking steps will use this determined parameter rather than pulling values from InitialSolutionParameters. Because the common parameters will apply to all kernels, they are typically the parameters which are compiler-dependent or hardware-dependent rather than being tile-dependent.

Fork Parameters

Benchmark Fork Parameters

Join Parameters

Benchmark Join Parameters

Benchmark Final Parameters

Clone this wiki locally