-
Notifications
You must be signed in to change notification settings - Fork 166
Benchmark Protocol
The benchmarking strategy from version 1 was vanilla flavored brute force: (8 WorkGroups)* (12 ThreadTiles)* (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)* (5 BranchTypes)* ...*(1024 ProblemSizes)=23,592,960 is a multiplicative series which grows very quickly. Adding one more boolean parameter doubles the number of kernel enqueues of the benchmark.
Tensile version 2 allows the user to manually interrupt the multiplicative series with "additions" instead of "multiplies", i.e., (8 WorkGroups)* (12 ThreadTiles)+ (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)+ (5 BranchTypes)* ...+(1024 ProblemSizes)=1,151 is a dramatically smaller number of enqueues. Now, adding one more boolean parameter may only add on 2 more enqueues.
To make the Tensile's programability more manageable for the user and developer, the benchmarking protocol has been split up into several steps encoded in a config.yaml file. The below sections reference the following config.yaml. Note that this config.yaml has been created to be a simple illustration and doesn't not represent an actual good benchmark protocol. See the configs included in the repository (/Tensile/Configs) for examples of good benchmarking configs.
BenchmarkProblems:
- ProblemType:
OperationType: GEMM
InitialSolutionParameters:
- WorkGroupShape: [ 0 ]
- NumLoadsCoalescedA: [ 1 ]
- NumLoadsCoalescedB: [ 1 ]
- WorkGroupEdge: [ 16 ]
- ThreadTileEdge: [ 4 ]
BenchmarkCommonParameters:
- ProblemSizes: [ [512], [512], [512] ]
- WorkGroupShape: [ -1, 0, 1 ]
ThreadTileShape: [ -1, 0, 1 ]
ForkParameters:
- WorkGroupEdge: [8, 16]
- ThreadTileEdge: [2, 4, 8 ]
BenchmarkForkParameters:
- ProblemSizes: [ [2880], [2880], [2880] ]
- NumLoadsCoalescedA: [ 1, 2, 3, 4, 6 ]
- NumLoadsCoalescedB: [ 1, 2, 3, 4, 6 ]
JoinParameters:
- MacroTile
BenchmarkJoinParameters:
- LoopUnroll: [8, 16]
BenchmarkFinalParameters:
- ProblemSizes: [ [16, 128], [16, 128], [256] ]
A Solution is comprised of ~20 parameters, and all are needed to create a kernel. Therefore, during the first benchmark which determines which WorkGroupShape is fastest, what are the other 19 solution parameters which are used to describe the kernels that we benchmark? That's what InitialSolutionParameters are for. The solution used for benchmarking WorkGroupShape will use the parameters from InitialSolutionParameters. The user must choose good default solution parameters in order to correctly identify subsequent optimal parameters.
During this first phase of benchmarking, we examine parameters which will be the same for all solutions for this ProblemType. During each step of benchmarking, there is only 1 winner. In the above example we are benchmarking the dictionary {WorkGroupShape: [ -1, 0, 1 ], ThreadTileShape: [ -1, 0, 1 ]}.; therefore, this benchmark steps generates 9 solution candidates, and the winner will be the fastest WorkGroupShape, ThreadTileShape combination. Assuming the winner is WGS=0, TTS=0, then all solutions for this ProblemType will have WGE=0 and TTS=0. Also, once a parameter has been determined, all subsequent benchmarking steps will use this determined parameter rather than pulling values from InitialSolutionParameters. Because the common parameters will apply to all kernels, they are typically the parameters which are compiler-dependent or hardware-dependent rather than being tile-dependent.