-
Notifications
You must be signed in to change notification settings - Fork 166
Benchmark Protocol
The benchmarking strategy from version 1 was vanilla flavored brute force: (8 WorkGroups)* (12 ThreadTiles)* (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)* (5 BranchTypes)* ...*(1024 ProblemSizes)=23,592,960 is a multiplicative series which grows very quickly. Adding one more boolean parameter doubles the number of kernel enqueues of the benchmark.
Tensile version 2 allows the user to manually interrupt the multiplicative series with "additions" instead of "multiplies", i.e., (8 WorkGroups)* (12 ThreadTiles)+ (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)+ (5 BranchTypes)* ...+(1024 ProblemSizes)=1,151 is a dramatically smaller number of enqueues. Now, adding one more boolean parameter may only add on 2 more enqueues.
To make the Tensile's programability more manageable for the user and developer, the benchmarking protocol has been split up into several steps encoded in a config.yaml file. The below sections reference the following config.yaml:
BenchmarkProblems:
- ProblemType:
OperationType: GEMM
InitialSolutionParameters:
- WorkGroupShape: [ 0 ]
- NumLoadsCoalescedA: [ 1 ]
- NumLoadsCoalescedB: [ 1 ]
- WorkGroupEdge: [ 16 ]
- ThreadTileEdge: [ 4 ]
BenchmarkCommonParameters:
- ProblemSizes: [ [512], [512], [512] ]
- WorkGroupShape: [ -1, 0, 1 ]
ThreadTileShape: [ -1, 0, 1 ]
ForkParameters:
- WorkGroupEdge: [8, 16]
- ThreadTileEdge: [2, 4, 8 ]
BenchmarkForkParameters:
- ProblemSizes: [ [2880], [2880], [2880] ]
- NumLoadsCoalescedA: [ 1, 2, 3, 4, 6 ]
- NumLoadsCoalescedB: [ 1, 2, 3, 4, 6 ]
JoinParameters:
- MacroTile
BenchmarkJoinParameters:
- LoopUnroll: [8, 16]
BenchmarkFinalParameters:
- ProblemSizes: [ [16, 128], [16, 128], [256] ]