Skip to content

Speed up advancing within a sparse block in IndexedDISI. #14371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

vsop-479
Copy link
Contributor

Description

Similar to #13692.

@gf2121
Copy link
Contributor

gf2121 commented Mar 20, 2025

Thanks @vsop-479 , have you been able to measure the performance of your patch?

I had similar idea recently. If you look at newest code in Lucene101PostingsReader, you may find we are using VectorMask to speed up this, that was what i had in mind - get a MemorySegment slice if it is not null, and play it with VectorMask.

for (; from + INT_SPECIES.length() < to; from += INT_SPECIES.length() + 1) {
if (buffer[from + INT_SPECIES.length()] >= target) {
IntVector vector = IntVector.fromArray(INT_SPECIES, buffer, from);
VectorMask<Integer> mask = vector.compare(VectorOperators.LT, target);
return from + mask.trueCount();
}
}

@vsop-479
Copy link
Contributor Author

Thanks for your feedback @gf2121. This patch is still in process, and have not been measured.

If you look at newest code in Lucene101PostingsReader, you may find we are using VectorMask to speed up this

Thanks for reminding this, I noticed the vectorization approach when I find #13692 has been reverted. But I am not sure we can use vector for IndexedDISI.slice.

that was what i had in mind - get a MemorySegment slice if it is not null, and play it with VectorMask.

That would be nice, I just noticed ShortVector#fromMemorySegment.

@vsop-479
Copy link
Contributor Author

@gf2121
For what it's worth, I implemented this patch, and measured with luceneutil on wikimedium10m.

           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
           HighTermDayOfYearSort      470.94     (11.4%)      441.83      (8.5%)   -6.2% ( -23% -   15%) 0.051
                 MedSloppyPhrase      183.03      (8.5%)      175.29      (6.1%)   -4.2% ( -17% -   11%) 0.070
                     LowSpanNear      411.55     (10.3%)      396.07      (9.5%)   -3.8% ( -21% -   17%) 0.230
           BrowseMonthSSDVFacets       27.44      (8.5%)       26.47      (5.4%)   -3.6% ( -16% -   11%) 0.113
                         MedTerm     1178.02      (9.8%)     1142.75      (6.9%)   -3.0% ( -17% -   15%) 0.261
                        HighTerm      929.38      (7.6%)      902.10      (7.9%)   -2.9% ( -17% -   13%) 0.232
                       MedPhrase      251.86      (7.8%)      245.03      (9.7%)   -2.7% ( -18% -   16%) 0.330
                    OrHighNotLow     1175.04      (9.4%)     1143.49     (11.8%)   -2.7% ( -21% -   20%) 0.425
                   OrNotHighHigh      877.79      (7.1%)      858.18      (7.3%)   -2.2% ( -15% -   13%) 0.326
                         LowTerm     1419.65      (9.5%)     1389.75     (10.4%)   -2.1% ( -20% -   19%) 0.504
             MedIntervalsOrdered       59.98      (8.2%)       58.76      (7.2%)   -2.0% ( -16% -   14%) 0.403
                      TermDTSort      454.05     (11.0%)      444.95     (12.6%)   -2.0% ( -23% -   24%) 0.593
                          Fuzzy1      130.17      (3.4%)      128.06      (4.1%)   -1.6% (  -8% -    6%) 0.172
                    OrNotHighMed     1033.67     (10.7%)     1017.41     (10.3%)   -1.6% ( -20% -   21%) 0.636
                     AndHighHigh      388.27      (8.1%)      382.25      (7.5%)   -1.5% ( -15% -   15%) 0.530
                HighSloppyPhrase      132.66      (5.2%)      130.62      (6.4%)   -1.5% ( -12% -   10%) 0.402
             LowIntervalsOrdered      637.45      (6.9%)      627.85      (6.9%)   -1.5% ( -14% -   13%) 0.488
                          IntNRQ      438.20     (11.4%)      431.78     (10.2%)   -1.5% ( -20% -   22%) 0.668
            HighTermTitleBDVSort      123.85      (8.9%)      122.20      (8.7%)   -1.3% ( -17% -   17%) 0.633
       BrowseDayOfYearSSDVFacets       26.84     (10.1%)       26.50      (9.3%)   -1.3% ( -18% -   20%) 0.683
                          Fuzzy2      102.89      (2.1%)      101.68      (3.6%)   -1.2% (  -6% -    4%) 0.204
                 LowSloppyPhrase      683.96     (10.5%)      676.73     (12.0%)   -1.1% ( -21% -   23%) 0.766
                         Respell      134.93      (2.2%)      133.51      (2.6%)   -1.0% (  -5% -    3%) 0.167
        AndHighHighDayTaxoFacets       67.12      (4.2%)       66.44      (4.6%)   -1.0% (  -9% -    8%) 0.471
                        Wildcard      379.66      (8.4%)      376.74      (9.8%)   -0.8% ( -17% -   19%) 0.790
                     MedSpanNear      227.24      (4.8%)      225.86      (5.6%)   -0.6% ( -10% -   10%) 0.715
               HighTermMonthSort     1811.33     (12.1%)     1803.90     (13.4%)   -0.4% ( -23% -   28%) 0.919
                      AndHighMed      820.66      (8.4%)      817.55      (9.8%)   -0.4% ( -17% -   19%) 0.895
                   OrHighNotHigh      757.11      (8.3%)      754.48      (7.8%)   -0.3% ( -15% -   17%) 0.892
                    OrNotHighLow     1757.90     (11.0%)     1754.29      (8.9%)   -0.2% ( -18% -   22%) 0.948
            MedTermDayTaxoFacets      148.12      (4.6%)      148.00      (5.1%)   -0.1% (  -9% -   10%) 0.956
                        PKLookup      293.33      (4.4%)      293.35      (2.9%)    0.0% (  -7% -    7%) 0.995
                         Prefix3      707.43     (14.9%)      708.70     (12.0%)    0.2% ( -23% -   31%) 0.967
                    HighSpanNear       75.55      (3.7%)       75.81      (4.5%)    0.3% (  -7% -    8%) 0.793
         AndHighMedDayTaxoFacets      217.31      (6.1%)      218.27      (5.5%)    0.4% ( -10% -   12%) 0.808
          OrHighMedDayTaxoFacets       47.73      (4.8%)       47.97      (3.9%)    0.5% (  -7% -    9%) 0.712
                           range     6721.40     (10.5%)     6784.00      (9.0%)    0.9% ( -16% -   22%) 0.763
               HighTermTitleSort      138.34      (5.1%)      139.89      (4.1%)    1.1% (  -7% -   10%) 0.443
                       OrHighMed      677.06     (14.7%)      687.35     (10.5%)    1.5% ( -20% -   31%) 0.707
            HighIntervalsOrdered      119.78     (10.0%)      121.82      (8.4%)    1.7% ( -15% -   22%) 0.562
                       OrHighLow     1099.81      (6.6%)     1118.62      (9.2%)    1.7% ( -13% -   18%) 0.499
                      HighPhrase       20.60      (4.2%)       20.98      (5.4%)    1.8% (  -7% -   11%) 0.232
                    OrHighNotMed      931.08      (6.7%)      950.36      (9.9%)    2.1% ( -13% -   19%) 0.438
     BrowseRandomLabelSSDVFacets       20.46      (4.4%)       20.95      (8.6%)    2.4% ( -10% -   16%) 0.270
                      OrHighHigh      263.33     (12.8%)      272.70     (13.9%)    3.6% ( -20% -   34%) 0.398
                      AndHighLow     2129.61     (16.0%)     2216.44     (11.3%)    4.1% ( -19% -   37%) 0.351
                       LowPhrase      218.81      (7.0%)      227.97     (11.5%)    4.2% ( -13% -   24%) 0.164
       BrowseDayOfYearTaxoFacets       35.03     (35.9%)       36.90     (34.8%)    5.4% ( -48% -  118%) 0.632
            BrowseDateTaxoFacets       34.70     (36.1%)       36.80     (35.2%)    6.0% ( -47% -  120%) 0.592
           BrowseMonthTaxoFacets       39.54     (33.0%)       42.11     (28.8%)    6.5% ( -41% -  101%) 0.508
            BrowseDateSSDVFacets        5.19     (20.1%)        5.59     (19.9%)    7.7% ( -26% -   59%) 0.222
     BrowseRandomLabelTaxoFacets       36.35     (51.7%)       39.68     (53.7%)    9.2% ( -63% -  236%) 0.583

@vsop-479
Copy link
Contributor Author

Maybe I should measure it with DVBench in luceneutil, or add a bench in jmh.

@vsop-479 vsop-479 marked this pull request as ready for review March 24, 2025 06:57
@gf2121
Copy link
Contributor

gf2121 commented Mar 24, 2025

Thanks for running benchmark @vsop-479 !

Maybe I should measure it with DVBench in luceneutil, or add a bench in jmh.

Yes, you are right, a bench in jmh will be great. We have not had tasks measuring IndexedDISI in luceneutil so far.

@vsop-479
Copy link
Contributor Author

vsop-479 commented Mar 26, 2025

a bench in jmh will be great.

I measured it with AdvanceSparseDISIBenchmark:

Benchmark                                             Mode  Cnt    Score   Error   Units
AdvanceSparseDISIBenchmark.advance                   thrpt   15  669.502 ± 4.531  ops/ms
AdvanceSparseDISIBenchmark.advanceBinarySearch       thrpt   15  358.620 ± 1.102  ops/ms
AdvanceSparseDISIBenchmark.advanceExact              thrpt   15  752.444 ± 1.810  ops/ms
AdvanceSparseDISIBenchmark.advanceExactBinarySearch  thrpt   15  547.818 ± 2.278  ops/ms

Even I set target docs's inteval to 10, there is still a big performance degrade. Maybe I use too many disi.slice.seek in this binary search version.

you may find we are using VectorMask to speed up this, that was what i had in mind - get a MemorySegment slice if it is not null, and play it with VectorMask.

I will try VectorMask when I get a chance.

@vsop-479
Copy link
Contributor Author

@gf2121
I implemented VectorMask approach. There is still a slowdown. I think the reason is my laptop (Mac M2).

Benchmark                                  Mode  Cnt    Score    Error   Units
AdvanceSparseDISIBenchmark.advance        thrpt   15  654.472 ±  2.349  ops/ms
AdvanceSparseDISIBenchmark.advanceVector  thrpt   15  498.590 ± 66.751  ops/ms

@vsop-479
Copy link
Contributor Author

@gf2121
I also implemented advanceExact with vector, there is still a slowdown. I will try to measure it on other laptop (with more vector lanes).

Benchmark                                       Mode  Cnt    Score    Error   Units
AdvanceSparseDISIBenchmark.advanceExact        thrpt   15  727.403 ± 33.060  ops/ms
AdvanceSparseDISIBenchmark.advanceExactVector  thrpt   15  520.427 ±  0.868  ops/ms

@vsop-479
Copy link
Contributor Author

vsop-479 commented Apr 1, 2025

Adjust ENABLE_ADVANCE_WITHIN_BLOCK_VECTOR_OPTO to 16 (at least 16 lanes, such as: AVX, AVX-512).

@vsop-479
Copy link
Contributor Author

vsop-479 commented Apr 3, 2025

@gf2121 , I measured it on a linux server (uses preferredBitSize=512; FMA enabled), there is still a massive slowndown. I will dig more ...

Benchmark                                       Mode  Cnt    Score   Error   Units
AdvanceSparseDISIBenchmark.advance             thrpt   15  386.100 ± 0.162  ops/ms
AdvanceSparseDISIBenchmark.advanceVector       thrpt   15  162.697 ± 0.581  ops/ms
AdvanceSparseDISIBenchmark.advanceExact        thrpt   15  437.998 ± 0.644  ops/ms
AdvanceSparseDISIBenchmark.advanceExactVector  thrpt   15  271.823 ± 0.625  ops/ms

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants