Preserve word boundaries when indexing RTE content with tags #19540

steveatkiss · 2025-06-12T13:18:20Z

Prerequisites

[ x ] I have added steps to test this contribution in the description below

Description

When Rich Text Editor content containing   tags is processed for Examine indexing, the HTML stripping process doesn't preserve whitespace between words separated by line breaks. This results in search terms being concatenated.

eg:
John Smith Company ABC London

Current behavior: Indexes as "John SmithCompany ABCLondon"
Expected behavior: Should index as "John Smith Company ABC London"

This affects search functionality as users searching for "Smith" won't find content where it appears as "SmithCompany" in the index.

The proposed pull request fixes this issue where   tags in Rich Text Editor content don't create proper word boundaries during Examine indexing, causing words to concatenate incorrectly in search results.

Changes

Modified RichTextPropertyIndexValueFactory.GetIndexValues() to preprocess HTML content
Added regex pattern to replace   tags with spaces before HTML stripping
Maintains all existing functionality while fixing the word boundary issue

Replace tags with spaces before HTML stripping to prevent word concatenation in Examine index. Fixes issue where "John Smith Company ABC" was indexed as "John SmithCompany ABC" instead of "John Smith Company ABC". - Add regex to replace variants with spaces in RichTextPropertyIndexValueFactory - Handles , with spaces and attributes - Maintains existing StripHtml() behavior for all other HTML tags

github-actions · 2025-06-12T13:18:28Z

Hi there @steveatkiss, thank you for this contribution! 👍

While we wait for one of the Core Collaborators team to have a look at your work, we wanted to let you know about that we have a checklist for some of the things we will consider during review:

It's clear what problem this is solving, there's a connected issue or a description of what the changes do and how to test them
The automated tests all pass (see "Checks" tab on this PR)
The level of security for this contribution is the same or improved
The level of performance for this contribution is the same or improved
Avoids creating breaking changes; note that behavioral changes might also be perceived as breaking
If this is a new feature, Umbraco HQ provided guidance on the implementation beforehand
💡 The contribution looks original and the contributor is presumably allowed to share it

Don't worry if you got something wrong. We like to think of a pull request as the start of a conversation, we're happy to provide guidance on improving your contribution.

If you realize that you might want to make some changes then you can do that by adding new commits to the branch you created for this work and pushing new commits. They should then automatically show up as updates to this pull request.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

steveatkiss · 2025-06-12T13:47:43Z

Additional comment which could be a commit, probably better to have the regex as something like:
(<br\s+[^>]*/?>\s*|<br\s*>) or <br\b[^>]*/?>\s*

ie:
html = Regex.Replace(html, @"<br\b[^>]*/?>\s*", " ", RegexOptions.IgnoreCase);

to avoid matches on <break> or <branything>.

emmagarland · 2025-06-16T07:31:03Z

Hi @steveatkiss,

Thanks for your PR to fix the indexing scenario where breakpoints are removed, causing potential problems with the content.

One of the Core Collaborators team will review this as soon as possible. I can't find a related issue - just wondering if there is one or if the issue is as described in here only?

I've given HQ a heads-up just to check this is not going to have any knock-on effects.

Also, it looks like this is your first PR to this repository! Nice work! #H5YR 🎉

Best wishes

Emma

emmagarland · 2025-06-19T09:20:01Z

Hi @steveatkiss

I've tested the PR as is and you're right, it matches on <branything>. I've taken the add a unit test with various that should cover these scenarios and more.

Just FYI this regex still matches on the <break>/<branything> but I have included that in the tests in case you wanted to try and mitigate it in another regex.

The new test cases have revealed that there are some other attributes that will cause similar indexing issues, but happy to stick with this scenario for now on this PR!

I've asked HQ to take a look at this now since I've amended it too.

Thanks

Emma

…nto v13/feat/improve-rte-indexing

steveatkiss · 2025-06-19T11:51:25Z

Thanks @emmagarland !

I hadn't added the updated Regex to the code, it was just in the comment further up. I've now grabbed your unit tests and pushed the new Regex pattern up, are you able to re-run the tests at your end with the updated version? I don't have the site properly configured at this end in order to run those but if required I can.

emmagarland · 2025-06-19T13:46:56Z

Thanks @steveatkiss!

If you're expecting the tests output to be John Smith<break>Company ABC<break>London, that is not happening and it is still being stripped out with the new regex. The tests also break because they are now outputting:

John SmithCompany ABCLondon

instead of what I originally put from the old regex:

John Smith Company ABC London

I think its worth getting the tests running locally but let me know if you're having any issues doing so,

Thanks

Emma

steveatkiss · 2025-06-19T14:11:06Z

Hi @emmagarland - thanks for the quick follow up, actually the <break> bit isn't needed, I don't think it's a valid HTML tag, it was just an example of something that the Regex shouldn't pick up, like the <branything>, sorry I should have made that a bit clearer.

Really I think it should only work correctly for ones like:

<br>
<br />
<br class="something">
<br class="something" />
<br data-test="different-attribute">
<br
class="newlines-inside"
>
etc

so all with a valid   tag of any variation as that could appear within Rich Text content (although would usually be converted to a p tag but not when it's in a table cell for example).

So the tests can be updated to:

[TestCase("<p>Sample text</p>", "Sample text")]
[TestCase("<p>John Smith<br>Company ABC<br>London</p>", "John Smith Company ABC London")]
[TestCase("<p>John Smith<break>Company ABC<break>London</p>", "John SmithCompany ABCLondon")]
[TestCase("<p>John Smith<br>Company ABC<branything>London</p>", "John Smith Company ABCLondon")]
[TestCase("<p>Another sample text with <strong>bold</strong> content</p>", "Another sample text with bold content")]
[TestCase("<p>Text with <a href=\"https://example.com\">link</a></p>", "Text with link")]
[TestCase("<p>Text with <img src=\"image.jpg\" alt=\"image\" /></p>", "Text with ")]
[TestCase("<p>Text with <span style=\"color: red;\">styled text</span></p>", "Text with styled text")]
[TestCase("<p>Text with <em>emphasized</em> content</p>", "Text with emphasized content")]
[TestCase("<p>Text with <u>underlined</u> content</p>", "Text with underlined content")]
[TestCase("<p>Text with <code>inline code</code></p>", "Text with inline code")]
[TestCase("<p>Text with <pre><code>code block</code></pre></p>", "Text with code block")]
[TestCase("<p>Text with <blockquote>quoted text</blockquote></p>", "Text with quoted text")]
[TestCase("<p>Text with <ul><li>list item 1</li><li>list item 2</li></ul></p>",
    "Text with list item 1list item 2 ")]
[TestCase("<p>Text with <ol><li>ordered item 1</li><li>ordered item 2</li></ol></p>",
    "Text with ordered item 1ordered item 2")]
[TestCase("<p>Text with <div class=\"class-name\">div content</div></p>", "Text with div content")]
[TestCase("<p>Text with <span class=\"class-name\">span content</span></p>", "Text with span content")]
[TestCase("<p>Text with <strong>bold</strong> and <em>italic</em> content</p>",
    "Text with bold and italic content")]
[TestCase("<p>Text with <a href=\"https://example.com\" target=\"_blank\">external link</a></p>",
    "Text with external link")]

which are these two changed:

[TestCase("<p>John Smith<break>Company ABC<break>London</p>", "John SmithCompany ABCLondon")]
[TestCase("<p>John Smith<br>Company ABC<branything>London</p>", "John Smith Company ABCLondon")]

And maybe useful to have a couple more like below to test the attributes and new lines (I'm not 100% sure that's how you would test the new lines!):

[TestCase("<p>John Smith<br class=\"test\">Company ABC<br>London</p>", "John Smith Company ABC London")]
[TestCase("<p>John Smith<br \r\n />Company ABC<br>London</p>", "John Smith Company ABC London")]

emmagarland · 2025-06-19T14:32:28Z

Thanks @steveatkiss!

Those tests all pass as expected now :)

We should be good to get this merged. I imagine HQ are pretty busy at Codegarden this week but once someone has confirmed (since I made some changes too) then it should be good to go into v13 and hopefully cherry-picked into v16 too!

I'll check in soon. Thanks again,

Emma

steveatkiss · 2025-06-19T14:56:08Z

That's great, thanks for the tests and comments too @emmagarland - fingers crossed all OK with that as it will solve an issue we have open from a client without us having to create a more involved solution in the site itself.

umbracocommunity · 2025-06-19T16:44:04Z

This pull request has been mentioned on Umbraco community forum. There might be relevant details there:

https://forum.umbraco.com/t/blockgrid-rte-examine-indexing-issues/4034/3

steveatkiss marked this pull request as ready for review June 12, 2025 13:19

emmagarland added community/pr category/dx Developer experience labels Jun 18, 2025

emmagarland self-assigned this Jun 19, 2025

Added unit test with test cases for expected index values

13f1f7c

steveatkiss added 2 commits June 19, 2025 12:43

Regex tweak to avoid matches on <break> <branything> etc

111c953

Merge remote-tracking branch 'origin/v13/feat/improve-rte-indexing' i…

50349fa

…nto v13/feat/improve-rte-indexing

Tweaked tests as per PR feedback

0b5dfc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve word boundaries when indexing RTE content with <br> tags #19540

Preserve word boundaries when indexing RTE content with <br> tags #19540

Uh oh!

steveatkiss commented Jun 12, 2025

Uh oh!

github-actions bot commented Jun 12, 2025 •

edited by emmagarland

Loading

Uh oh!

steveatkiss commented Jun 12, 2025 •

edited

Loading

Uh oh!

emmagarland commented Jun 16, 2025

Uh oh!

emmagarland commented Jun 19, 2025 •

edited

Loading

Uh oh!

steveatkiss commented Jun 19, 2025 •

edited

Loading

Uh oh!

emmagarland commented Jun 19, 2025

Uh oh!

steveatkiss commented Jun 19, 2025 •

edited

Loading

Uh oh!

emmagarland commented Jun 19, 2025

Uh oh!

steveatkiss commented Jun 19, 2025

Uh oh!

umbracocommunity commented Jun 19, 2025

Uh oh!

Uh oh!

Preserve word boundaries when indexing RTE content with <br> tags #19540

Are you sure you want to change the base?

Preserve word boundaries when indexing RTE content with <br> tags #19540

Uh oh!

Conversation

steveatkiss commented Jun 12, 2025

Prerequisites

Description

Changes

Uh oh!

github-actions bot commented Jun 12, 2025 • edited by emmagarland Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveatkiss commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emmagarland commented Jun 16, 2025

Uh oh!

emmagarland commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveatkiss commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emmagarland commented Jun 19, 2025

Uh oh!

steveatkiss commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emmagarland commented Jun 19, 2025

Uh oh!

steveatkiss commented Jun 19, 2025

Uh oh!

umbracocommunity commented Jun 19, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 12, 2025 •

edited by emmagarland

Loading

steveatkiss commented Jun 12, 2025 •

edited

Loading

emmagarland commented Jun 19, 2025 •

edited

Loading

steveatkiss commented Jun 19, 2025 •

edited

Loading

steveatkiss commented Jun 19, 2025 •

edited

Loading