Skip to content

Preserve word boundaries when indexing RTE content with <br> tags #19540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: v13/main
Choose a base branch
from

Conversation

steveatkiss
Copy link

Prerequisites

  • [ x ] I have added steps to test this contribution in the description below

Description

When Rich Text Editor content containing <br> tags is processed for Examine indexing, the HTML stripping process doesn't preserve whitespace between words separated by line breaks. This results in search terms being concatenated.

eg:
<p>John Smith<br>Company ABC<br>London</p>

Current behavior: Indexes as "John SmithCompany ABCLondon"
Expected behavior: Should index as "John Smith Company ABC London"

This affects search functionality as users searching for "Smith" won't find content where it appears as "SmithCompany" in the index.

The proposed pull request fixes this issue where <br> tags in Rich Text Editor content don't create proper word boundaries during Examine indexing, causing words to concatenate incorrectly in search results.

Changes

  • Modified RichTextPropertyIndexValueFactory.GetIndexValues() to preprocess HTML content
  • Added regex pattern to replace <br> tags with spaces before HTML stripping
  • Maintains all existing functionality while fixing the word boundary issue

Replace <br> tags with spaces before HTML stripping to prevent word
concatenation in Examine index. Fixes issue where "John Smith<br>Company ABC"
was indexed as "John SmithCompany ABC" instead of "John Smith Company ABC".

- Add regex to replace <br> variants with spaces in RichTextPropertyIndexValueFactory
- Handles <br>, <br/> with spaces and attributes
- Maintains existing StripHtml() behavior for all other HTML tags
Copy link

github-actions bot commented Jun 12, 2025

Hi there @steveatkiss, thank you for this contribution! 👍

While we wait for one of the Core Collaborators team to have a look at your work, we wanted to let you know about that we have a checklist for some of the things we will consider during review:

  • It's clear what problem this is solving, there's a connected issue or a description of what the changes do and how to test them
  • The automated tests all pass (see "Checks" tab on this PR)
  • The level of security for this contribution is the same or improved
  • The level of performance for this contribution is the same or improved
  • Avoids creating breaking changes; note that behavioral changes might also be perceived as breaking
  • If this is a new feature, Umbraco HQ provided guidance on the implementation beforehand
  • 💡 The contribution looks original and the contributor is presumably allowed to share it

Don't worry if you got something wrong. We like to think of a pull request as the start of a conversation, we're happy to provide guidance on improving your contribution.

If you realize that you might want to make some changes then you can do that by adding new commits to the branch you created for this work and pushing new commits. They should then automatically show up as updates to this pull request.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

@steveatkiss steveatkiss marked this pull request as ready for review June 12, 2025 13:19
@steveatkiss
Copy link
Author

steveatkiss commented Jun 12, 2025

Additional comment which could be a commit, probably better to have the regex as something like:
(<br\s+[^>]*/?>\s*|<br\s*>) or <br\b[^>]*/?>\s*

ie:
html = Regex.Replace(html, @"<br\b[^>]*/?>\s*", " ", RegexOptions.IgnoreCase);

to avoid matches on <break> or <branything>.

@emmagarland
Copy link
Contributor

Hi @steveatkiss,

Thanks for your PR to fix the indexing scenario where breakpoints are removed, causing potential problems with the content.

One of the Core Collaborators team will review this as soon as possible. I can't find a related issue - just wondering if there is one or if the issue is as described in here only?

I've given HQ a heads-up just to check this is not going to have any knock-on effects.

Also, it looks like this is your first PR to this repository! Nice work! #H5YR 🎉

Best wishes

Emma

@emmagarland emmagarland added community/pr category/dx Developer experience labels Jun 18, 2025
@emmagarland emmagarland self-assigned this Jun 19, 2025
@emmagarland
Copy link
Contributor

emmagarland commented Jun 19, 2025

Hi @steveatkiss

I've tested the PR as is and you're right, it matches on <branything>. I've taken the add a unit test with various that should cover these scenarios and more.

Just FYI this regex still matches on the <break>/<branything> but I have included that in the tests in case you wanted to try and mitigate it in another regex.

The new test cases have revealed that there are some other attributes that will cause similar indexing issues, but happy to stick with this scenario for now on this PR!

I've asked HQ to take a look at this now since I've amended it too.

Thanks

Emma

@steveatkiss
Copy link
Author

steveatkiss commented Jun 19, 2025

Thanks @emmagarland !

I hadn't added the updated Regex to the code, it was just in the comment further up. I've now grabbed your unit tests and pushed the new Regex pattern up, are you able to re-run the tests at your end with the updated version? I don't have the site properly configured at this end in order to run those but if required I can.

@emmagarland
Copy link
Contributor

Thanks @steveatkiss!

If you're expecting the tests output to be John Smith<break>Company ABC<break>London, that is not happening and it is still being stripped out with the new regex. The tests also break because they are now outputting:

John SmithCompany ABCLondon

instead of what I originally put from the old regex:

John Smith Company ABC London

I think its worth getting the tests running locally but let me know if you're having any issues doing so,

Thanks

Emma

@steveatkiss
Copy link
Author

steveatkiss commented Jun 19, 2025

Hi @emmagarland - thanks for the quick follow up, actually the <break> bit isn't needed, I don't think it's a valid HTML tag, it was just an example of something that the Regex shouldn't pick up, like the <branything>, sorry I should have made that a bit clearer.

Really I think it should only work correctly for ones like:

<br>
<br />
<br class="something">
<br class="something" />
<br data-test="different-attribute">
<br
class="newlines-inside"
>
etc 

so all with a valid <br> tag of any variation as that could appear within Rich Text content (although would usually be converted to a p tag but not when it's in a table cell for example).

So the tests can be updated to:

[TestCase("<p>Sample text</p>", "Sample text")]
[TestCase("<p>John Smith<br>Company ABC<br>London</p>", "John Smith Company ABC London")]
[TestCase("<p>John Smith<break>Company ABC<break>London</p>", "John SmithCompany ABCLondon")]
[TestCase("<p>John Smith<br>Company ABC<branything>London</p>", "John Smith Company ABCLondon")]
[TestCase("<p>Another sample text with <strong>bold</strong> content</p>", "Another sample text with bold content")]
[TestCase("<p>Text with <a href=\"https://example.com\">link</a></p>", "Text with link")]
[TestCase("<p>Text with <img src=\"image.jpg\" alt=\"image\" /></p>", "Text with ")]
[TestCase("<p>Text with <span style=\"color: red;\">styled text</span></p>", "Text with styled text")]
[TestCase("<p>Text with <em>emphasized</em> content</p>", "Text with emphasized content")]
[TestCase("<p>Text with <u>underlined</u> content</p>", "Text with underlined content")]
[TestCase("<p>Text with <code>inline code</code></p>", "Text with inline code")]
[TestCase("<p>Text with <pre><code>code block</code></pre></p>", "Text with code block")]
[TestCase("<p>Text with <blockquote>quoted text</blockquote></p>", "Text with quoted text")]
[TestCase("<p>Text with <ul><li>list item 1</li><li>list item 2</li></ul></p>",
    "Text with list item 1list item 2 ")]
[TestCase("<p>Text with <ol><li>ordered item 1</li><li>ordered item 2</li></ol></p>",
    "Text with ordered item 1ordered item 2")]
[TestCase("<p>Text with <div class=\"class-name\">div content</div></p>", "Text with div content")]
[TestCase("<p>Text with <span class=\"class-name\">span content</span></p>", "Text with span content")]
[TestCase("<p>Text with <strong>bold</strong> and <em>italic</em> content</p>",
    "Text with bold and italic content")]
[TestCase("<p>Text with <a href=\"https://example.com\" target=\"_blank\">external link</a></p>",
    "Text with external link")]

which are these two changed:

[TestCase("<p>John Smith<break>Company ABC<break>London</p>", "John SmithCompany ABCLondon")]
[TestCase("<p>John Smith<br>Company ABC<branything>London</p>", "John Smith Company ABCLondon")]

And maybe useful to have a couple more like below to test the attributes and new lines (I'm not 100% sure that's how you would test the new lines!):

[TestCase("<p>John Smith<br class=\"test\">Company ABC<br>London</p>", "John Smith Company ABC London")]
[TestCase("<p>John Smith<br \r\n />Company ABC<br>London</p>", "John Smith Company ABC London")]

@emmagarland
Copy link
Contributor

Thanks @steveatkiss!

Those tests all pass as expected now :)

We should be good to get this merged. I imagine HQ are pretty busy at Codegarden this week but once someone has confirmed (since I made some changes too) then it should be good to go into v13 and hopefully cherry-picked into v16 too!

I'll check in soon. Thanks again,

Emma

@steveatkiss
Copy link
Author

That's great, thanks for the tests and comments too @emmagarland - fingers crossed all OK with that as it will solve an issue we have open from a client without us having to create a more involved solution in the site itself.

@umbracocommunity
Copy link

This pull request has been mentioned on Umbraco community forum. There might be relevant details there:

https://forum.umbraco.com/t/blockgrid-rte-examine-indexing-issues/4034/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/dx Developer experience community/pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants