-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Preserve word boundaries when indexing RTE content with <br> tags #19540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v13/main
Are you sure you want to change the base?
Preserve word boundaries when indexing RTE content with <br> tags #19540
Conversation
Replace <br> tags with spaces before HTML stripping to prevent word concatenation in Examine index. Fixes issue where "John Smith<br>Company ABC" was indexed as "John SmithCompany ABC" instead of "John Smith Company ABC". - Add regex to replace <br> variants with spaces in RichTextPropertyIndexValueFactory - Handles <br>, <br/> with spaces and attributes - Maintains existing StripHtml() behavior for all other HTML tags
Hi there @steveatkiss, thank you for this contribution! 👍 While we wait for one of the Core Collaborators team to have a look at your work, we wanted to let you know about that we have a checklist for some of the things we will consider during review:
Don't worry if you got something wrong. We like to think of a pull request as the start of a conversation, we're happy to provide guidance on improving your contribution. If you realize that you might want to make some changes then you can do that by adding new commits to the branch you created for this work and pushing new commits. They should then automatically show up as updates to this pull request. Thanks, from your friendly Umbraco GitHub bot 🤖 🙂 |
Additional comment which could be a commit, probably better to have the regex as something like: ie: to avoid matches on |
Hi @steveatkiss, Thanks for your PR to fix the indexing scenario where breakpoints are removed, causing potential problems with the content. One of the Core Collaborators team will review this as soon as possible. I can't find a related issue - just wondering if there is one or if the issue is as described in here only? I've given HQ a heads-up just to check this is not going to have any knock-on effects. Also, it looks like this is your first PR to this repository! Nice work! #H5YR 🎉 Best wishes Emma |
Hi @steveatkiss I've tested the PR as is and you're right, it matches on Just FYI this regex still matches on the The new test cases have revealed that there are some other attributes that will cause similar indexing issues, but happy to stick with this scenario for now on this PR! I've asked HQ to take a look at this now since I've amended it too. Thanks Emma |
…nto v13/feat/improve-rte-indexing
Thanks @emmagarland ! I hadn't added the updated Regex to the code, it was just in the comment further up. I've now grabbed your unit tests and pushed the new Regex pattern up, are you able to re-run the tests at your end with the updated version? I don't have the site properly configured at this end in order to run those but if required I can. |
Thanks @steveatkiss! If you're expecting the tests output to be
instead of what I originally put from the old regex:
I think its worth getting the tests running locally but let me know if you're having any issues doing so, Thanks Emma |
Hi @emmagarland - thanks for the quick follow up, actually the Really I think it should only work correctly for ones like:
so all with a valid So the tests can be updated to:
which are these two changed:
And maybe useful to have a couple more like below to test the attributes and new lines (I'm not 100% sure that's how you would test the new lines!):
|
Thanks @steveatkiss! Those tests all pass as expected now :) We should be good to get this merged. I imagine HQ are pretty busy at Codegarden this week but once someone has confirmed (since I made some changes too) then it should be good to go into v13 and hopefully cherry-picked into v16 too! I'll check in soon. Thanks again, Emma |
That's great, thanks for the tests and comments too @emmagarland - fingers crossed all OK with that as it will solve an issue we have open from a client without us having to create a more involved solution in the site itself. |
This pull request has been mentioned on Umbraco community forum. There might be relevant details there: https://forum.umbraco.com/t/blockgrid-rte-examine-indexing-issues/4034/3 |
Prerequisites
Description
When Rich Text Editor content containing
<br>
tags is processed for Examine indexing, the HTML stripping process doesn't preserve whitespace between words separated by line breaks. This results in search terms being concatenated.eg:
<p>John Smith<br>Company ABC<br>London</p>
Current behavior: Indexes as "John SmithCompany ABCLondon"
Expected behavior: Should index as "John Smith Company ABC London"
This affects search functionality as users searching for "Smith" won't find content where it appears as "SmithCompany" in the index.
The proposed pull request fixes this issue where
<br>
tags in Rich Text Editor content don't create proper word boundaries during Examine indexing, causing words to concatenate incorrectly in search results.Changes
<br>
tags with spaces before HTML stripping