Add additional unicode escape support #4296

jltrfl · 2025-07-08T18:00:20Z

Description:

Currently TruffleHog supports 3 unicode escape formats:

\u0041
\\u0041
U+0041

This PR adds support for several other common formats:

\u{41} 
\U00000041
\x{41}
\41
&#x41;
%u0041

Unittests were added to ensure proper decoding of each new format, and benchmark tests was added to understand performance impacts. Based on my testing, there is no performance cost increase; however, I encourage the reviewer to conduct additional testing.

Beyond additional testing, I encourage the reviewer to consider adding additional unicode escape formats. I attempted to add a generic 0xHH 0xHH hex notation with space separation, but memory bloated too much and caused a significant performance impact. I think it'd be worth adding support for this case, and I left my work in the changed files, but it's commented out.

CLAassistant · 2025-07-08T18:00:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ jltrfl
✅ shahzadhaider1
❌ Joe Leon

Joe Leon seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…ehog-185c

kashifkhan0771 · 2025-07-10T06:34:26Z

pkg/decoders/escaped_unicode.go

 }
+
+// decodeHexEscape handles 0xX format - Hexadecimal notation with space separation
+// func decodeHexEscape(input []byte) []byte {


Are we intentionally commenting out this code?

I left this commented out, in case you wanted to try an implementation that doesn't bloat memory. I couldn't figure it out, but I think supporting this format would be really great given it's a pretty common format from what I've seen. But if that needs to be put on pause, that's fine, and we can remove the comments. What do you think?

dustin-decker · 2025-07-15T23:12:20Z

pkg/decoders/escaped_unicode.go

+	// \x{X} format - Perl (variable length hex in braces)
+	perlEscapePat = regexp.MustCompile(`\\x\{([a-fA-F0-9]{1,6})\}`)
+
+	// \X format - CSS (hex without padding). Go's regexp (RE2) has no look-ahead, so we
+	// include the delimiter (whitespace, another backslash, or end-of-string) in the
+	// match using a non-capturing group. The delimiter is later re-inserted by the
+	// decoder when necessary.
+	cssEscapePat = regexp.MustCompile(`\\([a-fA-F0-9]{1,6})(?:\s|\\|$)`)


If these are specific to Perl and CSS I think we could probably omit them for better performance

dustin-decker · 2025-07-21T16:49:41Z

I'm a bit concerned about throughput of this decoder overall since it is the only regex based one and runs on every chun. I think might be possible to do string operations instead of regex to improve that.

jltrfl and others added 5 commits July 8, 2025 09:50

Update escaped_unicode.go

bb1de3a

Create escaped_unicode_bench_test.go

31183ec

Update escaped_unicode_test.go

d2717c6

updated unicode escape logic

e809c3a

comment updates

e4b378b

jltrfl requested review from a team as code owners July 8, 2025 18:00

shahzadhaider1 approved these changes Jul 10, 2025

View reviewed changes

Merge branch 'main' into cursor/add-unicode-escape-support-for-truffl…

58d5dae

…ehog-185c

kashifkhan0771 reviewed Jul 10, 2025

View reviewed changes

dustin-decker reviewed Jul 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add additional unicode escape support #4296

Add additional unicode escape support #4296

Uh oh!

jltrfl commented Jul 8, 2025

Uh oh!

CLAassistant commented Jul 8, 2025 •

edited

Loading

Uh oh!

kashifkhan0771 Jul 10, 2025

Uh oh!

jltrfl Jul 14, 2025

Uh oh!

dustin-decker Jul 15, 2025

Uh oh!

dustin-decker commented Jul 21, 2025

Uh oh!

Uh oh!

Add additional unicode escape support #4296

Are you sure you want to change the base?

Add additional unicode escape support #4296

Uh oh!

Conversation

jltrfl commented Jul 8, 2025

Description:

Uh oh!

CLAassistant commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashifkhan0771 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

jltrfl Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

dustin-decker Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

dustin-decker commented Jul 21, 2025

Uh oh!

Uh oh!

CLAassistant commented Jul 8, 2025 •

edited

Loading