Skip to content

Add additional unicode escape support #4296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jltrfl
Copy link

@jltrfl jltrfl commented Jul 8, 2025

Description:

Currently TruffleHog supports 3 unicode escape formats:

\u0041
\\u0041
U+0041

This PR adds support for several other common formats:

\u{41} 
\U00000041
\x{41}
\41
A
%u0041

Unittests were added to ensure proper decoding of each new format, and benchmark tests was added to understand performance impacts. Based on my testing, there is no performance cost increase; however, I encourage the reviewer to conduct additional testing.

Beyond additional testing, I encourage the reviewer to consider adding additional unicode escape formats. I attempted to add a generic 0xHH 0xHH hex notation with space separation, but memory bloated too much and caused a significant performance impact. I think it'd be worth adding support for this case, and I left my work in the changed files, but it's commented out.

@jltrfl jltrfl requested review from a team as code owners July 8, 2025 18:00
@CLAassistant
Copy link

CLAassistant commented Jul 8, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ jltrfl
✅ shahzadhaider1
❌ Joe Leon


Joe Leon seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

}

// decodeHexEscape handles 0xX format - Hexadecimal notation with space separation
// func decodeHexEscape(input []byte) []byte {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we intentionally commenting out this code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this commented out, in case you wanted to try an implementation that doesn't bloat memory. I couldn't figure it out, but I think supporting this format would be really great given it's a pretty common format from what I've seen. But if that needs to be put on pause, that's fine, and we can remove the comments. What do you think?

Comment on lines +35 to +42
// \x{X} format - Perl (variable length hex in braces)
perlEscapePat = regexp.MustCompile(`\\x\{([a-fA-F0-9]{1,6})\}`)

// \X format - CSS (hex without padding). Go's regexp (RE2) has no look-ahead, so we
// include the delimiter (whitespace, another backslash, or end-of-string) in the
// match using a non-capturing group. The delimiter is later re-inserted by the
// decoder when necessary.
cssEscapePat = regexp.MustCompile(`\\([a-fA-F0-9]{1,6})(?:\s|\\|$)`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are specific to Perl and CSS I think we could probably omit them for better performance

@dustin-decker
Copy link
Contributor

I'm a bit concerned about throughput of this decoder overall since it is the only regex based one and runs on every chun. I think might be possible to do string operations instead of regex to improve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants