-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add additional unicode escape support #4296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add additional unicode escape support #4296
Conversation
Joe Leon seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
} | ||
|
||
// decodeHexEscape handles 0xX format - Hexadecimal notation with space separation | ||
// func decodeHexEscape(input []byte) []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we intentionally commenting out this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this commented out, in case you wanted to try an implementation that doesn't bloat memory. I couldn't figure it out, but I think supporting this format would be really great given it's a pretty common format from what I've seen. But if that needs to be put on pause, that's fine, and we can remove the comments. What do you think?
// \x{X} format - Perl (variable length hex in braces) | ||
perlEscapePat = regexp.MustCompile(`\\x\{([a-fA-F0-9]{1,6})\}`) | ||
|
||
// \X format - CSS (hex without padding). Go's regexp (RE2) has no look-ahead, so we | ||
// include the delimiter (whitespace, another backslash, or end-of-string) in the | ||
// match using a non-capturing group. The delimiter is later re-inserted by the | ||
// decoder when necessary. | ||
cssEscapePat = regexp.MustCompile(`\\([a-fA-F0-9]{1,6})(?:\s|\\|$)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are specific to Perl and CSS I think we could probably omit them for better performance
I'm a bit concerned about throughput of this decoder overall since it is the only regex based one and runs on every chun. I think might be possible to do string operations instead of regex to improve that. |
Description:
Currently TruffleHog supports 3 unicode escape formats:
This PR adds support for several other common formats:
Unittests were added to ensure proper decoding of each new format, and benchmark tests was added to understand performance impacts. Based on my testing, there is no performance cost increase; however, I encourage the reviewer to conduct additional testing.
Beyond additional testing, I encourage the reviewer to consider adding additional unicode escape formats. I attempted to add a generic
0xHH 0xHH
hex notation with space separation, but memory bloated too much and caused a significant performance impact. I think it'd be worth adding support for this case, and I left my work in the changed files, but it's commented out.