Skip to content

Conversation

@erik-krogh
Copy link
Contributor

@erik-krogh erik-krogh commented Aug 26, 2021

I've made a JS+Python query to detect regular expressions that are bad at matching HTML tags.
The query is currently very focused on <script> and <!-- --> tags.


Take as an example the below regexp.

var reg = /<script[^>]>.*</script>/g;

There are multiple problems with this regexp:

  • It doesn't match mixed-case / upper-case <SCRIPT> tags.
  • It doesn't match <script> tags where the body contain newlines.
  • It doesn't match </script > end tags containing superfluous whitespace. (A parse error, but browsers render it anyway).

The more complex cases in the query has to do with capture groups.
Consider the below program.

var reg = /^(?:<!--([\w\W]*?)-->|<([^>]*?)>)$/;

console.log(reg.exec("<!-- foo -->")[1]); // prints " foo "
console.log(reg.exec("<!-- foo --!>")[1]); // prints undefined
console.log(reg.exec("<!-- foo --!>")[2]); // prints "!-- foo --!"

The regular expression matches both <!-- foo --> and <!-- foo --!> (both are "valid" HTML5 comments).
However, it matches the two types of comments with different capture groups, which will confuse parsers that use regular expressions like the above. (This is a root cause for at least 2 XSS CVEs).


The query works by implementing a regular-expression evaluator on top of ReDoSUtil.qll, and using that evaluator to test for potential pitfalls when implementing regular expressions matching HTML tags.
The BadTagFilterQuery.qll file is shared between JS/Python, so there is essentially no language specific implementation for this query.


Copilot helped me write the example functions.
It fell into the same pitfall for both JavaScript and Python.


CVE-2021-33829: TP/TN
CVE-2020-17480: TP/TN

For context, see The Curious Case of Copy & Paste.


JS Evaluation shows a improvement in performance. (The baseline is main + a where none() edition of the query).
The new cached predicates have no noticeable effect on total DB size.

The Python evaluation looks OK.

@erik-krogh erik-krogh changed the title JS/Python;: add a bad-tag-filter query for Python and JavaScript JS/Python: add a bad-tag-filter query for Python and JavaScript Aug 26, 2021
@erik-krogh erik-krogh force-pushed the htmlReg branch 6 times, most recently from 39f5150 to fe10a90 Compare September 2, 2021 11:26
@erik-krogh erik-krogh added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Sep 2, 2021
@erik-krogh erik-krogh force-pushed the htmlReg branch 3 times, most recently from ac17564 to 232c8b0 Compare September 2, 2021 14:21
@erik-krogh erik-krogh force-pushed the htmlReg branch 6 times, most recently from c2ddeeb to d9f22af Compare September 20, 2021 07:33
@erik-krogh erik-krogh requested a review from a team as a code owner October 26, 2021 13:19
@github-actions github-actions bot added the Ruby label Oct 26, 2021
@erik-krogh erik-krogh changed the title JS/Python: add a bad-tag-filter query for Python and JavaScript JS/Py/Ruby: add a bad-tag-filter query Oct 26, 2021
@erik-krogh
Copy link
Contributor Author

I'm not sure why the Query help preview is failing.

@aibaars
Copy link
Contributor

aibaars commented Oct 26, 2021

I'm not sure why the Query help preview is failing.

The error message suggests some Actions related permissions problem. Perhaps it is because your PR is from a fork.

@erik-krogh
Copy link
Contributor Author

erik-krogh commented Oct 26, 2021

The error message suggests some Actions related permissions problem. Perhaps it is because your PR is from a fork.

If that's the case then it should be fixed.
I always open PRs from forks, and external contributors are forced to do it that way.

It works fine locally, so you guys can ignore the error when reviewing.

@RasmusWL
Copy link
Member

RasmusWL commented Oct 27, 2021

@aibaars I think the workflow needs to run on pull_request_target (docs), like our labeling action does. This does have security concerns, which we need to keep in mind 😊

EDIT: That was apparently already discovered, and a PR was made #6971

@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2021

QHelp previews:

javascript/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

function filterScript(html) {
    var scriptRegex = /<script\b[^>]*>([\s\S]*?)<\/script>/gi;
    var match;
    while ((match = scriptRegex.exec(html)) !== null) {
        html = html.replace(match[0], match[1]);
    }
    return html;
}

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

python/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

import re

def filterScriptTags(content): 
    oldContent = ""
    while oldContent != content:
        oldContent = content
        content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
    return content

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

ruby/ql/src/queries/security/cwe-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

def filterScripTags(html)
  oldHtml = "";
  while (html != oldHtml)
    oldHtml = html;
    html = html.gsub(/<script[^>]*>.*<\/script>/m, "");
  end
  return html;
end

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

@erik-krogh
Copy link
Contributor Author

The Ruby QLDoc Checks is failing. But that doesn't appear to have anything to do with the code changed in this PR.

@aibaars
Copy link
Contributor

aibaars commented Nov 4, 2021

The Ruby QLDoc Checks is failing. But that doesn't appear to have anything to do with the code changed in this PR.

The complaints seems to be related to the file that was moved in this PR.

@erik-krogh
Copy link
Contributor Author

The complaints seems to be related to the file that was moved in this PR.

I think we can just ignore that error for this PR, as the files were just moved.
Let me know if you don't think we can ignore these.

Copy link
Contributor

@nickrolfe nickrolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ruby-specific changes look pretty reasonable, but I have a suggestion on making the qhelp example a little more idiomatic.

Co-authored-by: Nick Rolfe <nickrolfe@github.com>
@github-actions
Copy link
Contributor

QHelp previews:

javascript/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

function filterScript(html) {
    var scriptRegex = /<script\b[^>]*>([\s\S]*?)<\/script>/gi;
    var match;
    while ((match = scriptRegex.exec(html)) !== null) {
        html = html.replace(match[0], match[1]);
    }
    return html;
}

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

python/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

import re

def filterScriptTags(content): 
    oldContent = ""
    while oldContent != content:
        oldContent = content
        content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
    return content

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

ruby/ql/src/queries/security/cwe-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

def filter_script_tags(html)
  old_html = ""
  while (html != old_html)
    old_html = html
    html = html.gsub(/<script[^>]*>.*<\/script>/m, "")
  end
  html
end

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 👍 (based on old reviews saying all LGTM, and no Python related changes)

Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM :-)

/**
* A word boundary, that is, a regular expression term of the form `\b`.
*/
class RegExpWordBoundary extends RegExpEscape {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced \b should be parsed as RegExpEscape, because it can result in ReDoS query FPs that say "...strings starting with 'b'...", but I can handle that in #7120.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
Yes, that causes FPs in Ruby/Python (I checked).

Nice if you can handle it in #7120.

@erik-krogh erik-krogh requested a review from asgerf November 16, 2021 22:54
@erik-krogh erik-krogh merged commit 1cca377 into github:main Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants