JS/Py/Ruby: add a bad-tag-filter query #6561

erik-krogh · 2021-08-26T17:38:09Z

I've made a JS+Python query to detect regular expressions that are bad at matching HTML tags.
The query is currently very focused on <script> and  tags.

Take as an example the below regexp.

var reg = /<script[^>]>.*</script>/g;

There are multiple problems with this regexp:

It doesn't match mixed-case / upper-case <SCRIPT> tags.
It doesn't match <script> tags where the body contain newlines.
It doesn't match </script > end tags containing superfluous whitespace. (A parse error, but browsers render it anyway).

The more complex cases in the query has to do with capture groups.
Consider the below program.

var reg = /^(?:<!--([\w\W]*?)-->|<([^>]*?)>)$/;

console.log(reg.exec("<!-- foo -->")[1]); // prints " foo "
console.log(reg.exec("<!-- foo --!>")[1]); // prints undefined
console.log(reg.exec("<!-- foo --!>")[2]); // prints "!-- foo --!"

The regular expression matches both  and <!-- foo --!> (both are "valid" HTML5 comments).
However, it matches the two types of comments with different capture groups, which will confuse parsers that use regular expressions like the above. (This is a root cause for at least 2 XSS CVEs).

The query works by implementing a regular-expression evaluator on top of ReDoSUtil.qll, and using that evaluator to test for potential pitfalls when implementing regular expressions matching HTML tags.
The BadTagFilterQuery.qll file is shared between JS/Python, so there is essentially no language specific implementation for this query.

Copilot helped me write the example functions.
It fell into the same pitfall for both JavaScript and Python.

CVE-2021-33829: TP/TN
CVE-2020-17480: TP/TN

For context, see The Curious Case of Copy & Paste.

JS Evaluation shows a improvement in performance. (The baseline is main + a where none() edition of the query).
The new cached predicates have no noticeable effect on total DB size.

The Python evaluation looks OK.

erik-krogh · 2021-10-26T13:57:49Z

I'm not sure why the Query help preview is failing.

aibaars · 2021-10-26T16:04:42Z

I'm not sure why the Query help preview is failing.

The error message suggests some Actions related permissions problem. Perhaps it is because your PR is from a fork.

erik-krogh · 2021-10-26T21:38:34Z

The error message suggests some Actions related permissions problem. Perhaps it is because your PR is from a fork.

If that's the case then it should be fixed.
I always open PRs from forks, and external contributors are forced to do it that way.

It works fine locally, so you guys can ignore the error when reviewing.

RasmusWL · 2021-10-27T09:22:30Z

@aibaars I think the workflow needs to run on pull_request_target (docs), like our labeling action does. This does have security concerns, which we need to keep in mind 😊

EDIT: That was apparently already discovered, and a PR was made #6971

github-actions · 2021-11-04T11:59:39Z

QHelp previews:

javascript/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

function filterScript(html) {
    var scriptRegex = /<script\b[^>]*>([\s\S]*?)<\/script>/gi;
    var match;
    while ((match = scriptRegex.exec(html)) !== null) {
        html = html.replace(match[0], match[1]);
    }
    return html;
}

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Securitum: The Curious Case of Copy & Paste.
stackoverflow.com: You can't parse [X]HTML with regex.
HTML Standard: Comment end bang state.
stackoverflow.com: Why aren't browsers strict about HTML?.
Common Weakness Enumeration: CWE-116.
Common Weakness Enumeration: CWE-20.

python/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

import re

def filterScriptTags(content): 
    oldContent = ""
    while oldContent != content:
        oldContent = content
        content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
    return content

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Securitum: The Curious Case of Copy & Paste.
stackoverflow.com: You can't parse [X]HTML with regex.
HTML Standard: Comment end bang state.
stackoverflow.com: Why aren't browsers strict about HTML?.
Common Weakness Enumeration: CWE-116.
Common Weakness Enumeration: CWE-20.

ruby/ql/src/queries/security/cwe-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

def filterScripTags(html)
  oldHtml = "";
  while (html != oldHtml)
    oldHtml = html;
    html = html.gsub(/<script[^>]*>.*<\/script>/m, "");
  end
  return html;
end

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Securitum: The Curious Case of Copy & Paste.
stackoverflow.com: You can't parse [X]HTML with regex.
HTML Standard: Comment end bang state.
stackoverflow.com: Why aren't browsers strict about HTML?.
Common Weakness Enumeration: CWE-116.
Common Weakness Enumeration: CWE-20.

erik-krogh · 2021-11-04T12:21:20Z

The Ruby QLDoc Checks is failing. But that doesn't appear to have anything to do with the code changed in this PR.

aibaars · 2021-11-04T14:37:22Z

The Ruby QLDoc Checks is failing. But that doesn't appear to have anything to do with the code changed in this PR.

The complaints seems to be related to the file that was moved in this PR.

erik-krogh · 2021-11-08T09:05:41Z

The complaints seems to be related to the file that was moved in this PR.

I think we can just ignore that error for this PR, as the files were just moved.
Let me know if you don't think we can ignore these.

nickrolfe

The Ruby-specific changes look pretty reasonable, but I have a suggestion on making the qhelp example a little more idiomatic.

ruby/ql/src/queries/security/cwe-116/examples/BadTagFilter.rb

Co-authored-by: Nick Rolfe <nickrolfe@github.com>

github-actions · 2021-11-11T13:05:34Z

QHelp previews:

javascript/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

function filterScript(html) {
    var scriptRegex = /<script\b[^>]*>([\s\S]*?)<\/script>/gi;
    var match;
    while ((match = scriptRegex.exec(html)) !== null) {
        html = html.replace(match[0], match[1]);
    }
    return html;
}

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Securitum: The Curious Case of Copy & Paste.
stackoverflow.com: You can't parse [X]HTML with regex.
HTML Standard: Comment end bang state.
stackoverflow.com: Why aren't browsers strict about HTML?.
Common Weakness Enumeration: CWE-116.
Common Weakness Enumeration: CWE-20.

python/ql/src/Security/CWE-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

import re

def filterScriptTags(content): 
    oldContent = ""
    while oldContent != content:
        oldContent = content
        content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
    return content

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Securitum: The Curious Case of Copy & Paste.
stackoverflow.com: You can't parse [X]HTML with regex.
HTML Standard: Comment end bang state.
stackoverflow.com: Why aren't browsers strict about HTML?.
Common Weakness Enumeration: CWE-116.
Common Weakness Enumeration: CWE-20.

ruby/ql/src/queries/security/cwe-116/BadTagFilter.qhelp

Bad HTML filtering regexp

It is possible to match some single HTML tags using regular expressions (parsing general HTML using regular expressions is impossible). However, if the regular expression is not written well it might be possible to circumvent it, which can lead to cross-site scripting or other security issues.

Some of these mistakes are caused by browsers having very forgiving HTML parsers, and will often render invalid HTML containing syntax errors. Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.

Recommendation

Use a well-tested sanitization or parser library if at all possible. These libraries are much more likely to handle corner cases correctly than a custom implementation.

Example

The following example attempts to filters out all <script> tags.

def filter_script_tags(html)
  old_html = ""
  while (html != old_html)
    old_html = html
    html = html.gsub(/<script[^>]*>.*<\/script>/m, "")
  end
  html
end

The above sanitizer does not filter out all <script> tags. Browsers will not only accept </script> as script end tags, but also tags such as </script foo="bar"> even though it is a parser error. This means that an attack string such as <script>alert(1)</script foo="bar"> will not be filtered by the function, and alert(1) will be executed by a browser if the string is rendered as HTML.

Other corner cases include that HTML comments can end with --!>, and that HTML tag names can contain upper case characters.

References

Securitum: The Curious Case of Copy & Paste.
stackoverflow.com: You can't parse [X]HTML with regex.
HTML Standard: Comment end bang state.
stackoverflow.com: Why aren't browsers strict about HTML?.
Common Weakness Enumeration: CWE-116.
Common Weakness Enumeration: CWE-20.

RasmusWL

Python 👍 (based on old reviews saying all LGTM, and no Python related changes)

yoff

Still LGTM :-)

nickrolfe · 2021-11-16T14:06:44Z

python/ql/lib/semmle/python/RegexTreeView.qll

+/**
+ * A word boundary, that is, a regular expression term of the form `\b`.
+ */
+class RegExpWordBoundary extends RegExpEscape {


I'm not convinced \b should be parsed as RegExpEscape, because it can result in ReDoS query FPs that say "...strings starting with 'b'...", but I can handle that in #7120.

I see.
Yes, that causes FPs in Ruby/Python (I checked).

Nice if you can handle it in #7120.

github-actions bot added documentation JS Python labels Aug 26, 2021

erik-krogh changed the title ~~JS/Python;: add a bad-tag-filter query for Python and JavaScript~~ JS/Python: add a bad-tag-filter query for Python and JavaScript Aug 26, 2021

erik-krogh force-pushed the htmlReg branch 6 times, most recently from 39f5150 to fe10a90 Compare September 2, 2021 11:26

erik-krogh added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Sep 2, 2021

erik-krogh force-pushed the htmlReg branch 3 times, most recently from ac17564 to 232c8b0 Compare September 2, 2021 14:21

erik-krogh force-pushed the htmlReg branch 6 times, most recently from c2ddeeb to d9f22af Compare September 20, 2021 07:33

erik-krogh force-pushed the htmlReg branch from 9b7d583 to c1ba6ac Compare September 21, 2021 08:24

erik-krogh added 6 commits September 21, 2021 12:13

use toUnicode in RegexTreeView

8535e6f

implement RegExpWordBoundary in RegexTreeView

01e345c

cache isInterpretedAsRegExp

6099321

cache TopLevel::isMinified

672e4a3

make isStartState public in ReDoSUtil

c40ffab

don't give group numbers to non-capturing groups

fd64ff9

erik-krogh force-pushed the htmlReg branch from c1ba6ac to 67d62fb Compare September 21, 2021 10:16

add a bad-tag-filter query for Python and JavaScript

99ed4a1

erik-krogh force-pushed the htmlReg branch from 67d62fb to 99ed4a1 Compare September 21, 2021 13:04

erik-krogh added 3 commits October 26, 2021 14:46

Merge branch 'main' of github.com:github/codeql into htmlReg

44afa34

move ruby files to match file structure from js/py

2ddf445

update ReDoSUtil in ruby

c15ddf6

erik-krogh dismissed yoff’s stale review via 55e1d5a October 26, 2021 13:18

erik-krogh requested a review from a team as a code owner October 26, 2021 13:19

github-actions bot added the Ruby label Oct 26, 2021

erik-krogh changed the title ~~JS/Python: add a bad-tag-filter query for Python and JavaScript~~ JS/Py/Ruby: add a bad-tag-filter query Oct 26, 2021

erik-krogh added 2 commits October 26, 2021 15:25

add the bad tag filter query to ruby

97264b5

make the RegExpEscape::getUnescaped predicate public in python

62e7295

erik-krogh force-pushed the htmlReg branch from 0ab13a2 to 62e7295 Compare October 26, 2021 13:25

fix imports

8a4b043

Merge branch 'main' into htmlReg

02f500b

nickrolfe reviewed Nov 11, 2021

View reviewed changes

ruby/ql/src/queries/security/cwe-116/examples/BadTagFilter.rb Outdated Show resolved Hide resolved

update ruby example

b639a8d

Co-authored-by: Nick Rolfe <nickrolfe@github.com>

RasmusWL approved these changes Nov 16, 2021

View reviewed changes

nickrolfe mentioned this pull request Nov 16, 2021

Ruby/Python: parse anchors in regexes as special characters #7120

Merged

yoff approved these changes Nov 16, 2021

View reviewed changes

nickrolfe approved these changes Nov 16, 2021

View reviewed changes

erik-krogh requested a review from asgerf November 16, 2021 22:54

asgerf approved these changes Nov 18, 2021

View reviewed changes

erik-krogh merged commit 1cca377 into github:main Nov 18, 2021

JS/Py/Ruby: add a bad-tag-filter query #6561

JS/Py/Ruby: add a bad-tag-filter query #6561

Uh oh!

Conversation

erik-krogh commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erik-krogh commented Oct 26, 2021

Uh oh!

aibaars commented Oct 26, 2021

Uh oh!

erik-krogh commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RasmusWL commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2021

Bad HTML filtering regexp

Recommendation

Example

References

Bad HTML filtering regexp

Recommendation

Example

References

Bad HTML filtering regexp

Recommendation

Example

References

Uh oh!

erik-krogh commented Nov 4, 2021

Uh oh!

aibaars commented Nov 4, 2021

Uh oh!

erik-krogh commented Nov 8, 2021

Uh oh!

nickrolfe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 11, 2021

Bad HTML filtering regexp

Recommendation

Example

References

Bad HTML filtering regexp

Recommendation

Example

References

Bad HTML filtering regexp

Recommendation

Example

References

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

nickrolfe Nov 16, 2021

Choose a reason for hiding this comment

Uh oh!

erik-krogh Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

erik-krogh commented Aug 26, 2021 •

edited

Loading

erik-krogh commented Oct 26, 2021 •

edited

Loading

RasmusWL commented Oct 27, 2021 •

edited

Loading