Skip to content

Why Can't You Parse HTML with Regular Expressions? (And When You Actually Can)

Problem

I wanted to extract all opening HTML tags from a document. I tried writing a regex pattern:

Simple regex pattern
<([a-z]+) *[^/]*?>

But when I tested it on various HTML snippets, it failed on edge cases. Then I searched online and found the famous Stack Overflow answer with 4.1 million views, warning that “HTML is not a regular language and hence cannot be parsed by regular expressions.”

So, is regex completely useless for HTML? Or are there cases where it’s acceptable?

Environment

  • Python 3.x (for BeautifulSoup examples)
  • JavaScript (for DOMParser examples)
  • The question from Stack Overflow: matching opening HTML tags like <p> and <a href="foo"> while excluding self-closing tags like <br /> and <hr class="foo" />

What Happened?

I was scraping a website. I thought, “HTML is just text with tags. Regex is great at pattern matching. Why not use regex?”

I wrote a simple pattern and tested it on a few cases. It worked! Then I ran it on real-world HTML, and it broke on:

  • Attributes containing / like <img src="/path/to/image.png">
  • Self-closing tags without space: <br/> vs <br />
  • Nested quotes: <a onclick="alert('hello')">
  • HTML comments: <!-- <p>commented out</p> -->
  • Script tags: <script>var x = "<p>";</script>
  • CDATA sections: <![CDATA[<p>literal text</p>]]>

Each time I fixed one edge case, another appeared.

The Technical Reason

The core issue is grammar complexity. HTML belongs to a more powerful class of languages than regex can handle.

Here’s the Chomsky hierarchy that formal language theory defines:

Chomsky Hierarchy
┌─────────────────────────────────────────────────┐
│ Type 0: Recursively Enumerable │
│ ┌───────────────────────────────────────────┐ │
│ │ Type 1: Context-Sensitive │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Type 2: Context-Free │ │ │
│ │ │ ┌───────────────────────────────┐ │ │ │
│ │ │ │ Type 3: Regular (Regex) │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ • Finite automata │ │ │ │
│ │ │ │ • No recursion/nesting │ │ │ │
│ │ │ └───────────────────────────────┘ │ │ │
│ │ │ • HTML, XML, JSON, programming │ │ │
│ │ │ • Can handle nested structures │ │ │
│ │ └─────────────────────────────────────┘ │ │
│ │ • Context-dependent rules │ │
│ └───────────────────────────────────────────┘ │
│ • Most powerful, least restricted │
└─────────────────────────────────────────────────┘

Regex is Type 3 (Regular). HTML is Type 2 (Context-Free). The key difference: nested structures.

HTML allows tags inside tags:

Nested HTML structure
<div>
<div>
<div>Content</div>
</div>
</div>

Regular expressions cannot count arbitrary nesting depth. They have no memory of “I opened 3 divs, so I need to close 3 divs.”

When Regex Can Work

Despite the theoretical limitation, regex can be practical in specific scenarios:

  1. One-time scraping with known structure - You control the input, format is consistent
  2. Extracting specific patterns - URLs, emails, simple tags with predictable format
  3. Pre-processing before a parser - Strip comments or CDATA first

Someone on Stack Overflow shared a real success story:

“I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke… After fighting all day with the ‘right’ approach, I finally switched to a regex solution and had it working in an hour.”

The key insight: known, limited, controlled HTML makes regex viable.

How to Solve It Properly

For robust HTML parsing, use a dedicated parser.

Python with BeautifulSoup:

parse_html.py
from bs4 import BeautifulSoup
html = '<p>Text</p><br /><a href="foo">Link</a>'
soup = BeautifulSoup(html, 'html.parser')
# Get only non-self-closing opening tags
for tag in soup.find_all():
if not tag.is_empty_element:
print(tag.name) # Outputs: p, a

JavaScript with DOMParser:

parse_html.js
const parser = new DOMParser();
const html = '<p>Text</p><br /><a href="foo">Link</a>';
const doc = parser.parseFromString(html, 'text/html');
doc.querySelectorAll('*').forEach(el => {
// Check if element can have content
if (el.innerHTML !== '' || el.tagName !== 'BR') {
console.log(el.tagName.toLowerCase());
}
});

Both handle edge cases automatically: malformed HTML, different quoting styles, comments, CDATA, script content.

When to Use Regex Anyway

If you decide regex fits your use case, here’s a more robust pattern:

HTML tag matcher
<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

This handles quoted attributes. To exclude self-closing tags, add negative lookbehind:

Exclude self-closing tags
<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

But remember: this still fails on CDATA, comments, and script/style content. Test thoroughly.

Summary

In this post, I explained why regex cannot reliably parse arbitrary HTML. The key point is that HTML is a context-free grammar requiring nested structure handling, while regex handles only regular grammars without memory.

For robust parsing, use BeautifulSoup (Python) or DOMParser (JavaScript). For quick one-off scraping of known HTML, regex might be pragmatic—but test edge cases carefully.

The famous Stack Overflow warning is technically correct for general HTML parsing. Just remember: pragmatism sometimes wins over theoretical purity when you control the input.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments