Why Can't You Parse HTML with Regular Expressions? (And When You Actually Can)

May 12, 2026

Problem

I wanted to extract all opening HTML tags from a document. I tried writing a regex pattern:

<([a-z]+) *[^/]*?>

But when I tested it on various HTML snippets, it failed on edge cases. Then I searched online and found the famous Stack Overflow answer with 4.1 million views, warning that “HTML is not a regular language and hence cannot be parsed by regular expressions.”

So, is regex completely useless for HTML? Or are there cases where it’s acceptable?

Environment

Python 3.x (for BeautifulSoup examples)
JavaScript (for DOMParser examples)
The question from Stack Overflow: matching opening HTML tags like <p> and <a href="foo"> while excluding self-closing tags like <br /> and <hr class="foo" />

What Happened?

I was scraping a website. I thought, “HTML is just text with tags. Regex is great at pattern matching. Why not use regex?”

I wrote a simple pattern and tested it on a few cases. It worked! Then I ran it on real-world HTML, and it broke on:

Attributes containing / like <img src="/path/to/image.png">
Self-closing tags without space: <br/> vs <br />
Nested quotes: <a onclick="alert('hello')">
HTML comments: 
Script tags: <script>var x = "<p>";</script>
CDATA sections: <![CDATA[<p>literal text</p>]]>

Each time I fixed one edge case, another appeared.

The Technical Reason

The core issue is grammar complexity. HTML belongs to a more powerful class of languages than regex can handle.

Here’s the Chomsky hierarchy that formal language theory defines:

┌─────────────────────────────────────────────────┐
│           Type 0: Recursively Enumerable        │
│  ┌───────────────────────────────────────────┐  │
│  │           Type 1: Context-Sensitive       │  │
│  │  ┌─────────────────────────────────────┐  │  │
│  │  │       Type 2: Context-Free          │  │  │
│  │  │  ┌───────────────────────────────┐  │  │  │
│  │  │  │  Type 3: Regular (Regex)      │  │  │  │
│  │  │  │                               │  │  │  │
│  │  │  │  • Finite automata            │  │  │  │
│  │  │  │  • No recursion/nesting       │  │  │  │
│  │  │  └───────────────────────────────┘  │  │  │
│  │  │  • HTML, XML, JSON, programming   │  │  │
│  │  │  • Can handle nested structures    │  │  │
│  │  └─────────────────────────────────────┘  │  │
│  │  • Context-dependent rules              │  │
│  └───────────────────────────────────────────┘  │
│  • Most powerful, least restricted           │
└─────────────────────────────────────────────────┘

Regex is Type 3 (Regular). HTML is Type 2 (Context-Free). The key difference: nested structures.

HTML allows tags inside tags:

<div>
  <div>
    <div>Content</div>
  </div>
</div>

Regular expressions cannot count arbitrary nesting depth. They have no memory of “I opened 3 divs, so I need to close 3 divs.”

When Regex Can Work

Despite the theoretical limitation, regex can be practical in specific scenarios:

One-time scraping with known structure - You control the input, format is consistent
Extracting specific patterns - URLs, emails, simple tags with predictable format
Pre-processing before a parser - Strip comments or CDATA first

Someone on Stack Overflow shared a real success story:

“I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke… After fighting all day with the ‘right’ approach, I finally switched to a regex solution and had it working in an hour.”

The key insight: known, limited, controlled HTML makes regex viable.

How to Solve It Properly

For robust HTML parsing, use a dedicated parser.

Python with BeautifulSoup:

from bs4 import BeautifulSoup

html = '<p>Text</p><br /><a href="foo">Link</a>'
soup = BeautifulSoup(html, 'html.parser')

# Get only non-self-closing opening tags
for tag in soup.find_all():
    if not tag.is_empty_element:
        print(tag.name)  # Outputs: p, a

JavaScript with DOMParser:

const parser = new DOMParser();
const html = '<p>Text</p><br /><a href="foo">Link</a>';
const doc = parser.parseFromString(html, 'text/html');

doc.querySelectorAll('*').forEach(el => {
    // Check if element can have content
    if (el.innerHTML !== '' || el.tagName !== 'BR') {
        console.log(el.tagName.toLowerCase());
    }
});

Both handle edge cases automatically: malformed HTML, different quoting styles, comments, CDATA, script content.

When to Use Regex Anyway

If you decide regex fits your use case, here’s a more robust pattern:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

This handles quoted attributes. To exclude self-closing tags, add negative lookbehind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

But remember: this still fails on CDATA, comments, and script/style content. Test thoroughly.

Summary

In this post, I explained why regex cannot reliably parse arbitrary HTML. The key point is that HTML is a context-free grammar requiring nested structure handling, while regex handles only regular grammars without memory.

For robust parsing, use BeautifulSoup (Python) or DOMParser (JavaScript). For quick one-off scraping of known HTML, regex might be pragmatic—but test edge cases carefully.

The famous Stack Overflow warning is technically correct for general HTML parsing. Just remember: pragmatism sometimes wins over theoretical purity when you control the input.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Stack Overflow: RegEx match open tags except XHTML self-contained tags
👨‍💻 BeautifulSoup Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!