Regular Expressions

Master the re module — patterns, groups, substitution, compiled patterns, and common regex recipes.

Intermediate 30 min read 🐍 Python

What Are Regular Expressions?

Regular expressions (regex) are patterns that describe sets of strings. They're incredibly powerful for searching, validating, and transforming text. Python's re module provides the regex engine. Every developer should know the basics — they come up in data cleaning, log parsing, input validation, and web scraping.

When to Use Regex

Regex is perfect for: validating emails/phones, extracting data from unstructured text, find-and-replace with patterns, parsing log files. But don't use regex for parsing HTML — use a proper parser like BeautifulSoup instead.

Basic Matching

import re

text = "Call me at 555-1234 or 555-5678"

# findall — find ALL matches
phones = re.findall(r'\d{3}-\d{4}', text)
print(f"All phones: {phones}")

# search — find FIRST match
match = re.search(r'\d{3}-\d{4}', text)
if match:
    print(f"First match: {match.group()}")
    print(f"Position: {match.start()}-{match.end()}")

# match — match at START of string only
result = re.match(r'\d+', "123 abc")
print(f"match: {result.group() if result else None}")

result = re.match(r'\d+', "abc 123")
print(f"match from start: {result}")
Output
All phones: ['555-1234', '555-5678']
First match: 555-1234
Position: 11-19
match: 123
match from start: None

Pattern Syntax Quick Reference

PatternMatchesExample
\dAny digit (0-9)\d{3} matches "123"
\wWord char (letter, digit, _)\w+ matches "hello_42"
\sWhitespace (space, tab, newline)\s+ matches " \t"
.Any character (except newline)a.c matches "abc", "a1c"
*0 or more of previousab*c matches "ac", "abc", "abbc"
+1 or more of previousab+c matches "abc", "abbc" (not "ac")
?0 or 1 of previouscolou?r matches "color", "colour"
{n,m}Between n and m of previous\d{2,4} matches "12", "123", "1234"
[abc]Any one of a, b, or c[aeiou] matches any vowel
^Start of string^Hello matches "Hello world"
$End of stringworld$ matches "Hello world"
|ORcat|dog matches "cat" or "dog"
Key Takeaway: Always use raw strings (r'...') for regex patterns. Without the r prefix, Python interprets backslashes as escape characters, so '\d' becomes a garbled string instead of the digit pattern.

Groups and Capturing

Parentheses create groups that capture parts of the match. This lets you extract specific pieces of data from a larger pattern:

import re

# Numbered groups
text = "John Smith, age 30, from NYC"
match = re.search(r'(\w+) (\w+), age (\d+)', text)
if match:
    print(f"Full match: {match.group(0)}")
    print(f"First name: {match.group(1)}")
    print(f"Last name: {match.group(2)}")
    print(f"Age: {match.group(3)}")

# Named groups — much more readable
pattern = r'(?P\w+) (?P\w+), age (?P\d+)'
match = re.search(pattern, text)
if match:
    print(f"\nNamed: {match.group('first')} {match.group('last')}, {match.group('age')}")
    print(f"As dict: {match.groupdict()}")
Output
Full match: John Smith, age 30
First name: John
Last name: Smith
Age: 30

Named: John Smith, 30
As dict: {'first': 'John', 'last': 'Smith', 'age': '30'}

Search and Replace

re.sub() replaces matches with a string or the result of a function. It's incredibly powerful for data transformation:

import re

# Simple replacement
text = "Hello 123 World 456"
result = re.sub(r'\d+', 'NUM', text)
print(result)

# Replace with a function
def double_number(match):
    return str(int(match.group()) * 2)

result = re.sub(r'\d+', double_number, text)
print(result)

# Clean up whitespace
messy = "  too   many    spaces   "
clean = re.sub(r'\s+', ' ', messy).strip()
print(f"'{clean}'")
Output
Hello NUM World NUM
Hello 246 World 912
'too many spaces'

Compiled Patterns

If you use the same pattern repeatedly (in a loop), compile it first for better performance. Compiled patterns also make code more readable by giving the pattern a name:

import re

# Compile once, use many times
email_pattern = re.compile(
    r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
)

emails = [
    "[email protected]",
    "invalid@",
    "[email protected]",
    "no-at-sign.com",
    "[email protected]",
]

for email in emails:
    if email_pattern.match(email):
        print(f"  Valid:   {email}")
    else:
        print(f"  Invalid: {email}")
Output
  Valid:   [email protected]
  Invalid: invalid@
  Valid:   [email protected]
  Invalid: no-at-sign.com
  Valid:   [email protected]

Common Regex Recipes

import re

# Email (basic)
email = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')

# Phone: (555) 123-4567 or 555-123-4567
phone = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')

# URL
url = re.compile(r'https?://[\w.-]+(?:/[\w./-]*)*')

# IP address
ip = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')

# Date: YYYY-MM-DD
date = re.compile(r'\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])')

# Extract all URLs from text
text = "Visit https://example.com or http://test.org/page for more"
urls = url.findall(text)
print(f"URLs: {urls}")
Output
URLs: ['https://example.com', 'http://test.org/page']
🔍 Deep Dive: Lookahead and Lookbehind

Lookahead ((?=...)) and lookbehind ((?<=...)) match a position without consuming characters. For example, (?<=\$)\d+ matches digits that are preceded by $ but doesn't include the $ in the match. Negative versions ((?!...) and (?<!...)) match positions where the pattern does NOT appear. These are advanced but extremely useful for complex extraction.

⚠️ Common Mistake: Greedy vs Lazy Matching

Wrong:

html = '<b>bold</b> and <b>more bold</b>'
# Greedy: .* matches as MUCH as possible
result = re.findall(r'<b>(.*)</b>', html)
print(result)  # ['bold</b> and <b>more bold'] — Wrong!

Why: .* is greedy — it matches the longest possible string, gobbling up everything between the first <b> and the LAST </b>.

Instead:

# Lazy: .*? matches as LITTLE as possible
result = re.findall(r'<b>(.*?)</b>', html)
print(result)  # ['bold', 'more bold'] — Correct!