What Are Regular Expressions?
Regular expressions (regex) are patterns that describe sets of strings. They're incredibly powerful for searching, validating, and transforming text. Python's re module provides the regex engine. Every developer should know the basics — they come up in data cleaning, log parsing, input validation, and web scraping.
When to Use Regex
Regex is perfect for: validating emails/phones, extracting data from unstructured text, find-and-replace with patterns, parsing log files. But don't use regex for parsing HTML — use a proper parser like BeautifulSoup instead.
Basic Matching
import re
text = "Call me at 555-1234 or 555-5678"
# findall — find ALL matches
phones = re.findall(r'\d{3}-\d{4}', text)
print(f"All phones: {phones}")
# search — find FIRST match
match = re.search(r'\d{3}-\d{4}', text)
if match:
print(f"First match: {match.group()}")
print(f"Position: {match.start()}-{match.end()}")
# match — match at START of string only
result = re.match(r'\d+', "123 abc")
print(f"match: {result.group() if result else None}")
result = re.match(r'\d+', "abc 123")
print(f"match from start: {result}")
All phones: ['555-1234', '555-5678'] First match: 555-1234 Position: 11-19 match: 123 match from start: None
Pattern Syntax Quick Reference
| Pattern | Matches | Example |
|---|---|---|
\d | Any digit (0-9) | \d{3} matches "123" |
\w | Word char (letter, digit, _) | \w+ matches "hello_42" |
\s | Whitespace (space, tab, newline) | \s+ matches " \t" |
. | Any character (except newline) | a.c matches "abc", "a1c" |
* | 0 or more of previous | ab*c matches "ac", "abc", "abbc" |
+ | 1 or more of previous | ab+c matches "abc", "abbc" (not "ac") |
? | 0 or 1 of previous | colou?r matches "color", "colour" |
{n,m} | Between n and m of previous | \d{2,4} matches "12", "123", "1234" |
[abc] | Any one of a, b, or c | [aeiou] matches any vowel |
^ | Start of string | ^Hello matches "Hello world" |
$ | End of string | world$ matches "Hello world" |
| | OR | cat|dog matches "cat" or "dog" |
r'...') for regex patterns. Without the r prefix, Python interprets backslashes as escape characters, so '\d' becomes a garbled string instead of the digit pattern.Groups and Capturing
Parentheses create groups that capture parts of the match. This lets you extract specific pieces of data from a larger pattern:
import re
# Numbered groups
text = "John Smith, age 30, from NYC"
match = re.search(r'(\w+) (\w+), age (\d+)', text)
if match:
print(f"Full match: {match.group(0)}")
print(f"First name: {match.group(1)}")
print(f"Last name: {match.group(2)}")
print(f"Age: {match.group(3)}")
# Named groups — much more readable
pattern = r'(?P\w+) (?P\w+), age (?P\d+)'
match = re.search(pattern, text)
if match:
print(f"\nNamed: {match.group('first')} {match.group('last')}, {match.group('age')}")
print(f"As dict: {match.groupdict()}")
Full match: John Smith, age 30
First name: John
Last name: Smith
Age: 30
Named: John Smith, 30
As dict: {'first': 'John', 'last': 'Smith', 'age': '30'}Search and Replace
re.sub() replaces matches with a string or the result of a function. It's incredibly powerful for data transformation:
import re
# Simple replacement
text = "Hello 123 World 456"
result = re.sub(r'\d+', 'NUM', text)
print(result)
# Replace with a function
def double_number(match):
return str(int(match.group()) * 2)
result = re.sub(r'\d+', double_number, text)
print(result)
# Clean up whitespace
messy = " too many spaces "
clean = re.sub(r'\s+', ' ', messy).strip()
print(f"'{clean}'")
Hello NUM World NUM Hello 246 World 912 'too many spaces'
Compiled Patterns
If you use the same pattern repeatedly (in a loop), compile it first for better performance. Compiled patterns also make code more readable by giving the pattern a name:
import re
# Compile once, use many times
email_pattern = re.compile(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
)
emails = [
"[email protected]",
"invalid@",
"[email protected]",
"no-at-sign.com",
"[email protected]",
]
for email in emails:
if email_pattern.match(email):
print(f" Valid: {email}")
else:
print(f" Invalid: {email}")
Valid: [email protected] Invalid: invalid@ Valid: [email protected] Invalid: no-at-sign.com Valid: [email protected]
Common Regex Recipes
import re
# Email (basic)
email = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
# Phone: (555) 123-4567 or 555-123-4567
phone = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')
# URL
url = re.compile(r'https?://[\w.-]+(?:/[\w./-]*)*')
# IP address
ip = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
# Date: YYYY-MM-DD
date = re.compile(r'\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])')
# Extract all URLs from text
text = "Visit https://example.com or http://test.org/page for more"
urls = url.findall(text)
print(f"URLs: {urls}")
URLs: ['https://example.com', 'http://test.org/page']
🔍 Deep Dive: Lookahead and Lookbehind
Lookahead ((?=...)) and lookbehind ((?<=...)) match a position without consuming characters. For example, (?<=\$)\d+ matches digits that are preceded by $ but doesn't include the $ in the match. Negative versions ((?!...) and (?<!...)) match positions where the pattern does NOT appear. These are advanced but extremely useful for complex extraction.
⚠️ Common Mistake: Greedy vs Lazy Matching
Wrong:
html = '<b>bold</b> and <b>more bold</b>'
# Greedy: .* matches as MUCH as possible
result = re.findall(r'<b>(.*)</b>', html)
print(result) # ['bold</b> and <b>more bold'] — Wrong!
Why: .* is greedy — it matches the longest possible string, gobbling up everything between the first <b> and the LAST </b>.
Instead:
# Lazy: .*? matches as LITTLE as possible
result = re.findall(r'<b>(.*?)</b>', html)
print(result) # ['bold', 'more bold'] — Correct!