A Comprehensive Guide to Regular Expressions in Python
Regular expressions (regex) provide a powerful mechanism for pattern matching and string manipulation. In Python, the re modulle facilitates searching, splitting, replacing, and validating text based on specific patterns. This guide covers the essential syntax, core functions, and advanced usage of regular expressions in Python 3.
Basic Pattern Matching
The entry point for most regex tasks is the re module. The findall method is commonly used to retrieve all non-overlapping matches of a pattern in a string as a list.
import re
source_text = "The version numbers are 1.2, 3.45, and 6.789."
pattern = r"\d+\.\d+"
matches = re.findall(pattern, source_text)
print(matches)
# Output: ['1.2', '3.45', '6.789']
Special Character Sequences
Regex patterns use special sequences to define character types:
- \d: Matches any decimal digit (0-9).
- \D: Matches any non-digit character.
- \w: Matches any word character (alphanumeric and underscore).
- \W: Matches any non-word character.
- \s: Matches any whitespace character (spaces, tabs, newlines).
- \S: Matches any non-whitespace character.
log_data = "User_123: Error 404 found."
# Extract alphanumeric sequences
print(re.findall(r"\w+", log_data))
# Output: ['User_123', 'Error', '404', 'found']
Word Boundaries
Assertions like \b and \B match positions rather than characters. \b asserts a position at a word boundary (start or end), while \B asserts a position that is not a word boundary.
sentence = "replay the reply"
# Match 're' only at the start of a word
print(re.findall(r"\bre", sentence))
# Output: ['re']
# Match 'play' not at the start of a word
print(re.findall(r"\Bplay", sentence))
# Output: ['play']
Anchors: String Position
Anchors are used to match positions relative to the start or end of a string or line.
- \A: Matches only at the start of the string.
- \Z: Matches only at the end of the string.
- ^: Matches the start of the string (or line in
re.MULTILINEmode). - $: Matches the end of the string (or line in
re.MULTILINEmode).
config = "DEBUG=true\nENABLED=false"
# Match start of line with 'MULTILINE' flag
print(re.findall(r"^\w+", config, re.M))
# Output: ['DEBUG', 'ENABLED']
Quantifiers
Quantifiers specify how many instances of a character, group, or character class must be present for a match to be found.
- *: Matches 0 or more repetitions.
- +: Matches 1 or more repetitions.
- ?: Matches 0 or 1 repetition (optional).
- {m}: Matches exact m repetitions.
- {m,n}: Matches between m and n repetitions.
text = "color colour colouuur"
# Match 'color' where 'u' appears 0 to 2 times
print(re.findall(r"colou{0,2}r", text))
# Output: ['color', 'colour']
Character Sets and Grouping
Square brackets [] define a set of characters to match. For example, [a-z] matches any lowercase letter. The caret ^ inside a set negates it.
data = "Item A, Item B, Item 1, Item 2"
# Match 'Item' followed by uppercase A-C
print(re.findall(r"Item [A-C]", data))
# Output: ['Item A', 'Item B']
Parentheses () create groups, allowing you to apply quantifiers to multiple characters or capture specific parts of the match.
dates = "2023-01-01, 2024-12-31"
# Capture year, month, and day
print(re.findall(r"(\d{4})-(\d{2})-(\d{2})", dates))
# Output: [('2023', '01', '01'), ('2024', '12', '31')]
Core re Module Functions
re.search()
Scans through the string looking for the first location where the pattern produces a match. Returns a match object or None.
code = "function_123()"
if re.search(r"function_\d+\(\)", code):
print("Function declaration found.")
re.match()
Similar to search, but only checks for a match at the beginning of the string.
header = "HTTP/1.1 200 OK"
if re.match(r"HTTP/\d\.\d", header):
print("Valid HTTP version detected.")
re.split()
Splits the string by occurrences of the pattern. This is more flexible than the standard string split() method.
text = "Words, separated; by: various-punctuations"
# Split on non-word characters
print(re.split(r"\W+", text))
# Output: ['Words', 'separated', 'by', 'various', 'punctuations']
re.finditer()
Returns an iterator yielding match objects over all non-overlapping matches. This is memory-efficient for large strings.
for match in re.finditer(r"\d+", "ID: 101, 202, 303"):
print(f"Found {match.group()} at {match.span()}")
re.sub() and re.subn()
re.sub replaces occurrences of the pattern with a replacement string. re.subn performs the same operation but returns a tuple containing the new string and the number of replacements made.
original = "2023-12-01"
# Replace hyphens with slashes
new_date = re.sub(r"-", "/", original)
print(new_date)
# Output: 2023/12/01
count_tuple = re.subn(r"\d", "X", "A1B2C3")
print(count_tuple)
# Output: ('AXBXCX', 3)
re.compile()
Compiles a regex pattern into a regex object. This is useful when the same pattern is used multiple times, as it improves performance.
pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
is_valid = pattern.search("contact@example.com")
print(is_valid)
Regex Flags
Flags modify the behavior of the regex engine.
- re.I (re.IGNORECASE): Performs case-insensitive matching.
- re.M (re.MULTILINE): Makes
^and$match after/before newlines.
print(re.search(r"python", "I love PYTHON", re.I))
# Output: <re.match match="PYTHON" object="" span="(7,">
</re.match>
The Match Object
Methods like search, match, and finditer return match objects which provide detailed information about the match.
- group(): Returns the string of the matched text. Using arguments returns specific subgroups.
- groups(): Returns a tuple containing all the subgroups of the match.
- start() and end(): Return the indices of the start and end of the matched substring.
- span(): Returns a tuple containing the (start, end) indices.
m = re.match(r"(\w+)@(\w+)", "user@domain")
print(m.group(0)) # 'user@domain'
print(m.group(1)) # 'user'
print(m.groups()) # ('user', 'domain')
print(m.span()) # (0, 11)