Home > Tech > Content

A Comprehensive Guide to Regular Expressions in Python

Tech May 17 14

Regular expressions (regex) provide a powerful mechanism for pattern matching and string manipulation. In Python, the re modulle facilitates searching, splitting, replacing, and validating text based on specific patterns. This guide covers the essential syntax, core functions, and advanced usage of regular expressions in Python 3.

Basic Pattern Matching

The entry point for most regex tasks is the re module. The findall method is commonly used to retrieve all non-overlapping matches of a pattern in a string as a list.

import re

source_text = "The version numbers are 1.2, 3.45, and 6.789."
pattern = r"\d+\.\d+"
matches = re.findall(pattern, source_text)
print(matches) 
# Output: ['1.2', '3.45', '6.789']

Special Character Sequences

Regex patterns use special sequences to define character types:

\d: Matches any decimal digit (0-9).
\D: Matches any non-digit character.
\w: Matches any word character (alphanumeric and underscore).
\W: Matches any non-word character.
\s: Matches any whitespace character (spaces, tabs, newlines).
\S: Matches any non-whitespace character.

log_data = "User_123:  Error 404 found."
# Extract alphanumeric sequences
print(re.findall(r"\w+", log_data))
# Output: ['User_123', 'Error', '404', 'found']

Word Boundaries

Assertions like \b and \B match positions rather than characters. \b asserts a position at a word boundary (start or end), while \B asserts a position that is not a word boundary.

sentence = "replay the reply"
# Match 're' only at the start of a word
print(re.findall(r"\bre", sentence))
# Output: ['re']

# Match 'play' not at the start of a word
print(re.findall(r"\Bplay", sentence))
# Output: ['play']

Anchors: String Position

Anchors are used to match positions relative to the start or end of a string or line.

\A: Matches only at the start of the string.
\Z: Matches only at the end of the string.
^: Matches the start of the string (or line in re.MULTILINE mode).
$: Matches the end of the string (or line in re.MULTILINE mode).

config = "DEBUG=true\nENABLED=false"
# Match start of line with 'MULTILINE' flag
print(re.findall(r"^\w+", config, re.M))
# Output: ['DEBUG', 'ENABLED']

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present for a match to be found.

*: Matches 0 or more repetitions.
+: Matches 1 or more repetitions.
?: Matches 0 or 1 repetition (optional).
{m}: Matches exact m repetitions.
{m,n}: Matches between m and n repetitions.

text = "color colour colouuur"
# Match 'color' where 'u' appears 0 to 2 times
print(re.findall(r"colou{0,2}r", text))
# Output: ['color', 'colour']

Character Sets and Grouping

Square brackets [] define a set of characters to match. For example, [a-z] matches any lowercase letter. The caret ^ inside a set negates it.

data = "Item A, Item B, Item 1, Item 2"
# Match 'Item' followed by uppercase A-C
print(re.findall(r"Item [A-C]", data))
# Output: ['Item A', 'Item B']

Parentheses () create groups, allowing you to apply quantifiers to multiple characters or capture specific parts of the match.

dates = "2023-01-01, 2024-12-31"
# Capture year, month, and day
print(re.findall(r"(\d{4})-(\d{2})-(\d{2})", dates))
# Output: [('2023', '01', '01'), ('2024', '12', '31')]

Core re Module Functions

re.search()

Scans through the string looking for the first location where the pattern produces a match. Returns a match object or None.

code = "function_123()"
if re.search(r"function_\d+\(\)", code):
    print("Function declaration found.")

re.match()

Similar to search, but only checks for a match at the beginning of the string.

header = "HTTP/1.1 200 OK"
if re.match(r"HTTP/\d\.\d", header):
    print("Valid HTTP version detected.")

re.split()

Splits the string by occurrences of the pattern. This is more flexible than the standard string split() method.

text = "Words, separated; by: various-punctuations"
# Split on non-word characters
print(re.split(r"\W+", text))
# Output: ['Words', 'separated', 'by', 'various', 'punctuations']

re.finditer()

Returns an iterator yielding match objects over all non-overlapping matches. This is memory-efficient for large strings.

for match in re.finditer(r"\d+", "ID: 101, 202, 303"):
    print(f"Found {match.group()} at {match.span()}")

re.sub() and re.subn()

re.sub replaces occurrences of the pattern with a replacement string. re.subn performs the same operation but returns a tuple containing the new string and the number of replacements made.

original = "2023-12-01"
# Replace hyphens with slashes
new_date = re.sub(r"-", "/", original)
print(new_date)
# Output: 2023/12/01

count_tuple = re.subn(r"\d", "X", "A1B2C3")
print(count_tuple)
# Output: ('AXBXCX', 3)

re.compile()

Compiles a regex pattern into a regex object. This is useful when the same pattern is used multiple times, as it improves performance.

pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
is_valid = pattern.search("contact@example.com")
print(is_valid)

Regex Flags

Flags modify the behavior of the regex engine.

re.I (re.IGNORECASE): Performs case-insensitive matching.
re.M (re.MULTILINE): Makes ^ and $ match after/before newlines.

print(re.search(r"python", "I love PYTHON", re.I))
# Output: <re.match match="PYTHON" object="" span="(7,">
</re.match>

The Match Object

Methods like search, match, and finditer return match objects which provide detailed information about the match.

group(): Returns the string of the matched text. Using arguments returns specific subgroups.
groups(): Returns a tuple containing all the subgroups of the match.
start() and end(): Return the indices of the start and end of the matched substring.
span(): Returns a tuple containing the (start, end) indices.

m = re.match(r"(\w+)@(\w+)", "user@domain")
print(m.group(0))   # 'user@domain'
print(m.group(1))   # 'user'
print(m.groups())   # ('user', 'domain')
print(m.span())     # (0, 11)

Back to List

Prev: Vue.js Development Insights: Data Binding, Components, and Communication

Next: Aspect-Oriented Programming: Proxy Factory and Annotation-Based AOP Implementation

Fading Coder

A Comprehensive Guide to Regular Expressions in Python

Basic Pattern Matching

Special Character Sequences

Word Boundaries

Anchors: String Position

Quantifiers

Character Sets and Grouping

Core re Module Functions

re.search()

re.match()

re.split()

re.finditer()

re.sub() and re.subn()

re.compile()

Regex Flags

The Match Object

Related Articles

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

A Comprehensive Guide to Regular Expressions in Python

Basic Pattern Matching

Special Character Sequences

Word Boundaries

Anchors: String Position

Quantifiers

Character Sets and Grouping

Core re Module Functions

re.search()

re.match()

re.split()

re.finditer()

re.sub() and re.subn()

re.compile()

Regex Flags

The Match Object

Related Articles

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment