Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

A Comprehensive Guide to Regular Expressions in Python

Tech May 17 2

Regular expressions (regex) provide a powerful mechanism for pattern matching and string manipulation. In Python, the re modulle facilitates searching, splitting, replacing, and validating text based on specific patterns. This guide covers the essential syntax, core functions, and advanced usage of regular expressions in Python 3.

Basic Pattern Matching

The entry point for most regex tasks is the re module. The findall method is commonly used to retrieve all non-overlapping matches of a pattern in a string as a list.

import re

source_text = "The version numbers are 1.2, 3.45, and 6.789."
pattern = r"\d+\.\d+"
matches = re.findall(pattern, source_text)
print(matches) 
# Output: ['1.2', '3.45', '6.789']

Special Character Sequences

Regex patterns use special sequences to define character types:

  • \d: Matches any decimal digit (0-9).
  • \D: Matches any non-digit character.
  • \w: Matches any word character (alphanumeric and underscore).
  • \W: Matches any non-word character.
  • \s: Matches any whitespace character (spaces, tabs, newlines).
  • \S: Matches any non-whitespace character.
log_data = "User_123:  Error 404 found."
# Extract alphanumeric sequences
print(re.findall(r"\w+", log_data))
# Output: ['User_123', 'Error', '404', 'found']

Word Boundaries

Assertions like \b and \B match positions rather than characters. \b asserts a position at a word boundary (start or end), while \B asserts a position that is not a word boundary.

sentence = "replay the reply"
# Match 're' only at the start of a word
print(re.findall(r"\bre", sentence))
# Output: ['re']

# Match 'play' not at the start of a word
print(re.findall(r"\Bplay", sentence))
# Output: ['play']

Anchors: String Position

Anchors are used to match positions relative to the start or end of a string or line.

  • \A: Matches only at the start of the string.
  • \Z: Matches only at the end of the string.
  • ^: Matches the start of the string (or line in re.MULTILINE mode).
  • $: Matches the end of the string (or line in re.MULTILINE mode).
config = "DEBUG=true\nENABLED=false"
# Match start of line with 'MULTILINE' flag
print(re.findall(r"^\w+", config, re.M))
# Output: ['DEBUG', 'ENABLED']

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present for a match to be found.

  • *: Matches 0 or more repetitions.
  • +: Matches 1 or more repetitions.
  • ?: Matches 0 or 1 repetition (optional).
  • {m}: Matches exact m repetitions.
  • {m,n}: Matches between m and n repetitions.
text = "color colour colouuur"
# Match 'color' where 'u' appears 0 to 2 times
print(re.findall(r"colou{0,2}r", text))
# Output: ['color', 'colour']

Character Sets and Grouping

Square brackets [] define a set of characters to match. For example, [a-z] matches any lowercase letter. The caret ^ inside a set negates it.

data = "Item A, Item B, Item 1, Item 2"
# Match 'Item' followed by uppercase A-C
print(re.findall(r"Item [A-C]", data))
# Output: ['Item A', 'Item B']

Parentheses () create groups, allowing you to apply quantifiers to multiple characters or capture specific parts of the match.

dates = "2023-01-01, 2024-12-31"
# Capture year, month, and day
print(re.findall(r"(\d{4})-(\d{2})-(\d{2})", dates))
# Output: [('2023', '01', '01'), ('2024', '12', '31')]

Core re Module Functions

re.search()

Scans through the string looking for the first location where the pattern produces a match. Returns a match object or None.

code = "function_123()"
if re.search(r"function_\d+\(\)", code):
    print("Function declaration found.")

re.match()

Similar to search, but only checks for a match at the beginning of the string.

header = "HTTP/1.1 200 OK"
if re.match(r"HTTP/\d\.\d", header):
    print("Valid HTTP version detected.")

re.split()

Splits the string by occurrences of the pattern. This is more flexible than the standard string split() method.

text = "Words, separated; by: various-punctuations"
# Split on non-word characters
print(re.split(r"\W+", text))
# Output: ['Words', 'separated', 'by', 'various', 'punctuations']

re.finditer()

Returns an iterator yielding match objects over all non-overlapping matches. This is memory-efficient for large strings.

for match in re.finditer(r"\d+", "ID: 101, 202, 303"):
    print(f"Found {match.group()} at {match.span()}")

re.sub() and re.subn()

re.sub replaces occurrences of the pattern with a replacement string. re.subn performs the same operation but returns a tuple containing the new string and the number of replacements made.

original = "2023-12-01"
# Replace hyphens with slashes
new_date = re.sub(r"-", "/", original)
print(new_date)
# Output: 2023/12/01

count_tuple = re.subn(r"\d", "X", "A1B2C3")
print(count_tuple)
# Output: ('AXBXCX', 3)

re.compile()

Compiles a regex pattern into a regex object. This is useful when the same pattern is used multiple times, as it improves performance.

pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
is_valid = pattern.search("contact@example.com")
print(is_valid)

Regex Flags

Flags modify the behavior of the regex engine.

  • re.I (re.IGNORECASE): Performs case-insensitive matching.
  • re.M (re.MULTILINE): Makes ^ and $ match after/before newlines.
print(re.search(r"python", "I love PYTHON", re.I))
# Output: <re.match match="PYTHON" object="" span="(7,">
</re.match>

The Match Object

Methods like search, match, and finditer return match objects which provide detailed information about the match.

  • group(): Returns the string of the matched text. Using arguments returns specific subgroups.
  • groups(): Returns a tuple containing all the subgroups of the match.
  • start() and end(): Return the indices of the start and end of the matched substring.
  • span(): Returns a tuple containing the (start, end) indices.
m = re.match(r"(\w+)@(\w+)", "user@domain")
print(m.group(0))   # 'user@domain'
print(m.group(1))   # 'user'
print(m.groups())   # ('user', 'domain')
print(m.span())     # (0, 11)

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.