22  Regex

import re

22.1 Common Functions

22.1.1 re.search()

  • Searches for the first occurrence of a pattern within a string.
  • Returns a match object if the pattern is found; otherwise, returns None.
import re

text = "hello world"
match = re.search(r"hello", text)
print(match)

if match:
    print("Pattern found!")
else:
    print("Pattern not found.")
<re.Match object; span=(0, 5), match='hello'>
Pattern found!

22.1.2 re.match()

  • Checks if the pattern matches at the beginning of the string.
  • Returns a match object if it matches the start of the string, otherwise returns None.
text = "hello world"
match = re.match(r"hello", text)

if match:
    print("Pattern matches the start!")
else:
    print("No match at the start.")
Pattern matches the start!

22.1.3 re.findall()

text = "My phone number is 1234, and my zip code is 56789."
matches = re.findall(r"\d+", text)
matches
['1234', '56789']

22.1.4 re.sub()

  • Substitutes all occurrences of a pattern with a replacement string.
  • Returns a new string with the substitutions.
text = "I have a dog. My neighbor has a dog too."
new_text = re.sub(r"dog", "cat", text)
new_text
'I have a cat. My neighbor has a cat too.'

22.2 Regex Syntax

Regular expressions use special characters to define patterns. Here are some of the most commonly used characters:

22.2.1 Metacharacters:

  • . : Matches any single character except newline (\n).
  • ^ : Matches the start of a string.
  • $ : Matches the end of a string.
  • * : Matches 0 or more repetitions of the preceding character.
  • + : Matches 1 or more repetitions of the preceding character.
  • ? : Matches 0 or 1 occurrence of the preceding character.
  • {} : Specifies the number of repetitions (e.g., {2} means exactly two, {2,4} means between two and four).

22.2.2 Character Classes:

  • \d : Matches any digit (equivalent to [0-9]).
  • \w : Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]).
  • \s : Matches any whitespace character (spaces, tabs, newlines).
  • \D, \W, \S : Match the opposite of \d, \w, and \s.

22.2.3 Anchors:

  • ^ : Anchors the pattern to the start of the string.
  • $ : Anchors the pattern to the end of the string.

Example:

pattern = r"^\d+"  # Matches digits at the start of the string
text = "1234abc"
match = re.search(pattern, text)
if match:
    print("Found at the start:", match.group())  # Output: Found at the start: 1234

22.2.4 Groups:

  • Parentheses () are used to create groups in regex.
  • You can extract matched groups using .group() or .groups().

Example:

pattern = r"(hello) (world)"
text = "hello world"
match = re.search(pattern, text)

if match:
    print(match.group(1))  # Output: hello
    print(match.group(2))  # Output: world
hello
world

22.2.5 Escaping Special Characters

If you want to match one of the special regex characters literally, you need to escape it using a backslash (\).

Example:

pattern = r"\$100"  # Matches the string "$100"
text = "The price is $100."
match = re.search(pattern, text)

if match:
    print("Price found:", match.group()) 
Price found: $100

22.2.6 Flags in Regex

You can modify the behavior of regex with flags, such as: - re.IGNORECASE or re.I : Makes the regex case-insensitive. - re.MULTILINE or re.M : Allows ^ and $ to match the start and end of each line in a multi-line string. - re.DOTALL or re.S : Makes . match newlines as well.

pattern = r"hello"
text = "HELLO world"
match = re.search(pattern, text, re.IGNORECASE)

if match:
    print("Case-insensitive match found!") 
Case-insensitive match found!

22.3 Use Cases

22.3.1 No Match - Exception

if match is None

class Money:
    def __init__(self, dollars, cents):
        self.dollars = dollars
        self.cents = cents
    def __repr__(self):
        return f"Money({self.dollars}, {self.cents})"
import re
def money_from_string(amount):
    match = re.search(
        r'^\$(?P<dollars>\d+)\.(?P<cents>\d\d)$', amount)
    # Adding the next two lines here
    if match is None:
        raise ValueError(f"Invalid amount: {amount}")
    dollars = int(match.group('dollars'))
    cents = int(match.group('cents'))
    return Money(dollars, cents)
money_from_string("$12.34")
Money(12, 34)
try:
    money_from_string("Big")
except ValueError as e:
    print(e)
Invalid amount: Big

22.4 re vs regex

import re
import regex
import timeit

text = "hello world" * 1000000
pattern = r'world'

# re module
def re_search():
    return len(re.findall(pattern, text))

# regex module
def regex_search():
    return len(regex.findall(pattern, text))
# Benchmark
print("re module:", timeit.timeit(re_search, number=100))
print("regex module:", timeit.timeit(regex_search, number=100))
re module: 2.7286510409903713
regex module: 10.390523624955676

22.5 Fuzzy Matching

Let me explain fuzzy matching through a practical example that you might encounter in medical records.

Imagine you’re working with patient names in a database. With regular exact matching, searching for “Johnson” would only find “Johnson” - it wouldn’t find common variations or typos like “Johnsen”, “Jonson”, or “Johnson”. This is where fuzzy matching comes in.

Fuzzy matching is a technique that finds strings that approximately match a pattern, even when they’re not exactly the same. It measures how similar two strings are and can match them if they’re “close enough.” This is especially useful when dealing with:

  1. Misspellings: Like matching “penicillin” with “penicilin”
  2. Name variations: Like matching “Catherine” with “Katherine”
  3. OCR errors: When scanned text isn’t perfectly recognized
  4. Data entry errors: When humans make typing mistakes

Here’s a practical example using Python’s regex module with fuzzy matching:

import regex

# Regular exact matching
text = "The patient Smith was prescribed penicillin"
exact_pattern = r'penicillin'
print("Exact match:", regex.findall(exact_pattern, text))  # Finds 'penicillin'

# Fuzzy matching with maximum 2 differences allowed
fuzzy_pattern = r'(?:penicillin){e<=2}'  # e<=2 means allow up to 2 errors
texts = [
    "The patient Smith was prescribed penicilin",   # Missing 'l'
    "The patient Smith was prescribed peniciilin",  # Extra 'i'
    "The patient Smith was prescribed penicilln"    # Missing 'i'
]

for t in texts:
    matches = regex.findall(fuzzy_pattern, t)
    print(f"Fuzzy matches in '{t}': {matches}")
Exact match: ['penicillin']
Fuzzy matches in 'The patient Smith was prescribed penicilin': [' penicilin']
Fuzzy matches in 'The patient Smith was prescribed peniciilin': [' peniciilin']
Fuzzy matches in 'The patient Smith was prescribed penicilln': [' penicilln']

The magic happens in how fuzzy matching calculates the “distance” between strings. The most common method is Levenshtein distance, which counts the minimum number of single-character edits needed to change one string into another. For example:

  • “penicillin” → “penicilin” (distance = 1, one deletion)
  • “Smith” → “Smyth” (distance = 1, one substitution)
  • “Katherine” → “Catherine” (distance = 1, one substitution)

Think of it like measuring how many steps it takes to transform one word into another, where each step can be: - Inserting a character - Deleting a character - Substituting one character for another

This is particularly valuable in medical contexts where accuracy is crucial but variations are common. For instance, when searching through medical records, fuzzy matching could help you find relevant cases even when drug names or conditions are slightly misspelled.