import re
22 Regex
22.1 Common Functions
22.1.1 re.search()
- Searches for the first occurrence of a pattern within a string.
- Returns a match object if the pattern is found; otherwise, returns
None
.
import re
= "hello world"
text = re.search(r"hello", text)
match print(match)
if match:
print("Pattern found!")
else:
print("Pattern not found.")
<re.Match object; span=(0, 5), match='hello'>
Pattern found!
22.1.2 re.match()
- Checks if the pattern matches at the beginning of the string.
- Returns a match object if it matches the start of the string, otherwise returns
None
.
= "hello world"
text = re.match(r"hello", text)
match
if match:
print("Pattern matches the start!")
else:
print("No match at the start.")
Pattern matches the start!
22.1.3 re.findall()
= "My phone number is 1234, and my zip code is 56789."
text = re.findall(r"\d+", text)
matches matches
['1234', '56789']
22.1.4 re.sub()
- Substitutes all occurrences of a pattern with a replacement string.
- Returns a new string with the substitutions.
= "I have a dog. My neighbor has a dog too."
text = re.sub(r"dog", "cat", text)
new_text new_text
'I have a cat. My neighbor has a cat too.'
22.2 Regex Syntax
Regular expressions use special characters to define patterns. Here are some of the most commonly used characters:
22.2.1 Metacharacters:
.
: Matches any single character except newline (\n
).^
: Matches the start of a string.$
: Matches the end of a string.*
: Matches 0 or more repetitions of the preceding character.+
: Matches 1 or more repetitions of the preceding character.?
: Matches 0 or 1 occurrence of the preceding character.{}
: Specifies the number of repetitions (e.g.,{2}
means exactly two,{2,4}
means between two and four).
22.2.2 Character Classes:
\d
: Matches any digit (equivalent to[0-9]
).\w
: Matches any alphanumeric character (equivalent to[a-zA-Z0-9_]
).\s
: Matches any whitespace character (spaces, tabs, newlines).\D
,\W
,\S
: Match the opposite of\d
,\w
, and\s
.
22.2.3 Anchors:
^
: Anchors the pattern to the start of the string.$
: Anchors the pattern to the end of the string.
Example:
= r"^\d+" # Matches digits at the start of the string
pattern = "1234abc"
text = re.search(pattern, text)
match if match:
print("Found at the start:", match.group()) # Output: Found at the start: 1234
22.2.4 Groups:
- Parentheses
()
are used to create groups in regex. - You can extract matched groups using
.group()
or.groups()
.
Example:
= r"(hello) (world)"
pattern = "hello world"
text = re.search(pattern, text)
match
if match:
print(match.group(1)) # Output: hello
print(match.group(2)) # Output: world
hello
world
22.2.5 Escaping Special Characters
If you want to match one of the special regex characters literally, you need to escape it using a backslash (\
).
Example:
= r"\$100" # Matches the string "$100"
pattern = "The price is $100."
text = re.search(pattern, text)
match
if match:
print("Price found:", match.group())
Price found: $100
22.2.6 Flags in Regex
You can modify the behavior of regex with flags, such as: - re.IGNORECASE
or re.I
: Makes the regex case-insensitive. - re.MULTILINE
or re.M
: Allows ^
and $
to match the start and end of each line in a multi-line string. - re.DOTALL
or re.S
: Makes .
match newlines as well.
= r"hello"
pattern = "HELLO world"
text = re.search(pattern, text, re.IGNORECASE)
match
if match:
print("Case-insensitive match found!")
Case-insensitive match found!
22.3 Use Cases
22.3.1 No Match - Exception
if match is None
class Money:
def __init__(self, dollars, cents):
self.dollars = dollars
self.cents = cents
def __repr__(self):
return f"Money({self.dollars}, {self.cents})"
import re
def money_from_string(amount):
= re.search(
match r'^\$(?P<dollars>\d+)\.(?P<cents>\d\d)$', amount)
# Adding the next two lines here
if match is None:
raise ValueError(f"Invalid amount: {amount}")
= int(match.group('dollars'))
dollars = int(match.group('cents'))
cents return Money(dollars, cents)
"$12.34") money_from_string(
Money(12, 34)
try:
"Big")
money_from_string(except ValueError as e:
print(e)
Invalid amount: Big
22.4 re
vs regex
import re
import regex
import timeit
= "hello world" * 1000000
text = r'world'
pattern
# re module
def re_search():
return len(re.findall(pattern, text))
# regex module
def regex_search():
return len(regex.findall(pattern, text))
# Benchmark
print("re module:", timeit.timeit(re_search, number=100))
print("regex module:", timeit.timeit(regex_search, number=100))
re module: 2.7286510409903713
regex module: 10.390523624955676
22.5 Fuzzy Matching
Let me explain fuzzy matching through a practical example that you might encounter in medical records.
Imagine you’re working with patient names in a database. With regular exact matching, searching for “Johnson” would only find “Johnson” - it wouldn’t find common variations or typos like “Johnsen”, “Jonson”, or “Johnson”. This is where fuzzy matching comes in.
Fuzzy matching is a technique that finds strings that approximately match a pattern, even when they’re not exactly the same. It measures how similar two strings are and can match them if they’re “close enough.” This is especially useful when dealing with:
- Misspellings: Like matching “penicillin” with “penicilin”
- Name variations: Like matching “Catherine” with “Katherine”
- OCR errors: When scanned text isn’t perfectly recognized
- Data entry errors: When humans make typing mistakes
Here’s a practical example using Python’s regex
module with fuzzy matching:
import regex
# Regular exact matching
= "The patient Smith was prescribed penicillin"
text = r'penicillin'
exact_pattern print("Exact match:", regex.findall(exact_pattern, text)) # Finds 'penicillin'
# Fuzzy matching with maximum 2 differences allowed
= r'(?:penicillin){e<=2}' # e<=2 means allow up to 2 errors
fuzzy_pattern = [
texts "The patient Smith was prescribed penicilin", # Missing 'l'
"The patient Smith was prescribed peniciilin", # Extra 'i'
"The patient Smith was prescribed penicilln" # Missing 'i'
]
for t in texts:
= regex.findall(fuzzy_pattern, t)
matches print(f"Fuzzy matches in '{t}': {matches}")
Exact match: ['penicillin']
Fuzzy matches in 'The patient Smith was prescribed penicilin': [' penicilin']
Fuzzy matches in 'The patient Smith was prescribed peniciilin': [' peniciilin']
Fuzzy matches in 'The patient Smith was prescribed penicilln': [' penicilln']
The magic happens in how fuzzy matching calculates the “distance” between strings. The most common method is Levenshtein distance, which counts the minimum number of single-character edits needed to change one string into another. For example:
- “penicillin” → “penicilin” (distance = 1, one deletion)
- “Smith” → “Smyth” (distance = 1, one substitution)
- “Katherine” → “Catherine” (distance = 1, one substitution)
Think of it like measuring how many steps it takes to transform one word into another, where each step can be: - Inserting a character - Deleting a character - Substituting one character for another
This is particularly valuable in medical contexts where accuracy is crucial but variations are common. For instance, when searching through medical records, fuzzy matching could help you find relevant cases even when drug names or conditions are slightly misspelled.