import re
import re2
text = "Hello, my email is example@example.com"
pattern = r'(\w+)@(\w+)\.com'
re.compile(pattern, 0)
re2.compile(pattern, 0)<re2._Regexp at 0x105a51780>
Both Python’s built-in re module and Google’s re2 module are used for regular expression operations, but they have some important differences in their implementation and API. Let’s examine each of the functions you mentioned.
import re
import re2
text = "Hello, my email is example@example.com"
pattern = r'(\w+)@(\w+)\.com'
re.compile(pattern, 0)
re2.compile(pattern, 0)<re2._Regexp at 0x105a51780>
search()re.search():
re.search(pattern, string, flags=0)re2.search():
re2.search(pattern, string, flags=0)Return Value: Both return a match object if found, or None if no match is found.
Flags Support:
re supports all Python’s regular expression flags like re.IGNORECASE, re.MULTILINE, etc.re2 supports only a subset of flags: re2.IGNORECASE, re2.MULTILINE, re2.DOTALL, and re2.UNICODE.Backreference Support:
re fully supports backreferences in search patterns.re2 doesn’t support backreferences in the pattern itself.import re
import re2
text = "Hello, my email is example@example.com"
# Using re
re_result = re.search(r'(\w+)@(\w+)\.com', text)
if re_result:
print(f"re found: {re_result.group(0)}")
print(f"Username: {re_result.group(1)}, Domain: {re_result.group(2)}")
# Using re2
re2_result = re2.search(r'(\w+)@(\w+)\.com', text)
if re2_result:
print(f"re2 found: {re2_result.group(0)}")
print(f"Username: {re2_result.group(1)}, Domain: {re2_result.group(2)}")re found: example@example.com
Username: example, Domain: example
re2 found: example@example.com
Username: example, Domain: example
Both will output similar results for this simple case, but re2 won’t work with patterns containing backreferences like r'(\w+)@\1\.com'.
re2 is different than flag in reimport re2
text = "Hello, my email is EXAMPLE@example.com"
# RE2
# Create options object with CASE_INSENSITIVE flag
options = re2.Options()
options.case_sensitive = False
# Pass options as the flags parameter
match = re2.search(r'(\w+)@(\w+)\.com', text, options)
print(match.group(0))EXAMPLE@example.com
options.NAMES('max_mem',
'encoding',
'posix_syntax',
'longest_match',
'log_errors',
'literal',
'never_nl',
'dot_nl',
'never_capture',
'case_sensitive',
'perl_classes',
'word_boundary',
'one_line')
flags = re.IGNORECASE
if flags is re.IGNORECASE:
print("IGNORECASE")IGNORECASE
combined_flags = re.IGNORECASE | re.MULTILINE # Value: 10 (binary: 00001010)
# This returns False because combined_flags isn't exactly IGNORECASE
combined_flags == re.IGNORECASE False
if combined_flags & re.IGNORECASE:
print("IGNORECASE is set")IGNORECASE is set
findall()re.findall():
re.findall(pattern, string, flags=0)re2.findall():
re2.findall(pattern, string, flags=0)re, if there’s only one capturing group, you get a list of the contents of that group.re2, if there’s only one capturing group, you still get the full match unless you explicitly use non-capturing groups for the rest.re2.findall() is typically faster for complex patterns on large strings due to re2’s linear-time matching guarantees.import re
import re2
text = "Email me at user1@example.com or user2@test.com"
reg_cap = r'(\w+)@(\w+)\.com'
reg_nc = r'(?:\w+)@(?:\w+)\.com'
# Using re
re_result = re.findall(reg_cap, text)
print(f"re findall: {re_result}")
# Using re2
re2_result = re2.findall(reg_cap, text)
print(f"re2 findall: {re2_result}") # Same resultre findall: [('user1', 'example'), ('user2', 'test')]
re2 findall: [('user1', 'example'), ('user2', 'test')]
For patterns with one capturing group, the difference becomes apparent:
reg_cap = r'(\w+@\w+\.com)'
reg_nc = r'(?:\w+@\w+\.com)'
# With re
re_result = re.findall(reg_cap, text)
print(f"re findall: {re_result}") # Will return ['user1@example.com', 'user2@test.com']
# With re2
re2_result = re2.findall(reg_cap, text)
print(f"re2 findall: {re2_result}") # Same result, but for different reasonsre findall: ['user1@example.com', 'user2@test.com']
re2 findall: ['user1@example.com', 'user2@test.com']
.finditer()re.finditer():
re.finditer(pattern, string, flags=0)re2.finditer():
re2.finditer(pattern, string, flags=0)Return Value: Both return an iterator yielding match objects.
Match Object Methods:
re match objects have methods like .start(), .end(), .span(), .group(), .groups(), and .groupdict().re2 match objects have the same methods, but some complex operations involving lookahead/lookbehind may behave differently.Named Group Support:
(?P<name>...) syntax.re2 has limitations with some named group features that re supports.import re
import re2
text = "Contact us: user1@example.com or user2@test.com"
# Using re
for match in re.finditer(r'(\w+)@(\w+)\.com', text):
print(f"re match: {match.group(0)}, groups: {match.groups()}")
print(f" Position: {match.start()}-{match.end()}")
# Using re2
for match in re2.finditer(r'(\w+)@(\w+)\.com', text):
print(f"re2 match: {match.group(0)}, groups: {match.groups()}")
print(f" Position: {match.start()}-{match.end()}")re match: user1@example.com, groups: ('user1', 'example')
Position: 12-29
re match: user2@test.com, groups: ('user2', 'test')
Position: 33-47
re2 match: user1@example.com, groups: ('user1', 'example')
Position: 12-29
re2 match: user2@test.com, groups: ('user2', 'test')
Position: 33-47
| Feature | re module |
re2 module |
|---|---|---|
| Fundamental Implementation | Uses backtracking, which can be exponential | Uses finite automata with linear-time guarantees |
| Backreferences | Fully supported | Not supported |
| Lookahead/Lookbehind | Fully supported | Limited support (only fixed-width lookbehind) |
| Flag Support | All standard flags | Limited subset (IGNORECASE, MULTILINE, DOTALL, UNICODE) |
| Memory Usage | Can be high for complex patterns | Generally lower and more predictable |
| Performance | May have catastrophic backtracking on complex patterns | Guaranteed linear time complexity |
| Group Handling | Special handling for single capture group in findall() |
Consistent behavior |
re2:
re:
Remember that re2 is designed with safety and performance in mind, specifically to avoid the catastrophic backtracking that can happen with traditional regex engines like the one in Python’s re module. This comes at the cost of some advanced regex features that cannot be implemented efficiently while maintaining linear-time guarantees.