27  Re2 vs Re

Both Python’s built-in re module and Google’s re2 module are used for regular expression operations, but they have some important differences in their implementation and API. Let’s examine each of the functions you mentioned.

27.1 Compile

import re
import re2
text = "Hello, my email is example@example.com"
pattern = r'(\w+)@(\w+)\.com'

re.compile(pattern, 0)
re2.compile(pattern, 0)
<re2._Regexp at 0x105a51780>

27.3 findall()

27.3.1 Basic API Structure

re.findall():

re.findall(pattern, string, flags=0)

re2.findall():

re2.findall(pattern, string, flags=0)

27.3.2 Key Differences

  1. Return Value Handling:
    • Both return a list of matching strings or tuples (when there are capturing groups).
    • However, there’s a subtle difference in how they handle groups:
      • With re, if there’s only one capturing group, you get a list of the contents of that group.
      • With re2, if there’s only one capturing group, you still get the full match unless you explicitly use non-capturing groups for the rest.
  2. Performance:
    • re2.findall() is typically faster for complex patterns on large strings due to re2’s linear-time matching guarantees.

27.3.3 Example

import re
import re2

text = "Email me at user1@example.com or user2@test.com"
reg_cap = r'(\w+)@(\w+)\.com'
reg_nc = r'(?:\w+)@(?:\w+)\.com'
# Using re
re_result = re.findall(reg_cap, text)
print(f"re findall: {re_result}") 

# Using re2
re2_result = re2.findall(reg_cap, text)
print(f"re2 findall: {re2_result}")  # Same result
re findall: [('user1', 'example'), ('user2', 'test')]
re2 findall: [('user1', 'example'), ('user2', 'test')]

For patterns with one capturing group, the difference becomes apparent:

reg_cap = r'(\w+@\w+\.com)'
reg_nc = r'(?:\w+@\w+\.com)'

# With re
re_result = re.findall(reg_cap, text)
print(f"re findall: {re_result}")  # Will return ['user1@example.com', 'user2@test.com']

# With re2
re2_result = re2.findall(reg_cap, text)
print(f"re2 findall: {re2_result}")  # Same result, but for different reasons
re findall: ['user1@example.com', 'user2@test.com']
re2 findall: ['user1@example.com', 'user2@test.com']

27.4 .finditer()

27.4.1 Basic API Structure

re.finditer():

re.finditer(pattern, string, flags=0)

re2.finditer():

re2.finditer(pattern, string, flags=0)

27.4.2 Key Differences

  1. Return Value: Both return an iterator yielding match objects.

  2. Match Object Methods:

    • re match objects have methods like .start(), .end(), .span(), .group(), .groups(), and .groupdict().
    • re2 match objects have the same methods, but some complex operations involving lookahead/lookbehind may behave differently.
  3. Named Group Support:

    • Both support named groups with (?P<name>...) syntax.
    • re2 has limitations with some named group features that re supports.

27.4.3 Example

import re
import re2

text = "Contact us: user1@example.com or user2@test.com"

# Using re
for match in re.finditer(r'(\w+)@(\w+)\.com', text):
    print(f"re match: {match.group(0)}, groups: {match.groups()}")
    print(f"  Position: {match.start()}-{match.end()}")

# Using re2
for match in re2.finditer(r'(\w+)@(\w+)\.com', text):
    print(f"re2 match: {match.group(0)}, groups: {match.groups()}")
    print(f"  Position: {match.start()}-{match.end()}")
re match: user1@example.com, groups: ('user1', 'example')
  Position: 12-29
re match: user2@test.com, groups: ('user2', 'test')
  Position: 33-47
re2 match: user1@example.com, groups: ('user1', 'example')
  Position: 12-29
re2 match: user2@test.com, groups: ('user2', 'test')
  Position: 33-47

28 Key Differences Summary

Feature re module re2 module
Fundamental Implementation Uses backtracking, which can be exponential Uses finite automata with linear-time guarantees
Backreferences Fully supported Not supported
Lookahead/Lookbehind Fully supported Limited support (only fixed-width lookbehind)
Flag Support All standard flags Limited subset (IGNORECASE, MULTILINE, DOTALL, UNICODE)
Memory Usage Can be high for complex patterns Generally lower and more predictable
Performance May have catastrophic backtracking on complex patterns Guaranteed linear time complexity
Group Handling Special handling for single capture group in findall() Consistent behavior

28.1 Practical Implications

  1. When to use re2:
    • Processing large text files
    • When you need performance guarantees (to avoid regex denial-of-service)
    • When your patterns are relatively simple
  2. When to use re:
    • When you need backreferences
    • When you need complex lookahead/lookbehind
    • For compatibility with existing Python code

Remember that re2 is designed with safety and performance in mind, specifically to avoid the catastrophic backtracking that can happen with traditional regex engines like the one in Python’s re module. This comes at the cost of some advanced regex features that cannot be implemented efficiently while maintaining linear-time guarantees.