Regex - What are “regular expressions”?
Regular expressions in Python are very powerful, but can be a little confusing at first. One reason to note it down for myself.
What are regular expressions?
You can think of regex as a special language that can be used to define text patterns. This allows you to search for patterns in texts instead of exact words, for example. Instead of searching for exact email addresses or phone numbers in texts, you can search for all phone numbers or all email addresses. This will become even clearer in an example.
They are used for tasks such as searching, replacing, or validating text.
Why are they so useful?
- Complex searches: Search for and find text that does not exactly match the search query, but corresponds to a pattern.
- Data extraction: You can extract specific information from large amounts of text (such as all phone numbers).
- Data validation: Check whether an input matches a specific format (for example, whether an email address is valid).
- Text editing: Find and replace patterns in text.
Regex in Python: re module
Python has a built-in module called re that allows you to work with regular expressions.
The most important functions are probably:
re.search(): Searches for the first occurrence of a pattern in a string and returns a so-called match object if it is found, otherwiseNone.re.match(): Searches for a pattern at the beginning of a string. Works similarly tore.search(), but is somewhat more limited.re.findall(): Finds all occurrences of a pattern in a string and returns a list of strings.re.sub(): Replaces occurrences of a pattern with another string.re.split(): Splits a string based on a pattern.re.compile(): Compiles a regex pattern to improve performance when using the same pattern multiple times.
Basic regex syntax
Below, I list some of the most common elements you will find in regex patterns.
Literal characters
Most characters match themselves.
a: Searches for the letteraHello: Searches for the entire stringHello
Metacharacters - Characters that have a special meaning
.: Matches any single character (except newline)a.bmatchesabc,a2l, …*: Matches none or multiple repetitions of the preceding character or groupa*matches,a,aa,abc, …ab*cmatchesac,abc,abbbc, …+: Matches one or more repetitions of the preceding character or groupa+: Matchesa,aa,aaa, … but notaclikea*ab+c: Matchesabc,abbc, … but notaclikeab*c.
?: Matches none or one repetition of the preceding character or group (makes it optional).colou?r: Matchescolorandcolour, covering the different spellings in this case.[]: Matches a single character listed in the brackets.[abc]: Matchesa,b, orc[0-9]: Matches any digit from0to9[a-z]: Matches any lowercase letter[A-Z]: Matches any uppercase letter[a-zA-Z0-9]: Matches any alphanumeric character
[^abc]: Matches any character that is nota,b, orc\: Removes metacharacters so that they are treated as literals. Also used for special sequences.\.: Matches an actual period (.would not find a period as described above)\$: Matches any actual dollar sign
Special, frequently used sequences
\d: Finds/matches any digit0to9(analogous to[0-9])\D: Matches any non-digit character\w: Matches any character (alphanumeric characters and underscore) (analogous to[a-zA-Z0-9_])\W: Matches any non-character\s: Matches any whitespace character (including spaces, tabs, newlines, etc.)\S: Matches any non-whitespace character\b: Matches a word boundary. This is the position between a character (\w) and a non-character (\W), or between a word character and the beginning/end of a string.\bHund\b: MatchesHundas inDer Hund frisst, but nothundemüde.\B: Matches a non-word boundary, is the opposite of\b, and matches anywhere\bwould not match.\BHund\B: Does not matchHund, but does matchemüdeinHundemüde.
^: Matches the beginning of a string^Hallo: MatchesHallo Welt, but notOh, Hallo Welt$: Matches the end of a string.World$: MatchesHello Worldbut notThe world is beautiful
|: Logical ORcat|dog: Matchescatordog(): Grouping patterns. Allows operations to be applied to a group or parts of the match to be extracted.(ab)+: Matchesab,abab,ababab, …
Examples
I think an example will make it clearer.
import re
text = "Hello world, I am a developer. My email is test@example.com and my phone number is 12-345-6789."
# Example 1: re.search() - Find the first occurrence
match = re.search(r"Developer", text)
if match:
print(f"Found: '{match.group()}' at position {match.start()} to {match.end()}")
# match.group() returns the string found
# match.start() returns the start index of the match
# match.end() returns the end index of the match
else:
print("Not found.")
# Example 2: re.findall() - Find all occurrences of a pattern
numbers = re.findall(r"\d+", text) # \d+ matches one or more digits
print(f"All numbers: {numbers}")
# Example 3: re.sub() - Replace patterns
new_text = re.sub(r"Developer", 'Programmer', text)
print(f"Text after replacement: {new_text}")
# Example 4: Find email address (more complex pattern)
# r"..."is a "raw string" so that backslashes do not have to be double-escaped
email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
email_match = re.search(email_pattern, text)
if email_match:
print(f"Email found: {email_match.group()}")
# Example 5: Find phone number (with groups)
# (\d{2}) - Group 1: 2 digits
# -? - Optional hyphen
# (\d{3}) - Group 2: 3 digits
# -? - Optional hyphen
# (\d{4}) - Group 3: 4 digits
phone_pattern = r"(\d{2})-?(\d{3})-?(\d{4})"
phone_match = re.search(phone_pattern, text)
if phone_match:
print(f"Phone number found: {phone_match.group()}")
print(f"Area code: {phone_match.group(1)}")
print(f"Middle part: {phone_match.group(2)}")
print(f"End part: {phone_match.group(3)}")The output:
Found: 'developer' at position 20 to 29
All numbers: ['12', '345', '6789']
Text after replacement: Hello world, I am a developer. My email is test@example.com and my phone number is 12-345-6789.
Email found: test@example.com
Phone number found: 12-345-6789
Area code: 12
Middle part: 345
End part: 6789Summary
Regular expressions are a powerful tool for searchung, extracting, validating and manipulationg text in python. By mastering the basic syntax and understanding how to use the re module, you can efficiently handle a wide variety of text processing tasks. While regex can seem complex at first, practice and experimentation will make it an invaluable part of your coding toolkit.
Key points
- Regex allows flexible pattern matching beyond exact text.
- The
remodule provides essential functions for working with regex in python. - Understanding metacharacters and special sequences is crucial for building effective patterns.
- Practical examples help clarify how regex works in real-world scenarios.