Python Regular Expressions - Mastering RegEx

Introduction to Regular Expressions

Regular expressions (or, shortly, regex) are a language that detects patterns. You express a pattern that will be applied to strings. It is an efficient way to find patterns and extract information from texts. Within this pattern, each symbol represents a type of information. In this article, we will see the main symbols and their meanings with the help of several examples.

Python has the re module for regular expressions. The documentation is very detailed, and there is also this text with a more step-by-step introduction to the module.

The module has three functions that are commonly used and will be our focus here: match, search, and findall.

Match

The match seeks the pattern provided at the beginning of the string. Its signature is re.match(pattern, string, flags=0). It returns None when it does not find the pattern. When it finds one, it returns a Match object, which we will detail later on the article.

Let’s start by searching for the pattern “Fran” in the string “Francisco Bustamante”, author of this site:

>>> import re

>>> re.match('Fran', 'Francisco Bustamante')
<re.Match object; span=(0, 4), match='Fran'>

>>> re.match('Fran', 'Bustamante, Francisco')

# notice that the latter returns None as it does not find it at the beginning of the string

The span=(0, 4) means that the pattern was found between indices 0 and 4. Remember that counting in Python sequences starts at zero and that the last index is exclusive.

Search

The search seeks the pattern provided throughout the string. Its signature is re.search(pattern, string, flags=0). It returns None when it does not find the pattern. When it finds one, it returns a Match object, which we will discuss throughout the article. It is important to note that it will return the first location where the pattern was found.

>>> re.search('an', 'Francisco Bustamante')
<re.Match object; span=(2, 4), match='an'>

The pattern an appears in the string more than once, however, the search returns in span only the indices of the first location.

Findall

The findall searches for the pattern provided throughout the string. Its signature is re.findall(pattern, string, flags=0). It returns an empty list when it does not find the pattern. When it finds, it returns a list with each occurrence. It is essential to note that it will return all occurrences found.

>>> re.findall('an', 'Francisco Bustamante')
['an', 'an']

>>> re.findall('an', 'Chico')
[]

The power of findall will be better explored with some tools that we will learn throughout the article.

Flags

The signature of the functions presented has an argument flags. With flags, we can modify some aspects of how regular expressions work. See the difference in behavior in the following two examples:

>>> re.match('fran', 'Francisco Bustamante')

>>> re.match('fran', 'Francisco Bustamante', re.IGNORECASE)
<re.Match object; span=(0, 4), match='Fran'>

In the first case, the return was None so that the interpreter just displayed an empty line. This is because the pattern was passed with the initial letter lowercase and, in the string, it is uppercase. The re.IGNORECASE flag, as the name suggests, disregards differentiation between uppercase and lowercase.

Flag	Meaning
ASCII, A	Considers escape characters like `\w`, `\b`, `\s` and `\d` only in ASCII characters
DOTALL, S	Allows the metacharacter `.` to find any character, including new lines
IGNORECASE, I	Makes combinations without differentiating between uppercase and lowercase
LOCALE, L	Makes a match considering the locality
MULTILINE, M	Multiline match, affecting `^` and `$`
VERBOSE, X (from ‘extended’)	Enables detailed regular expressions, which can be organized more clearly and understandably

The letters after each name are the abbreviations that can be used instead of the full name.

Some new words appeared in this table. Don’t worry, they will be explained in due time. One of the words is metacharacter, our next topic.

The power of regular expressions comes from metacharacters, characters that represent a specific set of characters, general patterns.

The dot metacharacter (`.`)

The metacharacter . represents any character, except newline breaks. We will use it with the match function seen previously:

>>> re.match('.', 'Francisco Bustamante') 
<re.Match object; span=(0, 1), match='F'>

>>> re.match('.', '42')
<re.Match object; span=(0, 1), match='4'>

>>> re.match('.', ' Francisco Bustamante')
<re.Match object; span=(0, 1), match=' '>

Notice in the last example that there was a space at the beginning of the string and this space was recognized by the metacharacter.

According to the definition, the metacharacter . should consider control characters except \n, which indicates a line break. Let’s see:

>>> re.match('.', '\t\t')  # \t represents TAB
<re.Match object; span=(0, 1), match='\t'>

>>> re.match('.', '\n')

>>> print(re.match('.', '\n'))
None

This behavior of ignoring \n can be modified by one of the flags we saw earlier, the DOTALL:

>>> re.match('.', '\n', re.DOTALL)
<re.Match object; span=(0, 1), match='\n'>

We saw earlier that the search searches for the pattern throughout the string and returns the first position where it finds it. Let’s see the behavior with .:

>>> re.search('.', ' Francisco Bustamante')
<re.Match object; span=(0, 1), match=' '>

>>> re.search('.', 'Francisco Bustamante')
<re.Match object; span=(0, 1), match='F'>

>>> re.search('.', '\nFrancisco Bustamante')
<re.Match object; span=(1, 2), match='F'>

>>> re.search('.', '\nFrancisco Bustamante', re.DOTALL)
<re.Match object; span=(0, 1), match='\n'>

The first two examples return the first position, as expected. In the third example, the control character \n is ignored, returning the first character right after, the letter F. This behavior is modified by the DOTALL flag in the last example.

We saw earlier that the findall searches for the pattern provided throughout the string, returning a list with all occurrences. Let’s combine it with .:

>>> re.findall('.', 'Chemistry\nProgramming')
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']

The control character \n was ignored, and all other characters were returned in the form of a list.

Anchors

The symbol ^ represents the beginning of the string, and the symbol $ represents the end of the string. Let’s test it with the findall function:

>>> re.findall('^.', 'Chemistry\nProgramming\nPython')
['C']

>>> re.findall('^.', 'Chemistry\nProgramming\nPython', re.MULTILINE)
['C', 'P', 'P']

The pattern ^. means searching for any character that is not a line break at the beginning of the string. Thus, in the first case, it returns the letter C. But notice that the passed string has line breaks. Therefore, we can pass the re.MULTILINE flag so that the pattern is searched for in each line. That is why the second example returns a list with the first letter of the string and also the first letter after the control character \n.

We can apply the same logic to the pattern .$, which will search at the end of the string:

>>> re.findall('.$', 'Chemistry\nProgramming\nPython')
['n']

>>> re.findall('.$', 'Chemistry\nProgramming\nPython', re.MULTILINE)
['y', 'g', 'n']

Some edge cases happen when we have a string with only one character, empty, or with a line break:

>>> re.match('^.$', 'a') 
<re.Match object; span=(0, 1), match='a'>

>>> re.match('^$', '')  # the beginning is equal to the end, empty string
<re.Match object; span=(0, 0), match=''>

>>> re.findall('^$', '\n', re.MULTILINE)
['', '']

The metacharacter . is very comprehensive, usually we want to be a bit more specific.

Character sets

When the pattern presents brackets, these declare a set of characters. Each character between the brackets will be searched for in the string text. Let’s look for lowercase vowels in the string Chemistry Programming:

>>> re.findall('[aeiou]', 'Chemistry Programming')                                 
['e', 'i', 'o', 'a', 'i']

The symbol ^ when inside a character set means negation. So, if we are looking for everything but lowercase vowels:

>>> re.findall('[^aeiou]', 'Chemistry Programming')
['C', 'h', 'm', 's', 't', 'r', 'y', ' ', 'P', 'r', 'g', 'r', 'm', 'm', 'n', 'g']

It is also possible to define ranges of characters. Searching from “a” to “f”:

>>> re.findall('[a-f]', 'Chemistry Programming') 
['e', 'a']

We can define more than one range. Searching from “a” to “f” and from “A” to “Z”:

>>> re.findall('[a-fA-Z]', 'Chemistry Programming') 
['C', 'e', 'P', 'a']

A very common pattern is to search for all letters and digits and “_”, since they tend to be the most accepted characters in online form fields like email, for example:

>>> re.findall('[a-zA-Z0-9_]', 'Chemistry Programming')
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']

It’s such a special sequence that there’s a shortcut, the \w:

>>> re.findall('\w', 'Chemistry Programming')
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']

Note a small difference from this result compared to the previous example. No brackets are needed to use \w. Besides that, if non-ASCII characters were present, they would be recognized. That is, here we recognize Unicode characters. If you really want only the equivalent to the set [a-zA-Z0-9_], use the re.ASCII flag:

>>> re.findall('\w', 'Chemistry Programming', re.ASCII)
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']

The main special sequences defined by default are:

\d equivalent to any unicode digit, which includes [0-9]
\D equivalent to the negation of \d
\s equivalent to unicode whitespace characters, which includes [ \t\n\r\f\v]
\S equivalent to the negation of \s
\w equivalent to characters that can be used in general texts, which includes [a-zA-Z0-9_]
\W equivalent to the negation of \w

When using the re.ASCII flag, each previous case is restricted to the bracket representation presented. In Unicode, they are more comprehensive.

Raw strings

All special sequences use the symbol \. This is problematic in Python, as strings accept control characters in the language. Literal strings evaluate the character after the backslash to check whether it is a control character or not. Let’s see an example:

>>> print('1\n2')
1
2

When we want to indicate that it is not to be considered as a control character, we use another backslash to signal escape:

>>> print('1\\n2')  # escape the second backslash
1\n2

This behavior related to backslashes can be very problematic in some contexts. For example, in scientific fields, LaTeX is widely used for document production. And LaTeX environments are delimited with commands that use \. For example, \begin{equation}...\end{equation} delimits an environment for a mathematical equation. The \section command indicates the beginning of a section in a document. Let’s see how the Python interpreter recognizes the command followed by a line break:

>>> '\section\n' # \s has no meaning in Python
'\\section\n'
>>> text = '\section\n'  # LaTeX
>>> print(text)
\section

Notice that, since \s has no meaning for the Python interpreter, it automatically adds a backslash to signal that such a character should be ignored. When using print, we verify that the command text appears normally and a blank line.

We can verify that control characters are considered as a single character by checking the length of the string stored in the variable text:

>>> len(text)
9

We have 8 characters in “\section” and the ninth is the control character \n.

Theoretically, there should be a match when searching for \\section in the string stored in text:

>>> print(re.match('\\section', text))  # \s is a special sequence in regex
None

But, as we saw in the previous section, \s has meaning within the context of the re module. Thus, we need to indicate that the backslashes should be escaped:

>>> print(re.match('\\\\section', text))
<re.Match object; span=(0, 8), match='\\section'>

Welcome to what we call the backslash hell! But don’t worry, there’s a way to avoid this bunch of backslashes. In Python, there is what we call raw strings. In this type of strings, denoted by an r before the quotes, backslashes are not treated in any special way. Note the difference:

>>> len('\n')
1

>>> len(r'\n')
2

In the raw string, we have two characters, the “\” and the “n”, while in the normal (literal) string, there is only one character representing a line break.

Returning to our example, just use a raw string in the pattern:

>>> print(re.match(r'\\section', text)) # raw string, means that there is no control character
<re.Match object; span=(0, 8), match='\\section'>

Thus, a vital tip: use raw strings when there is a backslash that should not be interpreted as a special sequence.

Use of pipe

The pipe | means or indicating alternatives in the use of regular expression. See examples:

>>> re.search('a|b', 'abc')
<re.Match object; span=(0, 1), match='a'>

>>> re.search('a|b', 'bcd')
<re.Match object; span=(0, 1), match='b'>

>>> re.search('a|b', 'cde')

Observe in the first example that, even though there is “b” in the string, as “a” was found first, only the position of “a” was returned. With findall, both are returned:

>>> re.findall('b|a', 'abc')
['a', 'b']

Repetitions

Searching for repetitions is one of the most common reasons for using regular expressions. Let’s check the various possible ways.

Specific amounts

Let’s check the behavior of searching for the pattern \d{4} which searches for any digit four times.

>>> re.match(r'\d{4}', '1234')                 # string with 4 digits
<re.Match object; span=(0, 4), match='1234'>

>>> re.match(r'\d{4}', '123')                  # string with 3 digits

>>> re.match(r'\d{4}', '12345')                # string with 5 digits
<re.Match object; span=(0, 4), match='1234'>

In the examples, it is clear that when there are fewer than four digits, the return is None. In other cases, it returns the first four digits. The same happens with the use of the search function:

>>> re.search(r'\d{4}', 'abc123def12345')
<re.Match object; span=(9, 13), match='1234'>

The first segment 123 was ignored for having fewer than 4 digits. The second segment, which has 5 digits, was considered and returned the first four digits.

Minimum and maximum quantity

The use of a comma indicates that the value to be searched for is the minimum. That is, if there are more digits, the behavior will be greedy and will return the remaining digits beyond the minimum. If there are fewer digits than the minimum, the return will be None.

>>> re.match(r'\d{2,}', '12')
<re.Match object; span=(0, 2), match='12'>

>>> re.match(r'\d{2,}', '12345')  # greedy or greedy (greed)
<re.Match object; span=(0, 5), match='12345'>

>>> re.match(r'\d{2,}', '1')

Using the ? symbol after the brackets turns the behavior into lazy, so this symbol is a repetition modifier.

>>> re.match(r'\d{2,}?', '12345')  # lazy, minimum possible
<re.Match object; span=(0, 2), match='12'>

A value after the comma indicates the maximum value. The other previous explanations remain valid.

>>> re.match(r'\d{2,4}', '12345')
<re.Match object; span=(0, 4), match='1234'>

>>> re.match(r'\d{2,4}', '123')
<re.Match object; span=(0, 3), match='123'>

>>> re.match(r'\d{2,4}', '12')
<re.Match object; span=(0, 2), match='12'>

>>> re.match(r'\d{2,4}', '1')

>>> re.match(r'\d{2,4}?', '12345')  # greedy to lazy
<re.Match object; span=(0, 2), match='12'>

0 or 1 occurrence, optional element

Searching for an optional element in the string is actually a special case of minimum and maximum, with a minimum of zero and a maximum of one.

>>> re.match(r'\d{0,1}', '12345')
<re.Match object; span=(0, 1), match='1'>

>>> re.match(r'\d{,1}', '12345')  # 0 can be omitted
<re.Match object; span=(0, 1), match='1'>

The ? symbol after a regular expression has the same effect as {,1}. Therefore:

>>> re.match(r'\d?', '12345') 
<re.Match object; span=(0, 1), match='1'>

But we have already seen that the same symbol turns the search from greedy to lazy. Thus, the following expression returns an empty string, as the minimum is no (0) occurrences:

>>> re.match(r'\d??', '12345') 
<re.Match object; span=(0, 0), match=''>

Let’s go by parts. The first question mark is a repetition modifier of the regular expression immediately before, searching for 0 or 1 occurrence of \d. The second question mark is a modifier of the repetition operator, turning it into lazy.

0 or more times

Another special case of minimum and maximum. It also has a special symbol, the *. Observe the examples:

>>> re.match(r'\d{0,}', '12345')
<re.Match object; span=(0, 5), match='12345'>

>>> re.match(r'\d{,}', '12345')  # 0 can be omitted
<re.Match object; span=(0, 5), match='12345'>

>>> re.match(r'\d*', '12345')  # symbol
<re.Match object; span=(0, 5), match='12345'>

>>> re.match(r'\d*?', '12345')  # lazy
<re.Match object; span=(0, 0), match=''>

>>> re.match(r'\d*', 'abc')
<re.Match object; span=(0, 0), match=''>

In the last example, it returns an empty string in the match, as the minimum is no occurrences.

1 or more times

Another special case of minimum and maximum. It also has a special symbol, the +. Observe the examples:

>>> re.match(r'\d{1,}', '12345')
<re.Match object; span=(0, 5), match='12345'>

>>> re.match(r'\d+', '12345')  # + requires at least one occurrence, being greedy
<re.Match object; span=(0, 5), match='12345'>

>>> re.match(r'\d+?', '12345')  # turns into lazy
<re.Match object; span=(0, 1), match='1'>

>>> re.match(r'\d+', 'abc')

In the last example, it returns None, as the minimum is one occurrence, and there are no digits in the string.

Understanding the importance of repetition control

If everything has seemed very abstract so far, let’s start to put some situations closer to reality. Consider the string below from which we would like to extract all the attribute values.

>>> text = 'name="Francisco" site="Chemistry Programming"'

Since the values are between double quotes, we might initially imagine the pattern r'".+"', since . would catch the characters with the + modifier indicating one or more times. But see the result:

>>> re.findall(r'".+"', text)
['"Francisco" site="Chemistry Programming"']

It didn’t work because we put anything that is between quotes with at least one occurrence. The search extends from the first double quote to the last. In reality, we want what is between each pair of quotes. Therefore, we should turn the search into lazy:

>>> re.findall(r'".+?"', text)
['"Francisco"', '"Chemistry Programming"']

Now we have what we wanted. But it’s still not a good way to get the values. Notice what would happen if empty fields were passed:

>>> text = 'name="" site=""'

>>> re.findall(r'".+?"', text)
['"" site=""']

It returns incorrectly because the + requires at least one occurrence. In reality, we want 0 or more occurrences, which we can solve by using *:

>>> re.findall(r'".*?"', text)
['""', '""']

Understanding the Match Object

In several examples, we saw that the result is a Match object. Let’s understand this object a little better. There are four important methods for this article’s level:

>>> m = re.match(r'\d+', '12345')

>>> type(m)
re.Match

>>> m.group()  # returns the string in which the match was made
'12345'

>>> m.start()  # initial position of the match
0

>>> m.end()    # final position of the match
5

>>> m.span()   # tuple with the initial and final position of the match
(0, 5)

The group method will be important for the next section.

Capture Groups

Every so often the string being analyzed has numerous fields from which data is desired to be extracted. We can then define capture groups. Groups are demarcated by parentheses, and we can repeat the content of a group with a repetition qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will correspond to zero or more repetitions of ab.

Consider the following HTML tag from which we want to extract the name (input) and the values of type, id, and name. We can create a variable pattern in the code, following the format of the string. In each group, we intend to capture one or more occurrences of characters and in a lazy way, so as not to have the problems seen in the repetition control section:

>>> html = '<input type="text" id="id_cpf" name="cpf">'

>>> pattern = r'<(.+?) type="(.+?)" id="(.+?)" name="(.+?)"'
# any character before a space for the tag name, same between quotes for the other groups

>>> m = re.match(pattern, html)  # storing the match result in a variable

>>> m
<re.Match object; span=(0, 41), match='<input type="text" id="id_cpf" name="cpf"'>

The variable m stores the result, being a Match object. By default, it presents the complete match made, but we can explore more details of the object. The groups method presents all extracted groups and the group allows returning specific groups:

>>> m.groups()  # capture groups
('input', 'text', 'id_cpf', 'cpf')

>>> m.group(0)  # entire match
'<input type="text" id="id_cpf" name="cpf"'

>>> m.group(1)  # first group
'input'

>>> m.group(2, 1, 3)  # specific groups
('text', 'input', 'id_cpf')

Suppose now that a change in the order of the attributes of the HTML tag may occur:

>>> html1 = '<input type="text" id="id_cpf" name="cpf">'

>>> html2 = '<input id="id_cpf" name="cpf" type="text">'

Clearly, our previous pattern will not work, as it is strongly depended on the order of the attributes. We have already seen the metacharacter | which indicates an alternative, something that will be useful here because we want to capture each attribute regardless of the order.

Another concept we will use is that of a non-capture group, represented by (?:...), replacing the ellipses with a regular expression. To understand, consider the following examples:

>>> m = re.match('name="(.+?)"', 'name="Francisco"')

>>> m.groups()
('Francisco',)

>>> m = re.match('name="(?:.+?)"', 'name="Francisco"')

>>> m.groups()
()

In the first case, we used the pattern in the group to obtain the value “Francisco” and in the second, we excluded that group.

We can use this pattern alternately to obtain each group from the HTML tag regardless of the order:

>>> pattern = r'<(.+?) (?:(?:type="(.+?)"|id="(.+?)"|name="(.+?)" ?)*)'
# the outer group indicates the presence or absence of space

>>> m = re.match(pattern, html1)

>>> m
<re.Match object; span=(0, 41), match='<input type="text" id="id_cpf" name="cpf"'>

>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')

>>> m = re.match(pattern, html2)

>>> m
<re.Match object; span=(0, 41), match='<input id="id_cpf" name="cpf" type="text"'>

>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')

Named Groups

Finally, when there are several groups, it is useful to name them. Python has a specific way to indicate the name of each group: ?P<name>.

>>> pattern = r'<(?P<tag>.+?) (?:(?:type="(?P<type>.+?)"|id="(?P<id>.+?)"|name="(?P<name>.+?)" ?)*)'

>>> m = re.match(pattern, html1)

>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')

When using named groups, there is a very useful method, the groupdict, which returns the name of the group and the value in the form of a dictionary:

>>> m.groupdict() 
{'tag': 'input', 'type': 'text', 'id': 'id_cpf', 'name': 'cpf'}

>>> m = re.match(pattern, html2)

>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')

>>> m.groupdict()
{'tag': 'input', 'type': 'text', 'id': 'id_cpf', 'name': 'cpf'}

Conclusion and an IMPORTANT note

The important observation is: just because something can be solved with regular expressions doesn’t mean it should be solved with regular expressions.

Specifically in Python, string manipulations involving substitutions are often more effective with string methods like replace. The official documentation makes this clear.

Similarly, although I used the case of the HTML tag as an example, there are various online discussions about more efficient ways to extract information from HTML and why regex is not usually the best option. See here, here and here. Be prepared for heated discussions. And exercise good sense to know when it is acceptable to use and when another tool should be used.

Regular expressions are useful, but can be very difficult to read and debug. Try to make them as short and specific as possible. It is a subject much more extensive than presented here, I tried to put what I consider a good start and, obviously, what is within my knowledge. Certainly, it is a very valuable tool of knowledge.

Enjoy and read other articles on the site about Python.

See you next time!

Python Regular Expressions – Mastering RegEx

Introduction to Regular Expressions

Match

Search

Findall

Flags

The dot metacharacter (`.`)

Anchors

Character sets

Raw strings

Use of pipe

Repetitions

Specific amounts

Minimum and maximum quantity

0 or 1 occurrence, optional element

0 or more times

1 or more times

Understanding the importance of repetition control

Understanding the Match Object

Capture Groups

Named Groups

Conclusion and an IMPORTANT note

About The Author

Francisco Bustamante

Leave a Comment Cancel Reply

Introduction to Regular Expressions

Match

Search

Findall

Flags

The dot metacharacter (.)

Anchors

Character sets

Raw strings

Use of pipe

Repetitions

Specific amounts

Minimum and maximum quantity

0 or 1 occurrence, optional element

0 or more times

1 or more times

Understanding the importance of repetition control

Understanding the Match Object

Capture Groups

Named Groups

Conclusion and an IMPORTANT note

About The Author

Francisco Bustamante

Related Posts

Leave a Comment Cancel Reply

The dot metacharacter (`.`)