Introduction #
Regular expressions (or, shortly, regex) are a language that detects patterns. You express a pattern that will be applied to strings. It is an efficient way to find patterns and extract information from texts. Within this pattern, each symbol represents a type of information. In this article, we will see the main symbols and their meanings with the help of several examples.
Python has the re module for regular expressions. The
documentation is very detailed, and
there is also this text with a more step-by-step introduction to the
module.
The module has three functions that are commonly used and will be our focus
here: match, search, and findall.
Match #
The match seeks the pattern provided at the beginning of the string. Its
signature is re.match(pattern, string, flags=0). It returns None when it
does not find the pattern. When it finds one, it returns a Match object, which
we will detail later on the article.
Let’s start by searching for the pattern “Fran” in the string “Francisco Bustamante”, author of this site:
>>> import re
>>> re.match('Fran', 'Francisco Bustamante')
<re.Match object; span=(0, 4), match='Fran'>
>>> re.match('Fran', 'Bustamante, Francisco')
# notice that the latter returns None as it does not find it at the beginning of the string
The span=(0, 4) means that the pattern was found between indices 0 and 4.
Remember that counting in Python sequences starts at zero and that the last
index is exclusive.
Search #
The search seeks the pattern provided throughout the string. Its signature is
re.search(pattern, string, flags=0). It returns None when it does not find
the pattern. When it finds one, it returns a Match object, which we will
discuss throughout the article. It is important to note that it will return the
first location where the pattern was found.
>>> re.search('an', 'Francisco Bustamante')
<re.Match object; span=(2, 4), match='an'>
The pattern an appears in the string more than once, however, the search
returns in span only the indices of the first location.
Findall #
The findall searches for the pattern provided throughout the string. Its
signature is re.findall(pattern, string, flags=0). It returns an empty list
when it does not find the pattern. When it finds, it returns a list with each
occurrence. It is essential to note that it will return all occurrences found.
>>> re.findall('an', 'Francisco Bustamante')
['an', 'an']
>>> re.findall('an', 'Chico')
[]
The power of findall will be better explored with some tools that we will
learn throughout the article.
Flags #
The signature of the functions presented has an argument flags. With flags, we can modify some aspects of how regular expressions work. See the difference in behavior in the following two examples:
>>> re.match('fran', 'Francisco Bustamante')
>>> re.match('fran', 'Francisco Bustamante', re.IGNORECASE)
<re.Match object; span=(0, 4), match='Fran'>
In the first case, the return was None so that the interpreter just displayed
an empty line. This is because the pattern was passed with the initial letter
lowercase and, in the string, it is uppercase. The re.IGNORECASE flag, as the
name suggests, disregards differentiation between uppercase and lowercase.
| Flag | Meaning |
|---|---|
| ASCII, A | Considers escape characters like \w, \b, \s and \d only in ASCII characters |
| DOTALL, S | Allows the metacharacter . to find any character, including new lines |
| IGNORECASE, I | Makes combinations without differentiating between uppercase and lowercase |
| LOCALE, L | Makes a match considering the locality |
| MULTILINE, M | Multiline match, affecting ^ and $ |
| VERBOSE, X (from ‘extended’) | Enables detailed regular expressions, which can be organized more clearly and understandably |
The letters after each name are the abbreviations that can be used instead of the full name.
Some new words appeared in this table. Don’t worry, they will be explained in due time. One of the words is metacharacter, our next topic.
The power of regular expressions comes from metacharacters, characters that represent a specific set of characters, general patterns.
The dot metacharacter (.)
#
The metacharacter . represents any character, except newline breaks. We will
use it with the match function seen previously:
>>> re.match('.', 'Francisco Bustamante')
<re.Match object; span=(0, 1), match='F'>
>>> re.match('.', '42')
<re.Match object; span=(0, 1), match='4'>
>>> re.match('.', ' Francisco Bustamante')
<re.Match object; span=(0, 1), match=' '>
Notice in the last example that there was a space at the beginning of the string and this space was recognized by the metacharacter.
According to the definition, the metacharacter . should consider control
characters except \n, which
indicates a line break. Let’s see:
>>> re.match('.', '\t\t') # \t represents TAB
<re.Match object; span=(0, 1), match='\t'>
>>> re.match('.', '\n')
>>> print(re.match('.', '\n'))
None
This behavior of ignoring \n can be modified by one of the flags we saw
earlier, the DOTALL:
>>> re.match('.', '\n', re.DOTALL)
<re.Match object; span=(0, 1), match='\n'>
We saw earlier that the search searches for the pattern throughout the string
and returns the first position where it finds it. Let’s see the behavior with
.:
>>> re.search('.', ' Francisco Bustamante')
<re.Match object; span=(0, 1), match=' '>
>>> re.search('.', 'Francisco Bustamante')
<re.Match object; span=(0, 1), match='F'>
>>> re.search('.', '\nFrancisco Bustamante')
<re.Match object; span=(1, 2), match='F'>
>>> re.search('.', '\nFrancisco Bustamante', re.DOTALL)
<re.Match object; span=(0, 1), match='\n'>
The first two examples return the first position, as expected. In the third
example, the control character \n is ignored, returning the first character
right after, the letter F. This behavior is modified by the DOTALL flag in the
last example.
We saw earlier that the findall searches for the pattern provided throughout
the string, returning a list with all occurrences. Let’s combine it with .:
>>> re.findall('.', 'Chemistry\nProgramming')
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']
The control character \n was ignored, and all other characters were returned
in the form of a list.
Anchors #
The symbol ^ represents the beginning of the string, and the symbol $
represents the end of the string. Let’s test it with the findall function:
>>> re.findall('^.', 'Chemistry\nProgramming\nPython')
['C']
>>> re.findall('^.', 'Chemistry\nProgramming\nPython', re.MULTILINE)
['C', 'P', 'P']
The pattern ^. means searching for any character that is not a line break at
the beginning of the string. Thus, in the first case, it returns the letter C.
But notice that the passed string has line breaks. Therefore, we can pass the
re.MULTILINE flag so that the pattern is searched for in each line. That is
why the second example returns a list with the first letter of the string and
also the first letter after the control character \n.
We can apply the same logic to the pattern .$, which will search at the end of
the string:
>>> re.findall('.$', 'Chemistry\nProgramming\nPython')
['n']
>>> re.findall('.$', 'Chemistry\nProgramming\nPython', re.MULTILINE)
['y', 'g', 'n']
Some edge cases happen when we have a string with only one character, empty, or with a line break:
>>> re.match('^.$', 'a')
<re.Match object; span=(0, 1), match='a'>
>>> re.match('^$', '') # the beginning is equal to the end, empty string
<re.Match object; span=(0, 0), match=''>
>>> re.findall('^$', '\n', re.MULTILINE)
['', '']
The metacharacter . is very comprehensive, usually we want to be a bit more
specific.
Character sets #
When the pattern presents brackets, these declare a set of characters. Each
character between the brackets will be searched for in the string text. Let’s
look for lowercase vowels in the string Chemistry Programming:
>>> re.findall('[aeiou]', 'Chemistry Programming')
['e', 'i', 'o', 'a', 'i']
The symbol ^ when inside a character set means negation. So, if we are looking
for everything but lowercase vowels:
>>> re.findall('[^aeiou]', 'Chemistry Programming')
['C', 'h', 'm', 's', 't', 'r', 'y', ' ', 'P', 'r', 'g', 'r', 'm', 'm', 'n', 'g']
It is also possible to define ranges of characters. Searching from “a” to “f”:
>>> re.findall('[a-f]', 'Chemistry Programming')
['e', 'a']
We can define more than one range. Searching from “a” to “f” and from “A” to “Z”:
>>> re.findall('[a-fA-Z]', 'Chemistry Programming')
['C', 'e', 'P', 'a']
A very common pattern is to search for all letters and digits and “_”, since they tend to be the most accepted characters in online form fields like email, for example:
>>> re.findall('[a-zA-Z0-9_]', 'Chemistry Programming')
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']
It’s such a special sequence that there’s a shortcut, the \w:
>>> re.findall('\w', 'Chemistry Programming')
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']
Note a small difference from this result compared to the previous example. No
brackets are needed to use \w. Besides that, if non-ASCII characters were
present, they would be recognized. That is, here we recognize
Unicode characters. If you really want
only the equivalent to the set [a-zA-Z0-9_], use the re.ASCII flag:
>>> re.findall('\w', 'Chemistry Programming', re.ASCII)
['C', 'h', 'e', 'm', 'i', 's', 't', 'r', 'y', 'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']
The main special sequences defined by default are:
\dequivalent to any unicode digit, which includes[0-9]\Dequivalent to the negation of\d\sequivalent to unicode whitespace characters, which includes[ \t\n\r\f\v]\Sequivalent to the negation of\s\wequivalent to characters that can be used in general texts, which includes[a-zA-Z0-9_]\Wequivalent to the negation of\w
When using the re.ASCII flag, each previous case is restricted to the bracket
representation presented. In Unicode, they are more comprehensive.
Raw strings #
All special sequences use the symbol \. This is problematic in Python, as
strings accept control characters in the language. Literal strings evaluate the
character after the backslash to check whether it is a control character or not.
Let’s see an example:
>>> print('1\n2')
1
2
When we want to indicate that it is not to be considered as a control character, we use another backslash to signal escape:
>>> print('1\\n2') # escape the second backslash
1\n2
This behavior related to backslashes can be very problematic in some contexts.
For example, in scientific fields, LaTeX is widely used for document production.
And LaTeX environments are delimited with commands that use \. For example,
\begin{equation}...\end{equation} delimits an environment for a mathematical
equation. The \section command indicates the beginning of a section in a
document. Let’s see how the Python interpreter recognizes the command followed
by a line break:
>>> '\section\n' # \s has no meaning in Python
'\\section\n'
>>> text = '\section\n' # LaTeX
>>> print(text)
\section
Notice that, since \s has no meaning for the Python interpreter, it
automatically adds a backslash to signal that such a character should be
ignored. When using print, we verify that the command text appears normally
and a blank line.
We can verify that control characters are considered as a single character by
checking the length of the string stored in the variable text:
>>> len(text)
9
We have 8 characters in “\section” and the ninth is the control character \n.
Theoretically, there should be a match when searching for \\section in the
string stored in text:
>>> print(re.match('\\section', text)) # \s is a special sequence in regex
None
But, as we saw in the previous section, \s has meaning within the context of
the re module. Thus, we need to indicate that the backslashes should be
escaped:
>>> print(re.match('\\\\section', text))
<re.Match object; span=(0, 8), match='\\section'>
Welcome to what we call the backslash hell! But don’t worry, there’s a way to
avoid this bunch of backslashes. In Python, there is what we call raw strings.
In this type of strings, denoted by an r before the quotes, backslashes are
not treated in any special way. Note the difference:
>>> len('\n')
1
>>> len(r'\n')
2
In the raw string, we have two characters, the “" and the “n”, while in the normal (literal) string, there is only one character representing a line break.
Returning to our example, just use a raw string in the pattern:
>>> print(re.match(r'\\section', text)) # raw string, means that there is no control character
<re.Match object; span=(0, 8), match='\\section'>
Thus, a vital tip: use raw strings when there is a backslash that should not be interpreted as a special sequence.
Use of pipe #
The pipe | means or indicating alternatives in the use of regular
expression. See examples:
>>> re.search('a|b', 'abc')
<re.Match object; span=(0, 1), match='a'>
>>> re.search('a|b', 'bcd')
<re.Match object; span=(0, 1), match='b'>
>>> re.search('a|b', 'cde')
Observe in the first example that, even though there is “b” in the string, as
“a” was found first, only the position of “a” was returned. With findall, both
are returned:
>>> re.findall('b|a', 'abc')
['a', 'b']
Repetitions #
Searching for repetitions is one of the most common reasons for using regular expressions. Let’s check the various possible ways.
Specific amounts #
Let’s check the behavior of searching for the pattern \d{4} which searches for
any digit four times.
>>> re.match(r'\d{4}', '1234') # string with 4 digits
<re.Match object; span=(0, 4), match='1234'>
>>> re.match(r'\d{4}', '123') # string with 3 digits
>>> re.match(r'\d{4}', '12345') # string with 5 digits
<re.Match object; span=(0, 4), match='1234'>
In the examples, it is clear that when there are fewer than four digits, the
return is None. In other cases, it returns the first four digits. The same
happens with the use of the search function:
>>> re.search(r'\d{4}', 'abc123def12345')
<re.Match object; span=(9, 13), match='1234'>
The first segment 123 was ignored for having fewer than 4 digits. The second
segment, which has 5 digits, was considered and returned the first four digits.
Minimum and maximum quantity #
The use of a comma indicates that the value to be searched for is the minimum.
That is, if there are more digits, the behavior will be greedy and will return
the remaining digits beyond the minimum. If there are fewer digits than the
minimum, the return will be None.
>>> re.match(r'\d{2,}', '12')
<re.Match object; span=(0, 2), match='12'>
>>> re.match(r'\d{2,}', '12345') # greedy or greedy (greed)
<re.Match object; span=(0, 5), match='12345'>
>>> re.match(r'\d{2,}', '1')
Using the ? symbol after the brackets turns the behavior into lazy, so this
symbol is a repetition modifier.
>>> re.match(r'\d{2,}?', '12345') # lazy, minimum possible
<re.Match object; span=(0, 2), match='12'>
A value after the comma indicates the maximum value. The other previous explanations remain valid.
>>> re.match(r'\d{2,4}', '12345')
<re.Match object; span=(0, 4), match='1234'>
>>> re.match(r'\d{2,4}', '123')
<re.Match object; span=(0, 3), match='123'>
>>> re.match(r'\d{2,4}', '12')
<re.Match object; span=(0, 2), match='12'>
>>> re.match(r'\d{2,4}', '1')
>>> re.match(r'\d{2,4}?', '12345') # greedy to lazy
<re.Match object; span=(0, 2), match='12'>
0 or 1 occurrence, optional element #
Searching for an optional element in the string is actually a special case of minimum and maximum, with a minimum of zero and a maximum of one.
>>> re.match(r'\d{0,1}', '12345')
<re.Match object; span=(0, 1), match='1'>
>>> re.match(r'\d{,1}', '12345') # 0 can be omitted
<re.Match object; span=(0, 1), match='1'>
The ? symbol after a regular expression has the same effect as {,1}.
Therefore:
>>> re.match(r'\d?', '12345')
<re.Match object; span=(0, 1), match='1'>
But we have already seen that the same symbol turns the search from greedy to lazy. Thus, the following expression returns an empty string, as the minimum is no (0) occurrences:
>>> re.match(r'\d??', '12345')
<re.Match object; span=(0, 0), match=''>
Let’s go by parts. The first question mark is a repetition modifier of the
regular expression immediately before, searching for 0 or 1 occurrence of \d.
The second question mark is a modifier of the repetition operator, turning it
into lazy.
0 or more times #
Another special case of minimum and maximum. It also has a special symbol, the
*. Observe the examples:
>>> re.match(r'\d{0,}', '12345')
<re.Match object; span=(0, 5), match='12345'>
>>> re.match(r'\d{,}', '12345') # 0 can be omitted
<re.Match object; span=(0, 5), match='12345'>
>>> re.match(r'\d*', '12345') # symbol
<re.Match object; span=(0, 5), match='12345'>
>>> re.match(r'\d*?', '12345') # lazy
<re.Match object; span=(0, 0), match=''>
>>> re.match(r'\d*', 'abc')
<re.Match object; span=(0, 0), match=''>
In the last example, it returns an empty string in the match, as the minimum is no occurrences.
1 or more times #
Another special case of minimum and maximum. It also has a special symbol, the
+. Observe the examples:
>>> re.match(r'\d{1,}', '12345')
<re.Match object; span=(0, 5), match='12345'>
>>> re.match(r'\d+', '12345') # + requires at least one occurrence, being greedy
<re.Match object; span=(0, 5), match='12345'>
>>> re.match(r'\d+?', '12345') # turns into lazy
<re.Match object; span=(0, 1), match='1'>
>>> re.match(r'\d+', 'abc')
In the last example, it returns None, as the minimum is one occurrence, and
there are no digits in the string.
Understanding the importance of repetition control #
If everything has seemed very abstract so far, let’s start to put some situations closer to reality. Consider the string below from which we would like to extract all the attribute values.
>>> text = 'name="Francisco" site="Chemistry Programming"'
Since the values are between double quotes, we might initially imagine the
pattern r'".+"', since . would catch the characters with the + modifier
indicating one or more times. But see the result:
>>> re.findall(r'".+"', text)
['"Francisco" site="Chemistry Programming"']
It didn’t work because we put anything that is between quotes with at least one occurrence. The search extends from the first double quote to the last. In reality, we want what is between each pair of quotes. Therefore, we should turn the search into lazy:
>>> re.findall(r'".+?"', text)
['"Francisco"', '"Chemistry Programming"']
Now we have what we wanted. But it’s still not a good way to get the values. Notice what would happen if empty fields were passed:
>>> text = 'name="" site=""'
>>> re.findall(r'".+?"', text)
['"" site=""']
It returns incorrectly because the + requires at least one occurrence. In
reality, we want 0 or more occurrences, which we can solve by using *:
>>> re.findall(r'".*?"', text)
['""', '""']
Understanding the Match Object #
In several examples, we saw that the result is a Match object. Let’s understand this object a little better. There are four important methods for this article’s level:
>>> m = re.match(r'\d+', '12345')
>>> type(m)
re.Match
>>> m.group() # returns the string in which the match was made
'12345'
>>> m.start() # initial position of the match
0
>>> m.end() # final position of the match
5
>>> m.span() # tuple with the initial and final position of the match
(0, 5)
The group method will be important for the next section.
Capture Groups #
Every so often the string being analyzed has numerous fields from which data is
desired to be extracted. We can then define capture groups. Groups are
demarcated by parentheses, and we can repeat the content of a group with a
repetition qualifier, such as *, +, ?, or {m,n}. For example, (ab)*
will correspond to zero or more repetitions of ab.
Consider the following HTML tag from which we want to extract the name (input)
and the values of type, id, and name. We can create a variable pattern in the
code, following the format of the string. In each group, we intend to capture
one or more occurrences of characters and in a lazy way, so as not to have the
problems seen in the repetition control section:
>>> html = '<input type="text" id="id_cpf" name="cpf">'
>>> pattern = r'<(.+?) type="(.+?)" id="(.+?)" name="(.+?)"'
# any character before a space for the tag name, same between quotes for the other groups
>>> m = re.match(pattern, html) # storing the match result in a variable
>>> m
<re.Match object; span=(0, 41), match='<input type="text" id="id_cpf" name="cpf"'>
The variable m stores the result, being a Match object. By default, it
presents the complete match made, but we can explore more details of the object.
The groups method presents all extracted groups and the group allows
returning specific groups:
>>> m.groups() # capture groups
('input', 'text', 'id_cpf', 'cpf')
>>> m.group(0) # entire match
'<input type="text" id="id_cpf" name="cpf"'
>>> m.group(1) # first group
'input'
>>> m.group(2, 1, 3) # specific groups
('text', 'input', 'id_cpf')
Suppose now that a change in the order of the attributes of the HTML tag may occur:
>>> html1 = '<input type="text" id="id_cpf" name="cpf">'
>>> html2 = '<input id="id_cpf" name="cpf" type="text">'
Clearly, our previous pattern will not work, as it is strongly depended on the
order of the attributes. We have already seen the metacharacter | which
indicates an alternative, something that will be useful here because we want to
capture each attribute regardless of the order.
Another concept we will use is that of a non-capture group, represented by
(?:...), replacing the ellipses with a regular expression. To understand,
consider the following examples:
>>> m = re.match('name="(.+?)"', 'name="Francisco"')
>>> m.groups()
('Francisco',)
>>> m = re.match('name="(?:.+?)"', 'name="Francisco"')
>>> m.groups()
()
In the first case, we used the pattern in the group to obtain the value “Francisco” and in the second, we excluded that group.
We can use this pattern alternately to obtain each group from the HTML tag regardless of the order:
>>> pattern = r'<(.+?) (?:(?:type="(.+?)"|id="(.+?)"|name="(.+?)" ?)*)'
# the outer group indicates the presence or absence of space
>>> m = re.match(pattern, html1)
>>> m
<re.Match object; span=(0, 41), match='<input type="text" id="id_cpf" name="cpf"'>
>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')
>>> m = re.match(pattern, html2)
>>> m
<re.Match object; span=(0, 41), match='<input id="id_cpf" name="cpf" type="text"'>
>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')
Named Groups #
Finally, when there are several groups, it is useful to name them. Python has a
specific way to indicate the name of each group: ?P<name>.
>>> pattern = r'<(?P<tag>.+?) (?:(?:type="(?P<type>.+?)"|id="(?P<id>.+?)"|name="(?P<name>.+?)" ?)*)'
>>> m = re.match(pattern, html1)
>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')
When using named groups, there is a very useful method, the groupdict, which
returns the name of the group and the value in the form of a dictionary:
>>> m.groupdict()
{'tag': 'input', 'type': 'text', 'id': 'id_cpf', 'name': 'cpf'}
>>> m = re.match(pattern, html2)
>>> m.groups()
('input', 'text', 'id_cpf', 'cpf')
>>> m.groupdict()
{'tag': 'input', 'type': 'text', 'id': 'id_cpf', 'name': 'cpf'}
Conclusion and an IMPORTANT note #
The important observation is: just because something can be solved with regular expressions doesn’t mean it should be solved with regular expressions.
Specifically in Python, string manipulations involving substitutions are often
more effective with string methods like replace. The official
documentation
makes this clear.
Similarly, although I used the case of the HTML tag as an example, there are various online discussions about more efficient ways to extract information from HTML and why regex is not usually the best option. See here, here and here. Be prepared for heated discussions. And exercise good sense to know when it is acceptable to use and when another tool should be used.
Regular expressions are useful, but can be very difficult to read and debug. Try to make them as short and specific as possible. It is a subject much more extensive than presented here, I tried to put what I consider a good start and, obviously, what is within my knowledge. Certainly, it is a very valuable tool of knowledge.
Did you like this article? It is part of Python Drops, a set of shorter posts focused on fundamentals of the Python language and programming in general. You can read more of these articles by searching for the “drops” tag here on the site.
See you next time!