Python Regular Expression Tutorial

In this tutorial, you will learn the Regular Expressions and the regular expression operations defined in the re module in Python. re is the standard library of Python which supports matching operations of regular expression.

Regular expression in Python is a set of characters or sequence that is used to match a string to another pattern using a formal syntax. You could think of regular expressions as a small programming language which is embedded in python.

You can use regular expression to define some rules and these rules are then used to create possible strings out of the given string which you want to match the pattern with. Regular expressions in Python are interpreted as a set of instructions.

The match() Function:

You can use the match function to match the RE pattern with the given string. The match function contains flags. Flags define the behavior of a regular expression and can contain different values which you will see later in this tutorial.

The following is the syntax of match function in Python:

re.match(pattern, string, flags)

It has three arguments,

  1. pattern is the regular expression pattern which is to be matched
  2. string is the given string which is to be matched with regular expression
  3. flags is used to change the behavior of regular expression, and it is optional.

If the matching is performed successfully Match object will be returned else NONE will be returned. Match object have further two main methods that are group(num) and group() functions. The main purpose to use these functions is to return the match or a specific subsequence and all the subsequences respectively.

Using the match function

The following example demonstrates how you can use the match function:

import re
strTest = "Hello Python Programming"
mobj = re.match(r"hello", strTest, re.I)
print(mobj.group())

In this code first of all re module is imported. Then you will compare a string strTest with the RE pattern and the value returned from the match function will be assigned to mobj. The match function is called using re then inside parenthesis the first argument is the pattern to be matched, and then you will have the given string from which pattern will be matched and also a flag value is passed. Here re.I is the flag value which means IGNORECASE, so it will be ignored whether the pattern and the string have different case letters (either upper case or lower case).

The output is :

Hello

In this example, the prefix r is used which tells that the string is a raw string. In a raw string there is no need to write double slashes when using escape sequences for example if you want a backslash then you just have a single \ but not double backslashes \\as you had in regular strings. This is the only difference between a regular string and a raw string.

Using match function with regular string

Consider the example below in which a regular string is used instead of a raw string:

import re
str = "\\tHello Python Programming"
mobj = re.match("\\thello", str, re.I) #no match

str = "\tHello Python Programming"
mobj = re.match("\\thello", str, re.I) #\thello is matching

The search() Function:

You can use the search() function to search the RE pattern in the given string. The search function contains three arguments in the function the pattern, given string, and flags (optional) respectively.

The following is the syntax of the search function in Python:

re.search(pattern, string, flags)

The following Python code demonstrates the use of search() function:

import re
str = "Hello Python Programming"
sobj = re.search(r"programming", str, re.I)
print(sobj.group())
Programming

In this code searching for the word programming is being done. The search function searches in the entire string. The difference between search and match is that match function only checks at the beginning of the string whereas search searches in the entire string.

Searching at the beginning

If you want to search at the beginning of the string then you can use ^. Consider the following example:

import re
str = "Hello Python Programming"
sobj = re.search(r"^programming", str, re.I)
print(sobj.group()) #no match is found

sobj = re.search(r"^hello", str, re.I)
print(sobj.group()) #matching: Hello

Here ^ will make the search only at the beginning of the string.

Searching at the end

You can also search at the end of the given string. It can be done using $ at the end of the pattern. Consider the code below:

import re
str = "Hello Python Programming"
sobj = re.search(r"programming$", str, re.I)
print(sobj.group()) #matching: Programming

sobj = re.search(r"hello$", str, re.I)
print(sobj.group()) #no match found

Compiling regular expressions:

Regular expressions in Python when compiled are converted into patterns. These patterns are actually the pattern objects which contain different functions to perform different tasks which may include searching, matching, and replacing, etc.

When you compile a pattern then you can use that pattern later in the program.

Using precompiled patterns

Consider the code below in which the pattern r"\d" is compiled which means the first digit in the string and then used this pattern to call search function and passed a string in search function. This pattern will be searched in the string provided to search function. Similarly, you can use this precompiled pattern with match function as follows:

import re
compPat = re.compile(r"(\d)")
sobj = compPat.search("Lalalala 123")
print(mobj.group())

mobj = compPat.match("234Lalalala 123456789")
print(mobj.group())
1
2

Flags:

You can use Flags are to change the behavior of a regular expression. In a function, flags are optional. You can use flags in two different ways that is by either using the keyword flags and assigning it flag value or by directly writing the value of the flag. You can have more than one value of flag in the RE literal; this can be done by using bitwise OR operator |.

Consider the following table in which some of the commonly used flags are described with Regular Expression literals:

Flag Value Description
re.I This modifier will ignore the case of strings and patterns while matching.
re.L This modifier is used to interpret words with respect to the current locale.
re.M This modifier is used to to make $ to match to the end of the line and not to end of string. Similarly, ^ will match at the beginning of the line instead of at the beginning of the string.
re.S This modifier is used to make a dot . to match any character. This includes a newline also.
re.U This modifier is used to interpret the characters as Unicode character set.
re.X It is used to ignore the whitespaces. It will make # as a marker of comment.

Using multiple flag values

Consider the following Python code in which you will see how to use multiple flag values to change the behavior of RE. Multiple flag values can be included by bitwise OR (|) operator:

import re
s = re.search("L", "Hello")
print(s)		#Output: None, L is there but in small letter and we didn't use flags

s = re.search("L", "Hello", re.I)
print(s)		#Output: 1

s = re.search("L", "^Hello", re.I | re.M)
print(s)		#Output: 1, searching will be made from the start of line and case is ignored

Checking for allowed characters:

You can also check if a certain string contains some particular range of characters or not.

Defining a function and checking allowed characters

Consider the following example in which a function is defined and also used precompiled pattern to check if the certain characters are in the passed string or not:

import re
def check(str):
	s = re.compile(r'[^A-Z]')
	str = s.search(str)
	return not bool(str)
print(check("HELLOPYTHON"))		#Output: True
print(check("hellopython"))		#Output: False

In this function, a pattern that is r '[^A-Z]' is compiled and used it to search in a string passed when this function named check is called. This function actually checks if the passed string contains letters A-Z(uppercase) or not. Similarly, it can be seen that when you pass a string in lowercase letters false is returned.

Search and replace:

The re module provides a function that is sub function which is used to replace all occurrences of the pattern in the given string using the repl attribute in the function. The characters will be replaced till the count number is reached. The sub function will return the updated string.

The following is the syntax of sub function:

re.sub(pattern, repl, string, count = 0)

Using sub function

Consider the example below in which sub function replaces the entire string with a given string:

import re
s = "Playing 4 hours a day"
obj = re.sub(r'^.*$',"Working",s)
print(obj)
Working

Here, sub function is used. The pattern r'^.*$ means starting from the start of the string and then .* means whatever is in the string till the end $ of the string. Then the argument "Working" will replace entire string s.

Using sub function to delete all the digits from a string

Consider the following example in which sub function deletes the digits in the given string. For this purpose you can use \d:

import re
s = "768 Working 2343 789 five 234 656 hours 324 4646 a 345 day"
obj = re.sub(r'\d',"",s)
print(obj)
Working   five   hours   a  day

Similarly, you can delete the characters from the string. For this purpose you can use \D.

import re
s = "768 Working 2343 789 five 234 656 hours 324 4646 a 345 day"
obj = re.sub(r'\D',"",s)
print(obj)
76823437892346563244646345

The findall() function:

The findall function returns a list of all the strings matching to the pattern. The difference between search and findall function is that findall finds all the matches whereas search finds only the first match. This function finds the non overlapping matches and returns them as a list of strings.

The following is the syntax of findall function:

findall(pattern, string, flags)

Here pattern is RE pattern which you will find in given string with some flags values for example re.I to ignore case.

Find all non-overlapping matches:

In the following example, findall finds non-overlapping matches:

import re
str = "Working 6 hours a day. Studying 4 hours a day."
mobj = re.findall(r'[0-9]', str)
print(mobj)
['6', '4']

r'[0-9]' is a pattern finding all the digits in the given string and a list of strings is returned (no matter they are digits) which is stored in mobj.

findall with files:

You can also use findall to find in a file. When you use findall with a file it will return a list of all the matching strings in the file. As read() function of file will be used so you do not have to iterate through each line of the file using a loop as it returns entire text of file as a string. Consider the following example:

import re
file = open('asd.txt', 'r')
mobj = re.findall(r'arg.', file.read())
print(mobj)
file.close()
['arg,', 'arg,', 'arg,', 'argv', 'argv', 'argv']

In this example, file is opened first in read mode. The pattern r'arg.' is matched with the content of the file and you have the list of matching strings in the output.

The finditer() function:

The finditer function can be used to find the RE pattern in strings along with the location of matching strings that is the index of the strings. This function actually iterates through the matching strings and returning the indexes or locations of the string.

The following is the syntax of finditer function:

finditer(pattern, string, flags)

Iterating over matches:

The only difference between findall and finditer is that finditer returns the index as well along with matching strings. In the code below, finditer is used to find the locations of the matching strings while iterating over matches (matching strings) using for loop.

import re
str = "Working 6 hours a day. Studying 4 hours a day."
pat = r'[0-9]'
for mobj in re.finditer(pat, str):
    s = mobj.start()
    e = mobj.end()
    g = mobj.group()
    print('{} found at location [{},{}]'.format(g, s, e))
6 found at location [8,9]
4 found at location [32,33]

In this example, the pattern is the digits from 0 to 9 to be found in str. for loop iterates over the matching strings returned by finditer. In the loop, functions start, end and group return the start index, ending index and found match respectively in each iteration of the string returned by finditer.

The split() function:

The split function is used to split a string.

The following is the syntax of split function:

split(patter, string, maxsplit, flags)

Here max is the total number of string splits. If at most maxsplit splits occur, the remainder of the string is returned as the final element of the list.The default value of max is 0 which means unlimited splits.

Splitting a string:

split function returns each word in a string

In the code below, a string is split according to the given pattern and number of max splits.

import re
str = "Birds fly high in the sky for ever"
mobj = re.split('\s+', str, 5)
print(mobj)
['Birds', 'fly', 'high', 'in', 'the', 'sky for ever']

In this example, the pattern character \s is a special character which matches the whitespace character, that is equivalent to [ \t\n\r\f\v]. Therefore you could have words separated. The value of max is 5 here which makes 6 splits, and the last element is the remainder of the string after the 5th split.

Basic patterns of re:

Regular expressions can specify patterns that are compared to given strings. The following are the basic Patterns of regular expression:

Pattern Description
^ It is used to match at the beginning of the string.
$ This pattern will match at the ending of the string.
. Dot is used to match one character (newline is not included).
[...] It is used to match a single character within brackets.
[^...] This will match a single character but not in brackets.
* 0 or more occurrences of preceding re in given string.
+ 1 or more occurrences of preceding re in given string.
? 0 or 1 occurrences of preceding re in given string.
{n} It will match n number of occurrences in given string.
{n,} It will match n or more than n number of occurrences.
{n,m} This pattern is used to match at least n and at most m matches in the string.
`a b`
(re) This pattern is used to group the regular expressions and it will remember the matched text.
(?imx) It will temporarily toggle on i or m or x in RE. When using parenthesis, then only parenthesis area is affected.
(?-imx) It will temporarily toggle off i or m or x in RE. When using parenthesis, then only parenthesis area is affected.
(?: re) This pattern is used to group the regular expressions but it will not remember the matched text.
(?imx: re) It will temporarily toggle on i or m or x in RE inside parenthesis.
(?-imx: re) It will temporarily toggle off i or m or x in RE inside parenthesis.
(?#...) It is a comment.
(?= re) It is used to specify the position by using a pattern. It does not have any range.
(?! re) It is used to specify the position by using a pattern negation. It does not have any range.
(?> re) This pattern is used to match independent pattern.
\w This pattern is used to match words.
\W This pattern is used to match non-words.
\s It will match whitespaces. \s is equal to [ \t\n\r\f].
\S It will match non-whitespaces.
\d equal to [0-9]. It matches digits in the string.
\D It matches non-digits.
\A match the beginning of the string.
\Z match end of the string. And if there is any newline, it will match before the newline.
\G match to the point where last match was finished.
\b match word boundaries when is outside the brackets but when inside brackets it will match backspace.
\B match non-word boundaries.
\n, \t, etc. \n is used to match newlines, \t will match tab and so on.
\1...\9 This pattern will match nth subexpression (grouped).
\10 \10 usually matches the nth subexpression (grouped) if match is already done. If match is not already done \10 will provide octal representation of a character code.

Repetition cases:

The following table demonstrates some examples of repetition cases with description:

Examples Descriptions
ab? It will match either a or ab.
ab* ab* will match ab and a’s and any a’s followed by any b’s.
ab+ ab+ means a’s followed by b’s and not only a. a must be followed by non zero b.
\d{2} It will match exactly 2 digits.
\d{2,} It will match 2 or more digits.
\d{2,4} It will match the digits 2, 3 and 4.

Nongreedy repetition:

In regular expressions, repetition is by default greedy which tries to match as many repetitions as possible.

The qualifiers such as *, + and ? are greedy qualifiers. When you use .*, it will perform a greedy match and will match the entire string resulting in matching as many characters as possible. Consider the code below:

import re
mobj = re.match(r'.*', "Birds fly high in sky")
print(mobj.group())
Birds fly high in the sky

So you can see here the entire string is matched.

When you add ? with .+ you will have a non greedy re and the pattern .+? will match as few characters as possible in the string.

import re
mobj = re.match(r'.*', "Birds fly high in sky")
print(mobj.group())

The result is the first character of the string

B

Special characters and sequences in re:

Special characters in re start with a \. For example, we have \A which will match from the beginning of the string.

These special characters are described in the table above.

In this section, you will be demonstrated the examples of some of the special characters:

import re
str = "Birds fly high in the sky"
# \A
mobj = re.match(r'\Ab', str, re.I) #OUTPUT: B, here \A will match at beginning only.

#\d
mobj = re.match(r'\d', "4 birds are flying") #OUTPUT: 4

#\s
mobj = re.split('\s+', "birds fly high in the sky", 1) #OUTPUT: ['Birds', 'fly']

The escape function:

The escape function is used to escape all the characters from the string. The ASCII letters, numbers, and _ will not be escaped. The escape function is used when you want to extract metacharacters from a string.

Following is the syntax of escape function:

escape(pattern)

In the following example, a string www.python.com is passed to escape function. In this we have . which is a metacharacter and it will be extracted or matched:

print(re.escape('www.python.com'))
www\.python\.com

Here . is a metacharacter which is extracted or matched. Whenever a metacharacter is matched using escape function you will have \ before the character.

Escaping special characters:

The characters like brackets [ and ] cannot be matched. Consider the following example:

import re
mobj = re.search(r'[a]', '[a]b')
print(mobj.group())
a

Here you can see that brackets [ and ] are not matched.

You can match them by using the escape function:

import re
mobj = re.search(r'\[a\]', '[a]b')
print(mobj.group())
[a]b

The group() function:

The group function is used to return one or more subgroups of the found match. The group function can have some arguments.

The following is the syntax of group function:

group(group1, group2,..., groupN)

If you have a single argument in group function, the result will be a single string but when you have more than one arguments, then the result will be a tuple (containing one item per argument).

When there is no argument, by default argument will be zero and it will return the entire match.

When the argument groupN is zero, the return value will be entire matching string.

When you specify the group number or argument as a negative value or a value larger than the number of groups in pattern then IndexError exception will occur.

Consider the code below in which there is no argument in group function which is equivalent to group(0).

import re
str = "Working 6 hours a day"
mobj = re.match(r'^.*', str)
print(mobj.group())
Working 6 hours a day

Here group() is used and you have the entire matched string.

Picking parts of matching texts

In the following example, group function is used with arguments to pick up matching groups:

import re
a = re.compile('(p(q)r)s')
b = a.match('pqrs')
print(b.group(0))
print(b.group(1))
print(b.group(2))
pqrs
pqr
q

Here group(0) returns the entire match. group(1) will return the first match which is pqr and group(2) will return the second match which is q.

Named groups:

Using named groups you can create a capturing group. This group can be referred by the name then. Consider the example below:

import re
mobj = re.search(r'Hi (?P<name>\w+)', 'Hi Roger')
print(mobj.group('name'))
Roger

Non-capturing groups:

Non-capturing group can be created using ?:. Non-capturing group is used when you do not want the content of the group.

import re
mobj = re.match("(?:[pqr])+", "pqr")
print(mobj.groups())
()