Journal logo

How to use RegEx in Python Programming

Validate data by matching with RegEx patterns

By Thiyari Sai ManikanthPublished 4 years ago 9 min read
Like
Python Online Training

The regular expressions are used to validate the data, most of the computer programming applications use these regular expressions to validate the client-side form elements followed by a three-tier architecture model of processing the requests and responses of client and server. Python also supports regular expressions that can be used in the projects towards data science. If you are new to python programming or want to learn more on coding then you can get this training course on Python Training Certification available online. In this article, I am going to discuss an overview of regular expressions using python.

Python has a module named “re” to work with regular expressions. This module is imported as shown below.

import re

This module defines several functions and constants to work with RegEx. Let us now discuss a few methods which are used in python. To verify the program execution, you can directly type the script code and execute it across online. Type the code in the “main.py” console and press the “execute” button in the specified Online Python Compiler.

1. re.findall()

This method returns a list of a string containing all matches. Consider the below example for understanding its implementation.

Example:

import re

string = 'hello 12 hi 89. Howdy 34'

pattern = '\d+'

result = re.findall(pattern,string)

print(result)

Output:

['12', '89', '34']

Code Description:

The function re.findall() displays all the results in a list for the given pattern matching the digits in the string. The statement 'hello 12 hi 89. Howdy 34' is stored in a variable string and '\d+' is a pattern for retrieving the digits that should be applied and is stored in the variable pattern. These two variables are now passed as parameters to the function re.findall()to retrieve the string of that specific pattern which displays only the digits. The resultant string would contain the digits as 12, 89 and 34 and this is displayed in the output terminal.

2. re.split()

This method splits where there is a match and returns a list of strings where the splits have occurred.

Example 1:

import re

string = 'Thirty One:31 Eight:8.'

pattern = '\d+'

result = re.split(pattern,string)

print(result)

Output:

['Thirty One:', ' Eight:', '.']

Code Description:

If the pattern is not found, re.split() returns a list containing an empty string. This method will split the given input string according to the digits in it since the pattern applied is '\d+', the string is split as “Thirty One:” and “Eight:”

We can pass maxsplit argument to the re.split() method which performs a maximum number of splits.

Example 2:

import re

string = 'Thirty One:31 Eight:8. Eighty Nine:89'

pattern = '\d+'

result = re.split(pattern,string,1)

print(result)

Output:

['Thirty One:', ' Eight:8. Eighty Nine:89']

Code Description:

The default value of maxsplit is 0; meaning all possible splits. Here in this method, we have applied the max split as 1 inside the split parameters. So now the string will be divided only into two substrings since the split applied is one. The output becomes “Thirty One:” and “Eight:8. Eighty Nine:89”.

3. re.sub()

The method returns a string where matched occurrences are replaced with the content of the replace variable. The syntax for using this method is as below.

Syntax:

re.sub(pattern, replace, string)

Example 1:

# Program to remove all whitespaces

import re

# multiline string

string = 'apple 12\

banana 23 \n orange45 6'

# matches all the whitespace characters

pattern = '\s+'

# empty string

replace = ''

new_string = re.sub(pattern, replace, string)

print(new_string)

Output:

Apple12banana23orange456

Code Description:

The above matches all the whitespace characters and replaces the whitespaces with an empty string. If the pattern is not found, re.sub() returns the original string.

You can pass count as a fourth parameter to the re.sub() method. If omitted, it results in 0. This will replace all occurrences. In the above example, the passed string is multiline.

apple 12\

banana 23 \n orange45 6

Now after replacing the spaces in the multi-line string after matching the given pattern, the output will be displayed as Apple12banana23orange456.

Example 2:

# Program to remove all whitespaces

import re

# multiline string

string = 'apple 12\

banana 23 \n orange45 6'

# matches all the whitespace characters

pattern = '\s+'

# empty string

replace = ''

new_string = re.sub(pattern, replace, string,1)

print(new_string)

Output:

apple12banana 23

orange45 6

Code Description:

The above program will replace the white space with no space. As the split value applied as 1 in the sub() method parameter. The string removes only the white space in the split applied to the string.

4. re.subn()

The re.subn() is similar to re.sub() except it returns a tuple of 2 items containing the new string and the number of substitutions made.

Example:

# Program to remove all whitespaces

import re

# multiline string

string = 'apple 12\

banana 23 \n orange45 6'

# matches all the whitespace characters

pattern = '\s+'

# empty string

replace = ''

new_string = re.subn(pattern, replace, string)

print(new_string)

Output:

('apple12banana23orange456', 4)

Code Description:

As you observe that the code execution is displaying the tuple of values enclosed inside the brackets “()”. The applied pattern has replaced the spaces in the given string and displayed the string without any whitespaces in the first parameter inside tuple and the second parameter has displayed number 4 to indicate the number of substitutions made. As there are 4 white spaces in the given string, these 4 white spaces are substituted and replaced without providing the spaces.

5. re.search()

This method takes two arguments, a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None.

Example:

import re

string = "Python 3.7 is awesome"

#checks if Python is at the beginning

match = re.search('\APython',string)

if match:

print("pattern found inside the string")

else:

print("pattern not found")

Output:

pattern found inside the string

Code Description:

In the above example, re.search() method will check whether the string contains the specified pattern in it. In the above “search()” method, we have applied the pattern “\APython” to check whether the string begins with the word “Python”. If this pattern match is true then the resultant matching which is specified in the matching statement will be displayed. In this example, as the string contains the specified match, the resultant matching output is displayed as “pattern found inside the string”.

Match object

You can get methods and attributes of a match object using dir() function. Some of the commonly used methods and attributes of match objects are

1. match.group()

The group() method returns the part of the string where there is a match.

Example:

import re

string = '39801 356, 2102 1111'

# Four digit number followed by space followed by three digit number

pattern = '(\d{4}) (\d{3})'

# match variable contains a Match object.

match = re.search(pattern, string)

if match:

print(match.group())

else:

print("pattern not found")

Output:

9801 356

Code Description:

To the applied pattern ‘(\d{4}) (\d{3})’, the program will match for the string '39801 356, 2102 1111' and considers the four digits as “9801” where there is a space that starts with three digits in a string having “356” after space.

Our pattern (\d{4}) (\d{3}) has two subgroups (\d{4}) and (\d{3}). You can get the part of the string of these parenthesized subgroups. Here's how

>>> match.group(1)

‘9801’

>>> match.group(2)

‘356’

>>> match.group(1,2)

(‘9801’,’356’)

>>> match.groups()

(‘9801’,’356’)

2. match.start()

The start() function returns the index of the start of the matched substring.

Example:

import re

string = '39801 356, 2102 1111'

# Four digit number followed by space followed by three digit number

pattern = '(\d{4}) (\d{3})'

# match variable contains a Match object.

match = re.search(pattern, string)

if match:

print(match.start())

else:

print("pattern not found")

Output:

1

Code Description:

In the above example, the execution of code is the same as explained in the previous example but as we have applied the method as “match.string()”, its output displays the starting index of number ‘9’ for the matched pattern which is 1.

3. match.end()

The end() function returns the end index of the matched substring.

Example:

import re

string = '39801 356, 2102 1111'

# Four digit number followed by space followed by three digit number

pattern = '(\d{4}) (\d{3})'

# match variable contains a Match object.

match = re.search(pattern, string)

if match:

print(match.end())

else:

print("pattern not found")

Output:

9

Code Description:

In the above example, as similar to the previous execution, here the method “match.end()” displays the ending index of number ‘6’ for the matched pattern which is 9.

4. match.span()

The span() function returns a tuple containing the start and end index of the matched part.

Example:

import re

string = '39801 356, 2102 1111'

# Four digit number followed by space followed by three digit number

pattern = '(\d{4}) (\d{3})'

# match variable contains a Match object.

match = re.search(pattern, string)

if match:

print(match.span())

else:

print("pattern not found")

Output:

(1, 9)

Code Description:

In the above example, the starting index of number ‘9’ for the matched pattern is 1 and the ending index of number ‘6’ for the matched pattern is 9. The match.span() method displays these two values 1 and 9 in a tuple enclosed with parentheses or brackets.

R Prefix before RegEx

When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.

Example:

import re

string = '\n and \r are escape sequences, \n is new line.'

result = re.findall(r'[\n\r]',string)

print(result)

Output:

['\n', '\r', '\n']

Code Description:

In the above program, the method “re.findall()” is used to display everything which finds the given regular expression pattern “r'[\n\r]'” to the applied string. The pattern finds whether the given string has the escape characters “\n” and “\r” in it and displays as output in the terminal.

Conclusion:

Using regular expressions makes the validation process easy for matching the expected pattern to get the desired outcomes. All you need to know is about the application of various usage of pattern expressions available in python regex and its implementation of operations.

how to
Like

About the Creator

Thiyari Sai Manikanth

Hi, I am currently working at HKR Trainings. I am passionate about doing research on techical skills and delivering knowledge to the audiences. I had experience in writing the content and publishing the technical documents.

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Sign in to comment

    Find us on social media

    Miscellaneous links

    • Explore
    • Contact
    • Privacy Policy
    • Terms of Use
    • Support

    © 2024 Creatd, Inc. All Rights Reserved.