Python regular expressions in 5 minutes

An introduction and a fast reference for your works.

Regular expressions in Python are a fondamental brick for our data science skills wall. It allows us to select text in a fast and simple way and use it in our daily works.

Imagine you want to select some recurrent text from a huge page of tenses, for example in a sentimental analysis of tweets you would like to select all the people ID, as “@myname” or for example all the hashtags in a particular tweet, as “#datascience or #regex.

In python it is possible trought the “regex” module, it allows us to select, match ans substitute some reconized pattern in out text. Let’s go straight to the point!

Using regex we reconize each letter, simbol or pattern with some special characters, the main ones are listed below.

  • . (dot) : Matches any character except a newline.
  • \d : It select any decimal digit, so numbers between [0,9].
  • \s : It rapresents the white space, it is useful when we want to select more that one word.
  • \w : Matches any word character, so the letters [a-zA-Z]; numers [0,9] and the uderscore.
  • \D \S \W : If we use the uppercase letters the simbols indicate the opposite of the lowercase simbol. So they are a non-digit, non-white space and a non-word characters.

All these simbols are used to select characters in the senteces, but are useless withoud some repetition qualifiers that can allow us to improve the quality selection, for example giving the number of digits or the times that a particular character is repeated. They are applied after the regex statement and refers to the last special character. So the main quantifiers are:

  • * : Select zero or more repetitions of the preceding regex statement, for example the regex statement ‘ab*’ will select: ‘a’ followed by any numbers of bs; (‘a’, ‘ab’, ‘abb’, etc..).
  • + : Matches one or more repetitions of the regex statement.
  • {n}; {n,m} : n repetitions or from n to m repetitions of the preceding text.
  • ? : Zero or one repetition of the test selection.

There are also others special characters used to determine the position of the regex statement related to the tense. We can decide if we want the selection at the beginning or at the end of the string.

  • ^ : Matches the start of the string.
  • $ : Matches the end of the string.
  • [] : Are used to select all the characters in a particulare set, for example all the uppercase letters ([A-Z]).
  • \ :If we want to use some special chacarters with their original use, like ‘$’ we have to prepose the ‘\’ simbol. So to select a dollar sign we have to write ‘\$’.

In the end there are the attributes and contents of the module, they are used to select, substitute or find the regex statement in the text. We will explore them throught some examples.

Now we finally have all the basic skills to make some example and excercises with the regex module. First of all import the module:

import regex as re 

The first method we will use is for select all the simil pattern in a text, it is the ‘findall’ method. Imagine you have a string with names and telephone numbers and you want to select only the latters. In this case we will use the special character ‘\d’.

phone_numbers='Luke: 4-567-123-6789 Mike: 97-567-78-54376're.findall("\d{1,2}-\d{3}-d{2,3}-\d{4,5}", phone_numbers)output ['4-567-123-6789', '97-567-78-54376']

But come back to the first paragraph, what if we are conducting an anlysis and we want to select the usernames and the hashtags from a twitter post?

In twitter a username is validated if:

  • Is a text with a ‘@’ at the beginning so in regex ‘^@’
  • Contains only alphanumerical characters (letters A-Z, numbers 0–9). In this case we can use either ‘[A-Za-z0–9]’ or ‘[\w\d]’. We use the brackets because we want to apply another quantifiers to the set.
  • It must be between 4 and 15 characters, so {4,15}.

You can use the re.match() or re.search() methods to find a match for your string, in the following code is reported the selection throught the regex statement.

To select all the hashtags we use the ‘#’ simbol at the beginning an then ad undefined number of digits or words so: ‘#[\w\d]+’.

The last method we can explore is useful to substitute a particular expression in a text. It is the ‘re.sub(old, new, string)’ method, in the following lines I report a simple example:

How to use the re.sub() method.

Regular expressions are a powerful and useful tool to seleect, substitute and match particular text expressions in a long and complicate script. This is a quick introduction and cheatsheet, I append come links to improve yours competences in regular expressions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store