Unit 4: Regular expressions
1 Overview
In this unit, we will look at the use of a very powerful tool called regular expressions to investigate, categorize, and manipulate data (especially text-based data). Regular expressions are sort of a mini language of their own, and they are embedded into most modern programming languages, including Python and bash (and common command line tools).
We will gain some proficiency and comfort with using regular expressions
both in Python with the re
module as well as in command-line
tools such as grep
, find
, and sed
.
2 Resources
2.1 Regex in bash
TLCL, Chapter 19: Regular Expressions
(Required reading)
How to write regexes in bash, mostly focusing on the grep
and
find
commands for examples
TLCL, Chapter 20: Text Processing
A broader view of handling text files at the command line, including
important commands like tr
, sed
and diff
.
Focus especially on the section “Editing on the fly” which discusses
the use of tr
and sed
.
ABSG, Chapter 18: Regular Expressions
A quick overview of the most common regex operations in bash. Very
useful as a reference once you understand the basics.
2.2 Regex in Python
P4E, Chapter 11: Regular Expressions
(Required reading)
A fantastic overview of how to use the re
module.
P4DA, Section 7.3: String Manipulation
A broader view of string operations that includes a brief discussion
of regular expressions, all focusing on the kinds of operations most
useful for data science.
3 Regex examples
TLCL, Chapter 19: Regular Expressions (Required reading)
How to write regexes in bash, mostly focusing on the grep
and
find
commands for examples
TLCL, Chapter 20: Text Processing
A broader view of handling text files at the command line, including
important commands like tr
, sed
and diff
.
Focus especially on the section “Editing on the fly” which discusses
the use of tr
and sed
.
ABSG, Chapter 18: Regular Expressions
A quick overview of the most common regex operations in bash. Very useful as a reference once you understand the basics.
P4E, Chapter 11: Regular Expressions (Required reading)
A fantastic overview of how to use the
re
module.P4DA, Section 7.3: String Manipulation
A broader view of string operations that includes a brief discussion of regular expressions, all focusing on the kinds of operations most useful for data science.
3 Regex examples
Regex | Example matches | Example mismatches | Notes |
---|---|---|---|
mid |
“mid ” |
anything else | |
m.d |
“mid ”, “mad ” |
“md ”, “mild ” |
|
m.*d |
“mid ”, “mild ”, “md ”, “my bad ” |
||
\.* |
"“, ”...... " |
“x ” |
|
\bmid |
“mid ”, “middle ”, “#mids ” |
“humid ” |
|
^mid |
“mid ”, “middle ” |
“#mids ”, “sponsor mid ” |
|
mid$ |
“mid ”, “humid ”, “sponsor mid ” |
“middle ” |
|
ren|stimpy |
“ren ”, “stimpy ” |
anything else | requires -E for grep and sed |
[efv]at |
“vat ”, “eat ”, “fat ” |
anything else | |
player[1-5] |
“player1 ”, “player3 ” |
“player ”, “player- ” |
|
a(bc|de*|)f |
“af ”, “abcf ”, “adeeef ” |
“abcdef ” |
requires -E |
4 Bash examples
Find lines that are not empty
grep '.'
Find lines that contain any 4-letter word
grep '\b\w\w\w\w\b'
Fix some profanity
sed 's/damn\b/dang/g'
Indent every line 4 spaces
sed 's/^/ /'
5 Python examples
Find all four-letter words in a sentence
import re sentence = input("Enter a sentence: ") print("4-letter words used:") for word in re.findall(r'\b\w\w\w\w\b', sentence): print(word)
Python program which does the equivalent of a basic grep
import re regex_string = input("Enter a regex: ") filename = input("Enter a filename: ") fh = open(filename) for line in fh: line = line.rstrip('\n') match = re.search(regex_string, line) if match is not None: print(line)