SD 212 Spring 2024 / Notes


Unit 3: Regular expressions

1 Overview

In this unit, we will look at the use of a very powerful tool called regular expressions to investigate, categorize, and manipulate data (especially text-based data). Regular expressions are sort of a mini language of their own, and they are embedded into most modern programming languages, including Python and bash (and common command line tools).

We will gain some proficiency and comfort with using regular expressions both in Python with the re module as well as in command-line tools such as grep, find, and sed.

2 Resources

2.1 Regex in bash

  • TLCL, Chapter 19: Regular Expressions (Required reading)

    How to write regexes in bash, mostly focusing on the grep and find commands for examples

  • TLCL, Chapter 20: Text Processing

    A broader view of handling text files at the command line, including important commands like tr, sed and diff.

    Focus especially on the section “Editing on the fly” which discusses the use of tr and sed.

  • ABSG, Chapter 18: Regular Expressions

    A quick overview of the most common regex operations in bash. Very useful as a reference once you understand the basics.

2.2 Regex in Python

3 Regex examples

Regex Example matches Example mismatches Notes
mid mid anything else
m.d mid”, “mad md”, “mild
m.*d mid”, “mild”, “md”, “my bad
\.* ““,  ”...... x
\bmid mid”, “middle”, “#mids humid
^mid mid”, “middle #mids”, “sponsor mid
mid$ mid”, “humid”, “sponsor mid middle
 
ren|stimpy ren”, “stimpy anything else requires -E for grep and sed
[efv]at vat”, “eat”, “fat anything else
player[1-5] player1”, “player3 player”, “player-
a(bc|de*|)f af”, “abcf”, “adeeef abcdef requires -E

4 Bash examples

  • Find lines that are not empty

    grep '.'
  • Find lines that contain any 4-letter word

    grep '\b\w\w\w\w\b'
  • Fix some profanity

    sed 's/damn\b/dang/g'
  • Indent every line 4 spaces

    sed 's/^/    /'

5 Python examples

  • Find all four-letter words in a sentence

    import re
    
    sentence = input("Enter a sentence: ")
    
    print("4-letter words used:")
    for word in re.findall(r'\b\w\w\w\w\b', sentence):
        print(word)
  • Python program which does the equivalent of a basic grep

    import re
    
    regex_string = input("Enter a regex: ")
    filename = input("Enter a filename: ")
    
    fh = open(filename)
    for line in fh:
        line = line.rstrip('\n')
        match = re.search(regex_string, line)
        if match is not None:
            print(line)