SD 212 Spring 2023 / Notes


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 2: Command line

1 Overview

As data scientists, we have to be comfortable handling files! So far you have written programs to deal with text files, csv files, tab-separated files, a few image formats, and zip files.

Some datasets we may want to handle consist of a single very large file, and some consist of many (sometimes hundreds or thousands) of small files. Either way, we need the ability to quickly and easily perform basic tasks with these files: seeing how long they are, what they are called, what format they are in, renaming or combining them, etc.

2 Reference

3 Concepts you should understand

  • The structure of a command-line: prompt, command, options, arguments
  • How files are organized into directories and sub-directories
  • What are the home directory and root directory
  • What does this mean to talk about the “type” of a file, and how does this relate to the filename extension?
  • The three streams (standard in, standard out, standard error) that every process has
  • Redirecting process input/output to/from files
  • Piping process input/output to other processes

4 Commands you should be familiar with

  • cd
  • ls
  • pwd
  • cat
  • head
  • tail
  • wc
  • touch
  • tr
  • mv
  • cp
  • rm
  • Redirection operators: <, >, >>, 2>
  • Piping operator: |
  • type
  • man
  • help
  • whatis
  • apropos
  • curl
  • wget
  • tar
  • zip
  • unzip
  • grep
  • sort
  • uniq
  • find
  • sed

5 Some examples you should understand

5.1 Basic command usage

  • Get help on a normal command or program

    man cat
    
    # or, just listing the options
    cat --help
    
    # or, just showing some common usages
    tldr cat
  • Get help on a built-in bash command

    help for
  • List detailed info on all txt files in sd212 directory

    ls -l ~/sd212/*.txt
  • Go to home directory

    cd
  • Go to sd212 directory

    cd sd212
  • Show contents of file.txt on the terminal

    cat file.txt
  • Combine in1.txt and in2.txt, save combined file as out.txt

    cat in1.txt in2.txt >out.txt
  • Show only lines of file.txt that mention soccer (upper or lowercase)

    grep -i soccer file.txt
  • Extract the 3rd column of a csv file

    cut -d',' -f3 data.csv
  • Display a message

    echo "Hello world"
  • Create a new subdirectory called images under the current directory

    mkdir newdirname
  • Move all jpg files to an images subdirectory

    mv *.jpg images/
  • Create file2.txt which is a complete copy of file1.txt

    cp file1.txt file2.txt
  • Create directory folder2 which has a complete copy of everyhing in folder1

    cp -R folder1 folder2
  • Delete file temp.txt

    rm temp.txt
  • Delete the folder garbage.

    rm -r garbage
  • Delete the folder garbage without having to say “yes” over and over (POTENTIALLY DANGEROUS!)

    rm -rf garbage
  • Find all txt files under the current directory

    find . -name '*.txt'
  • Find all subdirectories of /etc/systemd

    find /etc/systemd -type d
  • Display the number of words in each text file

    find . -name '*.txt' -exec 'wc' '-w' '{}' ';'
  • Create a new empty file

    touch empty.txt
  • Download the SD212 word cloud header from this web site

    wget 'https://usna.edu/Users/cs/roche/212/scripts/header.png'
  • Extract a tarball

    tar -xzvf something.tgz
  • Extract a zip file

    unzip something.zip
  • Decompress an xz file

    unxz something.xz

5.2 Variables and loops

  • Define and/or set a variable

    name="Bill the Goat"
  • Refer to a variable

    echo "My name is $name."
  • For-each loop

    for file in *.txt
    do
      echo -n "last line of $file: "
      tail -n1 "$file"
    done
    
    for folder in */
    do
      echo -n "number of files in folder $folder: "
      ls "$folder" | wc -l
    done
  • Command substitution

    echo "book.txt has $(wc -l <book.txt) lines in it"

5.3 Problem solving (pipelines!)

  • Get the first word in every line that mentions food

    grep food book.txt | cut -d' ' -f1
  • Get lines 3 through 5 of a file

    head -n5 file.txt | tail -n3
  • Find the names of all libraries imported into python programs in the current directory

    grep "import" *.py | cut -d' ' -f2 | sort | uniq
  • Show the third line of every .txt file in the current directory

    for tfile in *.txt
    do
      head -n3 "$tfile" | tail -n1
    done
  • Same as previous, but now include every .txt file in all subdirectories too

    for tfile in $(find . -name '*.txt')
    do
      head -n3 "$tfile" | tail -n1
    done
  • Move all .png files into a new directory called images

    mkdir images
    mv *.png images/
    # or
    for f in *.png
    do
      mv $f images/$f
    done
  • Same as above, using find

    mkdir images
    find . -name '*.png' -exec 'mv' '{}' 'images/{}' ';'
  • Make 4 copies of every pdf file. For example paper.pdf would create paper.pdf.1, paper.pdf.2, etc.

    for f in *.pdf
    do
      for n in $(seq 1 4)
      do
        cp $f $f.$n
      done
    done
  • Same as above, using find

    for n in $(seq 1 4)
    do
      find . -name '*.pdf' -exec 'cp' '{}' "{}.$n" ';'
    done