This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 2: Command line

1 Overview
2 Reference
3 Concepts you should understand
4 Commands you should be familiar with
5 Some examples you should understand

1 Overview

As data scientists, we have to be comfortable handling files! So far you have written programs to deal with text files, csv files, tab-separated files, a few image formats, and zip files.

Some datasets we may want to handle consist of a single very large file, and some consist of many (sometimes hundreds or thousands) of small files. Either way, we need the ability to quickly and easily perform basic tasks with these files: seeing how long they are, what they are called, what format they are in, renaming or combining them, etc.

2 Reference

The Linux Command Line

This book is a really nice walkthrough, starting from no background assumptions, on using the Linux command line. Each chapter is very short and just introduces a few new commands. Below I am listing the most important commands that are introduced in each chapter, so that you can easily look up examples and find extra help.

After the initial chapters, in some places this book goes into slightly more detail than what we need, but the book is overall very readable, to the point, and filled with plenty of examples to check your understanding.
- Chapter 1 “What is the shell”: Opening your terminal; date, cal, df, exit
- Chapter 2 “Navigation”: pwd, cd, ls
- Chapter 3 “Exploring the system”: ls, file, less, directory structure, symbolic links
- Chapter 4 “Manipulating files and directories”: cp, mv, mkdir, rm, ln, wildcards
- Chapter 5 “Working with commands”: type, which, man, help, whatis, alias
- Chapter 6 “Redirection”: cat, sort, uniq, grep, wc, head, tail, tee, piping, redirects
- Chapter 17 “Searching for files”: locate, find, touch
Data Science at the Command Line

This book is written from a data science perspective, and the intro especially gives a fantastic overview of why data scientists need to be comfortable on the command line.

However, the downside of the book is that it sometimes goes a little too fast in the later chapters, and it relies on some non-standard tools written by the book’s author that we won’t be using.

Overall, the book gives a great conceptual understanding of a few topics such as piping, and helps put the command line tools in the context of data science, but when reading feel free to skip over the non-standard tools (i.e., the ones not listed above).
- Chapter 1 “Introduction”: How command-line skills can help with data science
- Chapter 2 “Getting Started”: Just focus on the section “Essential Unix concepts”; skip everything about downloading the docker image and files.
- Chapter 3 “Obtaining Data”: Focus on the sections “Downloading from the internet” (curl, wget) and “Decompressing files” (zip, tar)
- Chapter 7 “Exploring Data”: The section on “Inspecting data and its properties” is the one to focus on here (head, less, wc, csvkit)
Advanced Bash-Scripting Guide

This book is less easy to read, but can still serve as a good reference, especially chapter 16 which goes through a whole bunch of useful common commands, and chapter 11 on for loops.

When reading here, keep in mind that some of the details or commands mentioned aren’t required to be understood for this class. Try to pinpoint the information you’re seeking, understand it slowly and completely, and then get back to work.

3 Concepts you should understand

The structure of a command-line: prompt, command, options, arguments
How files are organized into directories and sub-directories
What are the home directory and root directory
What does this mean to talk about the “type” of a file, and how does this relate to the filename extension?
The three streams (standard in, standard out, standard error) that every process has
Redirecting process input/output to/from files
Piping process input/output to other processes

4 Commands you should be familiar with

cd
ls
pwd
cat
head
tail
wc
touch
tr
mv
cp
rm
Redirection operators: <, >, >>, 2>
Piping operator: |
type
man
help
whatis
apropos
curl
wget
tar
zip
unzip
grep
sort
uniq
find
sed

5 Some examples you should understand

5.1 Basic command usage

Get help on a normal command or program

man cat

# or, just listing the options
cat --help

# or, just showing some common usages
tldr cat

Get help on a built-in bash command
```
help for
```
List detailed info on all txt files in sd212 directory
```
ls -l ~/sd212/*.txt
```
Go to home directory
```
cd
```
Go to sd212 directory
```
cd sd212
```
Show contents of file.txt on the terminal
```
cat file.txt
```
Combine in1.txt and in2.txt, save combined file as out.txt
```
cat in1.txt in2.txt >out.txt
```
Show only lines of file.txt that mention soccer (upper or lowercase)
```
grep -i soccer file.txt
```
Extract the 3rd column of a csv file
```
cut -d',' -f3 data.csv
```
Display a message
```
echo "Hello world"
```
Create a new subdirectory called images under the current directory
```
mkdir newdirname
```
Move all jpg files to an images subdirectory
```
mv *.jpg images/
```
Create file2.txt which is a complete copy of file1.txt
```
cp file1.txt file2.txt
```
Create directory folder2 which has a complete copy of everyhing in folder1
```
cp -R folder1 folder2
```
Delete file temp.txt
```
rm temp.txt
```
Delete the folder garbage.
```
rm -r garbage
```
Delete the folder garbage without having to say “yes” over and over (POTENTIALLY DANGEROUS!)
```
rm -rf garbage
```
Find all txt files under the current directory
```
find . -name '*.txt'
```
Find all subdirectories of /etc/systemd
```
find /etc/systemd -type d
```

Display the number of words in each text file

find . -name '*.txt' -exec 'wc' '-w' '{}' ';'

Create a new empty file
```
touch empty.txt
```

Download the SD212 word cloud header from this web site

wget 'https://usna.edu/Users/cs/roche/212/scripts/header.png'

Extract a tarball
```
tar -xzvf something.tgz
```
Extract a zip file
```
unzip something.zip
```
Decompress an xz file
```
unxz something.xz
```

5.2 Variables and loops

Define and/or set a variable
```
name="Bill the Goat"
```
Refer to a variable
```
echo "My name is $name."
```

For-each loop

for file in *.txt
do
  echo -n "last line of $file: "
  tail -n1 "$file"
done

for folder in */
do
  echo -n "number of files in folder $folder: "
  ls "$folder" | wc -l
done

Command substitution

echo "book.txt has $(wc -l <book.txt) lines in it"

5.3 Problem solving (pipelines!)

Get the first word in every line that mentions food
```
grep food book.txt | cut -d' ' -f1
```
Get lines 3 through 5 of a file
```
head -n5 file.txt | tail -n3
```
Find the names of all libraries imported into python programs in the current directory
```
grep "import" *.py | cut -d' ' -f2 | sort | uniq
```
Show the third line of every .txt file in the current directory
```
for tfile in *.txt
do
  head -n3 "$tfile" | tail -n1
done
```

Same as previous, but now include every .txt file in all subdirectories too

for tfile in $(find . -name '*.txt')
do
  head -n3 "$tfile" | tail -n1
done

Move all .png files into a new directory called images

mkdir images
mv *.png images/
# or
for f in *.png
do
  mv $f images/$f
done

Same as above, using find

mkdir images
find . -name '*.png' -exec 'mv' '{}' 'images/{}' ';'

Make 4 copies of every pdf file. For example paper.pdf would create paper.pdf.1, paper.pdf.2, etc.
```
for f in *.pdf
do
  for n in $(seq 1 4)
  do
    cp $f $f.$n
  done
done
```

Same as above, using find

for n in $(seq 1 4)
do
  find . -name '*.pdf' -exec 'cp' '{}' "{}.$n" ';'
done

SD 212 Spring 2023 / Notes