Unit 2: Command line
1 Overview
As data scientists, we have to be comfortable handling files! So far you have written programs to deal with text files, csv files, tab-separated files, a few image formats, and zip files.
Some datasets we may want to handle consist of a single very large file, and some consist of many (sometimes hundreds or thousands) of small files. Either way, we need the ability to quickly and easily perform basic tasks with these files: seeing how long they are, what they are called, what format they are in, renaming or combining them, etc.
2 Reference
-
This book is a really nice walkthrough, starting from no background assumptions, on using the Linux command line. Each chapter is very short and just introduces a few new commands. Below I am listing the most important commands that are introduced in each chapter, so that you can easily look up examples and find extra help.
After the initial chapters, in some places this book goes into slightly more detail than what we need, but the book is overall very readable, to the point, and filled with plenty of examples to check your understanding.
- Chapter 1 “What is the shell”: Opening your
terminal;
date
,cal
,df
,exit
- Chapter 2 “Navigation”:
pwd
,cd
,ls
- Chapter 3 “Exploring the system”:
ls
,file
,less
, directory structure, symbolic links - Chapter 4 “Manipulating files and directories”:
cp
,mv
,mkdir
,rm
,ln
, wildcards - Chapter 5 “Working with commands”:
type
,which
,man
,help
,whatis
,alias
- Chapter 6 “Redirection”:
cat
,sort
,uniq
,grep
,wc
,head
,tail
,tee
, piping, redirects - Chapter 17 “Searching for files”:
locate
,find
,touch
- Chapter 1 “What is the shell”: Opening your
terminal;
Data Science at the Command Line
This book is written from a data science perspective, and the intro especially gives a fantastic overview of why data scientists need to be comfortable on the command line.
However, the downside of the book is that it sometimes goes a little too fast in the later chapters, and it relies on some non-standard tools written by the book’s author that we won’t be using.
Overall, the book gives a great conceptual understanding of a few topics such as piping, and helps put the command line tools in the context of data science, but when reading feel free to skip over the non-standard tools (i.e., the ones not listed above).
- Chapter 1 “Introduction”: How command-line skills can help with data science
- Chapter 2 “Getting Started”: Just focus on the section “Essential Unix concepts”; skip everything about downloading the docker image and files.
- Chapter 3 “Obtaining Data”:
Focus on the sections “Downloading from the internet” (
curl
,wget
) and “Decompressing files” (zip
,tar
) - Chapter 7 “Exploring Data”:
The section on “Inspecting data and its properties” is the
one to focus on here (
head
,less
,wc
, csvkit)
-
This book is less easy to read, but can still serve as a good reference, especially chapter 16 which goes through a whole bunch of useful common commands, and chapter 11 on for loops.
When reading here, keep in mind that some of the details or commands mentioned aren’t required to be understood for this class. Try to pinpoint the information you’re seeking, understand it slowly and completely, and then get back to work.
3 Concepts you should understand
- The structure of a command-line: prompt, command, options, arguments
- How files are organized into directories and sub-directories
- What are the home directory and root directory
- What does this mean to talk about the “type” of a file, and how does this relate to the filename extension?
- The three streams (standard in, standard out, standard error) that every process has
- Redirecting process input/output to/from files
- Piping process input/output to other processes
4 Commands you should be familiar with
cd
ls
pwd
cat
head
tail
wc
touch
tr
mv
cp
rm
- Redirection operators:
<
,>
,>>
,2>
- Piping operator:
|
type
man
help
whatis
apropos
curl
wget
tar
zip
unzip
grep
sort
uniq
find
sed
5 Some examples you should understand
5.1 Basic command usage
Get help on a normal command or program
man cat
# or, just listing the options
cat --help
# or, just showing some common usages
tldr cat
Get help on a built-in bash command
help for
List detailed info on all txt files in sd212 directory
ls -l ~/sd212/*.txt
Go to home directory
cd
Go to sd212 directory
cd sd212
Show contents of file.txt on the terminal
cat file.txt
Combine in1.txt and in2.txt, save combined file as out.txt
cat in1.txt in2.txt >out.txt
Show only lines of file.txt that mention soccer (upper or lowercase)
grep -i soccer file.txt
Extract the 3rd column of a csv file
cut -d',' -f3 data.csv
Display a message
echo "Hello world"
Create a new subdirectory called images
under the current directory
mkdir newdirname
Move all jpg files to an images subdirectory
mv *.jpg images/
Create file2.txt
which is a complete copy of file1.txt
cp file1.txt file2.txt
Create directory folder2
which has a complete copy of everyhing in
folder1
cp -R folder1 folder2
Delete file temp.txt
rm temp.txt
Delete the folder garbage
.
rm -r garbage
Delete the folder garbage
without having to say “yes” over and
over (POTENTIALLY DANGEROUS!)
rm -rf garbage
Find all txt files under the current directory
find . -name '*.txt'
Find all subdirectories of /etc/systemd
find /etc/systemd -type d
Display the number of words in each text file
find . -name '*.txt' -exec 'wc' '-w' '{}' ';'
Create a new empty file
touch empty.txt
Download the SD212 word cloud header from this web site
wget 'https://usna.edu/Users/cs/roche/212/scripts/header.png'
Extract a tarball
tar -xzvf something.tgz
Extract a zip file
unzip something.zip
Decompress an xz file
unxz something.xz
5.2 Variables and loops
Define and/or set a variable
name="Bill the Goat"
Refer to a variable
echo "My name is $name."
For-each loop
for file in *.txt
do
echo -n "last line of $file: "
tail -n1 "$file"
done
for folder in */
do
echo -n "number of files in folder $folder: "
ls "$folder" | wc -l
done
Command substitution
echo "book.txt has $(wc -l <book.txt) lines in it"
5.3 Problem solving (pipelines!)
Get the first word in every line that mentions food
grep food book.txt | cut -d' ' -f1
Get lines 3 through 5 of a file
head -n5 file.txt | tail -n3
Find the names of all libraries imported into python programs in the
current directory
grep "import" *.py | cut -d' ' -f2 | sort | uniq
Show the third line of every .txt file in the current directory
for tfile in *.txt
do
head -n3 "$tfile" | tail -n1
done
Same as previous, but now include every .txt file in all
subdirectories too
for tfile in $(find . -name '*.txt')
do
head -n3 "$tfile" | tail -n1
done
Move all .png
files into a new directory called images
mkdir images
mv *.png images/
# or
for f in *.png
do
mv $f images/$f
done
Same as above, using find
mkdir images
find . -name '*.png' -exec 'mv' '{}' 'images/{}' ';'
Make 4 copies of every pdf file. For example paper.pdf
would
create paper.pdf.1
, paper.pdf.2
, etc.
for f in *.pdf
do
for n in $(seq 1 4)
do
cp $f $f.$n
done
done
Same as above, using find
for n in $(seq 1 4)
do
find . -name '*.pdf' -exec 'cp' '{}' "{}.$n" ';'
done
Get help on a normal command or program
man cat
# or, just listing the options
cat --help
# or, just showing some common usages
tldr cat
Get help on a built-in bash command
help for
List detailed info on all txt files in sd212 directory
ls -l ~/sd212/*.txt
Go to home directory
cd
Go to sd212 directory
cd sd212
Show contents of file.txt on the terminal
cat file.txt
Combine in1.txt and in2.txt, save combined file as out.txt
cat in1.txt in2.txt >out.txt
Show only lines of file.txt that mention soccer (upper or lowercase)
grep -i soccer file.txt
Extract the 3rd column of a csv file
cut -d',' -f3 data.csv
Display a message
echo "Hello world"
Create a new subdirectory called images
under the current directory
mkdir newdirname
Move all jpg files to an images subdirectory
mv *.jpg images/
Create file2.txt
which is a complete copy of file1.txt
cp file1.txt file2.txt
Create directory folder2
which has a complete copy of everyhing in
folder1
cp -R folder1 folder2
Delete file temp.txt
rm temp.txt
Delete the folder garbage
.
rm -r garbage
Delete the folder garbage
without having to say “yes” over and
over (POTENTIALLY DANGEROUS!)
rm -rf garbage
Find all txt files under the current directory
find . -name '*.txt'
Find all subdirectories of /etc/systemd
find /etc/systemd -type d
Display the number of words in each text file
find . -name '*.txt' -exec 'wc' '-w' '{}' ';'
Create a new empty file
touch empty.txt
Download the SD212 word cloud header from this web site
wget 'https://usna.edu/Users/cs/roche/212/scripts/header.png'
Extract a tarball
tar -xzvf something.tgz
Extract a zip file
unzip something.zip
Decompress an xz file
unxz something.xz
Define and/or set a variable
name="Bill the Goat"
Refer to a variable
echo "My name is $name."
For-each loop
for file in *.txt do echo -n "last line of $file: " tail -n1 "$file" done for folder in */ do echo -n "number of files in folder $folder: " ls "$folder" | wc -l done
Command substitution
echo "book.txt has $(wc -l <book.txt) lines in it"
5.3 Problem solving (pipelines!)
Get the first word in every line that mentions food
grep food book.txt | cut -d' ' -f1
Get lines 3 through 5 of a file
head -n5 file.txt | tail -n3
Find the names of all libraries imported into python programs in the
current directory
grep "import" *.py | cut -d' ' -f2 | sort | uniq
Show the third line of every .txt file in the current directory
for tfile in *.txt
do
head -n3 "$tfile" | tail -n1
done
Same as previous, but now include every .txt file in all
subdirectories too
for tfile in $(find . -name '*.txt')
do
head -n3 "$tfile" | tail -n1
done
Move all .png
files into a new directory called images
mkdir images
mv *.png images/
# or
for f in *.png
do
mv $f images/$f
done
Same as above, using find
mkdir images
find . -name '*.png' -exec 'mv' '{}' 'images/{}' ';'
Make 4 copies of every pdf file. For example paper.pdf
would
create paper.pdf.1
, paper.pdf.2
, etc.
for f in *.pdf
do
for n in $(seq 1 4)
do
cp $f $f.$n
done
done
Same as above, using find
for n in $(seq 1 4)
do
find . -name '*.pdf' -exec 'cp' '{}' "{}.$n" ';'
done
Get the first word in every line that mentions food
grep food book.txt | cut -d' ' -f1
Get lines 3 through 5 of a file
head -n5 file.txt | tail -n3
Find the names of all libraries imported into python programs in the current directory
grep "import" *.py | cut -d' ' -f2 | sort | uniq
Show the third line of every .txt file in the current directory
for tfile in *.txt
do
head -n3 "$tfile" | tail -n1
done
Same as previous, but now include every .txt file in all subdirectories too
for tfile in $(find . -name '*.txt')
do
head -n3 "$tfile" | tail -n1
done
Move all .png
files into a new directory called images
mkdir images
mv *.png images/
# or
for f in *.png
do
mv $f images/$f
done
Same as above, using find
mkdir images
find . -name '*.png' -exec 'mv' '{}' 'images/{}' ';'
Make 4 copies of every pdf file. For example paper.pdf
would
create paper.pdf.1
, paper.pdf.2
, etc.
for f in *.pdf
do
for n in $(seq 1 4)
do
cp $f $f.$n
done
done
Same as above, using find
for n in $(seq 1 4)
do
find . -name '*.pdf' -exec 'cp' '{}' "{}.$n" ';'
done