Unit 5: Versions and packaging
1 Overview
The main goal of this unit is to introduce you to git, the most popular tool used by software developers and data scientists alike to share and collaborate on coding projects.
At the end of this unit, you should be comfortable collaborating in a
single-branch git repository, and using the git commands clone
, add
,
commit
, pull
, and push
. There is a lot more to git that you might
learn about later on, or on your own!
We will also briefly discuss how big software package management systems
such as pip
and conda
/mamba
work at a high level. Many of them are
closely integrated with git!
2 Resources
-
The first two chapters on “Getting Started” and “Git Basics” correspond to what we’ll cover here. The book is very well-written with short, straightforward chunks of information, and it’s all free online.
git “simple guide” by Roger Dudler
A very basic down-to-business overview of the most useful git commands and workflows. Great for a quick reference or refresher.
-
Goes over how to create, clone, and contribute to projects hosted on GitHub. Remember that git is not github! GitHub is just one of many (free and very popular) places where people store their git repos.
3 Command line vs VS Code (GUI)
VS Code has very good integration with git so that you can easily pull, add, commit, and push files by clicking things in your code editor. It’s installed by default in VS Code, so you don’t need to do anything special other than create a git repository and open that folder in VS code.
But we are still going to learn how to use the command line git commands. Why? Scott Chacon, the author of Pro Git put it better than I could:
For this book, we will be using Git on the command line. For one, the command line is the only place you can run all Git commands — most of the GUIs implement only a partial subset of Git functionality for simplicity. If you know how to run the command-line version, you can probably also figure out how to run the GUI version, while the opposite is not necessarily true.
4 Git structure
When using git, your files typically exist in three places:
Working directory: This is the folder containing your actual fileso and subfolders. It appears just like any other folder and any other files on your computer. You edit these files in VS code or wherever, run the code, etc.
Local repo: This is created alongside the working directory on your own computer, in a special folder called
.git
inside your working directory. It is a database that stores snapshots (called commits) of your working directory.Remote repo: While not technically required to use git, we typically use git to sync up a locally-stored working directory and local repo with a server. The most popular such hosting service is <github.com>.
The remote repo stores exactly the same database as your local repo, containing snapshots (commits) of the working directory.
Understanding these three copies is crucial to being a competent user of git! It can be very easy to conflate the working directory and local repo, or the local repo and remote repo, leading to confusion about what is going on.
5 Git commands
git
is a single program with many “sub-commands”. To use it, you do a
command like this on the command line:
git SUBCOMMAND ARGUMENTS
where SUBCOMMAND
is something like commit
, pull
, status
, etc.
While there are around a hundred git subcommands, we can focus on just a few of the most common ones. The best way to understand these commands is how they move files and snapshots between the working directory, local repo, and remote repo.
The absolute most common every-day commands are git commit
,
git pull
, and git push
, so be sure you know them by heart!
5.1 Setup commands
git init
: Starting from a working directory only (i.e., just some
files in a folder), create the local repo .git
directory.
(Note, no snapshots are added to the local repo yet; it is created
as an empty database.)
git clone REMOTE_URL
: Given the URL of a remote git repo server,
create a local repo and working directory with the contents of
that local repo.
On popular hosting sites like GitHub and GitLab, you can find the
REMOTE_URL
to use by clicking the big “Clone” or “Code” button.
There are typically two URL options, HTTPS or SSH. HTTPS is good for
public repos where you are just downloading them; SSH requires a
little setup but works better for repos where you will be
contributing code.
git remote add NAME URL
: This is not required if you created the
repo using git clone
, since that will automatically create a
remote called origin
. But if you make a repo with git init
and
want to later connect it with a remote server, use this command;
origin
is the most common choice for NAME.
You can see all remote repos connected with git remote -v
, and
adjust the URL for a given remote after adding it with
git remote set-url
.
5.2 Working directory and local repo commands
git status
: Show the state of the working directory compared to
the local repo. In particular, this will show any files you have
changed or added that are not yet part of the local repo.
git add FILE
: Tell git
to start “tracking” the named file. You
can also put a directory here to tell git to track all files in that
directory. (You typically only need to do this when you create a new
file.)
git commit -a -m "MESSAGE"
: Take a snapshot of all changes from
all tracked files in the current directory, and add that snapshot to
the local repo.
(The -a
part means “all”, to commit all changes to tracked files.
The message is required!)
(Note: this does not upload to the remote repo yet!)
git merge
: Update the working directory based on any new commits
in the local repo. This is the opposite of git commit.
(Typically you don’t run this directly, but it happens as part of a
git pull
command.)
5.3 Remote sync commands
git push
: Upload all commits from the local repo to the remote
repo.
(The first time you do push you have to specify the remote
name and branch, like git push -u origin main
. The -u
part here
remembers the names so in the future you can just say git push
.)
(Note: you should typically pull before you push, to make sure
your local repo is up to date before you try to share your changes
with others.)
git fetch
: Download all commits from the remote repo to the local
repo. It is the opposite of push.
(Typically you don’t run this directly, but it happens as part of a
git pull
command.)
git pull
: Do a git fetch
followed by a git merge
, to download
any commits from the remote repo to the local repo, and then
incorporate those changes into the working directory.
(You should typically commit before you pull, to make sure you have
a snapshot of your working directory before trying to merge in any
changes from others.)
6 Pip and conda/mamba
git init
: Starting from a working directory only (i.e., just some
files in a folder), create the local repo .git
directory.
(Note, no snapshots are added to the local repo yet; it is created as an empty database.)
git clone REMOTE_URL
: Given the URL of a remote git repo server,
create a local repo and working directory with the contents of
that local repo.
On popular hosting sites like GitHub and GitLab, you can find the
REMOTE_URL
to use by clicking the big “Clone” or “Code” button.
There are typically two URL options, HTTPS or SSH. HTTPS is good for
public repos where you are just downloading them; SSH requires a
little setup but works better for repos where you will be
contributing code.
git remote add NAME URL
: This is not required if you created the
repo using git clone
, since that will automatically create a
remote called origin
. But if you make a repo with git init
and
want to later connect it with a remote server, use this command;
origin
is the most common choice for NAME.
You can see all remote repos connected with git remote -v
, and
adjust the URL for a given remote after adding it with
git remote set-url
.
git status
: Show the state of the working directory compared to the local repo. In particular, this will show any files you have changed or added that are not yet part of the local repo.git add FILE
: Tellgit
to start “tracking” the named file. You can also put a directory here to tell git to track all files in that directory. (You typically only need to do this when you create a new file.)git commit -a -m "MESSAGE"
: Take a snapshot of all changes from all tracked files in the current directory, and add that snapshot to the local repo.(The
-a
part means “all”, to commit all changes to tracked files. The message is required!)(Note: this does not upload to the remote repo yet!)
git merge
: Update the working directory based on any new commits in the local repo. This is the opposite of git commit.(Typically you don’t run this directly, but it happens as part of a
git pull
command.)
5.3 Remote sync commands
git push
: Upload all commits from the local repo to the remote
repo.
(The first time you do push you have to specify the remote
name and branch, like git push -u origin main
. The -u
part here
remembers the names so in the future you can just say git push
.)
(Note: you should typically pull before you push, to make sure
your local repo is up to date before you try to share your changes
with others.)
git fetch
: Download all commits from the remote repo to the local
repo. It is the opposite of push.
(Typically you don’t run this directly, but it happens as part of a
git pull
command.)
git pull
: Do a git fetch
followed by a git merge
, to download
any commits from the remote repo to the local repo, and then
incorporate those changes into the working directory.
(You should typically commit before you pull, to make sure you have
a snapshot of your working directory before trying to merge in any
changes from others.)
6 Pip and conda/mamba
git push
: Upload all commits from the local repo to the remote
repo.
(The first time you do push you have to specify the remote
name and branch, like git push -u origin main
. The -u
part here
remembers the names so in the future you can just say git push
.)
(Note: you should typically pull before you push, to make sure your local repo is up to date before you try to share your changes with others.)
git fetch
: Download all commits from the remote repo to the local
repo. It is the opposite of push.
(Typically you don’t run this directly, but it happens as part of a
git pull
command.)
git pull
: Do a git fetch
followed by a git merge
, to download
any commits from the remote repo to the local repo, and then
incorporate those changes into the working directory.
(You should typically commit before you pull, to make sure you have a snapshot of your working directory before trying to merge in any changes from others.)
Most of the software we have used in SD211 and SD212 is maintained by
communities of programmers who typically use git
to coordinate their
changes, track bugs and issues, and share their code with others.
The Python packages that we have installed either come from pip
and
distributed in PyPI, or from mamba
and
distributed in conda-forge.
While the details differ, it is important to understand that anyone
can propose to add their own software package to these repositories, and
the process is fairly streamlined. Yes, this means that even you could
write some Python code, share it on GitHub, and get it added to pip and
mamba with a few hours’ work. This is a great strength of why Python has
become so popular for data science, but it is also an indication that we
should be careful and thoughtful about what software we are downloading
when we run pip install
or mamba install
.
As an example, we created a very simple Python package to help with the last homework, and got it published on PyPI at https://pypi.org/project/sd212review/. You can see that the code is actually maintained on GitHub at https://github.com/sd212usna/sd212review. Scroll down on the README to see exactly the commands needed to publish this git repo on pip — it only took a couple minutes!