SD 212 Spring 2023 / Notes


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 14: Versions and packaging

1 Overview

The main goal of this unit is to introduce you to git, the most popular tool used by software developers and data scientists alike to share and collaborate on coding projects.

At the end of this unit, you should be comfortable collaborating in a single-branch git repository, and using the git commands clone, add, commit, pull, and push. There is a lot more to git that you might learn about later on, or on your own!

We will also briefly discuss how big software package management systems such as pip and conda/mamba work at a high level. Many of them are closely integrated with git!

2 Resources

  • Pro Git by Scott Chacon

    The first two chapters on “Getting Started” and “Git Basics” correspond to what we’ll cover here. The book is very well-written with short, straightforward chunks of information, and it’s all free online.

  • git “simple guide” by Roger Dudler

    A very basic down-to-business overview of the most useful git commands and workflows. Great for a quick reference or refresher.

  • Github “Quickstart”

    Goes over how to create, clone, and contribute to projects hosted on GitHub. Remember that git is not github! GitHub is just one of many (free and very popular) places where people store their git repos.

  • Using git in VS Code

3 Command line vs VS Code (GUI)

VS Code has very good integration with git so that you can easily pull, add, commit, and push files by clicking things in your code editor. It’s installed by default in VS Code, so you don’t need to do anything special other than create a git repository and open that folder in VS code.

But we are still going to learn how to use the command line git commands. Why? Scott Chacon, the author of Pro Git put it better than I could:

For this book, we will be using Git on the command line. For one, the command line is the only place you can run all Git commands — most of the GUIs implement only a partial subset of Git functionality for simplicity. If you know how to run the command-line version, you can probably also figure out how to run the GUI version, while the opposite is not necessarily true.

4 Git structure

When using git, your files typically exist in three places:

  • Working directory: This is the folder containing your actual fileso and subfolders. It appears just like any other folder and any other files on your computer. You edit these files in VS code or wherever, run the code, etc.

  • Local repo: This is created alongside the working directory on your own computer, in a special folder called .git inside your working directory. It is a database that stores snapshots (called commits) of your working directory.

  • Remote repo: While not technically required to use git, we typically use git to sync up a locally-stored working directory and local repo with a server. The most popular such hosting service is <github.com>.

    The remote repo stores exactly the same database as your local repo, containing snapshots (commits) of the working directory.

Understanding these three copies is crucial to being a competent user of git! It can be very easy to conflate the working directory and local repo, or the local repo and remote repo, leading to confusion about what is going on.

5 Git commands

git is a single program with many “sub-commands”. To use it, you do a command like this on the command line:

git SUBCOMMAND ARGUMENTS

where SUBCOMMAND is something like commit, pull, status, etc.

While there are around a hundred git subcommands, we can focus on just a few of the most common ones. The best way to understand these commands is how they move files and snapshots between the working directory, local repo, and remote repo.

The absolute most common every-day commands are git commit, git pull, and git push, so be sure you know them by heart!

5.1 Setup commands

  • git init: Starting from a working directory only (i.e., just some files in a folder), create the local repo .git directory.

    (Note, no snapshots are added to the local repo yet; it is created as an empty database.)

  • git clone REMOTE_URL: Given the URL of a remote git repo server, create a local repo and working directory with the contents of that local repo.

    On popular hosting sites like GitHub and GitLab, you can find the REMOTE_URL to use by clicking the big “Clone” or “Code” button. There are typically two URL options, HTTPS or SSH. HTTPS is good for public repos where you are just downloading them; SSH requires a little setup but works better for repos where you will be contributing code.

  • git remote add NAME URL: This is not required if you created the repo using git clone, since that will automatically create a remote called origin. But if you make a repo with git init and want to later connect it with a remote server, use this command; origin is the most common choice for NAME.

    You can see all remote repos connected with git remote -v, and adjust the URL for a given remote after adding it with git remote set-url.

5.2 Working directory and local repo commands

  • git status: Show the state of the working directory compared to the local repo. In particular, this will show any files you have changed or added that are not yet part of the local repo.

  • git add FILE: Tell git to start “tracking” the named file. You can also put a directory here to tell git to track all files in that directory. (You typically only need to do this when you create a new file.)

  • git commit -a -m "MESSAGE": Take a snapshot of all changes from all tracked files in the current directory, and add that snapshot to the local repo.

    (The -a part means “all”, to commit all changes to tracked files. The message is required!)

    (Note: this does not upload to the remote repo yet!)

  • git merge: Update the working directory based on any new commits in the local repo. This is the opposite of git commit.

    (Typically you don’t run this directly, but it happens as part of a git pull command.)

5.3 Remote sync commands

  • git push: Upload all commits from the local repo to the remote repo.

    (The first time you do push you have to specify the remote name and branch, like git push -u origin main. The -u part here remembers the names so in the future you can just say git push.)

    (Note: you should typically pull before you push, to make sure your local repo is up to date before you try to share your changes with others.)

  • git fetch: Download all commits from the remote repo to the local repo. It is the opposite of push.

    (Typically you don’t run this directly, but it happens as part of a git pull command.)

  • git pull: Do a git fetch followed by a git merge, to download any commits from the remote repo to the local repo, and then incorporate those changes into the working directory.

    (You should typically commit before you pull, to make sure you have a snapshot of your working directory before trying to merge in any changes from others.)

6 Pip and conda/mamba

Most of the software we have used in SD211 and SD212 is maintained by communities of programmers who typically use git to coordinate their changes, track bugs and issues, and share their code with others.

The Python packages that we have installed either come from pip and distributed in PyPI, or from mamba and distributed in conda-forge.

While the details differ, it is important to understand that anyone can propose to add their own software package to these repositories, and the process is fairly streamlined. Yes, this means that even you could write some Python code, share it on GitHub, and get it added to pip and mamba with a few hours’ work. This is a great strength of why Python has become so popular for data science, but it is also an indication that we should be careful and thoughtful about what software we are downloading when we run pip install or mamba install.

As an example, we created a very simple Python package to help with the last homework, and got it published on PyPI at https://pypi.org/project/sd212review/. You can see that the code is actually maintained on GitHub at https://github.com/sd212usna/sd212review. Scroll down on the README to see exactly the commands needed to publish this git repo on pip — it only took a couple minutes!