SD 212 Spring 2023 / Labs


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 7: Submit system data science

1 Overview

For our last lab we are asking you to turn your data science skills on yourself — or more specifically, on the data from all of your submit system interactions since SD211.

This is the most open-ended lab of the semester. We are not telling you specifically what questions you should be investigating. Instead, we are specifying the kinds of analyses we want you to do. Basically, your analysis should incorporate a large segment of the input data, should preferably involve some use of sklearn, and should lead to a clear and compelling recommendation or insight for next year. You will turn in your code (of course) as well as a 1-slide (PDF) presentation highlighting your results.

In order to facilitate discussions and possibly help ease some of your end-of-the-semester stress, we are also allowing you to optionally work in pairs on this lab. The idea is that you will actually work together to come to a better result than you would individually. Both partners will submit the full lab and indicate each other’s names in the markdown file.

1.1 Deadlines

  • Milestone: 2359 on Monday, 1 May
  • Complete lab: 2359 on Thursday, 4 May

1.2 Learning goals

  • Experience handling a very large JSON data dump
  • Practice working with raw data which may contain omissions and inconsistencies
  • Apply domain-specific knowledge to identify potential insights in a dataset

2 The data

2.1 Download

The raw data consists of a single large JSON file that you can download from this link or in the terminal directly:

wget "https://roche.work/courses/s23sd212/lab/submit/submit-data.json.xz"

Once you have it, use the unxz command to unzip it and give you the actual JSON file submit-data.json.

2.2 Structure

This JSON file is organized as a list of dicts, where:

  • There are around 37000 entries in the list

  • Each list entry is a dictionary with the same keys (but different values of course!)

  • The dicts in the list represent a single test run of a single submission.

  • The user fields have been anonymized from actual Midshipmen alphas to randomly-assigned superhero names. The renaming is consistent, so that the same “superhero” over multiple entries actually corresponds to the same student.

  • Hierarchically, the data is structured like this:

    • There are two courses in the dataset (SD211 and SD212)
    • Each course contains multiple projects (assignment names like hw01 or lab07)
    • Each project has multiple test cases with their own rulenames and “test case id” or tid
    • Each student (superhero) may make multiple submissions to various courses and assignments; each submission gets its own “submission id” or sid
    • Each entry in the list is a single (submission,testcase) pair, which contains information like whether that testcase passed and how long it took to run.
  • For example, the dataset contains 1155 entries for SD212 lab05. That is because this particular lab had 7 test cases, and of the 44 students in the dataset, each submitted about 3.8 times on average. It comes to 1155 entries for this lab because each submission by each student creates a separate entry for each test case.

It’s doubtful you will learn much by opening this file in a text editor. Instead, you need to use your tools!

Primarily, in Python, remember that you can use the json library to turn this file into a list of dicts.

There is also a handy-dandy command line tool called jq which can be used to get a quick glimpse of this data. For example, to look at index 30000 in this list you could run, in bash:

jq '.[30000]' <submit-data.json

2.3 Data dictionary

The actual meaning of each dictionary entry is summarized in the table below (mostly copied from the actual documentation of the submit system).

Keep in mind, you won’t actually need to use most of these entries, so if you’re not sure exactly what something means, don’t sweat it.

Also remember that many of these fields will be repeated across many dictionaries. For example, all 1155 entires for SD212 lab05 have the same close deadline.

Key Description
user The (superhero) username of the student who made this submission
pid The process id of the process that actually executed this testcase on the server
year Academic year
semester FALL, SPRING, or SUMMER
block Always 1 in fall and spring
course defines the course in which the project will be added
project project name for submissions
open time where the project will be available for students to submit to
close time where any student submissions received will be marked as late
realclose time where any student submissions will be rejected by the system
type defines the type of the project (for grading purposes) examples: homework, lab, project, etc.
title simple title for the project to remind students what they are submitting to
link a URL that will be made available to students to click on them (usually the assignment location)
description describes the test case
maxattempts the maximum number of times a student may attempt to submit an assignment
waitperiod time, in seconds, that the student will be forced to wait before resubmitting an assignment
compile_target makefile target to compile the student code
run_target makefile target to run the student code
analysis_target makefile target to analyze the student code
language used only for makefile assistance to inform the system of what language will be used, not required for test case runs
makefile actual makefile that will be used to run student code
run automatically run this test case
showgrades whether the results are displayed to students
sid student submission ID
datestamp Date and time when this submission was received
tid test case ID
rulename rule name for the test case
points the value given the student upon a successful run
source on what to perform the test case against.
sourcefile specify the file to be used when the source is ‘Created File’
stdin text provided as input to the program being tested
outvalue expected output to be compared against
cond condition that must be satisfied,
infinite time in seconds afterwhich the system will terminate the program and mark as an infinite loop
view_open time at which the students will be able to see the testcase
view_close time at which the students will stop being able to see the testcase
hide hide the existance of the test case from the students
docker identifies which setup actually ran the testcase
returnval value returned from the executing program
stime time it took the program to run
stderr results returned via stdout from the running program
stdout results returned via stdout from the running program
pass overal results, did the student pass
diff string that the expected output is compared against

3 Your tasks

3.1 Markdown file to fill in

Here is the file with questions to fill in and submit for today’s lab: lab07.md

You can run this wget command to download the blank md file directly from the command line:

wget "https://roche.work/courses/s23sd212/lab/md/lab07.md"

3.2 Code to submit

You will write a single program analyze.py that reads in the JSON data without modifying it, performs any necessary calculations and computations, and then probably produces some visualization that will be included in your 1-slide presentation.

This code must be clearly documented so it makes sense when someone else (your teammate, or your instructor, or yourself!) opens it to understand what you did.

(It’s OK if you have some smaller .py programs as you are working through things, but you should combine them into a single one-shot program in order to turn in. Make sure you run it and it actually works though!!)

3.3 1-slide presentation

Create a single slide which displays your results and gives the recommendation that goes along with it in a clear and compelling way.

Your single slide must be turned in as a PDF file. Remember, just changing the file name to end in .pdf doesn’t do anything; have to actually “save as pdf” or “export to pdf” depending on what program you use to make it.

Call the file slide.pdf.

3.4 Turn in

By the final deadline, both partners should submit the same thing. (It’s OK if only one partner submits for the milestone however.)

submit -c=sd212 -p=lab07 lab07.md analyze.py slide.pdf

or

club -csd212 -plab07 lab07.md analyze.py slide.pdf

or use the web interface

3.5 Rubric

Here’s the grading breakdown:

  • 20%: The question being investigated is interesting and appropriate to the dataset. The explanations given in the markdown file provide a good overview of the approach and conclusions.

  • 30%: Analysis incorporates a substantial fraction (at least 10000) of the data points to produce a clear result, and runs in under a minute (hopefully much less!) on the original, unmodified raw data.

    (This does not mean you have to use every single part of each data point, but for example focusing on just one lab or HW would not qualify.)

  • 20%: Analysis makes sensible use of supervised or unsupervised learning in sklearn

  • 20%: 1-slide presentation is clear and compelling, and includes a good visualization of the analysis.

  • 10%: Directions were followed along the way (turning things in correctly and on time, proper formatting and naming, etc.)

4 Questions to answer

Download, fill in, and submit: lab07.md

  1. What are the names and alphas of you and your parter for this lab?

    (If you are working by yourself, just put your own name and alpha.)

  2. What specific work did each partner do to contribute?

    (Come back and edit this question later as you make progress.)

  3. What specific question or problem are you investigating in the data?

  4. When did you discuss this question and run it by your instructor?

    (Yes, you need to do that before you get too far into things.)

  5. Explain in a few sentences (as much as you need, but doesn’t need to be too long) how your analysis code works to read in and process the data.

  6. Explain in a few sentences how your code analyzes the data. If you used the sklearn library as requested, this is the time to explain your use of supervised or unsupervised learning.

The above questions must be submitted (along with your initial analysis code) by the milestone deadline.

  1. What conclusions do you draw from the analysis? What specific recommendations would you make for next year’s students or instructors?

  2. Any comments or suggestions about this final lab, or anything else? We hope it was fun!

5 Tips and suggestions

  • To develop an initial question or approach, first get familiar with the data. Notice which fields of each dictionary are kind of boring or irrelevant, and which ones may lead to some insights.

  • You will likely need to combine the information from multiple data points along some direction, like combining all the submissions to each assignment together, or all the submissions by a single student across multiple assignments, etc.

    You can do this combining either in your Python code as you read in and loop through the JSON dictionaries, or you can do it in Pandas.

    It’s up to you whatever makes the most sense, but be deliberate about your approach and think ahead towards your ultimate goal.

  • Like any real dataset, the entries have some missing data or inconsistencies. For example, some test cases were put in by your instructors but later deleted, or some submissions might have been uploaded and then re-submitted too quickly for the initial test cases to have a chance to even run. Usually these kinds of things are indicated with None entries in the dicts.

  • You are likely to want to do something with the many timestamps present in this data. Fortunately, all the timestamps are consistently formatted for your convenience, into standardized strings such as

    2023-01-17 09:55:10

    You can covert such strings to datetime objects either by calling datetime.fromisoformat(str) using the datetime library or pd.to_datetime(str) using Pandas.

    Either way, for your analysis you will probably be most interested in the differences between times, like between a submission time and the deadline. To get the differences, just perform normal subtraction in Python.

  • Talk to your partner and your instructor to discuss options and approaches. Always be thinking towards your final goal so you don’t get caught up on irrelevant things or waste your time.

  • Have fun! This should be a good chance for you to show off and practice what you’ve learned this year.