Lab 7: Submit system data science
1 Overview
For our last lab we are asking you to turn your data science skills on yourself — or more specifically, on the data from all of your submit system interactions since SD211.
This is the most open-ended lab of the semester. We are not telling you specifically what questions you should be investigating. Instead, we are specifying the kinds of analyses we want you to do. Basically, your analysis should incorporate a large segment of the input data, should preferably involve some use of sklearn, and should lead to a clear and compelling recommendation or insight for next year. You will turn in your code (of course) as well as a 1-slide (PDF) presentation highlighting your results.
In order to facilitate discussions and possibly help ease some of your end-of-the-semester stress, we are also allowing you to optionally work in pairs on this lab. The idea is that you will actually work together to come to a better result than you would individually. Both partners will submit the full lab and indicate each other’s names in the markdown file.
1.1 Deadlines- Milestone: 2359 on Monday, 1 May
- Complete lab: 2359 on Thursday, 4 May
1.2 Learning goals
- Experience handling a very large JSON data dump
- Practice working with raw data which may contain omissions and
inconsistencies
- Apply domain-specific knowledge to identify potential insights in a
dataset
2 The data
2.1 Download
- Experience handling a very large JSON data dump
- Practice working with raw data which may contain omissions and inconsistencies
- Apply domain-specific knowledge to identify potential insights in a dataset
2 The data
2.1 Download
The raw data consists of a single large JSON file that you can download from this link or in the terminal directly:
wget "https://roche.work/courses/s23sd212/lab/submit/submit-data.json.xz"
Once you have it, use the unxz
command to unzip it and give you the
actual JSON file submit-data.json
.
2.2 Structure
This JSON file is organized as a list of dicts, where:
There are around 37000 entries in the list
Each list entry is a dictionary with the same keys (but different values of course!)
The dicts in the list represent a single test run of a single submission.
The
user
fields have been anonymized from actual Midshipmen alphas to randomly-assigned superhero names. The renaming is consistent, so that the same “superhero” over multiple entries actually corresponds to the same student.Hierarchically, the data is structured like this:
- There are two
course
s in the dataset (SD211 and SD212) - Each course contains multiple
project
s (assignment names likehw01
orlab07
) - Each project has multiple test cases with their own
rulename
s and “test case id” ortid
- Each student (superhero) may make multiple submissions to
various courses and assignments; each submission gets its own
“submission id” or
sid
- Each entry in the list is a single (submission,testcase) pair, which contains information like whether that testcase passed and how long it took to run.
- There are two
For example, the dataset contains 1155 entries for SD212 lab05. That is because this particular lab had 7 test cases, and of the 44 students in the dataset, each submitted about 3.8 times on average. It comes to 1155 entries for this lab because each submission by each student creates a separate entry for each test case.
It’s doubtful you will learn much by opening this file in a text editor. Instead, you need to use your tools!
Primarily, in Python, remember that you can use the json library to turn this file into a list of dicts.
There is also a handy-dandy command line tool called jq which can be used to get a quick glimpse of this data. For example, to look at index 30000 in this list you could run, in bash:
jq '.[30000]' <submit-data.json
2.3 Data dictionary
The actual meaning of each dictionary entry is summarized in the table below (mostly copied from the actual documentation of the submit system).
Keep in mind, you won’t actually need to use most of these entries, so if you’re not sure exactly what something means, don’t sweat it.
Also remember that many of these fields will be repeated across many
dictionaries. For example, all 1155 entires for SD212 lab05 have the
same close
deadline.
Key | Description |
---|---|
user | The (superhero) username of the student who made this submission |
pid | The process id of the process that actually executed this testcase on the server |
year | Academic year |
semester | FALL, SPRING, or SUMMER |
block | Always 1 in fall and spring |
course | defines the course in which the project will be added |
project | project name for submissions |
open | time where the project will be available for students to submit to |
close | time where any student submissions received will be marked as late |
realclose | time where any student submissions will be rejected by the system |
type | defines the type of the project (for grading purposes) examples: homework, lab, project, etc. |
title | simple title for the project to remind students what they are submitting to |
link | a URL that will be made available to students to click on them (usually the assignment location) |
description | describes the test case |
maxattempts | the maximum number of times a student may attempt to submit an assignment |
waitperiod | time, in seconds, that the student will be forced to wait before resubmitting an assignment |
compile_target | makefile target to compile the student code |
run_target | makefile target to run the student code |
analysis_target | makefile target to analyze the student code |
language | used only for makefile assistance to inform the system of what language will be used, not required for test case runs |
makefile | actual makefile that will be used to run student code |
run | automatically run this test case |
showgrades | whether the results are displayed to students |
sid | student submission ID |
datestamp | Date and time when this submission was received |
tid | test case ID |
rulename | rule name for the test case |
points | the value given the student upon a successful run |
source | on what to perform the test case against. |
sourcefile | specify the file to be used when the source is ‘Created File’ |
stdin | text provided as input to the program being tested |
outvalue | expected output to be compared against |
cond | condition that must be satisfied, |
infinite | time in seconds afterwhich the system will terminate the program and mark as an infinite loop |
view_open | time at which the students will be able to see the testcase |
view_close | time at which the students will stop being able to see the testcase |
hide | hide the existance of the test case from the students |
docker | identifies which setup actually ran the testcase |
returnval | value returned from the executing program |
stime | time it took the program to run |
stderr | results returned via stdout from the running program |
stdout | results returned via stdout from the running program |
pass | overal results, did the student pass |
diff | string that the expected output is compared against |
3 Your tasks
3.1 Markdown file to fill in
Here is the file with questions to fill in and submit for today’s lab: lab07.md
You can run this wget command to download the blank md file directly from the command line:
wget "https://roche.work/courses/s23sd212/lab/md/lab07.md"
3.2 Code to submit
You will write a single program analyze.py
that reads in the JSON data
without modifying it, performs any necessary calculations and
computations, and then probably produces some visualization that will be
included in your 1-slide presentation.
This code must be clearly documented so it makes sense when someone else (your teammate, or your instructor, or yourself!) opens it to understand what you did.
(It’s OK if you have some smaller .py
programs as you are working
through things, but you should combine them into a single one-shot
program in order to turn in. Make sure you run it and it actually works
though!!)
3.3 1-slide presentation
Create a single slide which displays your results and gives the recommendation that goes along with it in a clear and compelling way.
Your single slide must be turned in as a PDF file. Remember, just
changing the file name to end in .pdf
doesn’t do anything; have to
actually “save as pdf” or “export to pdf” depending on what program you
use to make it.
Call the file slide.pdf
.
3.4 Turn in
By the final deadline, both partners should submit the same thing. (It’s OK if only one partner submits for the milestone however.)
submit -c=sd212 -p=lab07 lab07.md analyze.py slide.pdf
or
club -csd212 -plab07 lab07.md analyze.py slide.pdf
or use the web interface
3.5 Rubric
Here’s the grading breakdown:
20%: The question being investigated is interesting and appropriate to the dataset. The explanations given in the markdown file provide a good overview of the approach and conclusions.
30%: Analysis incorporates a substantial fraction (at least 10000) of the data points to produce a clear result, and runs in under a minute (hopefully much less!) on the original, unmodified raw data.
(This does not mean you have to use every single part of each data point, but for example focusing on just one lab or HW would not qualify.)
20%: Analysis makes sensible use of supervised or unsupervised learning in sklearn
20%: 1-slide presentation is clear and compelling, and includes a good visualization of the analysis.
10%: Directions were followed along the way (turning things in correctly and on time, proper formatting and naming, etc.)
4 Questions to answer
Download, fill in, and submit: lab07.md
What are the names and alphas of you and your parter for this lab?
(If you are working by yourself, just put your own name and alpha.)
What specific work did each partner do to contribute?
(Come back and edit this question later as you make progress.)
What specific question or problem are you investigating in the data?
When did you discuss this question and run it by your instructor?
(Yes, you need to do that before you get too far into things.)
Explain in a few sentences (as much as you need, but doesn’t need to be too long) how your analysis code works to read in and process the data.
Explain in a few sentences how your code analyzes the data. If you used the sklearn library as requested, this is the time to explain your use of supervised or unsupervised learning.
The above questions must be submitted (along with your initial analysis code) by the milestone deadline.
What conclusions do you draw from the analysis? What specific recommendations would you make for next year’s students or instructors?
Any comments or suggestions about this final lab, or anything else? We hope it was fun!
5 Tips and suggestions
To develop an initial question or approach, first get familiar with the data. Notice which fields of each dictionary are kind of boring or irrelevant, and which ones may lead to some insights.
You will likely need to combine the information from multiple data points along some direction, like combining all the submissions to each assignment together, or all the submissions by a single student across multiple assignments, etc.
You can do this combining either in your Python code as you read in and loop through the JSON dictionaries, or you can do it in Pandas.
It’s up to you whatever makes the most sense, but be deliberate about your approach and think ahead towards your ultimate goal.
Like any real dataset, the entries have some missing data or inconsistencies. For example, some test cases were put in by your instructors but later deleted, or some submissions might have been uploaded and then re-submitted too quickly for the initial test cases to have a chance to even run. Usually these kinds of things are indicated with
None
entries in the dicts.You are likely to want to do something with the many timestamps present in this data. Fortunately, all the timestamps are consistently formatted for your convenience, into standardized strings such as
2023-01-17 09:55:10
You can covert such strings to
datetime
objects either by callingdatetime.fromisoformat(str)
using the datetime library orpd.to_datetime(str)
using Pandas.Either way, for your analysis you will probably be most interested in the differences between times, like between a submission time and the deadline. To get the differences, just perform normal subtraction in Python.
Talk to your partner and your instructor to discuss options and approaches. Always be thinking towards your final goal so you don’t get caught up on irrelevant things or waste your time.
Have fun! This should be a good chance for you to show off and practice what you’ve learned this year.