If trying to predict calories in a specific food using regression, which column could be useful. A: Grams of fat B: Food Names C: Food Category D: Measure amount

Answer

A
What is a common ethical concern in The Ethics of Using Hacked Data, Optimizing Schools, and Diabetes AI Agent?

A. If the math used to analyze the data was correct B. If the survey questions were fair C. If the data was used without permission or broke people’s privacy D. If the government should be in charge of the data
Answer
Which method would you use to permanently remove all rows from a DataFrame df that contain any missing values?

A. df.remove_na() B. df.dropna() C. df.drop(missing=True) D. df.fillna(method=“drop”)

Answer

B
Which regular expression pattern would match a string that contains EXACTLY a 4-digit number (e.g., “1234” or “5678”)?
2. ¹{4}$
3. 0-9]{4}
Answer

d
Which level in the computer storage Memory Hierarchy is included in Primary Storage but NOT found on the CPU?
1. Caches
2. Registers
3. Flash Disk
4. Main Memory
5. Internet
Answer

d
When is data considered public and ethical to use?
1. If it’s can be web scrapped
2. If it’s already been used in another study
3. If it’s data the person consented to give
4. both a and c
Answer

c
You have just made changes to a Python file in your Git-tracked project, but you have not run ‘gi add’ or ‘git commit’ yet. Where do your changed currently exist?
1. the remote repository only
2. the working directory only
3. both the working directory and remote repository
4. the local repository only
Answer

b
Answer

c
Which of the following best describe a process in regards to an operating system? A)A program loaded into memory B)Different commands controlling a computer C)A program that is currently running D)A way to make a program more efficient, limiting I/O time

Answer

C
Which of the following best describes ‘user’ time for CPU outputs using multiprocessing?
1. The time it takes the CPU to complete one process
2. The time it takes for the user to click start
3. The time measuring the total CPU usage across all processes
4. Measurement of how long the user waits for the output to appear
Answer
1. The time measuring the total CPU usage across all processes
In a bash if statement, the command you are testing the if statement on runs the then statement, what does the command return?
1. 0
2. -1
3. 1
4. 2
Answer
1. 0
Which line of code will search for two different words?
1. grep -E data|science
2. grep data[science]
3. grep data|science
4. grep data/science
Answer
1. grep -E data|science
What is the third step in the “data science lifestyle”?
1. Data Analysis
2. Data Processing and Cleaning
3. Data Acquisition
4. Data Storage
5. Data Visualization, Interface, and Communication
Answer

B
Which is not a violation of data ethics? A. Sharing customer data with other parties without consent B. Skewing data to create biases about communities C. Measuring user trends to improve features for products D. Failing to encrypt sensitive information.

Answer

C
What class concept controls the way built-in python operators and functions work with classes we write?
1. Inheritance
2. Encapsulation
3. operator overloading
4. Polymorphism
Answer
1. operator overloading
Which of the following is true?
1. Clustering Algorithms is a form of unsupervised learning.
2. Kmeans is a method for linear reggresion.
3. Continous data would be like 1,2,3,4.
4. Artifical intel is not required to learn.
Answer

A
Which of these methods would merge two dataframes, df1 and df2, on the ‘name’ column?

a.) pd.merge(df2, df1, on = ‘name’)

b.) df1.merge(df2, on = ‘name’)

c.) pd.merged(df1, df2, on = ‘name’)

d.) df1.join(df2, on = ‘name’)

Answer

a
Which of these lines of code will print “it’s true!”?
1. if false, true, then echo “it’s true!”, fi
2. if true, false,then echo “it’s true!”
3. if false; true; then echo “it’s true!”; fi
4. if true; false; then echo “it’s true!”;
Answer

c
I run my program coolthings.py from a terminal and attach time in front. When I get my output I see that my user time is much less then my real time. What is happening here?
1. My work CPU doesn’t have enough cores to handle the program.
2. The program’s OS overhead time is high.
3. A lack of RAM is affectign the performance.
4. The program is waiting for user input.
5. Real time represents the time spent on core so it makes sense.
Answer

d
What does the function cd when already inside several subdirectiories.
1. Moves into the directory directly behind
2. Moves to home directory
3. Creates a new directory
4. Orders doordash to Gate 1
Answer

Moves to the home directory
When given a bank csv file with the columns: “NAME,ID#,SAVINGS,CHECKING,WITHDRAW,DEPOSIT”. You create a dataframe, df, to go through the data in the csv file. Which of these Python commands would replace all unfilled entries of the “WITHDRAW” or “DEPOSIT” columns with the number 0? (select the best option)
1. df = df.fillna(0)
2. df = df[[‘WITHDRAW”, “DEPOSIT’]].fillna(0)
3. df = df[‘WITHDRAW’,‘DEPOSIT’].fillna(0)
Answer

b
What are the three main commands of git needed to add a file and changes to a repo and in which order? a) git commit, git pull, git push b) git add, git pull, git push c) git add, git commit, git push d) git status, git add, git commit

Answer

c
Which of the following answer choices is a key ethical consideration when using data that was obtained through previous hacking?
1. Ensuring that the data is significant and able to be used to for programming.
2. Maximizing the amount of accuracy that the data can provide regardless of how personal the information may seem.
3. Determining whether the data can be considered public, or if there are any stipulations surrounding the data.
4. Minimizing the amount of time that it takes for the data to be processed.
Answer

c
Answer
What is the difference between supervised and unsupervised learning and give an example?

Answer

Supervised learning is when you already have the target data answers. Unsupervised learning is when you use data to find patterns or clusters within data. An example of Supervised learning would be predicting housing prices based on data where the prices are already known. An example of Unsupervised would be clustering MIDS into groups based on their performance data.
Across the four case studies mentioned in class, what is one ethicical challenge that data scientist should consider?

Answer

One ethical challenge data scientists should consider is ensuring informed consent and respecting privacy when collecting or using data.

Write the code to print a dataframe with only the books that appear in both sales and ratings, including their sales and rating info (sales.csv)

title  copies_sold
0  Book A          1200
1  Book B           800
2  Book C           450

ratings.csv

title  rating
0  Book B       4.5
1  Book D       3.8
2  Book A       4.7

Answer

import pandas as pd

sales = pd.read_csv('sales.csv')
ratings = pd.read_csv('ratings.csv')


merged = pd.merge(sales, ratings, on='title')


print(merged)

Write a bash command that reads a file named file.txt and replaces all 3-letter words (e.g., “the”, “and”) with the word “???”.

Answer

sed ‘s/{3}???/g’ file.txt
python3 is the name of an executable program on your copmuter and myprogram.py is an input file that is read by the python3 ptogram. What does it mean when we say that the python3 program acts as an interpreter for the file?

Answer

python3 is translating between the Python programming language and machine instructions “on the fly”, reading each instruction and executing it. (taken from course notes)
Give an example of some case study we disscussed in class, and one ethical dilemna it brought up?

Answer

In the optimizing school study it asked whteher tracking kids without telling them or thier parents and restricting thier choices was really ethical.
What is the correct sequence of GitHub commands for accessing and collaborating on a Github repository?

Answer

git pull, git add, git commit - m , git push
Write code that merges two pandas on the column food
Answer
```
Df3 = pd.merge(Df1,Df2, on='food', how='inner')
```
What is a deadlock in regards to multithreading?

Answer

A deadlock is when two threads get hung up because each is waiting on a component that the other is supposed to provide, and neither can run until that component is sent, thus resulting in a locked system.

Here is a short program which uses threading:

from threading import Thread
from time import sleep

total = 0

def addtototal(n):
global total
print('Adding to total:',n)
sleep(0.5)
total += n
print('The new total is',total)

if __name__ == '__main__':
thr = []
for i in [10,20,30]:
t = Thread(target=addtototal, args=(i,))
t.start()
thr.append(t)

for t in thr:
t.join() #wait

print('Done! Total is',total)

Now let’s say this program does not help speed up the run time of the file because the processes can not run simultaneously. Change this program to use Multiprocessing.

Answer

from multiprocessing import Process, SimpleQueue
from time import sleep

def addtototal(n, q):
    print('Adding to total:', n)
    sleep(0.5)
    q.put(n)
    print('Done adding', n)

if __name__ == '__main__':
    q = SimpleQueue()
    procs = []

    for i in [10, 20, 30]:
        p = Process(target=addtototal, args=(i, q))
        p.start()
        procs.append(p)

    for p in procs:
        p.join()

    total = 0
    while not q.empty():
        total += q.get()

    print('Done! Total is', total)

Write a code in python that reads through filenames in list FILENAMES and prints out the lines in the files, but some of the filenames in the list do not exist.
Answer
```
for i in FILENAMES:
    try:
        file = open(i)
        for line in file:
            print(line)
    except:
        pass
```
What is the key difference between the bash commands grep and find?

Answer

The key difference is that grep can be used to find a particujlar string within a file, while find can be used to search for a particular file in a directory.
Why would someone prefer to use DictReader rather than Pandas? Consider the complexity of the csv file.

Answer

It is becasue the DictReader function can be more intuitive for simple csv files. It reads by rows by default, and can be good for answering questions like appending specific rows.
Describe or explain an ethical objection or problem with one of the case studies we discussed in class.

Answer

With the fictional Princeton-AI-Ethics-Case, one of the ethical objections was consent and transparency. The users offered sub-optimal solutions to some users without their consent as treatment. They argued it was for the greater good because it would help improve the program. However, it is unethical to do this at the expense of worsening the health of others when they believe they are being treated properly.

from Player import Player, Guard, Forward

lebron = Forward('LeBron James', 'Forward', 38)
steph = Guard('Stephen Curry', 'Guard', 45)
giannis = Forward('Giannis Antetokounmpo', 'Forward')
luka = Guard('Luka Doncic', 'Guard')

lebron.get_stats()   # prints "LeBron James plays Forward and has scored 38 points."
steph.get_stats()    # prints "Stephen Curry plays Guard and has scored 45 points."
giannis.get_stats()  # prints "Giannis Antetokounmpo plays Forward and has scored 0 points."
luka.get_stats()     # prints "Luka Doncic plays Guard and has scored 0 points."

lebron.score(2)    # prints "LeBron James scores 2 points!"
steph.dribble()    # prints "Stephen Curry is expertly dribbling the ball."
steph.score(3)     # prints "Stephen Curry scores 3 points!"
giannis.rebound()  # prints "Giannis Antetokounmpo grabs a crucial rebound!"
giannis.score(2)   # prints "Giannis Antetokounmpo scores 2 points!"
luka.dribble()     # prints "Luka Doncic is expertly dribbling the ball."
luka.score(3)      # prints "Luka Doncic scores 3 points!"

lebron.get_stats()   # prints "LeBron James plays Forward and has scored 40 points."
steph.get_stats()    # prints "Stephen Curry plays Guard and has scored 48 points."
giannis.get_stats()  # prints "Giannis Antetokounmpo plays Forward and has scored 2 points."
luka.get_stats()     # prints "Luka Doncic plays Guard and has scored 3 points."

Answer

class Player:
    def __init__(self, name, position, points_scored=0):
        self.name = name
        self.position = position
        self.points_scored = points_scored

    def score(self, points):
        print(f"{self.name} scores {points} points!")
        self.points_scored += points

    def get_stats(self):
        print(f"{self.name} plays {self.position} and has scored {self.points_scored} points.")

class Guard(Player):
    def dribble(self):
        print(f"{self.name} is expertly dribbling the ball.")

class Forward(Player):
    def rebound(self):
        print(f"{self.name} grabs a crucial rebound!")

Suppose you are given a dataset mids.csv with the following columns: name, credit_hours, qpr, prt_score, number_of_billets, and rank_in_company. You should use k-means clustering to divide the students into 3 groups

Answer

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import Kmeans

df = pd.read_csv('mids.csv')

mids = df.drop(columns = ['name'])

scaled = StandardScaler().fit(mids).transform(mids)
model = KMeans(n_clusters=3, random_state=123)
model.fit(scaled)
schools['group'] = model.labels_

I have a dataframe, df. How do I get rid of all of the rows in the dataframe that have null values under the “time” column.
Answer
```
df = df[df['time'].notna()]
```
Write a program that reads in a csv file called “cats.csv”, change the ‘breed’ and ‘name’ of the first cat in the dataframe and print “Sorry this is a CATastrophe” if there is an error.
Answer
```
try:
    cats = pd.read_csv('cats.csv')
    cat.iloc[0,'breed'] = 'Rag Doll'
    cat.iloc[0,'name'] = 'Betty'
except:
    print("Sorry this is a CATastrophe")
```

Given this code: I find that my real time is often much higher then my user. I want to minimize my real time and so that my user time is the main limiting factor. How can I edit the program to minimize real time as much as possible?

from random import*
luck = input("What do you think your lucky number is?")
luckynumber = randint(100)
print(f"{luck} isn't a bad number, but I think {luckynumber} might serve you better today)

Answer

from random import*
luck = 7
luckynumber = randint(100)
print(f"{luck} isn't a bad number, but I think {luckynumber} might serve you better today)

Write a command-line statement to search for the word “error” (case-insensitive) in a file named log.txt and display all matching lines.
Answer
```
grep -i "error" log.txt
```

Given the following “Bank.csv”:

NAME,ID#,SAVINGS,CHECKING,WITHDRAW,DEPOSIT
Brian,12345,$1000,500,0,0
Bob,54321,2000,100,$50,0
Ben,43215,$1500,350,0,50
Bart,23451,1800,$400,0,0
```python

Use bash to narrow down the data to narrow the data down to anyone who is not withdrawing or depositing,
and then remove any dollar signs that are causing problems in using the data.
Write the new csv to a file called "clean_data.csv".



<details><summary>Answer</summary>

::::: {.answer}
```python
grep -E ",0,0$" Bank.csv | sed "s/\$//g" > clean_data.csv

:::::

Write a python program that takes a person’s favorite number and returns YES if the number is greater than 20 and NO otherwise. Then, write the command line commands to add this file to a repo and commit your changes.
Answer
```
integer = int(input('Favorite number? '))

if integer >20:
    print('YES')
if integer <= 20:
    print('NO')
```
command line: git pull git add favenumber.py git commit -m ‘here is file’ git push
What could be a potential risk of using data that has been hacked or is not fully anonymous, meaning people’s personal information can still be accessed?

Answer

A potential risk of using this type of data could be that, you are using information that someone didn’t approve you to use. Therefore you would be violationg thier right to privacy and they could take leagal action if they felt as such.

0-9↩︎

SD 212 Spring 2026 / Admin