This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Final Exam Review Problems

Unit 2: Command line
Unit 3: Statistical data types
Unit 4: Regular expressions
Unit 5: Error handling
Unit 6: Data cleaning
Unit 8: Hardware and OS
Unit 9: Concurrency
Unit 10: Data Ethics
Unit 11: OOP in Python
Unit 12: Typing
Unit 13: Machine learning with sklearn
Unit 14: Versions and packaging

Unit 2: Command line

Readings/notes page

What input lines would be matched with the following command: grep -E '\bis$'
1. That is
2. yeah it tis
3. is this right
4. what is?
5. not is
Answer

a,e
How would you move the file ‘image.jpg’ into a new subdirectory called images?
1. move images image.jpg
2. mv image.jpg /images/
3. mv image.jpg images/
4. move image.jpg /images
Answer

c
Which of the following pipelines correctly grabs the first 7 lines of a file ‘book.txt’ and counts the number of times the word ‘the’ appears?
1. head book.txt | grep -c 'the'
2. head -n 7 book.txt | grep -c 'the'
3. cut -n 7 book.txt | grep -c 'the'
4. head -n 7 book.txt | grep 'the'
Answer

b
Suppose we have a file called ‘midshipmen.csv’, where each line has some information about midshipmen, including their alpha (starting with m). Write a bash command that counts the number of youngsters in this file.
Answer
.
```
grep -E ',m25[0123456789]{4},' midshipmen.csv | wc -l
```

"Year","Sex","Rank","Name","Count","Data_Revision_Date"
2000,Female,2,ASHLEY,2815,11/07/2022
2000,Female,3,SAMANTHA,2576,11/07/2022
2000,Female,3,JESSICA,2467,11/07/2022
2000,Female,5,JENNIFER,2256,11/07/2022
2000,Female,6,ALYSSA,2003,11/07/2022
2000,Female,7,HANNAH,1849,11/07/2022
2000,Female,3,SARAH,1847,11/07/2022
2000,Female,3,ELIZABETH,1830,11/07/2022
2000,Female,10,ALEXIS,1825,11/07/2022

In this CSV called ‘babies.csv’ how would you use bash script to find how many names from this data have a count of 3

Answer

grep ',3,' babies.csv |wc -l

Write a bash script that takes the 5th through 8th lines of a file book.txt and places them in a new file called new.txt. (You should include the 5th and 8th line)
Answer
.
```
head -n 8 book.txt | tail -n 4 > new.txt
```

Unit 3: Statistical data types

Readings/notes page

Suppose you and your friend are racing each other in a school zone. What type of statistical data best describes how fast you were traveling?
1. ordinal data
2. discrete numerical data
3. continuous data
4. nominal numerical data
Answer
What kind of statistical data type can be used to explain Order Of Merit (OOM)?
1. numerical and continuous
2. numerical and discrete
3. categorical and nominal
4. categorical and ordinal
Answer

d
What kind of data is your Overall Order of Merit (OOM) number?
1. Ordinal
2. Categorical
3. Nominal
4. Numerical
5. Both A and B
6. B and D
Answer

E

Use a csv of car information to print out a dataframe of model names, one categorical data type, and one discrete numerical data type sorted in descending order by the discrete numerical data.

model,type,seats,drivetrain,msrp,horsepower,mpg_city,weight
rainier,suv,7,all,37895.0,275.0,15.0,4600.0
rendezvous,suv,5,front,26545.0,185.0,19.0,4024.0
century,sedan,5,front,22180.0,175.0,20.0,3353.0
lesabre,sedan,5,front,26470.0,205.0,20.0,3567.0
regal,sedan,4,front,24895.0,200.0,20.0,3461.0

Answer

cars = pd.read_csv('cars2.csv')
new = cars.drop(columns=['type','msrp','horsepower','mpg_city','weight'])
print(new.sort_values(by='seats',ascending=False))

Why is it important to understand what type of data you are working with in regards to analysis and visualization of data?

Answer

When working with data, it is important to understand what statistical type there are so that it can be properly used and analyzed. For example, categorical data can be extremely misleading when arithmetic/math are done on them (averages, adding, comparing, etc.). Without proper knowledge of whether data is categorical or numerical, math may be performed on the wrong types of data, resulting in highly misleading conclusions. Similarly, we run into similar problems when working with visualizations. For example, nomimal and numerical data can be confused, leading to graphs that imply a different connection than what is truly present. This is common when finding a correlation in a trend line graph because of an attempt to organize non-ordinal or numerical data in a certain way. With the knowlege of statistical types, you may be more inclinced to use a different form of graph to remove confusion.
What Python command can you use to determine how many distinct values exist and how many times they are repeated?

(Challenge: answer it using command line)
Answer
Python:
```
.value_counts()
```
Command Line:
```
sort | uniq -c
```

Unit 4: Regular expressions

Readings/notes page

Which of the choices will not match with the regular expression N(AV*|OS*)Y? (Select all that apply)
1. NAVY
2. NAVVY
3. NAAVVY
4. NOSSY
5. NOSY
Answer

C
Which of the following regular expressions would match the name of India.Arie? (Select all that apply.)
1. India.Arie
2. India\.Arie
3. .*
4. [A-Za-z]*[.][A-Za-z]*
5. [^x]+
Answer

All of them: a, b, c, d, e
Which one of these is a mismatch for the given regex example mid$ ?
1. mid
2. humid
3. sponsor mid
4. middle
Answer

d

Write a Python program to count how many 800 numbers like 1-800-XXX-XXXX are in a file called telephone.txt.

Answer

import re

f = open('telephone.txt')
count = 0
for line in f:
    for n in re.findall(r'\b1-800-[0-9]{3}-[0-9]{4}\b'):
        count += 1
print(count)

Write a bash script with regular expressions that loops through a folder coinatining .txt files, and counts the number of files that contain the word Goat.

(Upper or lowercase, but only count whole words. So GOAT and goat should both count, but not goatee or scapegoat.)
Answer
.
```
for file in *.txt
do
grep -E -i -m 1 '\bgoat\b' $file
done | wc -l
```
Write a bash command that would change the format of a date from 02/01/2023 to 2023-02-01 in a file called dates.txt
Answer
.
```
sed -i 's/$[0-9]\{2\}$\/$[0-9]\{2\}$\/$[0-9]\{4\}$$/\3-\1-\2/' dates.txt
```

Unit 5: Error handling

Readings/notes page

What does the following bash script do?
```
if grep 'CLASSIFIED' file.txt
then
echo "REDACTED"
else
cat file.txt
fi
```
1. Replaces the word CLASSIFIED with the word REDACTED in file.txt
2. Turns file.txt into a cat
3. Prints REDACTED if the file contains the word CLASSIFIED, and otherwise displays the contents of the file
4. Prints REDACTED if the file does NOT contain the word CLASSIFIED, and otherwise displays the contents of the file
Answer

c
What is the purpose of error handling in programming?
1. To intentionally cause errors in a program.
2. To ignore any errors that may occur in a program.
3. To prevent a program from crashing when errors occur.
4. To make a program run faster.
Answer

Answer: C) To prevent a program from crashing when errors occur.
Write a Python function first_line(fname) that takes a string for the name of a file, and returns a string for the first line of that file. If the file does not exist, your function should return an empty string.
Answer
.
```
def first_line(fname):
    try:
        f = open(fname)
    except FileNotFoundError:
        return ''
    fline = None
    for line in f:
        fline = line
        break
    return fline
```
Write a Python function called divide that takes two arguments a and b, and returns the result of dividing a by b. However, if b is equal to zero, the function should raise a ValueError with the message “Cannot divide by zero”.
Answer
.
```
def divide(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        raise ValueError("Cannot divide by zero")
```

Unit 6: Data cleaning

Readings/notes page

Given a dataframe with missing values, what pandas method will remove any NaN values and replace them with a value:
1. pd.fillna
2. pd.dropna
3. pd.subna
4. pd.isnull
Answer
1. pd.fillna
What is the proper format to combine two dataframes together?
1. pd.merge(df1,df2, on = 'name')
2. df1.merge(df2, on = 'name')
3. pd.combine(df1,df2, on = 'name')
4. df1.combine(df2, on = 'name')
Answer
1. pd.merge(df1,df2, on = 'name')
What syntax would you use to set the index of a dataframe, df, according to one of its columns entitiled ‘names’?
1. set_index(df['names'])
2. df.index(['names'])
3. df.set_index('names')
Answer

c
What would this command do?
```
cut -d ',' -f3 planes.csv | sed "s/blue/red/g"
```
1. delete the 3rd column of planes.csv and replace all “blue” with “red”
2. pull only the 3rd row of planes.csv and replace all “blue” with “red”
3. pull only the 3rd column of planes.csv and replace all “blue” with “red”
4. delete all occurrences of “blue” and “red”, then only take out the 3rd column
Answer

c
How does one drop all columns with more than two NaN values?
1. df.dropna()
2. df.dropna(thresh=3)
3. df.dropna(how=‘all’)
4. df.dropna(thresh=2,axis=1)
Answer

D
How would you merge three dataframes together while keeping all of the rows intact.
Answer
.
```
left2.join([right2, another], how='outer')
```

Given place.csv:

Abbreviation,State Name,population
AL,Alabama,10000
AK,Alaska,100000
AZ,Arizona,10000
AR,Arkansas,100000
CA,California,52000
...

and given crime.csv:

state, crime rate, deaths
AL,.32,18
AK,.12,40
AZ,.68,13
AR,.22,8
CA,.47,78
...

Using these two csv files, creating a dataframe which has two additional columns: one labeled as “crimes” which is the number of crimes committed for each state’s population (hint: crime rate times population) and the other one labeled as ‘death rate’ (deaths divided by population).

Answer

place = pd.read_csv('place.csv')
new = place.rename(columns={'Abbreviation': 'state'})
crime = pd.read_csv('crime.csv')

df = pd.merge(new,crime, on = 'state')

df['crimes'] = df['crime rate'] * df['population']
df['death rate'] = df['deaths'] / df['population']

print(df)

Given two dataframes:

data1 =
   Missouri  Alabama   Oregon
a    NaN      NaN        8.0
c    9.0     10.0        10.0
e   13.0     14.0       12.0
g   12.0      5.0        8.0

data2 =
    OH       NV        NY
a   1.0     2.0       Nan
c   3.0     Nan       9.0
e   NaN     6.0      11.0

replace all Nan in Ohio with 1.0 and other states with 6.0, and combine these dataframes to made one big dataframe called data_comb. We also have some data on New Jersey: [4.0,5.0]. Add this to the dataframe.

Answer

import pandas as pd

data_comb = pd.merge(data2,data1, how='outer',left_index=True, right_index=)

data_comb.rename(columns={'OH':'Ohio','NV':'Navada','NY':'New York'})
data_comb.fillna({'Ohio': 1.0,'Nevada':6.0,'New York':6.0,'Missouri':6.0,'Alabama':6.0})

New_Jersey = pd.DataFrame({'a':4.0,'c':5.0,'e':6.0,'g':6.0})

data_comb = pd.concat([data_comb,New_Jersey], axis=1)

It should look like this:

  Ohio  Nevada     New York   Missouri  Alabama   Oregon   New Jersey
a    1.0     6.0      8.0      1.0     2.0        6.0        4.0
c    9.0     10.0     10.0     3.0     6.0        9.0       5.0
e   13.0     14.0     12.0      6.0     6.0       11.0       6.0
g    6.0      6.0     6.0       12.0      5.0     8.0        6.0

Two csv files:

clothes.csv:                           coolness.csv:
item,size                              item,coolpoints
Fortnite shirt,M                       Fortnite shirt,800000
Emoji pants,L                          Emoji pants,0.33333333
Bronies hoodie,XXXL                    Bronies hoodie,911

Write a short python program to join clothes.csv and coolness.csv to a single DataFrame with 4 rows and three columns (item, size, and coolpoints), and print out that merged DataFrame.

Answer

import pandas as pd

clothes = pd.read_csv('clothes.csv')
cool = pd.read_csv('coolness.csv')
bigdf = pd.merge(clothes,cool, on= ['item'])
print(bigdf)

Given a dataframe, df, with 5 columns, two of the columns, ‘name’ and ‘shape’, contain with nullvalues. Reorganize the dataframe so that there all of the NaN values are set to 0, and the data is indexed based on the second column entitled ‘color’.
Answer
.
```
df[['name','shape]] = df[['name','shape]].fillna(0)
df = df.set_index('color')
```

Unit 8: Hardware and OS

Readings/notes page

Which of the following processor instructions might be required to execute a line of Python code like x = y + 2? Select all that apply.
1. Arithmetic instruction to do the addition with +
2. Arithmetic instruction to do the comparison with =
3. Load instruction to look-up the value of x
4. Load instruction to look-up the value of y
5. Store instruction to save the value of x
6. Store instruction to save the value of y
7. Control flow instruction to perform the assignment
8. Logic instruction to determine the type
Answer

a, d, e
Which aspect of the memory hierarchy is considered secondary storage?
1. registers
2. caches
3. flash disk
4. main memory
Answer

c
Why do we have the memory hierarchy with faster and slower parts? Why not just store everything in the fastest type of storage like cache or registers?

Answer

The faster parts of the memory hierarchy like registers and cache are also very expensive in terms of power, size, and/or power consumption, so their capacity is limited. There are typically only a few bytes of register storage available, for example. The slower parts of memory hierarchy such as disk are also very cheap, so they can have huge capacity like terabytes of data.
What is one advantage and disadvantage of compiled languages?

Answer

An advantage is that compiled languages tend to be more efficient at run-time after compilation while a disadvantage is that compiled languages can help spot errors at the initial compilation stage before the program is actually run.

Unit 9: Concurrency

Readings/notes page

Which is true regarding Multiprocessing? Select all corect answer(s):
1. Each process has a copy of global variables
2. Not effective for CPU-intensive tasks because of the GIL
3. Affected by global interpreter lock
4. Works well for IO-bound tasks
Answer

a,d
Which command pauses a process for a given number of seconds?
1. sleep
2. wait
3. kill
4. ps -A
Answer

the answer is a, Where the sleep command pauses a process for a given number of seconds.
Which of the following is NOT true about multithreading?
1. Can effectively use multiple CPU cores
2. Works well with IO bound tasks
3. Affected by global interpreter lock, GIL prevents multiple threads
4. Each thread has shared access to the SAME global variable
Answer

a
What is the PID?
1. Process Identifier
2. Process in Disguise
3. Process inside Disk
4. Penguin Identifying as a Dog
Answer

a

A popular theory among Swifties is that the second song of each Taylor Swift album is one of her best songs. Using the list of Taylor Swift albums, each album made up of its own list of songs, write a multithreaded album that picks the second song from each album, adds it to a new list, then prints that list in alphabetical order.

TaylorSwift = ['Tim McGraw', 'Picture to Burn', ...]
Fearless = ['Fearless', 'Fifteen', 'LoveStory'...]
albums = [TaylorSwift, Fearless, SpeakNow,...Midnights] # list of all albums

Answer

from threading import Thread
my_list = []

def get_song(album):
    global my_list
    my_list.append(album[1])

if __name__ = '__main__':
    children = []
    for album in albums:
        child = Thread(target = get_song, args = [album])
        child.start()
        children.append(child)

    for child in children:
        child.join()

    print(sorted(my_list))

Multi-thread to retrieve 15 random car facts from an api

Answer

import requests
from threading import Thread
requests.packages.urllib3.disable_warnings()

link = ??????
carfacts = []

def carinfo():
    global facts
    resp = requests.get('link', verify=False)
    carfact = resp.json()['text']
    carfacts.append(carfact)

if __name__ == '__main__':
    children = []
    for _ in range(15):
        child = Thread(target=carinfo, args=[])
        child.start()
        children.append(child)
    for child in children:
        child.join()
    for carfact in carfacts:
        print(carfact)

Write a multiprocess program for executing a function function that is needed to run for range of 0 to 1000000 times and takes arguments start_value and end_value.

Answer

from multiprocessing import Process

children = []
start_value = 0
for x in range(250000, 1000000, 250000):
    child = Process(target=function, args=[start_value, x])
    child.start
    children.append(child)
    start_value += 250000

for child in children:
    child.join()

print("Done")

You are given four csv files (usna.csv, usma.csv, usafa.csv, uscga.csv) from the different service academies, containing phone usage data. Each csv is formatted as shown below.

------usna.csv-------
name,app,apptype,minutes
Tim,tiktok,entertainment,45
Sam,instagram,entertainment,30
Peter,googledrive,academic,120

Write a Python program that calculates the total hours students from all academies spend on non-academic apps in one day.

Answer

from threading import Thread
import pandas as pd
totalmin = 0

def total_min(fnames):
    global totalmin
    df = pd.read_csv(fnames)
    nonacademic_df = df[df["apptype"] != "academic"]
    for index,row in nonacademic_df.iterrows():
        totalmin = totalmin +row['minutes']

if __name__ == "__main__":
    children = []
    files = ['usna.csv','usma.csv','usafa.csv','uscga.csv']
    for fname in files:
        child = Thread(target=total_min, args = [fname])
        child.start()
        children.append(child)
    for child in children:
        child.join()

    hours = totalmin // 60
    mins = totalmin % 60
    print(hours, "hours", mins, "mins")

Unit 10: Data Ethics

Readings/notes page

It is ethical to take data from sources you were not authorized to use.
1. True
2. False
Answer

b False
Which of the following is NOT a primary tenet of data ethics?
1. Promote transparency
2. Hold oneself and others accountable
3. Avoid using large data sets
4. Stay informed of developments in the fields of data management and data science
Answer

C
It is the year 2075, and in preparation for your big 50 year reunion, the Commandant has given you access to a master file that includes major life updates such as birth of children, marriages, and deaths for the class of 2025. You decide to make a slideshow presentation featuring some of the highlights of the data you’ve found (such as 50% of those married within a month of graduation are now divorced). What are two ethical dilemmas that this situation presents?

Answer

One ethical problem that comes up is lack of transparency for those whose data it is. Some people might not be ok with personal information like the frequency of the birth of their children to be a factor in public data being displayed. Another issue that could present itself if the possibility of this data being accidentally released to the public population and creating a bias for Midshipman graduates. Negative biases could negatively affect application rates or cause alumni to have a harder time getting jobs.
You have a source of data of personal information, but the data that you need can’t be placed on one person (blood type, etc.). Can you use this data? Or what should you do in order to use this data?

Answer

You should try to get in contact with the people who gave thier information and ask if they can use the data. If they say yes, feel free to use it.

Unit 11: OOP in Python

Readings/notes page

If x is located somewhere in our code, what would bool(x) return?
1. would return the “type” of x
2. would return either true or false
3. either a 0 or 1 value
4. you would recieve an error message
Answer

b, or possibly (d) if the type of x does not allow it to be converted to a true/false.
What is the difference between a class and an instance variable?
1. They are interchangable
2. A class variable is shared throughout the class and an instance variable applies only to each unique instance of a class.
3. A class variable applies only to each unique instance of a class and an instance variable is shared throuhgout the class.
4. A class variable is used inside the class and instance variables are used outside the class.
Answer

b
Which of the following are NOT an object(s)?
1. 1234
2. [5,6,7]
3. if y:... else:...
4. df.sort_values(by='year')
5. d[7]
Answer

c
Which of the following describes the term “method” in regards to Object Oriented Programming?
1. A variable that is part of a class
2. A function that is contained within a class and the objects that are contructed from the class
3. A constructed instance of a class
4. A new class created when a parent class is extended
Answer

b
Explain what the use of the __init__ method is in a class and why it is important for a class. What occurs to the objects of the class when there is no __init__ method?

Answer

When the __init__ method is utilized, the arguments that are passed into the call/class must correspond to the __init__’s parameters, except for the parameter of self. __init__ creates attributes of a newly created instance. However, when there is no __init__ method in a class, the class must be called without any arguments in place as the __init__ method that normally accepts parameters is absent. As a result, new instances that are called upon without an __init__ method has no instance specific qualities.
For the following example, what would the print statement inform the user about the function “total”? Why can this be beneficial?
```
class Counter:
   x = 0
   def total(self) :
     self.x = self.x + 1
     print("Adding",self.x)

example = Counter()
print ("Type", type(example.total))
```
Answer

The following example would print out: Type <class 'method'>.

Typically you would not want to do this, but it tells us that total is a function in the class that needs to be called like example.total(), rather than a class variable or something else.

Construct a class that takes a title of a movie and the length of the movie in minutes. Then, define a class funtion that prints the name and length of the movie in a nice format on different lines.

Answer

class Movie:
    def __init__(self, name, length):
        self._name = name
        self._length = length

    def _show(self):
        print("Movie Title:", self._name)
        print("Movie Length:", self._length,"minutes.")

spiderman = Movie("spiderman",120)
Movie._show(spiderman)

class Sport:

    sport = 'ball sports'

    def __init__(self, name):
        self.name = name

What would two.sport return?
What would one.name return?

Answer

‘ball sports’
‘Basketball’

Unit 12: Typing

Readings/notes page

Say we have a function called funky_function that takes the arguments words,list of strings, and numb, a float OR an int, and it returns an int. How would one initialize a function using typing?
1. def funky_function(words[str], numb: float | int) -> int
2. def funky_function(words:list[str], numb: float|int) -> int
3. def funky_function(words:list[str], numb: float|int): int
4. def funky_function(words:list[str], numb: float/int) -> int
Answer
Which of the following is NOT a type annotation?
1. bool
2. Process
3. function
4. None
Answer

c
Using type hints, how would you create a variable “Classyear” which holds the integer 2025?
1. Classyear = 2025
2. Classyear: int = 2025
3. Classyear = 2025: int
4. Class year = 2025 -> int
Answer
1. Classyear: int = 2025
Which of the following is the correct type annotation to take in a string (line) and any number (num) and returns None? Make sure that all variables are annotated.
1. def show(line:int|float, num:str) -> None
2. def show(line:str, num:int)
3. def show(int[line], str[num]) -> None
4. def show(line:str, num:int|float) -> None
5. All of the above
Answer

d
Write a function called years_to_grad using type hints, where the only argument is an integer that is the user’s class year, and returns a string that says “Congratulations! You have x years until you graduate!” where x is their class year - 2023.

For example: years_to_grad(2025) would print:
```
Congratulations! You have 2 years until you graduate!
```
Answer
.
```
def years_to_grad(classyear:int) -> str:
    diff: int = classyear - 2023
    message: str = "Congratulations! You have " + str(diff) + " years until you graduate!"
    print(message)
```

Given the function below, write type annotations for all variables.

def whichunit(alpha):
    """Tells which unit to write review questions for based on your alpha."""
    if isinstance(alpha, str):
        alnum = int(alpha[-6:])
    else:
        alnum = alpha
    return (alnum // 21) % 14 + 1

Answer

def whichunit(alpha: int | str) -> int:
    """Tells which unit to write review questions for based on your alpha."""
    if isinstance(alpha, str):
        alnum: int = int(alpha[-6:])
    else:
        alnum = alpha
    return (alnum // 21) % 14 + 1

Write a function called measure that takes a string and prints a string saying how many characters long it is in the format: “wow your string is ____ characters long!” use typing.
Answer
.
```
def measure(word:str)-> None:
    a = len(word)
    print(f'wow your string is {word} characters long!')
```

Write a program that has three things in a class called Ball:

roll(x): moves the ball forward by the given amount
kick(): always just moves the ball forward 5
print(): prints the number that the ball is at

if __name__ == '__main__':
    ball = Ball(6)
    ball.roll(3)
    ball.roll(2)
    ball.kick()
    ball.print()

Required Output:

Ball starting at 6
Rolled 3
Rolled 2
Kicked
The ball is at 16

Answer

class Ball:
    def __init__(self, amt:int) -> None:
        self.points=amt
        print("Ball starting at " +str(self.points))

    def roll(self, amt:int) -> None :
        self.points=self.points + amt
        print("Rolled " +str(amt))
    def kick(self) -> None:
        self.points=self.points + 5
        print("Kicked")
    def print(self) -> None:
        print("The ball is at " +str(self.points))

Unit 13: Machine learning with sklearn

Readings/notes page

What is the correct order to create a pipeline?
1. python pipe = make_pipeline(Ridge(),StandardScaler()) pipe.predict(X_data,X_label) pipe.fit(X_test)
2. python pipe = make_pipeline(Ridge(),StandardScaler()) pipe.fit(X_data,X_label) pipe.predict(X_test)
3. python pipe = make_pipeline(StandardScaler(),Ridge()) pipe.predict(X_data,X_label) pipe.fit(X_test)
4. python pipe = make_pipeline(StandardScaler(),Ridge()) pipe.fit(X_data,X_label) pipe.predict(X_test)
Answer

d
What is a crucial aspect of the matrices when creating a Ridge regression model?
1. the numbers must all be positive
2. the data inside the columns should be both numeric and non-numeric
3. the data inside the columns should be only numeric
4. all the matrices need to be the same size
Answer

c
What kind of algorithm attempts to find distinct groups of data without reference to any labels?
1. Regression
2. Clustering
3. Supervised Learning
4. Classification
Answer
1. Clustering
Which of the following types of machine learning is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately?
1. Supervised
2. Unsupervised
3. Cluster
4. Reinforcement
Answer

a
Which two models are used in supervised learning?
1. Regression and Dimensionality reduction
2. Dimensionality reduction and Clustering
3. Clustering and Regression
4. Classification and Regression
Answer

D
What are regression and classification algorithms? Describe how they are different, and an example of where you would use each.

Answer

Both are examples of supervised learning, used to predict some label value using previously-known labels, based on a common set of features.

Regresssion algorithms are used to determine continuous values (e.g., age or height) while classification algorightms are used to identify different categories within a dataset (e.g., Gender, Classes, Groups).
List one type of model from both supervised learning and unsupervised learning.

Answer

Supervised: Regression Unsupervised: Clustering
Why is scaling dataframes important for machine learning?

Answer

It is important to scale dataframes, in order to normalize the data - it is important to have them more “rounded” per-say. If the dataframes are not normalized, then some columns that just naturally have larger numbers — say, a year which is typically around 2000 or more — will be weighted as more significant than a column that naurally has smaller numbers — say, a GPA which is typically between 0 and 4. Issues can arise with finding a proper linear equation between the data, skewing the results.
What is the difference between supervised and unsupervised learning?

Answer

Supervised learning can predict labels based on labeled training data and unsupervised learning identify structure in unlabeled data.
Given a training dataframe, X, a testing dataframe Y, and a vector V, write a linear regression model fit to make a prediction on the testing data.
Answer
.
```
model = Ridge()
model.fit(X, V)
predictions = model.predict(Y)
```

Unit 14: Versions and packaging

Readings/notes page

Circle which git command is responsible for downloading changes from remote to local repository.
1. add
2. pull
3. clone
4. push
Answer

b
To update your local repository to the newest commit, execute ______ in your working directory to fetch and merge remote changes.
1. git add
2. git pull
3. git merge
4. git clone
Answer
1. git pull
What is the git command to retrieve an entire remote repository and create a working copy?
1. Fetch
2. Merge
3. Commit
4. Pull
5. Clone
Answer

E
Dr. Timcenko wants to add a file with the name exam.txt to a folder called sec5 along with the message “Go Navy, Beat Army!”. What commands should he type in the command line to commit the changes to the local copy?
Answer
.
```
git add sec5/exam.txt
git commit -a -m "Go Navy, Beat Army!"
```
I have finished working on a coding assignment as want to now put it in my github. Assume I have already set up the repository and just need to run the github commands as this point. I also want to note that this is the third time I am committing these files. What two commands do I need to run to accomplish this?
Answer
.
```
git commit -a -m "Third commit"
git push -u origin main
```
Please state the correct usage for using github, how collaborating on these pages is beneficial to the coding community, and something we have used from the github page created by the programming community.

Answer

Github is a website created for community collaboration creating code and packages for people of all skill levels to use and collaborate on. It is highly beneficial for coders as we as data scientists need tools such as pandas to parse through data in a more effective and efficient way.

SD 212 Spring 2023 / Admin