SD 212 Spring 2024 / Homeworks


hw33: Reading about sklearn

  • Due before the beginning of class on Monday, April 22

In this unit, we have been grasping the terminology of machine learning and seeing how we can apply it with Python’s sklearn library.

While this is a challenging subject to dive into for the first time, it’s also one with a lot of great documentation due to its high popularity. Today’s reading will reinforce what we saw in class about unsupervised learning and give a preview on how supervised learning works as well.

Reading

Python Data Science Handbook Chapter 38: Introducing Scikit-Learn

Read everything until the section on Applications to handwritten digit recognition. Some of it should be familiar and sort of review from class, and some of it will be new.

Questions

Download the file hw33.md to fill in and submit for this homework
  1. How much did you read carefully?

    1. The entire chapter
    2. The beginning up to the last section on digit recognition
    3. Some of it
    4. None of it too carefully

    (Answer with just the letter of your choice.)

  2. What does each row of the input array to a machine learning model represent?

    1. The algorithm parameters
    2. The algorithm hyperparameters
    3. A single observation or sample
    4. A single feature or attribute
    5. The labels for each observation
  3. What does each column represent?

    1. The algorithm parameters
    2. The algorithm hyperparameters
    3. A single observation or sample
    4. A single feature or attribute
    5. The labels for each observation
  4. Suppose I have three variables:

    • A 2D array tested containing information about a bunch of drugs that have been tested in the lab,
    • A 1D array results with an indication (1 or 0) on whether each tested drug was effective, and
    • Another 2D array untested containing the same information about a few drugs that haven’t been tried out yet.

    We want to use machine learning to predict whether each untested drug will be effective.

    What kind of machine learning problem is this?

    (Select all letters that apply.)

    1. Supervised learning
    2. Unsupervised learning
    3. Classification
    4. Regression
    5. Clustering
  5. In the same setup as the previous problem, complete the code below that would actually do it. There are three missing steps; for the next three problems, you select which line of code should go in for each step.

    Here is the incomplete code:

    from sklearn.naive_bayes import GaussianNB
    tested = ... # big 2D array of numbers
    results = ... # 1D array of 1/0
    untested = ... # smaller 2D array of numbers
    
    # QUESTION 4 step
    # QUESTION 5 step
    # QUESTION 6 step
    
    print(predictions)

    What line of code should be filled in for QUESTION 4 STEP?

    1. model = GaussianNB()
    2. fit = naive_bayes()
    3. model = np.linspace(-1, 11)
    4. model = PCA(n_components=2)
  6. What line of code should be filled in for QUESTION 5 STEP?

    1. model.fit(tested)
    2. fit.model(results)
    3. model.fit(tested, untested)
    4. model.fit(tested, results)
    5. model.fit(untested, tested)
  7. What line of code should be filled in for QUESTION 6 STEP?

    1. predictions = model.labels_
    2. predictions = fit.predict(results)
    3. predictions = fit.predict(untested, results)
    4. predictions = model.predict(untested)
    5. predictions = model.predict(results)

Submit command

To submit files for this homework, run one of these commands:

submit -c=sd212 -p=hw33 hw33.md 
club -csd212 -phw33 hw33.md
Download the file hw33.md to fill in and submit for this homework