SD 212 Spring 2023 / Notes


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 10: Data Ethics

This short unit gives us a chance to think about some of the real ethical questions and dilemmas faced by working data scientists.

While there are numerous proposed ways of categorizing ethical principles for data scientists, what makes real-world situations challenging is that multiple of these principles will inherently come into conflict with each other.

Our goal here is not to establish a certain procedure or “right answer” when it comes to challenging situations, but rather to gain practice in recognizing the kinds of questions that should be asked and considered when deciding whether to engage in a particular use of data.

1 Resources

  • Data Ethics Framework developed by the U.S. federal government and initially released in 2019. Among other things, it provides 7 “Tenets” of ethical data use.

  • Academic Data Science Alliance “Lenses” provide four perspectives to consider when considering the ethical implications of data collection, analysis, and dissemination.

  • The big-name global management consulting firm McKinsey & Company published an article on Data Ethics primarily aimed at corporate leaders considering how to think about the risks and benefits of data use within their companies.

  • The DoD Data Strategy from September 2020 gives a high-level overview of the goals and needs of data science from a U.S. military standpoint, as well as stating 8 “guiding principles” for responsible data use within the DoD.

2 Case studies

In class we will read through and discuss one or more of these case studies. Again, the purpose is not to reach a certain conclusion about whether the actors in each case are “ethical” or “unethical”, but to ask ourselves what are the benefits and drawbacks of data use in each scenario, and how we would act as data scientists in these situations.

  1. The Ethics of using Hacked Data

    Read from the beginning until the “Discussion” section at the top of page 5.

    This real case provided by the nonprofit organization Data & Society looks at how to think about the source of data and when it can be truly considered “public”.

  2. The Maryland Commuter Survey

    Read the “Summary of Methodology” section from the bottom of page 4 to the top of page 10.

    This is not a “case study” but rather an actual research project conducted to learn about how people get to and from work in the state. The methodology section talks about how the researchers acquired and processed the survey results, and allows us to think about what it means to get “good” survey results and what are the ethical questions in conducting and processing the data from such a survey.

  3. Optimizing Schools

    Read from page 2 until the bottom of page 6, skipping the “discussion question” inserts.

    This is a fictional case study based on real experiences developed by Princeton’s Dialogues on AI and Ethics. It considers a (fictional) highly-beneficial use of data analytics to help high school students succeed, but where the way the data was collected and shared may cause a public outcry.

  4. Dynamic Sound Identification

    Read from page 2 unitl the bottom of page 5, skipping the “discussion question” inserts.

    Another fictional case study based on real experiences developed by Princeton’s Dialogues on AI and Ethics. This one considers how an overall-useful AI app can have drawbacks or harms to minority populations, and how to consider unintented uses of even seemingly simple data-centric apps.