Introduction
Data science is an exciting field, but it’s also full of ethical pitfalls. If you want to avoid getting drawn into a situation that could damage your company or client, it’s important to understand the basics of data ethics and how they can affect your work.
Learning from the past
Learning from the past is important because it can help us avoid making ethical mistakes that others already committed. The following examples are a sampling of some of the most popular data science ethics issues over the years:
- Google’s Street View privacy concerns (2011)
In 2011, Google came under fire for collecting personal data from unsecured wireless networks while taking pictures for its Street View service. The company said it was an accident and that it had not intended to collect any information from these Wi-Fi networks. However, computer security experts discovered a code in the software used by Google’s cars that indicated that they were designed to do just that.
Google said it had collected data like emails and passwords, but it wasn’t sure how much. The fact that Google didn’t know what data was collected raised concerns about its commitment to user privacy and security.
- Facebook’s Cambridge Analytica scandal (2018)
In March 2018, Facebook revealed that it had been the victim of a massive data breach involving tens of millions of users. The company said that Cambridge Analytica, a political consultancy firm with ties to the Trump presidential campaign, had gained access to information from up to 87 million.
Users were unaware that the app was collecting data from them, and Facebook did not do enough to prevent it. The incident led to intense scrutiny of Facebook’s use of user data and its responsibility for protecting users’ privacy. In April 2018, Facebook CEO Mark Zuckerberg testified before American Congress on the matter. He apologized for his company’s mistakes in handling user data and outlined some steps he would take to ensure that similar breaches didn’t occur again.
- IBM’s Photo-scraping scandal (2019)
IBM faced a photo-scraping controversial scandal in 2019 where the controversy focused on 1 million pictures of human faces that IBM scrapped from Flickr, the online photo-hosting site.
This scandal brought to light how their data is being used. People aren’t consenting to have their details used for profit.
- Predictive Policing Software
It would be nice to be able to predict crime or be able to estimate where law enforcement needs to be to prevent crime, but many predictions made by this kind of software don’t come true. This happens because these tools are fed bad data.
“Data collected by police is notoriously bad, easily manipulated, glaringly incomplete, and too often undermined by racial bias.” – Ezekiel Edwards, ACLU
This polluted data makes the software produce equally contaminated results, leading the software to be better at predicting policing instead of predicting crime, becoming a self-fulfilling prophecy.
Conclusion
These examples show that there are many ways to make mistakes when it comes to handling user data, but they also serve as excellent educational resources for anyone interested in learning more about how these kinds of problems can be avoided in future projects.
There are lots of resources to learn more about this theme and how to combat this issue when working with machine learning models and user data. I’ll list some, but remember this is a broad and deep topic, and my research may not find everything there is to it.