Even if you’re not working in AI, chances are you have heard about AI bias. It’s a topic that’s widely discussed in the media, in politics, and obviously AI research. But what is AI bias really, and why is it talked about now?
The first questions that usually arise in combination with AI bias: Is AI racist? No. Well,… hm… no…. unless?
The reply isn’t really straightforward. The fact is, AI is not some magical entity that, in its core, is racist. What most people call AI is a Machine Learning algorithm, which is trained on one or multiple specific datasets. Those datasets, often curated by humans, may have racial biases, which can then propagate in the algorithm. That can lead to decisions that are considered racist.
The issue with AI biases are a lot more extensive than just races though: Any feature that is fed into such an algorithm may suffer from certain types of biases. Let’s get into some examples.
You’re receiving a private dataset from a small credit card company, annotated with cases that have been found to be fraudulent. Amazing, right? Real-life data with a lot of features, including the person’s location, credit score, times and dates of transactions. Let’s get to work, feed those features in a neural network, get an accuracy of 99% - we’re done. Right?
Unless you know exactly how people annotated the fraudulent cases, you’re most likely biased by the confirmation biases, amongst others. Your machine learning algorithm won’t find fraudulent cases which were never found by human annotators, and you’re also likely to be biased by some features found in the dataset. Maybe the previous frauds only happened at night? Maybe previously confirmed frauds were only made on new, fresh credit cards? It may be easy to dupe this algorithm.
Let’s get to real life. Quite recently, a research paper was published {source needed, please email me if you find it} about skin cancer detection. The researchers got a private dataset themselves with images of moles, and whether there was a chance of skin cancer. They reached a really good accuracy with their approach, otherwise the paper seemed quite unspectacular.
The issue? Turns out, in their private dataset, all positive images (i.e. the ones with skin cancer) were annotated with a pen. The negative images were not. Turns out, the algorithm just looked for annotations on the skin, and not the mole itself. Whoops?
So, what does it matter? The results of that research paper were acknowledged, and nothing happens. Most algorithms end up as a number in a spreadsheet, and then get buried.
But the truth is, not all of them end there. Machine Learning has been leaking into the production pipelines of many different companies – not only Google and AI startups – for quite a while now. Fraud detection, image recognition, conversational interfaces – they all run on Machine Learning-backed infrastructures now. So why is this an issue now and not before?
Most existing Machine Learning infrastructures are using the outputs of their systems cautiously, as a second or third pointer. With Machine Learning getting more accessible – through APIs and new ways to build algorithm – bias is easier to get embedded in an algorithm from people who work a lot closer in production, and may put this into their app without any filters.
A small (fictional) startup doing credit scoring may run solely on a biased ML algorithm.
There are enough solutions out there, but many of them take a lot of time and effort. Machine Learning nowadays is a numbers game – the metrics tell us which algorithm is fit to the source data the best, and that one is taken. That needs to change in a production-oriented ML team.
Choosing the right numbers
If you are maximizing for numbers, you better choose the best. Not every metrics may be suitable for the job. Choose wisely. This takes some experience and a lot of reading. Not suitable for layman who can nowadays easily deploy a model.
Feature Engineering
For most Machine Learning algorithms, feature engineering is absolutely crucial. In Deep Learning days (where neural networks are in use), people tend to not do that anymore, since the magic neural network can usually figure it out.
If you’re optimizing for a bias-free algorithm, you need to figure out which features are adding value, and which are adding bias. Look up Feature Dependence Graphs if you’d like an easy way to do so. You might realize that you don’t need age and gender in your HR software ;)
Representation
In your dataset, do you have all possible cases represented? In many cases, this is not obvious and extremely difficult to solve for. In other cases, you should really check your data. Have you ever tried MS-Celeb-1m, one of the biggest and most popular face datasets out there?
Turns out, Hollywood’s biases are deeply embedded here. You’ll have difficulties finding black or asian representation.
In the same way, in your credit card fraud datasets, you should be sure that all the positive cases have been found, and that all negative examples are correct. There might be some bias embedded there too.
So much more
This is really a blog post (okay, a blog series) in itself. But if you have some other low-effort/high-impact ways to test for this, or are working on something interesting, please email me at [email protected].