5 Unusual Ways Bias Can Sneak into Your Models

    These “Generally Good Practices” have their downsides.

    Photo by Vackground on Unsplash

    We should be using more AI solutions by now. But there’s this bias issue to consider!

    We’ve seen AI models perform differently for underrepresented groups. These issues have been debated heavily in recent years. In search of why bias arises, we found that there are more ways than a human trainer’s intention could cause bias.

    Yet, when other people’s lives and jobs are concerned, the innocence of a creator is not excused. Customer backlashes, public opinion, and badmouthing could harm your reputation, and it may be tough to recover from them.

    Thus it’s critical to understand AI bias. You can’t manage something you don’t understand.

    Here are five situations where bias can sneak into your models.

    Reusability is a priority for developers and organizations. The benefits are even higher for machine learning models.

    Training time and resource consumption are prime concerns for companies to adopt AI practices. Hence repurposing old models or reusing models built for a different intent makes more sense.

    Of course, cloud computing and platform as a service (PAAS) solutions have revolutionized data science. Yet, the models we train also got larger in recent years.

    The language model GPT3 has 175 billion parameters. You need roughly $4,600,000 to train this model. Well, that’s not something everyone can afford.

    But you can use the open AI API for as little as $0.0004. That’s more affordable. Better yet, you can fine-tune these models for your specific use cases.

    However, you don’t know the original dataset these models were trained on. And blind reuse would introduce bias in your applications.

    Direct reuse of existing models may introduce bias

    Like Open AI’s GPT-3, you have plenty of other places where you can find pre-trained models. Keras documentation lists a range of such models. You can grab one and use it on a similar use case.

    For example, you can grab the VGG16 and start classifying images. But it may label a person as an animal because the model hasn’t seen enough examples of people with certain characteristics (for example, skin color).

    Using it in a different context may create biases, even if it’s your own model. For instance, your chatbot trained with American inputs may perform worse on Australian users.

    Thus, better not to reuse models in a different context unless you’re 100% sure about all its consequences. When it’s necessary, if you can have a human-in-the-loop, do it. You can either have all predictions or the ones with less confidence verified by a human before it starts making an impact.

    Of course, not all applications can have a human-in-the-loop. If so, updating the model using transfer learning techniques or ensemble methods is advisable.

    Use transfer learning to update the model and give it contextual information.

    Transfer learning is a widespread practice among ML engineers to reuse existing models. You can use this technique on deep neural networks (DNN).

    In essence, if you have a model to identify dogs in an image, transfer learning will train it to recognize cats. After all, cats and dogs have a lot of similarities — four legs, ears, tails, etc.

    You can either add a new layer to your DNN or unfreeze the last layer. Then train the model with your specific domain examples.

    Transfer learning is a cost-effective, time-saving technique that yields better outcomes. And using it on your models before you apply them to a new context will reduce the chances of bias.

    You can read more about it in my previous post.

    Use ensembling to remove unwanted bias in your models.

    Ensembling means a group of models. If you already have a model to predict cats, you can attach an additional model to add a new responsibility — find dogs.

    If you’re using a model trained for a similar purpose, you can use another model and prepare it with your new data to minimize bias.

    You should be more careful when your model learns from your users.

    Microsoft’s Twitter chatbot, Tay, is an excellent case study. Tay has been learning from Twitter conversations with other users. But the bot was turned off only after a few hours because Tay picked offensive language from other users and started speaking like them.

    If your models are left to learn from users, it’s exposed to the risk of bias. Avoiding bias in active learning is still in the research phase, and we don’t have solid workarounds yet. Thus you should be more careful when you’re opting for this.

    If learning happens in batches, you have some control over it. Before you feed new data to your model, you could check them for any known biases. Also, you could put in more checks before you publish a new version of your model.

    It’s a good practice to have a model registry. Model registries help you to experiment with several models to solve the same problem. When you find issues with one model in your production environment, you could easily switch to an older one and minimize the impact.

    But, in active learning models such as evolving reinforcement algorithms, you leave the control to the machine.

    Sometimes synthetic or artificially created data is used to train machine learning. Although it may sound counterintuitive, synthetic data has a lot of usefulness.

    Engineers use synthetic data to train ML models when it’s difficult or costly to collect new data. Also, synthetic data are beneficial when anonymity matters.

    Usually, synthetic data generation models the underlying probability distribution of variables and draws new samples from it.

    As synthetic data generalize the distribution, it loses the original context of the dataset. Hence, chances are low to spot any biases before their consequences.

    This is one reason why most image generation algorithms have been debated heavily in recent years. It’s a widespread practice to use image augmentation to train neural networks. It’s sort of necessary to avoid overfitting.

    Besides hiding the details, synthetic data augmentation can also amplify the differences. With more artificial data, you now have even more data to represent your dominant classes.

    Like synthetic data, dimensionality reduction techniques such as PCA also bury the context and create abstract variables. It doesn’t mean we should avoid such practices. But be aware of the risk.

    They make it hard to understand the input variables and detect bias in the early stages. It would be challenging to trace back the sources of bias in abstract variables.

    Imagine a situation where you build a model to predict credit scoring. Your input dataset has the income variable. After PCA, you have PC1 and PC2, … not the labels.

    Maybe you’d have sampled a population where low-income people are most likely to default. But you’d never know it’s a sampling issue because the variables are abstract.

    Bias is not always a product of biased datasets. It also depends on how your model selects features from your dataset.

    Deep neural networks (DNN) greatest promise is automatic feature selection. It’s highly beneficial yet comes with drawbacks. You have less control over your DNN model’s feature selection.

    When driving from city A to city B, you may not care much about the path as long as you get to B safely and on time. But in machine learning, it matters!

    Explainable AI (XAI) has received much attention in recent years, as concerns were raised over the predictions of AI models for underrepresented groups.

    That being said, we cannot ignore the benefits of DNN and avoid it. A reasonable benchmark is a human-level performance. Assure a human could perform the same way. When the prediction confidence is low, filter them and try to do it manually.

    Minimizing the impact of AI bias is a challenge for the data science community.

    At this point, the world has clearly understood the benefits of AI and wants to move forward. Yet, we’ve already seen several issues with a machine’s predictions, and we know there’s more to learn.

    Algorithms are not biased by nature. But the examples they learn from could alter their behavior. In this post, we’ve discussed some indirect ways bias could enter your models.

    5 Unusual Ways Bias Can Sneak into Your Models Republished from Source via

    Recent Articles


    Related Stories

    Stay on op - Ge the daily news in your inbox