Insights      Artificial Intelligence      Audit Your Data — The Right Way

Audit Your Data — The Right Way

So you want to use machine learning to build better products and improve your user’s experience, but knowing where to start can be difficult. 

In our Principles of Applied AI whitepaper, we explain the importance of properly identifying opportunities for integrating AI into your product. To do that, you need to review the business processes that you enable for your customer and prioritize improving the ones that will deliver the most benefit for you and your customers (see page 7 of our whitepaper for more on this!).


After you shortlist a few opportunities with the highest impact, you need to start thinking about the data you need to build those AI-enabled systems. We asked our Head of Applied Research, Parinaz Sobhani, for her top tips on data audits and making sure you have the high-quality data needed to avoid a “garbage-in, garbage-out” scenario.

Start with ranking opportunities based on their values rather than availability of data

After you’ve identified and ranked opportunities based on customer value, you then need to think about the level of effort required to bring those opportunities to life.

“It starts with the opportunity identification, prioritization and product strategy rather than the data audit,” said Parinaz. Following that, “The second step might be, ‘what is the level of effort that is required? And am I ready to start executing on it or not?’ and then the data audit comes into the picture.”

Quality vs. quantity in data audits

Let’s take an example of a company that wants to build an engine that scores customer leads based on the likelihood they will become paying customers. Historical data will be the main ingredient to train this engine. To conduct a data audit, this company needs to review the quality and quantity of historical sales data. 

Quality issues might arise from a lack of consistency in how data is collected or missing information.

“You might realize that, to score the leads, you have to use your existing CRM system data, and it may be partially populated through automation, while humans have populated other values in that CRM system,” said Parinaz. “We know humans might capture information in very inconsistent ways.”

In the case of quantity, you simply might not have enough data to train models; to train supervised machine learning models, you need input and outcome data.

In this example, input data can be firmographic information in B2B sales, and the outcome is whether a sales lead is converted into a customer or not. If the company has just started to use a CRM system to log the data, it might not have enough historical data to train such systems. 

If your data isn’t clean, start automating

In this example of building a customer lead engine, information that is automatically logged is more likely to be better quality because it is more likely to be consistent. So, if your data isn’t high-quality, it’s time to see how to automate parts of your data collection. 

“We can clean data using more advanced NLP techniques, or you can get business analysts to go over all your historical data and try to clean them up,” said Parinaz. Another option is to have better integration with other parts of the business so you can collect data automatically, rather than having people enter every piece of information.

In the case of quantity, companies may just have to wait another six months for more interactions between the sales team and potential leads. These interactions will train the machine learning models. 

Make sure you’re building a feedback loop

The power of machine learning comes from the fact that it’s self-learning. To benefit from it, you need to keep providing input and outcome data consistently so models can keep learning and getting better. If you can close the feedback loop, you can automatically collect outcome data as actions are taken within your product. 

“For example, imagine you’re building a product recommendation engine, but you haven’t thought about how you can log which item the user clicked, opened on or purchased,” said Parinaz. “If you’re not logging that information, there’s no way to improve your recommendation engine.” 

The other way to collect feedback is through explicit feedback, like asking customers how their experience was working with a customer service agent. Explicit feedback can be hard to collect, so proxy feedback sometimes helps — in this example, you guess the experience wasn’t good because you notice the call took longer than average. Even if the customer didn’t answer your survey, this clue can reflect how their experience was.

“It’s most likely an indicator that something went wrong and the agency didn’t do a good job really helping the customer, or there were several transfers to other departments,” said Parinaz. 

Overall, it’s important to collect input and outcome data in a consistent and automated way to get the full benefit of machine learning technologies.

“Otherwise, it’s going to be garbage in, garbage out, and it’s going to be pretty much a useless system,” said Parinaz. 

Read more like this

Testing LLMs for trust and safety

We all get a few chuckles when autocorrect gets something wrong, but…

How AI is redefining coding

Sometimes it’s hard to know where to start when it comes to…