You Can Never Be Too Data-Rich

By 2022, 35% of large organizations will be either sellers or buyers of data via formal online data marketplaces, up from 25% in 2020. With AI and ML supplementing existing data sources, there is always more value to be derived from large quantities of data.

For years, the data management industry has been talking about the ever-growing volumes, velocity, and variety of data.  For traditional analytics, the challenge has been about how to reduce the data used in reporting and BI; how to separate the noise from the signal, how to prioritize the most relevant and accurate data, and how to make a company’s universe of data usable to an increasingly self-service user population. This notion of having too much data is well-founded – so much data in an organization isn’t readily useful for traditional analytics. Data may be incomplete, inaccurate, too granular, unavailable, or simply not useful for a particular use case. However, in implementing AI and ML, it turns out that the more data that is available from as many sources as possible is one of the most important ingredients in building a successful model.

In traditional analytics, the user decides which data is most useful to their analysis and, in so doing, taints their results through their own intentional omissions and unintentional biases. But, in AI/ML (and especially when we’re leveraging Automated Machine Learning (AML) technologies), we really can’t have too much good data. We can throw massive amounts of data at the problem and let AML ascertain what’s relevant and helpful, and what isn’t. We want lots of data, and unfortunately we usually don’t actually have enough.

In a recent project, we met a customer who (as with most) believed that they had all the data they needed to accurately predict insurance loss risk – they knew their customers, their properties, various demographics, payment histories, on and on. And so we built a loss prediction model for them, and got good results. The customer was very pleased.  

Then we decided to train the model with a combination of internal and 3rd party data to see whether there would be a difference. We loaded several sets of data that significantly enriched that customer’s already voluminous customer and property data.  The result was a 25% increase in the efficacy of the AI model – which as any Data Scientist will tell you, is a massive improvement. And the cost of that data was a drop in the bucket relative to the scope of the larger budget.

My message to customers facing these issues has evolved; I now encourage them to seek out more data than they already have. The inclusion of external data at marginal cost can drive substantial improvements in the quality of models and outputs. And many data vendors have made it easier to test, acquire, and parse data for where it is most impactful. The bottom line is that, in the area of AI, more is definitely better, and you can never be too data-rich. 

Ironside and our partner Precisely recently published a white paper where you can learn more about data enrichment for data science, which you can download here.