How ML can help you extract real business value from your data
The mission to become more data-driven has been motivating companies for years now. Aware that they are awash with data that could help inform competition-crushing business decisions, they are relentlessly pursuing strategies to extract more value from the data they are amassing – with mixed results.
One area of technology that offers huge promise in this area is machine learning (ML). In fact, speaking at Google Next 2022, Irina Farooq, Senior Director, Product Management, Smart Analytics at Google Cloud, predicted that, by 2025, 90% of data will be actionable using ML.
Let’s look at what makes data-driven success so difficult, the role of ML in extracting value from data and the real-world results ML is generating.
Why data is not adding value
Research highlights the struggles companies experience trying to mine their data for business value. Following an Accenture 2019 Survey revealing that just 32% of companies are able to realize tangible value from their data, a 2021 study by NewVantage found that only 24% of executives consider their companies to be data-driven. Businesses are managing data infrastructure, moving data and making it available to users, often without any clear roadmap of how to capture the potential of all that information.
Obstacles to harnessing the business value of data include company culture, the sheer scale of data flooding organizations and concerns about data ownership and privacy. Faced with these hurdles, many company leaders struggle to craft realistic data strategies. Some may adopt a centralized program, with one team extracting, cleansing and aggregating data, resulting in a blanket approach that is often misaligned with end users’ specific needs. Alternatively, they might use separate teams to create customized data pipelines – with limited potential for repurposing.
Instead, companies need to design incremental data strategies primed to deliver value quickly but with scalability built in for future use.
How machine learning can help
Machine learning is a branch of artificial intelligence (AI) that feeds historical data to algorithms to identify patterns and predict future outcomes. This focus on using their data to make predictions and decisions or recommendations is what makes it appealing to data-driven organizations.
ML algorithms process historical data (generally called training data) to create a predictive model. Each ML dataset comprises variables (features) and observations (records). Predictive ML solutions must identify the independent variables (inputs) with the most influence on the dependent variable – the outcome we wish to predict.
Unsupervised ML models group and categorize data to identify patterns rather than predicted outcomes. This enables content streaming companies to help customers discover content they might like via recommendations and search, for example.
How to harness ML effectively
ML is no magic wand for managing data. Businesses with legacy systems will have to modernize them to ensure they work effectively with ML solutions. Relevant stakeholders need to prioritize the quality of the raw data feeding the training dataset at all stages of the process, from data acquisition to data preparation to the evaluation of results. This means leadership must champion machine learning solutions as a means to achieving identified business objectives and goals.
The importance of data quality
Machine learning algorithms trained on poor quality datasets produce inaccurate results. Raw data extracted from real-world scenarios will always be affected by noise and missing values created by manual errors, technical problems, unforeseen events and other issues. However, algorithms are generally not designed to handle missing values, and the sample’s true pattern can be disrupted by noise. Data preprocessing is necessary to treat the data before the algorithm can consume it. This process fills missing values, denoises the data, resolves inconsistencies and removes outliers.
Validating your ML model
Once you have built your ML model, you need to evaluate its real-world usefulness. Choosing the right validation metric is particularly important in the case of imbalanced datasets, where the class distribution is significantly skewed, and the sample for the positive class is so small that the model is unable to learn.
This is a common issue in medical and genomics ML initiatives. Say, for example, you are developing a classification algorithm that predicts whether or not someone has a genetic disorder. If just 1% of the population has this disorder, you could build a classifier that always predicts that the person does not have the disease, which would mean your model was 99% accurate – but utterly useless. This imbalance can be accounted for by techniques that involve randomly undersampling the majority class and oversampling the minority class, and it can be detected with more appropriate scoring metrics such as F1-score instead of accuracy.
Trusting the data
At Google Next 22, Irina Farooq spoke about the need to be able to see and trust the data for ML to be effective. That means leveraging automated cataloging tools to discover and manage your data from one central location. You also need to be able to work with the data in real time, so it is important to rely on the optimum combination of proprietary and open-source tools to allow your teams to work across all of your data and then apply streaming analytics to work with the data as you collect it.
When it comes to trust, explainability has become an important element of ML, focusing attention on what happens in an ML model between input and output and placing a new emphasis on transparency. Explainable artificial intelligence (XAI) has developed as a set of processes and methods to make the results and output created by machine learning algorithms understandable and trustworthy. It is a key consideration for businesses seeking to pursue responsible ML initiatives.
Optimizing your models
Short feedback loops are also essential to ensure your ML initiatives deliver meaningful value. Iterative optimization of your ML models reduces the degree of error between the predicted output and the true output and is measured through a cost function. To avoid generating unused models with your ML proof of concept, there should be a strong correlation between the optimized cost function used in your ML algorithm and a business metric such as ROI.
Practices such as writing automated tests, embracing continuous integration and continuous delivery (CI/CD) and applying effective user testing before launching a comprehensive ML effort will speed up the process of optimizing your ML models significantly. By applying DevOps principles to every stage of ML system construction, organizations can work toward a mature MLOps culture in which both ML and CI/CD pipelines are automated.
Where ML is driving valuable data insights
DoiT works with a range of clients who are applying machine learning to their data in creative ways – with impressive results. Here are just a few:
A streamlined retail experience
CB4 uses ML to make the in-store experience easier for retail staff and customers. With the ML-powered solution, store staff can make simple adjustments such as ordering additional units of a product or taking another product out of backstock to help customers and generate new sales. Each store is sent a tailored list of recommendations for stock-keeping units (SKUs) they could sell more of, based on its unique selling patterns and operation conditions.
CB4 leveraged Google Cloud tools and worked with DoiT to build a streamlined data pipeline, 30% more performant ML operations and enhanced cost visibility. The new system also helps the company ensure secure storage of data, in compliance with GDPR and other international data protection regulations. From a performance perspective, it can integrate new retailers easily into its data solution and maintain high availability even when demand peaks and when scaling.
Scalable online storytelling
Apester helps businesses get their message across via interactive social experiences such as quizzes and polls that integrate seamlessly with their websites and can be distributed at scale. As growing user numbers amplified the volume of data it was handling, the company needed to adopt an easily scalable business intelligence (BI) and data warehousing solution.
It built this around Google Cloud, incorporating Cloud Dataflow, Cloud Dataproc and Cloud Bigtable, for data processing and analytics. With its built-in ML and BI capabilities, the data warehouse BigQuery became Apester’s main analytics solution. The data held in BigQuery and the company’s work with Cloud Natural Language modules created a foundation for an ML initiative, and it is now investing heavily in its ML capabilities. It uses the ML platform Tensorflow for its pipeline, allowing it to accelerate its responsiveness to its customers’ needs even as it scales.
Real-time fraud detection
Fraud detection company 24metrics offers a solution called ClickShield, which helps businesses identify fraudulent users in real time. It normally takes weeks to establish whether app users are real and not bots, but 24metrics uses ML in its solutions to help predict the quality of users. DoiT helped the company identify the appropriate ML tools and, after an initial session with the DoiT team, they were able to train their first model themselves.
Unhappy with the results of the model they created, they consulted DoiT, who helped them analyze those results, identify potential issues with their ML training approach and offer alternatives. Once they followed DoiT’s recommendations, they quickly developed a well-trained model, which DoiT helped them to deploy cost-effectively. 24metrics had projected that it would take more than five months to build the ML algorithm and deploy the new feature, but DoiT’s support meant it took just two months and was easier than expected.
Intuitive content-editing at scale
Lightricks apps such as Facetune, Videoleap and Photoleap help streamline content-editing for professional video makers, graphic designers and web builders. With some online ad campaigns needing to create near-instant reports on several terabytes of data, these apps ingest and analyze huge volumes of largely mobile data, often in near real time. The company uses Google Cloud Dataflow to process user behavior data, which is then ingested into BigQuery for analysis at scale.
DoiT provides ongoing support for this elaborate machine learning program, offering guidance on everything from architecture to problem-solving. Lightricks is expanding its ML program, with its marketing, product optimization and recommendation engine teams all creating machine learning models now. Having started with self-managed ML Google Cloud Compute Engine, they are gradually migrating to managed services on Google Cloud’s Vertex AI for even faster scaling.
What to do next
Machine learning may not be the complete solution for companies grappling with their data, but it can be part of it. With the right leadership, culture and structures in place, companies can use ML to harness their data quickly and effectively to extract maximum business value from it. For companies considering ML as part of their data solution and for those already well advanced on the ML path, DoiT can offer support and guidance to accelerate and optimize your efforts.