Why a Data Governance Strategy Is Crucial to Harness the Capabilities of Artificial Intelligence

Rico Lamein

Feb 29, 2024 10:07:10 AM

Organizations Adopting AI Should Focus on Data Quality

The 2020s have been the decade of influx for Artificial Intelligence solutions and the technology has leapfrogged over the last few years. Every organization aspires to join the AI trend and implement applications like ChatGPT4 and DALL-E to accelerate processes, achieve scalability, and optimize resource utilization. However, the biggest mistake organizations make in the rush to adopt AI is not having the correct processes in place that guarantee data quality. So, the output is not that impressive because ‘garbage in, is garbage out’.

Therefore, in this blog, we will cover the effects of bad data on AI, and data quality best practices to keep in mind while adopting AI for your organization.
In this blog, we will cover:

Why is data quality important for AI programs?
What are the results of poor data quality used to train AI programs?
What are some best practices when adopting AI?
How can organizations use AI to optimize business processes?
Are you looking for a data quality solution?

Why is data quality important for AI programs?

A European Parliament resolution on the industrial policy on AI and robotics states that “quality and accuracy, as well as the representative nature of data used in the development and deployment of algorithms” can impact its success. Organizations are in a rush to adopt AI, despite implementation costs and data security concerns. But one of the barriers hindering the adoption of AI is the lack of confidence in the organization’s data quality according to the 2023 AI and Machine Learning Research Report.
(Source: https://www.rackspace.com/sites/default/files/2023-02/AI-Machine-Learning-Research-Report.pdf)

AI algorithms can be used across industries to ease and automate different tasks: from healthcare diagnostic patterns to recommendations for an e-commerce website based on customer buying behavior history. It can be used for functions such as document processing, customer engagement, content creation, image recognition, research, translation, intelligent search, and much more. In each of these implementations, the AI is trained on huge volumes of data. The AI extracts patterns and trends from the training data and teaches itself those characteristics. Subsequently, when a new observation is presented to the AI, it compares it to what it has learned thus far and predicts the outcome of the new observation.

Let’s take the example of training an AI to recognize puppies. Thousands of pictures of puppies are shown to the AI, which teaches itself the characteristics that belong to a puppy: size, pupil dilation, color, etc. If we then present a new image of an animal to the AI, it can autonomously detect whether the image displays a puppy or not. This makes one important assumption: the dataset we have used to train the AI is of good quality. If, however, that is not the case, and we train the AI with images showing both puppies and kittens while still telling it they are all puppies, it will incorrectly classify kittens as puppies with new observations. The same goes for humans: if a baby is not properly taught the difference between a puppy and a kitten, they will have a hard time differentiating them later in life.

Therefore, it is true that the AI system is only as good as the training data used. If you use labeled, structured, and accurate data, the AI program will generate better results. However, if the quality of data used to train an AI model is biased, incorrect, incomplete, or inconsistent, the AI program will generate biased, incorrect, incomplete, or inconsistent results as well.

The assumption that “Big Data” can bypass the data quality issue is not something organizations should bank on. The reverse is true. Bad data quality in big data sets used to train AI models only amplifies the problem.

What are the results of poor data quality used to train AI programs?

While most businesses want to start using AI to automate processes, they do not want to spend time on data cleansing. The majority of the problem lies in historical data. In fact, based on a report, major enterprises are aborting their AI projects due to poor data quality. So, what are the data quality issues that affect AI output?
(Source: https://cio.economictimes.indiatimes.com/news/strategy-and-management/data-quality-top-reason-why-enterprises-are-pulling-plugs-on-ai-projects/69649074)

Bias

Data is created by humans that can be biased. When biased data is used to train an AI model, the outcome is tainted as well with bias. In a world where equality, diversity, and inclusion are priorities for organizations, an AI assistant for the HR or legal team should not showcase any biases toward a race, or gender. It is therefore important to evaluate data not just for accuracy but also for any discriminatory biases that can color the worldview of the AI program. Parity tests should be conducted during the training stages to evaluate any biases in the AI system.

Measurement and representation errors

Measurement errors and representation errors usually stem from training data that could be personal or subjective. Measurement errors usually occur when there are no guidelines for labeling data. So how the data is labeled is crucial and can also lead to bias if there is no quality control as this is done by humans.

A measurement error occurs when the data shared cannot accurately reflect what is intended to be measured because of nuances of human nature or unavailable data. To explain the concept, if you had to define a ‘good citizen’ would it be someone who does not litter or someone who only purchases locally made products? For the AI model to process this and share an output, the exact characteristics of a good citizen need to be defined and they need to be measurable. Since this is a subjective question answering such questions is difficult for the AI program. In some cases when certain information is not available, the AI program might use the closest available information instead, which is a proxy, and therefore the output might not be accurate.

Representation errors occurs when there is not enough detail in the training data. If the AI program is being trained on the diversity of the US population but the data does not adequately represent immigrant data, then there will be errors in representation.

Data is not up to date

Timeliness and consistency are important aspects of data quality that AI programs might not be able to achieve. If the data used to train the program was historical data of the last decade and there have been changes to the data in the 2020s then the output received will not be able to keep up with the data quality dimensions of timeliness and completeness. For example, there could be some viewers who opt out of your company’s cookies and browsing data tracking on your website. It then becomes difficult to assess their journey on your website. Therefore, there will always be missing data, and such conditions need to be kept in mind when training the AI algorithm.

Reliability and validity

This could also be linked to measurement and representation errors. The output shared by AI is generated by the quality of training data, so even if you use a large data set it does not guarantee the reliability and validity of that data. Big data might use a large number of statistics or observations and the output based on the average could be correct. However, if that data has not been validated or misses partial information, the output generated could have bias, and errors and therefore not be reliable. Additionally, the output generated by AI could vary a lot for the same question.

Limitations of access

If you use out-of-the-box AI algorithms, you can only train them with non-sensitive data. This access to limited data impacts the output and therefore an AI program cannot be used for reporting or revenue management. Alternatively, if you are using sensitive data to train your AI program, you need to have the correct security perimeters in place to ensure compliance with industry rules and regulations. However, this costs additional overhead and implementation costs.

Limited understanding of concepts

Although large data sets are shared with AI programs, gated industry research or scientific papers cannot be shared. The understanding of the concepts directly relates to the data shared with the AI program and can improve over time with feedback. Additionally, AI processes data differently and develops associations based on the training of the tool.

Errors due to duplication and lack of standardization

When data is used from several sources but not checked for duplicates, it can lead to errors. Additionally, if that data is not classified, for example, if unstructured data without a metadata description is used in the training data, it can create confusion, and this can make analysis difficult for the AI program. An example of this would be data entered for time or date in different formats without establishing a relationship or metadata for the same.

What are some best practices when adopting AI?

Implement data governance

Ensuring you have data quality rules, standards, roles, and responsibilities, and a team focusing on data governance help you eliminate data quality issues. Checks for data standardization, and data deduplication when data is merged from several sources should be prioritized. All the above help you achieve a higher level of accuracy when training the AI model. Additionally, if you are using historical data ensure that the data has been cleaned and validated. Assigning roles and creating processes for data validation can improve the accuracy and reliability of data.

Use data quality solutions

Data quality solutions can help you de-duplicate, validate, and point out errors in your systems and applications at data entry and even for historical data. At STAEDEAN our data quality solution can help you improve data quality while migrating to Microsoft Dynamics 365 Finance & Supply Chain ERP and even after going live. Add data quality rules, establish validation processes, and run data quality checks at regular intervals. Also, set up processes to check data for timeliness and completeness.

Communicate and collaborate

Ensure your data governance team is sharing updates on the monitoring and improvement processes and best practices with the rest of the organization. Regular communication and training about data governance processes will also help improve data quality across departments. Set up and measure data quality metrics to track the accuracy levels of the AI output and be open to feedback from other teams and data providers. This can help identify and resolve issues and improve the performance of the AI program.

Focus on data preparation

When you begin with raw data, ensure your team has the time for structuring, labeling, adding metadata, and classifying it before they begin the data cleansing process. Data can range from plain text to heavy Excel files with images. If these are classified instead of merging formats, the AI predictions will be of better quality. Trying to fix such issues after the data has been shared can be a manually intensive process later. Seeking to debug data and identify what dataset is causing issues during model training is like detective work. It could be any issue ranging from a lack of data governance processes, or no metadata. Tracking the exact issue and fixing it is a shot in the dark and is a resource and time-intensive process.

Data quality monitoring

Even after the AI model has been trained and launched, it needs to be monitored and updated based on feedback and issues. Once a model is in use, it might be exposed to information via the prompts that are different or contrast greatly with the training data. This is known as data drift, and it should be monitored and reported. The model can be retrained based on the collective data drift and data quality issues. Having a process to report such issues and record them is important. Communicating the need to identify patterns related to any errors shared in our blog will help identify data quality errors. Improving the AI model takes time and should be a continuous process, with data governance at the center.

How can organizations use AI to optimize business processes?

From manufacturing robots to smart assistants, businesses are combining artificial intelligence with robotics and other technologies to improve business processes, empower employees, and scale production. As more and more businesses adopt AI the performance and maturity of the technology will improve and change. Based on the Forbes Advisor survey*, 56% of the businesses that participated use AI for customer service and 51% are currently using it for managing cybersecurity and fraud. (*Source: https://www.forbes.com/advisor/business/software/ai-in-business/)

The sky is the limit when it comes to the use of AI in businesses across processes. But we need to err on the side of caution as AI is evolving at break-neck speed and also has its risks and disadvantages. Having said that, let’s dive into some of the most commonly used processes where Artificial Intelligence is currently being used by organizations.

Deduplication

AI algorithms can be used in data cleansing solutions to point out duplicates. For example, our AI algorithm in Data Quality Studio for the Dataverse can identify and share data duplicates. The solution attributes a weight for the likelihood of two or more duplicate fields. The higher the weight, the more likely they are duplicate entries.

Customer service chatbot

Chatbots have been around for a long time and can respond to customer questions and issues around the clock. Using the data from the chats, you can train the AI algorithm to provide better responses, understand the most frequently asked questions and issues customers face, and work on resolutions. A chatbot on your website will free up your employees to take on more critical issues and improves the overall customer experience.

Product recommendations

The most commonly used by e-commerce, product recommendations are shared based on the customer’s browsing history. This keeps the customer engaged for longer on their website and encourages more sales without running paid ads. This technique is also being used by streaming platforms to share personalized recommendations based on the viewing data to keep consumers engaged for longer on the OTT platforms.

We have added a product recommendation engine feature to our DynaRent Solutions Suite for the Equipment as a Service industry. This is a configurable AI algorithm combined with machine learning that recommends frequently rented or purchased products during order creation in D365 F&SCM.
Related reading: Product Recommendation Engine: Use AI in D365 to Recommend Products to Rental Customers

Fraud detection

Especially used by the financial industry, machine learning algorithms can be trained to identify and call out suspicious transactions. This helps identify fraud and act on it sooner. When a suspicious transaction is flagged, the application can trigger an action to stop the transaction and send out an alert to relevant fraud management teams. Once a business process to validate fraud risk is detected, the application stops the transaction from going through and alerts the appropriate parties. The algorithm can be trained on historical data and based on customer behavior flag unusual transactions of higher volume. Similarly, machine learning algorithms can also be trained to detect cybersecurity threats.

Predictive Analytics

When trained with previous demand and supply information, AI-driven predictive analytics can be used to forecast demand. But do ensure you also plan for unforeseen events in case of sudden catastrophes. Based on the analysis you can optimize inventory levels and adjust demand, saving you from investing in extra inventory. This can help organizations improve overall supply chain efficiency and profits.

Improving diagnosis and treatment

AI algorithms can be trained with a large amount of anonymized patient data keeping in mind healthcare regulations to identify patterns of disease and even exceptions. This can help doctors and healthcare professionals better diagnose and treat their patients.

Currently, AI is being used to speed up drug discovery for different diseases. Instead of testing every possible compound, AI is being used to process biological data, and narrow down possible compounds to shortlist a few that can be tested to treat an ailment. This reduces the time and cost and is also a pathway for discoveries for the pharmaceutical industry. Additionally, AI is also being used to test existing drugs as a possible treatment for various other diseases.

Improving production

AI is being used to monitor product quality and enhance supply chain efficiency by food manufacturing companies. Usually paired with robotics, automated systems that are powered by AI can monitor and optimize inventory levels, detect anomalies in production, and reduce waste. Such systems that are used for quality control can help food manufacturing companies ensure compliance with healthcare and industry regulations and thus reduce the risk of recalls.

Equipment management

Different sectors from utilities to heavy machinery can use AI to monitor and improve operations and optimize maintenance. AI algorithms can be used in conjunction with sensors to track data on equipment to predict potential issues and share a warning. This predictive maintenance approach minimizes downtime, extends the lifespan of equipment, and reduces overall operational costs. Companies can allocate resources more strategically, ensuring the reliable and uninterrupted delivery of services.

Optimizing renewable energy

Renewable energy depends on the availability of natural resources such as solar power and wind which might not be available round the clock and vary based on seasons. The renewable energy sector is using AI to analyze weather patterns, energy demand consumption data, and plan for storage and distribution. Getting this right will play a huge role in the success of renewable energy adoption and can help companies optimize and improve the efficiency of renewable energy production and storage.

Autonomous driving
AI is being tested to improve the safety and navigation in autonomous vehicles. These vehicles have an array of sensors. AI and machine learning algorithms are trained to process data from these sensors in real-time. These programs can analyze surroundings, detect obstacles, and make split-second decisions to navigate the vehicle safely. Additionally, AI is being used by Advanced Driver-Assistance Systems (ADAS) to improve vehicle performance for efficiency, optimize fuel consumption, and reduce emissions.

Legal document review

AI in combination with Natural Language Processing algorithms can be trained to analyze legal documents to review and categorize large volumes of legal documents to extract and break down relevant information. This speeds up the document review process and manual work involved in the initial stages of document review and allows the legal team more time to focus on more complex aspects of the case.

Are you looking for a data quality solution?

As we have covered in this blog, organizations across industries are leveraging Artificial Intelligence and are witnessing a multitude of benefits. To summarize:
AI can be used to detect data duplicates thereby improving data quality.

AI chatbots can be trained to deliver better responses to customers for basic requests.
AI programs can be used to share personalized products, services, and content preferences based on browsing history.
Machine learning algorithms can be used for fraud detection and to track unusual activity.
AI can be used to optimize the supply chain by forecasting demand and calculating inventory data.
AI is aiding breakthrough discoveries in pharmaceuticals and also helping diagnose patients better.
AI programs can be used in manufacturing to detect anomalies, reduce waste, and reduce compliance issues.
AI can be used to detect issues in equipment, reallocate resources, and facilitate uninterrupted services.
Predictive analytics combined with AI can help to proactively predict and balance supply and demand for manufacturers across industries.
AI and machine learning algorithms used in autonomous vehicles can read real-time data from sensors and make split-second decisions.
AI programs can be used by legal professionals to speed up the initial review process of legal documents.

Organizations adopting or looking to adopt AI across processes and industries can considerably benefit from increased efficiency, improved customer experiences, enhanced decision-making, and better use of their resources. As AI technologies continue to evolve and grow, businesses must evaluate and adapt the use of AI while managing the associated risks.

If you are considering adopting Artificial Intelligence, we recommend starting with data governance first. An AI program cannot be effectively trained on poor-quality data, as this will not yield the desired results. Ensure you keep in mind data quality issues and develop a process that assesses your AI program with a continuous feedback loop. This will help you improve the processing and quality of responses shared by your AI program.

At STAEDEAN, we have built solutions for data quality management of master data and also offer a free AI-powered data quality tool. We understand how the quality of data impacts the quality of the output generated by the AI program. We offer two data quality solutions for different platforms. We offer Data Quality Studio for Microsoft Dynamics Finance & Supply Chain Management and Data Quality Studio for the Dataverse. If you want to test our data quality solution prior to upgrading, we also offer a free version of Data Quality Studio (AI Powered) for the Dataverse. Our solution uses an AI algorithm recommendation based on the chosen column. It also allows you to identify duplicate data and merge duplicates to improve data accuracy.

If you are interested in learning more about our data quality solution, visit our product page and download the factsheet from the link below.