Focusing on the Data is Imperative for Successful AI — Photo by Author David E. Sweenor

Introduction: The Shift from Model-Centric to Data-Centric AI

What if the key to unlocking AI’s full potential isn’t hidden in more complex algorithms, but in the one commodity that’s often overlooked–data? Since the seminal 1956 Dartmouth conference, AI has evolved from rule-based expert systems to deep learning. Yet, as we approach the next era of AI, it’s clear that the future lies in rethinking our approach to data.

For years, most discussions surrounding AI have focused on developing more advanced models. This model-centric approach has yielded impressive results, from image recognition systems that outperform humans to large language models (LLMs) and small language models (SLMs) capable of generating human-like text. Yet, as organizations work to implement AI at scale, they’re encountering limitations that no amount of model tweaking can overcome. The root cause? The quality and relevance of the data feeding these models.

While impactful, the model-centric approach often treats data as a static input, focusing innovation efforts on algorithmic improvements. This approach has led to diminishing returns, with marginal gains in performance coming at the cost of exponentially increasing model complexity. Additionally, it has created a disconnect between AI systems and the real-world environments in which they operate–resulting in models that may perform well during training but crash and burn when faced with the nuances and variability of practical applications.

While model-centric AI has achieved remarkable feats, its limitations are becoming evident. This realization is driving a shift towards data-centric AI, which emphasizes the quality and management of data over algorithmic complexity. This shift represents more than just a change in methodology; it’s a fundamental realignment that places data at the heart of AI innovation. For instance, in the healthcare industry, data-centric AI is being used to improve patient outcomes by analyzing large volumes of patient data to identify patterns and predict health risks. By focusing on systematically improving data accessibility, findability, quality, relevance, and representation, companies can unlock new levels of AI performance and reliability, paving the way for more widespread and impactful AI adoption.

What is Data-Centric AI?

Data-centric AI represents a paradigm shift in the approach to developing AI systems. At its core, data-centric AI is the discipline of systematically engineering the data used to build an AI system.^[1] This approach differs from traditional AI development, placing the spotlight squarely on the quality and management of data rather than solely on the algorithms. Gartner states,

“Data-centric AI is an approach that focuses on enhancing and enriching training data to drive better AI outcomes, as opposed to a model-centric approach wherein AI outcomes are driven by model tuning. Data-centric AI also addresses data quality, privacy, and scalability.”^[2]

Data-centric AI focuses on systematically engineering the data used to build AI systems. Key principles include:

Data Quality: Ensuring data is accurate and reliable.
Consistency: Maintaining uniformity in data definitions, formats, and structures.
Relevance: Using data that is pertinent to the specific AI application.

To understand the contrast with traditional model-centric approaches, consider the typical AI development process. In a model-centric world, data scientists spend the majority of their time tweaking model architectures, fine-tuning hyperparameters, and experimenting with different algorithms. While these activities are undoubtedly important, they often yield diminishing returns, especially when working with shoddy data.

Data-centric AI flips this script. Instead of treating the data as a fixed variable, it encourages teams to iteratively improve their datasets. This might involve cleaning noisy data, augmenting existing datasets with synthetic data, or redefining how data is collected and labeled. The goal is to create a virtuous cycle where better data leads to better models, which in turn inform further data improvements.

Andrew Ng, an AI guru and advocate for data-centric AI, succinctly captures the essence of this approach: “Focus on developing systematic engineering practices for improving data”.^[3] This perspective underscores the need for a structured, repeatable process for data enhancement, moving beyond ad-hoc data cleaning to establish robust data engineering practices as a cornerstone of AI development.

By adopting a data-centric AI approach, organizations can address many of the challenges that have historically hindered AI adoption and scaling. This approach promotes collaboration between domain experts, data scientists, and data engineers and ensures that AI systems are built on a foundation of high-quality, relevant data. It ultimately leads to more accurate, reliable, and trustworthy AI solutions.

Why Data-Centric AI Matters for Business

In today’s rapidly evolving business landscape, the adoption of data-centric AI is crucial for organizations seeking to maintain a competitive edge. This approach offers significant advantages that directly impact a company’s bottom line and operational efficiency.

Improved Model Accuracy and Reliability

Focusing on data quality, consistency, and relevance with data-centric AI can improve model accuracy, reliability, and robustness. Like cooking, focusing on the freshness and quality of the raw ingredients can help businesses create AI systems that produce more accurate and dependable results. This increased reliability translates into better decision-making processes, reduced errors, and ultimately, improved business outcomes.

Faster Development and Deployment Cycles

The data-centric approach to AI accelerates the development and deployment cycles. In some cases, organizations have reported building computer vision applications up to 10 times faster than traditional approaches.^[4] More rapid deployment means businesses can start reaping the benefits of their AI investments sooner, leading to quicker returns on investment and the ability to adapt more rapidly to changing market conditions.

Enhanced Collaboration Between Technical and Domain Experts

Data-centric AI fosters a collaborative environment where technical experts and domain specialists can collaborate more effectively. This approach bridges the gap between those who understand the intricacies of AI systems and those who possess deep industry-specific knowledge. By focusing on data quality and relevance, domain experts can contribute their insights more directly to the AI development process, resulting in solutions that are not only technically sound but also highly relevant to real-world business challenges. This collaborative nature of data-centric AI is what makes data scientists and AI practitioners feel valued and integral to the process.

By prioritizing data quality and relevance, data-centric AI enables businesses to create more effective, efficient, and reliable AI systems.

The Role of Standards in Data-Centric AI

Why Standards Matter

Standards are crucial in ensuring the quality, interoperability, and compliance of data-centric AI systems. As organizations increasingly rely on data to drive their AI initiatives, having common standards helps establish best practices, enables consistency across implementations, and facilitates trust in AI systems. Standards provide a shared language and guidelines for data collection, preparation, and usage in AI applications. This helps ensure data quality by defining metrics and processes for assessing and improving data. Standards also enable interoperability by specifying common data formats and interfaces, allowing AI systems and datasets from different sources to work together seamlessly. Additionally, standards support regulatory compliance by codifying requirements around data privacy, security, and ethical AI practices.

Key Standards

Several international standards bodies have developed relevant standards for data and AI, including:

ISO/IEC 20546—Information technology—Big data – Overview and vocabulary: Provides terminology and conceptual frameworks for big data.
ISO/IEC 20547—Information technology—Big data reference architecture: Specifies a reference architecture for big data systems.
ISO/IEC 23053—Framework for Artificial Intelligence (AI) Systems Using Machine Learning (ML): Establishes a framework for describing AI systems that use machine learning.
ISO/IEC 42001—Information technology—Artificial intelligence—Management system provides guidance on managing AI systems within organizations.
ISO/IEC 5259 series—Artificial intelligence—Data quality for analytics and machine learning: Covers various aspects of data quality for AI applications.

These standards provide guidelines on data management, quality assessment, AI system development, and governance that are highly relevant for data-centric AI initiatives.

Data Risks in AI Systems

Poor Data Quality

Poor data quality can have a significant negative impact on AI system performance. When AI models are trained on inaccurate, incomplete, or inconsistent data, they will likely produce unreliable or erroneous outputs. As they say, “garbage in, garbage out” – the quality of an AI system’s results directly depends on the quality of its training data.

A particularly concerning phenomenon related to data quality is the concept of data cascades. Data cascades refer to compounding events causing adverse, downstream effects from data issues. These are often triggered by AI development practices that undervalue data quality. Research by Google has found that data cascades in high-stakes AI applications were present 92% of the time.^[5] Data cascades can have severe consequences in critical domains like healthcare, criminal justice, and financial services, where AI predictions can significantly impact people’s lives. For example, poor-quality data cascading through an AI system could lead to incorrect cancer diagnoses, unfair facial recognition in law enforcement, or biased loan approvals.

Data Bias

Data bias poses a major risk for AI systems, with significant implications for fairness and ethics. When AI models are trained on biased datasets, they can perpetuate and even amplify existing societal biases. This can lead to discriminatory outcomes across various applications, from hiring processes to criminal sentencing.

For instance, an investigation in the US found that AI-powered lending systems were more likely to deny home loans to people of color compared to other applicants. The study revealed that 80% of Black applicants were more likely to be denied loans.^[6] This exemplifies how biased training data can result in unfair and potentially illegal discrimination in high-stakes decisions.

Addressing data bias is crucial for developing ethical AI systems that treat all individuals and groups fairly. It requires careful consideration of data collection methods, preprocessing techniques, and ongoing monitoring of AI system outputs for potential biases.

Data Privacy

Data privacy and security are critical considerations in AI systems, especially given the large volumes of potentially sensitive data often used in training and inference. AI models may inadvertently memorize and potentially expose private information from their training data. Additionally, the data used for AI inference could contain personal or confidential information that needs protection.

Regulatory frameworks like the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have established strict requirements for handling personal data, including in AI systems. These regulations mandate practices such as data minimization, purpose limitation, and the right to erasure, which all have significant implications for AI development and deployment.

Ensuring data privacy in AI systems involves various technical approaches, such as differential privacy, federated learning, and secure multi-party computation. It also requires robust data governance practices and a privacy-by-design approach to AI development. Balancing the need for data to train effective AI models with the imperative to protect individual privacy remains an ongoing challenge in the field.

Implementing Data-Centric AI

Implementing data-centric AI requires a systematic approach that prioritizes data quality, governance, and responsible use throughout the AI lifecycle. Here are some critical steps organizations can take:

1. Be Diligent in Data Acquisition

The foundation of data-centric AI is high-quality, diverse data. Organizations should carefully evaluate data sources and implement rigorous data collection practices. This includes assessing data relevance, completeness, accuracy, and potential biases. Diversifying data sources can help ensure broader representation. Organizations should also consider supplementing their own data with third-party datasets when appropriate, while thoroughly vetting those external sources.

2. Employ Effective Data Lifecycle Management

Data should be managed strategically from acquisition through disposal. This involves implementing data cataloging and metadata management to maintain visibility into available datasets. Version control for datasets is crucial as data evolves over time. Organizations should establish data cleaning, transformation, and integration processes to prepare data for use in AI systems. Archiving and retention policies should be defined to preserve historical data when needed while complying with regulations.

3. Implement Data Quality Governance

Maintaining high data quality requires ongoing governance. Organizations should define data quality metrics and implement automated data quality checks. Regular data profiling can help identify quality issues. Data stewards should be appointed to oversee quality of critical datasets. Processes should be established for data cleansing and enrichment when quality issues are found. Feedback loops between AI teams and data owners can help continuously improve data quality.

4. Address Data Privacy and Security Risks

As data becomes central to AI systems, protecting it becomes paramount. Organizations should implement robust data access controls and encryption. Data anonymization and pseudonymization techniques should be applied where appropriate. Privacy-preserving AI techniques like federated learning can enable AI development while protecting sensitive data. Organizations must also ensure compliance with relevant data protection regulations in their jurisdictions.

5. Mitigate Data Bias and Discrimination

Identifying and mitigating bias in data is critical for responsible AI. Organizations should conduct thorough analyses of training data to detect potential biases related to protected attributes like race, gender, age, etc. Techniques like reweighting, resampling, or augmenting datasets can help address imbalances. Ongoing monitoring of AI system outputs is necessary to detect emergent biases. Cross-functional teams, including ethicists and domain experts, should be involved in bias mitigation efforts.

By taking a systematic approach to data acquisition, management, quality, privacy, and fairness, organizations can build a strong foundation for data-centric AI. This data-first mindset, supported by robust processes and governance, enables more effective and responsible AI development.

Tools and Technologies for Data-Centric AI

Overview of Tools

Various tools and technologies have emerged to support the data-centric AI paradigm. These tools help organizations systematically engineer their data to improve AI model performance. Some key categories include:

Data labeling platforms: Tools like Labelbox, Scale AI, and Appen provide interfaces for efficiently labeling training data. They often incorporate active learning to prioritize the most valuable samples for labeling. Many offer features for quality control and managing large labeling workforces.
Data augmentation tools: Libraries such as Albumentations and imgaug allow users to automatically generate additional training samples through techniques like cropping, flipping, and color adjustments. More advanced tools leverage generative AI to create synthetic data.
Data quality and observability platforms: Platforms like Anonalo, Bigeye, and Monte Carlo help validate data quality by allowing users to define expectations about their data and automatically check if they are met. This helps catch data issues early in the AI development process.
Data catalogs: Tools like Alation provide centralized repositories for documenting and organizing an organization’s data assets. They typically include features for data discovery, lineage tracking, and collaborative metadata management. Data catalogs are crucial for helping data scientists and engineers find relevant datasets for AI projects.
Data governance platforms: Solutions like Alation provide capabilities for defining and enforcing data policies, managing data access controls, and monitoring data usage across an organization. These platforms help ensure data is used responsibly and compliantly in AI systems.

Integration with Existing Systems

Integrating data-centric AI tools into existing business processes and IT infrastructure requires careful planning but can yield significant benefits. Some key considerations include:

API-driven integration: Many modern data tools offer APIs to integrate them into existing data pipelines and workflows. For example, a data labeling platform could be connected to a data lake to pull in new samples for labeling as they arrive automatically.
Metadata standards: Adopting common metadata standards (e.g., Dublin Core, schema.org) across tools can facilitate interoperability. This allows metadata from a data catalog to be easily consumed by other systems like data governance platforms.
Data catalogs: Data catalogs can provide a unified view of data across disparate systems without physically moving it. This allows data-centric AI tools to access data from legacy systems without disrupting existing processes.
MLOps integration: Data-centric AI tools should be integrated into broader MLOps processes and platforms to ensure data work is coordinated with model development and deployment activities.

By thoughtfully integrating data-centric AI tools, organizations can create a cohesive ecosystem that supports the entire AI lifecycle – from data preparation to model deployment and monitoring.

Practical Advice and Next Steps

Shift focus to data quality and engineering: Prioritize systematic data improvement over model tweaking. Implement processes to continuously refine existing data and acquire high-quality new data to enhance AI model performance.
Invest in data-centric AI tools and capabilities: Develop or acquire tools for data labeling, synthetic data generation, data quality measurement, data cataloging, and data governance. Build internal capabilities around data engineering and domain-specific data work to support your AI initiatives.
Establish data governance for AI: Implement robust data governance practices tailored for AI development, including data versioning, quality monitoring, and cross-organizational data sharing standards. This will ensure high-quality, compliant data inputs for your AI systems.

Summary

Data-centric AI is a complementary strategy that emphasizes systematically improving data quality and quantity to enhance AI system performance. It complements the traditional model-centric approach by focusing on refining existing data and extending datasets through targeted data acquisition.
Critical aspects of data-centric AI include improving data quality at the instance and dataset levels, acquiring new data to address gaps, and leveraging domain knowledge in data work. This approach is particularly valuable for specialized domains with limited data availability.
Data cataloging, data governance, and AI governance are essential technologies that will shape the future of data management. These tools ensure data quality, compliance, and effective AI utilization.

If you enjoyed this article, please like it, highlight interesting sections, and share comments. Consider following me on Medium and LinkedIn.

Please consider supporting TinyTechGuides by purchasing any of the following books

^[1] “Data-Centric AI.” n.d. LandingAI. Accessed June 23, 2024. https://landing.ai/data-centric-ai.

^[2] “ Data-Centric AI.” n.d. Gartner.com. Gartner, Inc. Accessed June 23, 2024. https://www.gartner.com/en/information-technology/glossary/data-centric-ai.

^[3] “Data-Centric AI.” n.d. LandingAI. Accessed June 23, 2024. https://landing.ai/data-centric-ai.

^[4] “Data-Centric AI.” n.d. LandingAI. Accessed June 23, 2024. https://landing.ai/data-centric-ai.

^[5] Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora Aroyo. 2021. “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI.” https://doi.org/10.1145/3411764.3445518.

^[6] Popick, Stephen. 2022. “Did Minority Applicants Experience Worse Lending Outcomes in the Mortgage Market? A Study Using 2020 Expanded HMDA Data.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4131603.