Cloudera Blog

From Hybrid by Accident to Hybrid by Design: Mastering Data Sovereignty and AI Cost Control

Kierstan Williams — Thu, 18 Jun 2026 13:00:00 UTC

No enterprise intentionally builds a chaotic tech landscape. It usually sneaks up through acquisitions, teams buying their own tools, and disconnected, oftentimes partial,cloud migrations. The result is a "hybrid by accident" architecture, an IT environment built on years of reactive choices rather than intentional actions, and worse, nobody has a real plan to fix it.

As enterprise AI moves from experimentation into production, accidental hybrid is more than a technical inconvenience, it’s a strategic liability. To maintain data sovereignty and avoid skyrocketing AI costs, organizations need to embrace a hybrid-by-design architecture. Those that make this shift deliberately will unlock the full value of their data assets. Those that don't will find their architectural debt compounding with every passing year and every new AI initiative.

That was the topic at hand in my recent conversation with guest speaker Noel Yuhanna, VP Principal Analyst at Forrester, as part of the Cloudera webinar "Welcome to the Era of Hybrid by Design."

In this blog, I'm expanding on that conversation and outlining how organizations can create a hybrid-by-design architecture by leveraging unified governance, open standards, and having a clear map of AI lifecycle requirements.

How Enterprises End Up Here

Beyond M&A activity, siloed tool adoption, and lines of business locking in preferred vendors, the rapid push to the cloud of the past decades is a major culprit. Many organizations migrated fast and broadly, and now the pendulum is swinging back. Repatriation is a real and growing conversation.

The numbers tell the story: CIO-reported plans to repatriate workloads back on-premises rose from 43% in 2020 to 83% in 2024. This isn't a rejection of the cloud; it's a maturing recognition that not every workload belongs there. In fact, as Yuhanna points out, roughly 80% of transactional processing in banking and healthcare still runs on-premises today. The early "cloud is cheap" misconception has given way to hard questions about architecture optimization, overprovisioning, and egress costs that erode the value proposition.

The Regulatory Pressure Underneath It All

Compliance is forcing immediate action on what used to be a slow-moving IT problem. Regulations like GDPR, the EU Data Act, and HIPAA demand strict data sovereignty. Meanwhile, the US CLOUD Act, which allows US authorities to access data globally, is colliding with EU and APAC privacy rules, actively driving enterprises toward sovereign, non-US cloud providers. In the financial sector, DORA is mandating vendor exit strategies because relying too heavily on a single cloud is now a systemic risk.

With new AI regulations demanding strict traceability, this pressure will only increase in 2026 and beyond. Data and AI governance are merging into one massive compliance hurdle; companies without the right architecture face a painful, expensive retrofit.

Why AI Makes This Urgent Now

Enterprise AI converts a chronic infrastructure problem into an acute one. The AI lifecycle has vastly different needs at each stage: training and contextualization demands bursty, large-scale compute which may make it more well suited for the cloud, while steady-state inference is often more economical on-premises. "Intentional hybrid" means mapping each stage to the infrastructure that actually fits it, rather than defaulting to a single environment and absorbing the penalties.

Data gravity complicates this further. AI requires massive volumes of distributed data, and moving it across environments carries real latency and egress costs. You are forced into a corner: constrain models to a limited dataset (sacrificing quality) or absorb massive fees to centralize the data (destroying the business case).

Agentic AI sharpens this challenge considerably. Because these systems require real-time, trusted data to take action, batch-lagged pipelines simply won’t survive. As Yuhanna notes, agentic AI adoption currently sits around 24% and is expected to double by the end of 2026. The organizations building a proactive architecture for that reality today will capture the value tomorrow.

The Case for Open Standards

Vendor lock-in isn't just a theoretical risk; it’s an active cost at both the infrastructure and software layers. When your proprietary data tools only run on a specific cloud, you face a compounding "double lock-in." This hands all leverage to the vendor, creating a severe bottleneck the moment a workload needs to move, whether you are shifting a finished cloud pilot to run on spare data center capacity, or migrating to a new sovereign cloud to meet compliance. Organizations reclaim that leverage through workload portability and open standards.

Two defining standards make this possible:

Kubernetes acts as a universal abstraction layer for your underlying infrastructure. By providing a consistent cloud-native operational model regardless of what hardware or cloud provider sits underneath, it eliminates the "platform-hopping tax"—the re-engineering overhead that accumulates every time a workload crosses an infrastructure boundary.

Apache Iceberg does the equivalent job at the data layer. It isn't just about abstracting where your data lives; it's about expanding who can access it. The open table format and the Iceberg REST catalog allow organizations to share data in-place with any third-party system. This means you can leave your governed data exactly where it is, while allowing external analytics platforms to query it directly. By completely decoupling data from vendor-specific compute engines, organizations gain genuine, future-proof flexibility in how and where they run AI.

Consider what scale actually looks like in practice. Yuhanna recently encountered a customer connecting 50,000 databases across 1,000 disparate source systems. At that magnitude, complexity doesn't grow linearly, it compounds. Open standards aren't a nice-to-have; they're how enterprises stay in control of their own environments.

The Governance Gap and What It Costs

Fragmented infrastructure reliably produces fragmented governance. As Yuhanna highlights, roughly 70% of enterprise data lacks proper metadata and cataloging; meaning, only 25% is actually used for analytics and most enterprise data sits completely untouched! In 2006, British mathematician and data science pioneer, Clive Humby, famously coined "data is the new oil," noting that raw data must be refined by AI and analytics to drive real value. If every piece of data contains potential insight, why would you tolerate an architecture that actively prevents you from using all of it?

The security implications are just as concrete. According to IBM’s 2025 Data Breach Report, multi-environment breaches average over $5 million (well above the $4.44 million global average) and now account for roughly 30% of all incidents. The reason is simple: breaches happen at integration points, and every environment boundary is an integration point.

The answer is a unified policy layer: a single, federated control plane spanning classification, access control, lineage, auditing, and compliance. In this model, policies follow the data, applying consistently and in real time across the entire ecosystem.

Where to Start

Driven by the real-world demands of AI in production, tightening data sovereignty requirements, and a sharper focus on what infrastructure actually costs, organizations need to move from hybrid-by-accident to hybrid-by-design architectures. Here’s how to get started:

Establish clarity of purpose. Before touching any technology, build an 18-month roadmap anchored to concrete business outcomes, whether that is revenue growth, cost optimization, or resilience targets.

Conduct a data gravity audit. Map out where data actually lives, who accesses it, and your latency and egress exposure. This reliably surfaces forgotten workloads, duplicate data, and compliance blind spots.

Execute deliberate rationalization. Streamline overlapping tools, consolidate vendor relationships, standardize governance, and build for workload portability.

To learn more, replay my conversation with Noel Yuhanna and dive deeper with the “From Chaos to Control: Why ‘Hybrid by Design’ Is the Future of Enterprise Data Strategy” Industry Trend Report.

AI Readiness Depends on Real-Time Data

Katie Gdula — Wed, 17 Jun 2026 13:00:00 UTC

What you Need for Real-Time AI

To succeed in an AI-first world, modern enterprises must transition to a data pipeline architecture built on three core pillars:

Non-stop data movement: Pipelines must facilitate continuously ingesting and processing structured and unstructured data at massive scale to ensure AI agents always operate on fresh information.
Context and traceability: These systems must be deeply embedded within the broader context of modern AI infrastructure, and need to plug directly into your existing AI and software tools.
Flexibility and scalability: Enterprises need out-of-the-box processors that securely connect any data source to any destination. By mastering these three capabilities, organizations can eliminate data silos and build the agile, automated foundation to power next-generation AI agents.

The Unique Cloudera Solution: Total Control from the Edge to AI

Cloudera is the only enterprise-grade solution for delivering trusted and governed Edge to AI workflows at scale across hybrid environments.

With an end-to-end solution that securely delivers real-time data to your AI applications, Cloudera ingests, processes, and transforms data from anywhere, including edge devices, and delivers it to any destination. This unique solution addresses the challenges surrounding real-time AI, accelerating the development of enterprise AI and agentic AI.

Cloudera Data in Motion consists of four key components: Cloudera Edge Management, Cloudera Data Flow, Cloudera Streaming Analytics, and Cloudera Streams Messaging.

The Four Offerings of Cloudera Data in Motion

Cloudera Edge Management

The beginning of the real-time data lifecycle starts with information collected from edge devices. These may be sensors, cameras, phones, or other IoT devices. Cloudera Edge Management provides edge device data collection and processing with easy-to-use central command and control.

Cloudera Data Flow

The data lifecycle continues with Cloudera Data Flow, a management hub for data that’s in transit. With hybrid deployments, automated scaling, granular security controls, and single-pane-of-glass observability, Cloudera Data Flow empowers engineering teams to focus entirely on building high-value data pipelines rather than wrestling with complex systems administration.

Cloudera Streaming Analytics

Real-time AI requires processing and analytics to happen in real time. This is made possible with Cloudera Streaming Analytics, which offers a framework for real-time stream processing and streaming analytics that reduces management complexity and costs for real-time business activities.

Cloudera Streams Messaging

Finally, a messaging backbone is needed for ensuring continuous, high-speed data flow without bottlenecks. Cloudera Streams Messaging provides a robust, scalable, and secure event-driven architecture for real-time applications. This acts as a reliable data transport layer for the most demanding development workloads.

Consistent and Unified Security and Governance

Cloudera ensures more successful real-time AI projects with secure and trusted data. As a key part of the Cloudera portfolio, Cloudera Data in Motion seamlessly integrates the security and governance policies across the data lifecycle. The result is persistent context for real-time data across all analytics on any infrastructure, covering cloud, data center, and edge.

Learn more

How Cloudera Data in Motion Accelerates Your AI Journey

Today’s businesses are racing to deploy real-time AI and Agentic workflows—autonomous AI agents capable of making split-second business decisions, automated customer service adjustments, and live fraud assessments. For these systems to work, they cannot rely on what happened yesterday, or even an hour ago. They need to know what is happening right now.

Cloudera Data in Motion delivers an optimized flow and streaming solution for data ingestion, processing, and analysis, ensuring that enterprises have the freshest and most reliable data for their real-time AI applications.

#ClouderaLife Employee Spotlight: Meet Donna Beasley. Chief AI Enablement Officer

Debbie Kruger — Tue, 16 Jun 2026 13:00:00 UTC

Making Cloudera an AI-first company starts by applying the same enterprise AI principles internally that the company delivers to customers around the world. That mission sits at the center of Donna Beasley's role as Cloudera's first-ever Chief AI Enablement Officer.

My job is to make Cloudera an AI-first company from the inside out. Not in a slogan way, but in a 'what are people actually using on Monday morning' way."

It's a role that blends technology, strategy, governance, and change management. As Chief AI Enablement Officer, Donna helps employees across the company use AI effectively while balancing innovation with responsible use. She also leads Cloudera's AI Steering Committee, which guides investment and adoption decisions. Since its start in 2024, the committee has reviewed 145 AI tools, with 64% receiving a path to approval.

Behind all of it is one critical piece of the job.

"Most of the role is translation," Donna explains. "Engineering speaks one language, executives speak another, security and legal speak a third. The interesting work happens when you can get all three in a room and turn a good theory into something people can actually use."

Let's get to know Donna and learn how she's helping teams across Cloudera turn AI's potential into everyday impact.

Building Credibility Through Action

Being the first person in a new role meant there was no roadmap to follow. For Donna, that meant focusing on execution before strategy.

"I spent my first couple of months teaching people how to use the AI tools that had already been approved, not presenting strategy decks," she says. "When people can see something running, the conversation shifts from 'What should we do about AI?' to 'Here's what I'd change about this thing that already exists.'"

That hands-on approach is rooted in a career that spans sales engineering, product management, support, professional services, operations, customer success, and marketing. Those experiences taught Donna that every team experiences technology and change differently.

"Seeing the same company from that many angles is a privilege most leaders don't get," she says. "Every team is convinced they're the one carrying the company, and they're each usually a little bit right."

Today, that mentality continues to shape her leadership style: "The distance between a leader and the actual work is where bad decisions hide."

That same belief in staying connected to people and understanding their experiences extends beyond her work in AI enablement. It carries into one of the roles she finds most meaningful at Cloudera.

Creating Community Through ERGs

As executive sponsor of Cloudera's LGBTQ+ Employee Resource Group, Donna has seen firsthand how ERGs help strengthen both company culture and employee growth.

"ERGs are one of the few places where culture actually gets practiced, not declared."

She views them as spaces where employees build community and find mentorship that can shape the trajectory of their careers. As someone in the LGBTQ+ community, she understands how powerful it can be to work in an environment that welcomes authenticity.

"That's how careers get unstuck. That's how the next generation of leadership at this company gets built."

The ability to be open about who you are at work is something many people still cannot take for granted. Through her work with the LGBTQ+ ERG, Donna helps create space for employees to connect and bring a wide range of perspectives into the business.

Just as importantly, Donna sees the role as an opportunity to ensure the company's actions continue to align with its values as the organization grows. For her, creating opportunities for employees to feel represented and empowered is one of the most meaningful aspects of leadership.

Life Beyond Work

A few years ago, Donna faced a life-changing cancer diagnosis and recovery journey. The experience gave her a renewed perspective on how she wants to invest her time and energy.

Today, she and her wife call rural Maine home, where life moves at a different pace. Weekends are often spent on the water before sunrise, surrounded by the natural beauty that drew them there in the first place. From wildlife and astrophotography to FPV drone flying, Donna finds joy in activities that encourage patience and perspective.

"The work is intense and I love it," she says. "But the reason I can show up and do it well, year after year, is that I've built a life around it that is the opposite of the work: quiet, slow, full of things that don't get faster no matter how much you push them."

Closing Thoughts

When reflecting on what makes Cloudera unique, Donna points to a culture that values substance and authenticity.

"Cloudera attracts people who are wired the same way: they take the work seriously and themselves a little less so. There's a respect for technical depth here I haven't always found at other places. And a real impatience with anything that smells like theater instead of substance."

Her perspective captures the spirit of Cloudera: a place where talented people come together to tackle meaningful challenges, support one another, and continuously push the boundaries of what's possible.

To hear more about Donna's journey and her vision for AI enablement at Cloudera, check out her appearance on The AI Forecast podcast here. You can also meet more inspiring Clouderans here.

The New Bottleneck for Enterprise AI Sits Inside the Document

Sid Manchkanti ,Abhas Ricky — Tue, 09 Jun 2026 13:00:00 UTC

For most of the last two years, enterprise AI conversations started with the model. Organizations debated which foundation model to use, how to fine-tune it, and which orchestration framework would deliver the best results.

That conversation is changing. As foundation models become more capable, accessible, and interchangeable across providers, many organizations are discovering that model performance is no longer the primary constraint on AI outcomes. Instead, the bottleneck has moved earlier in the pipeline: into the document layer that feeds AI systems in the first place.

In conversations with CIOs and CDOs across financial services, healthcare, and telecommunications, the same observation comes up repeatedly. The challenge is no longer how the model reasons. It’s what the model is reasoning.

The model is no longer the bottleneck for enterprise AI inference. The document understanding layer is. That sentiment reflects a structural shift in where value and risk now sit in the enterprise AI stack.

The Real Problem: Unstructured Data and Document Intelligence

In regulated enterprise environments, most critical data does not live in clean, structured warehouse tables. It lives in unstructured formats: PDFs, scanned filings, claims schedules, contracts, financial statements, lab reports, and rate exhibits. This is the data that feeds AI systems.

Document intelligence refers to the process of converting this unstructured data into structured, usable inputs for AI models. When this step fails, the consequences ripple through the entire AI pipeline.

The failure mode is often deceptively simple. A misread table or merged cell creates a malformed extraction. That extraction produces flawed embeddings, which return the wrong context during retrieval. The model then generates a confident answer from a confidently wrong input.

At that point, even the most advanced model cannot compensate for the underlying error. Improvements to model performance do not correct a structural issue that occurred before the model ever ran. In practice, the quality of the document pipeline often determines the AI system's accuracy ceiling.

When Parsing Errors Become Business Risks

The impact of poor document parsing shows up directly in business outcomes. They are concrete, measurable, and largely underappreciated at the executive level. This is what makes document intelligence more than a technical challenge. For many organizations, it has become an operational and financial one.

Parsing errors are rarely visible when they occur. Instead, they compound downstream through workflows, decisions, and business processes. By the time the issue surfaces, the cost of remediation is often far greater than the cost of prevention.

Financial Services

In financial services, a single misread value in a fund administrator’s capital account statement can cascade into downstream errors in underwriting or reserving models. These errors can carry regulatory implications and lead to costly remediation efforts that often run into the millions.

Healthcare

Healthcare organizations continue to rely heavily on manual document abstraction for claims, remittance advice, and clinical documentation. This is driven in part by the structural complexity of the data, and in part by strict requirements around protected health information.

Manual document abstraction is consistently one of the largest line items in health data operations budgets.

Telecommunications

In telecom, vendor interconnect billing and service-level agreements (SLAs) often contain complex rate tables that few systems read accurately at scale. Even small inaccuracies can translate into hundreds of millions of dollars of leakage at carrier scale.

The pattern across these industries is the same. Inaccurate document understanding is not a technical inconvenience. It is a P&L problem that quietly compounds.

The Trade-Off: Accuracy vs. Data Sovereignty

Layered on top of the accuracy problem is a second constraint specific to regulated enterprises: where AI processing happens.

Over the past several years, most of the enterprise AI inference stack has steadily moved into controlled customer environments – virtual private clouds (VPCs). on-premise infrastructure, or sovereign cloud regions. Models, vector stores, orchestration layers, and observability now operate within the same governance controls as the underlying data. Document parsing has been the forced exception.

Historically, the most accurate document processing options were delivered via SaaS APIs only, which left regulated customers choosing between accuracy and sovereignty:

Route the most sensitive documents in the enterprise out to a third-party API for higher accuracy, or
Keep data within the enterprise boundary and accept a meaningful accuracy gap on the workflows that matter most.

Compliance, legal, and risk teams have long viewed both options as compromises. As a result, many organizations have struggled to balance two equally important priorities: achieving the accuracy required for business-critical workflows while maintaining control over where sensitive data is processed.

Until recently, there was no clear path to achieving both.

A Maturing Category for Enterprise Document Intelligence

The good news is that this trade-off is beginning to close. Across the industry, organizations are applying greater rigor to how document intelligence systems are evaluated, particularly for complex tables and highly structured business documents that have historically challenged traditional parsing approaches.

At the same time, a new generation of document intelligence providers is making it possible to achieve high levels of parsing accuracy within customer-controlled environments. Recently, the team at Pulse open-sourced PulseBench-Tab, a frontier benchmark for table parsing built specifically around the kinds of documents regulated enterprises actually run on.

It contains 1,820 human-annotated tables drawn from real financial filings, government reports, corporate disclosures, and regulatory filings, spanning 9 languages and 4 scripts, many of which contain merged or spanning cells and complex structures that commonly break traditional parsing systems.

Importantly, the benchmark introduces T-LAG, a unified scoring approach that captures both text and structural accuracy. This ensures that systems are not rewarded for extracting approximate text while silently breaking the table’s shape.

Results from this benchmark show that frontier-level accuracy in document parsing is now achievable without a third-party SaaS endpoint, bringing a new level of reliability to enterprise AI pipelines.

Nine providers were evaluated independently and in the open, and the methodology benefited from academic contributions from members of S&P Global’s Enterprise Data Organization. On that benchmark, Pulse delivered a T-LAG score of 0.9347 with full coverage across all 1,820 samples, materially ahead of the next closest provider at 0.8155.

Bringing Document Intelligence Inside the Enterprise AI Stack

This progress unlocks a new architecture for enterprise AI – one where document intelligence operates within the same environment as the rest of the data pipeline. As document intelligence becomes deployable within governed enterprise environments, organizations gain the ability to bring document processing into the same operational and governance boundary as the rest of the AI stack.

Combined with an AI-powered lakehouse architecture, this creates a more unified approach to managing structured and unstructured data, with consistent security, lineage, observability, and governance controls from ingestion through inference.

Solutions such as Pulse demonstrate what this architecture can look like in practice, enabling organizations to parse and structure complex documents without requiring sensitive data to leave the enterprise environment.

The result is a fully integrated pipeline that can:

Parse unstructured documents
Convert them into structured data
Embed and retrieve relevant context
Generate outputs using AI models

All within the same controlled environment.

For the CIO, that means a single governance boundary across the AI workflow rather than a patchwork of disconnected environments to secure, audit, and manage.

For the CFO, it can shift document processing from a recurring external service cost to an internal capability built on infrastructure that already supports broader AI and data initiatives.

More importantly, it changes where organizations should focus their investments. As models become increasingly accessible, competitive advantage is shifting toward the quality, governance, and reliability of the data pipeline that powers them.

What This Unlocks for Regulated Industries

For executives setting AI strategy in regulated industries, improvements in document intelligence create a visible impact on operating metrics.

Financial Services

Financial services teams can keep 10-K analysis and filings, fund administration records, bordereaux processing, claims schedules, and actuarial reports entirely within their governed environment, with structural accuracy high enough that downstream agents can be trusted with the output, reducing pressure and time spent on human review cycles.

Headcount that was previously dedicated to manual reconciliation can be redirected to higher-value analytical work.

Healthcare

Healthcare organizations can automate document-heavy workflows like clinical trial data extraction, lab panel ingestion, and explanation of benefits (EOB) processing into the same environment as their structured PHI. This materially reduces one of the largest line items in health data operations while accelerating revenue cycle times and clinical research workflows.

Telecom

Telecom operators gain the ability to accurately interpret interconnect agreements and billing structures at the level of detail required to recover the revenue leakage that has historically been buried inside complex rate tables.

In each case, improved document intelligence directly translates into measurable business value.

The Future: The Sovereign AI Stack

The center of gravity in enterprise AI is shifting. As models continue to converge in capability, the durable competitive advantage is moving one layer down, into the data to inference pipeline– specifically, how effectively organizations can process and govern unstructured data.

Document intelligence now sets the ceiling for accuracy and ROI. At the same time, data sovereignty is non-negotiable in regulated industries: AI must run where the data lives. This is where Cloudera’s AI and data anywhere vision applies: deploy AI across hybrid and multi-cloud environments, keep data in place, and enforce consistent governance.

Combined with Pulse, the regulated enterprise has a path to AI Native that protects accuracy, control, and the underlying ROI of every workflow built on top. That is the sovereign AI stack our customers have been asking for, and it is now within reach.

From Data Chaos to AI Confidence: Getting Enterprises Production-Ready

Suri Nuthalapati — Thu, 04 Jun 2026 13:00:00 UTC

Image: Most organizations are stuck between blocked AI and limited pilots and are far from production-ready.

Image: Cloudera's Data Readiness Index 2026—global insights from 1,270 IT leaders across AMER, EMEA, and APAC.

This isn't just an uncomfortable statistic. This is an operational risk. Teams that believe their data is accurate and complete will build AI applications on top of it and discover the quality problems only when those applications produce outputs that are wrong, biased, or indefensible. By that point, the damage to organizational trust in AI can take a long time to repair.

The first step toward closing this gap is replacing confidence with evidence: a scored, objective baseline across all the dimensions of AI readiness, built from actual assessment of the data estate, governance policies, and infrastructure.

What Becomes Possible on the Other Side

When enterprises establish a true AI-ready foundation, the unlocked use cases are significant and multiply quickly.

These use cases range from private enterprise Q&A chatbots and agentic AI workflows, to KYC automation, real-time retail insights, AI-driven document governance, intelligent operations, and portfolio analytics. A governed, unified, production-grade data estate is what separates organizations running isolated pilots from those deploying AI at scale.

The engagement covers the full scope of AI readiness: use case alignment and priority dataset identification, data quality and metadata assessment, retrieval and vector store readiness, governance and PII risk evaluation, platform compute and inference assessment, and an executive roadmap with prioritized remediation actions. For organizations ready to build, a custom implementation track takes those findings into production by deploying end-to-end data pipelines, governance and security configuration, applications, and the AI inference platform itself.

Image: Ways Cloudera PS&T helps you get there—assess where you stand, then build for production.

These aren't model or algorithm problems, they’re data readiness problems. The organizations that solve them are creating a durable competitive advantage that can't easily be replicated because it's built on data and processes that are uniquely theirs.

What Data Readiness for AI Actually Means

The term "data readiness" has very specific dimensions. Missing any one of them can derail an AI program. AI readiness is the ability to deliver accurate, trusted, and real-time AI outcomes using enterprise proprietary data. That means achieving readiness across six interconnected areas:

Data & Context Readiness: Do you have unified, high-quality, context-rich data that AI models can actually consume? Fragmented datasets, incomplete metadata, and poor data quality are among the most common failure points in RAG pipelines and fine-tuning workflows.

Platform Readiness: Is your infrastructure optimized for AI? Scalable compute (CPU/GPU), model inference capability, and auto-scaling across cloud and on-premises environments are foundational, not optional.

Data Access & Retrieval: Can AI systems reliably access the data they need, in real time, with minimal data movement? The ability to query across structured and unstructured stores, including vector databases, with low latency, is critical for agentic AI and real-time inference.

Data Governance & Trust: Can you trace where your AI's answers came from? Beyond compliance requirements, lineage, traceability, and auditability are what differentiates AI outputs that get acted on and those that get ignored.

Unified Security: Are role-based and fine-grained access controls consistently enforced across all your workloads, including your AI data stores? Inconsistent security posture is one of the most common blockers to enterprise AI adoption, particularly in regulated industries.

Operational Readiness: Do your data pipelines reliably deliver fresh, high-quality data to AI systems on the cadence those systems require? SLA adherence, data freshness monitoring, and inference latency management are operational disciplines that most organizations haven't yet developed.

Enterprises today are under enormous pressure to deliver on their AI investments. Boards are asking for Return on Investment (ROI). Business units are asking for production deployments. IT leaders are asking why their well-funded AI initiatives keep stalling. The answer, more often than not, has nothing to do with the models, and everything to do with the data underneath the AI.

Cloudera's Data Readiness Index 2026, a survey of 1,270 IT leaders across AMER, EMEA, and APAC, puts a number on what practitioners already feel in the field: 84% of organizations feel confident in their data accuracy, yet only 18% have fully governed data. That 66-point gap is exactly where AI projects quietly fail.

Let’s explore what it takes to close that gap—and how Cloudera Professional Services & Training is helping enterprises move from fragmented, ungoverned data estates to production-ready AI foundations.

The Real Reason Your AI Initiatives are Stuck in Pilot Mode

Ask any enterprise data team what's blocking their AI roadmap, and the answers tend to cluster around the same themes: pilots that can't graduate to production, models that produce outputs nobody trusts, and infrastructure that was built for analytics workloads and not AI inference at scale.

The Data Readiness Index confirms exactly this. 79% of organizations say their data-backed initiatives are hindered because they cannot access 100% of their data across environments. Data quality issues are the single biggest reason AI ROI falls short. This reason is cited even more often than the cost overruns and weak integration. 73% say infrastructure performance has hindered operational initiatives, with nearly a third saying this is the consistent norm, not an occasional exception.

Most enterprises have made real progress on one or two of these dimensions. Very few have addressed all in an integrated, production-grade way. That gap is what keeps AI stuck. In short: AI Readiness = Data + Context + Governance + Platform.

The Confidence Gap Is More Dangerous Than It Looks

One of the most striking findings from the Data Readiness Index is what Cloudera calls the confidence gap. Organizations feel ready, but their governance posture tells a different story.

The business value is real. Instead of using a patchwork of disconnected tools, organizations that achieve AI readiness benefit from faster time-to-insight, higher accuracy (driven by better data quality and context), built-in compliance for regulated industries, and dramatically lower cost and complexity that comes from implementing a unified data and AI platform.

How Cloudera Professional Services & Training Helps

Cloudera Professional Services & Training has delivered thousands of engagements across financial services, healthcare, telecommunications, energy, manufacturing, and the public sector. Our data and AI experts work from advisory through end-to-end implementation, all backed by deep alignment with Cloudera engineering, product, industry, and support teams.

Our Data Readiness for AI offering was built specifically to address the challenges surfaced by the Data Readiness Index. It helps enterprises move from fragmented, ungoverned data estates to production-ready AI foundations by using a structured methodology, clear deliverables at each phase, and a prioritized path forward.

Image: A strong AI-ready foundation unlocks an entire portfolio of AI-powered capabilities across the enterprise.

Image: Cloudera PS&T covers every layer—from platform foundations to production AI apps.

Image: Most organizations are stuck between blocked AI and limited pilots and are far from production-ready.

Image: The outcomes and key takeaways of an AI-ready data foundation.

If you're wondering where your organization falls on the readiness curve, the most valuable thing you can do right now is find out, with an objective, scored evaluation of your actual data estate.

Start with a Data Readiness Assessment: a structured advisory engagement that scores your data estate across all readiness dimensions and delivers a prioritized executive roadmap your team can act on.

Learn more about Cloudera Professional Services & Training

Download the Data Readiness Index 2026

Cloudera Professional Services brings capability and accelerators at every layer of the stack, from platform engineering and data engineering through AI engineering and agentic AI application development. Whether your organization needs to assess where it stands, remediate specific readiness gaps, or build a production AI pipeline from the ground up, PS&T can meet you where you are.

Get Started Today!

The Data Readiness Index makes one thing clear: the organizations that will lead in AI are not necessarily the ones with the most advanced models. They are the ones with the most trustworthy, accessible, and governed data—because that is what enterprise AI actually runs on.

Shifting Gears in Manufacturing: Predictive Maintenance with Cloudera and ServiceNow

Jeremiah Morrow — Tue, 02 Jun 2026 13:00:00 UTC

The manufacturing industry is facing significant headwinds in 2026. Global competition is driving commoditization and shrinking margins. Geopolitical instability, increasing frequency of extreme weather events, and aging infrastructure are making supply chains more volatile and vulnerable to disruption. And the global focus on sustainability is forcing every manufacturer to implement Environmental, Social, & Governance (ESG) strategies or risk losing investors, churning customers, and facing regulatory action.

While navigating all of this disruption, manufacturers must also continue advancing Industry 4.0 and 5.0 initiatives—the shift toward connected, intelligent, and increasingly human-centric industrial operations. The software-defined factory, which is at the heart of digital transformation, often feels out of reach for many manufacturers who are still trying to digitize physical assets and wrangle massive volumes of Internet of Things (IoT) data.

Predictive maintenance has the potential to address many of the challenges facing manufacturers. Reactive maintenance–fixing a machine only after it has experienced a failure–increases production costs and timelines, leads to stockouts and shipping delays, and increases capital expenditures by shortening the lifespan of equipment.

This blog post details how ServiceNow and Cloudera enable predictive maintenance for our customers, and how shifting from reactive to predictive maintenance can ensure operational continuity, reduce costs, and maximize the value of every industrial asset.

The Barriers to Moving from Reactive to Predictive Maintenance

While predictive maintenance has many potential benefits for manufacturers, several architectural and operational barriers often stall progress. To move beyond pilot projects to enterprise-scale solutions, manufacturers must address these challenges:

The massive scale of IoT data: A single machine can produce as many as a million data points every single day. Any one of these data points can indicate a potential failure. Collecting and processing all of that data is a massive challenge on its own.

Data movement tax: Traditional architectures require replicating massive datasets from the factory floor to the cloud for analysis and model training, which introduces significant costs, storage overhead, and operational risk.

Latency and fault tolerance: While most manufacturers want to achieve predictive maintenance, connectivity and latency often prevent them from moving beyond remote monitoring. Factories are often built in areas with low connectivity, and the gap between identifying a potential failure and taking action to prevent it is often too great.

Machine learning vs. AI: Most ML-based solutions successfully identify risks or detect anomalies, but stop at the insight stage. Manual procedures and human handoffs are still required to trigger a repair, slowing down the response and diminishing the ROI of predictive maintenance investments.

To overcome these obstacles, manufacturers need a solution that brings AI to the data, even in areas with low or no connectivity, and can collect, process, and analyze massive volumes of IoT data in near-real time.

Enabling Predictive Maintenance with Cloudera & ServiceNow

To solve these architectural and operational challenges, Cloudera and ServiceNow combine to deliver a unified, closed-loop governance ecosystem that leverages Cloudera’s hybrid data lakehouse and unified data fabric as well as ServiceNow’s intelligent orchestration layer to give manufacturers access to all of their data for AI.

Here is how it works:

Cloudera manages data at scale. Cloudera’s open data lakehouse architecture supports the real-time ingestion, processing, and analysis of massive volumes of data from equipment sensors, as well as historical data like sales data, maintenance schedules, and diagnostics. All of this data is critical for building and training models that can accurately identify potential issues.

Bring AI to the data. Cloudera analyzes sensor data at the edge to detect anomalies in near-real time, enabling fault tolerance and low-latency alerting.

ServiceNow closes the loop. ServiceNow’s AI agents pick up the alert and take action, scheduling maintenance, ordering parts, rerouting production, and notifying logistics and supply chain teams of any potential disruptions.

Traceability and auditability. Cloudera’s unified data fabric provides end-to-end governance, security, and lineage, so every automated decision made by a ServiceNow AI agent can be traced back to the underlying data for auditability of the full data and AI lifecycle.

By leveraging a unified view of organizational data, enabling security and governance across the entire data and AI lifecycle, and closing the loop with AI agents who can take action, manufacturers can finally move from reactive to predictive maintenance.

The Value of Predictive Maintenance

Transitioning from reactive to predictive maintenance represents a significant shift towards resilience, one of the pillars of Industry 5.0 transformation. By combining Cloudera’s ability to deliver a foundation of trusted data at enterprise scale with ServiceNow’s workflow automation, manufacturers can realize several business benefits:

Eliminate unplanned downtime. Predicting failures before they happen minimizes disruptions and ensures operational continuity.

Reduce OpEx. Keeping data at the source eliminates expensive data movement and transformation costs and reduces infrastructure overhead. Optimizing maintenance schedules reduces labor costs.

Reduce CapEx. Proactive resolution prevents the cascading failures that occur when machinery runs to the point of failure, extending the lifespan and maximizing the value of industrial assets.

Ready to move from Reactive to Predictive Maintenance?

Although manufacturers are under significant pressure to make progress on Industry 4.0 and 5.0 transformation, the ability to operationalize AI at enterprise scale has been a significant barrier to success. By breaking down the barrier between the data and agentic workflows, Cloudera and ServiceNow provide the capabilities necessary to harness massive volumes of IoT data, identify potential failures, and take action in real time, enabling manufacturers to transform maintenance workflows and improve productivity and profitability of factory operations.

To learn more about the partnership, read the Omdia Whitepaper: Workflow Data Fabric: Powering Private AI Agents and Real-Time Intelligence with Cloudera and ServiceNow.

How Cloudera’s Mental Health First Aid Champions Are Making an Impact

Debbie Kruger — Fri, 29 May 2026 13:00:00 UTC

Supporting the well-being of Clouderans remains a core part of building a healthy, connected workplace culture. With May recognized as Mental Health Awareness Month, Cloudera is continuing to emphasize the importance of mental health in the workplace and the role open, empathetic conversations play in creating a supportive environment. While discussions around mental health have historically carried stigma in many professional settings, initiatives like the Mental Health First Aid program are helping normalize those conversations and empower employees to better support one another.

To learn more about the program’s impact, we spoke with Staff Software Engineer Vignesh Baskaran, one of Cloudera’s Mental Health First Aid-certified champions, about the skills gained through the training and why fostering a compassionate workplace culture matters.

What led you to become a Mental Health First Aid–certified champion at Cloudera?

My main goal was to better understand how to recognize when someone may be struggling and how to respond in an empathetic, informed way. The Mental Health First Aid program felt like a practical and accessible starting point to build that knowledge, and after completing the course, I felt more confident and prepared to help create a workplace culture where people feel safe asking for support.

How would you explain this program and your role to a colleague who has never heard of this benefit?

It’s a lot like conventional first aid and CPR training, but for mental health. As a Mental Health First Aid–certified champion, I am an approachable first point of contact for colleagues seeking guidance. This program aims to enhance understanding of mental health by offering practical methods to recognize signs of stress, burnout, anxiety, or emotional distress, helping individuals identify when someone may need support and teaching First Aiders to engage in conversations with empathy and compassion.

What is the process like when someone does reach out to you?

The first step is simply creating a safe space for them to talk openly without fear of judgment. Mental Health First Aiders aren’t trained to diagnose or provide therapy, but we do help the person feel heard and supported while assessing whether additional help may be beneficial. From there, we know where to guide colleagues toward the appropriate employee resources or services available through the company or externally.

Are there any misconceptions you have seen relating to mental health support at work?

One common misconception is that discussing mental health or seeking support at work is a sign of weakness, when in reality, reaching out for help shows a lot of self-awareness and strength. Mental health challenges are part of being human, and these kinds of programs exist to help people navigate difficult periods in a beneficial way. Creating open conversations about mental health helps reduce stigma and encourages a more empowering and understanding workplace.

How do you think teams can work to create a more supportive environment day to day?

A supportive culture is built when people feel comfortable being themselves and know they will be treated with respect and compassion. In many cases, people reach out simply because they need a safe space to talk without judgment. If that conversation helps reduce feelings of uncertainty or hesitation around seeking direction, that alone is valuable. Small actions such as checking in on colleagues, listening without judgment, respecting boundaries, and being understanding during stressful periods can have a really meaningful impact.

It also helps when teams normalize conversations about well-being and encourage balance rather than constant pressure. Creating an environment where people are valued strengthens both individual well-being and team collaboration.

How has this experience changed the way you think about mental health in the workplace?

This program has helped me better understand that everyone can go through highs and lows in the workplace at different points in their lives and careers. Mental health challenges are more common than many people realize, and they can affect anyone, regardless of role or experience. The course reinforced the importance of approaching colleagues with empathy and awareness, and the case studies and practical scenarios included in the training made the learning more relatable and realistic.

Have you learned any skills as part of the program that you can now apply to other aspects of your work/life?

One of the biggest changes I’ve noticed has been becoming more self-aware and reflective about my mental well-being. It has helped me recognize early signs of stress and negative thought patterns before they grow into larger challenges.

The training also reinforced the importance of active listening and thoughtful communication, which are valuable in both professional and personal relationships. Being able to take a beat to reflect and take constructive action earlier has helped me approach situations with greater balance and understanding.

Why is it meaningful to you that Cloudera offers such a program?

Programs like this show a genuine commitment to employee well-being beyond day-to-day work responsibilities, and send a strong message that people matter and that creating a healthy workplace culture is important.

I feel grateful to be part of both the program and an organization that encourages these conversations. Overall, I believe programs like this help people become better versions of themselves, both professionally and personally.

Explore how Cloudera supports employees in a culture rooted in empathy and fellowship.

Mastering Data Sovereignty: The Ultimate Competitive Advantage

Jessica Espinoza — Thu, 28 May 2026 13:00:00 UTC

Today’s competitive climate is influenced by many moving parts: geopolitical uncertainty, tightening regulations like the EU AI Act, and rapid AI adoption are just a few. As these forces converge, compliance is becoming more dynamic and multifaceted, extending to how AI systems are trained and deployed. Still, many organizations treat data sovereignty as a compliance exercise, focusing narrowly on where data resides.

Organizations that maintain control over their entire AI lifecycle, including their data and infrastructure, will gain the most durable and significant competitive advantage. However, counterintuitively, compliance and guardrails power faster AI innovation at scale by giving teams clear boundaries that reduce complexity and streamline execution.

Here’s what leaders can take away from a recent webinar on data sovereignty, along with some key insights uncovered during the conversation.

Sovereignty Is Expanding Beyond Data

Dario Maisto kicked off the webinar by defining data sovereignty as being all about independence. Maintaining that sovereignty means organizations are not subject to undue influence from external entities, such as foreign governments and jurisdictions. As digital ecosystems grow more complex and globally distributed, this independence becomes essential for maintaining operational resilience, protecting sensitive assets, and preserving long-term strategic flexibility.

Dario also warned against a common error of equating data sovereignty with data residency, which only includes where your data is physically stored. Leaders may believe that if their data is stored locally, it’s automatically sovereign, but in an AI-driven world, residency without sovereignty creates a false sense of security. Even if data sits in a specific country, a foreign parent company may still own the infrastructure, and external governments may still have legal access.

At its core, data sovereignty is the alignment of jurisdiction and governance to exercise legal and operational control over data, wherever it resides.

Why Sovereignty Matters Now

The urgency around sovereignty isn’t new, but the stakes are different today. So why sovereignty, and why now?

First of all, organizations are now navigating increased geopolitical fragmentation. Regions are creating their own rules for data privacy and AI governance that often conflict or don’t align globally, leaving companies unable to operate under a single global standard.

Chris Royles raised the issue of rising regulatory pressure and growing concerns about supply chain and infrastructure risks, which are connected to this decentralization. International conflicts and trade restrictions disrupt access to hardware because cloud infrastructure is still tied to physical regions and political systems. At the same time, many are shifting from cloud adoption to AI deployment, which forces new decisions: Where should AI run? Who controls the models? Can workloads move across regions if conditions change?

As Chris noted, organizations need to build something once and run it wherever they do business, designing for flexibility without losing control over data and operations. Sovereignty enables organizations to adapt quickly to regulatory, operational, or market changes without being locked into a single environment.

From Compliance to Competitive Advantage

Rodrigue Vitini touched on how sovereignty removes blockers that can prevent organizations from scaling innovation. There are plenty of these that enterprises must now account for simultaneously, including regulatory barriers, security concerns, and operational constraints.

With the right sovereignty strategy, enterprises can accelerate AI deployment without compliance trade-offs. Normally, companies face a trade-off between speed and compliance, but sovereignty resolves that tension by bringing AI to the data and applying policy controls and security across environments. This way, innovation can scale without compromising regulatory requirements.

The crux of the issue isn’t about following the rules or checking the correct boxes. Leaders must lean into sovereignty to garner the necessary control to withstand the current storm that is making supply chains fragile and expensive.

Enabling Sovereignty with Cloud Anywhere and Unified Governance

The four experts also discussed the importance of a “Cloud Anywhere” approach, which involves deploying data and AI workloads across public, private, and on-premises environments without sacrificing consistency or control. This flexibility ensures that data stays within required jurisdictions, and workloads can shift as new regulations and changing geopolitical needs alter the playing field.

Having unified governance is a key piece of this puzzle. When it comes to sovereignty, it’s all about maintaining consistent policies across everything from data collection to model deployment. This means keeping a firm grip on who can access data by implementing strong encryption and security to protect sensitive information, and ensuring we can trace how data is used. Being fully aware of where AI models come from and who owns them helps leaders understand how models are developed and deployed, so they can retain control over the value they generate.

A Practical Path Forward

Data sovereignty doesn’t require a complete overhaul from day one. Enterprises should avoid trying to solve everything at once and instead focus on achieving a minimum viable level of sovereignty that can evolve as their needs change.

In an AI-driven world, organizations must go beyond data residency to achieve full data sovereignty to unlock their economic value. By identifying sensitive data and workloads where control is crucial, and building gradually, they can operationalize sovereignty and remain competitive.

Delve deeper into these changes with us in our Mastering Data Sovereignty in the Cloud Era webinar. For more insights into how these observations translate into practice and how your organization can maximize data sovereignty in your environment, explore Cloudera’s latest resources.  

WLIT Webinar: Human-Centered Leadership with Dr. Jeanette Epps

Divya Karmagam — Tue, 26 May 2026 13:00:00 UTC

In an era defined by rapid change and rising expectations, leadership is being redefined in real time. Technical expertise still matters, but what sets leaders apart now is the ability to lead with resilience and humanity.

That idea came to life in the “Launching The Fourth Era of WLIT: A Universe of Potential” webinar, where Cloudera CMO Mary Wells sat down with former NASA astronaut Dr. Jeanette Epps for a wide-ranging and deeply personal conversation. At its core, their insights show how human-centered leadership drives success in high-stakes, high-change environments, with lessons that resonate as powerfully in the boardroom as they do 250 miles above Earth.

Mission-First Leadership in High-Stakes Environments

Mary: You’ve spoken about how leadership isn’t about the size of the team, but the influence you have on those around you. In high-stakes environments, how can women in tech balance that mission-first focus while still advocating for the unique perspectives that they bring to the table?

Jeanette: Students and other groups always ask, ‘So what was it like to be a Black female astronaut?’ And I have to remind them that, ‘Hey, I'm actually just an astronaut.’ I do the exact same work as all the guys, and I do it just as well, if not better.

That sense of belonging matters. I’m part of the crew. I’m not a separate piece of the crew that’s female and Black, I’m a fully trained, fully participating member of the team. That’s the mindset I think we need to carry forward.

At the same time, because we are female, our presence matters. Being in the room matters. You don’t have to tell people you’re the female in the room, because they already know. So our presence in these boardrooms, on missions, on teams, is everything. You show up, you participate, and you contribute as a strong member of the team. That's one of the biggest things that I try to impart to young women. Sometimes, when people focus too much on being the only one, I’ve noticed they start to shrink back, and they hesitate to participate. But when you see yourself as simply a member of the team, you step in, roll up your sleeves alongside everyone else, and get to work.

Resilience Through Setbacks

Mary: Speaking of resilience, you faced a very public challenge when you were reassigned from your 2018 mission. I think many women in technology or business face similar career reroutes. What's your advice for regrouping and recentering when a career path you've worked years for suddenly changes?

Jeanette: In 2018, I became the official backup on the Soyuz, and when the Russians declare you the official backup, it means you’ve passed everything, you’ve met all their standards.

We had taken long exams, and none of them were written. They were all oral, in front of a committee. My colleagues were amazing. We worked really, really well together. And then, at the end of all that, they said, “We’re taking Jeanette off.” This was five months before the actual mission. As you can imagine, it was a very sad time. Devastating, really. People were calling me in tears on my behalf because they were so excited to see this mission. I didn’t know what to do at first.

But what I didn’t do was overreact. I chose to be proactive. I found allies, worked with them, and controlled the narrative. Most importantly, I reminded myself of who I am and what actually happened. You can feel shame in moments like that, but it’s important to take stock of what you’ve done and what you haven’t done and be honest with yourself.

Then you show up. Just show up and keep moving forward.

That’s what I did. I showed up every day, kept moving forward, and continued to train. And eventually I was reassigned to Boeing Starliner.

Show Up, Reflect, and Keep Moving Forward

Mary: What’s next for you?

Jeanette: So I did retire from NASA, but my alma mater, the University of Maryland, is giving me the opportunity to do the commencement address this year.

Mary: Oh, right on.

Jeanette: So, in that, I'm just reflecting back on my life and how my advisor and the people there really are a part of my network and my team who really got me to this point. In my speech, I want to share a few key ideas we’ve talked about today.

First, you are not alone in what you’re going to face. Many of the challenges you encounter have happened to others before, and the most important thing is to keep moving forward. Show up. Keep going. Be a contributing member of whatever team you’re on. Roll up your sleeves, participate, and follow your dreams.

Someone once asked me why I tell students to dream big, saying it could set them up for failure. And I thought, if I had never dreamed big, I’d probably still be back in Syracuse, having never done any of this. The truth is, even if you don’t reach the exact endpoint, dreaming big pushes you much further than you would have gone otherwise.

That’s what I want students to understand. You may not get all the way to that final goal, but you will go so much further because you had it. And that’s what matters.

Because when you finally reach something like earning your degree, you realize that that’s not the end. You start asking, “What’s next?” And that’s the point. It’s not just about the milestone. It’s everything you learn along the way that shapes who you are, and then you keep moving forward.

Mary: It’s about being intentional and taking the time to pause and reflect: “Look how far I’ve come, now what’s next?”

Dr. Jeanette Epps continues the conversation with Paul Muller on The AI Forecast. Listen to the full episode on Spotify, then join Cloudera’s global WLIT LinkedIn community to keep the dialogue going with other tech leaders.

Watch the recap:

Defining The Hybrid Modern Data Platform

Angela Mann,Suzy Tonini — Thu, 21 May 2026 13:00:00 UTC

While the information technology industry is fast-moving, narratives about what a modern data platform is and what it delivers are not. Many organizations still associate enterprise data management with the manual complexities of a decade ago, unaware of how far the industry has evolved.

In this blog, we’ll cover what a modern data platform looks like in 2026, and why organizations across industries rely on Cloudera to transform decision-making, boost bottom lines, safeguard against threats, and save lives.

6 Reasons Why Cloudera is The Go-To Modern Hybrid Data and AI Platform

A modern hybrid data platform is a unified environment where your data, governance, and AI workloads run securely at any scale, on any cloud. Cloudera is the only true hybrid data and AI platform that brings AI to data anywhere: in the cloud, data centers, and at the edge.

1. Cloudera is Built on Open Standards, Not Proprietary Lock-in

There is a common misconception that enterprise platforms are proprietary. In reality, Cloudera is built on more than 50 Apache open-source projects. We use Apache Iceberg to ensure your data remains in open table formats, accessible across all major clouds, and shareable with other ecosystem tools like Snowflake and Databricks via our REST Catalog. Additionally, you can deploy anywhere: on-premises, across all major public clouds, or in hybrid environments, all with consistent security and governance.

2. Cloudera Offers True Zero Downtime Upgrades

While some alternative distributions require planned maintenance windows of 8 hours or more for in-place upgrades, Cloudera allows for continuous operations. Cloudera offers Zero Downtime Upgrades (ZDU) for core services, allowing your business to stay online while your infrastructure evolves.

3. "Manual Tuning" is a Thing of the Past

The era of manual resource optimization is over. Modern platforms must be self-healing and automated to survive at scale. Our native observability includes Auto Actions that automatically terminate runaway jobs and provide prescriptive recommendations for cost optimization. Additionally, built-in tools provide capacity forecasting, budgeting, and cost-center tracking to manage your spend without manual intervention.

4. True AI Readiness Requires More Than a Notebook

"AI-ready" is often used to describe basic tools like JupyterHub or MLflow. While these are useful for experimentation, they are only the beginning of a production AI lifecycle. Cloudera AI, accelerated by NVIDIA AI infrastructure, software, and open models, provides a production-grade platform, including model registry, explainability, and inference serving. With Cloudera Agent Studio, organizations can now build and orchestrate multi-agent AI workflows on governed enterprise data with capabilities that go far beyond simple notebooks.

5. Your Data Doesn't Have to Leave Your Premises

Many modern cloud-native tools rely on a multi-tenant SaaS control plane, meaning your sensitive telemetry and metadata are processed in an external environment. For organizations with strict compliance or air-gapped requirements, data sovereignty is paramount. Cloudera’s observability can run fully on-premises. No metadata or telemetry ever leaves your environment, ensuring total sovereignty.

6. Unified Governance is the Foundation of Scale

Scaling to more than 30 exabytes in production requires more than just connecting tools together; it requires unified governance. Cloudera Shared Data Experience (SDX) provides unified security, metadata, and governance across all clusters and environments. Additionally, we maintain the highest levels of enterprise readiness, including FedRAMP Moderate, GovRAMP Authorized, and TX-RAMP Level 2 certifications.

Next Steps

To learn about the latest innovations in data, analytics, and AI, watch our ClouderaNOW virtual event.

From Analytics Platform to an AI Operating System: Data Lakehouse in the Agentic AI Era

Navita Sood — Wed, 20 May 2026 13:00:00 UTC

The lakehouse architecture was developed with the mission to combine the unstructured scale of the data lake with the structured performance of the data warehouse. This shift unified enterprise data and delivered the first true "single source of truth". But in 2026, the mission has expanded. As we enter the era of Agentic AI, the lakehouse is evolving from a repository for retrospective reporting to support decision making, into a high-performance context layer that powers autonomous enterprise agents to support autonomous and immediate action. Its open, flexible, and reliable foundation is enhanced with interoperability, real-time data handling, security, governance, cross-cloud and on premises portability, and built-in AI automations for all administrative and operative functions.

At Cloudera, we are seeing a fundamental transformation in how Fortune 2000 leaders view their data estates. The pressure is coming from their need to feed the autonomous AI agents efficiently. They are using Cloudera lakehouse to unify structured, semi-structured, and unstructured data to enable “zero‑copy”, “zero-ETL”, near real-time model fine tuning, and real-time inferencing. The lakehouse enables RAG pipelines, AI feature stores, and real-time streaming pipelines, delivering governance frameworks, semantic context layers, and operational intelligence for enterprise agents.

Evolution of the Data Lakehouse

Interoperability: Breaking the "Consolidation-First" Trap

In the AI era, your data is your biggest moat. So it's only right that your data strategy defines which tool you use or where you train and run your AI—and not the other way around. However, many vendors still push a "consolidation-first" model, requiring you to move or copy your data into their proprietary governance or cloud environment before you can use it. Not only does this add additional cost, complexity, and risk to your data strategy, it also often requires you to surrender ownership and control of your data.

Your data lakehouse must be open, flexible, portable, interoperable and adaptable so that if your data strategy changes, your lakehouse adapts to it. Hence, open table formats (Apache Iceberg), open catalogs (Apache Polaris), open query engines, REST-APIs, and federated access are becoming the new baseline and form the core building blocks of Cloudera’s lakehouse.

Context-Aware Hybrid Lakehouse

LLMs are trained on the Internet. They don't know your business. AI success is no longer determined by model quality. It depends on what workflows you are automating and the accuracy of the business context that you provide the models - ERP records, financial transactions, supply chain logs, etc.

Cloudera Data Lakehouse provides a secure, well-guarded context-aware layer for your agents:

360-degree Context: Unify and make available data from the edge, data centers, and in the clouds with a single governance layer providing complete 360-degree context.
Multi-Modal Data: Transform, clean, and unify unstructured data such as logs, videos, and images, augmenting analytics and reasoning together with structured tables.
Shared Semantics: Combine technical, business, and operational metadata to make it easy for agents to discover, understand, and use your data in the correct business context.
Full-Spectrum Lineage: When an AI agent makes a $1M procurement decision, you need a "paper trail", or explainability. Cloudera provides this explainability via end-to-end traceability and automated lineage from the edge sensor to the final model output.

Cloudera’s lakehouse delivers real-time context across distributed and heterogeneous environments, enabling enterprises to keep their data, models, and business rules in their control while delivering complete context to AI systems.

Portable AI

Cloudera allows you to bring analytics and AI to the data—wherever it lives. Whether your data resides in an on-premises object store, a private cloud, or across multiple public clouds, our lakehouse delivers portable AI with a unified, zero-copy architecture. You can build in the cloud and inference on premises–without any refactoring costs–to keep your data in your control and prevent IP leakage. For global financial institutions, like OCBC Bank, this architectural openness enables them to scale AI/ML capabilities across the entire group while meeting strict regional data residency and sovereignty requirements.

Self-Optimizing Autonomous Lakehouse

AI systems are highly sensitive to data quality, freshness, and consistency. As data volumes and AI workflows grow exponentially, manual optimization becomes unsustainable. Cloudera integrates AI-driven automations directly inside the lakehouse platform for:

Data access
Data optimization
Compaction
Schema evolution
Tagging and classification
Workload tuning
Quality monitoring
Governance enforcement
Lineage
Lifecycle management

It continuously self-optimizes while reducing operational complexity for data and AI teams. Using Cloudera Agent Studio, our customers are deploying agents that autonomously monitor, transform, and move data based on business intent.

From Batch to Continuous: The Streaming Lakehouse

The distinction between "streaming" and "batch" is evaporating. To support agentic workflows, data cannot be minutes or hours old—it must be continuous.

Cloudera Open Data Lakehouse serves as a streaming lakehouse, to treat every data point as an event, allowing AI agents to respond to supply chain disruptions or financial anomalies the millisecond they occur. It processes these events right where they originate and performs complex analytics on streaming data before ingesting it into the lakehouse for near-real-time decisioning. It also delivers the pre-processed streaming data to agents at inference for real-time action. The lakehouse also includes data sharing and federation capabilities, ensuring that the data from other sources can be acted upon with minimal latency, without unnecessary data movement or data transformations.

The Edge-to-AI Continuum: Edge Inference Extends the Lakehouse Beyond the Data Center

Lakehouse is not a centralized monolith. As IoT, smart factories, and mobile applications proliferate, edge inference has become critical. Cloudera extends the Lakehouse outward, allowing analytics and action where the data is generated—at the edge—while synchronizing the insights back to the central hub. At Navistar: by processing sensor data from thousands of connected trucks in real time, they’ve reduced maintenance costs by 30% by automatically triggering proactive maintenance actions.

Convergence of Data Fabric and Lakehouse

At Cloudera, we are seeing a convergence of the Lakehouse and Fabric architectures. While the Lakehouse unifies the data, the Fabric activates the metadata (automated capture at ingestion: lineage, sensitivity tags, and more). Together, this helps to automate data discovery, integration, and governance. This simplifies access to data anywhere with zero-copy, zero-ETL, and zero-redundancy security.

From AI that Talks to AI that Predicts and Acts

The first wave of AI was about conversation. The next wave is about agents. The winners in this era won't be those who simply "store" the most data; they will be the ones who can provide trusted, continuous, multi-modal context to autonomous systems, making clear recommendations and decisions. By providing AI agents with governed, federated access to any data, Cloudera is helping the world's largest enterprises move from "chatting" to "acting."

Whether your data is in the data center, the clouds, or at the edge, Cloudera Open Data Lakehouse serves as a hybrid lakehouse to ensure it is ready for the agentic future.

Watch the video to learn how the Cloudera Open Data Lakehouse works.

Visit Cloudera Open Data Lakehouse to learn more.

Healthcare AI: Building Trustworthy Data Pipelines for Patient Insights

Rameez Chatni — Mon, 18 May 2026 13:00:00 UTC

You’ll hardly ever hear an IT leader in any industry complain about a lack of data; that’s one thing nearly every enterprise has in spades. It’s a shortage of trustworthy, usable data that is causing bottlenecks in this competitive landscape, tripping enterprises up before they can reach the finish line of complete AI success.

In healthcare, the conversation around AI often centers on how to get patient insights from AI, yet the reality is more complicated. While AI is already showing that it can surface powerful patient insights, unreliable data pipelines render them risky or unusable. Critical data resides across electronic health records (EHRs), labs, imaging, and claims systems, which remain fragmented and non-interoperable, leading to incomplete patient views. Clinicians and analysts are often forced to make decisions without a full picture of the patient, limiting both care quality and AI effectiveness.

Regulatory pressure also increases compliance costs, and many healthcare AI models remain in pilot stages because poor data governance produces untrustworthy outputs that clinicians won't rely on. That’s why trusted, governed data pipelines are the foundation for clinically actionable healthcare AI, and ultimately determine how successfully organizations can get patient insights from AI that clinicians will actually use.

From Data Chaos to Trusted Data Pipelines

Healthcare data doesn’t live in one place, and for strict regulatory reasons, it likely never will. In practice, many organizations adopt a hybrid approach, centralizing what they can while leaving high-value systems like EHRs and imaging platforms in place. These systems aren’t designed for high query volumes and, in many cases, can’t be freely accessed, making full consolidation impractical.

End-to-end data pipelines shift healthcare data from static and delayed to continuous and usable, but that only matters if each stage actually solves a real bottleneck. Rather than relying on periodic batch uploads, modern pipelines capture data as it’s generated, from EHR transactions and lab results to claims feeds and connected medical devices. This reduces the lag between when an event occurs (for example, a change in patient condition) and when it becomes visible to downstream systems. In clinical environments, that latency directly impacts intervention timing and patient outcomes.

One of the biggest sources of inconsistency in healthcare is parallel data preparation, or different teams reshaping the same data for different purposes. End-to-end pipelines apply common standards and quality checks upstream, so the data feeding the healthcare AI models is aligned, ensuring the models are trained on the same version of truth that the business relies on.

End-to-end data pipelines also deliver insights directly into operational and clinical workflows in near real time. Insights only create value if they show up where decisions are made. This becomes even more critical as organizations adopt generative and agent-driven AI, where performance depends heavily on delivering the right clinical context at the right moment—something far more complex in fragmented healthcare environments than in controlled demos. Instead of routing outputs to separate analytics tools, mature pipelines integrate results into existing systems, so a clinician doesn’t need to dig for it. It’s surfaced in context, at the moment of care, where it can influence decisions.

Governance Drives Trusted Healthcare AI

In healthcare, governance has often been treated as a barrier to innovation, but in practice, the opposite is proving true. Without clear data lineage, healthcare AI outputs struggle to gain the trust of clinicians and regulators alike, especially when auditability and HIPAA compliance are at stake.

Forward-looking organizations are embedding governance directly into their data pipelines, enabling them to trace how data is transformed and used in models and ensure compliance without slowing down workflows. In turn, this strengthens healthcare workers' confidence in both the data they’re using and the decisions they’re basing their decisions on.

Curious to see how healthcare organizations are building that trusted data foundation to operationalize AI while protecting patient health information, compliance, and security postures?

Learn more

Infrastructure Makes or Breaks AI Scale

Many healthcare organizations have successfully piloted healthcare AI models, but far fewer have operationalized them at scale. At the same time, healthcare is seeing a surge of high-value, specialized AI solutions, from ambient documentation tools to radiology models and automated claims processing. While each delivers value independently, they often operate in isolation, creating new islands of intelligence. Without a unifying layer to connect these outputs to a patient’s longitudinal record, organizations struggle to turn point solutions into coordinated, system-wide impact. This is where a unified data and AI platform becomes critical, bridging these systems while maintaining governance, residency, and control.

In many organizations, models are developed in isolated environments that don’t reflect production conditions. Moving from one deployment to another often requires rework, introducing delays and risk. Scalable healthcare AI requires standardized deployment frameworks that allow models to run consistently across on-prem and cloud environments, with minimal friction between experimentation and production.

Many existing pipelines are built for either real-time insights, such as ICU alerts, or batch-generated insights, like population health trends, but rarely for both. Healthcare decisions don’t happen on a single timeline, so when real-time capabilities are missing, insights arrive too late to influence care, leading to preventable missed interventions. To scale, AI outputs must be embedded in workflows to inform decisions in real time. Without these capabilities, AI remains confined to isolated proofs of concept that demonstrate potential but fail to deliver sustained value.

Patient populations change, clinical practices evolve, and data distributions shift. Without continuous monitoring, organizations risk relying on outdated or unexplainable outputs. In a regulated environment, this is a huge liability. The organizations moving ahead are those that assign the same rigor and governance to their AI as any other critical healthcare system.

Trust Is the Differentiator

The healthcare organizations where AI has made meaningful impact are doing it with stronger data pipelines than their peers. Their success stems from treating data as a governed, strategic asset that supports clinical-grade decision-making.

Platforms like Cloudera support this shift and can help your organization turn fragmented data environments into reliable foundations for clinical and operational intelligence.

As AI adoption accelerates, organizations with governed, scalable data foundations will lead in both innovation and patient outcomes. Learn more about how Cloudera helps transform fragmented data into reliable, actionable patient insights.

Six Cloudera Leaders Named to CRN’s 2026 Women of the Channel List

Cloudera — Thu, 14 May 2026 13:00:00 UTC

Solving today’s most complex business challenges, from hybrid cloud to AI and advanced analytics, depends on a dynamic partner ecosystem built on deep collaboration and shared expertise. At Cloudera, our Partner Organization plays a central role in driving this momentum, enabling innovation and helping customers realize meaningful, long-term value.

This year, six exceptional women from Cloudera have been named to CRN’s 2026 Women of the Channel (WOTC) list, a recognition that exhibits influential leaders shaping the future of the IT channel through vision, execution, and impact.

Among them, Michelle Hoover, SVP of Global Alliances & Channels, has been named to the prestigious Power 100, an honor reserved for leaders whose contributions are redefining what success looks like across the channel.

Join us in celebrating Cloudera’s Women of the Channel and getting to know the leaders behind this recognition.

Michelle Hoover, SVP, Global Alliances & Channels – Michelle earned a place on CRN’s Power 100 list for a second consecutive year, recognizing a year of transformative leadership across Cloudera’s partner ecosystem. Her work has accelerated enterprise AI adoption and strengthened Cloudera’s cloud and AI integrations, positioning the company at the center of a rapidly evolving ecosystem.

With more than two decades of expertise in partner experience, Michelle leads Cloudera’s Global Alliances & Channels organization, focusing on building high-impact partnerships and aligning them with customer outcomes. She has been instrumental in advancing Cloudera’s AI Ecosystem, bringing together leading technology providers to help enterprises scale AI initiatives with greater efficiency and security. Her leadership style embodies Cloudera’s values by prioritizing collaboration with stakeholders and team members. She has played a major role in advancing Cloudera’s AI Ecosystem, a collaborative group of technology providers dedicated to making it easier and more secure for enterprises to harness the power of AI.

Michelle believes effective leadership involves leading from the front, actively engaging with partners and customers to maximize each team member's potential. This approach promotes strong collaboration and unity with sales, which is essential for a successful Cloudera partner ecosystem. 

Natascha Lee, Head of Global Partner & Alliance Marketing – A seven-time winner of the Women of the Channel awards, Natascha Lee leads Cloudera’s Global Partner Marketing organization with a track record of building high-performing, partner-first programs. With more than 20 years in channel marketing, she serves as Head of Cloudera’s Global Partner Marketing team, driving initiatives that deepen partner engagement and leading innovative programs spanning a vast ecosystem of technology partners. Her leadership blends creative instinct with analytical rigor, activating partners and customers through precise messaging and segmentation while consistently surpassing ambitious revenue targets.

Valaretha Brown, Senior Partner Marketing Manager – Valaretha Brown has played a key role in strengthening Cloudera’s global partner network through thoughtful program design and execution. She develops joint go-to-market strategies that expand revenue opportunities while reinforcing strong, trust-based partner relationships. A five-time winner of the Women of the Channel awards with over 15 years of experience in B2B technology marketing, Valaretha brings a sharp ability to identify high-impact initiatives and turn them into scalable programs. Her work spans digital campaigns and content strategy, making her a master at uncovering strategic initiatives that deliver immediate impact while establishing her as a trusted advisor among her marketing counterparts. She is dedicated to developing impactful demand generation that drives new pipeline while advancing existing opportunities through the funnel.

Lan Chu, Senior Partner Marketing Manager – Lan Chu brings a cross-functional perspective to partner marketing, combining experience in marketing strategy, partnerships, and sales enablement to deliver programs that drive measurable results. At Cloudera, she has built strong partner relationships while launching initiatives that directly support revenue growth. Her approach centers on close collaboration across internal teams, partners, and vendors, translating strategy into execution through targeted, channel-focused campaigns. Lan’s ability to connect the dots across functions makes her a key driver of integrated, high-performing partner programs. Lan has built a reputation at Cloudera for developing high-value partner relationships and translating strategy into measurable results across the business. She applies that range to work closely with cross-functional teams and partners to strengthen alignment and support sustained revenue growth.

Janet O'Sullivan, Senior Partner Marketing Manager – Janet O’Sullivan leads partner marketing initiatives across four continents, designing programs that expand Cloudera’s ecosystem while delivering clear and measurable value to customers. Her work has fueled strong pipeline growth through a combination of multi-partner campaigns, targeted account-based marketing strategies, and tight regional alignment with sales and partner teams. Operating at a global scale, Janet brings a disciplined, execution-focused approach to building and activating partner networks. She identifies where joint value can be created, then translates that into coordinated programs that address real customer challenges. Her ability to align diverse stakeholders across regions has been a key factor in scaling impact, enabling Cloudera to grow its partner ecosystem in a strategic, sustainable way.

Jessica Espinoza, Senior Partner Marketing Manager –  With more than 20 years of experience, Jessica Espinoza leads marketing efforts for Cloudera’s Cloud Alliances, shaping integrated campaigns that align closely with business priorities and scale globally. She brings a balance of creative thinking and operational discipline to every initiative. Jessica has led multi-million-dollar co-marketing programs, produced large-scale events with tens of thousands of attendees, and developed content strategies across digital and social channels. Known for her collaborative approach and bilingual fluency, she builds strong partner relationships while delivering campaigns that drive measurable growth.

Together, these Women of the Channel winners showcase a culture built on collaboration that translates strong partnerships into innovation and measurable results across the ecosystem.

Learn more about how Cloudera’s partner ecosystem can support your hybrid cloud journey. 

When Seconds Matter: Building AI You Can Depend On

Ian Brooks,Oliver Zarate,Pamela Pan — Mon, 11 May 2026 13:00:00 UTC

For the past few years, the AI conversation has been about access: getting models in front of teams, experimenting fast, proving out use cases. That chapter is closing. The questions organizations are asking now are different: Who controls the model? Where does the data go? What happens when it fails?

Picture a hospital using AI to help diagnose pneumonia from chest X-rays. A patient comes in struggling to breathe. The doctor uploads the scan and waits, but the system isn't responding—the model that the diagnosis application depends on is hosted in the public cloud, and it’s temporarily unavailable.

In healthcare, that kind of delay matters. It's a scenario worth thinking about carefully, because it gets at something that doesn't come up enough in AI conversations: where your model runs is just as important as what model you run.

Designing for Reliability

Public cloud has made AI accessible to a huge range of organizations, and that's genuinely valuable. At the same time, for applications where uptime isn’t negotiable, introducing external dependencies becomes an important architectural consideration.

One way to think about this is through uptime expectations. A 99.9% uptime service-level agreement (SLA) still allows for nearly nine hours of downtime per year. For a consumer app, that's an inconvenience. For a hospital radiology system, a trading platform executing millions of transactions, or an air traffic management tool, even short interruptions may require additional planning.

When external services are part of the stack, some aspects of reliability are shared across providers. As AI gets used in more critical parts of the business, teams often complement it with additional design considerations—such as fallback strategies and deployment flexibility—to align with their specific requirements.

The Solution: Running AI Where Your Data Lives

In contrast, if you run AI where your data already lives, you can choose the environment that fits your needs and, importantly, retain control over system reliability.

With Cloudera AI Inference service, models can be deployed on-premises, in a private cloud, or across a hybrid setup. That flexibility lets teams align inference with their data, workloads, and risk profile, without forcing everything through a single architecture.

In practice, that looks like:

Operational continuity: Your applications keep running regardless of what's happening outside your walls
Predictable costs: Moving away from variable pricing (for example, per call) toward compute you control and can plan around
Real-time performance: As shown in our radiology demo, imaging analysis completed in under a second, giving clinicians immediate results

On top of that foundation, teams get model flexibility by default. A curated AI model registry—including providers like NVIDIA, Cohere, and Mistral AI—makes it easy to choose the right model for each use case. And with no lock-in, you aren’t dependent on a single vendor’s roadmap and can change AI models as better options emerge.

Everything is designed for production from day one. Autoscaling absorbs demand spikes, high availability removes single points of failure, and performance optimizations for sub-second response times are built directly into deployment—not layered on later.

Governance is embedded throughout. An AI Gateway enforces access control and policy before requests reach a model, while a monitoring layer provides continuous visibility into latency, throughput, and resource usage.

The result is a system where the entire inference pipeline stays within your control—from model selection to production execution—while still giving you the flexibility to run AI wherever it works best.

Why Maintaining Control Over Data is Especially Critical for Regulated Industries

For healthcare, financial services, or national security, data privacy is a legal obligation. When model inputs, outputs, and prompts travel to an external vendor for inference, it becomes more than a question of latency and moves into a concern over maintaining compliance and sovereignty.

Think about what actually gets sent during an inference call. In radiology, that might be a patient scan tied to a medical record. In financial services, it could be a transaction history used to flag fraud. In legal or defense contexts, it might be documents that are sensitive by nature. Each of those calls is a data transfer, and with external APIs, that transfer crosses a boundary you don't fully control.

Keeping inference on-premises or in a private cloud means data stays where it belongs, proprietary models remain fully owned by the organization, and audit trails stay internal. Built-in observability gives teams real-time visibility into latency and resource usage without that activity touching an outside vendor, which matters both for compliance reporting and for understanding how your models are actually behaving in production.

Stop Debating "Cloud Vs. On-Premises” and Build Intentional Hybrid Architectures

AI should be an asset that makes your systems more reliable, not a new single point of failure. Healthcare makes the stakes visceral, but the same logic applies anywhere the impact of downtime is high: manufacturing lines, real-time financial systems, and logistics networks. To mitigate downtime and capitalize on AI benefits, organizations need to intentionally build hybrid architectures, so that their most critical workloads run on infrastructure they control.

Curious how this looks in practice?
Watch the full Cloudera AI Inference demo.

Cloudera Trusts Chainguard to Help Secure the Foundation of Enterprise Data

Sarah Haberman — Tue, 05 May 2026 13:00:00 UTC

The Challenge

Cloudera’s mission is bold: AI anywhere, cloud anywhere, data anywhere. As the company powers some of the world’s most data-intensive enterprises, the stakes are high, particularly in highly regulated sectors.

"At Cloudera, security is critical for our customers, especially in highly regulated environments like government, healthcare, and financial services.”

— Katie Boswell, Vice President, Product Security, Cloudera

Security and compliance sit at the heart of that responsibility. As customer expectations evolved and compliance requirements tightened, Cloudera realized that we needed to evolve to manage our container security posture as challenges grew in scale and complexity.

The distinction between public-sector and commercial compliance environments was disappearing, making scalable vulnerability remediation essential. Instead of relying on a sustainable, automation-first approach, Cloudera’s engineering teams were investing significant time in patching base images, rebuilding, and revalidating—even for vulnerabilities with no runtime impact—creating an opportunity to optimize and modernize their process. Katie Boswell, Vice President, Product Security explained, “Every hour an engineer spends in remediating a CVE is an hour taken away from building better features and higher quality for our customers.”

With exabytes of customer data under management and the rapid rise of AI increasing both opportunity and risk, Cloudera knew they needed a scalable, secure foundation that could maintain its FedRAMP requirements and security posture across both federal and commercial environments, reduce its attack surface area, and ensure it remained a trusted, resilient backbone in its customers’ data supply chain.

The Solution

Cloudera evaluated several paths forward, from major OS container vendors to building an in-house solution. Competing options focused on patching, leaving the long tail of medium and low CVEs and the maintenance burden squarely on Cloudera’s teams. Chainguard offered something different: verified container images that removed vulnerabilities across all severities, allowing engineers to stay focused on innovation.

Cloudera adopted Chainguard Containers to rebuild its container foundation from the ground up. With secure-by-default, continuously verified base images, Cloudera saw immediate reductions in vulnerabilities and gained end-to-end provenance for every image in its supply chain. The company completed integration into production pipelines in just 90 days, setting a new standard for security automation and deployment speed.

Despite some initial hesitation around bringing on a vendor for such critical infrastructure, Chainguard quickly proved its value by aligning with Cloudera’s deep security and compliance culture. The new approach freed engineering teams to focus on delivering high-quality, secure data products for customer.

“Chainguard is now a standard across our container ecosystem, powering both our FedRAMP and commercial control planes.”

— Katie Boswell, Vice President, Product Security, Cloudera

The Results

Working together, Cloudera reduced its container CVE footprint by more than 90%, strengthening its security posture and ability to scale securely while also establishing a more resilient, future-ready foundation to support evolving AI workloads, compliance requirements, and emerging supply chain risks.

Cloudera also saw significant gains in speed and compliance readiness. By shifting to secure-by-default containers, the company maintained its FedRAMP compliance while accelerating delivery cycles. As Jamison Bennett, Security Engineer, shared, “Chainguard has allowed Cloudera to reliably ship our product faster with fewer CVEs.”

From a leadership perspective, the results were equally transformative. Working with Chainguard, Cloudera can reallocate engineering resources toward innovation and customer outcomes while strengthening trust and data integrity across its enterprise open source platform. As Katie Boswell explained, "Chainguard has become a key weapon in our arsenal of tools we use against the security threats that are out there. It’s helped us stay ahead of emerging threats, including those amplified by AI, and frees our teams to focus on delivering enterprise innovation instead of chasing vulnerability noise.”

“Chainguard gives us peace of mind and knowing that the OS and the system vulnerabilities that are out there that would be under attack are being taken care of for us.”

— Jamison Bennett, Security Engineer, Cloudera

Special thanks to Sarah Haberman, Senior Customer Marketing Manager Chainguard, and Drew Kelly of ORO Productions for production of the Chainguard video case study and written companion piece above.

Bridging the Gap Between High Performance Computing and Sovereign AI: Part Three of Three

Gabriele Folchi,Lama Itani — Mon, 04 May 2026 13:00:00 UTC

This blog is the last in a three-part series: part one covers the basics of high performance computing (HPC), and part two covers the importance of a sovereign data lakehouse.

The Cloudera Advantage for HPC and Sovereign AI

While a data lakehouse on its own does not support HPC—HPC simulations require a substantially different technology platform— it's the ideal complement to operationalize a ROM-focused strategy, providing essential capabilities (structured MLOps, experiments support, cost-effective data archiving, simplified access, collaboration toolchain, and more).

Cloudera uniquely bridges the gap between massive-scale, specialized physics data (HPC) and the agile requirements of modern AI training (MLOps). By providing a cloud-agnostic, sovereign-ready architecture, it ensures compliance and gives enterprises a secure, viable path to operationalize ROMs.

Cloudera supports this convergence through the following specific capabilities:

1. Handling Data at Scale with Sovereign Control

The Challenge: As mentioned above, storing and managing petabytes of historical Full Order Model (FOM) snapshots is often expensive and complex in traditional storage. However, engineers also need a way to ingest, transform, and archive these massive datasets with strict governance while maintaining "Operational Sovereignty", therefore ensuring the data never leaves the desired jurisdiction.

The Cloudera Solution:

Cloudera DataFlow: Acting as the universal ingestion engine, Cloudera DataFlow allows engineers to build multi-modal pipelines with a no-code experience, in a collaborative environment. It can ingest raw solver files (CFD/FEA logs), transform unstructured data into structured features, and store them directly into the Data lakehouse’s Object Storage (Cloudera Object storage based on Apache Ozone) for ease of access when required to train/retrain ROMs
Provenance & Auditing: Crucially, DataFlow provides built-in data lineage and provenance. This ensures that every "feature" used to train a ROM can be traced back to its original source file, providing the audit trail required for safety-critical engineering.
Cloudera SDX then provides a unified policy design and enforcement point for authorization policies across each and every Data and AI service, hence keeping a single pane of glass when it comes to ensuring access to sensitive IP contained in FOMs datasets and ROMs features is under control

2. Precision and Reuse: Team Tracking of ML Experiments

The Challenge: Developing accurate ROMs involves hundreds of iterations. Without a central system of record, R&D teams struggle with "version chaos”, losing track of which hyper-parameters or datasets produced the best results.

Cloudera Solution:

Cloudera AI Workbench: This service provides a collaborative environment with secure, open-source Notebooks-as-a-Service (Jupyter). To further enhance developer productivity, the workbench provides the flexibility to use preferred 3rd-party editors, including VS Code, PyCharm, and RStudio, either in-browser or as local IDEs connected to the workbench's compute resources. Moreover, the workbench integrates natively with MLflow, allowing users to create a documented "Source of Truth" for every ROM project by logging hyper-parameters, evaluation metrics, and training dataset versions used for each specific version of an AI model produced by any team. This promotes visibility and reuse, allowing different teams to easily adapt a model architecture based on their subject expertise.

3. Cloud-Like PaaS Experience with Predictable Economics

The Challenge: R&D teams need instant access to compute not just for iterative training but also for production-grade inference of AI model. Public cloud inference services often lead to "token shock" or runaway costs due to high-volume inference loops. Conversely, on-premise IT often lacks the agility to provision resources quickly.

The Cloudera Solution:

PaaS-by-Design Architecture: Built on top of Kubernetes, Cloudera offers a modern, multi-tenant platform where data and AI services are self-provisioned by practitioners. The platform auto-scales based on current workload demands, regardless of whether it is running in a sovereign datacenter or a private cloud subscription.
Cloudera AI Inference Service: This service in particular allows engineers to deploy versioned releases of models, along with standard REST APIs for immediate production use. Because it runs on self-hosted infrastructure, the charging model is based on compute-hours (per GPU/CPU) rather than "per-token." This allows for the consolidation of tens of different models onto a single cluster, introducing significant economies of scale for high-volume engineering workloads.

4. From Datacenter to the Physical World: Edge Deployment

The Challenge: The ultimate value of a ROM is often realized outside the datacenter—embedded on a manufacturing floor or a power plant controller for real-time predictive maintenance.

The Cloudera Solution:

Cloudera Edge Management: This service allows practitioners to build and deploy data pipelines that include "in-process" model inference directly to edge infrastructure. With a no-code visual interface, engineers can push their trained ROMs to fleets of remote agents, closing the loop between the digital twin and the physical asset.

5. Future-Proofing via Open Standards

The Challenge: Engineering lifecycles are measured in decades. Proprietary tools or closed cloud formats create unacceptable vendor lock-in risks for long-term product data.

The Cloudera Solution:

Open Source Core: Cloudera’s entire data and AI platform is built on open community technologies (e.g., Apache Nifi, Apache Spark, Apache Iceberg, Apache Ozone, CNCF Kubernetes and more).
Enhanced Experience: By wrapping these standards in a unified, secure, and user-friendly control plane, Cloudera bridges the gap between the freedom of open source and the ease of use expected of a modern cloud platform. This ensures that your critical IP remains portable and accessible forever.

Most importantly, End-to-End Sovereignty Without Compromise

Unlike other competitive Datalakehouse platforms in the market—which often fragment the lifecycle between proprietary storage and third-party compute, or force a choice between public cloud only form factor —Cloudera offers all the above capabilities in a single, unified platform.

Cloudera combines this modern, PaaS-centric user experience with the unique flexibility to deploy the entire platform in a fully sovereign datacenter. This effectively allows advanced manufacturing customers operating in regulated markets or on strategically sensitive projects to execute a cutting-edge AI strategy in the most secure environment possible—satisfying the strictest requirements for both Data Residency and Operational Sovereignty.

Next Steps

The future of HPC and enterprise AI is sovereign, open, and operationally unified—and that future is built on Cloudera. Our Private AI Anywhere platform—that runs in any cloud and data center—delivers end-to-end, governed control over all mission-critical data, models, agents, and inference to ensure sovereignty, regulatory compliance, and proven business value at scale.

How Zero-Trust Principles Apply to Modern Data and AI Platforms

Carolyn Duby — Fri, 01 May 2026 13:00:00 UTC

In the past, traditional security models assumed clear perimeters and centralized data, but today’s landscape is much more complex. Data and AI workloads now operate across cloud, on-premises, and edge environments, creating new attack surfaces for cybersecurity threats.

Zero trust has been a foundational cybersecurity approach for years, and it’s only becoming more important for a future-proof, resilient security posture. So how can organizations continue to implement it in the next generation of enterprise technology?

What Is Zero Trust in the AI Era?

Zero trust is a proven security approach that assumes no user or device is automatically trusted, even within your network. While perimeter-based security assumes that internal users and devices are safe once inside the network, zero-trust treats all access requests as potentially risky and must therefore be continuously validated. In practice, this means that even if a user is connected to their company's Wi-Fi, they still need multi-factor authentication for each access request, and even then, they can only access specific, necessary systems.

The catchphrase most commonly associated with zero-trust architecture is “never trust, always verify,” and while that still applies in the AI era, the scope of what it includes has expanded beyond users, devices, and networks to also include models, pipelines, and environments. Now, zero-trust must extend across the entire AI lifecycle, from data and model access and usage to inference flows and cross-environment workloads.

Applying Zero Trust to Data and AI Platforms

Verify All Data Access and Enforce Governance Throughout the AI Lifecycle

Enterprises should implement identity-based, context-aware access controls throughout all their data. Every time data is accessed, it's important that these interactions are properly authenticated, authorized, and auditable to ensure security and trustworthiness.

This becomes even more critical as AI systems depend on 100% of enterprise data to generate accurate, reliable outcomes. Without consistent governance, gaps in access control can lead to biased models, data leakage, or regulatory risk. The opportunity is to apply these controls uniformly across hybrid and multi-cloud environments.

Zero trust is also fundamental to strengthening your security stance. When implemented with proper governance, zero trust allows effective data sharing across the organization. This approach is mutually beneficial: it keeps data secure while ensuring those who need access can obtain it. Organizations need a platform that delivers a consistent, cloud-like approach to security and governance across all data, anywhere it lives.

Secure Models and Inference as First-Class Assets

Think of models as sensitive information. The prompts employees input often contain proprietary business context, and the outputs models generate can expose confidential or classified insights and decisions. In effect, models become both consumers and producers of sensitive data.

That’s why zero-trust principles must extend beyond data to include models, prompts, and inference endpoints. Keeping AI assets within trusted enterprise boundaries is critical. This means enforcing granular access controls, so only authorized users and systems can interact with specific models or datasets. It also requires versioning and lineage, ensuring organizations can track how models were trained, what data was used, and how outputs are generated—essential for auditability and compliance.

Operate Consistently Across Hybrid and Multi-Cloud Environments

Fragmentation in any part of an enterprise introduces risk, and zero-trust strategies are no exception. With agents and models creating new attack surfaces, organizations must be more aware of blind spots caused by inconsistently enforced security and governance policies, which can be exploited and lead to operational issues. Security is only as strong as its weakest link.

To be effective, zero trust must be uniform and portable. Access controls, governance policies, and monitoring standards should follow the data, models, and workloads to ensure that every interaction is consistently governed, whether in a public cloud environment or deep within a data center.

Organizations need a unified approach that eliminates policy gaps and delivers a consistent, cloud-like experience across data anywhere. When security and governance are applied the same way everywhere, teams reduce complexity and can move faster with confidence. The result is less fragmentation and a stronger foundation for scaling AI across the enterprise, without sacrificing control or trust.

The Future of Zero Trust

A unified platform approach makes it possible to build a platform that unifies data, analytics, and AI from the ground up. Under a single, consistent framework, organizations can eliminate fragmentation, reduce risk, and apply zero-trust principles uniformly across cloud, on-premises, and hybrid environments. With the right platform in place, organizations can confidently bring AI to their data anywhere it lives, unlocking value while maintaining the control over compliance and reliability that modern enterprises demand.

Learn more about Cloudera’s approach to security and compliance here.

Bridging the Gap Between High Performance Computing (HPC) and Sovereign AI: Part Two of Three

Gabriele Folchi,Lama Itani — Thu, 30 Apr 2026 13:00:00 UTC

If you haven’t read part one on the basics of high performance computing (HPC), check it out now!

Key Principles of a Sovereign Data Lakehouse

The Open Data Lakehouse: A Simple PaaS for Engineers

While traditional engineering simulation software excels at helping mechanical engineers prepare, execute, and analyze simulation jobs, it lacks a native design to manage modern machine learning (ML) workflows and data pipelines. An open data lakehouse can bridge this gap, offering R&D engineers robust, contemporary capabilities on a platform that the IT department is likely already familiar with.

Key use cases and benefits of an open data lakehouse include:

Cost-effective, governed data archiving: Offers virtually unlimited, low-cost storage for archiving years of simulation snapshots (the datasets generated by solver sessions). This storage is managed and governed consistently across all engineering and IT organizations or teams. Critically, essential metadata and lineage are preserved for each dataset, transforming it from an opaque file into a trusted asset that can be easily reused beyond its original creator.

Simplified access to compute resources: Engineers can easily and rapidly deploy shared notebooks and Apache Spark or Python Ray clusters. These often share the same dedicated GPU resources used by the main HPC cluster.

Protection via open standards: An open data lakehouse prioritizes open standards like Apache Iceberg, Parquet, and Python over proprietary engineering formats. This is crucial for safeguarding a company's Intellectual Property (IP), ensuring simulation data remains accessible and usable by any tool, now and in the future, regardless of the company's evolving IT infrastructure or provider strategy.

A cloud-like PaaS experience: Data lakehouses structured as user-friendly, self-service platform-as-a-service (PaaS) stacks simplify the use of complex data engineering and MLOps tools, effectively bridging the knowledge gap between users from different technical backgrounds and fostering productive competence exchange.

The Risk of Public Cloud for Protecting IP of R&D

While a data lakehouse offers many advantages, it’s not, by itself, a complete solution for highly regulated sectors (such as aerospace, defense, energy, and automotive) where sovereignty is a non-negotiable requirement. Simply put: not every data lakehouse can be deployed and operated in compliance with data sovereignty mandates, and relying on the public cloud carries significant risk to maintaining the strictest control over proprietary IP.

For instance, a single snapshot of a computational fluid dynamics (CFD) job—like a new engine design—effectively represents the complete blueprint of its performance and industrial design; this dataset is a company's crown jewel. It is therefore crucial to determine which key non-functional capabilities of a data lakehouse can provide the absolute legal assurance of operational sovereignty necessary to store such strategic assets. This leads directly to the core of the residency vs. sovereignty debate.

Data Residency vs. Sovereignty

The traditional definition of sovereignty as operating in an enterprise's home country is an outdated notion, a remnant of the pre-cloud era. Previously, data center infrastructure was typically managed by local personnel, inherently subjecting it to the company's local jurisdiction and legal obligations. However, the rise of commercial cloud offerings and the necessity for providers to guarantee extremely high service-level objectives 24/7, have fully enabled remote, follow-the-sun global cloud operations. This advancement makes it impossible to guarantee—at least in commercial standard regions—the residency of the management team, thereby severing the link between “data residency" and true "sovereignty."

Consequently, the most dependable architecture for handling and processing critical engineering data is a sovereign data lakehouse: an open data lakehouse that's natively hybrid and cloud-agnostic.

This approach offers the speed and ease of a cloud-like PaaS experience along with by-design compliance, enabling an enterprise to meet national or other jurisdictional policies that would mandate operating entirely within a sovereign, private, and controlled environment (and personnel).

Term	Explanation	Business Impact
Data Residency	The data physically sits on hardware inside a specific country's geopolitical boundaries.	Handles basic local compliance requirements, not necessarily related to security but mostly on latency between data itself and IT solutions consuming that particular dataset.
Operational Sovereignty	Ensures that the people managing the cloud infrastructure (Cloud Ops) and the legal framework governing the provider are also local and under the right sovereign governance.	Prevents the risk of foreign government access requests that could legally force the provider to hand over sensitive IP without the company's consent.

AI Economics: Achieving Cost Predictability for AI Models

Beyond security and legal compliance, a sovereign data lakehouse architecture offers another crucial advantage: predictable cost management for implementing AI workflows.

The financial model of running AI services in the public cloud is inherently variable and consumption-based, tying costs directly to usage metrics (such as GPU-hours, processed tokens, operational volume, and data scanned). As more teams, projects, and applications leverage cloud infrastructure, the cost grows exponentially. This model is particularly challenging for high-demand tasks like training complex generative AI (GenAI) models or heavy autoencoders, which require dedicated, constant, and massive GPU usage that is often difficult to share efficiently.

Transitioning to a sovereign data lakehouse deployed in a private or fixed-cost colocation data center shifts an organization to predictable spending by:

Establishing fixed asset investment: Organizations invest in fixed, sharable infrastructure. This setup allows multiple teams and projects to use the same resources, effectively driving the marginal cost of initiating new R&D experiments down to near zero.
Eliminating "bill shock": This architecture completely removes any financial risk associated with unexpected, massive expenses, such as those caused by high-volume inference, continuous iterative R&D training loops, or prohibitive data transfer fees common across public cloud zones.

To Learn More, Keep Reading in Part Three!

Delivering Exam-Ready AI Decisions in Insurance with Cloudera

Tom Gannon — Thu, 30 Apr 2026 13:00:00 UTC

Property & Casualty (P&C) insurance carriers have been pursuing digital transformation to protect their combined ratio and grow market share for well over a decade. AI represents a powerful new opportunity to automate and streamline workflows, manage risk, and improve profitability, but most insurers struggle to move from pilot projects to deploying AI in production. To build AI models that insurance carriers can trust to run core business processes, they must build their AI strategies on three pillars that ensure accuracy, consistency, and explainability of AI outputs.

The urgency for this shift is no longer theoretical. Regulators have signaled a clear expectation: insurers must maintain robust governance and documentation for every AI-supported decision. As states rapidly adopt these frameworks, often adding their own unique requirements, the move to production-grade AI has become a mission-critical endeavor.

In this blog, we will discuss those three pillars, and how Cloudera helps the largest insurance carriers in the world deliver exam-ready decisions with AI.

The AI Opportunity in Insurance

AI has the potential to transform many workflows within insurance:

Intelligent underwriting. Carriers need to improve loss ratios by moving from static models to more accurate, data-driven risk scoring and reduce underwriting overhead. Generative and agentic AI can capture nuance and context in complex submissions, synthesize the data, and arrive at a decision in a matter of seconds.

Claims velocity. Claims adjusters often deal with a backlog of First Notice of Loss (FNOL) documents and photos that require manual categorization and routing. By using AI to summarize and triage claims, insurers can significantly reduce the administrative burden and operational overhead.

Fraud Prevention. Traditional machine-learning-based fraud scoring still requires a significant amount of manual investigation work when a claim is flagged, leading to long resolution times and a poor customer experience. AI can provide the reasoning behind a flag, identifying patterns across disparate datasets and reducing the time to resolution.

Catastrophe (CAT) Response. While carriers around the world deal with an increase in volatile flash events, CAT response is often delayed by the need to wait for post-event, manual damage assessments. AI can integrate real-time data and imagery, enabling insurers to model impact dynamically as an event unfolds, enabling proactive resource allocation and faster policyholder support.

The potential value of AI is clear, with many insurers running AI pilots or deploying AI in isolated pockets to prove out that value. However, the industry faces significant scrutiny across audits, litigation, and disputes, and every AI decision must be explainable, accurate, and consistent. There are significant technical barriers to deploying AI that meets the regulatory standards for explainability.

The Three Pillars for Exam-Ready AI Decisions

To overcome the technical, business, and regulatory challenges to deploying AI at enterprise scale, insurers should build models on the following three pillars for exam-ready AI decisions.

Truth. The quality, accuracy, and consistency of AI decisions depend heavily on the data it’s trained on. Most insurers are managing a distributed data estate, with legacy data warehouses, cloud and on-premises data lakes, and point solutions for various business processes. Each of these silos contains important policyholder and organizational data that is critical for AI success.

To trust that data, insurers must have an end-to-end view of its lineage: they should be able to see where the raw data came from, where and how often it moved and transformed, and where and how it is consumed across the organization.

Control. One of the core tensions related to AI in insurance is this: a significant portion of sensitive data resides on premises or in private cloud environments, and the majority of AI development, training, and deployment occurs in the public clouds, creating a gap between the data and the models. To produce exam-ready AI outputs, insurers must develop accurate, more deterministic models by training them on 100% of the organization’s data while complying with internal Governance, Risk, and Compliance (GRC) frameworks and external regulatory requirements for data privacy and security.

Defensibility. In litigation-heavy industries like insurance, AI governance must go well beyond explainability. Every AI decision must hold up in court, and when AI makes a decision, insurers must be able to recreate the AI model, the output, and the underlying view of the data it was based on. Insurers need end-to-end visibility and auditability of the data and AI lifecycle, governance over the data and the models, and security across the entire data estate to meet the industry standard for defensibility.

Cloudera Provides a Data and AI Platform for Exam-Ready AI Decisions

Insurance companies like Allianz Australia use Cloudera to unify customer, operational, and external data to train models that can predict the potential impact of adverse weather events and respond proactively. Cloudera’s platform is built on the three pillars for delivering exam-ready AI decisions.

Build trust in AI with end-to-end lineage. Cloudera provides automated, end-to-end lineage across every data source and system, so data teams and regulators can easily trace data from its source all the way to consumption.

Maintain control with private AI. With private AI, insurers can build and train models on 100% of their data because the entire AI lifecycle runs in their private environment, behind their firewall. They can also deploy and run models directly on their data in a secure environment. As a result, AI decisions are based on organizational context, leading to more accurate and consistent AI outputs without compromising on security and governance.

Deploy defensible AI with a unified data fabric. Cloudera’s unified data fabric provides consistent security, governance, and access to data across your data estate, ensuring visibility and transparency into AI workloads. Models, outputs, and the underlying state of the data that produced them are easy to reproduce.

Together, these capabilities provide a platform that enables insurance companies to safely move from AI pilots to the production-grade AI they need to transform underwriting, claims, fraud, catastrophe response, and more.

For Insurance, the Time for AI Transformation is Now

Insurance is a business model built around risk management. AI represents one of the best opportunities for carriers to optimize that model and significantly improve their combined ratio, boosting profit margins and growth. However, the key to success is mitigating the new risk AI introduces. By building AI on the three pillars of trust, control, and defensibility, insurers can mitigate risk and deliver exam-ready AI decisions across their business.

Join the Conversation

To connect with Cloudera and learn more about how your peers are operationalizing defensible AI, join us at our insurance roundtable, Defensible AI Decisions in Insurance,” in Boston on May 13, 2026. Register here.

Mending the Broken Link: Real-Time Data for AI in Financial Services

Dennis Duckworth — Wed, 29 Apr 2026 13:00:00 UTC

In this new phase of AI adoption, ideas and pilot models are no longer enough. Increasingly, operations leaders and boards alike want to see AI at full-scale production—complete with measurable returns. But that’s proving to be a more difficult task than anticipated, especially in financial services. As of now, a reported 88% of enterprise AI projects stall before reaching production because their existing infrastructure can’t keep up with real-time data needs.

In the financial services sector, the gap between "having data" and "driving value" often boils down to a single factor: latency. While many institutions have spent the last decade perfecting "lakehouse" models for static data, the strongest AI use cases require a fundamental shift toward real-time data or data in motion.

A recent roundtable with experts from IBM and Cloudera explored the core challenge for leaders: understanding the imperative of this shift and choosing the right architectural partner. The discussion centered on how real-time architecture is finally mending the "broken link" in financial AI.

The Imperative for Real-Time AI in Financial Services

The driver for real-time data goes deeper than technical speed; it's about repairing a massive operational leak. Financial institutions have long tolerated "dark hours" where data sits idle, waiting for overnight batch processing. In recent years, this delay has become a competitive liability.

Focusing on Immediate ROI: The Back and Middle Office

In a recent solution brief, technology research and advisory firm, Omdia explored the real-time AI use cases in financial services, which included:

Real-time fraud prevention and security
Customer experience and loyalty
Data ingestion, transformation, and flow management
Platform modernization and reporting

Check out the brief for more information

While consumer-facing generative AI for things like customer experience and loyalty is tempting, for many financial services companies, the most immediate ROI is being delivered in the back and middle office. These "unsexy" use cases translate directly into massive efficiency gains.

Touchless Operations: Applying real-time AI to internal financial forecasting is making processes 94-95% touchless
Massive Efficiency: Automating data aggregation for complex reporting is reducing operating expenses by 30% to 40%
Scale of Impact: For enterprise-level banks, these optimizations translate into hundreds of millions of dollars in reclaimed productivity

The Advantage of Cloudera and IBM Together: Hybrid Efficiency and Sovereignty

The increasing costs of cloud operations and intensifying regulatory scrutiny make the choice of platform a strategic pivot point for financial services. Cloudera’s approach to data sovereignty aligns closely with IBM’s, prioritizing secure, governed access over data movement. Together, they enable a federation-in-place model that allows financial institutions to access and analyze data anywhere it lives—across core banking systems, trading platforms, cloud environments, and edge channels—without moving it. This approach supports real-time insights while helping institutions meet regulatory requirements, reduce operational risk, stabilize compute costs, and maintain strict control over sensitive financial data.

Hybrid Flexibility for Cost Control

Real-time AI in financial services demands "always-on" compute to support use cases like payments processing, risk modeling, and trading operations. While cloud environments offer agility for experimentation, the total cost of ownership (TCO) for stable, high-throughput workloads like transaction processing or regulatory reporting can be significantly lower on premises. Cloudera's hybrid platform enables data and application portability so institutions can run latency-sensitive and cost-intensive workloads where they make the most financial and operational sense.

Mending the "Broken Link" with Governance

A major obstacle for AI in financial services is the difficulty data scientists and risk teams face in discovering, trusting, and governing data in motion. Cloudera addresses this by extending consistent governance, lineage, cataloging, and security controls to streaming data, ensuring that real-time data used for decisions is as auditable and trustworthy as data at rest. This is critical for meeting compliance requirements and supporting explainable AI.

AI and Model Sovereignty

Institutions are moving beyond data residency into the era of AI and model sovereignty. With Cloudera and IBM, organizations can ensure that both data and models remain within required geographic or regulatory boundaries—supporting compliance with evolving data protection and financial regulations. This approach prevents sensitive data from leaving a jurisdiction while maintaining performance. Additionally, IBM Granite models provide auditable, enterprise-grade provenance, reducing the risk associated with opaque or unverified training data.

The Path Forward: Edge AI and Event-Driven Architecture

To enable real-time decisioning, such as fraud prevention, credit adjudication, and trade validation, financial institutions need to move beyond batch processing to event-driven architectures powered by technologies like NiFi and Flink.

Edge AI: Moving decision-making closer to the point of interaction (or the "edge")—like the point of sale, an ATM, or within a mobile app—enables real-time fraud detection and transaction validation. This allows institutions to stop fraudulent activity before a transaction is completed, rather than identifying it after settlement.
Small Language Models (SLMs): Not every financial services use case requires a large-scale model. Compact models (under 10B parameters) can be deployed at the edge or within controlled environments to support use cases like customer authentication, document processing, and compliance checks, delivering lower latency, improved privacy and reduced infrastructure costs.

Future-Proofing the AI Enterprise with Real-Time Data

The era of the "Field of Dreams" approach—building massive data lakes and simply hoping value will follow—is long over. In financial services, value is measured in proven results.

The time to act is now. Real-time data is no longer a luxury, but the essential foundation of modern banking, payments, insurance, and capital markets operations. It transforms static reporting into continuous, event-driven decisioning, enabling dynamic workflows that adapt in real-time. By leveraging Cloudera’s hybrid platform and data-in-motion offerings alongside IBM watsonX for AI and aligning these technologies on clear business outcomes, financial institutions can turn real-time data into a permanent competitive advantage without losing the control, governance, and resilience this sector demands.

Cloudera Earns AWS AI Competency, Bringing Secure, Enterprise-Grade AI to Data Anywhere

Michelle Hoover — Tue, 28 Apr 2026 13:00:00 UTC

Today, most enterprises distribute their data and workloads across both cloud and data centers. Organizations require a consistent, dependable experience across their entire data estate, allowing them to deploy AI where it makes the most sense for the business without sacrificing control or flexibility.

We’re excited to share that Cloudera has earned the prestigious Amazon Web Services (AWS) AI Competency. This recognition comes at a time when many organizations are moving beyond early AI experimentation and confronting the more complex reality of operationalizing AI across distributed data environments.

The Significance of AWS AI Competency

AWS AI Competency Partners are recognized for their ability to build AI-driven solutions that deliver tangible business outcomes. Earning this designation reflects real-world experience supporting customers across the full AI lifecycle, from early development through production deployment and ongoing optimization.

To qualify, partners must demonstrate not only strong technical skills, but also a consistent track record of helping organizations operationalize AI in practical, scalable ways. That includes enabling teams to move from initial experimentation to systems that can be reliably deployed, monitored, and integrated into existing business processes.

Cloudera’s inclusion stems from its experience supporting customers in running AI in complex environments, including use cases involving generative and agentic applications. It also points to a broader shift in how organizations are approaching AI, moving toward solutions that are production-ready and aligned with enterprise requirements.

Bringing Compute to Data

Enterprises today need AI that works where their data lives and delivers real business outcomes. Achieving this AWS AI Competency validates our ability to help customers operationalize AI across hybrid environments, combining the scalability of AWS with the governance and security required to deploy agentic AI at enterprise scale.

Cloudera’s approach is grounded in its AI Anywhere strategy, which enables organizations to run AI workloads across any cloud and data center while maintaining full control over data. This includes a “compute to data” architecture, which brings AI workloads directly to governed data sources rather than moving data to centralized systems. This is especially critical for large enterprises managing sensitive datasets across on-premises environments and cloud platforms like AWS. By keeping data in place, organizations maintain strict control over security and compliance requirements.

The result is a more efficient and scalable path to enterprise AI, from optimized infrastructure and operational costs to stronger data and model sovereignty, all without compromising performance or flexibility.

Secure, governed AI across AWS and beyond

Cloudera enables organizations to leverage AWS services such as Amazon Bedrock and Amazon SageMaker while maintaining a secure, governed environment for AI development and deployment.

Through this integration, enterprises can:

Run AI across hybrid environments with a consistent, cloud-like experience
Apply fine-grained access controls and unified governance across the AI lifecycle
Ensure auditability and compliance for regulated industries
Use any model based on business needs

AI is only as effective as the data behind it. Large enterprises operate across complex, distributed environments where data spans clouds, on-premises systems, and the edge. Cloudera enables access to 100% of that data so organizations can build more accurate, impactful AI applications.

This reflects a core principle of Cloudera: delivering end-to-end control across models and agents, and inference to ensure trusted AI outcomes at scale.

Continuing the momentum with AWS

Cloudera will continue to deepen its collaboration with AWS, helping customers unlock the full value of their data and AI investments by enabling them to build, deploy, and scale AI across hybrid and multicloud environments. Together, Cloudera and AWS provide a consistent, secure foundation for running AI where data lives by combining AWS's scalability with the full control enterprises need to deliver real business outcomes.

Cloudera will also sponsor the upcoming AWS Summit series, including AWS re:Invent in November. For more information on events, visit our events page.

Bridging the Gap Between High Performance Computing and Sovereign AI: Part One of Three

Gabriele Folchi,Lama Itani — Mon, 27 Apr 2026 13:00:00 UTC

Historically, high performance computing data analytics focused primarily on R&D for engineering/manufacturing industries. Whereas operational use cases for data analytics, relying on similar big data systems, operated in isolation.

Today, the rise of generative AI (GenAI) and machine learning (ML) presents a significant opportunity to bridge these two domains. This synergy allows enterprises with both divisions to leverage their respective expertise and infrastructure investments, leading to increased productivity and a competitive edge for R&D organizations. Specifically, mechanical engineers working with high-performance computing can dramatically accelerate product development and gain deeper operational insights by employing intelligent, AI-driven compression methods (like reduced order models) trained on big data platforms.

This blog series, delivered in three parts, illustrates how and why a sovereign data lakehouse–an open data lakehouse that can operate under the sovereignty of a customer, not the jurisdiction of the infrastructure provider–is the architecture needed to scale experimental physics and AI workflows into a robust, enterprise-grade capability. We also cover why Cloudera is the go-to choice for organizations looking to merge the precision of engineering with the agility of modern data analytics.

The Basics of High-Performance Computing and Reduced Order Solvers

The Full Order Model

Understanding the mechanics of simulations is key to appreciating AI's transformative role in engineering. Traditional multi-physics simulations, such as finite element analysis (used to test real-world structural integrity) or computational fluid dynamics (used to model how air or liquid moves), work by breaking a physical structure (like a bridge) into a “mesh” or system of millions of tiny elements. The mathematical representation of these elements often takes the form of a system of interacting tensors, i.e., structured sets of numbers used to model how forces, pressure, temperature, and motion interact across the system.

The full-order model is the most detailed and physically accurate model of that system. Its physical behavior is simulated by a solver (e.g., OpenFOAM) which continuously calculates complex equations. This process calculates the changes in these tensors based on physics, including how a single element's reaction affects its closest neighbors and the system as a whole. While this offers incredible precision, it comes at a cost: these simulations are intensely computationally demanding, often requiring a supercomputer cluster to run for days just to analyze one scenario, limiting how quickly teams can iterate, test alternatives, or bring products to market.

The Reduced-Order Model

A reduced-order model is an AI-driven technique that dramatically simplifies complex simulations. It builds on advanced mathematical techniques, ranging from classic methods like singular value decomposition to modern artificial neural network architectures such as autoencoders—to approximate highly complex, non-linear systems.

At its core, a reduced-order model identifies and captures the most important, defining patterns within the massive volumes of simulated tensor data generated by a full-order model.

By distilling the problem, the reduced-order model effectively shrinks the enormous computational space into a much smaller “latent space” – a simplified mathematical representation of the system (effectively, a “digital twin”). This means that instead of a traditional solver having to process millions of complex equations, the reduced-order model might only need to solve for 50 latent variables to account for 99% of the underlying physics.

For mechanical engineers, whose daily workflow revolves around optimizing product performance, reliability, and cost across countless combinations of geometry, materials, thickness, and weight—this capability changes the pace of innovation. Their workflow is essentially a continuous sequence of what-if scenarios, drawing on both synthetic knowledge from physics-based solvers and real-world deployment data. Integrating reduced-order models into this process provides a number of significant strategic advantages, such as:

Reduced-Order Model Strategic Opportunity	Explanation	Business Impact
Rapid Iteration	Run thousands of design changes and what-if scenarios in seconds.	Cuts product development time from months to just days.
Edge Compute Deployment	Reduced-order models are small and fast enough to run directly on embedded controllers or internet of things (IoT) devices out in the field.	Enables real-time, on-device decision-making and automated control with or without cloud connectivity.
Real-Time Digital Twins	Powers a physically-informed neural network (PINN) that runs alongside the actual machine, using live sensor data to predict system behaviors and anomalies.	Shifts maintenance from fixing things after they break to proactive maintenance, reducing downtime and extending the asset’s life.

Reduced-Order Model Development: From Theory to Production

ROMs deliver substantial value by accelerating engineering workflows, but successful deployment requires navigating specific technical constraints and operational realities that organizations must address systematically.

Training Data Requirements

Accurate reduced-order models require large volumes of data from full-order models. For example, building a reliable automotive crash-analysis reduced-order model requires 500 to 2000 full-order model runs across different material and geometry configurations, representing weeks of high-performance computing cluster time. Sparse training data produces reduced-order models that fail catastrophically outside tested conditions. Automated design of experiments tools help optimize which simulations to run, reducing required full-order model simulations by 30 to 40% while maintaining accuracy.

Accuracy Trade-offs

Reduced-order model performance degrades outside training boundaries. For example, a turbine blade reduced-operation model trained for 800 to 1200°C operating temperatures may produce 15 to 20% error at 1250°C. This can be addressed through ensemble modeling techniques and uncertainty quantification. When model confidence drops below predefined thresholds, automated triggers can initiate validation runs using the original full-order model.

Validation Burden

In safety-critical environments (automotive, aerospace, energy, etc.), reduced-order model applications require rigorous validation against full-order models, often consuming significant effort (such as extensive correlation studies). That’s because regulatory bodies demand documented equivalence before approving their use.

While the validation process can be intensive, once validated, reduced-order models enable thousands of rapid iterations that would be infeasible with traditional simulation (full-order models) alone.

Skills Gap

Effective reduced-order model development requires expertise in both machine learning engineering and domain physics. A data scientist working alone may build mathematically elegant models that lack physical interpretability. A mechanical engineer working alone may struggle with hyperparameter optimization, (e.g., architecture selection and model scaling). Therefore, small cross-functional teams consistently outperform larger siloed groups. It’s important to invest in training programs that teach engineers modern machine learning tools.

Edge Deployment

Real-time control scenarios require deterministic inference (<10 milliseconds latency) on embedded hardware. Not all reduced-order model architectures meet these latency and memory requirements. Deep neural networks often exceed resource budgets, while overly simplified linear reduced-order models sacrifice accuracy.

Current best practice is phased deployment:

Start with cloud-based reduced-order models for digital twin visualization and predictive maintenance.
Then deploy edge controllers only after extensive hardware-in-the-loop testing validates real-time performance.

Scaling Reduced-Order Models: From Ad-Hoc Scripts to Enterprise Machine-Learning Ops (MLOps)

While the mathematical foundation of reduced-order models is sound, the primary obstacle lies in standardizing their development and deployment across an entire organization. Currently, many R&D teams rely on a decentralized collection of Python scripts, unmanaged file systems, or proprietary vendor environments. These approaches may work for individual projects, but fail under governance, compliance, and industry-standard open community practices.

To achieve scale, reduced-order model training must treat simulation data with the same rigorous data governance principles that are standard for handling financial records or customer data, for example.

Addressing this shift involves resolving concerns such as:

MLOps Requirement	Explanation	Business Impact
Handling Data at Scale	Scalable data pipelines and transformation tools (like Spark) pull out key features and standardize huge amounts of historical simulation data from different solvers (such as OpenFOAM).	Ensures complicated simulation data is clean, governed, and ready for reliable AI training, reducing rework and risk.
Team Experiment Tracking	Secure, shared environments (like Jupyter Notebooks) equipped with newer machine-learning experiment tracking (like MLFlow), allow physicists and data scientists to co-develop code, try different AI models, and consistently tag metrics, such as hyperparameters and loss.	Guarantees full history and reproducibility. When a reduced-operation model goes live, teams can instantly trace it back to the exact version of the model, data, settings, accuracy evaluation metrics at the time of build, and hyperparameter configuration used to get that result – critical for regulated industries.

To Learn More, Keep Reading in Part Two!

Data Readiness to Data Reality: How Key Industries Are Rewiring Their Data Strategies

Cloudera — Wed, 22 Apr 2026 13:00:00 UTC

Data readiness is no longer just a technical ambition; it’s an operational requirement. Still, execution across industries is lagging. Data foundations weren’t built for the demands of the AI era, and while these challenges manifest differently across sectors, the mandate is consistent: organizations must rethink how they unify, govern, and access their data to bring AI to their data, wherever it lives.

Cloudera’s recent Data Readiness Index digs into what organizations need to build a solid foundation that can fuel AI at scale.  The survey results show that enterprises remain constrained by structural, cultural, and governance obstacles; however, these challenges manifest differently across industries. These insights can help leaders foretell the strategic changes to bridge the gap between ambition and execution.

Technology: Scaling AI Meets Data Fragmentation

Technology companies have long been some of the most AI-forward organizations, but the survey reveals that even in advanced settings, scale is exposing structural weaknesses. More than half (56%) of technology organizations report they lack full access to their data, despite significant investment in cloud and modern data platforms.

The shift to production-scale AI requires technology organizations to rethink their infrastructure. The fragmented, unreliable data systems that hinder scaling AI result from the difficulty in operationalizing it across products and teams. This is reflected by 30% of leaders citing data quality as the main reason AI projects fail to deliver ROI, and 39% noting infrastructure issues always hinder operations.

In the technology sector, closing the data readiness gap involves enabling AI to run where data already lives—without requiring costly data movement. This starts with creating a unified, governed data and AI foundation across clouds, data centers, and edge environments, delivering a consistent experience while maintaining full control over distributed data.

Manufacturing: Legacy Systems Collide with Real-Time Demands

Manufacturing companies are always pushing to streamline operations across the product lifecycle, but fragmented data prohibits full optimization of these efforts. 42% of manufacturing organizations cited siloed data as preventing teams from using their data effectively, and over half (52%) still lack full access to their data. Clearly, access is a central barrier to achieving data readiness, and operational complexity is compounded by isolated and unreachable data. The operational task of closing the gap between data ambition and execution requires making sure teams can access 100% of their data across environments, not just isolated subsets.

For manufacturers, production uptime, predictive maintenance, and supply chain continuity all depend on timely and reliable data. Equally important is investment in data integration and standardization layers, addressing the 20% of manufacturers who cite weak workflow integration as the primary reason data initiatives fail to deliver ROI. By focusing on scalable data pipelines and industrial platforms that operate across facilities, a unified, real-time infrastructure embedding data into core workflows can become a reality.

Energy & Utilities: Governance Becomes the Gatekeeper of Scale
Highly regulated environments, like those faced by IT leaders in the energy and utilities industry, require a careful balance between innovation and control. Regulatory compliance and grid reliability are both at stake, as energy and utilities organizations must ensure that data is not only accurate and secure, but also consistently governed across highly distributed environments. Energy and utilities organizations show relatively strong governance maturity, with 65% reporting that all or almost all their data is governed.

On the other hand, 25% cite cost overruns as the main reason data initiatives fall short of ROI, pointing to the financial and operational challenges of modernizing data infrastructure in highly regulated and distributed settings. Strict regulatory requirements need complete visibility and control over data, while real-time grid operations rely on timely, reliable data to balance supply and demand, prevent outages, and handle disruptions. Any gaps in accessibility can lead to security and compliance threats.

Energy and utilities operate in environments where every decision carries regulatory, financial, and public safety implications. That means data must be accessible, auditable, and secure across every system it touches.

Telecommunications: Complexity at Scale

Massive, distributed telecom environments create complex data and high stakes. Maintaining performance is one of those stakes, requiring real-time monitoring and quick adjustments, which can impact the customer experience. Issues like dropped calls, slow data speeds, and service interruptions quickly translate into customer dissatisfaction and churn. Telecom environments generate massive volumes of streaming data, and without the ability to process and act on data in real time, both network performance and customer experience suffer.

Telecommunications organizations lead in several areas of data readiness, with 54% reporting full visibility into their data and 51% able to access it across environments. They also report the highest level of fully governed data, with one-third (33%) of respondents reporting fully governed data environments. Yet despite this maturity, 60% say infrastructure performance consistently hinders operations—by far the highest of any industry surveyed. Scale and complexity, not access, are now the primary barriers, and data latency is an operational risk.

To overcome the gap between data readiness and operational performance, telecommunications organizations should invest in infrastructure built for speed, scale, and continuous processing. When latency directly impacts service quality, the solution is to enable telecom providers to automate network operations, enabling experts to deliver consistent, high-quality customer experiences.

The Bottom Line

Across various sectors, a common theme appears: organizations need to implement data effectively at scale. Data readiness enables organizations to bring AI to their data wherever it lives, unlocking the full value of 100% of their data across clouds, data centers, and edge environments. Cloudera’s Data Readiness Index demonstrates the opportunity for organizations to invest in data readiness now, ensuring they are well-positioned to lead in an AI-driven future.

How confident are you in your data readiness? Read the full report to gain deeper insights into how global organizations are approaching the data foundations that enable AI at scale.

Beyond the Notebook: Architecting Data Readiness for Production-Grade AI

Robert Hryniewicz — Tue, 21 Apr 2026 13:00:00 UTC

Gartner predicts that 60% of enterprise AI initiatives will be abandoned before reaching production. This attrition rate is rarely a failure of model parameters or raw compute availability; rather, it is a structural failure of data readiness.

Organizations frequently encounter a bottleneck when attempting to bridge the gap between fragmented, siloed raw data and a production-grade AI pipeline. Without a unified data foundation, the transition from experiments to AI systems running live, production workloads remains blocked by legacy infrastructure debt.

Architectural Foundation: The Open Data Lakehouse

Solving the data-readiness deficit requires an architectural transition to an Open Data Lakehouse that functions across the entire data estate. By maintaining data in an open format (like Apache Iceberg), enterprises avoid the high Total Cost of Ownership (TCO) of proprietary storage. This ensures that massive datasets remain queryable and AI-ready without redundant replication.

Unified Governance with Shared Data Experience (SDX)

Security and governance are the primary inhibitors to AI speed-to-market. Standard protocols usually break when moving across disparate compute environments. Cloudera Shared Data Experience (SDX) addresses this risk by decoupling security policies from the underlying engines—ensuring that governance follows AI models and data.

The Three-Phase Path to Production

Phase 1: Validating Business Value with RAG Studio

To avoid high-cost project abandonment, organizations must pivot from speculative development to rapid validation. Cloudera RAG Studio allows developers to iteratively test different embedding models and LLMs against data. This quantifies retrieval accuracy before committing to full-scale production infrastructure.

Phase 2: Optimization with Synthetic Data Studio

Data scarcity and stringent privacy constraints for personally identifiable information (PII) frequently stall LLM fine-tuning cycles. Cloudera Synthetic Data Studio addresses this bottleneck by generating statistically representative datasets that mimic production data without exposing sensitive information. This lowers engineering costs and accelerates training without compromising compliance.

Phase 3: Operationalizing Intelligence with Agent Studio

Simple chatbots are no longer enough. The goal is autonomous business processes: AI that can “do” rather than just “talk.” Cloudera Agent Studio provides the framework to define workflows, tool-calling logic, and multi-step feedback loops, turning models into functional agents capable of complex reasoning.

Accelerating the Baseline: AI Accelerators

For organizations requiring rapid time-to-value without the overhead of building bespoke pipelines, Cloudera AI Accelerators (aka AMPs) provide end-to-end reference architectures. These include pre-configured data ingestion scripts, containerized model configurations, and UI components for high-impact use cases like churn prediction or agentic security analysis. What used to take months of engineering now takes days.

Infrastructure Portability: Avoiding the “Cloud Tax”

The primary architectural advantage of Cloudera AI is the decoupling of workflows from specific infrastructure providers. By maintaining a consistent data and tool layer across multi-cloud VPCs and on-premises data centers, enterprises avoid the "cloud tax" and egress penalties associated with proprietary data and AI stacks. This portability ensures that the cost per AI inference remains predictable—avoiding token-driven cost spikes—as workloads transition from experimental dev-test environments to global production.

The Path to Production-Grade AI

The journey to ROI shouldn't be blocked by fragmented data or proprietary silos. By combining a unified governance layer with specialized tools for RAG and synthetic data generation, model training and inference at scale, agent orchestration and more, Cloudera AI brings AI to the data with a clear, governed path to production-grade intelligence.

Learn more

Cloudera vs Snowflake vs Databricks: Which Federation Model Best Supports Enterprise AI?

Navita Sood — Mon, 20 Apr 2026 13:00:00 UTC

AI is forcing enterprises to confront a project they’ve deferred for years: fragmented data estates.

Fragmentation used to be an inconvenience. Sure, it took a few extra steps—and a few extra days—to pull reports across regions or departments. The IT team might have to step in to reconcile discrepancies. But none of that was enough of a disturbance to be a deal-breaker.

Until now.

Why Data Federation Matters Now

In an AI context, a splintered data estate means:

Models trained on incomplete context
Agents making decisions with stale or invalid data
Governance policies applied inconsistently across environments

It means duplication, latency, and blind spots at exactly the moment enterprises are trying to operationalize AI at scale.

In other words, fragmentation is suddenly a deal-breaker.

In our previous post, we explored why unified, governed data access is the foundation for trusted AI, and why consolidation alone is not the answer. Centralizing data (i.e., moving it all into one physical location) may sound clean in theory, but in practice, it introduces operational trade-offs that enterprises can no longer afford. Click here to read why.

The alternative is federation—enabling organizations to operate as if their data is unified. But there’s a nuance many buyers are now discovering:

Not all federation strategies are created equal.

Two Competing Federation Strategies: Centralize First or Federate Where Data Lives

Most vendors use the term “federation” to describe a benefit of their data and AI platform (i.e., allowing organizations to use all of their data to run analytics and AI), but they don’t always mean the same thing by that term. When evaluating a platform, it’s critical to understand exactly what each vendor is offering and how well it aligns with your needs before you overcommit.

Generally speaking, there are two dominant approaches on the market today: consolidation-first federation and federation-in-place (often referred to as data virtualization).

Model 1: Consolidation-First Federation (Databricks’ and Snowflake’s Approach)

The first federation model is what’s known as a ‘consolidation-first’ approach—federation becomes possible after you’ve consolidated data into the vendor's cloud environment or inside their governance model. If you want cross-system access, that typically means regularly copying or ingesting data into their platform.

Put simply, it is federation because you can analyze all your data in one place. But you have to move everything into their house first.

For enterprise leaders, there are tangible implications to this approach, including:

Higher storage and data processing costs
Increased data duplication
Governance policy and permissions replication across systems
Greater compliance and audit complexity

In other words, the more places your data goes, the more expensive and harder to secure it becomes. For cloud-native companies, this approach may be acceptable. But for hybrid, regulated enterprises, it introduces friction that compounds over time.

Model 2: Federation-in-Place (Cloudera’s Approach)

The alternative federation model, championed by Cloudera, takes a fundamentally different stance: bring compute and AI to the data, no matter where it lives, instead of forcing the data to move.

Federation-in-place brings data together logically rather than physically, so teams can access and analyze it where it already lives—across public, private, and on-premises environments—without copying it into another platform first.

It sounds like a subtle difference, but in practice, it changes everything:

Lower infrastructure and storage costs by minimizing unnecessary data movement
Less duplication across environments
Greater flexibility across multi-cloud and on-prem architectures
Reduced exposure to cloud concentration risk
Single security and governance model with end-to-end lineage across all your data anywhere

As a result, your data stays where it makes the most sense for regulatory, operational, or performance reasons, and your teams still get a complete, real-time view across it.

What Federation-in-Place Enables That Consolidation-First Models Can’t

When federation works across hybrid environments without replication (i.e., federation-in-place), it creates conditions that consolidation-first models struggle to match. That distinction changes the risk profile of your entire AI strategy outside of cloud-only environments.

1. Zero Redundancy Security

In consolidation-first models (offered by vendors like Databricks and Snowflake), data may appear unified, but it still exists in multiple environments. It is copied, ingested, or replicated into a vendor-controlled platform before it can be analyzed. Every additional copy expands the compliance surface.

More environments mean more permissions to manage, more policies to synchronize, and more audit scope to reconcile. As replication grows, so does governance complexity.

Federation-in-place models, like Cloudera’s, leave the data where it is. As such, governance policies are defined once and enforced consistently everywhere. Instead of recreating permissions across systems, a single, consistent control plane governs access across hybrid environments. At Cloudera, we call it governance that moves with your data.

Think of it like a global corporate badge system. You wouldn't want to issue a new security badge every time an employee visits a different office. Access permissions are defined centrally, and that same badge works across headquarters, regional offices, and data centers, enforcing the same security rules everywhere.

You define the rules once, and every door recognizes them—even in different locations. That’s zero-redundancy security, and it’s a huge advantage for risk containment because complexity doesn’t multiply as your environment grows.

2. End-to-End Lineage Across Hybrid Sources

Across industries, AI is taking on more responsibility, and with that comes a growing need for accountability and explainability.

When AI influences credit approvals, fraud flags, pricing decisions, or supply chain adjustments, for example, every output must be defensible. Regulators, auditors, and executive leadership increasingly expect to see not just the result, but the full path that produced it.

In hybrid enterprises, that path rarely lives in one environment. Data may originate on premises or at the edge, be enriched in a public cloud, joined with SaaS data, and consumed by a model running elsewhere. Traceability across that reality is non-negotiable.

Consolidation-first federation approaches attempt to simplify lineage by centralizing data. But in practice, replication creates parallel histories: original datasets in source systems and transformed copies in analytical environments. Over time, explaining a decision may require reconciling multiple versions of the same data across systems. Lineage becomes something you’d have to reconstruct.

With federation-in-place integrated into data lineage capabilities (like Cloudera’s data lineage tools), that’s a non-issue. Because data is accessed where it lives (rather than replicated into a separate environment), lineage remains anchored to the original source.

That distinction matters most in hybrid and edge-dependent workflows. With a federation-in-place approach, you can rest assured that if a regulator or new CRO shows up years from now asking how a specific decision was made, the answer won’t be buried in a black box that needs deciphering. It’s documented, traceable, and defensible.

3. A Stronger Foundation for Real-World AI Systems

In consolidation-first models, AI operates inside the environment where data has been centralized. That works, as long as data movement keeps pace with operational reality. In hybrid enterprises, it rarely does.

When AI is responsible for real-world outcomes like dynamic pricing or supply chain adjustments, it must operate within live, distributed systems—not downstream analytical copies. Every replication step introduces dependency chains, creating latency / ingestion delays and potential for drift between the actual operational systems and the AI models that use them.

Federation-in-place, on the other hand, keeps AI aligned with operational reality, ensuring context is always current and powering operational AI use cases that a consolidation-first federation strategy couldn’t keep up with beyond the cloud.

Operational AI in Practice: Logistics Industry

To see why all of this matters in practice, let’s walk through an example. Consider a global logistics company deploying AI to optimize delivery routes in real time. A single routing decision may depend on:

Driver availability data from a workforce management system
Real-time GPS feeds from vehicles
Traffic and weather data from external APIs
Inventory availability across regional warehouses
Fuel efficiency metrics from IoT sensors
Local regulatory constraints or union rules

If that AI model is operating on snapshots copied to a single cloud days, or even hours earlier, it’s making decisions with partial context. It might reroute drivers without accounting for updated inventory levels or optimize for speed without factoring in regional compliance constraints. It might rely on outdated telemetry from vehicles already off the route.

When AI systems can safely access distributed data where it already lives with zero-redundancy security and full lineage visibility, organizations unlock fully operational AI that acts in real time, works within policy boundaries, and scales across environments without adding risk.

How to Choose a Federation Vendor: Questions Every Enterprise Should Ask

As we’ve explored, not all federation strategies are built for the same outcome.

Some prioritize consolidation, and others prioritize hybrid flexibility and governed access. When evaluating Cloudera vs. Databricks vs. Snowflake (or any data federation solution or combination therein), these questions help surface the real differences:

Does federation require data movement? Can you access data where it already lives, or will it need to be copied into a centralized cloud first?
Where are governance policies defined? Are access controls set once and inherited everywhere, or recreated across systems?
Is hybrid treated as permanent? Does the architecture support on prem and multi-cloud long term, or does it assume eventual consolidation?
Can lineage extend beyond the vendor’s environment? Is traceability end-to-end across distributed sources, including non-native systems?
Is the platform designed for operational AI anywhere? Can AI safely access live, governed data in real time, or only centralized snapshots?

The answers to these questions will help you determine whether federation will become a convenience feature centered on analytics use cases, or the long-term foundation for trusted, cost-controlled, enterprise-scale AI.

Federation Only Works If It’s Architected Intentionally

Designing a federated environment means looking under the hood—aligning governance models, regulatory constraints, performance requirements, and existing integrations while connecting systems in a way that supports long-term flexibility.

Cloudera’s Professional Services & Training (PS&T) team has guided organizations across industries through this process countless times. Whether establishing a new federation strategy or optimizing an existing environment, having experienced advisors on your side can help ensure your federated environment is not only set up correctly, but is also truly AI-ready and built to deliver measurable outcomes.

Keep Reading: How Federation Works in Financial Services

The choice between consolidation-first and federation-in-place determines whether AI stays in pilot mode or scales safely into operations.

Nowhere is that more critical than in financial services, where fraud detection, risk management, and regulatory reporting depend on fresh, cross-system data. In our next article, we’ll explore how federation is reshaping real-time analytics and AI governance in banking.

The AI Moment Is Here, But Are Organizations Data Ready?

Cloudera — Thu, 16 Apr 2026 13:00:00 UTC

As AI, analytics, and real-time decision-making reshape how businesses compete, data readiness has emerged as a critical prerequisite for turning ambition into impact. Yet while organizations are eager to unlock value from their data, many are discovering a hard truth: their foundations weren’t built for the demands of the AI era.

To identify the missing pieces to the data puzzle, Cloudera surveyed over 1,200 IT leaders across 14 countries to examine how prepared organizations are to translate data into business value across all areas of the enterprise. The results revealed that now more than ever, data is firmly established as a strategic priority, with strong executive buy-in and increasing investment across the board.

But beneath that momentum lies a more complex reality. While most organizations recognize the importance of data readiness, significant structural, cultural, and governance challenges continue to limit progress. The following findings point to a widening gap between aspiration and execution that will ultimately define which organizations can successfully scale AI and which will fall behind.

Data Readiness Is a Strategic Asset

Data readiness is a core enabler of competitive advantage in the AI era, and this belief is evident in strong executive alignment. Eighty-nine percent of respondents state that senior leadership understands and prioritizes the data infrastructure required to enable AI at scale, a clear signal that data conversations have entered the boardroom.

With this alignment comes a tighter connection between data and business outcomes. Eighty-six percent of respondents cited that their organizations have well-defined data strategies tied to business objectives. To enable those strategies, 86% of organizations are increasing cloud spend for data infrastructure, reflecting a widespread push toward more scalable, flexible architectures capable of supporting advanced analytics and AI workloads.

This stage in the AI adoption cycle is also marked by experimentation and openness to change. Nearly all organizations (94%) report a willingness to adopt or evolve governance frameworks, an important signal that enterprises understand the need to balance innovation with control, trust, and compliance.

What’s Holding Data-Driven Organizations Back

Even as ambition, alignment, and investment reach new highs, the path to true data readiness remains uneven. Despite growing investment, the survey suggests that aspiration is still ahead of execution, and organizations still face deep structural challenges.

The necessary data exists, but people can’t easily find or access it, and organizational silos slow collaboration. More than one-third (34%) of respondents said siloed data was a top issue preventing them from collaborating, sharing, managing, and using data effectively. Data silos can persist because data is not well integrated across enterprise systems. Most reported that their data sources were somewhat integrated across different environments, but significant gaps remain. Only 30% of IT leaders stated that their data sources were fully integrated, while 52% said they were mostly integrated. While this represents progress, the gap indicates that many enterprises are still not fully equipped to support large-scale AI initiatives.  

IT leaders also cited a host of other barriers to collaboration with data, including complicated access requirements and processes (47%), limited visibility into where data resides (44%), insufficient training and data literacy (41%), and cultural resistance to data sharing (34%).  Clearly, there is more than one obstacle blocking the path to full data readiness, and enterprises must account for each one to cross the finish line.

The Data Paradox: Investment vs. Readiness

The survey reveals a paradox: companies are investing heavily in data platforms and AI, yet they still struggle with governance and access complications. Although only 20% of respondents answered that all of their data is governed, 90% responded that most of their data is governed, which looks strong on paper. However, this contrasts with the 80% who said their data initiatives are hindered by a lack of access to all the necessary data. Even when organizations believe their data is largely governed, that governance lacks the accessibility and integration needed to support real-world use cases. As a result, data may be technically “governed,” but still fragmented and difficult to discover, therefore limiting its value.

Technology adoption alone doesn’t guarantee data readiness. Although the survey noted strong governance adoption, data access remains a critical bottleneck. A quarter of respondents (24%) lack full confidence in accessing their enterprise data, meaning that even in relatively mature environments, universal data access is not guaranteed.

The name of the game with data readiness is cohesion and accessibility. Until organizations bridge this divide, investments in AI and advanced analytics will continue to underdeliver, constrained by the practical realities of getting the right data into the right hands at the right time.

The Next Competitive Frontier

The solution to this paradox isn't just about collecting more data. It depends on organizations that can manage, access, trust, and collaborate using their existing data.

Data readiness is crucial for unlocking AI’s full potential. It involves more than just having data; it requires using the entire dataset, wherever it’s stored, to gather valuable insights and improve AI skills that support strategic goals. Cloudera’s Data Readiness Survey clearly shows the opportunity for organizations to invest in data readiness now to be best prepared to lead in an AI-driven future.

Cloudera supports enterprise organizations as they prepare their data for an AI-driven future. To learn more about accelerating your data readiness journey, visit our website.

Scaling Opportunity and Empathy: A Conversation with CHRO Amy Nelson

Debbie Kruger — Wed, 15 Apr 2026 13:00:00 UTC

At Cloudera, we believe in the power of turning theory into action, and that extends far beyond technology to the fabric of our everyday culture. As we continue our celebration of allyship this month, this is a perfect moment to reflect on what that looks like in practice, showcasing the initiatives that empower and connect our employees.

To explore how growth and mentorship come to life across the organization, we spoke with Amy Nelson, Cloudera’s Chief Human Resources Officer. Amy believes great companies are built by strong, connected communities. At Cloudera, she leads the people strategies that bring that belief to life, spanning workforce planning, leadership development, and inclusion and engagement. In our conversation, she shares how Cloudera is building meaningful development pathways for all Clouderans, while championing a culture grounded in connection, compassion and community.

Here’s Amy's take.

Tell us a bit about the upcoming mentorship program at Cloudera. What excites you about this initiative? What prompted the team to explore launching this program?

We are incredibly excited to pilot our first company-wide mentorship program. While we have seen strong success with organic, team and location-based mentorship efforts over the years, our Culture Survey has consistently defined a clear opportunity. Our employees are looking for more structured, accessible pathways for growth and connection.

What makes this initiative especially meaningful is its ability to operate at scale in a globally distributed environment. This program is a direct response to what our employees have told us they need, so it’s an investment in developing our people and reinforcing a learning culture.

How do you define successful mentorship? How do you measure its impact, individually and organizationally?

At Cloudera, we define successful mentorship through a one-size-fits-one approach to development. We provide the structure and access, but success is ultimately measured by the individual.

We measure impact on two levels. At an individual level, we look at progress against self-defined career milestones, along with shifts in confidence before and after the program. At the organizational level, we focus on broader outcomes, including engagement as reflected in our Culture Survey and stronger cross-functional and global connectivity across the company.

How have you seen cross-department or cross-level mentorship impact performance and growth at Cloudera?

We’ve seen that some of the most meaningful growth happens at the intersections of our business. For the past five years, our Sponsorship Program has intentionally paired high-potential talent with senior leaders outside their immediate functions. That cross-functional exposure often brings new perspectives and unlocks opportunities that wouldn’t emerge in more siloed environments.

We’ve seen these connections accelerate development and broaden leadership capabilities. Just as importantly, our data shows a clear link between these experiences and sustained improvement in employee engagement. When people feel supported and connected beyond their immediate team, they perform their best.

This focus on mentorship, support, and growth comes at the perfect time, as we celebrate Allyship April. How have you seen allyship in action at Cloudera?

At Cloudera, allyship is something we practice every day. It’s embedded in how we lead, collaborate, and support one another. We see it come to life through our Employee Resource Groups, which hosted over 60 global events this past year to foster connection and education.

It’s also reflected in how we operate as a company. From our continued Fair Pay Workplace recertification to achieving a top score on the Corporate Equality Index and recognition as a Best Place to Work for Disability Inclusion, we hold ourselves accountable to building an environment that is equitable and accessible.

For us, allyship is about consistent action. It’s how we ensure every employee feels seen, supported, and empowered to contribute, and how we translate our values into measurable impact across the organization.

How do company-wide groups and programs, like the ERGs, influence an inclusive environment?

At Cloudera, our Employee Resource Groups (ERGs) are a cornerstone of how we bring inclusion to life. In a globally distributed organization, silos can naturally emerge. ERGs can break those down by creating meaningful communities that connect employees across regions, functions, and backgrounds. More importantly, they give employees a voice, shaping how we think about policies, programs, and the overall employee experience.

ERGs act as catalysts and compasses, pushing us to continuously raise the bar. They help ensure inclusion goes beyond aspiration to something our employees genuinely experience every day.

Cloudera has been certified a Fair Pay Workplace for a third consecutive year. What makes this achievement important to Cloudera, and how does the organization work to uphold those standards?

At Cloudera, creating a workplace where people feel valued and respected starts with how we approach pay. We bring the same data-driven rigor that defines our business into our compensation practices ensuring decisions are consistent, transparent, and grounded in measurable impact.

Earning Fair Pay Certification for the third consecutive year is a meaningful validation of that commitment. It reflects the discipline we’ve built into our processes, from regular audits and governance to clear frameworks that support equitable outcomes at scale. More importantly, it reinforces a core belief: when you lead with data and accountability, you create a foundation that sustains fairness over time.

What is a lesson you’ve learned about building a people-centric tech organization that might be surprising?

What may be surprising is that data doesn’t replace empathy; it actually scales it. It helps us identify where to lean in, but it’s the human side, specifically through listening, context, and real conversations, that turns insight into meaningful action. Tools like our Culture Survey can tell us what is happening, but they don’t tell us why or how to respond.

Building a truly people-centric organization is about striking that balance. We use data to surface the opportunities, but it’s the stories and experiences behind the data that ultimately drive better decisions and stronger outcomes.

Discover how Cloudera empowers employees to thrive in an environment rooted in allyship and inclusion, and check out career opportunities at Cloudera.

Inside Cloudera IMPACT26: How Partners Are Driving Enterprise AI Anywhere

Natascha Lee — Fri, 10 Apr 2026 13:00:00 UTC

Attendees at our Virginia Watch Party

Collectively, our partner ecosystem is enabling organizations to build and run data and AI solutions anywhere (across public clouds, private infrastructure, and the edge) while maintaining the governance, security, and control required for enterprise-scale AI without compromising control, flexibility, or scale.

This is what “AI Anywhere” looks like in practice—it's a coordinated effort across partners to help customers operationalize AI with confidence, no matter what their current data architecture or tech stack looks like today.

To learn more about Cloudera’s partner ecosystem, visit Cloudera.com/partners.

2025 was a year of meaningful momentum for Cloudera’s partner ecosystem, and nowhere was that more evident than at our IMPACT26, our annual Partner Kickoff event.

IMPACT26 was primarily virtual, but Clouderans and partners tuned in from Watch Parties around the world, creating a shared energy felt across every session and conversation. These parties served as hubs of forward-thinking leaders across the Cloudera partner ecosystem, coming together to exchange ideas, spark innovation, and collectively shape a better future for each other, the industry, and our customers navigating rapid technological change—especially as organizations look to operationalize AI across increasingly complex, distributed environments.

Our conversations reflected a shared, deep understanding of what it takes to turn vision into tangible customer outcomes with AI in the real world.

The event created an opportunity to reflect on what we’ve built together so far and, more importantly, what comes next. This year, it brought together more than 800 attendees across every region, exhibiting the global scale and momentum of Cloudera’s partner ecosystem.

That spirit continues to sharpen into focus: enterprise AI will be led by north-star aligned partnerships, and that future is already taking shape at Cloudera.

“Through our joint innovation with Cloudera, we’re tightly integrating Dell’s industry‑leading storage with Cloudera’s data platform to bring AI directly to where the data lives,” said Travis Vigil, senior vice president, ISG Product Management, Dell Technologies. “Together, we’re helping customers operationalize AI at scale with the security, performance, and confidence required for production environments.”

Working Together with Partners Around the Globe for AI Anywhere

A key theme across these sessions was a shared focus on “AI Anywhere”—the idea that AI, when used to its full advantage, cannot be confined to a single platform, environment, or use case. It must be able to operate anywhere data lives.

In practice, a fragmented data estate creates real challenges: AI agents making decisions based on stale or incomplete data, governance policies applied inconsistently across environments, and patchwork fixes to address duplication, latency, and blind spots—hindering enterprises working to operationalize AI at scale.

Designing a unified AI environment means going deeper: bringing partners together to align on integrations that allow governance models, regulatory requirements, performance standards, and existing systems to work seamlessly across the board for long-term flexibility and resilience.

Executive presentations helped clarify this vision, aligning leaders on what it takes to operationalize AI across hybrid environments while maintaining consistency, control, and trust.

Attendees at our Mexico City Watch Party.

Cloudera’s IMPACT26 Global Honorees

Cloudera’s partners are at the forefront of turning data into value and AI anywhere, and we’re proud to celebrate those making a lasting impact across the industry.

At IMPACT26, that influence was recognized through the 2026 Global Partners of the Year Awards. The awards recognize organizations that have driven meaningful customer outcomes through technical excellence and innovation across the Cloudera ecosystem. This year’s global honorees are:

Amazon Web Services (Cloud Partner of the Year)
IBM (OEM Partner of the Year)
AMD (Technology Partner of the Year)
NVIDIA (AI Partner of the Year)
Dell Technologies (IMPACT Partner of the Year)
Protegrity (ISV Partner of the Year)

These partners demonstrate what’s possible when complementary strengths are brought together to solve real customer challenges. From bringing AI directly to where data lives, to enabling enterprise-grade performance, to embedding security into every layer of the data lifecycle, these collaborations are turning strategy into execution.

Cloudera’s IMPACT26 Regional Partners of the Year

The same energy seen with our global partners was equally present among our regional and specialized partners.

“Receiving Cloudera’s AMER Partner of the Year award is a great honor for Compwire and a reflection of the strong, trusted partnership we have built together,” said Ricardo Vinicius de Godoi, Director of Business Solutions at Compwire. “Together, we are helping organizations accelerate their data and AI initiatives in Brazil, and we look forward to continuing to drive innovation, customer value, and growth with Cloudera.”

Cloudera’s Regional Partners of the Year are:

Compwire (AMER)
IBM Consulting (APAC)
Puedata (EMEA)
ThunderCat Technology (Public Sector)

Every day, their local expertise helps customers accelerate their data and AI initiatives in ways that are both practical and scalable, grounded in the realities of specific markets and industries.

Our Emerging Partners of the Year are:

Codename37 (AMER)
Novare Technologies (APAC)
Engineering Ingegneria Informatica S.p.A. (EMEA)

These partners play a critical role in extending our reach across regions, environments, and use cases.

Vibe coding and Cloud Accountability with David Linthicum

Cloudera — Thu, 09 Apr 2026 13:00:00 UTC

In episode 65 of The AI Forecast, "The Vibecoding Liability: How Unchecked AI Can Kill Cloud ROI," David Linthicum joins host Paul Muller to reveal the hidden costs of hybrid and multi-cloud environments and explain why cloud governance and resilience have become boardroom priorities.

As high-profile cloud outages expose hidden dependencies and single points of failure, IT leaders must rethink resilience, data management, and accountability across hybrid cloud environments.

Here’s what stood out from Paul and David’s conversation:

The Core Differentiator: Reliability Versus Resilience

Paul: Resilience is a funny word because I think people often equate resilience with reliability, and there's a really big difference, isn't there?

David: There is. I mean, resilience is your ability to not let these disasters stop your processing and your business. In other words, what are plan A, plan B, and plan C? How resilient and fault-tolerant is this going to be? Reliability is basically about a component: how well it's going to maintain itself, and if it's going to fall out of place, how to recover from that. Resilience is your responsibility, reliability is not. Typically, if it's with a cloud provider, it's their responsibility, but you'll still be affected. You're going to be paying the bill. There's no credit you're going to get from these cloud providers when they go down.

Paul: Resilience is an architectural artifact, not a consequence of a component, isn't it? It's how you design your system. It goes back to that enterprise architecture.

David: It's all architecture, and it's on the application and enterprise layers. You've got to build and plan for resiliency. It won't happen automatically, and it’s not contained in the clouds. That's where people were surprised. They thought they would be completely resilient to any issues they have, but now they realize they're fallible like everybody else. Part of building an AI system, an enterprise architecture, or any kind of architectural planning is about resilience. It's as important, if not more important, than security, governance, and the other things we have to go through. It has to be operationalized so you can actually prove, with metrics, that this thing won't stop the business from processing if the worst happens. And you basically have to spend the money and time to figure that out.

If you don't have resilience, you won't be able to recover from these kinds of things.

Accountability and Observability in a Hybrid World

Paul: Now, a lot of people are talking about hybrid clouds, but it seems, in some respects, to be a combination of the best and worst attributes of both on-prem and cloud worlds. How do we build clear accountability and observability in what will ultimately be a hybrid world?

David: If you're building hybrid and multi-cloud solutions, you have to basically manage the complexity that's part of the solutions, and resilience is going to be a common control plane that goes through that. People think, "Well, I'm going to build this thing in a hybrid way where I'm going to be able to fail over to my on-prem systems, or even fail over to another cloud.” That's perfectly fine, and it works, but it's going to cost you money. I think the ability to understand what those costs and resources are, and how to manage them, becomes the biggest point of contention.

Multi-cloud is great because you're allowed to use the best technology to build more efficient systems, but resilience and reliability are going to be issues within those architectures. I always say, you can have resilience, and you can have efficiency, but you can't have both. We either have to build the architecture for resilience, or we're going to have to deal with outages three or four times a year that will cost the business billions.

Rising Cloud Costs and The Repatriation Trend

Paul: With regards to things like catastrophic outages, cost overruns, and complex accountabilities, it's not surprising that a lot of companies are thinking about repatriating workloads. What's the state of play there, and what are some of the struggles people have as they're trying to bring some of those workloads back on-prem?

David: The big thing would be the cost of doing it. There are two layers there. Number one, you’ve already spent around half a million dollars on applications and migrating everything to the cloud, and now you’d need to spend a similar amount to move it back. Second, you’d have to go to the board of directors and explain that decision and the path forward. That’s a difficult conversation, because it means acknowledging that the move to the cloud, which was originally expected to be more valuable and reliable, didn’t deliver as planned. Someone will have to go hat in hand and explain that, as a result, the organization needs to shift back to an environment where it has more control over the hardware.

Typically, going to colocation providers and managed service providers is much more efficient, but they're reeling from the cost of the cloud. And now that they're looking at the AI workloads, they're trying to make that move even quicker because they can't afford the cloud. Even though the cloud's going to be the easy button for AI, it's the path of least resistance for building these systems. You get a whole ecosystem ready to go on demand, but it's too expensive for most enterprises. If we're going back there for economic reasons, then we have to put some resources in place to ensure we do it effectively.

Paul: How many other developers in how many enterprises have spun up a little side project in a vibe coding app that's generated, you know, incredible compute workloads or storage workloads that are resulting in cost overruns?

David: You're coding by telling the AI system in terms of what your interpretation is and what they need to code. And the thing is it doesn't understand the nuances there. It doesn't understand how to deal with the efficiencies and you end up spending more money. And so that kind of stuff, the vibe coding stuff, you know, it's fun to think about, but the thing is we have to get some human control over these things. And the more I see these coding systems that go out and you know most of my clients are trying them, they're failing because they're not able to get to the efficiency that they need.

Catch the full conversation with David Linthicum on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

Unified Data Access is the Foundation for Trusted AI

Navita Sood — Mon, 06 Apr 2026 13:00:00 UTC

IT leaders have been under pressure for years to shore up AI plans that deliver on enterprise goals. But the move from pilot to production has proven harder than anyone expected.

That’s because, in hindsight, these early experiments weren’t quite as well-structured as they should have been. AI models were layered on top of data estates that weren't ready for them. Experiments were run in isolation, so governance and security had to be retrofitted across the enterprise to scale. Meanwhile, departments running unsanctioned AI experiments introduced shadow AI that now must be brought back under policy, auditability, and control.

Delivering on AI goals means untangling messy, fragmented, and physically distributed data estates that get ever-more complicated by the day. The scalable path forward is bringing AI to the data, and rethinking how AI accesses it. Without unified, governed access down to the studs, accountability and results are fundamentally at odds.

Why Consolidation is the Wrong Strategy

For years, the cleanest answer (and most common advice) was data estate centralization: move everything into one lake, warehouse, or cloud to create one source of truth. Cut down silos and end fragmentation by physically eliminating distribution.

In theory, it sounds efficient. But reality has shown that, at least in an enterprise context, it’s untenable.

Data volumes make large-scale movement expensive and slow
IT and Data Engineering teams have to facilitate access, creating delays and preventing self-service use cases (like department-specific AI agents and tools)
Regulatory boundaries restrict where data can live and how it can be processed
Hybrid environments became permanent fixtures, instead of the transition stage they were once thought to be
Centralization itself introduces latency, undermining real-time analytics and AI use cases

Ultimately, consolidation forces enterprises into tradeoffs they can no longer afford in the AI era, when real-time responsiveness and context are crucial to realizing value. Waiting for data to move, or duplicating it across environments, erodes both.

The better approach is data federation: enabling enterprises to operate as if their data is unified without forcing it to move.

What Data Federation Really Means

Data federation is often described in technical terms—query engines, connectors, and distributed compute. For operations leaders, its impact is far more strategic.

Put simply, data federation enables unified access to data across distributed systems without physically centralizing or duplicating it. But the outcome is what matters. Data federation allows teams to work with data where it already lives, enabling leaders to get accurate, up-to-the-minute answers to questions that span cloud, on-prem, and edge systems.

Imagine a global retailer asking, “Where is my inventory of X?” and receiving a single, contextualized answer that reflects warehouse stock, brick-and-mortar shelves, goods in transit, and e-commerce fulfillment centers simultaneously.

Or picture a state agency asking, “Is this applicant eligible for Program X?” and receiving a unified response that reflects tax records, income verification, and existing benefit enrollment—even though those datasets remain within separate department systems.

Data federation makes those outcomes possible, because beneath that user interface lives a single governance policy—i.e., a unified governance framework, where rules are tied to the data itself, not to the storage systems where it happens to live.

In effect, this is a logical data unification instead of a physical one. It means authorized queries can span the data estate end-to-end, utilizing the compute closest to the data, while remaining governed, keeping every access point consistent, and ensuring every output is traceable and auditable.

That foundation is what makes AI scalable and trustworthy.

The Operational Model of “Govern Once, Access Everywhere”

If federation is the architectural shift, “govern once, access everywhere” is the operating model—it changes how enterprises think about control and scale.

As we briefly touched on earlier in this article, with a federation strategy, governance policies follow the data, not its physical storage location. In practice, it means that security rules apply consistently, no matter what. That makes traceability and auditability foundational, built-in capabilities rather than bolt-ons retrofitted after deployment.

Beyond audit mechanics, it also improves top-layer AI apps and agents by enabling them to access broader context in real time within existing governance controls.

For operations leaders, the implications are tangible:

Faster AI deployments, accelerating automation and efficiency gains
Fewer compliance bottlenecks across regions and regulatory frameworks
Reduced duplication across teams, lowering both infrastructure and processing costs
Real-time visibility across distributed operations, allowing everyone to work from the same source of truth at the same time
Greater executive confidence in AI outputs and decisions, accelerating trust and time-to-value

This frees up teams to drive outcomes rather than getting stuck in the weeds of reconciling across environments and auditing results for consistency.

Preparing for the Era of AI Anywhere

Modern platforms are evolving beyond storage-centric design toward intelligent data access layers built for hybrid permanence, regulatory scrutiny, and AI-powered automation.

This evolution reflects a broader platform direction: bringing AI to the data anywhere it lives, rather than forcing data to conform to infrastructure constraints. As AI embeds itself deeper into supply chains, financial forecasting, fraud detection, and customer engagement, the cost of fragmented access only grows.

Industry analysts have reached the same conclusion. This is reflected in Forrester’s evaluation of data fabric providers, where unified, governed access across hybrid environments is treated as a core architectural capability for enterprise AI. A ranking that named Cloudera a Q4 2025 Leader.

Unified, governed access is the foundation for trusted AI—and that starts with federation.

But not all federation strategies are created equal.

In our next article, we’ll explore how different federation models compare, and what enterprises should look for when choosing a platform built for true hybrid data access, unified governance, and AI at scale.

Data for AI Anywhere: Cloudera’s AI Investments Are Fueling a Hiring Surge

Angela Mann — Thu, 02 Apr 2026 13:00:00 UTC

In an industry defined by reductions in force and hiring freezes, Cloudera is taking a different path and actively expanding its global workforce to meet accelerating demand for enterprise AI.

This growth is the direct result of a multi-year investment strategy in Research & Development (R&D) and AI, which is now entering its breakout phase. We are augmenting our teams globally to build the platform that makes enterprise AI possible anywhere.

Why R&D is Our North Star

As our CTO, Sergio Gago, recently noted, we have entered the “Era of Convergence,” where data centers and cloud come together so that AI can be managed “as another part of the workforce.” This shift from experimental pilots to enterprise-scale impact is exactly why we are expanding our R&D teams to build a unified architecture that allows our customers to bring AI to their data, anywhere it lives.

The Strategy: Investing in the "Era of Convergence"

The experimentation phase of AI is over. Enterprises are moving from simple proofs of concept to agentic AI with autonomous workflows that require secure, governed access to data across hybrid environments.

To meet this demand, we have significantly ramped up our R&D spending, focusing on:

Cloudera AI Inference: Powered by NVIDIA technology to scale GenAI, agentic workflows, and traditional predictive ML use cases
AI Agent Studio: Empowering developers and business teams to build autonomous agents within a trusted data ecosystem using low- and no-code techniques
Unified Data: Blurring the lines between the clouds and on-premises data centers to ensure 100% of your data can be made "AI-ready" without friction

Deep Dive on the Launch of Cloudera Agent Studio

The surge in our R&D hiring is a direct response to a fundamental shift in the market. In 2024 and 2025, enterprises were experimenting with LLMs. In 2026, they are operationalizing them.

To lead this transition, we recently unveiled Cloudera Agent Studio, a centerpiece of our AI roadmap. Agentic AI is the new frontier with systems that can plan, reason, and execute multi-step tasks across a company's entire data estate.

Why This Product Matters

Cloudera Agent Studio is an orchestration layer that allows developers to build autonomous agents that are:

Context-Aware: They use your actual enterprise data (stored in the Cloudera platform) to provide accurate, governed answers
Hybrid-Ready: With our new AI Inference service powered by NVIDIA, these agents can run just as efficiently in your private data center as they do in the public clouds
Secure by Design: Every action an agent takes is logged and governed by Cloudera Shared Data Experience (SDX), ensuring that AI never sees data it isn't supposed to

Growing the Team: What We’re Looking For

We are building the future of the hybrid data and AI platform, so we are looking for builders. Our hiring remains heavy on the Research & Development and Engineering side, but our growth is felt across the entire organization.

We are currently seeking experts who can bridge the gap between "data in motion" and "intelligence at scale." Current high-priority roles include:

AI Solutions Engineering - Building RAG pipelines and custom GenAI prototypes for global enterprises
Platform Engineering - Optimizing Lakehouse architectures and hybrid-cloud deployments (K8s, Iceberg)
Machine Learning Ops - Scaling model serving and observability via MLflow and Cloudera AI
Data Architecture - Designing the streaming foundations (NiFi, Flink) that feed real-time AI

Why Join Us Now?

Cloudera is building to accelerate. We offer a stable, high-innovation environment where you can work on the world's most complex data challenges with leading brands that collectively manage more than 30 exabytes of enterprise data.

If you’re ready to move past the AI hype and start building AI that works in the real world, we have a seat for you.

Explore our open opportunities and help us build for the Era of Convergence.

Navigating the Future of Data & AI: Key Takeaways from Gartner Data & Analytics 2026

Katie Gdula — Wed, 01 Apr 2026 13:00:00 UTC

At Gartner’s 2026 Data & Analytics Summit, the message was clear: the era of experimental AI is over, and the era of integrated, governed, and value-driven AI has begun. As organizations race to modernize, the focus has moved from "What is AI?" to "How do we scale AI reliably?"

Here are five key takeaways from the conference and how Cloudera can help you deliver business value in each of these areas.

5 Key Takeaways from Gartner’s D&A Conference

1. There is No AI Without AI-Ready Data

AI-ready data is the prerequisite for successful AI initiatives. The market is moving toward converged platforms that simplify operations, specifically the open data lakehouse architecture.

A data lakehouse combines the benefits of a traditional data warehouse and the flexibility of data lake architectures. The lakehouse is expected to replace traditional data warehouses because it provides the necessary access to unstructured data—the lifeblood of modern generative AI (GenAI).

The Cloudera Edge: Cloudera’s Open Data Lakehouse allows organizations to manage structured and unstructured data across hybrid and multi-cloud environments. By providing a single, unified integrated architecture, Cloudera eliminates data silos, ensuring all your data is AI-ready regardless of where it resides.

2. The Rise of Agentic Systems

2026 is the year of AI agents. Unlike simple chatbots, these agents move toward autonomous decision-making and require robust agentic data management to automate complex tasks. AI agents must be governed, budgeted, and contextualized to create value and reduce risk.

The Cloudera Edge: Cloudera provides the high-performance data streaming and real-time processing power needed to fuel agentic ecosystems. With Cloudera Data in Motion, enterprises can build the real-time pipelines that allow AI agents to act on the most current data, ensuring autonomous decisions are based on reality, not stale information.

3. Context is King: Semantics and Graph RAG

Gartner highlighted that for AI to be trustworthy, it must understand the context of specific jobs and processes. This is driving a shift toward knowledge graphs and graph retrieval-augmented generation (RAG) to handle content complexity and ensure traceability. Leaders need a composite semantic layer to ensure interoperability and transparency.

The Cloudera Edge: Cloudera’s Unified Data Fabric is designed to handle the complexity of massive datasets while maintaining metadata integrity. By integrating specialized tools for vector databases and knowledge graphs, Cloudera enables graph RAG at scale, allowing enterprises to feed their large language models (LLMs) highly specific, proprietary context while maintaining a clear audit trail of where that information came from.

4. Governance as a Risk Mitigator

Gartner also warned, "Governance derisks our aspirations." Meaning, without right-sized governance, AI initiatives will fail to build the necessary trust to scale. D&A leaders must modernize governance to meet the requirements of the entire AI lifecycle, from data ingestion to model deployment.

The Cloudera Edge: Cloudera Shared Data Experience (SDX) offers enterprise-grade security and governance that follows data wherever it goes. Whether you are running a model on-premises or in a public cloud, Cloudera SDX provides a consistent security policy, ensuring that sovereign AI is not just a buzzword, but a reality for regulated industries.

5. The Hybrid Mandate: Sovereign AI

A significant focus at the summit was the need for sovereign AI solutions that allow organizations to localize D&A control, particularly for compliance and data privacy. Organizations need platforms that offer unified management while allowing for localized control over data and models.

The Cloudera Edge: As the only true hybrid platform for data and AI, Cloudera gives customers the ability to run high-performance AI workloads in the cloud and keep your most sensitive data on-premises. This hybrid flexibility is the cornerstone of a sovereign AI strategy, giving you total control over your intellectual property.

Final Thoughts: Moving to an AI-First Mentality

The industry is moving away from fragmented tools toward unified data management solutions. Success in this new era requires a platform that can handle the entire lifecycle—from data ingestion and engineering to warehousing, machine learning, and monitoring.

Cloudera’s hybrid, open, and secure platform provides the foundation for AI-ready data and the governance to protect it, empowering leaders to turn AI disruption into a sustainable competitive advantage.

To learn more about how Cloudera can power your AI use cases, check out our webinar series “Accelerate Enterprise & Agentic AI: From Development to Inference with Private AI.”

Moneyball’s Billy Beane on Why Ignoring Data Is the Biggest Risk of All

Cloudera — Wed, 25 Mar 2026 13:00:00 UTC

Baseball always ran on gut instinct and tradition… until Billy Beane proved the numbers could win.

In episode 62 of The AI Forecast, How Moneyball's Billy Beane Changed Baseball Forever with Data Analytics, Billy Beane joins host Paul Muller to discuss how evidence-based decisions challenged traditional baseball. He explains how constraints spur innovation, questioning assumptions is vital, and data helps organizations reinvent decision-making.

From evaluating talent to managing resources, Billy asserts that success depends on creating systems that prioritize evidence over ego. Below are a few of the main moments from Paul and Billy’s fascinating discussion.

Reframing Risk

Paul: How tough is it to navigate that point where you're confident in the idea, but the results aren't showing up quickly enough?

Billy: That’s a great question, and I leaned on my assistant. He used to say that if you're going to take a math test and someone is going to give you the answers, wouldn't you take them? We felt like using data was like that. They were giving you the answer to the test. Now, we wanted to leverage data and make a lot of decisions. We knew we weren't going to be right every single time; we weren't going to win every hand, but if we were disciplined with the data, ruthless with the numbers, and consistent in how we made decisions, over time, we would be correct.

I think there were a lot of assumptions when we were doing things that we were nervous about how this was going to turn out, but we felt the opposite, completely. We felt the use of data was kind of a roadmap and a fog light for us. And again, we weren't going to be right about every single decision, but if we were consistent with the way we made decisions over time, we would end up where we wanted to be, and it was going to be that discipline that was going to carry us through.

If you're right three times in a row, everybody's on board. Then the fourth time, if you're wrong, everybody says, ‘Oh, well, I told you that numbers don’t tell you the whole thing.’ And they sort of jump back to an emotional decision-making position, yet they don't hold emotional decisions to the same standard. One of the things we get complimented for, which I think is a little misguided, is that we were risk takers. We were actually completely the opposite. We wanted to manage risk, we wanted to be actuaries, and we thought what was risky was having information to help you make predictive decisions and not using that. That, to us, was the risk.

Data Over Orthodoxy

Paul: The good news is you got famous, and the bad news is you got famous. As other teams figured out what you're up to, how did you find a new edge? How did you stay scrappy?

Billy: I think the real revolution was when other teams started realizing the importance of data, collecting their own data, and using that data to build more predictive models. When we first started making decisions, we based them on statistics. Statistics are a result. What teams started figuring out was that there was a better way to measure process, which was a better predictor of skill, and that data collection was important. And quite frankly, it wasn't just about collecting data, but about bringing in some really, really bright, passionate people into our business who previously weren't working there.

The thing about the book Moneyball was that everything in that book was public information. We basically stole Bill James's ideas. The culture allowed us to do it because nobody really tried the ideas of Bill James or what he talked about in his pamphlets for years after that. Over the next 20 years, though, and as we sit here now, teams have become very private. They hire, and they have very large analytical staffs with bright young men and women helping them build these models using biometrics to improve player performance. It's gotten very, very sophisticated—far beyond even my understanding, to be totally frank.

Everyone’s a Data Person… Until It Disagrees With Them

Paul: In my experience, the challenge now is that you may be in a situation—particularly for really bright, experienced people—who will say, “I'm a data-driven person,” and they'll point to data, and they'll agree with it. But as soon as they come up with something that doesn't back up their experience, they may say, “Well, that data's not right, and I'm not going to use that data.” In short, cherry-picking the data is something that I've seen happen, and it goes back to the statement I made about everyone being a data person until it doesn't back up their opinion.

Billy: To me, that's the real opportunity. The experiences of a really successful long-term CEO in a business are data, and drawing on those experiences to help him make decisions is data. But I think in many cases, when you're with experienced people, we have a tendency to give in when they say, “Hey, that data’s not right.” Well, my response is usually that you don't get to disagree with the data, because it's not an opinion. It's a fact. In today's world, with all the data that we have exposure to, the real opportunity is when the data tells you one thing and your own experiences tell you something else. Personally, I prefer to always nod towards the data and ignore my own experiences when making decisions. And again, I know many people will disagree with that. To me, the opportunity is when really smart people see the same thing and the data tells them something, because you have to assume your competitor is going to see the same thing you do and make a decision along those lines.

Catch the full conversation with Billy Beane on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

#ClouderaLife Employee Spotlight: Meet Jim Ewton, A Veteran Building Community and Mission Impact at Cloudera

Debbie Kruger — Fri, 20 Mar 2026 13:00:00 UTC

“At Cloudera, we just seem to draw individuals from the military,” Jim says. “And that makes me happy because it means they feel comfortable coming to talk to and work with us.”

At Cloudera, innovation starts with belonging. We work to build an environment where people from all backgrounds, including those who have served, can continue their mission in new ways. For Jim Ewton, a U.S. Air Force veteran and active member of Cloudera’s Veterans Employee Resource Group (ERG), that sense of purpose and community is what makes being a Clouderan special.

Let’s meet Jim and learn how a lifetime of service led him to Cloudera, and how he’s helping fellow veterans find belonging along the way.

From the Air Force to Cloudera Government Solutions

Jim spent 23 and a half years in the U.S. Air Force, traveling the world and serving in roles spanning communications and law enforcement. His career took him across Asia and South America, and to 39 U.S. states, including four years at the Pentagon.

“When you wear the uniform that long, it becomes part of who you are,” he says.

After retiring in 2002, Jim transitioned into government contracting before joining Cloudera in 2015. Today, he’s part of Cloudera Government Solutions, the company’s public sector arm that supports sensitive U.S. government missions.

That work carries deep responsibility. Cloudera Government Solutions operates under strict security and compliance standards, supporting agencies that rely on secure, mission-critical data capabilities every day.

“We do a lot of sensitive work,” Jim says. “There are multiple agencies that depend on our capabilities and our software every day.”

The Hardest Mission: Transitioning to Civilian Life

The path from military service to civilian life isn’t seamless.

“Even when you take off the uniform, it’s not an immediate immersion into civilian life,” Jim says. “It takes a while. It’s a different world. It can be scary.”

He speaks candidly about the challenges many veterans face—from having to pick out an outfit for work for the first time, to translating military experience into a civilian résumé, to navigating invisible wounds like PTSD or social anxiety. Everything feels new, and recognizing that shock is an important part of the process.

“I say this a lot in my mentorship,” he explains. “A lot of folks coming out of the military have visible or invisible health issues. It’s important to help them find value again in who they are in their new endeavor.”

That belief is what drew him deeper into Cloudera’s Veterans ERG.

Building Community Through Cloudera’s Veterans ERG

At Cloudera, our Veterans ERG offers an incredible support system. Members support one another, mentor transitioning service members, and seek ways to give back to the broader military community.

Jim is especially passionate about mentorship, helping veterans translate their skills and experiences into new opportunities.

“The ERGs help create a sense of community,” he says. “I’ve really enjoyed getting involved more, and I hope more Clouderans learn about them and the good work they do.”

Cloudera’s veteran presence extends well beyond the ERG. Veterans serve at every level of the organization, including executive leadership. Seeing that representation throughout sends a powerful message: your background is understood here, and your experience has a place at the table.

“When you see veterans across leadership, it reinforces that you belong here,” Jim says.

A Culture That Makes Space

One of the first things Jim noticed when he joined Cloudera was the environment itself. After decades in highly structured military settings, Cloudera’s approachable, casual culture stood out.

“It wasn’t suits and ties. It wasn’t stuffy,” he says. “It was comfortable. People were accepted no matter where they came from—background, education, experience.”

Over more than a decade at Cloudera, Jim has seen the company evolve from its early Hadoop foundations to today’s leadership in hybrid data and AI. Through growth and change, one thing has remained constant: a focus on team building.

“Every time we change direction or pace, leadership comes back to team building,” he says. “That’s always been fundamental.” Now past his 10-year mark, Jim calls Cloudera “one of the best environments I’ve ever been in.”

Giving Back Is Part of the Mission

For Jim, being a Clouderan also means giving back. Through the Veterans ERG and Cloudera Cares initiatives, he supports organizations like Fisher House, which provides housing for military families while loved ones receive medical care, and Operation Motorsport, which helps veterans rediscover purpose and community through hands-on engagement in motorsports.

“The testimonies from the young folks are what turned me into a true believer,” he says of Operation Motorsport. “They’re thankful. I cannot begin to tell you how many times they said ‘thank you’ during the event.”

“Just a little bit of energy goes a long way when it comes to offering a helping hand,” he adds. “That’s one thing Cloudera does exceptionally well—we give back.”

Jim also brings a deeply personal dimension to this work. He is supported by a service dog who accompanies him to the office and business reviews, helping to create a sense of calm wherever she goes. “When a dog walks into a room, it changes the environment immediately,” he says. “It provides comfort. Relief. That’s powerful.”

The openness and flexibility to bring his full self (and his fluffy support system!) to work isn’t something he takes lightly.

Continuing the Mission

Jim’s story is ultimately about belonging and how powerful it can be when that feeling extends beyond the workplace. Organizations like Operation Motorsport are doing life-changing work to help veterans rediscover purpose and community after service. The impact is tangible, personal, and lasting.

At its best, Cloudera’s culture has always been about showing up—for each other and for the communities around us. As Jim’s journey reflects, there is always room to deepen that impact and to show what it truly means to be a Clouderan: mission-driven, people-first, and committed to making a difference.

Hear from another Clouderan and explore career opportunities at Cloudera.

Reimagining Prescription Analysis: How Specialized AI Agents Solve Healthcare's Toughest Document Processing Challenges

Vish Rajagopalan,Kathy Wong,Maximilian Engelhardt,Laurent Edel,Maxim Belikov — Thu, 19 Mar 2026 13:00:00 UTC

In document-intensive fields such as healthcare and pharmaceuticals, the speed and accuracy of data extraction are critical for patient safety and timely care. Prescriptions are a critical document in the healthcare workflows, and accurate transcription is paramount to reducing medication errors and adverse drug events.

This blog shows how Cloudera can help healthcare organizations modernize, improving the speed and accuracy of data extraction and prescription generation by replacing traditional optical character recognition (OCR) with specialized AI agents.

Modernizing the US Pharmacy with Agentic AI

The US pharmacy sector faces rising demand, tighter margins, and increasing expectations for accuracy and speed. More than 6 billion prescriptions are generated in the US alone every year, yet dispensing still relies heavily on manual data entry, verification, and documentation.

Pharmacist wages have grown, while reimbursement pressure from pharmacy benefit managers (PBMs) and operational friction continue to compress profitability. Pharmacies face a structural challenge: delivering faster, safer dispensing at a time when labor is costly, workflows are increasingly complex, and reimbursement is becoming more volatile.

US pharmacies are experiencing a dual squeeze of rising workload and falling margins:

The labor gap: Pharmacist wages average $66/hr, yet a large proportion of their time is consumed by manual data entry and clerical verification.

The audit: Pharmacy benefit managers recoup billions annually via clawback, retroactive payment reversals triggered by minor documentation errors.

The revenue shift: Dispensing margins continue to decline, while clinical services offer materially stronger economics for pharmacies.

Moving Beyond Traditional Entity Extraction

For many years, optical character recognition has been the de facto technology for transcribing prescriptions. However, it continues to face real-world complexity, such as:

Lack of standardized formats: Prescriptions vary widely in format, and handwritten prescriptions further increase complexity due to differences in handwriting and language.

High error rates: This variability is due to frequent errors in processing optical character recognition from written text, requiring significant manual review and correction.

Custom software stack: Most optical character recognition-based solutions employ a custom software stack. As such, healthcare systems struggle with licensing, upgrades, and staff training.

Privacy and PII regulations: There’s a high degree of regulatory compliance (such as GDPR) around patient records, which constrains storage and transmission of processing of health records.

The Business Value of AI-Enabled Prescription Verification

AI-enabled verification strengthens—not replaces—pharmacists by automating repetitive, potentially error-prone steps and converting unstructured prescriptions into reliable data.

Labor Optimization

Verification is one of the most time-intensive steps in the dispensing workflow, as pharmacists must intake, interpret, transcribe, and confirm each prescription. AI-enabled optical character recognition automates prescription intake and verification, reducing manual effort and allowing pharmacies to meet demand with existing staff—lowering overtime and reliance on relief pharmacists.

Reallocated Capacity

By reducing time spent on fulfillment, pharmacists regain time for higher-margin clinical services—such as vaccinations, medication therapy management (MTM), and point-of-care testing—improving overall margin mix.

Error Reduction

Medication errors and clerical discrepancies often stem from inconsistent handwriting, incomplete information, or manual data entry. During pharmacy benefit manager audits, even small documentation errors can result in full claim clawbacks, creating significant financial exposure. AI-enabled optical character recognition adds an automated safety layer by flagging ambiguous or inconsistent data before submission. This improves documentation quality, reduces dispensing errors, and lowers the risk of audit recoupments.

Reimbursement Accuracy

Pharmacy benefit managers manage most prescription claims and enforce strict documentation standards. Small discrepancies in directions, quantities, or prescriber information frequently trigger claim denials, creating rework and administrative burden. AI-enabled optical character recognition improves documentation accuracy at the point of entry, reducing avoidable denials and the time spent correcting and resubmitting claims. This results in fewer reworks, faster reimbursement, and more predictable cash flow in an already margin-constrained environment.

Success Story: How a Healthcare Provider Transformed Prescription Analysis with Cloudera AI

A Central European healthcare provider partnered with Cloudera to modernize prescription analysis under strict PII regulations. The solution replaced a single-pass optical character recognition workflow with an agent-based AI pipeline deployed in a private, air-gapped environment. Further, the solution improved accuracy by over 16%, reached near human-level performance, and scaled from proof of concept to production in a matter of weeks.

A Specialized Agentic Approach

The solution’s effectiveness comes from an orchestrated, AI agent-based workflow that combines fine-tuned vision models with authoritative medical data validation.

First, a Cloudera AI agent first extracts prescription data using a vision optical character recognition model specifically trained on real-world prescription formats and handwriting patterns.

Then, the extracted drug names, dosages, and ingredients are then validated against certified medical and drug databases using probabilistic matching.

Finally, a human-in-the-loop feedback continuously retrains the model, allowing the system to learn from prior errors and steadily improve accuracy. This closed-loop approach moves prescription analysis beyond static optical character recognition into a self-improving, production-grade workflow.

Benefits Achieved with Cloudera AI

This agentic workflow delivered clear operational and financial benefits:

Improved accuracy: Certified medical database validation reduced optical character recognition and documentation errors.

Lower operational costs: Automation reduced manual review, error correction, and audit-related rework.

Faster processing: Automated inference shortened fulfillment cycles and freed pharmacist capacity.

Next Steps

Pharmacies that adopt agentic workflows gain speed, resilience, and economic advantage. Those that delay face rising labor costs, greater audit exposure, and widening competitive pressure driven by pharmacy benefit manager requirements.

To learn more about how Cloudera AI can power your use cases, check out our webinar series “Accelerate Enterprise & Agentic AI: From Development to Inference with Private AI.”

Beyond the Screen: Deepfakes, Trust, and the Next Cybersecurity Frontier

Cloudera — Wed, 18 Mar 2026 13:00:00 UTC

Trust is the foundation of cooperation, trade, and enterprise decision-making. In the digital age, trust is established through signatures, voices, and virtual interactions. But as deepfake technology rapidly advances, that trust erodes, creating new risks that bypass decades of cybersecurity investment.

In this episode of The AI Forecast, Paul Muller speaks with Jim Brennan, Chief Product and Technical Officer at GetReal Security, about how AI-powered authenticity threats change the enterprise security equation. Their conversation reveals why deepfakes are the new face of social engineering, why technology—not the human eye—must lead the defense, and how leaders can protect their businesses and people.

The Human Layer Has Become the Weakest Link

Paul: Decades of digital transformation gave us the ability to collaborate instantly. But now the very thing we rely on—the little window on our screens—has become the new attack surface. If I can’t trust what I see, the only fallback is expensive, slow, physical interactions.

Jim: A CIO told me, ‘This little window is where I run my business and now, I can’t trust anything coming through it.’ That’s profound. The human eye can’t detect this level of sophistication. Most people are guessing 50/50. That’s why technology, not instinct, has to lead the defense.

Trust fuels cooperation, and cooperation powers business. But deepfakes undermine that trust at its most personal level—the daily conversations and video calls leaders depend on. Jim describes this as a new human-facing interaction layer, which he calls the “display layer,” and Paul jokingly dubbed “Liar 8,” an entirely new attack surface. Unlike firewalls and intrusion detection systems, this is not a technical but a human layer. The medium executives use to communicate and make decisions is now open to manipulation.

Boards Respond to Realistic Threats, Not Hollywood Plots

Paul: Do boards risk dismissing deepfakes as something that could never happen to them?

Jim: It only takes seeing it once to believe it’s real. However, the real challenge is showing boards what it means for their business. If you lean on big sensational stories, they may shrug them off. The reality is that smaller, everyday incidents are already happening, which resonate far more.

He points to fraudulent hiring as a prime example. Attackers are using deepfakes to impersonate candidates and slip through HR processes. Sometimes the motive is simple financial gain, like pocketing a sign-on bonus. Other times, it’s far more serious: nation-state actors planting impostors inside companies for espionage or large-scale fraud. ‘

Jim: In the last three months, every Fortune 500 and 1000 company I’ve spoken to has told us it’s having issues with fraudulent hiring. HR teams aren’t built to think like attackers, making hiring an easy target.

Technology Must Lead the Fight for Digital Authenticity

Paul: We’ve always used technology to fight technology—firewalls, antivirus, intrusion detection. Can we do the same against deepfakes?

Jim: You can’t simply train your way out of this problem. Standing up a black-box model and feeding it real and fake examples won’t cut it. The better approach is to use digital forensics to study the artifacts deepfakes leave behind, whether it’s facial distortions, audio noise, or lighting inconsistencies and then use machine learning to find those signals at scale.

Jim explained that effective defenses must go beyond generic AI, getting “under the covers” of generation tools to identify subtle traces and artifacts. Practically, enterprises can deploy these protections through APIs from platforms like Zoom or Teams, avoiding endpoint installs and keeping defenses scalable. At the same time, awareness is critical—webinars, demos, and simulations give employees the context to pause and think before acting. Technology and training form the two layers needed to protect digital trust.

Closing Insight for Enterprise Leaders

Jim: We live in an age where you can’t trust anything in this window or screen. New policies for organizations are called for, and new ways of operating are called for as well.

The threat landscape has shifted. Deepfakes are not just a futuristic risk. They are here, undermining both enterprise decision-making and personal safety. From fraudulent hires to AI-cloned ransom calls, digital trust is no longer guaranteed.

The path forward is threefold:

Educate boards with credible, relatable examples that fit existing risk frameworks
Equip employees with awareness that “seeing” and “hearing” are no longer enough to establish truth
Deploy technology that can detect and respond to authenticity threats in real time

Catch the whole conversation with Jim Brennan on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

Cloudera Agent Studio and NVIDIA Bring Next-Gen Agents to Enterprise AI

Charu Anchlia,Suryakant Bhardwaj,Pamela Pan — Wed, 18 Mar 2026 13:00:00 UTC

The Foundation: Private Model Deployment with NVIDIA Nemotron

Enterprise AI starts with data governance. Prompts, proprietary data, and model outputs must stay within the organization's operational boundary, meeting compliance mandates without architectural compromise. This is the core requirement of Private AI: the full inference stack running inside the enterprise, not outside it.

Cloudera AI Inference service, powered by NVIDIA NIM microservices, enables high-performance, scalable model serving directly within the enterprise environment, keeping prompts, data, and outputs inside the security perimeter. Accelerated by the NVIDIA AI stack, including Blackwell GPUs and Dynamo-Triton, the service supports a wide range of models, including NVIDIA’s Nemotron model family for agentic AI with advanced reasoning, tool use, and long-horizon workflows. This foundation allows organizations to build and run enterprise AI agents directly on their data—securely and at scale.

Four Pillars of Cloudera Agent Studio

1. Dynamic, Iterative, Multi-Step Planning

Enterprise data environments are not clean. Real deployments involve dozens of databases with inconsistent schemas, sparse documentation, and no deterministic path from a business question to the right data source. The agent must construct that path at runtime.

Agent Studio's orchestrator treats exploration as part of execution. It decomposes complex requests into multi-step plans, executes them iteratively, and self-evaluates after each step before committing to a path. This self-correcting planning loop makes agents reliable in environments they have never encountered and sustains long-horizon workflows across many sequential steps.

2. Multi-Agent Collaboration: Reusability and Transparency

Complex enterprise workflows span multiple domains, each requiring distinct reasoning strategies and specialized tools. A single agent attempting to cover all of them cannot be well-optimized for any, and the broader its scope, the harder it becomes to understand and govern agent behavior.

Agent Studio is built around specialized agents, each scoped to a specific domain and equipped with the appropriate tools, coordinated by an orchestrator that understands how to delegate. What makes this collaboration transparent and reusable is how agents communicate: each agent writes structured outputs to shared project context, and subsequent agents consume those outputs as explicit, inspectable inputs. The full chain of reasoning is traceable at every step, providing the auditability enterprises require and the reusability to build on prior work across runs.

3. Context Engineering: Accuracy, Speed, and Cost

At enterprise data scales, passing raw data directly to the model does not work. Context windows are finite, and as unstructured context grows, accuracy degrades well before the window limit is reached.

Agent Studio treats the context window as a precision instrument: at each step, only the information relevant to that agent's specific task reaches the model. This artifact-driven design reduces token consumption, cutting inference cost and latency while improving accuracy. That combination is what makes long-horizon workflows tractable at enterprise scale.

4. Sandboxed Execution

What makes autonomous agents genuinely powerful is their ability to dynamically generate tools, skills, and executable code as workflows demand them, capabilities that Agent Studio supports natively. But without isolation, agent-generated code and tools executing directly against enterprise systems present unacceptable risk.

We architected Agent Studio's execution layer around isolation by default. All agent-generated code and tool execution runs in a sandboxed runtime with no access to systems outside their defined scope. Agents begin with zero permissions, and every action is policy-enforced at the infrastructure layer, not inside the agent process itself. This gives regulated industries the auditability they require, without restricting what agents can do.

Customer Story: Agentic AI Transforming Petabyte-Scale Data Analytics

Cloudera manages over 30 exabytes of structured data across its customer base, making structured data analytics where this architecture delivers immediate impact. A major media and entertainment company deployed it to give business users and analysts a natural language interface to their operational data. Their data estate spanned petabytes across dozens of databases, often with conflicting metadata and sparse documentation.

Cloudera Agent Studio orchestrated specialized agents backed by NVIDIA Nemotron running inside the customer's private network. A business user's analytical question triggered an iterative planning loop: the orchestrator explored the data estate, navigated schema ambiguity, and identified the right data sources autonomously. When the analysis required statistical computation beyond what SQL could express, the orchestrator delegated to the appropriate code execution agent. Intermediate outputs were written as artifacts and passed forward through the long-horizon workflow. All generated code executed in a sandboxed environment, maintaining a complete audit trail throughout.

Workflows that once required a data engineer, developer, and an analyst working in sequence became accessible to any business user. The agents' outputs, including SQL commands, generated code, and visualizations, were written to shared project context throughout, each inspectable and auditable. Those artifacts were also exportable as production pipelines. Because the code that agents generate is deterministic even when the underlying models are not, those pipelines are reliable and reproducible without additional engineering.

Architecture as Competitive Advantage

Every pillar in this architecture builds on the one before it. A private inference layer provides the foundation, supporting the call volumes and reliability that long-horizon workflows require. Iterative planning enables agents to navigate environments they have never seen. Multi-agent collaboration brings domain precision to multi-step reasoning. Artifact-based context management improves accuracy while reducing inference cost and latency. Sandboxed execution ensures agents operate safely within defined boundaries, with every action governed and auditable.

Cloudera and NVIDIA bring this architecture to life through Cloudera Agent Studio, Cloudera AI Inference powered by NVIDIA NIM, and the NVIDIA Nemotron family of models. Together, they deliver the foundation of building orchestration and agentic reasoning needed to run enterprise AI agents directly on enterprise data—securely, privately, and at scale.

To learn more, see Cloudera Agent Studio in action.

Autonomous agents act toward complex goals without requiring human direction at each step. In enterprise environments, deploying these agents introduces a more exacting set of challenges: they must navigate heterogeneous data systems; satisfy compliance, audit, and data sovereignty mandates; and keep all data within the organization's operational boundary.

Long-horizon agents represent a new class of autonomous AI, extending beyond single tasks to pursue objectives across dozens of sequential decisions, running workflows for hours or days while maintaining context throughout. At enterprise scale, every one of those challenges is amplified.

An Architecture Built for Enterprise AI Agents

Cloudera designed Cloudera Agent Studio (part of Cloudera AI Studios) in collaboration with NVIDIA to address exactly these challenges.

NVIDIA Nemotron provides the model foundation: it’s purpose-built for agentic AI and the high-throughput inference demands of long-horizon workflows.

Cloudera Agent Studio provides the orchestration layer that builds on that foundation through four architectural pillars: dynamic multi-step planning, transparent multi-agent collaboration, context engineering for accuracy, and sandboxed execution. Each pillar addresses a specific requirement that emerges when autonomous agents operate at enterprise scale.

Adam Skotnicky on Taming Data Complexity and Building Cloud-Like Simplicity

Cloudera — Tue, 17 Mar 2026 13:00:00 UTC

If there’s one thing serial entrepreneur Adam Skotnicky would warn organizations about, it’s data complexity. As VP of Engineering at Cloudera and founder of tcp.cloud and Taikun, which was recently acquired by Cloudera, Adam is an expert at capitalizing on emerging opportunities in the tech sector without letting complicated data structures hold him back.

Paul Muller, host of The AI Forecast Podcast, and Adam discuss how engineering teams can find their way back to simplicity while maintaining flexibility and control. They delve into why IT teams feel swamped by tooling and operational challenges, how platform engineering can make things easier for users, and what it really means to achieve that cloud-like agility in hybrid environments.

Here are a few of the main points from the discussion.

The Pitfall of Overengineering

Paul: Organizations today are managing data across multiple clouds, on-prem, and hybrid environments. From your perspective, what are the biggest challenges they face in that complexity?

Adam: The thing is that you need to focus on the core value of what you’re trying to build.

If you go all in, you might overengineer your solution. You don’t need to have all the features on the planet. It’s like a candy shop for engineers, right? They go crazy. Then you have the sugar rush, and then you have this huge fall after that. It’s exactly what it is.

The Future Is Workload-First, Invisible Infrastructure

Paul: What was the inspiration to try to create a more cloud-like experience in your data center? I think a lot of technologists would say that the issue with this promise of hybrid has always been that my on-premises stuff might have a little bit of automation, but it’s nowhere near as slick or as simple as when I’m using a public cloud service, where they spend a lot of engineering dollars to make it really feel like a catalog. Do you agree that that’s been the compromise in the past, and how did you get around that with what you were doing with Taikun?

Adam: If you want to build something similar, the cloud-like experience means removing people from the process. If you have any ticket between you and your application, or if I own this application, you log in, go to the catalog, and deploy things. That’s the ultimate goal. Beyond that, no people touch it; they observe, make sure it works, and ensure it’s performant and secure. They do this without you, without requiring anything from you, and that’s how public cloud works. That’s the experience; that’s what cloud-like means.

Paul: Talk to me about what you’re seeing in the marketplace as it comes to deploying these big data workloads. How does a self-service, flexible cloud experience empower teams to focus on insights rather than infrastructure?

Adam: I absolutely agree that it’s about workload and workload only. It’s not about the infrastructure, and that’s why we don’t want anyone to touch it. You want to abstract the infrastructure completely, but we still allow you to go and tinker with it. You can do that and explore, but in production environments, you shouldn’t touch it. You should follow best practices because then you can finally focus on the workload, and you shouldn’t go from the workload down. The infrastructure should be there. That’s what we’re doing at Taikun. We focus on the workload.

One Platform, Any Environment

Paul: What are people using workloads like the Cloudera platform going to notice that's different about this new way of working as they start to deploy?

Adam: We are now the abstraction layer for Cloudera services, so Cloudera services will be independent of that environment, so they can run on public or private cloud on your few servers or hundreds or thousands of servers and still have the same experience. You can now run as many of them as you want, connect them to as many endpoints as you want, choose where to combine, and then configure them. It’s not a public cloud or a hybrid cloud. You can use both. You can run your production environments, which you can scale on-prem because of data sovereignty, and you can play with technologies in the public cloud because you can scale up and down from zero to a hundred in minutes. You can combine these approaches.

Paul: Amazing. What do people need to do to start preparing for this new world? Is it something they can just instantly drop in, and it’s a technology problem, or how much of this is a people problem where you need to start to get people to think differently? What do I need to do to get ready to get the most out of hybrid?

Adam: You can choose your approach. You can go with my preferred way, which we call the “golden pot.” Everything is built in, so you can go one way or another or somewhere in between. You can still run your old, good virtual machine side by side with this environment. There are loads and loads of know-how built into the structures and processes already in place. Both approaches will be there, and in Cloudera products, if you choose not to interface with this new world, it’ll be embedded for you.

Catch the full conversation with Adam Skotnicky on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

Now is the Time for Higher Education Institutions to Master Data Lineage

Jeremiah Morrow,Hilary Billingslea,Art Jordan — Mon, 16 Mar 2026 17:43:00 UTC

In today's state, local, and education (SLED) environments—especially higher education—budgets are under constant scrutiny, and the demand for data excellence is constant. That means doing more with fewer resources. One high-impact change to your data workflows that can transform the quality of your data and AI while lowering costs is automating and documenting data lineage.

Higher education institutions are battling data complexity: critical data lives across systems and environments that were never designed to talk to each other—on-premises databases, cloud environments, and edge devices. Managing fields like student IDs, grant IDs, or year-to-date endowment performance, across sources and teams is necessary but difficult, manual, and prone to error.

Without first having trusted, high-quality data, high-impact analytic and AI use cases remain a pipedream. However, if higher ed institutions have a unified view of data lineage across systems, they can successfully leverage this data for AI-driven insights and actions in curriculum development, student recruiting, student retention, efficient campus operations, migrations to the cloud, and so much more.

Cloudera Data Lineage provides an automated and consistent way to map the flow of data from its creation (source) to its ultimate consumption (BI or AI). It harvests and interprets metadata very quickly, helping organizations build a comprehensive knowledge graph that shows exactly how data is created, transformed, and consumed, consistently across the entire map with no gaps.

Achieving Data Excellence with Cloudera Data Lineage

In our recent webinar, Building Trust and Compliance in SLED Organizations, hosted by Cloudera and partner, Carahsoft–panelist Art Jordan (Sales Go-to-Market Director, Data Intelligence Products for Cloudera Data Lineage), notes that “data lineage is a billion-dollar problem.” If you rely on manual processes and have blind spots in your data mapping, inefficiencies and delays are inevitable, which creates critical challenges around explainable AI, personally identifiable information (PII) privacy, and regulatory compliance.

Cloudera Data Lineage addresses these challenges by providing detailed views of lineage with dependencies and transformations consistently across the entire map:

Cross-system lineage: Provides lineage at the system level from the entry point, all the way to reporting, analytics, and any data consumer.

Inner-system lineage: Details the asset-level lineage within an extract, transform, and load (ETL) process, report, or database object. This includes seeing how a field is derived or calculated inside a pipeline or repository.

End-to-end lineage: End-to-end asset-level lineage between systems. This accounts for complex relationships where one field may feed multiple systems or come from multiple sources (one-to-many and many-to-one).

Mastering lineage gives higher education institutions the ability to perform upstream and downstream analytics and mapping quickly. It provides end-to-end visibility and governance, enabling organizations to understand where their data is going, where it came from, and how it was derived. This transparency and ability to guarantee integrity is essential for ensuring you have trusted, high-quality data for use in AI models and that’s being delivered to senior leadership and external partners.

Success Story: How The University of Arizona Improved Efficiency and Cut Costs with Cloudera Data Lineage

The University of Arizona (U of A), a major research university, implemented Cloudera Data Lineage within their University Analytics and Institutional Research department. Their environment included running 10,000 extract, transform, and load (ETL) jobs each night and housing close to 40,000 distinct columns in their data warehouse. Manual data documentation was challenging due to this sheer volume.

The university achieved significant efficiency gains and cost reduction by:

Performing ETL impact analysis: Analyzing the impact of major PeopleSoft updates (which change data types and lengths or delete columns) previously took the data engineering team a week or more. Cloudera Data Lineage cut this time down to a few days.

Consolidating artifacts: Each ETL job consumes compute, storage, and logging resources. Using Cloudera’s end-to-end metadata view, U of A consolidated artifacts, reducing ETL jobs from 10,000 down to 8,000. This 20% reduction lowered infrastructure costs, decreased pipeline complexity, and reduced operational overhead while improving data consistency and governance across the environment.

Leveraging rapid discovery: Using the Cloudera Data Lineage discovery module, the team compiled a list of all ETL jobs containing specific commented-out SQL. This task–which was required for a major system upgrade–would have taken significant time to perform manually but was completed instantly via automation.

Crucially, Cloudera Data Lineage strengthened audit readiness and data accuracy by providing stakeholders with clear visibility into how data flows through pipelines, repositories, and BI reports. Instead of relying solely on the data engineering team to manually trace data origins and transformations, compliance, institutional research, and finance teams could independently verify where data came from and how it was calculated. This reduced the risk of reporting errors, accelerated responses to regulatory and accreditation inquiries, and more—all while easing pressure on lean IT budgets and resources.

Take the Next Step

Are you confident in your organization’s ability to prove compliance and data accuracy when faced with budget scrutiny or rapid operational change? What is the single most complex data pipeline transformation you would like to automatically document and map next week?

Let’s discuss how Cloudera Data Lineage can help you achieve data excellence.

Enterprise Landing Zones Matter: Why Cloudera Runs Natively Within Governed AWS Environments

Corin Bishop,Peter Ryan — Fri, 13 Mar 2026 19:26:00 UTC

Enterprise cloud adoption has matured. Organizations no longer deploy workloads into isolated or unrestricted cloud accounts. Instead, they operate within governed, cloud provider landing zones that enforce security, identity, networking, and compliance controls by default.

When data and AI platforms don’t integrate cleanly into these landing zones and instead expect customers to weaken governance or introduce exceptions to cloud controls, deployments slow down. Security reviews become more complex, operational risk increases, and platform teams lose confidence in long-term scalability.

Enterprise buyers increasingly expect data and AI platforms to work with their cloud governance models, not around them. Reflecting and supporting real customer conditions, we’re proud to note that the Cloudera platform runs natively inside AWS Control Tower—managed landing zones delivering scale, compliance, and long-term trust.

Landing Zones Are Now the Enterprise Default

Landing zones act as a standardized cloud foundation, allowing organizations to scale securely and consistently. They define how accounts are created, how identity and access are managed, how networks are structured, and how security controls are enforced.

For large enterprises and regulated industries, operating within landing zones isn’t an option, it’s the default for running workloads in public clouds at scale.

Validating Cloudera with AWS Control Tower

To validate Cloudera under real-world enterprise conditions, we deployed the platform within an Amazon Web Services (AWS) landing zone built using AWS Control Tower. This environment included:

A multi-account structure aligned with enterprise patterns

Centralized AWS identity and access management (IAM)
Preventive and detective security guardrails
Standardized networking, logging, and monitoring

The validation demonstrated that Cloudera can be deployed, operated, and scaled without breaking or bypassing AWS landing zone controls. Running Cloudera natively within this environment reduces deployment risk, shortens security review cycles, and accelerates time to value for enterprise customers.

Specific outcomes from the validation exercise include:

Cloudera operates within AWS Control Tower–managed accounts without requiring privileged exceptions

Security and compliance guardrails remain intact

Platform operations align with enterprise IAM and networking models

Customers can deploy Cloudera as a first-class workload within their governed AWS environments

Governance and Innovation Are Not Opposites

There is a persistent misconception that governance slows innovation. In practice, strong cloud foundations enable faster and safer adoption by removing ambiguity and reducing operational friction.

By aligning our platform with enterprise landing zone architectures, Cloudera supports both innovation and control. Customers can confidently adopt advanced analytics and AI capabilities on the Cloudera platform without compromising their cloud governance model.

To learn more about how you can deploy Cloudera natively within governed AWS environments, reach out to our professional services team, check out our product demos.

How Cloudera and Salt AI Deliver a Flagship AI Foundation for Life Sciences

Aber Whitcomb,Andreas Skouloudis — Thu, 12 Mar 2026 13:00:00 UTC

Figure 1. How The Cloudera and Salt AI Partnership Accelerates Innovation in Life Sciences

From Experiments to Business Value

In enterprise deployments, combinations of Cloudera and Salt AI have enabled organizations to achieve unprecedented scale, with a throughput of thousands of data engineering jobs per hour, faster prototyping of complex R&D workflows, and step‑change performance and cost improvements for machine learning workloads like AlphaFold2. For example, Salt AI has delivered processing times 22x faster than previous benchmarks for Alphafold2. Equally important, these gains come with full telemetry, governance inheritance, and a clear audit trail for every workflow run. Ultimately, teams can focus on scientific outcomes, and not on integration of existing data and technology solutions.

Salt AI will continue to invest in interoperability with a broad ecosystem of clouds, data platforms, and models while collaborating with partners like Cloudera to publish concrete patterns that regulated industries can adopt and adapt. For life sciences teams, that means more choices—and clearer examples—for turning AI experiments into durable, trustworthy systems. Learn more about Cloudera capabilities and the Salt AI platform.

Life Sciences AI Needs Patterns, Not One‑Off Proofs Of Concept

Life sciences teams are working with more data, models, and regulatory scrutiny than ever before. And much of that data—omics, imaging, electronic health records, trial protocols, real‑world evidence, and more—is stored in unstructured formats that are hard to search and govern.

AI has the potential to redefine what’s possible in the life sciences—transforming vast, disconnected stores of biological and clinical data into actionable intelligence that accelerates discovery, sharpens decision making, and ultimately helps bring lifesaving innovations to patients faster. But first, organizations must prove that AI‑driven decisions are explainable, stable, and compliant.

In this environment, one‑off proofs of concept (POCs) are not enough. To achieve an acceptable level of governance and trust in AI-driven insights, life sciences organizations need to combine a trusted data and compute foundation with an intelligence layer that can orchestrate models and workflows at scale.

The Cloudera and Salt AI Partnership: A Reference Architecture for Contextual, Trusted AI at Scale

Cloudera and Salt AI are partnering to offer one powerful reference combination for life sciences teams.

Cloudera provides an open data lakehouse and enterprise AI platform that integrates data streaming, data engineering, data warehousing, and ML/GenAI at scale with a unified governance security and governance layer through SDX. This framework features attribute-based data access controls, lineage, and active metadata enrichment and cataloging.

Salt AI leverages those foundational security mechanisms and adds an orchestration layer across AI models and data. The scalable infrastructure continuously captures context—prompts, system prompts, workflow designs, run performance, user roles, and data sources—enabling complex use cases that capture full value from both specialized and general AI models. Tool calls for agentic operations can be readily spun up through Salt’s txt2 assistant, and pipelines come alive visually in the canvas, showcasing exactly how data flows.

This partnership enables life sciences organizations to apply fine-grained controls across on-premises, public cloud, and hybrid environments; use any model appropriate to a given task; and achieve an auditable, visual record of how AI systems make decisions.

In addition, both Cloudera and Salt AI drive computational and operational efficiencies across the data lifecycle. Leveraging GPU acceleration frameworks, Cloudera delivers improvements on data engineering and LLM inferencing workloads of up to 20x and 36x, respectively. Similarly, Salt AI offers optimizations such as a split-compute architecture that balances CPU and GPU processes, a sophisticated caching system, and the ability to swap, mix, and combine AI models into workflows. The more complex the pipeline and the more it is run, the greater the compute efficiencies when running on Salt.

Built to Live in a Broader Ecosystem

The Cloudera and Salt AI solution is explicitly designed to work seamlessly within each customer’s existing ecosystem of clouds, data platforms, and AI tools. It can be deployed in a customer’s virtual private cloud (VPC), with no public egress, and integrates with a diverse array of model providers, vector stores, and data systems.

Cloudera’s open data lakehouse, built on Apache Iceberg, offers a flexible and performant table format that combines multi-function analytics and automated data management capabilities (e.g., schema and partition evolution). This approach standardizes feature engineering workflows across disparate and diverse data sources, facilitating GxP compliance in life sciences.

Additionally, the Cloudera Iceberg REST catalog enables data sharing with other public cloud data platforms (e.g., Databricks, Snowflake) that support Apache Iceberg tables. Salt AI offers a mechanism that transforms text queries into R&D workflows that orchestrate LLMs, graph databases, modeling tools, and internal systems. Furthermore, it empowers researchers to convert code (e.g., Python scripts) into visual workflows, improving cross-functional collaboration among research teams. These capabilities accelerate innovation cycles by democratizing siloed research initiatives and automating the integration of complex systems without the labor-intensive effort to build custom integration and orchestration logic.

For organizations standardizing on Cloudera, this partnership offers a fast path: governed data combined with contextual orchestration, ready for use cases like molecule design, drug repurposing, translational medicine, protocol authoring, and medical affairs assistants. For others, it serves as a blueprint for marrying existing data platforms with a context‑first AI orchestration layer.

Accelerating Humanitarian Impact with AI

Debbie Kruger — Thu, 12 Mar 2026 13:00:00 UTC

Mercy Corps operates in environments where timely, well-informed decisions are essential to effective crisis response. Teams must rapidly assess conditions and draw on research and historical knowledge, often under intense pressure.

As global crises have increased in scale and complexity, this model has become harder to sustain. At the same time, funding constraints have driven sector-wide contraction, requiring organizations like Mercy Corps to do more with fewer resources, even as delays in analysis can have real consequences on the ground.

To address this challenge, Mercy Corps began exploring how data and AI could reduce friction in crisis research without replacing human judgment. By combining Mercy Corps’ humanitarian expertise with Cloudera’s data and AI capabilities, the two organizations set out to strengthen crisis response and support Mercy Corps’ mission at scale.

Managing Processes at Scale

Mercy Corps’ Global Crisis Analysis teams support decision-making across the organization by producing research on aid and development topics in rapidly changing contexts. Their work informs everything from emergency response planning to longer-term program design. These teams analyze conflict dynamics, food insecurity, displacement trends, and economic shocks to help anticipate needs and guide action.

Historically, this research relied on manual processes. Analysts navigated across numerous news sources, websites, and information platforms, copying and recording information into spreadsheets and documents before synthesizing it into reports. While thorough, this process was time consuming and created bottlenecks when rapid crisis analysis was required.

As the scale and pace of crises increased, Mercy Corps recognized that this model was not sustainable. The organization also faced practical constraints. Technical capacity was limited, teams were under-resourced, and building new AI solutions internally would have required investments that were difficult to absorb while maintaining existing operations.

Realizing the Power of Professional Services

Cloudera’s Professional Services team provided the capacity and expertise Mercy Corps needed at a critical moment. And through this partnership, Mercy Corps gained support from leading technical experts without the added strain of bringing in additional staff or infrastructure.

“The intention of this project wasn’t just to come in, do the work and then leave,” said Laurence Da Luz, Senior Director, CTO & Portfolio. “It was to set them up to be self-sufficient.”

Cloudera’s team brought deep experience in data, analytics, and AI, along with a clear understanding of the operational and mission constraints humanitarian organizations face. Working closely with Mercy Corps stakeholders, the Professional Services team helped translate real-world challenges into a scalable solution that could evolve as needs changed.

Rather than approaching the engagement as a one-time delivery, the focus was on partnership and enablement. The goal was to move quickly during a period of crisis while setting Mercy Corps up with a solution they could adapt, extend, and sustain over time.

A Human-Centered Approach to AI

From the outset, the partnership was guided by a clear objective to start with the people and decisions that matter most. Cloudera Professional Services worked closely with Mercy Corps teams to understand how crisis research happens in practice and where delays and bottlenecks most directly affect outcomes.

“Recognizing that there is still a human element in the solution was vitally important,” said Da Luz. “The goal for us wasn’t to replace what they do with AI, as much of the work still requires human nuance and expertise.”

Rather than attempting to automate judgment, the solution was designed to accelerate it. AI was applied to handle information aggregation and early summarization, enabling analysts to spend more time interpreting findings and applying contextual expertise where human judgment is essential.

This approach resulted in a flexible, AI-driven research capability that brings fragmented workflows into a more unified experience. These capabilities allowed analysts to quickly identify, access, and synthesize information from diverse sources, reducing research cycle time while maintaining human oversight.

At a technical level, Mercy Corps’ solution leverages multiple agentic workflows aligned to different humanitarian research themes. These agent workflows process large volumes of diverse, fast-changing humanitarian and social data. The resulting output helps surface highly relevant information based on the analyst’s stated objectives. Because the system supports conversational interaction, analysts can iteratively refine results and guide the output toward their specific scenario, while retaining full control over interpretation and final conclusions.

Designed for the realities of humanitarian work, the solution adapts to varied geographies, audiences, and crisis types without requiring significant changes to existing workflows. Support for evolving research needs, multilingual sources, and rapidly changing conditions allows teams to respond faster and make more informed decisions in moments where timing and context are critical.

Impact Beyond Innovation

For Cloudera team members, working on the Mercy Corps project has been especially meaningful. Beyond the technical challenges, the work offered a direct connection between technology and social impact. Many involved have spoken about the pride that comes from knowing their work helps support humanitarian efforts around the world.

“It’s quite humbling when you sit and understand the work they’re doing and the reasons behind it,” said Alastair Elliot, Director of Professional Services, North EMEA.

The project gave the team new insights and learnings to help refine and expand on Cloudera’s existing AI capabilities. It also directly helped to strengthen Cloudera’s library of proven patterns and reference architectures, applicable across industries. This combination of learning and collaboration reflects the company’s culture of empowering teams to pursue work that aligns with both business goals and values.

AI Solutions for a Deeply Human Purpose

Cloudera’s partnership with Mercy Corps demonstrates what is possible when advanced data and AI capabilities are paired with a clear mission and a collaborative approach. By focusing on human needs, operational realities, and long-term sustainability, the two organizations delivered a solution that accelerates impact where it matters most.

We are proud of the work accomplished together and inspired by the potential ahead. This collaboration serves as a model for how organizations can apply AI responsibly, effectively, and with purpose, not just to solve technical problems, but to support people and communities around the world.

Learn more about how Cloudera’s Professional Services team can support the most complex data and AI initiatives.

Scalable AI Economics: Achieving Secure, Hybrid Intelligence with Cloudera, AMD, and Dell Technologies

Stephen Catanzano — Wed, 11 Mar 2026 13:00:00 UTC

Enterprise interest in generative and agentic AI has accelerated dramatically over the past two years. Organizations across industries are exploring how AI agents, intelligent assistants, and automation can improve productivity, streamline operations, and unlock insights from growing volumes of enterprise data. Yet as enthusiasm grows, so do questions around cost, security, and operational complexity.

One reality is becoming increasingly clear: not every AI workload requires graphics processing units (GPUs) or massive foundation models. In fact, many high-value enterprise use cases can be delivered efficiently using central processing units (CPUs) and smaller, task-focused language models, particularly when deployed close to the data they serve.

A growing number of organizations are now reevaluating their AI strategies through this lens. Rather than pursuing scale at any cost, they are prioritizing return on intelligence: the ability to deploy AI solutions securely, economically, and at scale. This shift is shaping how enterprises think about infrastructure, data architecture, and governance as AI moves from experimentation into production.

A Shift in Enterprise AI Economics

Research from Enterprise Strategy Group (now part of Omdia) indicates that approximately 80% of organizations view AI agents as a top or high business priority. These agents promise tangible benefits through automation, faster decision-making, and improved employee and customer experiences. However, many organizations continue to struggle with the cost and operational burden associated with GPU-centric deployments.

GPU infrastructure can introduce significant capital expense, power consumption, and supply-chain constraints. For many real-time inference and knowledge-driven workloads, this approach can be misaligned with business needs. As a result, enterprises are increasingly exploring alternatives that better match compute resources to workload requirements.

This is where CPU-based AI, paired with smaller language models, has emerged as a practical option. Rather than pursuing the largest possible models, organizations are using the assets they already own to address their budget challenges with GPU purchases or access. This is about right-sizing AI architectures that emphasize efficiency, security, and scalability.

Right-Sized AI and the Role of Small Language Models

Small language models (SLMs) are designed to perform specific enterprise tasks such as summarization, question answering, content generation, and code assistance. Typically containing far fewer parameters than large language models, SLMs can run effectively on modern CPUs while delivering strong performance for targeted use cases.

This approach offers several advantages. CPU-based inference reduces infrastructure costs, lowers power consumption, and simplifies deployment. It also enables organizations to run AI workloads within existing data centers or private cloud environments, addressing concerns around data sovereignty and regulatory compliance.

Within this context, Cloudera has positioned its Private AI strategy around enabling enterprises to deploy and operate AI systems entirely within their own controlled environments. By combining an open data lakehouse architecture with integrated governance and MLOps capabilities, Cloudera supports AI development that remains close to enterprise data.

Infrastructure Matters: CPUs and Enterprise Platforms

The effectiveness of CPU-based AI depends heavily on the underlying infrastructure. Advances in modern processors have significantly improved performance-per-dollar for analytics and inference workloads. AMD EPYC™ processors, for example, are designed to deliver high core density, strong memory bandwidth, and built-in security features, making them well suited for AI inference and data-intensive workloads.

When deployed on enterprise-grade systems from Dell Technologies, organizations can scale AI workloads reliably while leveraging validated architectures optimized for data and AI platforms. This combination allows enterprises to modernize AI capabilities without re-architecting their entire infrastructure footprint.

From an operational perspective, this model enables organizations to reuse existing investments, accelerate deployment timelines, and reduce dependency on specialized hardware. Across these scenarios, the emphasis is not on model size, but on efficiency, responsiveness, and trust.

Practical AI Use Cases With CPUs

Many of today’s most valuable AI applications can run efficiently on CPUs without the need for massive models or GPU acceleration. Examples include:

Internal Knowledge Assistants

Enterprises often store critical knowledge across documents, emails, and reports. By applying SLMs to this data, organizations can enable natural-language access to internal information, improving decision-making while keeping sensitive data on premises.

Employee and Agent Assist Chatbots

HR, IT, and customer support teams face recurring questions that can be automated through secure, internal chatbots. CPU-based AI enables always-available assistance without introducing external data exposure.

Content and Documentation Generation

Marketing, compliance, and engineering teams frequently produce repetitive content. AI-assisted generation and summarization can accelerate workflows while maintaining consistency and governance.

Software Development Support

SLM-powered assistants can generate code snippets, tests, and documentation within enterprise firewalls, helping development teams improve productivity without sending intellectual property to public AI services.

Predictive Analytics and Optimization

In manufacturing and operations, CPU-based AI models analyze sensor and operational data to predict failures and optimize performance, reducing downtime and operational costs.

Data Gravity and the Importance of On-Premises AI

Despite widespread cloud adoption, a significant portion of enterprise data remains on premises. Omdia research indicates that many organizations keep between 26% and 75% of their data in local or private environments. This data gravity presents challenges when AI processing requires moving sensitive information to external platforms.

Private AI architectures address this challenge by bringing AI to the data rather than the other way around. By running AI workloads within existing environments, organizations reduce latency, improve performance, and maintain compliance with regulations such as GDPR, HIPAA, and industry-specific mandates.

Cloudera’s approach integrates data ingestion, governance, model management, and serving within a single platform. Combined with CPU-based infrastructure, this enables enterprises to move from pilot projects to production AI more efficiently.

From Pilot to Production: Measuring Outcomes

One of the most significant barriers to AI adoption has been the gap between proof-of-concept and production deployment. CPU-based AI architectures help narrow this gap by reducing cost and operational complexity.

Organizations adopting this approach report several outcomes:

Lower total cost of ownership for inference-heavy workloads
Faster deployment cycles by avoiding specialized hardware procurement
Reduced energy consumption aligned with sustainability goals
Improved ROI through workload-appropriate compute selection

These benefits reinforce a growing consensus that enterprise AI success depends as much on economics and governance as it does on model performance.

Conclusion: A Practical Path Forward for Enterprise AI

The next phase of enterprise AI will not be defined by the largest models or the most powerful hardware. Instead, it will be shaped by organizations that can deploy AI securely, economically, and at scale, using architectures aligned with real business needs.

By combining Cloudera’s data and governance platform with AMD EPYC processors and Dell Technologies infrastructure, enterprises have a viable path to operationalizing AI within their own environments. This right-sized approach enables organizations to focus on outcomes, not infrastructure complexity, and to unlock AI value where their data already lives.

As enterprises continue to move AI initiatives from experimentation into production, practical, CPU-based Private AI architectures are likely to play an increasingly important role.

To learn more about achieving economical AI with Cloudera, AMD, and Dell Technologies, download the Omdia Showcase Brief.

When AI Models Converge, Proprietary Data Becomes the Advantage

Pamela Pan — Tue, 10 Mar 2026 16:09:00 UTC

Today’s leading large language models (LLMs)—including Claude, GPT, Gemini, Grok, Mistral, and Llama—are all trained on broadly available public internet data and built on comparable architectures. As a result, performance gaps between models are shrinking, and the competitive edge once associated with choosing a specific AI model is narrowing. At the same time, business research and executive commentary increasingly point to the same dynamic: AI delivers the greatest long-term value when it can run on proprietary, organizational data that competitors cannot access or replicate.

"For these [foundation] models to reach their peak value, you need to train them not just on publicly available data, but you need to make privately owned data available to those models." -Oracle Founder and CEO Larry Ellison, Oracle AI World 2025

As foundational capabilities become more standardized, differentiation shifts from the model itself to how effectively enterprises capture, govern, and operationalize their unique data assets. That shift raises a practical question: how do organizations turn proprietary data into a lasting AI advantage?

RAG is a Starting Point, Not a Differentiation Strategy.

Many organizations begin their AI journey with a simple architecture: call a cloud-hosted model and add retrieval-augmented generation (RAG) to pull in internal documents. This approach is effective for early experimentation. It allows teams to build prototypes quickly and demonstrate value immediately.

However, it has limitations when the goal is competitive differentiation. RAG retrieves information at query time, but it does not fundamentally change how the model understands a domain. The model remains general-purpose, and the underlying enterprise knowledge stays external to the model itself. If competitors can access the same base models and implement similar retrieval pipelines, the resulting capabilities are difficult to distinguish.

For enterprises seeking durable advantage, simply retrieving proprietary data is not enough. The model must learn from it.

Building AI on Proprietary Data

To turn proprietary data into a lasting advantage, organizations need to go beyond simply querying external models. They need to adapt models to their own data and run them within environments they control. This is where fine tuning and private inference become important.

Fine Tuning

Fine tuning allows organizations to adjust a model’s internal weights using proprietary datasets so that domain knowledge is embedded in how the model behaves. Instead of retrieving information at query time, the model begins to understand the organization’s terminology, workflows, and decision patterns.

In many cases, organizations also augment their training pipelines with synthetic data, generating enterprise-grade datasets that expand training coverage while addressing compliance and data availability challenges. Over time, these approaches create AI systems that are aligned with the business itself, not just the public Internet.

AI Inference

Once models are adapted to proprietary data, the next step is how they are deployed and operated in production. Running AI inference within private infrastructure allows organizations to operate AI systems directly within their enterprise environment. This approach provides several important benefits:

Data privacy and control. Prompts, model artifacts, and outputs remain within the organization’s environment rather than being sent to external services.

Improved performance. Deploying models closer to where enterprise data resides can reduce latency and improve responsiveness for production applications.

Unified governance. Security policies, access controls, and data lineage can be maintained consistently across the entire AI lifecycle.

At enterprise scale, competitive advantage increasingly comes from the ability to adapt models to proprietary data and run models where that data resides.

Your Data, Your Models, Your Way

In a world where foundation models continue to converge, the ability to operationalize AI on unique enterprise data will increasingly define long-term competitive advantage.

Cloudera believes the next era of enterprise AI will be defined by this shift toward Private AI architectures. With Cloudera AI Workbench, AI Inference Service, and AI Studios—which include low-code tools for RAG and model fine tuning—we provide end-to-end, governed control needed to ingest, fine-tune, and serve models within your trusted perimeter, across any cloud or data center.

Dr. Jake Trippel on Why Your Technical Debt Is Compounding

Cloudera — Tue, 10 Mar 2026 14:00:00 UTC

AI is only as powerful as the data architecture behind it.

In episode 52 of The AI Forecast, Why LLMs Aren’t Enough and How AI Fabrics Will Change Everything, host Paul Muller sits down with Dr. Jake Trippel, Dean of the College of Business and Technology at Concordia University, St. Paul, and Co-Founder & CTO of Codename 37, to unpack what’s holding enterprises back from scaling AI:

Siloed data architecture
Misunderstanding of the power of machine learning, deep learning, and neural networks
Compounding technical debt

Their conversation spans cloud versus on-prem economics to the coming shift from SaaS applications to bot-based experiences. Below are key moments from their discussion.

Why AI Architectures Are Hitting Their Limits

Paul: Tell us about what we’ve seen in the past with AI and data architectures, and why we need to rethink them now.

Jake: We went through the digital transformation era, that was the challenge with data. We stayed in data silos because that's how our platforms were architected, and that's how data was organized. Then we tried to do a bunch of integrations. We tried to do all these app integration engines. We tried to find nifty ways to do it, but what happened was we created a spaghetti mess pulling ELT to ETL, system to system.

Now fast forward to today. The challenge now is that these organizations are incentivized to keep us in silos because now comes AI data silos, the data still in silos, and that's where the power of cloud comes in. That's where we're proud to be a Cloudera partner.

Imagine the same problem, except amplified. I’ve got AI agents up the kazoo — awesome — but they’re only working inside their own data silo.

People are going to want more. They’re going to want agents that can work together, talk together, and reason together. But how do you do that if your data is still stuck in silos? To get to this data mesh state is going to require a transformational change, and that's why Cloudera is a cool solution that can help folks do that.

Why Large Language Models Aren’t Enough

Paul: What are some of the hacks, best practices, tips or tricks that you use to help you get the most out of what you do with data?

Jake: The biggest thing is understanding that large language models are not the answer for everything. AI is a big world.

Large language models are awesome for some things, but they’re really bad for others. People have to understand the power of machine learning, deep learning, and neural networks — which are really the guts of the other two.

The skillset of our time right now is being able to develop or use the right models for the right use cases, and to rapidly get through data. That’s where people need to focus.

The Compounding Effect of Technical Debt

Paul: How do organizations, in your opinion and experience, pragmatically start to move from where they’ve been to where they’re going? How do they clean their data up? Is there a mechanism by which they can do it without breaking?

Jake: That's a big loaded question, so I'll try to pull it apart a little bit. You’re three decades in for a reason. We still see AS/400s out there — and they work. You got to give IBM credit.

The challenge that these organizations have though is how much capital are you expending? Because of the compounding effect of this technical debt — you can kick the can down the road year after year, decade after decade. The cost is only going to grow.

But now at least you have options. We can take the data out and we can do a lot more with it than we ever have before. Instead of ripping off the Band-Aid approach, as long as we have access to the data and continue access to the data, we can now create any type of experience we want in parallel.

Why Some AI Workloads Are Moving Back On-Prem

Paul: What are you seeing with your existing clients today as they’re looking to deploy new workloads?

Jake: We are seeing a massive migration back to on-prem. Couldn’t believe it. Never would have predicted that.

As these organizations are doing more model development, training, and so on, the cloud cost model is just too expensive. I have not met a CFO who’s excited about spending how much a month training these models.

So, they’re making the investment. They’re going back to data centers. They’re depreciating it over the next five years. We’re seeing this in medical devices, financial services, aviation — it’s typically hybrid, but for particular workloads, especially training and development, it’s way more cost effective.

AI as an Amplifier for Learning — Good and Bad

Paul: What are you seeing in terms of the academic world and how we prepare the workforce of the future?

Jake: AI is an amplifier. It’s going to amplify the good — and it’s going to amplify the bad.

On the good side, people will learn 10, 20 times faster than they ever have before. I’ve built models that can read books in three seconds flat. I can now immerse myself in the data and create any type of learning experience I want adapted to my learning style.

The bad side is students choosing, I don’t have to do anything. I can let AI do all my work and I’m not going to learn anything. That’s the part that scares me.

The skillset of our time is, I hope you like learning. You’re going to be doing it every single day of the rest of your career.

Listen to the full conversation with Dr. Jake Trippel on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

Cloudera’s 2026 Trends in Data and AI Webinar Recap

Robert Hryniewicz — Mon, 09 Mar 2026 14:00:00 UTC

I recently sat down with Manasi Vartak, Cloudera’s chief AI architect, and Mike Gualtieri, vice president and principal analyst at Forrester Research, for Cloudera’s 2026 Trends in Data and AI webinar to discuss how to deploy agentic AI at scale.

While our conversation had a forward-thinking, future-oriented slant, I kicked off the webinar by posing this retrospective question: What is one belief about AI that died in 2025?

Between the three of us, we discovered that in 2025, several long-held beliefs about AI finally collapsed. I want to share with you the philosophies Manasi and Mike identified that we are leaving behind as we step into this new and exciting year in AI development.

The Beliefs That Died: The Intellectual Gatekeeping of Agentic AI 

2025 began with the belief that agentic AI would be accessible only to a select few. With novel technologies, it is a basic instinct to defer to the tried-and-true experts: PhDs, engineers, and so on.

However, we are now seeing regular business users build their own functional AI pipelines. Manasi recalled the “lightning strike moment” from last year that sparked this realization—at a hackathon in our Agent Studio, an employee from our strategy department built a complete pipeline that had the potential to save $3 million a year. This was an incredible feat performed by someone without specialized training in agentic AI strategy.

To Manasi, this was the sign that agentic AI is truly being democratized across the board.

The Beliefs That Died: Ubiquity of AI Hallucinations 

This past year, Mike noticed a marked reduction in AI hallucinations. He acknowledged they still occur but pointed out that, in the past, conversations surrounding AI use focused heavily on them as a threat to its dependability. Now, these fears are much less common.

Mike posited that people now have a better understanding of how to control the scope of an LLM model through prompting, RAG techniques, and other methods. Enough users now understand the circumstances in which these issues arise, as well as the mitigating and eliminating techniques to reduce this phenomenon.

The Bigger Pattern 

AI has become genuinely actionable because it is now reliable and usable at scale. As agentic AI becomes more democratized, autonomous systems are no longer limited to elite technical teams—they can be deployed across organizations to execute defined tasks end-to-end. Improved accuracy and fewer hallucinations mean these systems can operate with minimal human oversight, shifting AI from an advisory role to an operational one.

Operational AI truly stands out because it reliably eases manual work while achieving impressive results like quicker cycle times, cost savings, and better decision-making. It’s exciting to see how automation brings real value to daily operations, making them smarter and more efficient, rather than just being limited to isolated tests.

Why These Belief Shifts Matter Going Into 2026

As trust in AI becomes informed rather than aspirational, the question is no longer whether AI can act, but where it is allowed to act. With increased confidence in data integrity and greater output reliability, AI can now move beyond isolated silos into core business processes and decision-making loops.

The real challenge now is whether organizations are structured to support this democratization. Spreading AI throughout the entire company means shifting away from bottlenecks that restrict experimentation to just a few technical teams. When operational leaders can safely access data across different environments, they’re empowered to build, test, and launch AI-powered tools that truly meet business needs. Without wider, well-managed access to data, AI stays centralized and disconnected from daily operations.

Organizations stuck in old beliefs or unwilling to adapt to new ones risk stalling and falling by the wayside of technological advancements. Cloudera’s platform is designed to avoid this outcome and weather these changes in the ever-volatile AI landscape. Whether your data resides in the cloud, in data centers, or at the edge, Cloudera provides universal access to data for AI across the entire enterprise, with governed, enterprise-wide intelligence.

These themes and more are covered in detail by Manasi, Mike, and me in our talk, and I invite you to explore these shifts in greater depth with us in our 2026 Trends in Data and AI webinar. For more insight into what these observations mean in practice and how your organization can make the most of democratized AI in your own environment, explore Cloudera’s latest resources.  

From Log Overload to Mission Readiness: Rethinking Government Data Architecture

Ian Brooks — Mon, 02 Mar 2026 17:00:00 UTC

Across government agencies today, data is both a mission enabler and a hidden drain on resources. From cybersecurity and threat detection to compliance and citizen service delivery, public-sector missions depend on timely, trusted data. Yet the success of these programs—and the regulations that ensure their accountability—create an invisible cost: a flood of log data that strains infrastructure, slows systems, and inflates storage budgets.

To stay compliant, agencies and other regulated organizations must manage this growing data volume responsibly. But as it accumulates, log data can overwhelm even the most capable environments—consuming storage, increasing processing time, and degrading overall performance.

For many agencies, security information and event management (SIEM) platforms like Splunk sit at the heart of cybersecurity operations, yet even these best-in-class tools can struggle to keep pace. That’s why progressive agencies are rethinking the data architecture behind their SIEM platforms. Not abandoning SIEM, but optimizing how data moves into and through those systems. Let’s talk about what that looks like in practice.

A New Approach to Data Movement: Cloudera Data Flow

Public-sector organizations are increasingly adopting solutions to streamline data movement. Smarter data distribution helps agencies improve system performance and reliability, control costs, and maintain end-to-end awareness of how data moves across their environments.

Cloudera Data Flow provides centralized control and visibility across on-premises and cloud environments, helping agencies manage data more securely and efficiently at scale. Rather than relying on one-off pipelines or manual integrations, Cloudera Data Flow functions as a connective layer that intelligently routes, filters, and delivers data where it’s needed. In short, it connects and manages data intelligently across environments, minimizing duplication and complexity while conserving both infrastructure and human resources.

For agencies balancing tight budgets and strict mandates, Cloudera Data Flow offers clear advantages, including:

Optimized resources: Route only the most critical data to Splunk or other SIEM tools, while archiving less-urgent logs in cost-effective object storage

Reduced noise: Preprocess and filter high-volume data to accelerate analysis and improve the signal-to-noise ratio

Maintained compliance: Preserve auditable chains of custody and full observability of every data flow

Hybrid continuity: Support mission operations seamlessly across secure on-premises environments and evolving cloud initiatives

Interested in a deep dive of how universal data distribution works with Cloudera?

Explore the step-by-step guide on optimizing Splunk log ingestion with Cloudera Data Flow to see how this can be implemented in practice.

Rethinking the Data Pipeline

The shift toward universal data distribution reflects a larger change in how agencies think about data pipelines. For years, data integration was treated more like retrofitted plumbing—cobbling together different pipes and materials to connect and move data stored in different formats, within different tools, and governed by different rules.

Today, the limitations of that approach are clear. For true operational resilience, data flows need to be unified and transparent, regardless of where the data lives. Open-source technologies like Apache NiFi have made this approach more accessible, allowing agencies to test, replay, and adjust data flows without disruption.

Using an open-source framework allows these disparate systems and data formats to work together seamlessly, enabling modernization without abandoning existing investments. For public sector IT leaders, this evolution strengthens mission continuity.

By reimagining data distribution as a core capability, agencies can turn what was once operational overhead into an architectural advantage that keeps everything operating smoothly and in sync.

A Future-Proof Data Strategy for the Public Sector

Looking ahead, data complexity isn’t going away—it’s accelerating. The growth of tech including edge devices, IoT sensors, and AI-enabled monitoring will only increase the volume and variety of data that must be collected, secured, and analyzed while staying in compliance.

Agencies that invest now in flexible, distribution-first architectures will strengthen both their cybersecurity and compliance postures while ensuring they’re well positioned to adapt to whatever comes next. Tools like Cloudera Data Flow make it possible to achieve the scalability, observability, and performance that today’s public sector organizations demand.

Why Native Observability is the Heart of Hybrid Cloud

Ron Pick — Fri, 27 Feb 2026 14:00:00 UTC

In the current enterprise technology landscape, we’re witnessing an industry-wide scramble. As organizations shift from monolithic architectures to complex environments leveraging heterogeneous infrastructures, cloud-based data platforms are hitting a visibility—i.e., observability—wall. Their response has been a wave of reactive, multi-billion-dollar acquisitions designed to "bolt-on" the observability that they lack natively.

But observability shouldn't be a post-script or a line item from a recent merger—it must be a core capability. At Cloudera, we’re evolving our native observability DNA into a unified, hybrid-first powerhouse, proving that true insight across the entire data estate is a foundational requirement for a unified data fabric, open data lakehouse, data in motion, AI, and your data platform as a whole. This is true whether you run your apps, workloads, models, and agents in public clouds, on-premises in data centers, and at the edge.

The Multi-Faceted Nature of Observability: Beyond Simple Monitoring

True observability is not a single tool; it’s a foundational capability baked into the data platform to answer critical questions for every stakeholder across the data estate. Whether it’s a business analyst wondering why a dashboard hasn't refreshed, a database admin investigating a long-running query, or a system admin identifying skewed data storage across cluster nodes, observability must offer telemetry that’s integrated to provide immediate, actionable answers.

In the reality of hybrid and multi-cloud landscapes, relying on separate, single-purpose tools— for data quality, cloud performance, infrastructure health, and so on—that don’t operate across the entire data landscape doesn’t grant true visibility. Instead, it creates a data silo problem of disconnected islands of observed systems.

It’s the interplay between these systems (in data, workloads, resource utilization, etc.) that necessitates observability. When these categories are disconnected, organizations lose the deep context required for operational excellence. To achieve that level of insight requires visibility that links logs, metrics, and traces cohesively between the data layer and the underlying infrastructure, along with everything in between.

The Inevitable Complexity of the Hybrid AI Era

The rise of generative AI and large-scale modeling has fundamentally transformed hybrid architecture from a strategic choice into a technical necessity. AI workloads demand a delicate balance between massive cloud-scale compute for training and localized, on-premises data gravity for privacy and low-latency inference, leading the modern enterprise to become an intricate web of heterogeneous environments.

This shift toward a truly distributed footprint—spanning from the core data center to the public cloud and out to the edge—inherently magnifies complexity, as workloads behave differently both within and between these various infrastructures. This complexity makes it exponentially harder to get to the critical "why" behind performance lags, cost spikes, or consumption issues. In this hybrid AI era, system complexity without a unified view and telemetry becomes an unmanageable black box, leaving IT leaders unable to predict or prevent critical failures.

The "Bolt-On" Trap: Why Observability Cannot Be an Afterthought

There’s been a recent surge in cloud-based data providers acquiring observability startups: Snowflake acquiring Observe, Palo Alto Networks acquiring Chronosphere, and more. These multi-billion-dollar acquisitions show that when data platforms lack native observability, they eventually hit a "visibility wall." These providers are now attempting to bolt-on what should have been a core capability.

For the modern enterprise, a fragmented, cloud-only approach will not provide the visibility they need to achieve true operational excellence:

Cloud-only tools are restricted to a specific segment of the stack, ignoring the vast data estate existing outside the public cloud.

Tools with bolted-on observability struggle to provide the unified context needed to understand the cause of issues across complex hybrid environments. Customers frequently find themselves juggling disjointed interfaces for logs, metrics, and traces, which highlights a significant lack of cohesion between the data layer and the infrastructure supporting it.

Cloudera's Native and Unified Observability Capability

Cloudera Observability is a native, foundational capability that moves beyond simple monitoring to act as a unifying powerhouse. By positioning visibility as a foundational requirement, Cloudera provides total insight across the entire hybrid cloud: on-premises, public cloud, and at the edge. And by leveraging OpenTelemetry as the observability framework to collect and capture distributed traces and metrics, we’re aligned with the leading framework of observability standards.

Cloudera Observability delivers more than just the "why" behind performance; it provides a comprehensive cycle of insight. We’ve "bottled" the diagnostic intelligence gathered from more than 1.3 million nodes under subscription to create sophisticated diagnostic tools. Now, with the integration of Cloudera Cloud Factory (formerly known as Taikun CloudWorks), we’re best placed to extend these capabilities beyond cloud-native infrastructure management.

This evolution places predictive reliability firmly within reach for the modern enterprise, transforming maintenance from a cycle of reactive patching into a proactive strategy. By leveraging advanced warnings on known issues and security vulnerabilities, organizations can finally transcend traditional troubleshooting to achieve a state of continuous, reliable performance across their entire data estate.

Ultimately, observability is the only way to navigate the complexity of the hybrid AI era, through a data platform built with observability in its DNA. To learn more about how you can achieve true observability with Cloudera, reach out to our professional services team or check out our product demos.

Bring AI Models to Your Data with Cloudera AI Inference Service

Pamela Pan,Peter Ableda — Mon, 23 Feb 2026 14:00:00 UTC

We’ve entered a new phase of AI adoption: 88% of enterprise AI projects stall before reaching production, not because of poor ideas or weak models, but because infrastructure can’t keep up. Cloud APIs get expensive fast. Governance is an afterthought. Latency adds up. And for regulated industries, moving sensitive data to a public endpoint is just not an option.

Closing the gap between an AI pilot and full-scale production requires bringing intelligence directly to the source. Cloudera AI Inference service gives enterprise teams a secure, performant, and cost-effective production model serving layer—running directly where the data lives.

Instead of sending your data to the cloud as context for models, Cloudera brings the models to you—unblocking intelligence exactly where it’s needed, securing it by design, and scaling it confidently behind your own firewall.

3 Reasons Why Bringing AI to Your Data is Important: Privacy, Cost, and Choice at Scale

Keep Data Private and Protected

Most AI services require you to send data to the cloud, creating risks around compliance, cost, and latency. Cloudera takes the approach to bring models to where your data already lives. Whether it’s in a secure virtual private cloud (VPC), or within an air-gapped (fully offline and isolated) on-premises environment, this model-to-data strategy ensures your information stays private and governed, while still enabling high-performance inference to power AI in production.

Predictable Economics in the Long Run

Running AI in the cloud 24/7 leads to spiraling, unpredictable expenses. These per-request fees create a budget that fluctuates with usage, making long-term forecasting difficult. By shifting inference to infrastructure the organization already owns and controls, teams can bypass these external usage fees. Once AI moves into steady-state production, costs become more predictable, allowing for a higher return on investment as workloads scale.

Control and Choice

Most cloud AI providers steer customers into their proprietary ecosystem, making it hard to switch, extend, or fully control your models. With Cloudera AI Inference service, you can deploy a wide range of AI capabilities, from open-source GenAI LLMs like NVIDIA’s Nemotron to traditional predictive models, without giving up control or ownership of your intellectual property. Accelerated by the NVIDIA AI stack—NVIDIA Blackwell GPUs, NVIDIA Dynamo-Triton, and NVIDIA NIM microservices for high-performance, scalable model serving—Cloudera AI Inference service lets you innovate freely while keeping your AI infrastructure flexible, portable, and future-proof.

Success Stories: Early Adoption of Cloudera AI Inference Service On Premises

Cloudera AI Inference service is unlocking new AI use cases in places where the cloud can’t go: offline environments, sovereign infrastructure, and latency-critical operations. Here are three real-world scenarios now enabled by Cloudera AI Inference service and already underway with early adopters.

National Security: Air-Gapped Intelligence That Never Sleeps or Leaks

In national defense, speed and security are non-negotiable. But until recently, intelligence officers spent thousands of hours manually sifting through sensitive, offline documents—slowed by process, overwhelmed by volume, and unable to leverage public AI tools without risking exposure.

Now, with Cloudera AI Inference service running inside air-gapped environments, defense agencies can deploy powerful LLM assistants that scan and summarize massive document collections in seconds. These models operate entirely offline: no internet, no cloud dependencies, no data leakage, helping analysts make faster decisions without compromising security.

Global Finance: Instant Operations, Zero Data Exposure

Cross-border finance lives in dozens of languages. Previously, translating documents like contracts, fraud reports, or compliance updates meant using external tools, raising serious concerns over data exposure and auditability.

Today, one of the top global credit card providers is exploring Cloudera AI Inference service and testing on-premises deployment of multilingual models to translate sensitive communications across more than 200 markets in real time, and fully under internal control. By running inference on their own infrastructure, they’re unlocking faster internal operations and customer response times, while avoiding the compliance risks of third-party APIs.

Public Sector: AI Agents for Every Employee

Government agencies are under pressure to serve more people, faster—yet employees often rely on outdated portals and dense policy manuals. Public GenAI tools aren’t an option due to privacy mandates and unpredictable costs.

Early implementations of Cloudera AI Inference service are supporting on-premises AI chatbots trained on internal agency documentation. These agents help staff and constituents navigate complex topics with speed and confidence, delivering answers instantly, while maintaining full control over the data, prompts, and outputs.

Looking Ahead: The Future of AI is Anywhere Data Lives

By bringing the model to where your data lives, Cloudera AI Inference service is helping organizations scale intelligence on their own terms—with predictable cost and flexibility to choose from a wide range of production models. Whether you’re navigating air-gapped security mandates or optimizing high-volume global operations, the path to production-grade AI is now open.

Cloudera AI is the trusted foundation for building, deploying, and governing all types of AI—from generative and agentic AI to traditional machine learning—across your data estate.

Ready to scale? Don’t let infrastructure limit the AI strategy. Visit the Cloudera AI Inference service webpage for use case demos, learn more about it in this webinar, or book a demo to see how to turn “AI anywhere” into a reality.

#ClouderaLife Employee Spotlight: Meet Josephine Tan, Cloudera’s Senior Director, Human Resources, APAC

Debbie Kruger — Tue, 17 Feb 2026 20:00:00 UTC

At Cloudera, we pride ourselves on fostering an environment focused on employee well-being and professional growth. At the end of 2025, that commitment was recognized as several Cloudera offices earned Best Places to Work honors—including the Singapore office, a close-knit and highly connected team. One of the team members fueling that success is Josephine Tan, Senior Director, Human Resources for the Asia-Pacific (APAC) region.

Approaching her sixth year with the company, Josephine is proud to share the firm foundation of the Singapore office’s culture. “When a culture is strong, that’s where trust within the team grows,” she shared. “We empower people; there’s this level of trust and honesty.”

This time of year, the office looks forward to celebrating both big and small wins, whether it’s the end of a quarter or the start of the Lunar New Year. “You work hard, you play hard, and that’s very much what the Singapore team believes.”

Let’s take a moment to get to know Josephine Tan better, explore her journey with Cloudera, and discover how the Lunar New Year is bringing the Singapore office together at the start of this new season.

Meet Josephine Tan

Josephine joined Cloudera in March 2020, mere days before lockdown took effect. At that time, she not only had to learn the ropes of a new job but also navigate an entirely uncharted professional landscape, as business was conducted online. “Luckily, a growth mindset is part of Cloudera’s DNA.”

She is dedicated to leading the region’s people strategy with a warm focus on nurturing talent, fostering a positive culture, and supporting organizational growth. She always keeps the community at the heart of everything she does.

“What drives employees in Singapore is one objective, one goal. It’s all about the power of ‘we.’” This credo is Josephine’s North Star. “I believe in growing the team’s expertise.”

For Josephine, her role in HR is truly about inspiring progress: “I see this as a place where I can make a difference. HR is not all about maintenance, it’s about making possible change happen.”

Her commitment to driving actionable change shows up in both her professional and extracurricular life. Even outside of work, she prioritizes philanthropic efforts and public service: “In my free time, I will ask myself: how can I give back to the community?” It’s a question that defines her approach to leadership and lifestyle, rooted in impact and meaning.

Singapore’s Collaborative Character

While it may have presented new challenges, the idea of a remote work environment was not an obstacle. Rather, it was a novel opportunity to foster stronger connections with other Clouderans. “We’re so close-knit now because we went from working fully remote to having the luxury of coming back to the office,” she asserts. This shared experience drew the team together, strengthening and reaffirming their dedication to their work.

When asked what makes the Singapore office special, Josephine happily shared, “the people.” She believes this quality comes from the cosmopolitan and welcoming spirit engrained in Singapore. “There are easily five different cultures present in this one small office,” she notes. The diverse makeup of the office fosters a spirit of collaboration and inclusion among teams, which is a big reason why the office has been proudly recognized as a ‘Great Place to Work’ for two consecutive years.

Celebrating The Lunar New Year

This solidarity shows up in how the workplace comes together and celebrates special occasions, such as the Lunar New Year. “All nationalities are welcome to celebrate, and we embrace that,” Josephine said. One of the Singapore office’s favorite Lunar New Year activities is the prosperity salad toss, or Yu Sheng. People gather to toss mixed ingredients like shredded vegetables, crackers, and raw fish high in the air while shouting auspicious phrases. “It’s like having turkey at Christmas,” she explained, a holiday tradition that symbolizes abundance and vigor.

Clouderans in Singapore also celebrate the holiday with festive activities such as decorating the office, exchanging oranges as gifts, and enjoying a special quarter-end lunch, where the prosperity toss salad is often the star. They also have fun dressing up to match the theme, “We dress in denim with a touch of gold or red. Red is an auspicious color for the Lunar New Year.”

Josephine attributes part of her office’s camaraderie to hosting activities like this—an example of how Cloudera leadership globally supports initiatives as part of the company’s commitment to inclusive, locally led culture-building. Lunar New Year values such as renewal, reunion, and recognition are reflected in this way at Cloudera Singapore: “This is a time to celebrate and express gratitude and appreciation for the time we spent building the business up to this point, to reinforce our shared commitment, and to look forward to a new chapter.”

Closing Thoughts

Josephine’s dedication to service and her community exemplify Cloudera’s caring, people-centered culture. Her story shows how proactive effort and collaboration lead to meaningful growth and strong bonds. For Josephine, Cloudera is the perfect place to live out these values.

Hear from another Clouderan and explore career opportunities at Cloudera.

You Can Build It Yourself, But Should You? Protecting the Value of Modern Data Platforms

Jim Bisordi — Tue, 10 Feb 2026 17:00:00 UTC

Organizations don’t invest in modern data platforms casually. They invest to support a range of mission-critical needs—from real-time fraud detection and global inventory visibility, to private AI readiness and consistent governance across complex regulatory environments.

With those outcomes in mind, teams come in ready to move fast and build with purpose. But it doesn’t take long to realize that translating intent to impact and value is harder than expected.

In complex environments, early implementation decisions often determine whether a platform becomes a durable foundation or an expensive capability that never quite delivers on its promise.

Why Experience Compresses Time-to-Value

The problem is that implementation is often treated as a checklist—specific steps that ladder up to a specific outcome—when it’s really a decision tree. Each choice made along the way can take teams down very different paths with long-term consequences that aren’t always obvious at the time.

These learning curves can be costly and can quietly lock in architectural and governance decisions that can limit flexibility, scale, and trust long after launch, dramatically increasing total cost of ownership and time to value.

Teams with deep platform and solution implementation experience approach these projects with a seasoned perspective. They recognize patterns early, know which trade-offs actually matter (and which don’t), and design for real operating conditions rather than idealized ones, shaping early decisions that protect the platform’s long-term value and accelerate the path to durable outcomes.

What Professional Services & Training Actually Means in Practice

This is where Professional Services & Training (PS&T) comes in, a team that works with you to bridge the gap between purchasing a new platform, and seeing it adopted across the organization. This phase is a critical time in the platform’s lifecycle, as these early steps set the organization up for long-term success.

Industry-specific experts on PS&T teams act as an extension of in-house teams during platform adoption and use case implementation, bringing the perspective of having done this hundreds of times before in similarly complex environments. They help shape early decisions, navigate trade-offs, and avoid common pitfalls in data flow, governance, security and integration, so teams don’t discover too late that something foundational needs to be reworked. Just as importantly, they transfer that knowledge back to internal teams, ensuring long-term platform ownership, confidence, and self-sufficiency remain internal.

By engaging PS&T early, organizations can move from evaluation to execution more quickly and confidently, avoiding unexpected challenges along the way. Instead of spending months tuning pipelines, rethinking governance models, or retrofitting for scale, teams start with a foundation designed to support today’s use cases and grow with them over time.

When “Working” Still Isn’t Enough

Once the platform is live, teams often assume the job is complete, but it’s really just the beginning. Despite having the tools they asked for, many still struggle to extract real value from their data. Doing so requires building trust, broadening adoption, and confidently operationalizing insights.

The gap between standing up a platform and genuinely using it is often driven by subtle, slow-moving issues—ones that don’t immediately break the system outright, but quietly erode confidence. Over time, this can lead to fragmented usage, shadow systems, stalled initiatives, and growing skepticism about the platform’s ROI. By the time these issues are recognized, momentum can be hard to recover.

Early decisions set the trajectory for whether a platform becomes foundational or gradually sidelined.

AI-Driven Use Cases in Regulated Environments

This dynamic becomes even more pronounced in messy, real-world environments with regulatory or operational complexity. Here, early decisions can determine whether private AI initiatives, for example, become durable assets, or introduce new risk.

Healthcare

In healthcare, private AI enables a wide range of use cases, from automating administrative workflows to supporting advanced imaging and diagnostics. But realizing those benefits starts well before any model is trained.

It all starts at the foundation—bringing data together across hybrid environments and ensuring it is properly permissioned, tagged, and contextualized. Without that structure, AI outputs can lack the clinical or regulatory context needed to be trusted, undermining decision integrity, defensibility, and compliance. In these environments, early implementation decisions determine whether AI capabilities mature into trusted clinical tools or remain constrained by governance and data access limitations.

Telecommunications

Telecommunications organizations face similar challenges. Data is generated continuously across highly distributed infrastructure, often spanning regions and regulatory jurisdictions.

Private AI can open up real-time threat detection, outage prediction, and network optimization, but only when governance, lineage, and access controls are consistent. When these foundations are uneven, AI-driven insights may look actionable on the surface, but lack the context needed to be truly useful.

While AI initiatives (the examples used here) tend to surface these challenges quickly, the same dynamics apply to analytics modernization, regulatory reporting, operational intelligence, and any use case that depends on trusted, well-governed data. In any case, success depends less on how sophisticated the models are, and more on consistency in early architecture and governance decisions that shape how data is accessed, secured, and interpreted.

Where Implementation Becomes Adoption: How Momentum Is Built

Even with the right technical foundation, realizing the full value of the data platform doesn’t happen all at once. It’s a deliberate process—one that builds confidence incrementally as teams validate results, expand usage, and integrate insights into everyday workflows.

Teams that succeed tend to treat implementation as the beginning of the journey, not the finish line. They start with well-scoped use cases, build trust in the results, and scale deliberately as confidence grows.

This is where Professional Services & Training plays a guiding role—partnering with teams to sequence adoption, reinforce governance as usage expands, drive new AI use cases, and keep momentum moving without introducing rework. The result is a solution that steadily proves its value over time, protects the original investment, and becomes a dependable foundation for analytics, AI, and future data initiatives.

For teams thinking about how to move from standing up a platform to fully realizing its value, Cloudera’s PS&T’s resources explore what that journey looks like in practice.

The Next Evolution of Enterprise Analytics – The Data Intelligence Platform

Divya Karmagam — Mon, 09 Feb 2026 14:00:00 UTC

See It in Action

Want to see what a data intelligence platform looks like in practice?
See how Iceberg tables managed by Cloudera can be queried by Snowflake and Databricks without copying data or compromising governance.

How to Shift to an Intelligence-First Platform

Adopting an intelligence platform represents a fundamental shift not just in infrastructure, but in how organizations think about and trust their data. The transition period is especially critical because it sets the expectations for reliability, integration, and adoption across teams. Early missteps can create lingering challenges and resistance to longer-term adoption.

Done well, this shift balances stability and progress, keeping mission-critical processes running while delivering early wins that build confidence and momentum.

Cloudera’s Professional Services & Transformation (PS&T) team helps organizations navigate this shift with care—avoiding common architectural pitfalls and building a durable foundation that supports future analytics and AI use cases.

Learn more about our PS&T capabilities here.

Lakehouses solved a lot of enterprise problems by unifying and simplifying data storage. But the operating landscape at the enterprise level has shifted. Today, organizations are coordinating more tools, managing more data, operationalizing AI, and navigating increasing regulatory scrutiny.

As a result, data can no longer be treated as something that’s queried occasionally or in isolation. It now needs to be operational—meaning ready for real-time use, automated decision-making, and AI-driven workflows across the organization. This shift is pushing architectures beyond lakehouses and toward a more dynamic data intelligence platform.

What Changed? Analytics Became Multi-Platform

Modern enterprises rely on multiple analytics platforms to support a wide range of workloads, including business intelligence and reporting, real-time analytics, observability, machine learning, and AI.

Each team brings its own needs to the same data, and in practice, platform choices are driven by productivity and speed rather than architectural purity. Much of that data also remains on premises or in regulated environments, where moving it to the cloud isn’t practical or permitted.

The original lakehouse model assumed convergence on a small number of analytics platforms. Reality proved otherwise: tools, users, and workloads diverged. The challenge now is supporting that diversity without sacrificing consistency or control.

The Cost of Treating Data as Platform-Owned

Despite lakehouse implementations, enterprise data often remains tightly coupled to the platform that manages it. When another platform needs access, the data is often copied, transformed, or exported to fit that environment.

Over time, simply keeping data consistent and accessible across these various platforms becomes a challenge. Duplicate datasets, fragile pipelines, delayed insights, and inconsistent governance introduce operational risk and drive up costs.

The result is a familiar pattern: rising spend, growing complexity, and declining trust in the data and its outputs.

From Lakehouse to Intelligence Infrastructure

The lakehouse helped bring structure to a fragmented analytics landscape, making it easier for data systems to work together. As enterprises move into the era of full-scale data intelligence platforms, the focus changes.

Instead of data being shaped and owned by individual tools, it becomes the foundation of the architecture—anywhere that data physically resides. All tools sit on top of a shared data layer, rather than pulling data into isolated environments and producing siloed outputs.

This shift allows teams to choose the right compute engine for each workload—whether it’s SQL analytics, large-scale processing, or AI—confident they’re operating on the same governed, trusted data foundation.

What is a Data Intelligence Platform?

A data intelligence platform is a shared infrastructure for data. Think of it like city infrastructure—the roads, power lines, and plumbing beneath a city that every building taps into and relies on.

In the same way, a data intelligence platform provides a centralized foundation that powers many different tools, compute engines, and applications, with governance and context embedded by design rather than bolted on later.

It’s characterized by:

A shared data layer built on open data formats

Rich metadata lineage that captures structure, meaning, and history

Built-in governance that travels with the data

Support for multiple analytics and AI engines

The ability to evolve without re-architecting from scratch

Open Foundations Make Data Intelligence Possible

A platform like this only works if data can be shared safely across all tools and environments, whether on premises, in the cloud, at the edge, or a combination. Open table formats are the common foundation that makes cross-engine interoperability possible (to continue with our city metaphor: the building codes and street standards that make the city navigable by everyone).

Without them, connecting tools often means dealing with mismatched formats, inconsistent latencies, proprietary lock-in, or data that must be governed across geographic boundaries. This can lead to familiar pain points: reduced auditability, inconsistent views of data, and growing challenges around trust.

By contrast, open formats reduce lock-in and support a growing ecosystem of tools (i.e., set it up once and let it grow with your tech stack over time). They make it easier to define governance policies once and enforce them everywhere (including where data can’t easily move), regardless of which engine needs access. This also creates a consistent “memory layer” for AI-driven systems, making them more reliable, auditable, and adaptable through built-in traceability and historical context.

Without open formats and embedded governance, intelligence quickly fragments back into silos, eroding the very advantages data intelligence platforms are designed to deliver.

Cheers! To Professional Growth With Toastmasters

Debbie Kruger — Wed, 28 Jan 2026 14:00:00 UTC

Cloudera’s culture is rooted in empowerment, continuous learning, and creating spaces where people can thrive both personally and professionally. It’s a mindset and approach that is reflected by every Clouderan across the globe, building an environment where anyone can feel empowered to take on new challenges and grow.

A fantastic example of this culture in action comes from our team in Cork, Ireland. Here, thanks to the enthusiastic efforts of Clouderans like Noel Hayes, Senior Manager Global Order Management, the office has started a Toastmasters club. A vibrant and entirely in-person community that in just the last year has already made a meaningful impact on its members’ confidence, communication skills, and leadership growth.

Here’s how Noel’s own journey led him to get the club off the ground, empower others to join, and continue growing its footprint.

Creating an Environment of Growth and Understanding

Noel, a long-time participant in Toastmasters himself, has always understood how much structured public speaking practice can shape a person’s confidence and professional capability. Early in his career he found presenting challenging, but through regular participation and taking on both speaking roles and leadership roles in meetings, he gained confidence and strengthened his ability to lead with presence.

Noel saw an opportunity to bring that same growth experience to his colleagues. When starting out, Cloudera’s leadership encouraged him to revisit the idea once people began returning to the office. The first meetings began in late 2024, and by 2025, the club was officially chartered. Today, the club has soared to over 40 members.

With meetings every two weeks, members gather in person to practice speaking, take on a number of different roles that strengthen leadership and listening skills, and encourage one another through structured feedback and shared experiences.

The environment is intentionally welcoming. Members range from people who once dreaded speaking in public to those who had never stepped into a role like this before.

Taking the Plunge into Professional Growth

One great example of this group’s impact comes from the experiences of Barry O’Driscoll, Senior Sales Operations Analyst. Barry’s journey to the Toastmasters club started with a conversation with Noel. That conversation snowballed quickly, with Barry joining the group and ultimately competing, and finishing second, in an internal competition and later placing second in an international Toastmasters event.

That’s just one of many who have joined Toastmasters since the group started meeting. And while it may feel overwhelming, it’s possible to start small. “Just join a meeting,” said O’Driscoll. “Once you see the energy in the room, get to know the way it works, it will blow your mind.”

These experiences demonstrate how opportunities like the Cork Toastmasters club align with Cloudera’s broader values. By empowering employees to lead initiatives, provide space for learning, and support each person’s development journey, Cloudera continues to build a culture where people feel supported to grow and contribute in meaningful ways.

Fostering a Stronger Community

Toastmasters is a powerful tool for Clouderans in the Cork office to sharpen their skills, get more comfortable with anxiety-inducing elements of their work, or just step outside their comfort zones. But beyond that, it’s a place where employees can forge a broader sense of community. Oftentimes it’s easy to get siloed into groups based on your role and the kind of work you’re involved in.

A club like this is open to everyone, from HR to engineers. In any given meeting, one might get to interact with someone they would virtually never cross paths with. And because the club is fully in person, members have the chance to build that rapport on an even deeper level and support each other’s professional development.

Looking to the Future

As the Toastmasters club continues to develop, members are setting new goals: advancing through Toastmasters’ structured learning program, achieving distinguished status, and connecting with other clubs in the community. There is also interest in exploring how this model might support employees in other Cloudera locations, helping them build confidence and community through shared learning experiences.

The success of the Cork Toastmasters club is a reminder that development happens in many forms, and that when people are encouraged, supported, and trusted to lead, their potential expands far beyond what they once believed possible.

Find out more about opportunities in our Cork, Ireland, office. And learn more about how Cloudera is helping build a workplace where employees can learn, grow, and thrive.

Openness in the Age of AI

Matthew Michaelides — Tue, 27 Jan 2026 14:00:00 UTC

If the AI revolution has given way to one universal data management truth, it’s the need for openness and interoperability across the data estate. After all, AI is only as good as the data it can actually reach.

No longer are enterprises willing to invest in disconnected legacy technologies. The cost of silos, once measured in infrastructure alone, is now exponentially higher when measured in lost time to value and the inability to run AI at scale. Considering this landscape, enterprises can’t afford not to rethink their data architectures.

At Cloudera, we define openness as a three-layered data management architecture (see Figure 1):

Open compute: The ability to use any engine regardless of where the data is stored
Open catalog: The ability to swap in and out, and interoperate across different data access layers, ensuring schema and governance are consistent regardless of the viewing engine
Open data: The ability to move and access data assets wherever they sit

More broadly, openness is at the heart of who we are at Cloudera:

Early proponent of Apache Iceberg: Cloudera began supporting Iceberg in our public cloud Lakehouse in 2021. Other vendors quickly followed suit—implicitly acknowledging Iceberg as the winner of the open table format war. In 2024, Databricks acquired Tabular, due in part to its open governance and sophisticated features. In 2025, both Snowflake and Amazon Web Services (AWS) invested in expanding Iceberg support and features.

Open-source foundation and ecosystem: Deeply embedded in the open-source community since its founding in 2008, Cloudera was the first company to commercialize open-source data lake technology and continues to contribute to and support more than 50 open-source projects. Our open-source foundation gives freedom of choice by allowing our customers to opt in or out of Cloudera distributions far more easily compared to vendors whose proprietary overlays lock them in. Cloudera customers don’t have to stay; they choose to stay.

Interoperability across the data management stack: Providing open compute, catalog, and data ensures interoperability at each level of the data management stack so our customers can truly win in the age of AI without having to build from scratch. Additionally, Cloudera provides the flexibility to use any compute engine or land data in any cloud service provider (CSP), and provides full access to features regardless of where the data resides or what compute engine is used. Conversely, some vendors restrict access to features based on whether all layers of the stack are running in the same platform. Own your data. Control your data. Use your data—that is the promise of Cloudera.

For a deeper dive on the importance of openness in the age of AI, read our blog: The Future Delivered Today: The AI-Powered Data Lakehouse.

Figure 1: How Cloudera Powers Unparalleled Openness and Interoperability

2025 Was the Year the Cloud Reminded Us Who's Really in Control

Suzy Tonini — Mon, 26 Jan 2026 14:00:00 UTC

Why the outages keep happening, and what you can actually do about it

2025 was rough if you were betting your business on a single cloud vendor. In December, Snowflake customers watched helplessly as a schema update cascaded across multiple regions, blocking queries for 13 hours. Databricks users dealt with days of degraded AI services.

In October, Amazon Web Services (AWS)'s US-East-1 region went dark for 15 hours—a DNS error affecting DynamoDB took down over 1,000 companies. In June, a null pointer exception in Google Cloud's Service Control binary disabled multiple systems including Cloud Storage, Compute Engine, and BigQuery for several hours, with ripple effects hitting Spotify, Discord, and OpenAI.

Across all of these incidents, the pattern was the same: customers refreshed status pages and waited for someone else to fix the problem. The difference between vendors is not whether outages happen, it’s what options you have when they do.

The Pattern: Single Points of Failure with Global Reach

Snowflake’s December incident was triggered by a backwards-incompatible database schema update. Version mismatch errors caused operations to fail or hang indefinitely across multiple regions on AWS, Microsoft Azure, and Google Cloud Platform (GCP). Snowflake's communications stated there were no workarounds except for customers who had pre-configured replication to non-impacted regions. Everyone else waited.

Databricks’ December outage (spanning multiple days) included Unity Catalog issues, compute degradation across multiple regions, and a Mosaic AI disruption that stretched for days. Status updates repeatedly noted they were "working with the cloud provider on potential mitigation paths." That phrase tells you everything about the dependency chain: when Azure has a bad day, Databricks customers on Azure regions have a bad day too.

The Google Cloud June incident revealed the same vulnerability. A faulty policy with blank fields was inserted into global configuration tables and replicated worldwide within seconds. The corrupted data triggered crash loops that took down core services for 7.5 hours. Google's own status dashboards were initially unavailable—SRE teams could not even confirm the scope of the disaster.

Regional redundancy does not help when the failure is logical rather than physical. When a platform relies on globally coordinated metadata or shared configuration, a single bad update propagates everywhere. The failure follows you from region to region.

Additionally, in these scenarios, the infrastructure is distributed, but control remains centralized. When Snowflake's control plane breaks, it doesn’t matter that they run on AWS, Azure, and Google Cloud underneath. When Databricks is waiting on Azure to fix something, multi-cloud marketing does not help. The single point of failure is the proprietary layer on top.

What Analysts Are Saying

The Gartner® 2025 analysis of cloud adoption trends estimates that more than 50% of organizations will not get the expected results from their multi-cloud implementations by 2029. The core problem: lack of interoperability between environments.

In Forrester Predictions 2026: Cloud Outages, Private AI On Private Clouds, And The Rise Of The Neoclouds, the research firm predicts at least two major multiday cloud outages in 2026. The cloud industry is undergoing a massive infrastructure transition as hyperscalers race to build AI-native data centers. That investment is coming at a cost: legacy x86 and ARM environments are being deprioritized, leading to aging infrastructure faltering amid growing complexity.

In the same Forrester predictions piece, they estimate that at least 15% of enterprises will shift toward private AI deployments built on private clouds in 2026. The drivers: rising AI costs, concerns about data lock-in, and the operational risk of depending on infrastructure that is increasingly optimized for someone else's priorities. The 2025 outages were a preview of what happens when your workloads are not the provider's top concern.

Architect for Resilience with Cloudera

Most enterprises have “accidental multi-cloud” architectures by way of acquisitions, shadow IT, or best-of-breed tool selection—not through deliberate architectural planning. Their workloads are scattered across providers but they lack the ability to move data and workloads when things go wrong.

Architecting for resilience involves ensuring your data and AI platform enables portability and eliminates single points of failover.

The Cloudera platform is designed for portability, giving you the ability to fail over between environments to maintain operations—workloads and data can move across AWS, Azure, Google Cloud , and on-premises environments without rewrites, friction, or vendor lock-in. Updates are not forced as global, non-backward-compatible changes.

When the inevitable outage happens, you have options: fail over to another cloud or move workloads back to your data center. You’re not stuck watching a status page—you remain in control of your data and can maintain consistent operations and compliance no matter where data resides.

For a deeper dive on how to build a resilient architecture with Cloudera, read our blog: Architecting for Data Resilience: Ensuring Business Continuity with Cloudera

Looking Ahead

The AI buildout is straining infrastructure, and analyst firms point to more turbulence moving forward: Forrester predicts multiday outages, Gartner predicts defensive multi-cloud adoption. Enterprises that come through 2026 in good shape will be those who treat resilience as an architectural principle rather than a compliance checkbox.

Cloudera does not have push-button cross-cloud failover out of the box—nobody does. But we’re architecturally positioned to support that resilience in ways proprietary platforms are not.

If the 2025 outages made you uncomfortable, we would like to have that conversation. Because the cloud is just someone else's computer. And when that computer has a bad day, you should have somewhere else to go.

To learn more about how you can architect for resilience with Cloudera, reach out to our professional services team, or check out our product demos.

Hybrid by Design: The New AI Mandate

Blake Tow — Fri, 23 Jan 2026 14:00:00 UTC

For the better part of a decade, the enterprise technology mandate was simple: “cloud first,” or more pointedly “cloud only.” Modernizing meant moving to the public cloud, and on-premises architecture was viewed as legacy infrastructure to be maintained until it could eventually be migrated.

Fast forward to today, that narrative has shifted dramatically, with AI as the major catalyst. A recent ZDNet article, citing research from Deloitte and 451 Research, declared that the cloud-first era is over as we enter a more pragmatic, hybrid-by-design era. This approach elevates on-premises infrastructure from legacy debt to the central pillar of a strategic, optimized architecture.

At Cloudera, we’re living for this moment. While the industry swung wildly toward cloud only, we realized that what organizations really needed was "the cloud experience, anywhere.” Now, the market is catching up, and enterprises are waking up to the fact that workloads must move fluidly between public clouds, private data centers, and the edge. Here’s why the shift is happening, and why Cloudera is uniquely positioned to lead it.

The Inference Economics Wake-Up Call

The primary driver of this shift is what analysts call the "AI infrastructure reckoning." In the early days of generative AI (GenAI), everyone rushed to the cloud for massive compute power to contextualize models. But as organizations move from experimentation to production, the math changes.

The critical tipping point? Inference costs. While contextualizing a model is a massive, episodic burst of compute that’s perfect for the public cloud, running that model (inference) requires 24/7 compute. When you scale AI to enterprise levels, the recurring costs of cloud inference and data egress become prohibitively expensive.

In 2026, the smart play is workload-first rather than cloud-first:

Public cloud: Ideal for bursty training workloads and elastic experimentation.
On premises: The cost-effective powerhouse for consistent, high-volume production inference.
Edge: Critical for low-latency decision-making where the speed of light is the bottleneck.

Cloudera allows you to execute a workload-first approach seamlessly. With Cloudera AI, you can spin up a workspace in one infrastructure to contextualize a model on massive datasets, and then deploy that same model to another infrastructure for inference, without refactoring. We bring the compute to the data, rather than paying the "gravity tax" of moving petabytes of data to the compute. This empowers you to choose the deployment pattern that fits your reality, whether that means training on premises to secure sensitive IP and deploying to the cloud, or vice versa.

Resilience via Hybrid Failover

Another reason why enterprises are rethinking their cloud-only strategy is “concentration risk,” or more plainly: if all your workloads are tied to a single cloud provider, when the inevitable outage happens, then your business goes dark as well. Relying on a single public cloud provider for all data and AI operations creates a single point of failure. This is no longer just a matter of good business sense. Regulators are stepping in with frameworks like DORA (Digital Operational Resilience Act) to prevent concentration risk from causing systemic catastrophes.

For many, cloud-only resilience is simply too little. True resilience now requires the agility to move workloads instantly, whether to survive outages or navigate geopolitical mandates.

In a hybrid world, resilience comes from diversity. A proper hybrid architecture allows you to failover not just from one region to another, but from public cloud to private cloud, or even from one hyperscaler to another.

Cloudera supports a resilient architecture. Our platform can be configured to replicate data, metadata, and security policies across environments. This setup establishes a powerful "failover anywhere" capability. With these configurations in place, mission-critical applications can failover in any direction, whether moving from a downed public cloud region to a private data center, or shifting from on premises to the cloud to handle sudden spikes.

Security and Governance: The Sovereignty Factor

Another friction point in the cloud-first approach is governance. Fragmented policies across hyperscalers and on-premises systems create security blind spots. As data sovereignty and regulatory pressures intensify, enterprises are facing a complex web of compliance requirements. Whether navigating regional mandates like GDPR and the EU Data Act, industry standards like HIPAA and PCI DSS, or self-imposed controls for IP protection, organizations are realizing they cannot simply expose sensitive data to public environments. Instead, many are moving workloads back on-premises to regain control.

The challenge: How do you govern a hybrid estate without massively multiplying your workload?

Cloudera’s unified data fabric solves this challenge by first unlocking data access and automating understanding from a business perspective, regardless of location. This foundation allows you to decouple security and governance from the underlying infrastructure. You simply define a policy once, such as masking PII for specific users, and that policy follows the data, whether it resides in an S3 bucket, an on premises cluster, or an edge stream.

We’ve further strengthened this fabric with the addition of Cloudera Data Lineage (formerly Octopai), which delivers automated, end-to-end visibility into your data's journey. These advanced capabilities allow teams to trace data flows across complex hybrid environments to ensure compliance and trust, earning Cloudera recognition as a Leader in the The Forrester Wave™: Data Fabric Platforms, Q4 2025. While others may stitch together separate tools, Cloudera delivers a unified platform that secures and manages the entire experience.

Not All Hybrid Architectures Are Created Equal

The 2025 outages may have served as the nail in the coffin of the cloud-only era. But as 451 Research notes, there’s a critical difference between hybrid-by-accident architectures that leave organizations struggling with silos and complexity, and an architecture that’s hybrid by design. A designed approach includes a consistent, portable platform that abstracts complexity across data centers, clouds, and the edge, anchored by a unified data fabric with replication.

To succeed in 2026 and beyond, organizations cannot afford accidental architectures. Cloudera’s hybrid-by-design architecture enables enterprises to stop compromising on where their data lives. Instead, they can start capitalizing on what their data can do, turning the inherent diversity of the hybrid estate into a strategic asset rather than a burden.

We deliver a consistent cloud experience by bringing the best parts of the cloud to wherever the data lives. This includes cost efficiency, scalability, elasticity, increased agility, reduced IT effort, faster access to innovation, and high availability. We’re the only data and AI platform company that brings AI to data anywhere: in clouds, data centers, and at the edge.

To learn more about how you can build a hybrid-by-design architecture with Cloudera, reach out to our professional services team or check out our product demos.

Luz Erez on Bringing Humanity Back to Healthcare with AI

Cloudera — Thu, 22 Jan 2026 17:01:00 UTC

Few types of data carry as much potential—or as much responsibility—as medical data. When used properly, healthcare data can improve outcomes, accelerate research, and quite literally save lives. But accessing and analyzing that data remains one of the hardest challenges in enterprise AI.

In this episode of The AI Forecast, host Paul Muller sits down with Luz Erez, founder of MDClone, to explore how synthetic data is changing the way healthcare organizations conduct research, deploy AI, and safeguard sensitive information. Their conversation spans everything from clinical workflows and physician burnout to the role synthetic data plays in validating AI agents safely at scale.

Here are the key takeaways from the conversation.

Bringing Humanity Back to Healthcare Through Automation

Paul: We talk about AI and what it’s going to do for business, and specifically what it’s going to do in the medical industry—but what would you like to see AI automate for you personally?

Luz: One of the things that happened to us during the last 40 years, we lost contact with people. A physician today spends 60% of his time behind the desk doing registering things, regulation, dosing, and so on.

He should be with you as a person. The rest of the work—important work, but work that AI can do—will be done by machines. And it will totally alter the way that we interact. Free time will be more, and many of the interactions that we don’t like about work will be done by machines. I really believe it’s a much, much better future. I’m excited.

Why Medical Research Needs a New Data Model

Paul: You talk about the complexity of retrospective research, but what does retrospective research mean in this context?

Luz: Retrospective research means I’m doing research by looking at data of patients that already exist. And most of the time, people understand there is a difference between correlation and causality.

A researcher might look at medical data and say: I want all the medications of patients that had a relapse in kidney disease while on a beta blocker. Tools like SQL can’t answer this because first you have to define what “on a beta blocker” means, and what a “relapse” means.

As a physicist, I ask: what are the basic rules? The basic rules are rules of time and people, which means this is longitudinal. So, the main question is, how frequent is something taking place? Once I put the mathematics inside it, I could build logic and a system on top of it. But then I saw another problem. I can find the answers, but I cannot give them to anyone. A physician can ask about their patient, but population-level research requires consent, privacy, and governance. So how do you solve this?

We built something called synthetic data. The engine looks at real data, but it doesn’t give you that data. It gives you a list of avatars that look like the original data. Any statistics will be the same, but there’s no one-to-one correlation with real people. There is no PHI issue.

Synthetic data allows you to share data, train models, and collaborate—without violating privacy. And today, synthetic data plays a major role in AI.

Synthetic Data as the Foundation for Safe Medical AI

Paul: To get synthetic data of sufficient fidelity, surely it still has to come from actual data?

Luz: Sure. It looks at actual data and creates synthetic data. There is a balance between privacy and utility. You set the level of privacy, and we give you the best utility possible.

The key difference is governance. When a machine does this automatically and users only see synthetic data, all the ethical and privacy issues go away. Not everything can be synthetic—rare cases are hard—but for common medical data, synthetic data works extremely well.

Paul: How do you see this changing the future of medicine?

Luz: AI agents are already doing things like offering dosing recommendations with incredible accuracy, but not enough yet. For dosing, you need absolute certainty. To validate these agents, you need hundreds of thousands of cases—and you don’t have them yet.

With synthetic data, we can bootstrap. We generate more and more cases until we can prove the agent works 100% of the time. We’ve shown that primary caregivers can save 40–60% of session time using agents like this, but only if they’re validated correctly.

Without synthetic data, you can’t safely test these systems at scale. I truly believe synthetic data is one of the bedrocks of medical AI.

Listen to the full conversation with Luz Erez on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

You can also learn more about Cloudera’s partnership with MDClone at cloudera.com.

Winter, Unplugged: Making the most of the Holiday Season

Ashton Stockstill — Tue, 20 Jan 2026 14:00:00 UTC

Is there anything about Unplug you want to share?

“This isn't just a holiday break; it’s a necessary pause to fully embrace this moment before a long farewell.” - Jennifer Parker, Sr. Manager, Contracts

“It's great, and I really enjoyed these days. Especially as an employee, these breaks are important to refresh our minds and be 100 percent when we return.” - Gaurav Sharma, Software QA Engineer

“Unplug gives me the freedom to step away, recharge, and return with better focus and energy. It’s a strong signal of trust and a people-first culture.” - Vishnuprakash Palanisamy, Staff Software Engineer

Learn more about how Cloudera is helping create an inclusive and supportive workplace for everyone.

“These unplugged trips have always turned out to be more productive than I expected. This time, they gave me a fresh perspective on life and its surprises — along with an answer to the philosophical question most present in my mind, just as similar trips have done before.” - Shivam Kumar, Software Engineer II

“When not spending time with my four children and family, I was able to work on building a robotic dog.” – Dr. Christopher Royles, Field CTO EMEA

"Since Brazil is in the summer season, I spent time doing some cycling training in beautiful places.” - Everton Fernandes, Sr. Manager, Solutions Engineering

“One of the most rewarding moments for us on the talent acquisition team is when candidates tell us they’ve already heard about our Unplug program through friends or social media. When people want to work here not just for the role, but because we genuinely respect personal time, you know the culture is real.” - Rachit Chandra, Director, Talent Acquisition, APAC

“I stayed in a wooden house in the mountains of Chiang Mai, Thailand, spending my days meditating, reading, and disconnecting from the rhythm of city life.” – Ziyang Yang, Senior Talent Acquisition Advisor

“I was able to spend time with family and friends - Pantomime, Christmas markets, dining out, touring London, watching the Christmas lights and firework show on New Year’s Eve.” - Deepa Pednekar, Senior Practice Manager, EMEA

“I really appreciate the fact that I can disconnect completely without worrying about tons of emails waiting for me on my return to the office. Getting quality time off makes me very productive and motivated to give my best after the break!” - Stamatis Zampetakis, Senior Staff Engineer

What does it mean to you to work at a company that offers Unplug?

“As a busy mom, wife, and professional, having the flexibility to take time off when I need it—whether that’s to align with my kids’ school schedules, hockey tournaments, or gymnastics meets—means a lot. Cloudera Unplug gives me the freedom to step away without guilt, knowing I can take care of what matters most at home and come back recharged and ready to give my very best at work.” – Molly Boyer, Sr. Director, Communications and Analyst Relations

“Unplug days offer a chance for employees to recharge mentally and build a stronger connection with family, which in turn prevents burnout and leads to stronger loyalty, creativity and productivity.” - Dimas Ramaditya, Partner Sales Manager

“Having the opportunity to pursue one’s dreams with a healthy body creates a healthy mind.” - Niel Dunnage, Principal Strategic Customer Success Manager

“Scuba Diving! I went to Mexico and did a lot of cave diving in the famous cenotes. And then taught my oldest to dive as well!” - Sergio Gago, CTO

“It is nothing short of amazing to have this time off. I have always appreciated and thoroughly cherished any company that supports their employee’s well-being with much needed time off to unplug and recharge the batteries for a new year!” - Morry Bowling, Partner Sales Director

Every year, Cloudera offers “Unplug” days to help employees truly disconnect from work, recharge, and focus on what matters to them outside of the office. For some, that means pursuing a long-awaited passion project. For others, that means seeing more of the world or spending quality time with family and friends. No matter how Clouderans choose to spend them, these dedicated, enterprise-wide breaks reinforce Cloudera’s commitment to the humanity behind the company—ensuring our team’s well-being and promoting a positive work-life balance.

We just returned from our Winter Unplug, which gave Clouderans the opportunity to close out an incredible year with a moment to reset. With the new year underway and more employees returning from their breaks, we checked in with Clouderans from across the globe to see how they spent their 2025 Winter Unplug.

“My family plans our year around unplug days, and we all look forward to them. We schedule vacations to coincide with unplug periods, allowing for extended breaks without missing extra workdays. This approach has significant business benefits too. My coworkers do the same, so out-of-office times cluster around unplug days, resulting in more days when everyone is online together.” - Jason Fehr, Senior Staff Engineer

2026 Predictions: The Architecture, Governance, and AI Trends Every Enterprise Must Prepare For

Cloudera — Thu, 08 Jan 2026 14:00:00 UTC

2026 marks the transition from experimentation to intelligence orchestration—a moment where AI, data, infrastructure, and governance converge into a single operating model. If 2024 and 2025 were defined by proofs of concept and one-off model deployments, 2026 will be the breakout year when enterprises begin operationalizing AI at scale, safely and with measurable ROI.

According to our Cloudera leadership team, this is the year when data evolves from passive storage to active organizational memory. Enabling data everywhere for AI anywhere by the unifying cloud and on-prem control planes. It’s also the year when AI agents move from demonstrations to becoming part of the digital workforce, but only if enterprises put governance, security, and responsible AI practices on equal footing with compute priorities.

Here’s what our leaders predict for the year ahead.

Abhas Ricky, Chief Strategy Officer: The Data Foundation Becomes the Intelligence Layer

In 2026, the leaders in the race to capitalize on AI will be the organizations that recognize that data’s value comes from how well it can be understood and acted on (not merely from how much of it exists). Data must function as a living, semantic, and governed memory system that AI can learn from and reason with.

In other words, you can’t scale AI until you re-architect the data beneath it.

Every dataset—whether structured, unstructured, real-time, or generated by a model—must carry its own semantics, lineage, and guardrails. This embedded context allows the modern data lakehouse to evolve from passive storage into an active intelligence layer that can contextualize information, enforce policy, audit decisions, and preserve traceability.

With this foundation in place, enterprises can begin building truly autonomous workflows that recall, adapt, and self-correct—the capabilities that will define AI ROI in the years ahead.

Manasi Vartak, Chief AI Architect: Agentic AI Moves to Production and Governance Becomes Non-negotiable

Despite headlines predicting a slowdown, enterprise demand for generative and agentic AI will continue to rise in 2026, but with a decisive shift toward measurable ROI (i.e., fewer rogue experiments, and more predictable and intentional use-case-based applications). Much of that value will come from enterprise-adapted models, gradually reducing reliance on public models as organizations prioritize solutions tailored to their own data and workflows.

The last few years were about testing AI’s limits.

2026 is about scaling what works.

To deploy agentic systems in production, organizations will need:

Strong governance frameworks

Clear data access controls

Security rules and permission frameworks defining what data agents can access and what actions they are allowed to take

Observability into agent actions and decision-making

Agent registries and workflow versioning to track how agents evolve over time

This necessarily broadens the definition of responsible AI. Fairness and bias mitigation remain important, but enterprises now require end-to-end accountability across data pipelines, system behaviors, and the choices AI agents make if they want to scale agentic AI safely and profitably.

Sergio Gago, CTO: The Era of Convergence and the Rise of One Control Plane

After years of tension between on-prem control and cloud elasticity, 2026 is the year of true convergence. Hybrid infrastructure is no longer a compromise between legacy and cloud systems. It has instead become the architectural backbone that enables intelligence at scale.

Across Cloudera’s leadership team, one theme stood out: AI agents will become part of the operational workflow. But until now, their effectiveness has been limited by fragmented data access. Some models could reach only cloud-based data, while others pieced together partial views across environments. Most thought a unified control plane simply wasn’t possible.

That changes in 2026.

Cloudera’s hybrid architecture allows workloads (including AI agents) to run wherever they make the most sense, guided by policy, governance, and efficiency rather than storage location, unlocking the next generation of intelligent, coordinated enterprise systems.

Implications by Vertical

These predictions aren’t just theoretical. They stand to impact and influence sector operations. Retail and financial services, in particular, are positioned for profound transformation as data foundations strengthen, agentic AI moves to production, and control planes converge.

Neelabh Pant, Director of Global AI: Retail: From Siloed Systems to Real-Time, Connected Intelligence

Retailers are already seeing outsized returns from AI, with early adopters realizing ROI up to six times faster. In 2026, success will hinge on:

Connecting data across stores, supply chains, customer interactions, and online ecosystems

Enabling AI agents to act on real-time information from inventory updates and returns to customer preferences

Empowering nontechnical teams to create new data connections and workflows without waiting on IT to put it together on their behalf

A unified control plane means AI agents can navigate data and make inferences regardless of where it lives, unlocking personalization, operational efficiency, and faster decision-making. Retailers that modernize their data architectures will continue to set the pace of innovation.

Adrien Chenailler, Sr. Director, AI Industry Solutions: Financial Services: AI Becomes an Operational Layer, Not a Project from

Financial institutions have spent years modernizing their data foundations. In 2026, that work pays off. Banks, insurers, and investment firms will increasingly run day-to-day operations on AI, with agents already supporting things like:

Credit risk scoring

Fraud detection and prevention

Compliance investigations

Credit memo preparation

Customer service workflows

With 91% of financial services leaders already calling hybrid AI highly valuable, there’s a reduced need for experimentation—we've already done that. Now, enterprises will compete on execution. Unified control planes provide the secure, governed environment AI needs to analyze sensitive data across systems without compromising compliance or sovereignty.

Cloudera’s platform is built for exactly this moment, enabling access to data anywhere for AI everywhere with governed, enterprise-wide intelligence, whether your data lives in the cloud, in data centers, or at the edge.

To learn how your organization can prepare for 2026 and beyond, explore Cloudera’s latest resources and insights.

Unleash Peak Performance: Get 13x Faster Queries with Cloudera Lakehouse Optimizer

Adam Benlemlih,Navita Sood — Wed, 31 Dec 2025 14:00:00 UTC

Cloudera's commitment to an open data lakehouse empowers customers with the flexibility to use any engine or tool of choice—whether from Cloudera, other vendors, or open source. We understand the complexity of modern data ecosystems, and our engine-neutral approach ensures seamless collaboration across teams accessing data to build analytical or AI applications and agents. We continuously enhance our lakehouse with innovative features for speed, security, automation, and interoperability, ensuring all engines run concurrently and efficiently and have access to all features and optimizations.

The Cloudera Lakehouse Optimizer provides predictive and intelligent optimizations, automating Apache Iceberg table maintenance and ensuring your open data lakehouse remains performant, scalable, and cost-effective. This service empowers data teams with a cost-efficient lakehouse for all their AI and analytical workloads.

The Proof is in the Performance: 13x Faster Queries and 36% Storage Cost Reduction!

We know that performance and cost efficiency are paramount, which is why we're sharing compelling results from our internal benchmarks. We tested Cloudera Lakehouse Optimizer using 7 TPC-DS tables (107 GB of data), executing TPC-DS queries before and after optimization. Even after accounting for caching and removing outliers, the results are significant:

13x faster queries: Our data shows an average 13x query time improvement, reducing average query time from 24 seconds to a mere 1.8 seconds after optimization!
36% storage cost reduction: Cloudera Lakehouse Optimizer also drives substantial cost savings by optimizing your storage footprint. Our benchmarks revealed a 36% reduction in dataset size–from 107 GB to 68 GB. This directly translates to a lower total cost of ownership (TCO).

These results demonstrate how Cloudera Lakehouse Optimizer improves query performance for downstream AI, reporting, and analytics, and also significantly reduces your storage costs.

What Makes Cloudera Lakehouse Optimizer Stand Out?

Whether you're a platform lead focused on cost controls, a data architect designing scalable solutions, or a data engineer streamlining processes, Cloudera Lakehouse Optimizer is built for you. It comes with policy templates and defaults, enabling immediate optimization without extensive configuration. For specific requirements, the graphical user interface (GUI) and application programming interface (API) offer best-in-class controls.

Let's explore how Cloudera Lakehouse Optimizer uniquely tackles table optimization to deliver these performance and storage benefits:

Intelligent policies: Cloudera Lakehouse Optimizer assesses whether a table requires optimization, ensuring only necessary actions are executed, and autonomously runs the optimizations as and when necessary. It offers rich and configurable action arguments against all Iceberg optimizations, covering a large set of arguments to enable maximum performance.

Engine and storage agnostic: Once the tables are optimized by the Lakehouse Optimizer, any engine accessing the data from the lakehouse will see exactly the same improvements in the performance of the queries, whether those engines are Cloudera owned, open source, or from another vendor. These optimizations also apply to data stored in any cloud object storage or on-premises object stores.

Unmatched scope and control: Cloudera Lakehouse Optimizer allows granular control over policy application. You can create and apply policies at the table, namespace, or even entire catalog level, offering flexible and scalable management as your lakehouse evolves and allows for optimizations to be defined against nearly all arguments, enabling the best policy definition for your tables. This broad scope is a significant differentiator compared to other solutions with more limited policy application. The optimizer also includes a dedicated GUI, enabling all users to comfortably configure and monitor optimizations. For programmatic control, comprehensive API/command line interface (CLI) access is also available, ensuring ease of use for all. It also provides unparalleled flexibility and control over when and how optimizations run:
- Event-based intelligent scheduling: Automatically triggers optimizations when a table event occurs, such as an update, insert, or delete.
- Time-based scheduling: Allows you to schedule optimizations on a set, recurring basis using a cron-like schedule—a feature not available from AWS S3 Table Maintenance or Databricks Predictive Optimizer.
- Manual executions: Trigger manual executions of policies, enabling on-demand optimization.

Ready to Transform Your Lakehouse?

Experience the power of automated, intelligent Iceberg table optimization and realize significant performance and cost benefits today.

Learn more about the Cloudera Lakehouse Optimizer by watching a demo.
Take advantage of our special promotional offer: All data processed through Cloudera Lakehouse Optimizer will be free until April 26th, 2026! While there is a minimal base cost, this promotion ensures you can explore Cloudera Lakehouse Optimizer’s capabilities without worrying about data processing fees. Furthermore, you can set consumption limits via the Cloudera Management Console to ensure costs never exceed your expectations.

Deliver Repeatable, Measurable, and Enterprise-Ready AI for Life Sciences

Laura Blewitt — Tue, 30 Dec 2025 14:00:00 UTC

Deliver Repeatable, Measurable, and Enterprise-Ready AI for Life Sciences

Pharmaceutical and life science companies use AI to enhance drug discovery, clinical development, and patient experiences. In these types of regulated environments, the key to unlocking AI-assisted breakthroughs and return on investment (ROI) is a back-to-basics approach—focusing on data unification, interoperability, and security and governance.

On the latest episode of the Healthcare IT News podcast, HIMSSCast, Rameez Chatni, Global Director of AI Solutions at Cloudera, explains that the industry is transitioning from a nascent focus on AI strategy back to the bedrock of a robust data foundation.

Ensure Interoperability Across The Value Chain

The typical global pharma organization comprises 12 to 15 distinct, enterprise-like verticals—R&D, manufacturing, commercial, and so on—and building an AI-ready data set requires managing sophisticated, distributed architectures.

Data unification is difficult, and the solution isn't to force all data into one homogenous system. Instead, organizations are embracing a hybrid architecture that accommodates on-premises systems, multiple clouds, and software-as-as-service (SaaS) solutions.

Using open-source, interoperable technologies that support open data formats ensures that multiple query engines can access data for a variety of engineering, analytic, and AI workloads, and reduces the risk of vendor lock-in.

The ultimate goal for data unification is to give AI models the context they need to connect the dots across the organization and provide better outputs. One contextual model many pharma companies are leveraging is a knowledge graph. This structure captures the relationships within the business—linking drugs to genes, diseases, clinical trials, and commercial data— that humans often miss, creating a truly comprehensive and usable data set.

However, these advanced architectures hinge on one critical, often-overlooked first step: data inventory and data lineage. These are the unsung heroes and foundational pillars that prevent different functions (like R&D and manufacturing) from duplicating licenses for the same data sets and wasting resources.

Treat Governance as a Feature, Not a Bug

In a sector that is trying to innovate quickly with data, governance is frequently an afterthought, and projects can stall for as long as nine months as a result. Rameez argues that governance must be treated as a feature, not a bug. This means transforming it into “governance as a service,” a proactive, continuous capability within the enterprise.

The only way to achieve governance as a service is through a multidisciplinary center of excellence (CoE) that connects business leaders, data strategists, technology architects, and privacy/legal lawyers. This ensures technical teams, who understand how data moves, can communicate effectively with legal teams, who understand privacy and consent restrictions.

Crucially, governance should be applied early. Failure to consider compliance, like restrictions on using clinical trial data for secondary purposes, can halt an entire project late in the game. In fact, AI should be applied to governance itself to accelerate contract reviews and ensure compliance checks are automated and auditable.

Prove ROI To Achieve Scale

The industry is littered with reports of AI pilot failures. Organizations that are just starting their AI journeys should find the operational AI use cases first. Automating "boring" tasks like clinical trial protocol writing (saving a week on each of a thousand documents) or processing adverse events faster are clear, quick wins.

Rameez advises that success starts with defining a clear, measurable ROI that aligns with the business. In pharma, enabling a “fail fast” culture is a ROI. Computational failure is significantly cheaper than a late-stage clinical trial crash.

Rameez frames this ROI simply, advising that organizations take steps to identify and solve issues quickly, before they snowball: "The earlier you find problems... you can get to a (solution) much faster before it becomes a much bigger problem."

Finally, standardize your systems: define the agentic frameworks, the tools, the support models, and, most importantly, have clear rules for promotion from development to a validated, auditable production environment.

The Next Frontier: Personalized AI

Looking ahead, the next three to five years promise even greater transformation. We’ll see a rise in personalized agents that tailor interactions and insights to the individual user.

AI models will evolve to optimize for multi-parameters simultaneously. Instead of optimizing just for efficacy, models will suggest molecules that are effective, non-toxic, manufacturable, and have a good shelf life—all at once. We may even see the first commercially available drug marketed as “generated by AI.”

Want to learn how to prepare your organization for this future? Listen to the full conversation with Rameez Chatni for all the details on AI implementation and best practices.

Context Is the Hard Part: Practical Lessons in Building Agentic AI Systems

Pamela Pan,Navita Sood — Mon, 29 Dec 2025 14:00:00 UTC

Why context engineering is important, and how teams are delivering it

“How do you get the right data, in the right place, at the right time?”

That’s the core challenge behind bringing agentic AI to life in the enterprise. While large language models (LLMs) have unlocked powerful reasoning and orchestration capabilities, their effectiveness hinges on something more foundational: delivering the right business context for reasoning and taking action. Context engineering is a discipline focused on shaping how data, metadata, access policies, and memory come together to guide agent behavior in a secure and explainable way.

At Cloudera, we see this firsthand while partnering with enterprise customers experimenting with new generative AI (GenAI) and agentic AI use cases. Building agentic AI systems depends on something most organizations struggle with: data architecture that capture, govern, and reuse knowledge across the AI lifecycle.

In this blog, we share our approach to building agentic AI systems, which groups foundational capabilities into three buckets: Connect, Contextualize, and Consume. This approach enables our enterprise customers to build intelligent, trusted, explainable, and production-ready agentic systems.

Connect: Break Down Silos with Control

Modern AI agents can’t thrive in fragmented environments. However, most enterprises have data that’s spread across multiple clouds, data centers, legacy systems, and inconsistent formats. Exposing that data to an AI system without structure or safeguards leads to performance issues and governance risk.

In successful implementations, we’ve seen organizations focus first on creating a unified data layer that spans environments and formats. This doesn’t mean centralizing all data, but instead stitching it together in a data fabric architecture. This provides a unified layer with shared metadata, access policies, federated data engineering, and runtime interoperability.

Implementing an open table format and standard API access simplifies data access while delivering flexibility. Open lakehouse architectures matter here because they provide real-time, consistent views of data across engines—especially for agentic workflows that depend on reliable retrieval augmented generation (RAG) and reasoning.

Contextualize: Give Agents More Than Access

After data is connected, the challenge shifts to helping agents understand what data exists and how it's used. That starts with discovery: automatically identifying data sources across cloud and on-premises systems and activating the metadata—table names, fields, formats, and more. Tools like Cloudera Octopai Data Lineage scan ETL scripts, reverse-engineer pipeline logic, and capture how data moves and transforms across systems from the source to its final destination, capturing all the dependencies on its way.

This information forms the basis for lineage, which shows how datasets are related and how they change over time. Lineage matters when you need to validate a result, explain a recommendation or agent action, or trace a broken output to its source. It creates transparency and confidence in the systems with which agents interact.

Finally, cataloging brings this information into a usable structure. A centralized metadata store helps both humans and agents locate what they need, understand relationships between datasets, and surface policies that affect how data should be handled. A strong catalog acts like a blueprint—delivering a knowledge graph that gives agents a clear, navigable map of the enterprise’s data estate. It captures the technical, operational and business metadata including all the business definitions and the business logic required to understand the data and take action.

Contextualization enables agents to do more than retrieve information. It allows them to explore patterns, ask better questions, and make decisions with a deeper understanding of the environment they operate in.

Consume: Deliver the Right Context at the Right Time

The final step in building agentic systems involves enabling AI to take action in a way that is traceable, safe, and grounded in the right information. This is where architectural choices matter—guardrails, observability, and controlled access all shape whether agents behave predictably when it counts.

We’ve found it helpful to map common context engineering techniques to the underlying data challenges they’re designed to solve. Here are some examples of how they show up in practice:

Data Readiness Challenge	Context Engineering Technique	Cloudera’s Approach
Sensitive data leaking into prompts	Prompt engineering	Prompt gateways to redact sensitive data
Messy, unstructured data or outdated vector indexes	RAG	Governed and secure real-time streaming data pipelines
Lack of lineage, brittle training sets	Fine tuning	Improve AI explainability with lineage tracking
Agents overstepping, opaque decisions	Tool/API access	Metadata tagging, autonomous data classification, fine-grained access and full audit trails on every system call
Agents unable to access internal enterprise knowledge	Model context protocols (MCPs)	Controlled access to Apache Iceberg-backed context with REST catalogs

Choosing the right technique depends on the agent’s role, data sensitivity, and operational environment. Below are common enterprise use cases and the recommended combinations that have worked well in practice:

Use Case	Recommended Method(s)
Internal knowledge assistant	RAG + vector DB + prompt engineering fallback
Sales enablement bot with customer relationship management (CRM) data	Function calling + business context injection
Product-specific support agent	Fine-tuning or RAG + MCP shared context
Data analytics multi-agentic workflow to extract insights	LangGraph + MCP + tool access + chunked memory
Document understanding (PDF, Excel)	Multi-modal inputs + preprocessing pipelines

This approach to consumption ensures agents are operating with precision, security, and alignment to business goals.

Takeaways: From Framework to Action

At Cloudera, we’ve spent years navigating the complexities of enterprise data: bridging silos, enforcing governance, building secure pipelines for AI and analytics, and surfacing lineage across hybrid environments. So when agentic AI patterns began emerging, we weren’t starting from scratch. We knew where context lives, and how to capture it safely and securely with the right guardrails.

With Cloudera Octopai Data Lineage, teams can automatically map data flows, trace dependencies, and catalog metadata across cloud and on-premises environments. Layering in data catalogs, observability, and access control, agents can interact with systems more safely and intelligently. Teams gain visibility, governance, and trust–critical for scaling these workflows across the enterprise.

To make these pieces actionable, we’ve integrated these capabilities into our Open Data Lakehouse and Cloudera AI Studios, giving enterprises the foundation to design, deploy, and manage secure agentic systems in production.

Learn more about how Cloudera can help you with productionizing your AI agents with the right business context that they need.

Cloudera Grows Recognition as Great Place to Work

Debbie Kruger — Fri, 19 Dec 2025 14:00:00 UTC

At Cloudera, our people are the heart of our innovation. That’s what makes recognitions like the Great Places to Work so important to us as an organization. Great Places to Work recognizes organizations that put an emphasis on employee well-being and professional growth. These priorities are deeply aligned with Cloudera’s ongoing commitment to put employees first and foster a collaborative work environment. At Cloudera that means building a workplace where everyone can feel included, supported, and given opportunities to grow and learn.

This year has been an exciting moment seeing multiple recognition and certifications from offices that span the globe. Over the course of 2025, the company has secured Great Places to Work certifications across regions including our offices in Ireland, Singapore, Costa Rica, Spain, Italy, and France. Here’s what our offices have earned so far:

Costa Rica

Best Places to Work Costa Rica (#6 Place)
Best Place to Work By Employee Quantity (20-100 #7 Place)

Spain

1st Time Certifying, Best Workplaces
Best Workplaces in Tech (#2 Place)

Italy

1st Time Certifying, Best Workplaces

France

1st Time Certifying, Best Workplaces

Ireland

Best Workplace in Ireland (Med) (#1 Place)
Best Small/Med Workplace in Europe (#13 Place)
Best Workplace for Women
Best Workplace for Health & Wellbeing

Singapore

Best Place to Work (Small) (#3 Place)

Cloudera is a global leader in data management and enterprise AI, and that leadership is fueled by the incredible talent and commitment of our people,” said Amy Nelson, Chief Human Resources Officer. “As our global footprint expands, these Great Places to Work certifications mark important milestones and reinforce the culture and dedication our teams bring to life every day.”

To celebrate Cloudera’s commitment to employee well-being, we interviewed country leaders from the teams recognized to discuss what being certified as a Great Place to Work means to them.

Watch our video celebrating these honors.

The 3 Eras of Women Leaders in Technology: Mary Wells’ Perspective

Debbie Kruger — Wed, 17 Dec 2025 14:00:00 UTC

The conversation about women in technology has changed a lot over the years. What began as a push for visibility has become something much bigger: a story about representation, allyship, and influence.

Mary Wells, Chief Marketing Officer at Cloudera, has had a front row seat for that evolution. Over her 25+-year career across some of the biggest names in tech, she’s seen firsthand how women’s roles and voices have transformed. As the executive sponsor of Cloudera’s Women Leaders in Technology (WLIT) initiative, she helps foster that next stage of growth: creating space for women and allies to learn, lead, and lift each other up.

Drawing from her experience, Mary describes the evolution of women leading in technology through three eras. Each builds on the one before it, with a new era just beginning to take shape.

Era One: Representation and Belonging

A couple decades ago, progress meant simply being seen.

Many women in tech were “the only one”—the only woman in a department, on a project team, or even in an entire building. These pioneers faced the dual challenge of doing their jobs while also proving they belonged.

In a recent interview, Mary reflected on her experiences during this era with informal meetups for women in tech at various company and industry events. In hindsight, she sees these as the early grassroots versions of today’s more formal women-in-tech support networks.

Mary recalls women sharing stories of being the only woman on their floor or in their department. Some left those WLIT conversations with other leaders (who happen to be women) in tears—not from sadness, but from relief. For many, it was the first time they realized they weren’t alone in their workplace struggles. Seeing their experiences reflected in others created a sense of representation and belonging.

Simple conversations broke the feelings of isolation, creating a sense of solidarity. Women working together to listen, encourage, and prove that belonging was a form of strength.

During this era, peer communities gave women the courage to take a seat at the table and stay.

Era Two: Confidence and Voice

Once women had a place in the room, the conversation started to shift. It wasn’t enough to just be present. It was time to meaningfully participate.

That’s why this second era of women leading in tech can be characterized by confidence. Women started searching for ways to use their voices, influence decisions, and lead authentically. Mary recalls that about ten years ago, the questions she heard most often centered on self-doubt. Women were asking, “How do we make our presence count?”

At the time, “imposter syndrome” became the go-to phrase to describe the gap between physically being in the room and truly feeling like you belonged there.

But as time went on, she sensed this was a misnomer. Imposter syndrome wasn’t just a woman’s issue. Everyone experiences self-doubt at some point. The key isn’t to wait until it disappears, but to move forward anyway. For her, confidence often begins with courage. “Do it afraid,” she tells colleagues. A reminder that stepping out of your comfort zone usually means you’re growing.

This was the era when women stopped waiting for permission to lead and began shaping conversations of their own.

Era Three: All are Welcome Through Allyship and Partnership

This third era is about allyship and shared responsibility. It’s no longer just a “women’s issue”—today, all are welcome. Men and women alike are working to build teams that reflect the diversity of the world around them.

Mary has seen this shift firsthand. At a recent women-in-tech-leadership panel during an event in London, she looked out at a crowd that was nearly 60% men. For her, that moment, recognizing allyship and a broader peer group actively listening to these challenges, captured how far the conversation had come.

She recalls a moment when a male colleague questioned why forums like WLIT were needed, and another man quickly stepped in to say, “Look around the table,” implying that for most in attendance, the answer was obvious. That kind of allyship, Mary notes, gives the conversation credibility and momentum.

Progress now depends on everyone showing up, listening, and lifting others along the way.

The Emerging Era: Leadership and Influence

A new chapter is already unfolding, and this next era is about influence. Ensuring women aren’t just part of the conversation about the future of tech. They are helping to define it. The WLIT sessions throughout Cloudera’s global EVOLVE event series offer a vivid example of what this new era looks like in practice.

Under the theme “Accelerate Action, Accelerate Innovation,” WLIT brought together leading voices across industries to explore topics ranging from adaptive leadership to responsible AI. Across four events, we saw over 300 external registrants and nearly 200 attendees demonstrating a strong appetite for these crucial conversations.

Together we discussed:

Leading with governance and transparency (inspired by the rules of robotics)
Shaping a responsible AI future that people are excited to engage with
Cultivating adaptive, human-first leadership styles

The feedback from these sessions reflects how resonant and needed these conversations are.

One attendee shared:

“The WLIT panel from NY was honestly one of the most refreshingly honest and engaging panels I’ve seen. The diversity of thought and representation was great!”

For Mary, the WLIT sessions at EVOLVE demonstrate how influence becomes impact—it’s a natural evolution of the journey. The focus is no longer on women proving they belong in tech leadership—it's on equally leading the conversations that will shape the future. The goal isn’t to be seen as “women leaders,” anymore—instead, we’d rather simply be seen as leaders.

Looking Ahead: Women Leading in Technology

Each era has paved the way for the next. Belonging built confidence, confidence created allyship, and allyship is leading to influence. A fourth era, Mary says, we’re already seeing take shape.

The story of women leading in technology is still being written. It’s a story of resilience, courage, and connection. Of people who chose to lift one another up rather than climb alone.

At Cloudera and across the industry, leaders like Mary Wells remind us that progress is about using our seat at the table to make space for others and to shape what comes next.

Experience the impact of Women Leaders in Technology at EVOLVE25 today:

Want to learn more? Check out our Women Leaders in Tech page.

Patrick Moorhead Insights: Overinvest in Data to Scale AI

Cloudera — Tue, 16 Dec 2025 14:00:00 UTC

Patrick Moorhead Insights: Overinvest in Data to Scale AI

Few people have had a front-row seat to more technological revolutions than Patrick Moorhead. As founder, CEO, and chief analyst at Moor Insights & Strategy, he’s spent decades tracking the intersection of hardware, software, and business transformation.

In this episode of The AI Forecast, host Paul Muller sits down with Patrick for a wide-ranging conversation on the evolution of AI—from lessons learned during the dot-com era to the rise of hybrid multi-cloud fabrics and the future of human-machine collaboration.

Here are the key takeaways from the conversation.

Comparing the Dot-Com Era to Today’s AI Moment

Paul: I was listening to Scott Galloway and Ed Olson’s podcast a few days ago, and they were likening the level of exuberance and frankly, even the level of dealmaking we’re seeing in the marketplace going on at the moment in AI to the dot-com era. And we all know how that ended—the internet won, but it didn't get there via straight line. How does today's AI moment compare to past waves of innovation that you've seen?

Patrick: I am more comfortable about this than dot-com. When I was part of dot-com, it was, oh my gosh, I’m losing 35 bucks a bag on dog food and VCs are triple investing in the same businesses. It was like putting multiple chips on a craps table and it was pretty clear that that wasn’t going to work.

I like to call it the law of “if thens”—and the law of if thens said, okay, I’m building all this dark fiber and all this capability. If I can get a service that distributes video over the internet… If I can have a PC connected to a DSL or cable modem… If I have gaming that is not multiplayer yet… then yeah, we can fill up these pipes. That’s too many “if thens.”

Today, if I have a web browser, what I can do is absolutely amazing. All the agents that I have running are through a web browser. Now, don't get me wrong, enterprise AI is challenging, but it’s already making a difference. You can see it touches more than the internet did. This is touching healthcare. This is touching consumers. This is touching entertainment. This is touching every form of personal productivity. It’s touching code development. Google said that they're doing 20% of all their code based on AI. That is absolutely mind blowing.

Garbage In, Garbage Out—Still True in 2025

Paul: What are some of the best practices you’ve learned as someone who works with data all the time?

Patrick: The most important thing I learned, I learned in 1984 in my first computer class—and that was garbage in, garbage out. And it has never changed.

If you look at GenAI today, it’s amplified. Your data has to be that much better to get a good decision. If you have the right workload and the right model, the biggest impediment to enterprise AI success is having the right data. I think it’s one of the reasons that Cloudera is such an important company.

Paul: What are you seeing happen in the C-suite and at the boardroom table when it comes to recognizing and addressing the challenges of data quality and fragmentation?

Patrick: The successful companies really do have a proper data management strategy—bringing multivariate data in multiple formats, making sure it’s clean, tagged correctly, and secure. We’ve been talking about having a data management strategy for decades, and this time, it matters.

The Hybrid Future: Why Optionality Wins

Paul: The idea of being able to get to all your data everywhere all the time is going to be critical to connecting the dots. Because let's face it, when you're in the boardroom as an executive, the whole point of getting all those various functions together in one room was to try and assemble all the various experiences and data points that you had across the business to create a cohesive business decision. So, I suppose this hybrid notion—that you’re going to have to be able to get to data no matter where it is—you were talking about this 10 years ago.

Patrick: I was the analyst who was the cloud denier, saying that hybrid was going to be the way to go. Enterprises want optionality. There are things where they want to leave the driving to someone else, and there are some things they want to control.

Even a 15-year-old company has an Oracle database, an SAP implementation, a mixture of on-prem, public cloud, enterprise SaaS, and now sovereign cloud. You must be able to work across all of those.

If you try to copy all of your data into one giant place—it’s impossible. The cost to copy and assemble it all is a complete and utter failure. That’s why I came up with this idea that the future is going to be about hybrid multi-cloud fabrics—whether it’s security, data, compute, or automation.

You want to choose vendors that operate in every modality. Otherwise, you’ll be playing whack-a-mole until the cows come home.

Paul: For boards preparing for AI at scale—what’s your advice?

Patrick: Overinvest in data. Spend more money than you think you need. Don’t do it alone—find a partner that has hybrid multi-cloud fabrics. If you find a partner that’s cloud-only or on-prem-only, you’ve failed.

Catch the full conversation with Patrick Moorhead on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

Cloudera’s Top Takeaways from AWS re:Invent 2025

Jeremiah Morrow — Thu, 11 Dec 2025 14:00:00 UTC

AWS re:Invent is a technology conference that, for most of the world, needs no introduction. This year, more than 60,000 technology professionals descended on the Venetian in Las Vegas for the week to share best practices, hear about the latest innovations in cloud technology, and discuss infrastructure, data, and AI strategy.

Cloudera had a big presence on the expo floor and throughout the week in sessions. It was our busiest re:Invent ever. Here are some of the top takeaways, stories, and announcements from the week, and what they mean for our customers.

It’s (Still) All About AI

In last year’s recap, our top takeaway was the pervasiveness of AI across the event. This year was much the same. Booth conversations, announcements, sessions, and demos were centered around nearly universal questions from attendees: how can we deploy AI in production safely and securely, and how can we build AI that we trust to run core parts of our business?

While everyone was talking about the promise of AI, many of the execution challenges lay down the stack, in the data infrastructure. In the largest enterprises, data is still often siloed in various systems, clouds, and data centers making it difficult to find, unify, secure, govern, and provide access to that data for analytics and AI.

Ultimately, distributed environments will be an inevitability, and hybrid architectures are likely the result. The goal is to leverage a platform that can apply a common set of security and governance policies across distributed data stores and enable portability so customers don’t have to care where their data lives. Taking a “cloud and data anywhere for AI everywhere” approach can solve many of the challenges customers face in delivering on their AI vision.

AWS Announcements

Amazon Web Services (AWS) made several announcements that Cloudera customers should know about.

AWS Graviton5 release. Last year, we wrote a blog post announcing support for AWS Graviton, which provides customers with cheaper, more efficient, and more sustainable compute power. Now, according to AWS’s benchmarks, AWS Graviton5 delivers 25% higher performance and several customers have reported significant performance improvements.

New capabilities in Amazon Bedrock. Many Cloudera customers use Amazon Bedrock for AI model development. AWS announced reinforcement fine tuning within Amazon Bedrock, which enables customers to create more accurate models that learn from feedback and deliver better business results. Reinforcement fine tuning is automated, so even developers who aren’t machine learning (ML) experts can use the tool.

Amazon S3 Vectors is now GA. Amazon S3 Vectors is the first cloud object storage to support the storing and querying of vector data. Now, customers can run and query vector-based AI and ML workloads directly on Amazon S3 without moving the data into a specialized vector database. Amazon S3 Vectors integrates with Amazon Bedrock, further streamlining AI workloads in AWS.

Sovereign cloud was another theme of the conference, and a critical point for many customers dealing with sensitive data or regulatory concerns. Ultimately, Cloudera and AWS are working together to ensure customers have access to the cloud innovation they need to be successful with AI while being excellent stewards of their customers’ data.

Cloudera Sessions

Cloudera experts and customers presented at several sessions throughout the week. Here are a few of the topics we covered, with links to the videos where they are available.

From Data to Action: Agentic-Powered Humanitarian Response. This session focuses on how Mercy Corps used Cloudera’s “Data Anywhere for AI Everywhere” approach to build MercyCORE, an agentic AI platform that enables the delivery of mission-critical insights to support humanitarian aid and revolutionize crisis response.

Powering Credit Risk Modernization with Cloudera and AWS. Most financial services institutions still struggle with manual processes, especially when they need to work with sensitive data. This session focuses on how financial institutions can integrate reasoning models and agentic patterns to arrive at a credit decision efficiently while maintaining compliance.

Hands-On Lab: Build an AI Agent with Cloudera Agent Studio. In this hands-on session, we introduced attendees to Cloudera Agent Studio, part of Cloudera AI Studios, and walked them through a workflow to build their own agents.

Cloudera Booth Demos

Cloudera booth demos highlight the depth and breadth of our platform capabilities, and our ability to support our customers’ AI journeys across a diverse set of organizational requirements. While we covered many data and AI topics, here are some of the highlights:

Building a Knowledge Graph with Cloudera. Knowledge graphs solve the two biggest problems with generalized large language models (LLMs) for business use: they are inaccurate and they’re not deterministic. By giving the data relational context through a knowledge graph and making that context available to the model through graph retrieval-augmented generation (GraphRAG), we can produce more accurate, more deterministic results from AI. Cloudera AI supports this process by unifying the underlying data across silos and systems so models have the best context available.

Cloudera Octopai Data Lineage. Data lineage is one of the biggest challenges and opportunities in data governance. Often, the large organizations we talk to don’t know where all their data lives. The first step in building a unified view of data is understanding where it lives and how it flows across the organization, and Cloudera Octopai Data Lineage makes it easier than ever to see the full view of your data estate.

Data in Motion. The need for real-time insights has never been greater. Automating operational workflows often requires action at the point of ingestion–well before the data lands in a data lake. Cloudera Data in Motion enables organizations to ingest, process, and analyze data in true real time, and customers are using these capabilities for everything from network monitoring and automation to fraud detection to cybersecurity.

Cloudera AI Inference Service, powered by NVIDIA on AWS. Although most organizations have deployed AI in some capacity, many are struggling to move from experimentation to reliable, scalable deployment. The first step in operationalizing AI is having a trusted, high-performance inference layer that can serve models consistently across teams and environments. Cloudera AI Inference Service makes it easier than ever to deploy, scale, and govern AI workloads with enterprise-grade speed and control.

AI-Powered Knowledge Base on Cloudera with AMD EPYC on AWS. Making data searchable and turning unstructured documents into usable knowledge is a common AI use case, but organizations struggle to do so in a cost-effective manner without relying on large, GPU-intensive models. Small language models (SLMs) on Cloudera—powered by AMD EPYC™ CPUs on AWS—make it easy to build a secure, high-performance knowledge base, delivering fast semantic search with no GPUs required.

Protegrity Banking Demo: Secure GenAI & Analytics on Cloudera with AWS. Protegrity and Cloudera have showcased a banking solution that secures sensitive financial data on the Cloudera platform. By integrating Protegrity’s data-centric protection with Amazon S3, the solution establishes granular, role-based access controls. This approach keeps data protected by default, empowering enterprises to confidently scale their AI and analytics pipelines while adhering to strict compliance mandates.

Next Steps

AWS re:Invent 2025 reinforced what we’ve been hearing from many of our customers: everyone is at least experimenting with AI. A lot of organizations are under pressure to start to show value from their AI projects. And building AI securely, reliably, and in a cost efficient way is mission critical.

Our partnership with AWS, and our joint vision for a unified data and AI ecosystem that ensures unified, secure, governed access to organizational data, is the solution enterprises need to be successful with their AI initiatives.

Has the Recent Acquisition Put Your Streaming Data on Lockdown?

Katie Gdula — Tue, 09 Dec 2025 14:00:00 UTC

Unlock The Power of Agnostic Streaming for Enterprise AI

The competitive landscape of enterprise data management is undergoing a significant reshaping following IBM’s announcement of its planned acquisition of Confluent, a data streaming platform. The deal is valued at $11 billion—a staggering price tag that validates two key components of a modern data strategy:

Real-time data streaming is no longer a luxury, but the indispensable foundation for the next generation of AI agents, intelligent applications, and true business automation.
Data in motion is a critical layer in integrated data and AI platforms—something Cloudera has been offering our customers for many years now.

Additionally, IBM’s acquisition of an independent vendor suggests a market trend toward consolidation, where vendors strive for comprehensive control over the data lifecycle—from ingestion to serving. Importantly, this trend doesn’t always align with customer needs: there are many use cases where organizations need an agile, “drop-in” data-in-motion solution, independent of a data and AI platform, deployable anywhere real-time streaming analytics, insight, and reasoning are needed.

This shift toward consolidation has many implications for organizations interested in an independent operator for Kubernetes, data-in-motion solution. In this blog, we’ll look at a few key considerations and what a shift towards vendor lock-in could mean for your data estate.

The Hidden Cost of Consolidation

Prior to the acquisition announcement, Confluent was well-known in the market as an open, independent, and cloud-agnostic data streaming solution. For organizations needing to get real-time data to a place where AI can be applied as quickly as possible, adopting a solution like Confluent was an easy choice.

Now, organizations that chose Confluent for its openness and flexibility face the possibility of vendor lock-in. Will this once-independent streaming vendor become a pipeline designed primarily to feed the new parent company’s broader, heavier platform? Will your once agile data-in-motion solution suddenly come with the heavyweight baggage of an entire enterprise stack that you neither want nor need?

This fear is valid: the reality is that when a tech giant consumes a smaller, more focused vendor, priorities inevitably shift.

The time is now to ask: Do you want your real-time data strategy—the lifeblood of your AI future—tied to a single, proprietary ecosystem? Or, do you need a solution built for openness, free of dependency on any specific vendor platform, that integrates into your existing data ecosystem with complete platform independence?

The Power of an Independent Data-in-Motion Solution

If the IBM-Confluent news has you concerned about the future direction, feature focus, or pricing of your current data-in-motion investment, consider an independent, managed, containerized alternative.

Cloudera’s data-in-motion solution is available both as an integrated part of our platform as well as an independent operator for Kubernetes. While your use case will dictate which is the better option for you, here are a few of the benefits associated with using an independent solution focused purely on data in motion:

Platform Independence: Your Data, Your Cloud. Cloudera’s data-in-motion operators are engineered from the ground up to be platform agnostic. You can run your critical, real-time pipelines—Kafka, Flink, and more—on any public cloud, on-premises data center, or hybrid environment without penalty. This means you can focus on moving and processing data, not on migrating to a vendor’s preferred ecosystem.

Faster Innovation, Less Bloat. Cloudera natively incorporates the three pillars of data in motion—Apache Kafka, Flink, and NiFi—giving you a complete, visual, drag-and-drop environment for building and running efficient streaming analytics, data flow, ingestion, and routing. For example, because Kafka alone has historically not been the most efficient at data flow processing, Cloudera delivers operators for a combined Kafka/Flink solution as well as a NiFi flow-based engine operator that easily integrates with Kafka/Flink.

Data in Motion Your Way–Not Dictated by a Vendor. Cloudera offers an independent, enterprise-grade Operator for Kubernetes for data in motion that can manage the full suite of real-time needs. And all this can easily integrate with the rest of the Cloudera platform for a full lifecycle option, secure and governed, from edge to generative AI.

Conclusion

We’re here for the customers who believe their data strategy should not be constrained by a single vendor's all-encompassing platform ambitions. We are here for the fastest path to production for your real-time pipelines, driven by open source, and delivered with the freedom to deploy and run anywhere.

If you’re worried about the heavy hand of IBM changing your Confluent experience, or if you simply believe your data-in-motion solution should work everywhere you have data, talk to us. We’re ready to help you navigate this changing landscape and put your data back in motion, on your terms. Try our five-day trial on AWS.

Integrate Agentic Workflows Using Cloudera AI Workbench MCP Server

Patrick Hunt,Peter Ableda,Khauneesh Saigal — Thu, 04 Dec 2025 14:00:00 UTC

Figure 2. Cloudera Workbench MCP Server: Security by Design

How to Get Started with Cloudera MCP Server

Cloudera MCP Server is designed to help your assistants interact directly with your platform, all while operating within your established governance.

Getting started is a straightforward process:

Configure the server: Run the open-source server in Docker, providing your Cloudera AI Workbench host and API key as secrets
Connect your client: Point your preferred MCP client (like Cloudera Agent Studio) to the server using its STDIO command
Make your first request: You can test the connection by asking your assistant to "list my projects”

Example Workflows

Here are some examples of tasks you can perform through an assistant connected to the Cloudera MCP Server:

List all my active projects and show me any jobs that are still running
Upload the new-data-august.zip file to the “fraud-detection” project
Create a job using the train-v3.py script, give it 2 CPUs and 8GB of memory, and run it
Log these metrics to the experiment named “resnet-sweep” and tag the run with “new-data”
Take the latest model build and deploy it to the staging endpoint
Restart the “gradio-demo” application

The server includes tools to support these workflows across the project lifecycle, including file management, job execution, experiment tracking, model deployment, and application management.

Learn More

For detailed setup steps, examples, and a full list of capabilities, please visit the Cloudera MCP Server GitHub repository. Note: GitHub projects are provided as-is and are not formally supported by Cloudera. The Cloudera MCP Server project is made available under the Apache 2.0 license, and Cloudera provides no warranty, support, or maintenance for its use.

To learn more about how MCP and Cloudera work together, check out our blog Bringing Context to GenAI with Cloudera MCP Servers.

Figure 1. Cloudera AI Workbench MCP Server: Architecture

Integrates with Existing Governance

Cloudera MCP Server is designed to work with your existing enterprise governance, not bypass it.

For data scientists and AI engineers: This can help reduce context switching, allowing you to stay in your chat or IDE while initiating platform tasks. The assistant can handle the coordination, while the platform handles the execution.

For platform and MLOps teams: It will help with triggering an eval script, uploading new datasets, and running similar test runs. The integration also allows application updates, deletes, and restarting and tracking experiments.

Security by Design

Security is a core component of the server's design, intended to fit within an enterprise environment.

STDIO transport: By default, it uses Standard Input/Output (STDIO) for communication between the assistant and the server. This avoids the need to open and manage a new network endpoint for this interaction.

Credential management: The server is designed to read credentials from Docker secrets or environment variables, avoiding the need to hard-code keys or pass them in command-line arguments.

Easy access: It uses your existing Cloudera AI Workbench API keys, allowing you to scope permissions appropriately for different users and use cases.

Automate Tasks and Improve Data Practitioner Efficiency

There are quite a few mundane tasks a data scientist or AI engineer does as part of their daily workflow—like uploading datasets, running and iterating the same scripts for different hyperparameters, observing experiments, and so on. Offloading these tasks to an AI agent could save resources and add significant value.

That’s where the Cloudera AI Workbench MCP Server comes in: it’s an open-source Model Context Protocol (MCP) server designed to better integrate with your agentic workflow.

What Cloudera MCP Server Is and How It Helps

Cloudera’s MCP Server acts as a secure translator. It enables assistants (like Cloudera AI Agent Studio, Claude, or Cursor) to execute tasks directly inside your Cloudera AI Workbench environment.

This means you can ask your assistant to list projects, upload files, and run jobs, and the server will carry out the action using the platform's standard APIs.

How to Avoid Building Brick Walls with Your Data and AI Platforms

Jeff Healey — Tue, 02 Dec 2025 14:00:00 UTC

Most large organizations today would never choose just one vendor to run their data and AI initiatives. A single, preferred cloud vendor? Perhaps, but multi-cloud and hybrid adoption have grown, particularly as these organizations prepare for the next, inevitable public cloud outage. Companies need flexible options on where and when they run their workloads in the most cost-optimized ways, say when there’s an economic downturn or as budgets tighten.

If you take a glimpse into the data and AI architectures of Fortune 2000 IT organizations, you’ll find a myriad of technologies implemented from the vendors scattered as dots across Gartner Magic Quadrants and Forrester Waves.

When you’re active with mergers and acquisitions and needing a quick win, it’s easy to buy into the hype of certain vendors’ claims. And despite their best intentions to maintain an open ecosystem approach, these large organizations sometimes fail to read the fine print before investing heavily into overhyped offerings.

The result? Accidental architectures with brick walls—locking organizations into single vendors, which can lead to higher costs, limited flexibility, and slower innovation.

This blog explores the most common vendor lock-in pitfalls and the critical questions you should ask during platform evaluations, with examples of how Cloudera’s open data architecture helps you sidestep these challenges.

Forced Costly Cloud Migrations and Lack of Support for Data Fabric and Data Sovereignty

Does your data and AI platform run where my data lives?

Cloudera runs anywhere your data lives, so you can securely process and govern distributed data across hybrid environments with the same, consistent platform. Cloudera’s integration of Trino takes this even further. It enables fast, federated queries across data warehouses, lakes, and on-premises systems—without moving data. By centralizing access and accelerating insights, Trino is a key enabler for organizations building unified data fabrics and preparing for the next frontier: agentic AI.

Cloud-only data and AI platforms can’t handle on-premises data without forcing cloud migrations that cost millions of dollars in rewrites and refactoring—at the end of which you’re locked into a single vendor.

Does your platform allow me to connect data across silos, from on-premises systems to public clouds and everywhere in between?

That’s what a data fabric supports—allowing data to be accessed and used anywhere, by anyone, securely and efficiently. In recognition of our strengths in this area, Cloudera was just named a Leader in the 2025 Forrester Wave for Data Fabric Platforms.

Vendors that don’t meet the minimum data management requirements to support data fabric use cases aren’t featured in Forrester’s report. Take note of popular platform vendors that are missing from this evaluation—investing in their solutions will force your organization to move all of your data into a single system.

Can your platform run in air-gapped environments to deliver sovereign deployments?

Cloudera delivers private AI by supporting fully air-gapped, sovereign deployments where control planes and data never leave your environment—a requirement for regulated industries, particularly the public sector. Other platforms require constant connection to their control plane, making true private AI impossible.

Catalogs that Only Work Inside a Data Estate with Limited Functionality

Does your data catalog work across my entire data estate?

Cloudera (and particularly Cloudera Octopai Data Lineage) provides full-stack lineage and governance across all your data platforms. Other platforms only govern data that you've migrated into that platform, breaking data mesh architectures. Also, Cloudera Octopai Data Lineage delivers visual lineage out of the box with full integration—this is a key differentiator compared to other vendors that offer an API endpoint but no tooling, UI, or integrations.

Does your data and AI platform deliver complete governance?

Cloudera Shared Data Experience (SDX) has been production-proven for years, providing complete governance across all workloads.

Other vendors fall short in this area: one announced catalog offerings years ago, with features like tag-based governance only recently reaching GA—three years after it was initially announced—while critical capabilities like attribute-based access control remain in public preview. Operating on a two-to-three year gap between big announcements and production delivery is the definition of a hype machine.

Hidden Costs, Lack of Guardrails, and an Immature Data Warehouse

Do you offer transparent pricing with guardrails to avoid bill shock?

Cloudera offers transparent pricing without hidden multipliers or consumption traps. Other vendors introduce features without guardrails, hitting customers with thousands of dollars in surprise bills for even just one day of testing.

Can your data warehouse handle true enterprise demand?

Cloudera Data Warehouse provides production-grade data warehouse capabilities with high availability (HA) and seamless scaling.

While other vendors have added autoscaling and HA, it’s important to review whether these are compatible or separate functions—if the latter, you’ll be forced to choose one or the other. Additional limitations to be on the lookout for are regional and vendor-managed storage.

Limited Data Streaming with a Tax on Dubious Performance Gains

Can your data and AI platform handle data-intensive streaming workloads?

Cloudera delivers production-proven Apache Flink, Kafka, and NiFi for complex streaming workloads. Other vendors can't compete against Flink, specifically, and have no streaming play.

Do you charge for performance gains on streaming workloads?

Cloudera Streaming has no premium pricing tiers. Others force a ~3× cost multiplier, even though streaming workloads often see no performance gain. It’s not uncommon for these vendors to charge you more when you optimize—up to 80% more, based on internal analyses.

Does your platform deliver true open source Kafka or a proprietary, unproven version?

Cloudera relies on mature, open-source Apache Kafka with a proven track record. Others don’t run Apache Kafka at all. They ship a proprietary Kafka-lookalike that’s still early, unproven at scale, and wrapped in opaque pricing.

Lack of Clarity Around AI Ownership (vs. API Access Rentals) and AI Assistants (vs. Chatbots)

With your data and AI platform, will I own my AI models or do you simply charge me for API access?

Cloudera AI enables companies to own and operate their AI models privately on their infrastructure. Other vendors act as “middlemen” for public APIs, exposing customers to sudden service cutoffs and uncapped costs while collecting massive fees.

Is your platform infused with reliable AI assistants to improve productivity?

Cloudera AI Assistants are embedded across the platform from day one with genuine intelligence. Other vendors are repackaging basic retrieve-and-respond chatbots as innovation—but if it can't trace data lineage, enforce governance, or reason across structured and unstructured data—it's just search with a better interface.

Vendors Jumping On The “Open” and “Unified” Bandwagon Without the Infrastructure to Support These Claims

How open is your data and AI platform, really?

Cloudera supports Apache Iceberg and Hudi today across multiple engines without vendor lock-in. Other vendors claim an open approach, but their table format support is often several years away, or still in beta, and essentially remains proprietary, trapping customers.

What level of support does your platform provide for Apache Iceberg?

Cloudera supports Apache Iceberg with full read and write capabilities across the platform without vendor lock-in. Cloudera’s Iceberg REST Catalog further enhances data sharing by delivering an open, universal metadata layer that enables zero-copy access across popular platforms, engines, and teams.

Other vendors claim openness, but their Iceberg support is still in beta. And their “unified” table format? Practitioners skip it in real deployments—using it means duplicating data or sacrificing performance, since their optimizations only work on proprietary formats.

Avoid Vendor Lock-In: Choose an (Actually) Open, Unified, Governed Data and AI Platform

Cloudera is the only data and AI platform company that large organizations trust to bring AI to their data anywhere it lives. Unlike other providers, Cloudera delivers a consistent cloud experience that converges public clouds, data centers, and the edge, leveraging a proven open-source foundation. As the pioneer in big data, Cloudera empowers businesses to apply AI and assert control over 100% of their data, in all forms, delivering unified security, governance, and real-time predictive insights. The world’s largest organizations across all industries rely on Cloudera to transform decision-making and ultimately boost bottom lines, safeguard against threats, and save lives.

To learn more about how to securely prepare, integrate, and analyze data at scale with Cloudera, check out our product demos.

Achieve Workload Portability Without the Rewrite

Blake Tow,Tushar Sharma — Tue, 25 Nov 2025 14:00:00 UTC

Cloudera’s cloud bursting capability brings the cloud to your data

The conversation around cloud adoption has matured significantly. For modern data-driven organizations, it’s no longer a question of if they should use the cloud, but how they can strategically blend public cloud agility with the security and control of their on-premises infrastructure.

Although the hybrid cloud market is projected to grow to over $300 billion by 2030, many organizations are hitting a wall. They’re discovering that simply connecting an on-premises data center to a public cloud doesn't create a truly hybrid platform.

Instead, they’re often forced into a lift-and-shift cycle: permanently relocating applications and continually replicating massive datasets to the cloud just to get temporary compute capacity. This leads to fragmented management, rising costs due to data duplication, and data staleness.

The Problem: How to Handle Data Spikes

Scalability is a top priority for enterprises. Businesses frequently face sudden spikes in data volume that require additional resources—whether it's end-of-month reporting, model training, or seasonal traffic.

Resource contention during these spikes creates bottlenecks that force organizations to miss critical service level agreements or objectives (SLAs and SLOs), which can result in potential regulatory fines and increased customer churn.

Historically, IT leaders had two imperfect choices to handle these spikes:

Over-provisioning: Buying costly on-premises hardware that sits idle most of the time to account for peak demand
Migration: Moving data and workloads to the cloud permanently, which is complex, risky, and fraught with compliance risks

The Solution: Bring the Cloud to Your Data

Unlike the traditional lift-and-shift model, Cloudera’s approach brings the cloud to the data.

Cloudera’s cloud bursting capability enables organizations to extend the private data center into a public cloud—only when needed—and scale back down when the demand subsides. This approach instantly bridges resources to handle demand without the risk or cost of data migration.

Here’s how it works:

Spin up a Hybrid Data Hub in the public cloud. This temporary compute cluster combines cloud elasticity with secure access to your on-premises data to handle heavy workloads (for example, a Spark job).

This cloud workload reads and writes directly from on-premises storage (such as Hadoop Distributed File System, or HDFS), intelligently fetching only the precise data subset required for the specific task rather than moving entire datasets.

Once the job is done, the cloud resources spin down. Your data is never replicated to the cloud; it is read only into memory and stays safely on-premises.

Why This Approach Changes the Game

By using Cloudera’s cloud bursting capability, built on its unified runtime and hybrid control plane, organizations can finally achieve workload portability without the rewrite. Benefits include:

Zero Data Migration

This architecture eliminates the cost and complexity of application redesign and massive data migration. Organizations don't need to create and maintain a copy of their data in the cloud just to run a query. Data that is out of sync remains with the original copy before the process is even completed. To optimize performance, the system uses advanced techniques like projection pushdown and partition pruning. This guarantees high-performance query results without the latency or cost of moving massive datasets.

Centralized Security and Governance

One of the biggest barriers to hybrid adoption is security. With Cloudera, the security context moves with the workload. We establish a two-way cross-realm trust between on-premises Active Directory and the cloud, which guarantees that the user submitting the job in the cloud is authorized by the same policies defined in Ranger on-premises. All metadata and governance rules remain centralized to maintain compliance with regulations like GDPR and HIPAA.

Strategic Workload Isolation and SLA Assurance

Resource contention on-premises often forces IT to play traffic cop, which is where they sometimes must delay lower-priority jobs to keep mission-critical ones running. Cloud bursting resolves this conflict. Organizations can now use strategic workload isolation to offload specific workloads to the cloud so they can maintain critical SLAs and SLOs for their core business processes. Whether it’s meeting a strict deadline for regulatory reporting or delivering real-time fraud detection without latency and ensuring performance without over-provisioning hardware can be guaranteed.

Real-World Application: Faster Time to Value

Imagine a data engineer working on a fraud detection model. The on-premises cluster is at 95% capacity, and a new threat vector requires immediate model retraining. Running this locally would choke the production pipeline and cause an SLA breach.

With Cloudera, that data engineer can:

Burst to the cloud in real time to access the necessary compute power
Process the sensitive data that lives on-prem without permanently moving it
Shut down the cloud instance immediately after the job completes

This capability also accelerates software development by enabling teams to create instant development environments that leverage zero-copy data access from their production on-premises source.

The Future is Cloud Anywhere

Cloudera is the only data and AI platform company that brings AI to your data anywhere it lives. Whether the data is in the data center, the public cloud, or at the edge, we deliver a consistent cloud experience that empowers you to make smarter, faster decisions.

Ready to bring the cloud to your data?

How Leading Data Teams Build AI-Ready Pipelines with Apache Iceberg and Spark

Pamela Pan,Ying Chen,Akshat Mathur — Mon, 24 Nov 2025 14:00:00 UTC

Lessons from two global enterprises modernizing data engineering for scalable AI

From predictive analytics to generative AI, every business is looking to turn data into value. But for many teams, the real challenge lies beneath the surface—in the data engineering work required to make that data usable, trusted, and scalable. Across complex environments, engineers are still stitching together pipelines using legacy table formats, duplicating logic across tools, and retrofitting governance after the fact. These inefficiencies create drag at every stage, delaying outcomes and limiting the impact of even the most advanced AI and analytics initiatives.

For enterprises looking to streamline and future-proof their data engineering stack, Apache Iceberg as the open table format and Apache Spark as the open compute engine have been proven as a powerful combination. Together, they offer an open, scalable, and standardized foundation for processing and managing petabyte (PB)-scale data—without sacrificing governance, flexibility, or performance.

In this blog, we will take a closer look at how two global organizations transformed their data pipelines using Spark and Iceberg with the Cloudera data and AI platform. We’ll explore how they reduced query times by 80%, standardized workflows across teams, and accelerated their path from raw data to AI-ready insights.

How Vodafone Idea Slashed Query Times by 80%

Vodafone Idea is one of the three major telecommunications companies in India, serving 220 million customers. The company was struggling with scale issues: their Hive-based data lake had ballooned to more than 17 PBs, and performance bottlenecks were putting critical business operations at risk. Some reporting queries took more than 70 hours to complete! This delayed compliance, analytics, and regulatory reporting.

Rather than simply upgrading infrastructure, Vodafone Idea chose to re-architect its data platform. Collaborating with Cloudera, the company leveraged Iceberg for faster queries through optimized metadata and schema evolution, and rebuilt its processing workflows on Spark to leverage distributed compute for efficient, large-scale data processing.

For regulatory reporting, they paired Iceberg with Apache Impala as the interactive query engine to support fast, reliable access to PB-scale datasets. While Impala handled the reporting queries, Iceberg played a critical role behind the scenes—its support for ACID transactions (atomicity, consistency, isolation, and durability—properties that ensure database transactions are processed reliably and consistently), flexible schema evolution capabilities, and rich metadata kept reporting workflows consistent, even as data changed.

Through integration with Cloudera Shared Data Experience (SDX), the team also gained fine-grained governance with role-based and attribute-based access control, making sure that the right people had access to the right data. This foundation enabled the business to deliver timely and auditable reports while meeting growing regulatory demands.

Transforming Telecom with Data-Driven Efficiency

By partnering with Cloudera, Vodafone Idea preserved flexibility, strengthened governance, and accelerated insight delivery at scale—without having to rebuild its entire data stack. Using Spark for ingestion, Iceberg for unified table management, and Impala for reporting, they modernized their foundation while reusing existing logic and workflows.

Together, this architecture delivered measurable results:

Reduced query times by 80%.
Decreased pipeline failures via Spark’s resilience at scale and Iceberg’s robust table management capabilities.
Improved regulatory reporting ( faster and more reliable).

How a Pharmaceutical Company Consolidated In Order To Scale: One Tech Stack, 10,000 Jobs

A global pharmaceutical company managing PB-scale clinical research data faced a familiar but growing challenge: they had too many tools in play, leading to data reliability challenges and difficulty meeting compliance standards, on top of facing pressure to support faster AI and analytics. The data engineering teams needed to run more than 10,000 daily ETL jobs, but lacked a standardized way to build, govern, or validate pipelines across teams.

With Cloudera on AWS, the company set a clear direction forward. The team standardized all data pipelines using Spark on Cloudera Data Engineering, unifying and scaling processing across batch, streaming, and machine learning workloads. At the same time, they adopted Iceberg as the default open table format to ensure consistent schema evolution, built-in version control, and enterprise-grade governance across teams and environments.

By adopting Spark and Iceberg on Cloudera, the company laid a clean, scalable DataOps foundation that standardized data pipelining, enabled secure data sharing across teams and tools, and paved the way for faster and more advanced AI and analytics. This foundation now supports everything from regulatory audit workflows to AI models that accelerate clinical trial discovery and drug development, ensuring the company can seamlessly integrate any new technology or engine in the future.

Transforming Pharma with a Unified Data Platform

Standardizing on Cloudera’s platform gave the global pharmaceutical company a new level of operational consistency:

Governance without disruption: Iceberg’s write-audit-publish pattern allows upstream teams to validate data before releasing it to production—without breaking downstream workflows.
Time traveling for traceability: Regulatory teams can access historical data snapshots instantly, enabling clean rollback and audit support.
Shared pipeline logic: With Spark as the unified engine, teams—ranging from data engineers to data scientists—can collaborate easily and reuse core transformations across jobs and environments, reducing duplication and simplifying maintenance.

Building A Modern Foundation for Data Engineering and AI

These two stories share a common thread: both organizations faced fragmentation, scale pressure, and growing complexity in their data workflows. By standardizing on Apache Spark and Apache Iceberg with Cloudera, they rebuilt their pipelines around open, scalable, and trusted components—enabling better governance, faster performance, and cleaner data flows for AI and analytics.

With Cloudera Data Engineering, enterprises get an end-to-end solution that runs across hybrid and multi-cloud environments. It brings together Spark, Iceberg, and integrated orchestration with Airflow to empower teams to:

Build pipelines once, and run them anywhere—in the data center or on clouds
Maintain trust and governance at scale in the open data lakehouse

Watch this interactive demo to see how Spark and Iceberg power trusted, scalable pipelines on Cloudera.

The Future Delivered Today: The AI-Powered Data Lakehouse

Dipankar Mazumdar — Fri, 21 Nov 2025 14:00:00 UTC

Figure 3: Cloudera AI’s Offering with AI Workbench and Inference Service

Cloudera AI Workbench

Cloudera AI Workbench is the collaborative environment where data scientists, analysts, and engineers develop, fine-tune, and test models. It brings together notebooks, low-code application builders (AMPs), and specialized studios for every stage of AI development. To accelerate AI development and deployment, Cloudera AI Workbench underpins four AI studios that bridge the gap between business and technical teams, fostering collaboration on AI projects.

Synthetic Data Studio generates synthetic datasets for testing and model training when real data is limited or restricted.
Fine-Tuning Studio adapts open foundation models with enterprise-specific datasets for higher relevance and accuracy.
RAG Studio builds RAG pipelines that connect LLMs (such as OpenAI, Anthropic, Amazon Bedrock) to relevant private data for grounded, contextual outputs.
Agent Studio enables the creation of multi-step, agentic workflows that use models, MCPs, APIs, and internal data sources to automate domain-specific tasks.

All of these capabilities operate on the open lakehouse (on Iceberg’s foundations), giving teams governed, zero-copy access to the data needed for specific tasks.

Cloudera MCP Server

Cloudera is also extending the openness of its AI platform through a series of emerging MCP services, beginning with the open-source Cloudera AI Workbench MCP Server. This service is designed for AI system integration, enabling agentic and tool-calling capabilities within the AI Workbench. It provides the framework for LLMs to securely interact with Cloudera AI Workbench features and components—bringing models, data, and applications into automated enterprise workflows. In this architecture, intelligent agents can reason, act, and automate tasks across the trusted, governed Cloudera environment while maintaining the security, control, and auditability required in regulated industries.

Cloudera AI Inference Service

The Cloudera AI Inference Service brings models into production with autoscaling, high availability, and end-to-end observability. It supports both traditional ML models and large language models (LLMs), serving predictions and responses with low latency. Models can be deployed as REST or gRPC endpoints with enterprise-grade security, ensuring reliable and consistent access from applications and agents.

The Cloudera AI Registry, integrated within the inference layer, provides a centralized model lifecycle management with MLflow-compatible APIs for tracking, versioning, artifact storage, and lineage. You have the choice to select from the various open and enterprise language models options such as LlaMa, Cohere, Gemma, Mistral.

The inference layer also includes built-in monitoring and observability, enabling teams to track latency, throughput, and model drift while maintaining full lineage and compliance through SDX governance. This ensures that model predictions are explainable and traceable, which is a key requirement for enterprise-grade AI.

The Future is Driven by AI, and AI is Fueled by All Data

AI success depends as much on data architecture as on model/agent capability. The lakehouse provides that foundation, unifying analytical, operational, and AI workloads on a single, governed data plane. When built on open standards, it ensures that data, metadata, and models can interoperate across tools, clouds, and teams without friction.

Together, Cloudera AI Workbench, AI Inference Service, and the integrated AI Registry complete the data-to-AI lifecycle on an open lakehouse foundation. Built directly on governed Iceberg tables and open metadata access, this stack ensures that every model, prompt, and agent operates on trusted, versioned data.

The future of enterprise AI will not be defined by proprietary stacks, but by open foundations that unify data, governance, and intelligence through shared standards and transparent interoperability.

To learn more about how to securely prepare, integrate, and analyze data at scale with Cloudera, check out our product demos.

Figure 1: Cloudera’s Data and AI Platform Built on Open Foundations (Apache Iceberg)

We’ll now review how the different components in Cloudera's platform (Figure 1) support teams in building ML pipelines and GenAI applications, as well as the different stages of the data and AI lifecycle—from ingest to inference—while operating as one interoperable platform. Each component is built on open standards, ensuring flexibility and interoperability across environments.

Storage: Apache Iceberg

Apache Iceberg is the open, versioned, and transactional table format that underpins Cloudera’s lakehouse architecture. Iceberg enables schema evolution, time travel, and atomic operations, allowing both analytical and AI workloads to operate consistently on the same governed data. Cloudera offers a governed and versioned foundation that ensures that every model, prompt, or retrieval task draws from a consistent and traceable view of data.

Iceberg’s native capabilities like schema evolution also align closely with how AI datasets evolve. Feature stores, training datasets, and retrieval corpora can all share the same Iceberg tables in Cloudera’s lakehouse, using snapshots to freeze consistent views for training while new data continues to flow in for inference. This eliminates the divide between analytical tables and AI-specific storage.

Ingestion: Cloudera Data in Motion

Cloudera DataFlow, built on Apache NiFi, forms the foundation for continuous data movement into the lakehouse. It enables low-latency ingestion from diverse enterprise sources—databases, APIs, IoT devices, and event logs—to support both batch and streaming workloads. Recent innovations in NiFi’s native Apache Iceberg integration now allow data to be written directly into the open lakehouse without intermediate staging. This tight coupling between NiFi and Iceberg reduces pipeline complexity and brings ingestion closer to the open table format itself.

In real-time use cases, NiFi, Apache Kafka, and Apache Flink form an event-driven ingestion fabric: NiFi orchestrates and routes data, Kafka provides durable streaming, and Flink enables real-time enrichment before persisting data into Iceberg. This design ensures that data remains both fresh and governed across all downstream consumers. This continuous flow of multimodal data is what also powers AI workloads on the lakehouse. By making real-time data continuously available in Iceberg tables under consistent governance, enterprises can feed GenAI systems with timely, domain-specific information, making RAG pipelines and agentic workflows more precise, grounded, and reliable.

Catalog: Cloudera Iceberg REST Catalog

The Cloudera Iceberg REST Catalog (based on the open REST specification) provides a centralized and interoperable metadata service that allows any third-party engine (such as Snowflake, Redshift, and Databricks) that supports the open specification to have zero-copy access to Iceberg tables. This is a key aspect for organizations, as they are not restricted to just one compute engine offered by one platform and therefore have the flexibility to choose the best compute for the task. Users can use their preferred tools while the same security and governance policies offered by Cloudera follow the data everywhere, ensuring consistency across environments.

Cloudera’s open foundations enable organizations to access 100% of their data, wherever it resides

Across industries, data teams are rethinking how to build and run systems that do more than store information: they’re looking to turn data into intelligence. Just as important, they need these systems to interoperate. AI models, feature pipelines, business intelligence (BI) reports, and batch jobs often span multiple teams and engines. Sharing data across those boundaries without copying or refactoring is now a first-order requirement.

Traditionally, organizations have relied on a two-tier architecture: data warehouses optimized for BI and reporting, and data lakes designed for large-scale AI and machine learning (ML). This separation came at a cost: complex data movement, specialized engineering, and duplicated storage across systems that rarely stayed in sync.

Cloudera’s open lakehouse architecture addresses this challenge, bringing together analytical (BI, ad-hoc queries) and AI (predictive and generative AI, or GenAI) workloads on a single, governed data foundation. With open table formats like Apache Iceberg, this unified data architecture enables organizations to bring compute to data (not the other way around) and provides the foundation for running AI workloads closer to the data. AI workloads on the lakehouse can operate directly on governed, versioned, and high-quality data.

Cloudera is the only data and AI platform company that brings AI to data anywhere. Leveraging our proven open-source foundation, we deliver a consistent cloud experience that converges public clouds, data centers, and the edge.

The Importance of Open Foundations for Running AI Workloads

Over the last decade, enterprises have learned that performance and scalability alone are not enough, and that flexibility and interoperability determine long-term success. AI workloads, in particular, depend on the ability to use disparate data sources, frameworks, and tools without being constrained by proprietary formats or systems.

That’s where open table formats like Apache Iceberg have reshaped the architecture of data platforms. Iceberg separates the logical definition of a table from its physical storage layout, allowing multiple engines and frameworks to read and write the same data with full transactional guarantees. This openness makes it possible to evolve infrastructure and adopt new compute engines without rewriting pipelines.

Running production-grade pipelines requires a unified platform that can connect data, models, and governance across every stage of the AI lifecycle. At the core, there are data and feature engineering pipelines that continuously transform raw structured, semi-structured, and unstructured data into AI-ready features, maintaining lineage and reproducibility for model training and evaluation.

Beyond traditional ML, GenAI introduces new operational requirements. Teams need infrastructure and access to data for retrieval-augmented generation (RAG), fine-tuning large language models (LLMs) on private data, and building agentic workflows that combine models, prompts, and model context protocols (MCPs) (APIs) to solve domain-specific tasks. These workloads rely on both tabular and unstructured data (text, documents, images, and embeddings)—all governed under a single data and metadata plane. Additionally, a scalable inference layer is essential to deploy and serve these models securely and efficiently.

As AI workloads become increasingly multi-modal and agentic, access to catalogs and metadata becomes just as critical. AI pipelines, retrieval systems, and autonomous agents all rely on metadata to discover datasets, reproduce training states, and maintain lineages. An open catalog provides a universal way for these systems to query, register, and track datasets—regardless of where or how they are processed.

Cloudera’s open foundation enables organizations to support the complete spectrum of analytical, predictive, and GenAI workloads.

Cloudera’s Unified Data and AI Platform

Cloudera’s open data lakehouse unifies data engineering, analytics, and AI on the same governed architecture by building on open foundations like Apache Iceberg and REST catalog. The platform is designed around the principle that workloads (whether analytical or AI) should operate where the data already lives. By eliminating the friction of moving or duplicating data, teams can build continuous pipelines that span ingestion, transformation, analytics, and model operations with full lineage and governance.

Figure 2: Cloudera’s Iceberg REST Catalog Enables Interoperability with Third-Party Engines

This catalog layer is critical for feature engineering pipelines, agentic workflows, and retrieval systems to locate and access governed datasets dynamically. AI agents can query Iceberg tables using the REST Catalog just like a knowledge graph of enterprise data. They can discover available tables, interpret their schemas, and reason over table metadata, such as partitioning, snapshots, and lineage to determine which datasets to use.

Security and Governance: Cloudera SDX

Cloudera Shared Data Experience (SDX) is the unified security and governance framework that spans every service, from ingestion to inference. SDX provides a single, consistent layer for data lineage, auditing, access control, and policy enforcement, ensuring that every workload inherits the same security model regardless of where it runs. It integrates with enterprise identity systems (LDAP, SSO, OAuth) and supports fine-grained, role- and attribute-based access controls across structured and unstructured data.

By coupling SDX with the open lakehouse foundation, Cloudera ensures that data, models, and AI agents operate within the same governed boundary—delivering transparency, reproducibility, and trust for both analytical and GenAI workloads.

Cloudera Data and AI Services

The unified services layer brings together all the functional capabilities that teams need to transform, analyze, and operationalize AI, all while working on the same governed data.

Data Engineering

Cloudera Data Engineering, built on open-source Apache Spark and Apache Airflow, provides a serverless service for building, orchestrating, and scaling data pipelines directly on Iceberg tables—enabling reliable, reproducible ETL and feature pipelines for analytics and AI workloads across hybrid environments.

AI Services

The Cloudera AI services layer operationalizes the full lifecycle of AI, starting from model training and fine-tuning to secure deployment—all running natively on the same governed data foundation with Iceberg. It unifies model development, registry, and inference into a single workflow that bridges data engineering and AI operations.

How Clouderans Give Back During The Season of Thanks

Ashton Stockstill — Thu, 20 Nov 2025 14:00:00 UTC

Our annual Week of Giving is a dedicated time for our global Cloudera community to come together to live out our values, collaborate, and make a positive impact on the world. This year, our theme is “A Season of Thanks, A Week of Giving.” In essence, the week is about far more than just giving back. It’s a time for us all to reflect on the things we are thankful for while embracing opportunities for service through independent volunteering, company events, and donations to local community organizations.

It’s always fulfilling and impactful for Clouderans to get out and make a difference for causes that matter to them. As we wrap up this week, Clouderans have been participating in events around the world. This year, our teams got involved in a variety of efforts that included creating youth mental health toolkits, playing Bingo with seniors, donating coats through One Warm Coat, joining Christmas light workshops, and volunteering with the World Central Kitchen.

Our people are the heart of our Cloudera Cares program and Week of Giving. Their dedication, passion, and time are what make these events so special. With that, let’s hear from Clouderans about what they find special during this time of year.

What makes you most thankful to work at Cloudera?

“I'm thankful that Cloudera empowers its employees to take the lead on supporting causes we're passionate about. It is wonderful to have access to a platform like Benevity, where I can give to causes I believe in and the Company matches those funds, doubling my impact.” – TJ Sundar, Private Cloud Field Specialist (EMEA)

“I'm thankful for the opportunities to be in service to the community that I live in and where I grew up, alongside my fellow Clouderans. I've volunteered for many virtual and in-person Cloudera Cares events, and I always leave feeling more connected to my colleagues.” – Renee Castro, Learning & Enrichment Partner (AMER)

“For me, it’s the people and the culture. I’m constantly grateful to be surrounded by such a talented, smart, and driven group of individuals. There’s a genuine spirit of collaboration, and I feel like I learn something new from my colleagues every single day. The culture here truly encourages growth, curiosity, and supporting one another, which makes coming to work inspiring and rewarding.” – Laura Hughes, Director, R&D Operations & Programs (EMEA)

What makes giving back and volunteering important to you?

“Volunteering for Second Harvest, handing out food and essentials for people reminds me not to take things for granted. It is incredibly meaningful for me to see how thankful the recipients are.” – Westley Chan, Sr. Manager, Business Applications (AMER)

“Giving back is important to me as it reminds me that my actions, no matter how small, can create a meaningful change. Volunteering allows me to help others and impact people's lives. I find this extremely rewarding.” – Deepa Pednekar, Senior Practice Manager (EMEA)

“It’s important to me because I want to lead by example for my child. We take so much from the world, and volunteering gives me a chance to give something back.” – Asha Mohan Chandran, Learning & Enrichment Partner (APAC)

Why is Week of Giving such an important part of the employee experience at Cloudera?

“This is a time to strengthen our culture and remember that, together, we can achieve remarkable things.” – Marcus Fig, Cloud Sales Specialist (AMER)

“It’s one thing for a company to have values; it’s another to live them. Events like this are the 'living' part. They're a core part of our experience because they show that Cloudera Cares is more than just a program. It's an action. Week of Giving is a unique opportunity to bond with colleagues. It's a chance to collaborate with people in a different way than you normally would, whether you’re decorating smiley-face goodie bags or sharing Halloween-themed baked goods. These interactions build our culture and strengthen our relationships.” – TJ Sundar

“I've participated in volunteering events at previous companies that felt like we were just "checking the box." Week of Giving—in addition to all other volunteering events I've participated in at Cloudera—feels full of intention and attracts individuals who truly care about the communities served. From those who run the events to the volunteers themselves, we are doing more than just "checking the box" - we truly care about impact.” – Renee Castro

What have you learned from your time volunteering, both with colleagues and in your community?

“Teamwork and organization. Special thanks and shoutouts to the folks organizing these volunteer events to bring people together and help the community. Every little bit helps!” – Westley Chan

“Volunteering, both with colleagues and in the community, has taught me the power of collaboration and empathy. I’ve learned that even small actions can have a meaningful impact, and that working together toward a common goal strengthens connections and builds a sense of shared purpose. It’s inspiring to see the difference we can make when we combine our skills, time, and energy to help others.” – Laura Hughes

“I've learned the power of collaboration, empathy and giving back to society. I've found that these events help strengthen relationships beyond the workplace. It's taught me humility and just realizing that meaningful change starts small with consistent efforts. You never know - your little contributions can bring big smiles across many individuals and communities.” – Deepa Pednekar

Continuing Cloudera’s Commitment to Our Global Community

As we celebrate another Week of Giving, we want to thank everyone who participated in this year’s activities. It has been so rewarding to see Clouderans from across our global offices volunteer and give back to their communities.

Learn more about how Clouderans are helping shape the communities that we all call home.

Trino: The Federation Engine Powering Your Unified Data Fabric

Katie Gdula — Thu, 20 Nov 2025 05:00:00 UTC

Connect, manage, and govern data across hybrid and multi-cloud environments

In today’s data landscape, organizations often grapple with massive, distributed data estates spanning multiple clouds and on-premises systems. This complexity leads to data silos and costly, time-consuming data movement for analysis.

A unified data fabric addresses this challenge by providing an architectural layer that automates and orchestrates data discovery, access, and management across distributed, hybrid environments. It connects data, without data movement, from any source, applies consistent governance, and delivers uniﬁed access for analytics, AI, and real-time decision-making.

Trino, an open-source distributed SQL query engine, is a key component of Cloudera’s data fabric. It enables big data analytics and data engineering by running interactive queries and batch processing across vast amounts of data, without requiring unnecessary data movement or storage format conversions. Trino can, in a single query, collate data from multiple sources, including data lakes, and run federated queries across these disparate systems.

High-Level Use Cases for Trino

Trino is versatile, supporting a diverse array of use cases–from high-speed, ad-hoc analytics to complex batch processes.

Centralized Data Access and Analytics with Query Federation

Query federation is a core strength of Trino. It provides the ability to query many disparate data sources within the same system using a single SQL query. This capability dramatically simplifies analytics for users who need a comprehensive view of all their data. Trino's architecture is designed for diverse connectivity, allowing it to federate across dozens of heterogeneous sources. A key feature is zero-copy data, which eliminates the need for expensive, and sometimes risky, data movement or replication.

Interactive and High-Performance Data Analytics

Trino is primarily driven by interactive analytics. It’s built from the ground up for efficient, low-latency query performance. Data analysts and data scientists can query large amounts of data, run hypotheses, conduct A/B testing, and build visualizations or dashboards directly. Trino is designed to be so performant that it enables analytics that were previously impossible or took hours to complete.

Batch ETL Processing Across Disparate Systems

While interactive analysis is key, Trino also accelerates large extract, transform, load (ETL) processes that typically run in batches and are resource-intensive. Engineers can speed up ETL processes using standard SQL statements, avoiding more complex, error-prone, and hard-to-maintain code-based ETL processes that work with a range of data sources and targets.

Cloudera with Trino: A Unified Data Fabric is the Pathway to Agentic AI Anywhere

Cloudera's integration of Trino addresses the needs of organizations with large, heterogeneous data estates, preparing organizations for the future of data: agentic AI. And a unified data fabric is the foundation for trusted AI.

The key differentiators of the Cloudera + Trino integration include low-latency performance for agentic AI anywhere, providing real-time reasoning directly within business flows, with unified governance and security, and a focused experience with AI automation.

Hybrid and Multi-Cloud Deployment: AI Everywhere

Cloudera provides an anywhere cloud experience with a data and AI platform that allows customers to run the identical software stack and unified control plane across public clouds, private clouds, and on-premises data centers. This is a decisive advantage for organizations concerned with data sovereignty and regulatory requirements.

Trino on Cloudera is optimized for on-premises and cloud environments and can be deployed to federate data across systems using certified connectors. Unlike cloud-native, SaaS-only architectures, Cloudera's hybrid approach is essential for regulated industries, like banking and government, whose operational data cannot be moved to a public cloud vendor’s SaaS platform.

Low-Latency Performance for Agentic AI

Cloudera leverages Trino's architecture to enable operational AI—the application of AI/ML models to live, real-time business processes—key to anyone pursuing agentic AI. Trino’s architecture is massively parallel processing (MPP), in-memory, and pipelined, allowing for sub-second to few-second performance. For interactive analytics workloads, Trino can be 2 to 30 times faster than Apache Spark. Data scientists can embed real-time model inference logic directly into a low-latency, federated Trino query, combining fast federated access with the power of Python AI/ML for true operational AI and agentic workflows.

Unified Governance and Security

For enterprise adoption, centralized governance is paramount. Trino is integrated with Cloudera Shared Data Experience (SDX), ensuring consistent security and management. This added layer of security ensures that all metadata and access controls are unified to simplify management and self-service access. Cloudera delivers a single endpoint to access all data across various engines, including Trino, without needing to replicate access and security policies.

Focused Experience and AI Automation

Cloudera enhances the user experience for administrators and practitioners, driving efficiency and democratizing access to data. Teams benefit from automated warehouse management, natural language access, and simplified administration through guided federation connector setup and a true hybrid deployment model–simplifying data architecture and empowering zero-copy analytics with no ETL burden.

With Trino, Cloudera delivers a "govern once, access everywhere" solution, providing a secure, high-performance query engine that runs identically across your hybrid, multi-cloud estate–a necessity for mastering the complexity of modern enterprise data and enabling real-time AI workflows.

Next Steps: Building a Unified Data Fabric with Cloudera and Trino

Cloudera’s unified data fabric enables organizations to govern every dataset, track every lineage, and trust every prediction, ensuring responsible AI that aligns with enterprise and regulatory standards. Trino extends the value of Cloudera’s data fabric by centralizing data access, performing interactive and high-performance analytics, and running batch processing across disparate systems.

To learn more about how Cloudera with Trino can transform your analytics and AI experience, schedule a virtual demo.

Cloudera was recently named a Leader in The Forrester Wave™: Data Fabric Platforms, Q4 2025. Access the report to understand the trends shaping data fabric architectures—and how we believe Cloudera continues to lead the way.

Inside the Third Wave of Data and AI

Cloudera — Fri, 14 Nov 2025 13:00:00 UTC

From the rise of the internet to the explosion of cloud computing, every major technological era has reshaped how we use—and create—data. Now, according to Cloudera Chief Technology Officer Sergio Gago, we’re entering a third phase of big data focused on convergence.

He recently joined The AI Forecast podcast to discuss how the convergence of cloud and on-premises systems is setting the stage for a new generation of private AI—where enterprises can fully control their data, models, and AI life cycles.

Here are the key takeaways from the conversation.

The Convergence of Cloud and On-Prem—and Why It Enables Private AI

Paul: Let’s talk about your vision. What does the third wave of big data mean to you, and why is it so important?

Sergio: We started with the era of control. Many companies had their own data centers that gave them control of their data. Then the cloud came in and we entered what we call the era of convenience. So, you had teams with a credit card that could go into any hyperscaler and start playing with data either for machine learning or for building dashboards. It was so easy that it brought shadow IT into many enterprises, which made controlling cost, TCO, and data governance growing challenges.

That was the story of cloud and data. Now today, you kick a rock and there are hundreds of engines, databases, and options. We talk about Frankenstein architectures now, where companies have dozens—if not hundreds—of components and are struggling to bring them together. The era of convenience brought this complexity.

Now fast forward with the advent of AI and AI agents and the regulation and compliance requirements for many enterprises and startups alike. To comply, organizations need to bring all the controls of the first era back, especially in large enterprises. All that is forcing companies and individuals to converge and manage both worlds—the data center and the cloud—to have the control and governance of the data center with the convenience of the cloud. That’s why we call the Third Wave, the era of convergence.

Private AI: Full-Lifecycle Control and the Human Advantage

Paul: I wanted to talk to you about the private AI component. With private data, I have a tremendous competitive advantage. How does private AI help me tap into that?

Sergio: Private AI is the ability to control the full life cycle of your AI applications. What models do you use? How do you deploy them? Which ones are approved from a compliance perspective? How do you make sure the model weights stay constant for as long as you need? Then you have data from your company that lives both in the cloud and in the data center. You need to safely bring that data into your model—either for training, fine-tuning, or other techniques like RAG. That’s what makes your model unique to you.

The competitive advantage of most companies today is the data, but also the skills—the human capacity to drive insights. It’s not necessarily the data itself but the experience and domain knowledge that allow you to interpret it. Private AI helps you preserve that advantage by controlling everything from model lifecycle to prompt management, lineage, and benchmarking so you can move from proof of concept to true production workloads.

Build for ROI and Risk—With Agents, Governance, and Culture in the Loop

Paul: When we talk about topics like convergence, we sometimes run the risk of alienating businesspeople who'll see this as more of a CTO-type of discussion, a technical discussion. From your perspective, what does something like convergence do to unlock new use cases or business value that you couldn't get before as a CEO or business leader?

Sergio: I think that the CEO will always want to understand the actual value of a tool, either in terms of ROI or cost reduction, or value improvement for your company. GenAI is just the conveyor belt for all those things.

At the same time, the second angle every CEO has front and center is risk—either from FOMO or from fear of becoming the next company in the headlines due to a massive AI hallucination. Those are the two sides of the scale that CEOs are working with.

GenAI use cases need to start from the business side. Bring in compliance, governance, IT, cybersecurity, and legal from the very beginning so that it doesn’t become an experiment in the garage that then doesn’t go anywhere. Showing value in those terms allows you to then take them to the enterprise.

Catch the full conversation with Sergio Gago on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

A 5-Step Framework To Streamline Your Post-Merger Data Strategy

Andreas Skouloudis — Thu, 13 Nov 2025 18:00:00 UTC

Inorganic growth strategies, such as mergers and acquisitions (M&A), serve as strategic growth levers, enabling companies to realize revenue and cost synergies or to rapidly acquire emerging capabilities that will deliver long-term competitive advantage. Today, for instance, we observe major organizations acquiring smaller, innovative AI start-ups to accelerate their AI transformation efforts and gain a competitive edge.

Technology integration plays a crucial role in value capture from M&As. A Deloitte study argues that IT is a key driver of integration benefits, accounting for more than 50% of all synergies. However, due to the proliferation of data silos and varying technology architectures and environments, organizations face several post-merger data challenges in realizing technology integration benefits.

This article introduces a five-step framework to address those challenges and accelerate value capture in M&A settings. This framework will ensure your post-merger data strategy with Cloudera delivers the capabilities needed to streamline the technology integration process.

Figure 1: Post-Merger Data Integration Framework with Cloudera

1. Accelerate Post-Merger Integration with Cloudera Octopai Data Lineage

At the start of post-merger integration, the data discovery phase frequently becomes a bottleneck, since fragmented and undocumented sources delay critical analytics and compliance efforts. Cloudera Octopai Data Lineage addresses this challenge by providing an automated, AI-powered metadata management solution that accelerates data discovery, end-to-end lineage, and cataloging across complex hybrid and multi-cloud environments.

Cloudera Octopai Data Lineage effectively maps data flows and fills metadata gaps, providing multi-dimensional lineage that traces origins and transformations for complete visibility. With more than 60 native integrations and universal connectors for non-native systems, Cloudera Octopai Data Lineage streamlines the onboarding of acquired data estates, thereby improving data transparency, quality, and trust.

For example, in banking merger scenarios, this capability facilitates rapid identification and tagging of risk-related datasets, ensuring compliance with regulatory standards such as BCBS 239, while minimizing the need for extensive manual audits or intervention.

2. Integrate Disparate Data Sources with Cloudera Data In Motion

Integrating diverse data sources and eliminating complex, custom ETL pipelines is a critical post-merger challenge. Cloudera delivers robust capabilities for batch and real-time data ingestion, processing, and data distribution through Cloudera Data Flow (powered by Apache NiFi) and Cloudera Streaming (powered by Apache Kafka and Apache Flink).

With more than 450 connectors, Cloudera Data Flow provides a visual, drag-and-drop interface to ingest data from a variety of heterogeneous data sources, whether on-premises, in the clouds, or at the edge. In addition, Cloudera Streaming provides a messaging bus architecture that decouples source systems from consuming systems between the two entities, thereby eliminating point-to-point integrations that add architectural complexity and higher costs.

During post-merger integration, these capabilities can significantly accelerate and simplify data movement between organizations. For instance, Cloudera Data Flow can be used to quickly integrate on-premises data from legacy source systems of the acquired company into the cloud-native data warehouse of the parent company, expediting decision-making.

3. Build a Secure Data Sharing Layer on Cloudera Open Data Lakehouse with Apache Iceberg

Data sharing between merging entities is an essential requirement for integrated decision-making and deriving insights. This process can be complex due to the diverse exploratory analytics and business intelligence technologies, as well as the varying data security models used by different systems.

An open data lakehouse approach that combines Apache Iceberg, the Cloudera Iceberg REST Catalog, and Cloudera Shared Data Experience (SDX) enables organizations to develop a unified data sharing layer. This layer is compatible with various analytical engines (for example, Snowflake, Databricks, AWS EMR, AWS Athena, and Salesforce Data Cloud, as long as these engines are Iceberg REST Catalog enabled) and provides a fine-grained security and governance model to manage access for a diverse range of users, including the newly integrated data science teams.

For example, two healthcare organizations engaged in drug manufacturing can leverage Cloudera to construct a GxP-compliant data lakehouse that consolidates the data assets of the merging entities while ensuring adherence to regulatory requirements.

4. Standardize Cross-Environmental Initiatives on a Single, Multi-Cloud Environment

The different environments used for analytical activities in the two merging entities lead to duplicative operations throughout the data lifecycle, including multiple data engineering pipelines for common tasks such as data ingestion and standardization.

Cloudera empowers organizations to standardize data and AI operations on a common runtime across various private and public cloud environments. This capability derives from the underlying containerized infrastructure model used across environments, a consistent user authentication and authorization mechanism (Cloudera SDX), and Cloudera Manager, which serves as the single pane of glass for managing clusters across different deployment environments and regions.

In a post-merger context, this standardization is transformative: the two companies can integrate their data lifecycle operations onto a single runtime, eliminating redundant tools and facilitating the sharing of data, insights, and AI models. This leads to reduced technology and labor costs for data operations and AI/ML model development, increased practitioner productivity, consolidation of multiple tools, and reduction of data silos.

5. Scale AI Initiatives Anywhere with Cloudera AI

Post-acquisition or merger, the immediate challenge is integrating the disparate tools, models, and data scientists from the newly acquired innovative start-up, all while managing changing capacity demands. Cloudera AI Workbench and AI Inference empower organizations to scale AI initiatives on-premises or in the cloud by:

Providing a container-based, end-to-end solution for feature engineering, model training, experimentation tracking, and model deployment

Facilitating AI model sharing that allows data scientists to collaborate among disparate teams

Leveraging hardware and software acceleration services from Clouder partners that can speed up the entire data science lifecycle by improving data engineering performance by 20x and AI inference performance by up to 6x

With Cloudera, the integrated company can achieve substantial cost reduction by moving persistent, compute-intensive workloads such as AI/ML model serving to on-premises environments. More importantly, it can accelerate the time-to-market for new, combined AI applications. This allows the organization to rapidly realize the “competitive advantage” it sought from the M&A in the first place

Take the Next Step to Ensure Successful Integration After Your Next Merger and Acquisition

Cloudera can accelerate the post-merger integration of data assets and analytical capabilities between the two integrating entities. Our platform offers scalability across the data lifecycle, an infrastructure-agnostic deployment model, and interoperability of the data lakehouse on Cloudera services and Apache Iceberg. This combination provides an architectural blueprint for standardizing AI/ML initiatives and data operations, and for delivering a data sharing model that can be used by both Cloudera and non-Cloudera services.

To schedule a demo or product tour, contact our team.

Cloud Migration Checklist: Getting Your Data Landscape Ready

Ron Pick — Thu, 06 Nov 2025 14:00:00 UTC

Do you know where your data is? The number of people who can pat their server and say fondly, “Right here!” is decreasing. Instead, more people are lifting their eyes to the heavens and answering, “Um… up there… somewhere…” McKinsey reports that in 2025, large enterprises have 60% of their environment in the cloud.

If you’re considering moving your data assets, processes, and applications to the cloud, you’re in good company. But if you’re dreading the move, you’re not alone there either. A data migration will inevitably strain your organization’s time, resources, and patience. But this article is here to help—a good checklist can make the process smoother so you can focus on execution.

We’ve put together a cloud migration checklist below. It’s a helpful framework that covers the points you need to ensure it happens.

Do You Have Someone To Head The Migration?

If you can’t check this one, stop in your tracks. Do not pass go; do not go to jail; do not head to Free Parking; do not MOVE!

A revolution without a leader will quickly dissolve into chaos. A cloud migration will face the same fate. The leader of a cloud migration must have both strong technical skills and strong interpersonal skills, because personnel issues can stall or hinder a migration. Your migration leader needs to facilitate a shift in your data’s location as well as a shift of your employees’ attitudes and perspectives about data.

If you don’t have one person who can fulfill both roles, then dividing the leadership role between the technical “migration architect” and the interpersonal “migration evangelist” so that each can be responsible for the cloud migration steps in their area of expertise can also work.

One tool that will help your migration evangelist is a data intelligence platform with a data catalog. When every employee can locate the data asset they need, no matter where it’s currently situated, resistance decreases and acceptance increases.

Do You Know What You Can Leave Behind?

Don’t move garbage. If you’re rolling your eyes and saying “Duh!”, then you haven’t been part of lift-and-shift migrations that take a legacy system and move it to a cloud environment, basically as is. If your organization has had its legacy system for more than a few years, it’s almost guaranteed to have garbage: outdated assets, defunct reports, redundant processes… all kinds of digital dust bunnies scampering around.

That’s not to say there isn’t ever a place for lift-and-shift migrations. However, if you’re trying to do the migration thing right, then spend the time sorting through what you have and deciding what’s valuable enough to be migrated, and what should stay behind.

Here an automated data lineage solution can be invaluable. In minutes to hours, automated data lineage can create a complete mapping of your legacy data landscape, revealing your data flow and interconnections. A close reading of this data lineage map will show you almost everything you need to decide what goes to the cloud and what can be relegated to the past.

Are Your Applications Ready To Take Advantage Of The Cloud’s Benefits?

So you’ve decided what’s coming to the cloud. Fantastic! Now it’s time to look more closely at your applications and pipelines. The real financial and operational benefits of cloud migration are only achieved when your data systems architecture is designed to take advantage of the cloud’s benefits, such as:

Dynamic scaling
Distributed workloads
Serverless computing capabilities
Powerful AI and ML capabilities

Make yourself a checklist for each application that you plan to migrate. For each one, check which cloud benefits that application is poised to take advantage of in its current state. For example, if an application does not yet have the ability to run on variable servers, and you just replicate that in the cloud, it can’t utilize the cloud benefit of distributed workloads.

What Needs To Be Done To Make This Specific Application Cloud-Ready?

Sometimes bringing your application up to cloud speed is simple and quick. Sometimes it requires hours upon hours of development time. Possible scenarios include:

Refactoring (reconstructing the application to match cloud capabilities)
Optimizing (the tweaks needed are more minor than refactoring)

When you see the investment needed, you can then make an educated decision as to how to handle the application in question. You may decide to refactor, to optimize, or just to leave it alone for now and do a lift-and-shift, as sometimes the return on investment for refactoring or optimizing just isn’t worth it in your current situation.

Check? Check!

Data migrations aren’t easy or enviable, but with a detailed cloud database migration checklist to guide you, then can at least feel a little more manageable. Ready to bring your data landscape up to speed? Check!

For tips on how to reduce cloud costs once your migration is complete, check out this blog next: 3 Steps to Cutting Cloud Costs with Data Lineage.

Cloudera Named a Leader in the 2025 Forrester Wave for Data Fabric Platforms

Wim Stoop — Wed, 05 Nov 2025 14:00:00 UTC

We’re thrilled to share that Cloudera has been named a Leader in the 2025 Forrester Wave for Data Fabric Platforms. This recognition underscores its commitment to helping organizations unify, secure, and activate their data across hybrid and multi-cloud environments.

In this blog, we cover what a data fabric is and why it matters, what sets Cloudera apart key capabilities of the Cloudera platform that resulted in this position as a leader, and why all of this matters for Cloudera customers.

What is a Data Fabric?

In a world where data is more distributed than ever, enterprises need a way to connect the dots across silos—from on-premises systems to public clouds and everywhere in between. That’s exactly what a data fabric enables.

A data fabric is an architectural approach that connects, manages, and governs data across hybrid and multi-cloud environments. This approach allows data to be accessed and used anywhere, by anyone, securely and efficiently. Instead of forcing organizations to move all their data into a single system, a data fabric creates a virtual, uniﬁed layer that integrates data from multiple sources—clouds, on-premises, streaming, and edge—into one consistent framework. It provides end-to-end visibility, lineage, governance, and access, so teams can ﬁnd, trust, and use the right data in real time.

Why Data Fabric Matters Now

As organizations accelerate AI adoption and cloud transformation, they face a common challenge: data fragmentation. Data lives across multiple clouds, legacy systems, and on-premises environments—making it difficult to govern, secure, and operationalize for business impact.

A data fabric addresses this by providing an architectural layer that automates and orchestrates data management across distributed environments. It connects data from any source, applies consistent governance, and delivers uniﬁed access for analytics, AI, and real-time decision-making.

Forrester’s evaluation of the top data fabric vendors highlights the importance of this

capability as enterprises seek to implement data and AI initiatives securely and at scale—and in our opinion, Cloudera’s position as a leader in making it a reality.

According to the Forrester report, “[Cloudera’s] focus on private cloud and on-premises deployments gives it a stronghold in industries with data sovereignty or legacy system requirements.” This long-standing foundation, combined with our open hybrid cloud strategy, has helped our customers modernize data architectures without compromising control or governance.

Key Capabilities for Open Data Fabric: Where Cloudera Scored Highest

In our opinion, receiving a 5/5 score from Forrester reﬂects more than product maturity—it signals leadership, customer validation, and measurable differentiation. In the 2025 Forrester Wave for Data Fabric Platforms, Cloudera received the highest scores possible (5/5) in seven criteria:

End-to-End Integrated Fabric
Uniﬁed Data Catalog
Real-Time Performance and Scalability
Metadata Management
Agentic AI
Vision
Roadmap

For End-to-End Integrated Fabric, Forrester deﬁnes a score of 5 as delivering advanced data management through a comprehensive, uniﬁed management portal that spans distributed environments, with integrated metadata, governance, and policies. It also recognizes the vendor as a leading contributor to key open-source fabric components.

In Uniﬁed Data Catalog, a 5/5 score indicates the vendor provides superior support for features such as a unified and automated data catalog across multiple data fabrics, AI-powered discovery, classiﬁcation and enrichment of metadata, full customization, native integration with third-party catalogs, and the ability for business users to leverage the catalog with full capabilities.

Achieving a 5/5 score in Real-Time Performance and Scalability indicates the vendor provides superior support for features such as certified hardware integration with NVIDIA GPUs, integration with SIMD, automated advanced AI/ML query tuning, automated tiered storage, add/drop resources automatically, and AI-enabled intelligent workload management, advanced horizontal scale out, dynamic sharding and balancing, automated scale-up and down.

In Metadata Management, Forrester looks for advanced, automation including end-to-end metadata discovery, tagging, and classiﬁcation, AI automation (such as automated tagging of of sensitive data), comprehensive integrated metadata across distributed fabrics, and integrated support for the data product lifecycle. Cloudera’s acquisition of Octopai enhances these capabilities by delivering deep lineage and metadata intelligence across hybrid environments, supporting the full lifecycle of governed data products.

The Agentic AI criterion recognizes vendors that embed autonomous AI agents to support data fabric. To earn a 5, platforms must demonstrate AI agents that automate integration, governance, and discovery, operating collaboratively and contextually.

Forrester’s 5/5 score in the Vision and Roadmap criteria is reserved for vendors whose strategy anticipates customer needs and shapes the direction of the market, along with evidence of execution. Cloudera’s clear, open, and hybrid approach — bridging data, analytics, and AI across any environment — demonstrates a bold and differentiated vision that continues to lead the industry forward. Our investments in intelligent automation, interoperability, and agentic AI illustrate forward momentum in the roadmap.

Together, these 5/5 scores affirm its position as a trusted, future-ready data platform that uniﬁes data, analytics, and AI across any cloud or infrastructure. The 5/5 scores in the metadata management and agentic AI criteria demonstrate how its data fabric continues to evolve to meet the needs of modern, AI-driven enterprises.

What Sets Cloudera Apart: A Strategy Built for the Future of Data and AI

At Cloudera, our mission is to make data and AI work together, seamlessly and securely, across any environment. Cloudera believes it stood out in Forrester’s evaluation for its open, hybrid-by-design architecture that enables enterprises to manage data seamlessly across on-premises and multi-cloud environments, powered by open standards and open source innovation.

The Forrester report notes that “Cloudera’s data fabric strategy tackles the challenge of fragmented data, aiming to deliver integrated governance, visibility, and secure access across hybrid and multicloud environments.”

Key elements of our strategy include:

Integrated governance and visibility: Cloudera Shared Data Experience (SDX) ensures that policies for access, lineage, and compliance are applied consistently across all workloads. This uniﬁed approach brings consistency and transparency across all data assets.

Metadata intelligence and lineage: Cloudera Octopai Data Lineage enables end-to-end lineage, impact analysis, and automated metadata management.

Open architecture and interoperability: Cloudera’s AI-ready architecture brings together advanced analytics, machine learning, and real-time streaming to help organizations transform raw data into actionable insight faster. Designed to work seamlessly with non-Cloudera engines, supporting ﬂexibility and avoiding lock-in.

Intelligent automation: Our roadmap invests in agentic AI, automation, and intelligent data fabric capabilities to optimize workloads and deliver adaptive data experiences.

Trusted and proven: Our platform is proven at global scale—trusted by leading banks, telecommunications providers, and public sector organizations to power some of the world’s most data-intensive and mission-critical operations with reliability and conﬁdence.

These advancements underscore our commitment to helping enterprises simplify complexity, ensure trust, and accelerate innovation as data becomes the foundation of AI-driven transformation.

Why Customers Should Care: Data Fabric is the Foundation for Trusted AI

As enterprises scale their AI initiatives, the importance of a uniﬁed, governed data layer cannot be overstated. AI models are only as good as the data they’re built on—and that data must be accessible, high-quality, and compliant. Once you have trusted, governed data from anywhere, you can power trusted AI everywhere.

Cloudera’s data fabric enables organizations to govern every dataset, track every lineage, and trust every prediction, ensuring responsible AI that aligns with enterprise and regulatory standards. Cloudera’s open data lakehouse extends the value of the data fabric by enabling secure analytics, machine learning, and AI on uniﬁed, high-quality data.

Together, the Cloudera Uniﬁed Data Fabric and Cloudera Open Data Lakehouse form the foundation for a modern enterprise data strategy—one that brings intelligence to every workload, user, and business decision. With Cloudera, organizations don’t just unify data—they unlock its full potential to drive innovation, resilience, and responsible AI at scale.

See the Full Evaluation

We invite you to read The Forrester Wave™: Data Fabric Platforms, Q4 2025 to see how vendors stack up and why Cloudera was named a Leader. Access the report to understand the trends shaping data fabric architectures—and how we believe Cloudera continues to lead the way.

Forrester does not endorse any company, product, brand, or service included in its research publications and does not advise any person to select the products or services of any company or brand based on the ratings included in such publications. Information is based on the best available resources. Opinions reflect judgment at the time and are subject to change. For more information, read about Forrester’s objectivity here .

The Inevitable Outage: Why Your Hybrid Strategy Needs Multi-Cloud Resilience

Blake Tow — Wed, 29 Oct 2025 13:00:00 UTC

The recent global IT outage experienced by a major cloud hyperscaler was a disruptive, real-world reminder that downtime and service disruptions are inevitable. The event impacted services across banking, retail, and healthcare, and served as a powerful warning that relying on any single provider, or even a single cloud region, creates a critical business vulnerability.

This outage highlights the critical risk of a single-provider strategy, rather than an inherent problem with the cloud. It’s the clearest example yet of why a hybrid cloud strategy—one that gives you the freedom to move data and AI workloads between clouds and data centers—must include multi-cloud capabilities.

This is why Cloudera's “anywhere cloud" approach is the clear choice for organizations looking to ensure business continuity. When we say "data anywhere," we mean it: in your data centers, at the edge, and across multiple public clouds.

Hybrid is the Foundation for Freedom

For years, Cloudera has championed a hybrid cloud strategy as the foundation for enterprise freedom. We believe that you should have the flexibility to run your data and AI workloads where it makes the most sense for your business—whether that’s in your own data centers, in a public cloud, or at the edge—and the choice to move them as needed depending on changing business imperatives.

The goal of hybrid is to deliver a consistent cloud experience anywhere, giving you the agility and scalability of the public cloud while maintaining the security and control of your private cloud. This approach is designed to give enterprises the freedom to move data and AI workloads between clouds and data centers, without friction or vendor lock-in. This freedom from infrastructure lock-in is the core of a resilient architecture.

The Key: Hybrid Includes Multi-Cloud for Resilience

While this hybrid foundation provides crucial freedom and choice, the recent outage exposed a critical blind spot in many hybrid strategies: if your architecture simply connects your data center to a single public cloud provider, you’re still dangerously exposed. You’ve merely swapped one single point of failure for another.

As we discussed in our last post, true resilience is about eliminating single points of failure. A modern hybrid strategy, therefore, must be a multi-cloud strategy. Achieving true business continuity means having the freedom to “failover anywhere.” This capability must go beyond a simple on-premises-to-cloud connection to include failover between cloud regions, back to your data center, and critically, from one cloud provider to another.

How Cloudera's Cloud Anywhere Platform Makes This a Reality

On paper, a multi-cloud failover strategy is the obvious answer. In reality, it’s incredibly complex. Different cloud providers have different APIs, data services, and security models. For most organizations, moving a mission-critical data workload from one cloud to another would require a painful, time-consuming effort to refactor applications, re-architect security policies, and migrate data.

Reducing this complexity is precisely the problem our platform was built to solve. Cloudera’s cloud anywhere platform enables a true "failover anywhere" strategy by providing two essential, unique capabilities:

A consistent, portable platform: Our open data lakehouse and portable data services run identically everywhere. We provide a consistent "write-once, run-anywhere" data and AI platform that runs on any cloud, including AWS, Azure, and Google Cloud, as well as in your private data center. This eliminates the need to refactor applications or workloads when moving between different infrastructures, giving you true portability and eliminating infrastructure dependency.

A unified data fabric with replication: A workload encompasses more than data; it also includes the security and governance that must travel with it. Our Unified Data Fabric, powered by Cloudera Shared Data Experience (SDX), ensures that critical metadata, security, and governance policies are consistent everywhere. Capabilities like Cloudera Octopai Data Lineage provide deep metadata management and lineage, which is also critical context for a failover scenario. Our Replication Manager then replicates both the data and its critical context, which includes metadata and policies, to another environment.

This combination makes the multi-cloud resilience scenario a practical reality. You can run your primary workloads on one cloud provider while using Replication Manager to maintain a synchronized, secondary environment on a completely different cloud provider. When an outage strikes your primary provider, you can quickly promote the secondary environment, ensuring business continuity with minimal data loss (recovery point objective, or RPO) and minimal downtime (recovery time objective, or RTO).

Your Hybrid Strategy Must Be Multi-Cloud Ready

The recent outage should be treated as a drill. It was a test of every organization's resilience strategy, and it exposed a common, critical vulnerability: single-provider dependency. A hybrid architecture is the right foundation for the modern enterprise, but if your strategy has a single-provider blind spot, it's not truly resilient. Don't wait for the next, inevitable disruption to discover this.

Cloudera provides the true "cloud experience anywhere," giving you the capability to design a resilience plan that can withstand any failure. To learn more about how to build a truly resilient architecture, read our blogs "Architecting for Data Resilience" and “Mastering Multi-Cloud with Cloudera”.

Strengthen Data Governance with the Power of Automated Data Lineage

Ron Pick — Tue, 28 Oct 2025 13:00:00 UTC

Trying to manage governance without a comprehensive data lineage solution can leave you feeling like your data keeps running away. It’s not easy to keep up with data and metadata on the move. Successful governance managers and data stewards leverage a data lineage tool to improve governance a hundredfold in four key ways we’ll explore next.

4 Ways A Data Lineage Tool Will Improve Data Governance

1. Correcting Errors

Maintaining quality is a key goal of data governance. It’s your responsibility to make sure that management and business users make important decisions based on accurate information.

If you find erroneous data, of course remove and replace it ASAP. But if you’re constantly correcting retroactively instead of fixing the origin of the error, you’ll be constantly pulling weeds in that data field. Long term, it’s much more effective to identify where in the system the error was introduced and fix it at the source.

A comprehensive data lineage tool enables you to trace any data point’s journey upstream to origin and downstream to target, inspecting every process that transformed the data along the way.

In the case of flawed data, you can use data lineage to quickly conduct root cause analysis to work backward from where the error first appeared and identify the stage and/or process where the data changed from accurate to flawed. You can then correct the problem at the root, eliminating the proliferation of dirty data and the necessity of correcting that data wherever it travels in your environment.

2. Keeping Up With Minor Changes

If you want to work in an industry where change seems slow, try paleontology. When you work in data governance, change is constant and fast. Technologies evolve, source systems develop, your dataset structure is modified to reflect new business demands from your data, calculation methods change, and so on.

All the constant little changes need to be reflected in your data governance platform, or you’ll quickly wind up with piles of ungoverned data. If it's left up to human, manual effort to keep the data governance platform updated, then it’s very easy for a change to fall through the cracks.

Automated data lineage tools for data governance, on the other hand, will periodically and automatically run through all your metadata and make note of any new additions, deletions or changes. They will then update your data governance platform with the new fields, calculations or other metadata.

With an automated data lineage solution at your back, you can concentrate on managing and governing data instead of chasing it.

3. Preparing For Major Changes

Mergers and migrations and transitions—oh, my! Most data professionals will probably experience, if not preside over, at least one of these major events over the course of their careers.

The transition is usually unavoidable. And it will just as unavoidably wreak havoc with the work of anyone in your business who touches data and its results—from governance to BI to business—unless you foresee where the changes made to accommodate the new system will impact your current workflows.

Short of a crystal ball, this foresight can only be had by creating a complete visualization of your current system and data flow, comparing it with the intended layout and processes of the new system, and planning how to transition smoothly from one to the other.

It also usually involves lots of communication between members of different departments to apprise them of the slated changes and ask how these changes will affect them, their data and their processes (and then hope they actually respond in a timely fashion). This process, when done manually, typically takes an entire data department months to complete.

Furthermore, an upcoming major transition can be an opportunity—an opportunity to make your data governance more efficient by pruning out dormant fields, consolidating overlapping definitions and checking the consistency of process results. But capitalizing on that opportunity can take months of manual mapping efforts just to prepare for the real work of streamlining your data management.

An automated data lineage tool can turn those months of manual impact analysis into days, or even a single day. Talk about efficiency. One small step for an automated data lineage tool; one giant leap for data governance.

4. Setup

Let’s take a trip down memory lane to the day your company got a new enterprise data governance platform: Congratulations! This platform is going to work wonders for your company as soon as you set it up. But that’s easier said than done.

Data governance platforms usually have an incorporated data catalog, and setup means populating that catalog with all the metadata you are planning to govern. That process usually takes months upon months of work. However, with an automated data lineage tool, you can set up an entire data catalog on your lunch break.

As mentioned above, a comprehensive data lineage solution doesn’t lie down on the job after the initial cleanup. It periodically refreshes, updating your data governance platform with any metadata changes or additions, so you don’t have to endanger your working relationship with any other department by reminding them constantly to update you or the platform every time they make a change to a field, a process or a report.

Picking The Right Tool For Data Lineage In Data Governance

Not everything that calls itself a “data lineage” solution can actually perform all the functions above. Some tools come with built-in automated lineage functions that still require significant manual labor (and headache). As such, it’s important to evaluate solutions to ensure they offer the full suite of capabilities and metadata management you need.

To that end, request a demo to get started with Cloudera Octopai Data Lineage—an automated lineage solution that can perform these functions and improve your data governance today.

Architecting for Data Resilience: Ensuring Business Continuity with Cloudera

Jeremiah Morrow,Eileen O’Loughlin — Wed, 22 Oct 2025 18:00:00 UTC

The recent global IT outage experienced by a cloud hyperscaler was a reminder of a universal truth in technology: even if it’s minimal, downtime and service disruptions are inevitable. While the impact was widespread, disrupting services across retail, banking, healthcare, and other sectors, this wasn’t a failure unique to a single provider or a single cloud. It illustrates that disruption can occur anywhere: in any cloud region, with any provider.

The key takeaway is clear: organizations can and must take control by building a resilient data architecture that can adapt and thrive amid constant change. In this blog, we’ll share how Cloudera customers are uniquely positioned to ensure business continuity thanks to the flexibility our portable architecture and tools that ensure seamless failover and recovery. Cloudera is the only data and AI platform company that brings AI to data anywhere: in clouds, data centers, and at the edge.

What Does it Mean to Architect for Resilience?

Data resilience is an organization's ability to withstand, recover quickly from, and minimize the impact of data-related disruptions or failures. It is a proactive approach to business continuity, going beyond backup or disaster recovery to ensure that critical data always remains:

Available: Accessible to users and applications when needed (minimizing recovery time objective or RTO)
Intact/accurate (data integrity): Uncorrupted and unaltered (minimizing recovery point objective or RPO)
Secure: Protected from unauthorized access, loss, or theft

Architecting for true resilience involves two core, interconnected pillars: technology that enables portability and a vetted process for failover.

1. Enable Failover Anywhere: Eliminate Single Points of Failure

Relying on a single provider, a single cloud, or even a single region within a cloud creates a critical business vulnerability, or single point of failure. Outages occur due to hardware failures, software issues, human error, natural disasters, or cyberattacks. The goal of resilience is to ensure that when one environment goes down, your operations can seamlessly and automatically continue elsewhere.

This means you must be able to failover anywhere—between cloud regions, across cloud providers, and even back to a data center. Business operations must continue, and critical systems must remain up and running, regardless of where the initial disruption occurred.

2. Have a Vetted Plan for Resilience

Technology can provide resilience capability, but the process is essential for successful business continuity. Too many disaster recovery plans are written once and rarely revisited, even as people and technology evolve. A well-vetted plan is documented, practiced, and revisited regularly to ensure that the organization can execute in the event of a failure. Some elements of the plan include:

Prioritizing workloads to ensure mission-critical operations, such as transaction processing in retail and remote monitoring in healthcare, have the lowest service level agreements (SLAs) for RTO and RPO.
Ensuring redundancy and high availability by establishing the ability to failover between environments to maintain operations.
Backing up critical data and metadata, and establishing retention policies and governance.

How Does Cloudera Help Organizations Architect for Resilience?

Cloudera is the only data and AI platform provider that delivers a consistent cloud experience to data anywhere. This gives enterprises the freedom to move data and AI workloads between clouds and data centers—without friction or vendor lock-in—so that you’re no longer tied to any one piece of infrastructure. As a result, organizations can reduce business risk by leveraging Cloudera to architect for resilience and maintain consistent operations and compliance no matter where data resides.

The Cloudera platform supports high availability and disaster tolerance through our solutions and services, including:

Portable Data Services: Cloudera’s platform, including cloud-native data services and data lake, runs consistently on any cloud (AWS, Azure, Google Cloud) and on premises in Kubernetes. The freedom from underlying infrastructure enables customers to configure a variety of available sites—mixing different clouds and on-premises resources—to drastically reduce dependency on a single platform or vendor.

Data in Motion: Cloudera Data Flow, Cloudera Streaming Analytics, and Cloudera Streams Messaging enable customers to capture, process, and distribute data anywhere in real time. For mission-critical, real-time workloads like fraud detection and network monitoring, a potential outage can have significant business impact. Cloudera ensures these services remain highly available and can be replicated across environments.

Replication Manager: This core Cloudera component provides a simplified approach to backup and recovery. It replicates not just the data, but also the metadata, critical security and governance policies tied to that data. This replication enables easy migration, continuous synchronization, and, most importantly, the ability to quickly failover by promoting a secondary replicated environment alongside the primary operating environment with minimal data loss.

Open Data Lakehouse: Cloudera’s open data lakehouse provides secure data management and portable cloud-native data analytics with a write-one, run-anywhere approach. This eliminates the time and costs associated with refactoring applications or workloads when moving between different infrastructures.

Figure 1. Cloudera Delivers the Cloud Experience Anywhere for AI Everywhere

Together, these capabilities enable Cloudera customers to run mission-critical data and AI workloads with confidence, ensuring near-zero downtime and data loss for their most important business processes, even during an infrastructure-level outage.

How AM-BITS Architected for Resilience in the Face of Geopolitical Instability

For many businesses, the recent outage was just a blip. But what if the disruption was a true disaster, like a war? Based in Ukraine, AM-BITS, an IT solutions provider for the banking, telecom, and retail sectors, faced an urgent need to secure and migrate their clients’ mission-critical data after geopolitical disruption forced organizations to rapidly accelerate their shift from on-premises systems to the cloud. A typical cloud migration could take six months or more—a timeline that many businesses could not afford.

To address this crisis of continuity, AM-BITS built a modern, multi-tenant data and AI platform powered by Cloudera. Leveraging Cloudera Shared Data Experience (Cloudera SDX), AM-BITS rapidly provided a “technical safe harbor” for its clients’ data assets, drastically reducing the time to securely migrate data to the cloud by 50%. Because Cloudera operates seamlessly across any environment, AM-BITS’ clients gained true flexibility: they could migrate to the cloud quickly, but they also maintained the option to move to a different cloud or bring data back on premises. By leveraging Cloudera, AM-BITS turned portability into a powerful tool for business continuity.

Next Steps

Data-related disruptions and outages can be caused by hardware failures, software issues, human error, natural disasters, cyberattacks, and more. It’s critical that organizations design their systems with those points of failure in mind and have a plan in place to recover their IT systems and data quickly and without significant disruption.

To learn more about how you can architect for resilience with Cloudera, take a look at our disaster recovery checklist and resources, or reach out to our professional services team who can help you design a plan for resilience.

Developer Relations at Cloudera: What We’re Building for Our Developer Community

Dipankar Mazumdar — Wed, 22 Oct 2025 13:00:00 UTC

Figure 4: Cloudera’s Enterprise Intelligent Center at NYC EVOLVE

My Next 30 Days: What I’m Looking Forward To Most

I love how Cloudera focuses on developers. Teams here talk constantly about how to make things easier to use, how to remove friction, and how to listen to feedback from practitioners in the field. That mindset of putting developer productivity and real-world needs first is exactly where Developer Advocacy can add value.

We’re building a home for the developer community—a place where engineers can learn, try out things, and build without friction. Our focus is on helping developers move from “this looks hard” to “I can build this” with the right patterns, explanations, and tools.

To materialize that vision of a true home for developers, keep an eye out for a Cloudera Developer Hub—a central place where the community can find all of this content, access labs, ask questions, and exchange ideas with other practitioners.

More to come on that soon! In the meantime, stay up to date with our latest practitioner news by subscribing to the Cloudera Community.

Figure 3: Cloudera Iceberg REST Catalog and how it offers interoperability with 3rd party engines.

Launching New Technologies

That’s why it was exciting to see the launch of the Cloudera Iceberg REST Catalog at NYC EVOLVE. With this release, developers can use third-party engines to access Cloudera-managed data directly—without copying or moving it around. Just as important, the same security and governance policies follow the data everywhere, ensuring consistency no matter where it’s accessed.

Alongside the REST Catalog, we also announced the Lakehouse Optimizer. For engineers (particularly with Iceberg), this matters because it takes care of the tedious, behind-the-scenes work that usually comes with managing Iceberg tables. Instead of manually handling tasks like compacting small files, rewriting manifests, or cleaning up position deletes, the optimizer does this automatically.

What that translates to is simple: faster queries and lower storage costs, without developers needing to constantly tune or babysit their tables. And since it’s built as an open service, the same optimizations apply no matter which Iceberg-compatible engine you’re running.

The same mindset of openness shows up in how Cloudera approaches GenAI workloads. Instead of betting on just closed-source models (which have their own advantages and challenges), Cloudera AI embraces flexibility: support for open-source large language models (LLMs) like LLaMA, Mistral, and Hugging Face, plus the ability to fine-tune them on enterprise-specific data. That matters because developers want choice. They want to train, fine-tune, and deploy models in their own infrastructure with the same security and governance as the rest of their stack.

Creating Two-Way Feedback Loops

And, finally, there’s the momentum. At NYC EVOLVE, I saw firsthand how engineers and decision-makers are leaning in and asking questions about deploying GenAI use cases, integrating Iceberg into their architecture, and making their data architectures more open, future-proof, and cost friendly. That kind of curiosity is what excites me. It reinforces why building a stronger developer community is so important here.

Our goal within Developer Relations is to turn those conversations into something actionable. This means showing how developers can build open architectures with Iceberg, how they can run multi-compute pipelines seamlessly with interoperability guarantees, and how they can build AI agents on top of their lakehouse data, among other groundbreaking innovations.

This blog is part two of two

I recently joined Cloudera to lead Developer Relations (DevRel), and I’m excited to build out this team and connect with the worldwide developer community.

In this blog post, I’ll share what I’ve seen in my first month on the job and what excites me most about what we’re building here. My goal is to enable practitioners to learn, explore, and build with technologies that matter—whether that’s open data architectures with Apache Iceberg; streaming systems with Apache Flink, Kafka, and NiFi; or generative AI (GenAI) applications. So, importantly, I’ll discuss how Cloudera’s platform and data services can support running and delivering these technologies, securely, at scale, anywhere (clouds, data centers, and at the edge) with the openness and trust that developers expect.

In part one, I cover what Developer Relations means for Cloudera. While the DevRel function can look different from one organization to another, our focus will be on educating, engaging, and building a two-way relationship with developers.

My First 30 Days: What I’ve Seen

What struck me right away at Cloudera is how much emphasis is placed on openness and the flexibility it creates for those building on the platform. True to its open-source foundations, Cloudera still values and prioritizes openness, which is evident in its approach to open standards and frameworks. That’s why so much investment is going into technologies that carry that vision forward.

Prioritizing Openness

Take Apache Iceberg, as an example. Cloudera has been an early proponent of Iceberg as the foundation for an open data architecture because it reflects that same vision of openness and interoperability.

Figure 2: Comparative representation of catalogs with different implementations and catalogs that speak ‘REST’

This is exactly the problem the Iceberg REST Catalog was designed to solve. The REST Catalog API provides a universal standard for server–client communication, ensuring that Iceberg clients can interact with any compliant catalog, regardless of the server implementation’s underlying technology or programming language. Users can create tables, branch versions, or list snapshots through the same API—no matter which catalog sits underneath.

For developers, this removes the need for one-off connectors and reduces friction when adopting new engines. For organizations, it helps avoid locking Iceberg tables into a single platform’s catalog, while still keeping governance and security consistent. In short, everyone speaks the same “language.”

Figure 1: Apache Iceberg as the foundation for open data architecture with Cloudera

Iceberg gives developers an open table format that isn’t tied to one engine or one vendor. You can write data with Spark, stream updates with Flink, query with Trino or Hive—all against the same table. That level of interoperability has traditionally been limited, if not absent, in other data architectures such as cloud data warehouses, but it’s exactly what modern data and AI platforms need. For customers, this becomes a real advantage. By building their data architecture on Iceberg, they make themselves future-proof, and any new compute engine can be plugged into the same tables without costly migrations or lock-in.

Building on that, the Iceberg REST Catalog takes openness a step further. While open table formats like Iceberg have broadened access to data, the catalog is another critical component in the lakehouse architecture that needs to be interoperable.

Today, there are many different catalog implementations—both open-source and proprietary. The challenge is that managing Iceberg tables across different catalogs has historically required custom integrations, making true interoperability difficult. On top of that, many vendor platforms only provide full support if developers use their own built-in catalog. That dependency limits what can be shared with other engines and tools, creating a new form of lock-in.

Developer Relations at Cloudera: Introducing DevRel and What Developer Advocacy Means for Cloudera

Dipankar Mazumdar — Tue, 21 Oct 2025 13:00:00 UTC

Figure 3: Cloudera’s support and usage of various open source software

In line with its open-source foundations, Cloudera is built on the principle of being open for integration, frameworks, and standards. That openness gives developers the freedom to use the tools they already know, adopt new ones as the ecosystem evolves, and avoid being locked into a narrow path.

That is why DevRel is so critical here. It means staying deeply engaged with the open-source ecosystem, while also enabling enterprise developers who rely on Cloudera to solve real problems in data and AI using these foundational technologies.

At Cloudera our DevRel work is anchored on three pillars: awareness, engagement, and impact. Awareness is about making sure developers discover and understand what’s possible. Engagement is about meeting them where they are. And impact is about driving real outcomes: helping developers be more productive, shaping better products through feedback, and strengthening the open-source projects we all depend on.

What We’re Building for Developers

As I wrap up my first month, I keep coming back to a simple thought: my journey has always been shaped by community. I started as an engineer leaning on open source—reading docs, interpreting code, and learning from community blogs. Over time, I contributed back in different ways.

Now at Cloudera, I see the chance to extend that same cycle: to learn, share, and build alongside developers. Here’s what we’ll be working on in the coming months:

Technical deep-dives: This includes blogs, how-tos, and whitepapers on how to operationalize technologies like Iceberg, Spark, Flink, NiFi, Ozone, Kafka, and more at scale with Cloudera. They’ll show real patterns, tradeoffs, and examples you can reuse.

New explainer series: Encompassing short, focused breakdowns of concepts, use cases, and learnings from production in the data and AI space. The goal is to cut through jargon and give developers a clear mental model.

Hands-on labs: These are guided, runnable examples you can try on your own laptop or cloud environments. If a blog tackles the “why,” labs will show the “how.”

Community events: We are meeting engineers wherever they learn and code. So, meetups, workshops, and conference sessions are where we will engage directly, exchange ideas, and learn from one another.

Join me at the Cloudera Community and engage with the content, try out the code, give feedback, and ask questions!

This blog is part one of two

It has been slightly more than four weeks since I joined Cloudera to lead Developer Relations (DevRel). A month may seem brief, but it’s enough to feel the pulse of a community—its culture, its people, and the momentum behind some of the key technologies that Cloudera drives.

In this blog post, I’ll explain what Developer Relations means for Cloudera. The DevRel function can look different from one organization to another, depending on the goals of advocacy. This is my fourth DevRel gig, and at Onehouse, Dremio, and Qlik, the focus was slightly different. But the crux has always been the same: educating, engaging, and building a two-way relationship with developers.

In part two, I’ll share what I’ve seen in my first 30 days, our plans for supporting practitioners in their pursuits and use of the technologies that matter most to them, and how our platform supports their efforts.

What is Developer Advocacy?

Developer Advocacy is a specific role within the DevRel function and while there are other roles within the function, we will use these terms interchangeably in this blog.

Figure 2: A day in the life (of DevRel)

On the other hand, it’s about carrying developers' voices back into product engineering and making sure their needs shape what gets built next. When done right, DevRel creates a two-way feedback loop. We show what's possible with a platform, and we also listen to and incorporate where developers get stuck (the issues/errors), what excites them (capabilities), and how the community evolves with the ecosystem.

What Does Developer Advocacy Mean at Cloudera?

At Cloudera, developers have always been at the center. The company sits at a unique intersection: open-source commitment on one side and enterprise adoption on the other. Cloudera has a long history of contributing to foundational Apache projects like Spark, Flink, Kafka, Ozone, NiFi, and Iceberg, while also serving a global customer base that depends on these technologies for production-grade scale and reliability.

Figure 1: Developer Relations as an interface with product, engineering, and marketing teams and developers

At its core, DevRel is the bridge between technologies (products) and developers. On one hand, it’s about enabling developers to be productive, grow, and succeed with a range of data and AI technologies. This involves breaking down complex system internals in the form of blogs, books, or papers; showing how to accomplish something (with code); and exploring possible use cases via demos, hands-on labs, or webinars. It’s about being present where developers already are—meetups/conferences, open-source Github repositories, Slack channels, and forums.

The Shifting Airgapped Data Processing Market: What It Means for the Public Sector

Jeremiah Morrow — Thu, 16 Oct 2025 13:00:00 UTC

For organizations in the U.S. public sector, the ability to leverage data in secure, air-gapped cloud and on-premises environments is not a preference—it’s a non-negotiable security and operational requirement.

Many public sector agencies currently use platform-as-a-service (PaaS) solutions for secure data processing using Apache Spark. However, with many solution providers moving to multi-tenant software-as-a-service (SaaS) offerings, these PaaS solutions are being deprecated. Moving forward, organizations for whom single-tenancy is a critical requirement will need to evaluate alternative solutions for air-gapped data processing. For most of them, a multi-tenant SaaS solution is simply not an option.

Cloudera is uniquely positioned to support mission-critical networks as our data and AI platform is designed for absolute control and sovereignty. For public sector agencies looking to maintain secure customer operations, Cloudera provides a clear and stable path forward.

A Platform Built for the Mission, Not the Public Cloud

As a graduated In-Q-Tel portfolio company, Cloudera has a long history of successful, Technology Readiness Level (TRL) 9 mission-proven deployment across the U.S. civilian, defense, and intelligence communities.

To exclusively serve this market, we established Cloudera Government Solutions, Inc. (CGSI). Headquartered in the Washington D.C. metropolitan area, CGSI is a dedicated subsidiary focused solely on the unique needs of government agencies. Our expertise is U.S.-based, cleared, and focused on ensuring mission success, evidenced by a strong, growing presence with Authority to Operate (ATO) qualifications across numerous secure networks.

For program managers and technical leaders re-evaluating their data strategy, the choice is simple: rely on a platform designed to support your specific industry requirements. Cloudera is the proven solution for any public sector agency requiring a robust, self-contained data platform.

The Clear Choice for Secure, Air-Gapped Data and AI

When moving your critical Spark workloads and data pipelines, Cloudera offers distinct advantages that ensure stability, control, and future-readiness:

Cloud Anywhere

Cloudera was built from the ground up to support workloads across hybrid and multi-cloud environments. Our data and AI platform delivers a consistent cloud experience across data centers, private clouds, and at the tactical edge—environments where pure cloud-native solutions simply cannot operate. We’re the de facto Spark provider for on-premises deployments leveraging object stores (S3 and Ozone), Kubernetes, virtual machine (VM), and bare metal technologies. This means your secure, self-managed data environment is our foundation–not an edge case or deprecated feature.

Unified, Open, and Built for Longevity

Our platform is built on an open-source foundation with an open-standards approach to integration, reducing the risk of vendor lock-in and ensuring maximum interoperability. Collectively, Cloudera customers ranging from Global 2000 to government manage more than 25 exabytes of data using our platform, demonstrating unparalleled scale and enterprise stability. Cloudera has more than $1 billion in annual recurring revenue to back our long-term partnership commitment. We provide a single, unified platform with an open data lakehouse and a comprehensive data fabric to manage the entire lifecycle of data–from streaming and data engineering to machine learning and enterprise AI.

Mission-Ready AI Everywhere

The ability to deploy modern AI is increasingly vital to mission success. Cloudera accelerates the full AI lifecycle–from data preparation to private generative and agentic AI–with real-time, low-latency inference. You can deploy in certified AI infrastructure on premises and, crucially, in fully air-gapped cloud environments for absolute data control and sovereignty. This enables you to bring AI to your data, anywhere it resides, without ever compromising security.

Comprehensive Data Control and Governance

In government environments, data control is paramount. Cloudera delivers enterprise-wide data security, governance, lineage, and observability within a single platform. Our technology is tested rigorously to meet the most stringent regulatory and accreditation standards, with documented support for FIPS 140. This comprehensive compliance ensures your programs achieve and maintain their Authority to Operate (ATO) with confidence.

Unwavering Commitment to Your Success

Our investment in your mission goes beyond technology. CGSI provides an ecosystem of support designed specifically for the U.S. government:

Dedicated, cleared U.S. expertise: We offer professional services and 24x7 support from cleared U.S. citizens on U.S. soil. Our subject matter experts are available for everything from hands-on installation and optimization to supporting your most complex, mission-critical cases.

Robust partner ecosystem: We partner with all key federal system integrators (FSIs) and technology providers to ensure seamless integration and mission success.

Expert training: We offer comprehensive training and certification programs via live private on-site, live public, or self-service on-demand training to empower your teams to become self-sufficient experts on the platform.

For government agencies that require the full power of modern data and AI without compromising on security or control, Cloudera is the proven, trusted, and superior choice.

Cloudera and Protegrity: Delivering Secure AI and Analytics for Regulated Industries

Jerome Alexander — Wed, 15 Oct 2025 13:00:00 UTC

The rapid embrace of AI tools and models is yielding serious results for businesses across nearly every industry. Advanced and predictive analytics are providing deeper insights into business operations, newer forms of AI, like agentic AI, are transforming customer experiences, and machine learning is streamlining complex processes.

But for businesses in highly regulated industries—financial institutions, healthcare institutions, or any other business that’s subject to added compliance, security, and privacy considerations—the path to AI acceleration has extra obstacles along the way. For many of those highly regulated organizations, it may feel like AI is simply not an option. That, however, does not have to be the case.

Recently, Cloudera partnered with Protegrity, a global leader in data security and privacy, to address those security, compliance, and privacy concerns that leave regulated industries seemingly hamstrung in their adoption of AI.

Capitalize on AI While Maintaining Compliance

Whether it’s a financial firm contending with GDPR and DORA guidelines or a healthcare institution bound by longstanding regulations like HIPAA, non-compliance is an extremely dangerous prospect.

Adherence with regulatory guidelines isn’t just a security issue. Failure to stay in compliance can bring serious financial and operational consequences that set the business back. Cloudera and Protegrity’s collaboration simplifies governance and auditability, helping streamline protection at scale while reducing operational complexity and costs. For organizations navigating highly regulated environments, this means the ability to innovate securely while ensuring adherence to evolving standards.

Unlike other platforms that require data movement to centralized locations, Cloudera enables businesses to apply AI directly to their data, wherever it resides—in clouds, data centers, or at the edge. That means organizations can avoid the added risk and complexity that comes with moving data from one location to another in order to feed an AI initiative.

And now, the partnership with Protegrity adds advanced data protection tools, such as vaultless tokenization, format-preserving encryption (FPE), dynamic data masking, and anonymization. These tools integrate seamlessly with Cloudera’s platform, enabling organizations to secure sensitive data while applying AI. For example, a financial institution using Cloudera can tokenize customer data with Protegrity’s solutions, ensuring compliance with GDPR while running predictive analytics in real time.

Partnering to Enhance Data Protection Across Environments

Cloudera and Protegrity bring a deep understanding of the data challenges that face highly regulated businesses, and together provide the heightened level of support and security to unlock the full potential of proprietary data without increasing risk exposure.

Cloudera’s enterprise data platform and Protegrity’s robust data protection enable highly regulated organizations to adopt AI, machine learning, and cloud analytics while ensuring compliance and data protection. These businesses can securely share and analyze sensitive information across teams and third parties, generating and harnessing richer insights and making real-time decisions without compromising trust.

Facing a heightened regulatory and compliance burden doesn’t have to mean sacrificing on the benefits of AI, machine learning, and advanced analytics. As the only data and AI platform company that large organizations trust to bring AI to their data anywhere it lives, Cloudera and its partner ecosystem deliver the security and scalability needed to support any enterprise.

Learn more about how Cloudera, and its partners, can secure AI and advanced analytics for highly regulated industries.

Cloudera Container Service—Built-in Security and Smarter Cost Control

Bhagya Lakshmi Gummalla — Wed, 08 Oct 2025 13:00:00 UTC

Figure 1: Cloudera Container Service Architecture

Simplified Kubernetes Lifecycle Management

Cloudera continues to invest in making Kubernetes and add-on services easier to operate across environments. With Cloudera Container Service, you can now use an intuitive UI to easily deploy Kubernetes clusters. Looking ahead, our roadmap includes extending unified lifecycle management across the whole Cloudera managed cluster estate, enabling enterprise admins to manage lifecycle updates consistently from a unified UI.

Built-In Security and Compliance

Cloudera Container Service provides several security features out of the box, ensuring that Kubernetes deployments are secure from day one, which helps you move faster and reduce risk. These features include:

Istio service mesh: Ensures secure, authenticated communication between microservices, without requiring users to install or configure Istio separately.
Knox gateway (as an Istio External Authorization Provider): Delivers enterprise-grade authentication and access control with external services while maintaining Istio's native security framework.
Calico: Provides network policy enforcement to isolate workloads and meet compliance requirements through fine-grained traffic control for secure pod to pod communication.
Private cluster support: Restricts access to within the customer’s cloud network, keeping workloads isolated from public internet exposure and reducing the need for complex network policy configurations.
IMDSv2 (instance metadata service v2): Uses session-based tokens to protect access to AWS instance metadata, mitigating risks and improving cloud workload security.
Non-transparent proxy support: Enables secure, auditable outbound traffic from Kubernetes clusters without requiring manual proxy setup for each data service configuration.

Smarter, Cost-Optimized Workload Management

By 2026, organizations performing real-time cost or performance optimization of cloud-based workloads will rise from less than 20% in 2022, to 50%.” - Gartner(™), Evolve Service Management and Cloud Operations

These insights underscore the increasing focus on cloud cost optimization as organizations seek to manage expenses while leveraging cloud technologies.

By giving enterprises control over cost-saving mechanisms, Cloudera ensures that organizations only pay for the resources they actually use while maintaining the flexibility of Kubernetes-based workloads.

Cloudera’s latest enhancements enable organizations to optimize spending while maintaining performance in several ways, including:

AWS Graviton support: Enables cost-effective compute with ARM-based instances, reducing cloud expenses and energy consumption. Further, building multi-architecture container images enables a “build once, deploy anywhere” approach.

Suspend/resume clusters: Allows enterprises to pause workloads when not in use and resume them when needed, cutting down on unnecessary infrastructure costs.

Shared data services: Optimizes resources by allowing multiple data services to leverage shared infrastructure, reducing duplication and improving efficiency.

Apache Yunikorn: Enables higher cluster density, lower operational costs, and improved performance through an intelligent resource scheduler with enhanced workload placement and scheduling techniques like bin-packing, hierarchical quota management, gang scheduling.

Leveled-Up: Cloudera AI Inference Service with NVIDIA Accelerated Compute

Cloudera AI Inference service is the first data service onboarded to Cloudera’s enhanced Kubernetes platform. By leveraging Cloudera Container Service, AI workloads can now move from development to production faster, more securely, and more cost-effectively than ever before.

Cloudera’s Container Service plays a critical role in enabling AI inference by providing:

Optimized performance: Efficient scheduling and orchestration of NVIDIA accelerated compute, ensuring AI workloads are allocating the compute power they need without over-provisioning resources.

Enterprise-grade security: AI workloads remain fully contained within Cloudera’s secure, enterprise-ready platform, ensuring data governance and compliance.

Automated infrastructure management: The platform handles cluster scaling, security policies, and workload isolation, allowing data scientists and AI engineers to focus on model optimization instead of infrastructure management.

Future-Ready Kubernetes: Built for AI, Analytics, and Beyond

As part of Cloudera’s broader vision of supporting diverse workloads—from real-time data streaming to large-scale analytics and next-generation enterprise applications—this enhancement is a boon for organizations with an AI-first approach.

With Kubernetes as the foundation, Cloudera solves today’s infrastructure challenges and prepares your organization for future innovation.

Interested in learning more and seeing what’s in store for the future? Contact us to speak directly with a member of our sales team.

Introducing Cloudera Container Service: Simple, Secure, Cost Efficient

Cloudera Container Service is our enhanced Kubernetes platform (replacing Compute Cluster). Enhancements include simplified lifecycle management, built-in security, and cost-optimized workload management across multi-cloud environments.

With Cloudera Container Service, you can focus on innovation rather than infrastructure complexity, ensuring that Kubernetes deployments are secure, scalable, and cost-effective across multi-cloud environments.

Kubernetes should be an enabler, not an obstacle,” said Karthik Krishnamoorthy, Cloudera’s Vice President for Product Management. “With these enhancements, we’re giving enterprises the tools to manage Kubernetes more efficiently, reduce cloud costs, and onboard powerful AI and data-driven applications—all while ensuring built-in security.”

#ClouderaLife Employee Spotlight: Meet Leo Brunnick, Chief Product Officer

Debbie Kruger — Mon, 06 Oct 2025 13:00:00 UTC

At Cloudera, leadership is about more than just driving business strategy, it’s about inspiring innovation, nurturing community. No one embodies that spirit more clearly than Leo Brunnick, Cloudera’s Chief Product Officer.

As he settles into his tenure, Leo feels a standout quality about Cloudera.

What I see is energy. What I see is joy. I see a group of people desperately wanting Cloudera to do well and win,” he shared. “I’ve been at many companies, and they’re not all like this.”

That collective drive, he believes, is what sets Cloudera apart. “People aren’t here just for performance reviews or scores—they’re here to make Cloudera successful. And that’s rare.”

Let’s get to know Leo Brunnick and explore how Cloudera has supported his leadership journey and empowered him to shape our product vision.

Meet Leo Brunnick

As Chief Product Officer, Leo guides Cloudera’s product strategy and innovation agenda, helping ensure the company is at the forefront of data, AI, and cloud transformation.

What drew him in was not just the technology, but the people. “Our CEO, Charles Sansbury, had built his dream team across sales, marketing, and finance, and he needed leadership in product and engineering. I saw that what I could bring would make a real difference. That’s what gets me out of bed in the morning.”

Leo’s Journey to Cloudera

When Charles first reached out, Leo was intrigued. After speaking with executives and board members, the decision became clear.

“I don’t think I’ve ever seen a board more supportive of giving a company what it needs to be successful,” Leo said. “Cloudera was in a spot where if it made the right moves, it could take advantage of the mega trends in AI and data. That was terribly exciting.”

For Leo, it wasn’t just about joining a strong company but also helping it break through. “Cloudera is this close to becoming an even bigger success story. It’s fun to have a brass ring to chase.”

Driving Innovation: Cloudera Data Services

Cloudera recently announced the launch of Cloudera Data Services, a transformative platform designed to directly bring private AI and cloud-native agility to the data center.

Leo is energized by what this means for customers and employees alike. “That full easy button of the cloud—now available on-premises. That’s fundamentally different,” he said.

For him, this isn’t just a technology milestone but a company-wide opportunity. “When you can move quickly and deploy differently in the on-prem environment, it impacts the whole company. It is a full team sport regarding what Cloudera is poised to do now.”

Clouderans play a role in shaping this future, from how products are built and packaged to how they’re sold, supported, and scaled. Leo sees this as one of the most exciting parts of the journey: everyone has a hand in making it real.

Culture, Community, and Representation

Leo’s family is deeply connected to Latin heritage, which has shaped his personal life and professional outlook. Having spent time in El Paso and now living in Austin, he’s long embraced Hispanic culture's vibrancy, traditions, and energy.

“I just love the culture, the tradition, the energy, and the vibrancy,” he said. Now, as part of Cloudera’s Latinx Employee Resource Group (ERG), he feels less like a leader with a title and more like a participant in a supportive community. “ERG lead is just a fancy title. Really, I feel grateful to be allowed to be part of the group.”

For him, ERGs are about belonging: “It just feels better when you’re around people you care about and who care about you. ERGs help people connect and feel seen for who they are. Taken together, all those perspectives make Cloudera a special place.”

Leading with Energy and Authenticity

When asked about his leadership style, Leo is quick to ground it in humility.

“I’m never going to be the smartest person in the room, and I’m not always going to be right. But what I bring is energy and authenticity. People want to be part of a winning team—they just want to know how to participate.”

That belief drives his hands-on approach. From San Jose to Costa Rica, Raleigh, Budapest, and Bangalore, Leo embraces “management by walking around.” As he puts it, “You’ve got to get out there, pound the drum, and get people fired up.”

For him, leadership is also about clarity: “This is what we’re doing. This is why. Repeat it repeatedly. That’s how you build trust across teams.”

Closing Thoughts

For those considering a career at Cloudera, Leo’s advice is both candid and inspiring.

“Be ready, because Cloudera is a full-contact sport. This isn’t a place to just punch in and out. People here lean in and give it all they’ve got. And we want others who feel the same way.”

His words reflect the spirit of Cloudera: authentic, passionate, and all-in on success.

Want to learn about more inspiring Clouderans? Read here.

Democratize Data for AI Using Interoperability Across Engines and Zero-Copy Data Collaboration

Pamela Pan,Akshat Mathur,Bill Zhang — Fri, 03 Oct 2025 13:00:00 UTC

How Cloudera Iceberg REST Catalog Enables Open, AI-Ready Enterprises

Interoperability has long been a buzzword, not a capability enterprises can count on in practice. Instead, data architects are often left stitching together fragmented systems, chief data officers face massive risk and vendor lock-in from siloed governance, and platform leaders are restricted from providing a consistent data view to their teams. Whether driven by mergers, multi-cloud strategies, or external partnerships, the pattern repeats: rising costs, slower innovation, and limited ability to scale AI with confidence.

At Cloudera, we’ve helped our customers navigate these challenges—disconnected metadata layers, duplicated data pipelines, and governance models that fail to extend across tools—always striving to enable open, AI-ready enterprises that unlock interoperability at scale.

Why Openness Matters for Enterprise AI

To scale AI workloads, organizations require visibility and control over the data that fuels them. Metadata intelligence plays a critical role in this equation, enabling organizations to understand where data lives, how it’s structured, and how it’s used across teams and tools.

With open standards like Apache Iceberg and the Iceberg REST Catalog, enterprises gain a unified layer of metadata that supports zero-ETL data sharing, enforces governance, and powers secure interoperability across analytics and AI engines. This foundation transforms fragmented infrastructure into a connected, AI-ready data architecture—one where metadata becomes the key to accelerating access to insights while maintaining trust.

Open, Secure, and Simple: Cloudera Iceberg REST Catalog

The Cloudera Iceberg REST Catalog powers our open data lakehouse and helps organizations simplify architecture, reduce duplication, and extend secure data access wherever it’s needed.

It acts as a universal, interoperable metadata layer and provides zero-copy access to Iceberg tables across tools, clouds, and teams, enabling open-source and third-party tools to access the same data. Features and benefits include:

Open and engine-agnostic: Provides standards-based APIs that support tools like Athena, Databricks, Redshift, and Snowflake—enabling interoperability without vendor lock-in
Decoupled by design: Abstracts query engines from backend metastores, reducing complexity and increasing portability across environments
Real-time metadata access: Supports fast, up-to-date metadata queries from Iceberg-compatible metastores, improving data visibility across teams
Governed and secure: Extends fine-grained access controls, row-level permissions, and enterprise identity access management (IAM) integration (such as LDAP and OAuth2) to all connected systems—ensuring consistent policy enforcement at scale

Figure 1. Cloudera's Iceberg REST Catalog provides a universal, interoperable metadata layer, enabling open source and third-party tools to access the same data.

Real-World Use Cases and Impact of Iceberg REST Catalog

The following real-world examples illustrate how organizations are using the Iceberg REST Catalog to simplify their data stack, reduce total cost of ownership (TCO), and accelerate time to value–all while keeping data where it belongs.

Together, these examples demonstrate how Cloudera’s open and interoperable approach accelerates AI outcomes, drives operational efficiency at enterprise scale, and enables security and compliance.

Data Sharing: Scale AI Applications to 3,000+ Cross-Platform Users

A luxury automotive manufacturer faced mounting challenges in securely sharing data with an external partner using Databricks. Traditional methods relied on data duplication, which introduced cost, complexity, and architectural inflexibility.

By adopting the Iceberg REST Catalog, the customer established secure, zero-ETL data sharing across both internal systems and external platforms. This open, standards-based approach allowed them to choose the best tools for the job—using Spark for complex data pipelines and Impala for fast SQL analytics. With this foundation, the company scaled AI applications to more than 3,000 users while maintaining full governance and control over data access.

Data Warehouse Optimization: Reduce Data Movement Costs 74%

Following a merger activity, a global satellite company encountered significant roadblocks in unifying fragmented data locked in proprietary systems. Without a consistent, interoperable data layer, their AI and analytics initiatives were slow to scale and difficult to manage.

Cloudera’s open data lakehouse architecture, powered by the Iceberg REST Catalog, helped the customer consolidate these silos and establish a single source of truth for all of its AI and analytics workloads. By querying managed Iceberg tables directly in S3, they eliminated the need for redundant data pipelines and replatforming efforts, leading to a 74% reduction in data movement costs.

Demo: A Closer Look at Data Sharing via Cloudera’s Iceberg REST Catalog

This interactive demo brings the Iceberg REST Catalog to life through a financial services scenario. At the fictional Parent Bank, different teams use their preferred tools—such as Snowflake and AWS Athena—to securely access one governed source of data, all without complex ETL or costly data movement.

For a deeper dive into this offering and how it can benefit your organization, explore these resources:

Visit our product page to learn more about Cloudera’s open data lakehouse.
Read the press release for the full announcement about Cloudera’s vision for open data sharing.

3 Steps to Cutting Cloud Costs with Data Lineage

Ron Pick — Thu, 02 Oct 2025 13:00:00 UTC

Ever promise someone the moon? If so, it’s unlikely you knew the price tag in advance.

Whereas, if you promise someone a cloud, you can calculate your costs down to a thousandth of a cent.

Amazon, Azure, and Google offer cloud data storage cost calculators that will make your head spin with their specificity: How many TiB of data do you need for streaming reads on Google BigQuery? Do you want ra3.4xlarge or ra3.xlplus instances on Amazon Redshift—and how many nodes?

While storing data in the cloud is often billed as being more cost-efficient than using on-premises data storage, in truth reducing your cost for cloud storage requires investigation, elimination, and optimization. Let’s take it step by step.

Step 1: Investigation

One of the simplest ways of reducing data storage costs is to store less data. Obvious, yes. Easy, no.

There’s a reason why you have all that data. Sometimes a good reason—like for operational, administrative, and business processes—but sometimes the reason isn’t all that great, such as “we haven’t gotten rid of it yet.”

In every data ecosystem, there’s outdated, redundant, and bad quality data that you can—and should—get rid of. But how do you locate it?

The answer is automated data lineage: the data housekeeper’s faithful sidekick.

Imagine that you have a magic wand that helps with spring cleaning. This wand tells you where each item in your household was bought, when it was last used, what shape it’s in, if you have any other items that serve the same function, and so on.

This is what automated data lineage does for your data ecosystem. Let it loose, and within minutes you’ll have a complete mapping of your data flow: what data assets feed what reports and trace back to which sources. Comprehensive data lineage shows this both on a zoomed-out, source-system level, as well as on a zoomed-in, column-to-column level. It can even get into the ETL processes and show exactly what transformations were performed on the data as it moved.

Once you have the complete picture mapped out, you can move on to the second stage: elimination.

Step 2: Elimination

Take a close look at your data lineage, and ask the following questions:

Are any of these data assets or data uses (reports, for example) redundant?
Are any of these data assets or data uses outdated or otherwise no longer relevant?

Answering “yes” points you to data that can be offloaded, directly reducing cloud-based storage costs. But offload wisely! Even if you’ve identified two data assets that are effectively duplicates, if they are both being used by downstream reports, you can’t just go and delete one of them before you line up its replacement.

Leveraging your data lineage for impact analysis empowers you to foresee the impact of changing a business process and take proper advance action to prevent issues.

Now that you’ve identified and eliminated data you don’t need (outdated, redundant, bad quality), it’s time to move on to data that you do need to keep around, but you could store more efficiently.

Step 3: Optimization

Take another look at your data lineage mapping, and ask the following questions about the data you are storing:

What are we using this data for?
How often do we need to access it?
How fast does it need to be available when we do want to access it?

Cloud-based data storage providers usually offer a range of storage levels that vary by their accessibility. For example, Amazon S3 offers Standard storage for frequently accessed data ($0.023 per GB), Standard – Infrequent Access storage for data that’s accessed infrequently but should be retrieved in milliseconds when needed ($0.0125 per GB), Glacier Flexible Retrieval storage for archive and backup data that should be retrieved in anywhere from 1 minute to 12 hours ($0.0036 per GB), and Glacier Deep Archive storage for archive data that's accessed only once or twice a year and will take 12 hours to retrieve ($0.00099 per GB).

Storing 1 TB of data in Standard storage would cost $23 a month. Storing the same 1 TB of data in Glacier Deep Archive Storage would cost $0.99 a month! If your organization currently stuffs all of its data into standard cloud storage without differentiating based on access needs, optimizing your storage can significantly reduce your storage costs.

From Storage to Computing and Back Again

Data lineage can reduce your data storage costs by showing you both:

Which data you can eliminate
Which data you can store more effectively

But that's not all! While less data reduces cloud storage costs, it can also reduce compute costs. Cloud-based data warehouses like Snowflake and Amazon Redshift usually have a pay-per-usage model on compute, charging for the time it takes to run queries across the datasets. The more data you include in your query, the longer it will take to run, and the higher your charge will be.

Reducing the amount of data you’re storing (or keeping in standard storage) will usually mean less data included in your queries, indirectly reducing compute costs. But data lineage also provides you with a direct way to decrease your compute costs: restricting exploration queries.

Exploration queries tend to use a lot of computing power. With a clear data lineage map, your data team can see exactly where the relevant data is, enabling them to run much more targeted queries across the platform, and eliminating or reducing the need for general exploration queries.

Next Steps

If cloud data storage costs are getting you down, it’s time to turn the tables and get them down instead. Just pull out your automated data lineage magic wand and follow these steps: Investigate! Eliminate! Optimize!

See those data storage costs shrink!? Okay, it may take a wee bit more work than that. But when your enterprise gets its next, lower bill from its cloud data services provider, it will still feel magical.

Want to learn more? Request a demo to get started with Cloudera Octopai Data Lineage—an automated data lineage solution that can help you implement these steps and reduce your cloud storage costs today.

Empowering Enterprise AI with Structured Synthetic Data: Preserving Privacy and Source-Statistical Properties

Andreas Tsiartas,Yi-Hsun Tsai,Robert Hryniewicz — Wed, 01 Oct 2025 13:00:00 UTC

In the era of data-driven AI, enterprises need high-quality datasets to analyze or train AI models, yet data privacy regulations and ethical concerns restrict the use or sharing of real-world data. How can organizations innovate without compromising sensitive information?

At Cloudera, we’ve pioneered a solution that bridges this gap. Cloudera’s Synthetic Data Studio—part of the Cloudera AI Studio toolset—is a tool that creates entirely synthetic datasets that mimic an organization's actual data patterns, so organizations can innovate without risk to confidential information.

Key Takeaways

Cloudera’s approach to synthetic data generation offers a blueprint for enterprises wanting to use or share sensitive structured data. The approach illustrates:

Privacy as a feature: Synthetic data becomes a strategic asset that enables innovation in restricted domains

Statistical fidelity matters: Clustering and seed instructions ensure synthetic data retains the nuanced relationships that make models effective

Scalability for enterprise AI: Automated workflows reduce the cost and time of synthetic data generation

The Business Challenge: Leveraging AI Models While Ensuring Compliance

Consider a financial services company striving to predict loan defaults. Real-world data in this domain is a treasure trove of sensitive details: income levels, employment histories, and credit scores. Sharing such data with third parties or AI models is full of regulatory and ethical hurdles.

Traditional synthetic data methods often fall short, failing to capture the nuanced logical relationships between variables—such as how existing debts might influence repayment behavior—or the logical consistency between data points across rows and columns. Companies require a synthetic data solution that can scale, preserve the statistical integrity of the original data, and ensure compliance with privacy standards.

Cloudera’s Solution: Structured Synthetic Data Generation

Cloudera’s solution follows a four-step workflow that incorporates clustering techniques, Cloudera Synthetic Data Studio, and rigorous validation.

Step 1: Profile Data

The journey begins with partitioning and clustering the data to create statistical profiles. By categorizing borrowers into groups based on risk levels—high-risk versus low-risk applicants, for instance—and further clustering numerical variables like loan amounts and interest rates, we distill the dataset into “seed instructions.”

Seed instructions encode the statistical properties of each group, such as means, standard deviations, and correlations, while embedding borrower information such as loan grades or loan statuses. This step ensures that the synthetic data inherits the structure of the original data without exposing sensitive details.

Step 2: Generate Data Using Cloudera Synthetic Data Studio

With these seed instructions in place, the next phase leverages LLM-powered generation. Using advanced models like Llama 3.3-70B-Instruct, we synthesize new records guided by the statistical blueprints seen in the seed instructions. Cloudera Synthetic Data Studio acts as a creative force, generating data that preserves the relationships and patterns defined in the seed instructions.

This is where the magic happens: the model doesn’t just produce random numbers but constructs data that reflects the complexity of real-world scenarios, such as how a borrower’s income might logically influence their repayment history.

Step 3: Filter Data

However, not all generated data meets the required quality. To ensure fidelity, we employ an innovative LLM-as-a-judge workflow.

This step evaluates synthetic outputs against a set of criteria, including formatting consistency, logical coherence (for example, ensuring mortgage accounts align with home ownership status), and realism (for example, generating plausible interest rates). Only data that scores highly—meeting a threshold of 9 out of 10—is retained. This filtering process acts as a quality gate, ensuring that the final dataset is both realistic and statistically robust.

Step 4: Validate Data

The final phase of the workflow involves statistical and visual validation. By comparing synthetic data to the original dataset using metrics like KL divergence for categorical variables and mean/standard deviation differences for continuous features, we confirm that the synthetic data mirrors the real-world distributions.

The Impact: Privacy Without Compromise

Cloudera’s approach generates data that is free of personally identifiable information (PII) and sensitive patterns, yet retains the statistical fidelity needed to train accurate models. This enables companies to share synthetic data with third-party systems or collaborate with external partners without fear of data breaches or regulatory penalties.

As shown in Table 1, we find that using a Llama 3.3 70B-Instruct model to generate structured loan data (27 columns total), 100% of the generated data match the expected output, 97.2% contain no logical cross-column errors when judged by an LLM, statistical means deviate 12% from the original distribution, and cross-column correlations deviate by 0.24.

Structured Data Generation Results Using Llama 3.3-70B-Instruct
Data Integrity	100% format accuracy	The synthetic data is a perfect match for the original structure.
Statistical Fidelity	12% mean deviation	The synthetic data accurately mimics the key statistical properties of the original.
Cross-Column Logical Consistency	2.8% logical errors	The generated data reflects real-world logical relationships.
Cross-Column Correlation Preservation	0.24% correlation difference	The key connections between features are authentically preserved.

Table 1: Structured Data Generation Results Using Llama 3.3-70B-Instruct

Conclusion

As AI models grow more complex and privacy regulations tighten, the demand for high-quality, privacy-compliant data will only intensify. In the coming years, we expect structured data generation methodologies to redefine industries from healthcare to finance, where data privacy is non-negotiable.

Cloudera’s structured synthetic data approach shows that enterprises can meet this demand without compromising on privacy or performance. By combining clustering, Cloudera Synthetic Data Studio, and rigorous evaluations, organizations can unlock the full potential of structured data.

If you’re interested in learning more, take our product tour of Cloudera AI Studios, or reach out to our team at ai_feedback@cloudera.com.

A Year-Over-Year Look at AI Challenges and Shifting Perspectives

Cloudera — Tue, 30 Sep 2025 13:30:00 UTC

In just the last few years, artificial intelligence (AI) has exploded across enterprise organizations, with new use cases emerging at a rapid pace. Tools and models like AI agents have introduced new opportunities and innovations that are redefining the marketplace.

Cloudera’s latest report: The Evolution of AI: The State of Enterprise AI and Data Architecture, paints a clear picture. Most organizations have moved beyond experimentation and are integrating AI models into some of the most important facets of their businesses: 96% of IT leaders surveyed say that AI is at least somewhat integrated into core business processes. At the same time, many leaders feel they’ve yet to realize the full potential of AI, and challenges to adoption and secure use of AI persist.

The AI landscape is constantly shifting. So, how does today’s AI environment compare to one year ago? How have attitudes changed? What challenges are enterprise leaders facing when it comes to AI adoption? Let’s dive into some of the biggest shifts.

Confidence in Data is Rising, but with Room to Improve

No matter the industry, maintaining a competitive edge depends on how quickly an organization can make accurate, informed decisions. But going a level deeper, that ability hinges on how an organization can tap into its own data. For AI to be impactful, IT leaders need to ensure they strive to make 100% of their data accessible. Cloudera’s survey reveals a notable gap here as just 9% said that all their data is available and accessible for AI.

Nearly one quarter (24%) of respondents said that they trust their data much more than they did last year, but 41% said they only trusted their data somewhat more. While confidence in data has shown signs of growth, enterprise leaders still hold some security concerns around AI implementation. Of those surveyed, 46% say they’re worried about the security and compliance risks that AI presents. And two of the top concerns relating to AI security are focused on data—50% cite data leakage during model training, and 48% note unauthorized data access as top challenges.

These results are not surprising. Enterprise leaders must maximize value from AI without exposing sensitive data or falling out of compliance. Something that, at a time where new regulations are constantly emerging, can be easier said than done. As organizations strengthen their data architecture and capabilities, governance remains a focal point of any strategy to ensure consistent security.

AI Adoption Challenges Persist

Even as enterprise IT leaders show more trust in their data year over year (YoY), many of the same AI adoption challenges cited in 2024 remain. For example, data integration is still ranked as the top technical limitation in data architectures when supporting AI workloads. Other challenges cited by survey respondents in 2025 included storage performance, compute power, lack of automation, and latency.

While many of the same challenges from 2024 have remained, one of the biggest shifts is the cost to access computer capacity for training models. The number of IT leaders who cite this as a barrier to AI adoption rose from 8% in 2024 to 42% this year—a 34-point jump! As enterprises push for more AI initiatives, with new tools and models, the costs of adoption and operation grow quickly—particularly if the data architecture supporting AI initiatives is not ready to handle more complex systems.

Then there’s the age-old problem of data silos, which have long caused trouble for IT leaders. Breaking down silos is a critical piece of effective AI. When a model is trained on incomplete data, the outputs are vulnerable to inaccuracies that could prove costly. Of the IT leaders surveyed by Cloudera, 61% say that siloed data has at least sometimes negatively impacted their ability to scale AI initiatives, but many are seemingly getting a handle on this problem, with 35% saying this was rarely impacting their own AI initiatives.

What’s Next for AI and Data Architecture

AI is now integrated into some of the most critical business functions across enterprises. As enterprise leaders become more familiar with AI tools and models, the demand for data has accelerated shifts in data architecture. Those shifts have seen organizations become more data-driven culturally, giving leaders more confidence in their organization’s data.

And yet, many of the same challenges surrounding AI adoption and security have remained consistent YoY, while new difficulties around operating costs have emerged.

Wherever an organization finds itself in their AI journey, having the right data architecture and AI infrastructure is critical to establishing long-term success.

Check out the full report and learn more about how Cloudera is helping organizations bring AI to their data, anywhere it resides.

Enterprise AI and Data Architecture in 2025: From Experimentation to Integration

Cloudera — Thu, 25 Sep 2025 13:00:00 UTC

In 2024, Cloudera set out to understand the state of enterprise AI and data architectures, releasing its first survey report on the subject: The State of Enterprise AI and Modern Data Architectures. The results from that survey painted a picture of an enterprise AI landscape where IT leaders were ready to capitalize on AI but struggled with outdated data architectures.

Now a year later, how are enterprises fairing in their AI journeys? To better understand the current state of AI and data architecture, Cloudera fielded a follow-up survey report: The Evolution of AI: The State of Enterprise AI and Data Architecture.

The survey of 1,574 enterprise IT leaders across the US, EMEA, and APAC, shows that AI is moving from experimentation to deep integration, with a focus on data and current data architecture deployments evolving in lockstep.

Let’s dive into the findings.

The State of Enterprise AI: Maximizing Value

This year’s report reveals that enterprise AI has moved from experimentation to full integration in core processes and workflows:

96% of respondents say that AI is at least somewhat integrated into their core business processes
54% say they have significant AI integration
21% say it’s already fully embedded

These numbers make it clear—AI has become table stakes for enterprise success.

And the benefits of AI aren’t something relegated to the abstract or hypothetical. A growing number of IT leaders are seeing real value generated. In fact, most (52%) report they’re significantly successful in realizing measurable value from AI, while only 1% have yet to see results.

So, what types of AI are these organizations utilizing to generate that success? Cloudera’s survey found enterprise IT leaders are tapping into a broad set of AI forms. This includes generative (60%), deep learning (53%), predictive (50%), supervised learning (43%), classification (41%), agentic (36%), and regression (24%) models.

As AI portfolios diversify, the lesson is clear: leaders aren’t relying on a single “hero model” but building collections tuned to use case, risk, and cost. Likewise, they want visibility and control over all their data, not just a subset, so decisions are smarter and AI more effective.

Enterprises are gearing up for newer forms of AI. Agentic capabilities are crossing from experiments to production. Sixty-seven percent feel more prepared to manage agents than a year ago (26% say much more prepared). Already, 36% run agents as a primary model type, and 83% believe investing in agents is essential to maintaining a competitive edge.

Leading organizations will pair guardrails with clear ownership models for agent actions and data access. The pivot from applications to intelligent agents is underway, and success will depend on unifying policies wherever those agents run.

Examining Today’s Data Attitudes and Architectures

Enterprise culture around data is maturing. Eighty-six percent of leaders describe their organization as at least moderately data driven. Those calling their culture extremely data-driven rose to 24%, up from 17% a year ago. That culture shift is accompanied by a growing level of confidence in enterprise data as well.

Among survey respondents, 24% say they trust their organization’s data much more than they did one year ago, and another 41% say they trust their organization’s data somewhat more.

As enterprise leaders look to enable AI at scale, the foundation of data architecture they choose may vary:

63% of organizations are storing their data in private clouds
52% are storing data in public clouds
38% say they rely on on-premises mainframes
32% note they use on-premises distributed options

With data spread across a mix of storage methods, success with AI hinges on an organization’s ability to bring AI to data anywhere: in clouds, data centers, or at the edge.

As Confidence in Data Rises, the Bottlenecks Still Bite

Even as enterprises grow more confident in their data and embrace a wider range of AI models, many adoption and implementation challenges persist. Asked what the biggest technical limitation of their architecture was, respondents chose data integration (37%) as their top issue. This is followed by storage performance (17%), compute power (17%), lack of automation (17%), and latency (12%).

Then there are challenges that have evolved since last year. Compared to 2024, the cost to access computer capacity for training AI models is on the rise. One year ago, just 8% of surveyed IT leaders noted these costs were too high. Today, that number has increased to 42%—a 34-point jump!

Many respondents also have challenges around accessing and utilizing their organization's data for AI initiatives. While 38% of global respondents note that most of their organization's data was accessible and usable in these instances, just 9% say that all of their data is available. With data inaccessible to AI, these organizations may be missing potential market opportunities or operating with faulty information for decision-making.

Where Are AI and Data Headed Next?

Enterprise leaders are more confident in their data. AI is becoming deeply integrated into core processes, transforming everything from operational efficiency to customer experience. But many still have yet to make all of their data accessible to AI. This gap in access within data architectures poses serious risks from a competitive standpoint but also means AI initiatives may not be as effective as they otherwise could be.

Maximizing the value of AI is critical for the long-term outlook of enterprises, particularly as they seek to scale the technology. Overcoming these challenges starts with understanding internal data needs and prioritizing partners and tools that help bring AI to data anywhere, wherever that data resides.

Read the full report to uncover the current state of AI and data architecture, and learn more about why Cloudera is the only data and AI platform company that large organizations trust to bring AI to their data anywhere it lives.

Revolutionize Your Data Strategy: Unleash the Power of Cloudera Octopai Data Lineage for Seamless Metadata Management and Data Lineage

Cloudera — Thu, 18 Sep 2025 13:00:00 UTC

Today’s data landscape is vast and continues to evolve rapidly. With organizations collecting more data than ever before—across cloud and on-premises platforms and various analytics tools—businesses must navigate an increasingly complex ecosystem of data sources. When data is spread across multiple environments, tracking and understanding its flow becomes complex, error-prone, and time-consuming.

In such complex data ecosystems, metadata and data lineage become the single source of truth, leading to improved data utilization, breaking down data silos, aiding regulatory compliance, and providing AI governance. On the flip side, lacking appropriate metadata and data lineage infrastructure becomes a barrier to achieving actionable insights, and businesses struggle to get a complete view of their data, making it difficult to ensure quality, compliance, and security.

The Challenge in Managing Metadata and Data Lineage Across Various Environments and Tools

Inconsistent Metadata Management

Metadata is often called the "data about data." Metadata can be business, social, or operations related and it provides essential context to raw data, such as its structure, format, source, and the rules governing its use. When metadata is inconsistent or fragmented across systems, it leads to several challenges, including:

Inconsistent definitions: Different departments or systems may use different terms or definitions for the same data elements. For instance, a customer record in the sales department might not have the same metadata as a customer record in the finance department. This inconsistency creates confusion and reduces the ability to work cross-functionally. The business impact can be significant—sales might report 10,000 active customers based on recent interactions, while finance reports only 7,500 because they define "active" differently. Such discrepancies can lead to misguided strategic decisions, misallocated budgets, and even strained customer relationships due to inconsistent communication across departments
Difficulties in data discovery: Metadata enables teams to quickly locate the data they need, but when metadata isn’t centralized or well-maintained, it becomes a needle-in-a-haystack situation for data engineers and analysts. Teams waste valuable time searching for the right data and may miss important datasets altogether, resulting in incomplete analyses.
Lack of contextual understanding: Without a clear understanding of how data is structured and its intended use, teams may misinterpret it or apply it incorrectly. For example, if an analyst doesn’t know that a dataset has been cleaned or transformed, they may spend time reprocessing data unnecessarily or using outdated information.

Poor Data Traceability

Data lineage refers to the traceability of data, including its origins, transformations, and movements throughout an organization's systems. Without clear data lineage, businesses struggle to understand how data flows, where it’s coming from, and how it changes over time. This becomes especially problematic when:

Data is distributed across platforms: Many businesses use a combination of on-premises systems, cloud platforms, and a variety of third-party applications. Each system may use different formats or methodologies for managing metadata and lineage, making it difficult to see a unified view of how data is being used and transformed.
Lack of visibility into transformations: When data moves through multiple stages or systems, it undergoes various transformations. Without clear tracking of these changes, teams can’t confidently rely on the data for analytics, leading to incorrect insights and decisions. Missing or incomplete data lineage also hinders troubleshooting errors or improving processes.

Data traceability gaps: As data moves through pipelines and systems, the traceability is often lost. If teams can’t pinpoint exactly where data has been sourced or how it’s been altered, it becomes a challenge to maintain data integrity and ensure that the data is trustworthy for use in critical decision-making.

Fragmentation from Data Silos

When data is siloed within individual departments or tools, the ability to understand how data moves across the organization is compromised. Data silos cause fragmentation, which exacerbates the challenge of managing metadata and data lineage, including:

Disjointed metadata: As data is stored across multiple systems, metadata often resides in silos as well. Each system might have its own metadata repository, which makes it difficult to maintain a consistent, enterprise-wide understanding of the data’s lifecycle. Without a holistic view of metadata, it becomes nearly impossible to track data lineage accurately.
Inability to integrate new tools: When data is siloed and metadata is not standardized, integrating new tools into the existing ecosystem becomes a monumental task. For example, adding new data sources or analytics tools requires businesses to manually reconcile metadata across systems, which can lead to errors and slow down adoption.
Difficulty in maintaining compliance: As data becomes more fragmented, ensuring that it complies with governance and regulatory standards becomes more challenging. Without a consistent understanding of where data has been and how it’s been altered, businesses cannot guarantee compliance with standards like GDPR, HIPAA, or other industry-specific regulations.

Cloudera Octopai Data Lineage Unifies and Automates Metadata Management and Data Lineage Across Tools

Cloudera Octopai Data Lineage offers a unified, intuitive solution that eliminates the fragmentation caused by data silos and complex integrations, helping organizations strengthen governance and streamline collaboration. Its capabilities act as the backbone of initiatives including data quality, compliance and governance, and cross-team collaboration.

Consistent metadata management: It aggregates metadata from various sources into a single, centralized repository. This ensures that all metadata—whether from cloud platforms, on-premises systems, or third-party tools—is accessible in one place.

Automatic data lineage tracking: It automatically maps and tracks data lineage. This is achieved through intelligent algorithms that scan the data pipelines and connections between systems, creating a visual representation of how data flows across the organization. Data lineage capabilities are multilayered: cross-system, inner-system, and E2E column level, enabling support for granular governance, debugging, and AI/ML explainability. This delivers end-to-end visibility, near real-time updates, and enables quick error and impact detection.

Breaks down silos with prebuilt connectors: Cloudera Octopai Data Lineage provides more than 60 connectors, covering a range of widely used platforms, including databases, cloud platforms, and ETL and BI tools. While APIs and connectors both serve as means to integrate with other systems and tools, connectors simplify the integration process significantly, providing a ready-to-use interface for connecting to a data source or system without requiring extensive custom development.

Connectors for Apache Hive and Apache Impala workloads on Cloudera platform

Two connectors we want to highlight are those for Apache Hive and Apache Impala, two widely used SQL-based query engines in enterprise data environments. Apache Hive and Impala are critically important in AI/ML workloads, as they are used for staging data, transformations, and for serving real-time analytics.

These connectors offer the following capabilities and benefits:

Seamlessly integrate metadata and data lineage from Hive and Impala into Cloudera Octopai Data Lineage, providing a more complete view of your data ecosystem.

Easily track how data flows and transforms across Hive, Spark and Impala environments, ensuring greater visibility, data quality, and governance.

Accelerate data discovery, enhance collaboration, and improve compliance, all while reducing the complexity of managing metadata across multiple platforms.

What This Means for The Future of Data and AI

Whether managing a small set of data sources or large, complex data ecosystems and AI workloads, Cloudera Octopai Data Lineage is built to scale. Businesses can efficiently manage their metadata and data lineage as their data infrastructure evolves, and have the capabilities and support needed to govern model pipelines, trace training data, and meet AI auditability standards.

In a world where AI is shaping critical decisions, managing data pipelines in isolation is no longer sufficient. Organizations need full transparency into the data entering, flowing through, and leaving AI models. With Cloudera Octopai Data Lineage’s deep lineage and metadata integration, Cloudera extends governance to AI workloads—enabling responsible AI development, deployment, and oversight while ensuring compliance and trust in the data powering AI.

If you would like to know more, then please reach out to your account teams. If you would like to learn about how Cloudera customers are pioneering new use cases then sign up for Cloudera EVOLVE near you.

Cloudera + NVIDIA Deliver AI-Powered Transformation in Financial Services

Andreas Skouloudis — Wed, 17 Sep 2025 13:00:00 UTC

Figure 1: Cloudera and NVIDIA deliver value across the data science lifecycle

In this blog, we will highlight three use cases that showcase how, together, Cloudera and NVIDIA deliver value with analytics and AI for financial services institutions..

NVIDIA RAPIDS Accelerator for Apache Spark for AML/KYC Compliance

The anti-money laundering and know your customer (AML/KYC) compliance lifecycle in large financial organizations is a highly compute-intensive process. This is due to the need to integrate and standardize vast volumes of data across various activities, such as:

Entity resolution, which requires the standardization of cross-border data subject to different data clearance processes and sourced from a wide range of transactional systems and external entities (such as credit card transactions, wire transfers, and SWIFT messages).

Data consolidation from multiple AML/KYC systems that store information in different formats, which must be normalized into a unified schema and structured into data products (such as cross-business-unit AML data marts).

Ongoing transaction monitoring and regulatory reporting that require data processing, enrichment, and the application of rules.

For many Cloudera customers who have implemented AML/KYC use cases, Apache Spark plays a pivotal role in enabling these analytic workloads. Apache Spark is a powerful engine for data engineering, providing capabilities like in-memory computing and distributed processing. However, the surge in transaction volumes and the increasing variety of new data sources for AML/KYC compliance place additional strain on existing compute infrastructure, demanding even greater performance.

The NVIDIA RAPIDS library for Apache Spark offloads specific data processing operations from CPU to GPU in a transparent manner, meaning without any code modifications. As a result, Cloudera customers have experienced performance improvements of up to 20x by using the NVIDIA RAPIDS library for Apache Spark 3.0 workloads.

NVIDIA NIM Microservices for Fraud Prevention in Payments

Two of the greatest challenges in fraud prevention are the explosion in transaction volumes in digital and credit card payments and the increasing sophistication of fraud techniques. These factors have led to resource contention and scalability challenges for AI/ML inference, necessitating the deployment of multiple composable AI/ML models to address emerging fraud methods.

To tackle these challenges, the Cloudera AI Inference service includes NVIDIA NIM that are designed to deliver high-performance, low-latency, and high-throughput inference for fraud prevention AI models on NVIDIA accelerated computing. For example, by using NVIDIA NIM, Cloudera AI Inference service can deliver up to 6x performance improvement for PyTorch models (using the Torch-TensorRT library) and a 2.5x improvement for TensorFlow models (using the TF-TensorRT library), both of which are widely used in payments fraud prevention.

In addition, the Cloudera AI Inference service accelerates inference requests executed on NVIDIA accelerated computing by leveraging NVIDIA’s dynamic batching feature. This feature enables the combination of server-side inference requests, avoiding the inefficiency of processing one request at a time, which leaves much of the GPU idle. As a result, the Cloudera AI Inference service with NVIDIA NIM improves GPU utilization, reducing future GPU capital expenditures to meet growing demands for fraud prevention.

NVIDIA AI-Q Blueprint for Loan Origination in Retail Banking

Credit underwriting is an important capability in banking, spanning many different lending activities such as mortgages, credit card lending, commercial banking, and trade finance. These processes have historically been inefficient given the number of activities involved in the origination process, from application submission to funding, and the numerous roles participating in the decision process.

While traditional AI/ML models can streamline many individual activities in the loan origination workflow, the process from the customer’s perspective still feels slow and fragmented. This is where agentic AI can have a significant impact: in this context, agentic AI can reduce the effort required to collect, summarize information, and draft credit decisions. It can also deliver a personalized and consistent lending experience by standardizing reviews during the approval process. Additionally, it can deliver personalized product recommendations based on the customer’s behaviors and spending patterns, with a multiple-agent workflow that orchestrates various tools, data, and AI agents.

By leveraging NVIDIA AI-Q Blueprint on NVIDIA accelerated computing with the Cloudera AI Inference service, banking organizations can achieve this transformative vision. For example, by using AI-Q Blueprint, Cloudera can orchestrate a multi-agent workflow that includes a GenAI-based personalized loan advisor deployed on NVIDIA NIM, an AI-based document processing agent leveraging optical character recognition (OCR) and natural language processing (NLP) techniques, and existing credit decisioning tools.

Next Steps

The combined power of Cloudera’s unified, cloud-anywhere data platform and NVIDIA’s hardware and software capabilities offers a holistic solution for the development of agentic AI solutions.

Visit this page to learn more about the Cloudera AI Inference service.
Read this whitepaper by Enterprise Strategy Group to learn about the Cloudera + NVIDIA joint value proposition.

Cloudera and NVIDIA enable organizations to streamline complex data pipelines at scale by combining Cloudera’s data management capabilities with NVIDIA’s full-stack services:

Data processing Apache Spark on Cloudera and NVIDIA RAPIDS Accelerator for Apache Spark streamlines execution of feature engineering and data engineering workloads.

AI/ML model deployment with Cloudera AI Inference and NVIDIA NIM microservices improves the throughput and latency performance of artificial intelligence (AI) models (both traditional AI/ML and generative AI) .

Agentic AI orchestration with NVIDIA AI-Q Blueprint enables the integration of AI agents with private data and the interaction with other systems through APIs.

#ClouderaLife Employee Spotlight: Meet Amy Nelson, Cloudera’s Chief Human Resources Officer

Debbie Kruger — Tue, 16 Sep 2025 13:00:00 UTC

“At Cloudera, our greatest strength is our people. What sets us apart is how we empower employees to think independently, act with autonomy, and make a real impact,” Amy Nelson says.

That belief has guided Amy throughout her career and continues to define her leadership at Cloudera. As Chief Human Resources Officer, Amy is the center of this people-first approach, shaping a workplace where individuals feel valued and empowered.

Let’s meet Amy Nelson and learn about her journey at Cloudera, the culture she’s helping to create, and how she empowers Clouderans to thrive and give back.

Meet Amy Nelson

Amy oversees everything from workforce planning and leadership development to inclusion and engagement programs, always keeping the community at the heart of her work.

That philosophy comes to life in the initiatives Amy drives: expanding learning and development programs, embedding purpose through Cloudera Cares, and advancing accessibility so every employee feels supported. Each effort reinforces her larger promise—to make Cloudera a great place to work and where people truly belong.

What Drew Amy to Cloudera

Amy’s career has always been rooted in people and purpose. When she encountered Cloudera, she saw a company that mirrored her values.

“What initially drew me to the company was its strong commitment to innovation and people-first culture,” she says. “I saw an opportunity to contribute to a company that truly believed in aligning talent strategy with long-term growth.”

Since then, her role has expanded far beyond traditional HR. “The past few years have pushed HR to the forefront of business strategy,” she says. “I’m proud to have helped guide the company through times of change with empathy and purpose.”

From recognition as a Great Place to Work in multiple countries to strengthening career development pathways, Amy has helped Cloudera evolve while keeping its culture grounded in belonging and resilience.

Creating a Workplace Where Everyone Can Thrive

For Amy, a strong employee experience blends vision with practical support. Through her efforts, Cloudera has introduced personalized learning paths, leadership programs, and clear growth frameworks to help employees see a future for themselves at the company.

Listening is central to Amy’s approach. “We place so much emphasis on employee feedback, especially through our engagement surveys,” she says. “We treat that feedback as a strategic input, not just a data point. It helps us make better decisions and evolve our culture in real, responsive ways.”

The employee value proposition has also been reshaped through that input. Focus groups and surveys helped Cloudera articulate what employees already felt: this is a place where your ideas matter, your work shapes the future, and your voice truly counts.

Amy has also championed accessibility and inclusivity. Participation in the Disability Index led to meaningful changes such as flexible work policies, enhanced benefits, and home office stipends. “Ultimately, our goal is to create an environment where every employee, whether they identify as having a disability or not, feels supported, empowered, and able to contribute meaningfully,” she says.

She credits Cloudera’s Learning & Enrichment team for making these programs effective. “Accessibility is another key strength,” she says. “Whether virtual, in-person, or self-led learning, our programs are designed to be flexible and inclusive, giving employees the freedom to grow in a way that fits their learning style and schedule.”

Empowering Clouderans to Give Back

Amy believes people are more engaged and prouder of their work when connected to it.

Through Cloudera Cares—the company’s corporate social responsibility program—she helps create a culture of giving back which embodies the collective spirit of Clouderans worldwide, reinforcing the company’s mission to create positive impact inside and outside the workplace.

For Amy, this passion for service reflects Cloudera’s identity. “Giving back has always been a natural extension of who we are at Cloudera,” she says. “The passion our employees bring to their work is the same passion they bring to their communities, and that’s something we’re incredibly proud of.”

By embedding purpose into the employee experience, Amy helps Clouderans unite globally to deliver innovation and create lasting social impact.

Hiring and Scaling in a Fast-Changing Tech Landscape

As Cloudera grows, Amy ensures its people strategy evolves with it.

“We’re leveraging the right mix of tools, technology, and human connection to make the hiring journey seamless and engaging from first touchpoint to offer decision,” she says.

Looking ahead, she is focused on building teams that reflect diverse perspectives and creating an environment where new talent can contribute, grow, and lead. This forward-looking approach ensures Cloudera is meeting today’s needs and building the foundation for its future.

Closing Thoughts

As Amy looks ahead, her vision for Cloudera is rooted in purpose and belonging. “I want us to continue building an environment where every employee, no matter where they are in the world or in their career, feels a deep sense of purpose and belonging here,” she says. “I want Cloudera to be known not just as a place where people want to work, but a place they’re proud to be part of—a community that supports their growth, reflects their values, and inspires their best every day.”

Her advice for those interested in joining Cloudera reflects the company’s fast-moving, collaborative spirit. “Be ready to collaborate with some of the brightest, most passionate people in the data space,” she says. “Things evolve quickly, but that means you’ll constantly have the chance to grow your skills and make an impact.”

Amy’s journey proves that when people feel supported, valued, and connected to something bigger, they can achieve extraordinary things for themselves, their company, and their communities.

Hear from another Clouderan and explore career opportunities at Cloudera.

“It’s not just about giving back, it’s about embedding purpose into our culture,” she says.

“For me, creating a standout employee experience starts with a simple but powerful belief: everyone should feel like they belong and that their voice matters,” she says. “From day one at Cloudera, I’ve championed the idea that culture isn’t just built for employees, it’s built with them.”

Austin Week of Learning

Savanna Morris — Mon, 15 Sep 2025 13:00:00 UTC

At Cloudera, summer interns are vital to the day-to-day success of our business. But the role interns take on is about much more than just learning a business and supporting teams. It’s about taking on opportunities to learn and grow too.

Recently, one of our 2025 summer interns, Savanna Morris, spent some time in Austin, Texas exploring Cloudera’s Week of Learning and all that the event had to offer for those looking to grow in their careers.

Let’s hear from Savanna about what that experience was like:

As part of my summer internship with Cloudera, I had the opportunity to visit Austin, Texas for the "The Week of Learning," an event dedicated to fostering knowledge and skill development across various disciplines. From interactive workshops to community-driven volunteering events, the week offered a diverse range of opportunities for attendees to expand their horizons and gather insight into their roles at Cloudera.

Highlights of the Week

The event kicked off with the first workshop of the day, “Unlocking your Conflict Style,” featuring Raena Mareder, the Manager of Learning and Enrichment and Misha D’ Andrea, the Learning and Enrichment Partner, who both set an enthusiastic tone for the days to follow. The workshop, with approximately twenty attendees in the first session, offered thoroughly engaging content. Not only were the activities interactive, but they also fostered greater collaboration among colleagues. As a result, I had the opportunity to connect with coworkers from various departments who had flown in from all over the States.

Workshops & Interactive Sessions

A key component of The Week of Learning was its focus on hands-on experiences. Attendees had the chance to participate in workshops covering topics such as:

Unlocking your Conflict Style: This very informational workshop discussed various conflict styles applicable to different workplace scenarios. As a part of the activities, we took the Thomas-Kilmann Conflict Mode assessment to discover our most prominent conflict style. Complemented by the assessment, the workshop provided a deeper understanding of personal management styles and their individual shortcomings, fostering self-discovery.
Communicate to Connect: Complementing “Unlocking your Conflict Style,” this workshop offered practical insight into honing communication skills through storytelling. For one activity, we were tasked with matching workplace “stories” to common story arcs. In their concluding remarks, Misha and Raena shared insights on public speaking, emphasizing the core principle of "connecting with the audience, through presence, with yourself."
Udemy Lunch and Learn: Complementing the workshops, a "Lunch and Learn" session was held in collaboration with Udemy, focusing on AI enablement. Udemy provides access to over a thousand AI-related courses, with new content added daily. Employees have full access to Udemy's extensive catalog, including certification courses, to enhance their AI knowledge. Shayde Christian, Chief Data Officer, concluded the event by answering any questions over AI, offering insightful knowledge on this growing industry.
Conscious Leadership: Misha and Raena wrapped up Austin’s Week of Learning with a very interactive workshop focused on "conscious leadership.” Members were asked to self-segregate into card-based groups, highlighting our human tendency to gravitate towards similarities rather than embracing differences. For the subsequent activity, attendees were asked to refer to the Ladder of Inference, a method designed to foster a conscious leadership mindset when approaching decisions based on sets of information. Using the Ladder of Inference, attendees were introduced to increasing levels of data to come up with the best course of action. The result was effective collaboration as well as utilization of differing perspectives.

Season of Service Opportunity

In addition to the variety of workshop opportunities, attendees also had a chance to participate in a service opportunity with SAFE, a non-profit dedicated to assisting survivors of child abuse, sexual assault, trafficking, and domestic violence. SAFE’s Volunteer Services Director, Stefanie Lebens, spoke on their effectiveness in managing these sensitive situations while also introducing their work to many of the attendees. Following the presentation, approximately thirty Clouderans filled baskets with essential household items such as dish sets, soap, sponges, and trash bag rolls. To further connect with the recipients, attendees also included personalized encouraging notes in each basket. The event fostered a positive and supportive community and atmosphere, with all volunteers demonstrating hearts ready and willing to serve.

Personal Reflection

Attending Austin’s Week of Learning was truly an unforgettable experience. The Learning & Enrichment team did a wonderful job planning this event. As a remote intern working in Global Communications, the event fostered connections with fellow Clouderans, exposed me to office culture, and developed leadership skills—learning more about myself in the process. I will apply the skillsets I was exposed to and encouraged to develop to my life moving forward. Additionally, in an increasingly remote world, the in-person interaction among coworkers was refreshing! I was able to meet so many talented individuals and interact with them in ways I would not have been able to remotely. I am truly grateful for this incredible opportunity.

Looking Forward

The success of The Week of Learning underscores Cloudera’s commitment to continuous growth and education. We can always improve and learn at any stage in our careers to become better, more effective colleagues. I wholeheartedly encourage participating in future Week of Learning workshops and events. Not only will it give you an excellent opportunity to connect with colleagues from various departments and locations, but it will also give you valuable skills applicable to all aspects of your life.

Learn more about how Cloudera is furthering its commitment to fostering growth and education opportunities.

Reduce Data Management and Hosting Costs with Data Lineage

Ron Pick — Mon, 15 Sep 2025 13:00:00 UTC

Data lineage can help large organizations reduce costs across various areas. Here are some common expenditures where data lineage can be beneficial:

Infrastructure and storage: Data lineage allows organizations to understand data usage patterns, access frequencies, and uncover data dependencies. By analyzing this information, organizations can optimize their infrastructure and storage strategies, avoiding unnecessary storage costs and efficiently allocating resources based on data usage patterns.
Data integration and ETL: Large organizations often deal with complex data integration and extract, transform, load (ETL) processes. Data lineage helps identify redundant or inefficient data integration steps, allowing organizations to streamline their processes and reduce development, maintenance, and operational costs associated with ETL.
Data quality: Poor data quality can result in significant costs for organizations. Data lineage helps trace data quality issues back to their source, enabling organizations to identify the responsible processes or systems. By addressing these issues at their root, organizations can reduce costs associated with data cleansing, error correction, and rework caused by poor data quality.
Regulatory compliance: Compliance with data regulations is essential for large organizations, and non-compliance can result in substantial penalties. Data lineage provides transparency into data flows, transformations, and access, supporting organizations in demonstrating compliance and reducing the risk of costly violations.
Analytics and reporting: Data lineage facilitates data discovery and understanding of data sources, transformations, and calculations. By empowering data analysts and business users with self-service analytics capabilities through data lineage, organizations can reduce the time and effort spent on data exploration, preparation, and reporting.
Impact analysis: When making changes to data sources, structures, or processes, organizations need to understand the downstream impact. Data lineage enables impact analysis by tracing the flow of data and identifying the systems, reports, or applications affected by changes. By conducting thorough impact analysis, organizations can mitigate the risks of costly errors and minimize the associated costs.
Operational support: Data lineage provides insights into data dependencies and the relationships between different systems or processes. This information helps organizations troubleshoot issues, identify bottlenecks, and optimize performance. By resolving issues more efficiently and reducing downtime, organizations can lower operational and support costs.

It’s important to note that the specific impact areas where data lineage can help will vary depending on the organization’s industry, data landscape, and specific challenges. Conducting a thorough assessment of the organization’s data ecosystem and understanding its pain points will help identify the areas where data lineage can provide the most significant cost reduction and process improvement opportunities.

How Data Lineage Can Help Organizations Save

Data lineage can help organizations reduce costs by providing valuable insights into the origin, movement, and transformation of data throughout its lifecycle. Here are some ways to leverage data lineage to reduce costs:

Identify unnecessary data processes: Data lineage allows you to trace the path of data from its source to its destination, enabling you to identify redundant or unnecessary data processes. By eliminating these redundant processes, you can reduce resource consumption and associated costs.
Optimize data storage: Data lineage helps you understand which datasets are frequently accessed and which ones are seldom used. By analyzing this information, you can optimize your data storage strategies, such as implementing tiered storage or archiving infrequently accessed data. This approach helps reduce storage costs by allocating resources more efficiently.
Identify data quality issues: Poor data quality can lead to increased costs due to errors, rework, and inefficiencies. By leveraging data lineage, you can track the origin of data quality issues, identify the responsible processes or systems, and take corrective actions. Improving data quality reduces the need for data cleansing and error correction, leading to cost savings.
Streamline data integration processes: Data lineage enables you to understand how different data sources are integrated into your systems. By analyzing the lineage, you can identify complex and inefficient data integration processes. Simplifying and streamlining these processes can reduce development, maintenance, and operational costs.
Enhance data governance: Data lineage provides transparency into data flows, transformations, and dependencies, supporting robust data governance practices. Effective data governance ensures compliance with regulations, reduces the risk of data breaches or non-compliance penalties, and avoids associated costs.
Support impact analysis: Data lineage helps you understand how changes in data sources, structures, or processes impact downstream systems and applications. By conducting impact analysis, you can identify potential risks, assess the cost implications of changes, and make informed decisions, thereby minimizing the chances of costly errors.
Facilitate data discovery and self-service analytics: Data lineage helps data consumers easily discover relevant datasets and understand their lineage. By empowering users to explore and access data independently, you can reduce the time and effort spent by data engineers or analysts in fulfilling data requests, leading to cost savings.

Remember that leveraging data lineage effectively requires proper data governance, documentation, and tools for capturing and visualizing lineage information. It is also crucial to regularly review and update the lineage information as data and processes evolve over time.

Automating Data Lineage with Cloudera

Ready to cut costs and improve efficiency? Request a demo to get started with Cloudera Octopai Data Lineage today.

Redefining AI Leadership: Inside the Rise of the Chief AI Enablement Officer

Cloudera — Fri, 12 Sep 2025 13:00:00 UTC

AI is moving from experimentation to execution. Yet as adoption scales across the enterprise, one truth is becoming clear: tools alone are not enough. Success hinges on the people, processes, and leadership that bring AI into daily business operations.

On a recent episode of The AI Forecast, host Paul Muller sat down with Donna Beasley, Cloudera’s first-ever Chief AI Enablement Officer, to explore this newly emerging role, the challenges of scaling adoption, and what it takes to build organizational readiness when no blueprint exists.

AI’s Impact Relies on Operational Discipline

Paul: AI’s impact still hinges on the basic principles of operational discipline and execution. Yet a McKinsey survey showed that while 78% of companies use AI, only 17% saw a meaningful earnings before interest and tax contribution. Why is the impact not showing up in the results?

Donna: You can’t push this on people, nor can you hold people back. You’ve got to meet them where they are and then help them just take whatever that next step is. For many people, the first encounters with AI feel uncertain—even intimidating. Some worry it could replace their role, while others don’t know where to begin. I focus on creating space for employees to explore without pressure. The goal isn’t to force everyone into the same pace of change, but to ensure everyone can see how AI connects to their work. Progress comes in steady steps. When someone sees a colleague using AI effectively, they’re far more likely to return and be ready to try it themselves. That momentum—built gradually and reinforced through real examples—turns curiosity into measurable business impact.

The Chief AI Enablement Officer Role Has No Blueprint

Paul: You’ve stepped into a brand-new role at Cloudera, and unlike other executive positions there’s no predecessor, no set KPIs. How did you think about defining success in the absence of a blueprint?

Donna: There is no right or wrong answer to this. We’re kind of forging this path as we go forward. The advantage here is that Cloudera already had guardrails in place—an AI council, security guidelines, and governance. That foundation meant I could focus on putting tools in people’s hands, building confidence, and creating pathways from casual use to real innovation.

I approached success in phases. First, we ensured everyone had access to AI tools so they could start experimenting. Next, I focused on departments eager to adopt and show early wins. From there, the goal has been to build advanced learning paths for power users who want to go deeper. We track progress through adoption metrics and by looking at what people are creating—like the internal tools employees have started building for their teams. That phased approach ensures we’re not just experimenting, but turning new capabilities into practices that can scale across the organization.

Success Starts with the Right Departments

Paul: If you’re advising other companies on where to start, which departments provide the most natural footholds for AI adoption?

Donna: Marketing is absolutely the place where it makes a lot of sense to start. The outputs are already designed for public sharing, so you can sidestep some of the trickier data concerns. Once one team demonstrates success, others line up. Our CMO wanted marketing to be the poster child, and from there, adoption spread quickly to engineering, sales, and beyond.

Each department comes with a different level of complexity. Marketing provides the fastest wins because the work is already meant for public use, which lowers data risk. Sales bring strong opportunities, but require careful governance since customer information is involved. Engineering is a natural fit because developers already operate within strict guardrails and coding practices. That’s why I always suggest starting where adoption is easiest. Early successes create momentum, and once employees see tangible results, adoption expands naturally across the business without forcing it in areas with higher risks.

Organizational Change Requires Trust and Patience

Paul: One of the biggest challenges leaders face isn’t technical at all. It’s about trust, resistance, and change management. How do you help employees move past fear and skepticism?

Donna: It’s much more carrot than stick. If an idea or practice is powerful enough, the value will come through in the long run. Sometimes the tools don’t work perfectly at the onset, and I’m upfront about that reality. It takes the pressure off people. If they’re not ready today, that’s fine—because once they see colleagues using AI successfully, curiosity takes over. My role is to meet them at their pace and make adoption approachable.

Fear and resistance are normal reactions when people are asked to change how they work. I focus on building trust through transparency—acknowledging when tools don’t perform as expected and reminding teams that a human must always be in the loop. That openness helps take the pressure off and makes adoption feel less risky. I also use peer examples to create positive momentum: when small pilot groups demonstrate success, others naturally want to join. By letting curiosity and proof drive the process, adoption spreads more smoothly and scales in a way that feels approachable rather than forced.

Catch the full conversation with Donna Beasley on The AI Forecast on Spotify, Apple Podcasts, and YouTube.

Unlocking Enterprise AI Potential: Knowledge Distillation for Customer Support Analytics

Andreas Tsiartas,Yi-Hsun Tsai,Jugoslav Djajic,Robert Hryniewicz — Thu, 11 Sep 2025 11:00:00 UTC

Business Challenge: Balancing AI Model Speed and Accuracy Without Compromising Data Privacy

Cloudera’s customer support team leverages AI models to analyze and summarize customer support tickets in real time. The system takes as input customer or Cloudera support agent comments. Then, it analyzes each comment and extracts a set of analytics, such as sentiment and summarization. These analytics are paramount to improve the customer experience at Cloudera.

Due to the sensitive nature of the customer data being processed in this pipeline, only models running in local environments can be used and no customer data can be shared with any external sources.

Initially, to analyze the comments, the team relied on local LLMs (Goliath 120B), which met basic performance requirements but lagged in speed and generation performance: on average, processing requests took 12-15 seconds each, and requests came in every 30 seconds. Adherence to the expected output was 77.5%, and generation accuracy was lower than proprietary models—a bottleneck for scalability and LLM performance.

The challenges of using local large LLMs (Goliath-120B) were clear: slower response times, increased costs, lower generation accuracy than state-of-the-art, cloud-based models, and compliance risks.

Large organizations face similar trade-offs—balancing AI accuracy and speed against the risks of data exposure.

Cloudera’s Solution: Knowledge Distillation with Private Data

Cloudera’s breakthrough lies in a privacy-first approach to knowledge distillation.

Instead of training models on raw customer data, which had regulatory and exposure risks, we generated synthetic datasets using Cloudera Synthetic Data Studio. This new low-code tool in Cloudera AI mimicked real-world interactions—technical questions, troubleshooting scenarios, and more—without ever exposing private information.

Generating synthetic customer support interactions had regulatory and exposure benefits and also enabled the team to send the synthetic data to state-of-the-art, cloud-based LLMs to extract insights such as customer sentiment from the best performing LLMs. These cloud-based LLMs provided much more accurate information extraction than large local LLMs, making them an ideal source to distill accurate insights from these state-of-the-art LLMs.

Cloudera’s synthetic data solution eliminated any compliance and privacy risks and generated the highest quality synthetic data (even higher than existing large, local LLMs). This approach unlocked the option to distill knowledge from state-of-the-art models to small LLMs and solve the same problem as the Goliath-120B but at a lower cost and higher accuracy.

Our Process

Data generation: Using the Synthetic Data Studio data generation workflow, we crafted a prompt instructing Claude Sonnet to generate customer questions and answers. The prompt instructs the LLM to create customer support questions and answers, impose the tone, and detail the structure. In addition, we provide a list of topics that appear in real-world data (such as customer support for Cloudera AI or Cloudera Data Warehouse) and use seed topics to ensure both diverse and real-world customer support ticket generation.

Fine-tuning: Using only the filtered data, the team split the data into train and development tested and distilled knowledge from the Claude Sonnet model to a Meta Llama3.1-8B-instruct model. The team ran multiple experiments selecting the fine-tuning parameters that maximize the performance of the distilled LLM.

Evaluation: Using the Synthetic Data Studio evaluation workflow, the team crafted a prompt to instruct an LLM-as-a-judge on how to evaluate the quality of the generated data and filtered out low-quality samples.

Using both human and automated LLM-as-a-judge evaluations, the team scored real-world customer support ticketing questions and answers. Cloudera’s team focused on answers that the deployed and distilled LLMs differed and reported the win rate of each LLM. In addition, they measured speed improvements in terms of average running time, adherence to the expected output, and cost to deploy the model.

The Results

Improved speed: Processing time dropped 95%.

Better output structure: Output adherence rose from 77.5% to 99.5%.

Higher LLM accuracy: When comparing the smaller distilled LLM (Llama 3.1 8B) against the deployed Goliath LLM (Goliath 120B), win rate was 70% vs. 30% when using Phi-4 as a judge and 63% vs. 37% when using human evaluators to compare the two models.

Improved cost and efficiency: The smaller distilled LLM reduced compute and memory needs while increasing real-time scalability and maintaining data privacy, and throughput improved 11x.

The results are clear: enterprises can achieve AI excellence without compromising data privacy. By synthesizing training data and distilling knowledge, businesses avoid trade-offs between innovation and compliance.

Enterprises today face a steep challenge: they want to leverage advanced AI models to stay competitive, but need to keep the high costs of cloud-based large language models (LLMs) under control and stay compliant with data privacy regulations.

So how can businesses explore cutting-edge AI without overextending budgets or exposing sensitive private data? At Cloudera, we’ve developed a solution that turns this challenge into an opportunity—using synthetic data generated from private data and knowledge distillation to build cost-efficient, accurate, and compliant AI systems.

In this article, we discuss how Cloudera’s Synthetic Data Generation Studio–part of Cloudera AI Studios—allows organizations to capitalize on AI innovation even when real-world data is scarce or sensitive.

Synthetic Data Enables Innovation Without Regulatory Risk

By developing a knowledge distillation approach, Cloudera achieved a 95% reduction in processing time, increased output structure adherence to 99.5%, and deployed a distilled Llama 3.1 8B model that outperformed the prior Goliath 120B model by 70% in accuracy (as judged by Phi-4) and 63% in human evaluations.

This method eliminated compliance risks by avoiding direct use of sensitive data and also unlocked 11x greater throughput, showing that smaller, fine-tuned models can surpass larger, resource-intensive alternatives in both speed and precision.

Try our AMP to explore how to use private synthetic data to distill knowledge from a large model to a smaller model for a customer support use case.

Figure 1. The impact of the synthetic data distillation approach to speed, adherence, and cost for the customer support use case. The AWS cost is a hypothetical cost if the LLM runs on the AWS Cloud (based on Feb 2025 prices).

Use Case and Key Takeaways

Use case: Drawing from an internal use case, we’ll show how we significantly improved the performance and overall throughput for Cloudera’s customer support ticket pipeline through knowledge distillation using synthetic data generated from private data, while maintaining data privacy and regulatory compliance.

Key takeaways:

Data privacy as a competitive advantage: Synthetic data enables innovation without regulatory risk.

Cost-effective performance: Smaller, fine-tuned models outperform larger, resource-heavy alternatives.

Applicable to multiple use cases: The same approach can power use cases from fraud detection to personalized customer service.

Celebrating Cloudera and IBM’s Milestone Impact in Brazil

Cloudera — Wed, 03 Sep 2025 12:00:00 UTC

Cloudera and IBM are celebrating eight years of collaboration highlighted by continuous innovation, impressive growth, and real-world impact on enterprise digital transformation across Brazil. Since the collaboration began in 2017, Cloudera and IBM have together generated strong business results—achieving US$100 million in annual recurring revenue by 2020, with growth continuing thanks to the deepening synergy between the two companies.

A Powerful Partnership

Cloudera’s open and scalable data platform, integrated seamlessly with IBM’s advanced technologies such as watsonx, BigSQL, and Cognos, is shaping the future of data and AI for large organizations. Joint engineering and solutions teams from both companies collaborate to provide up-to-date integrations and smooth technical support, supported by robust professional services that are central to the partnership’s success.

Delivering End-to-End Value

By combining their expertise, Cloudera and IBM empower clients with end-to-end solutions across the enterprise data and AI lifecycle—from ingestion to inference—regardless of environment: on-premises, hybrid cloud, multicloud, or edge. The shared mission is clear: deliver a seamless, secure, and scalable customer experience that meets the demands of today’s digital business landscape.

“Our collaboration with IBM is one of our most valuable strategic assets,” says Rubia Coimbra, Vice President of Cloudera for Latin America. “We are helping companies unlock real-time value from their data with intelligence and scalability.”

Marcela Vairo, VP of Data & AI, Americas at IBM, echoes this, pointing to the collaboration’s focus on innovation and customer impact: “Together, we are committed to providing seamless, end-to-end solutions that empower enterprises across all environments to unlock the full value of their data and AI investments, ensuring security, governance, and exceptional performance.”

Joint Services and Solutions

Key offerings through the Cloudera-IBM alliance include:

Data in Motion (DIM): Real-time data capture, processing, and analytics
Data Services: Secure infrastructure and intelligent data management tools
Enterprise AI Integration: Predictive and generative modeling with robust privacy and compliance
Integrated Support & Professional Services: Joint technical assistance and tailored deployments
Co-engineering: Ongoing innovation focused on interoperability and performance.

Impact on Key Industries

Both the financial and healthcare sectors in Brazil have particularly benefited. Financial organizations have gained agility to meet regulatory requirements and produce real-time insights, while healthcare providers leverage predictive analytics to improve patient care and operational efficiency.

Cloudera and IBM’s ongoing collaboration demonstrates how working together can set a standard for secure, scalable, and innovative data and AI solutions in complex enterprise environments, now and in the years to come.