Open Data Lakehouse
Public Sector
US
The Centers for Disease Control and Prevention (CDC), the lead U.S. government agency for public health, is dedicated to safeguarding the health and safety of more than 337 million Americans. As the nation’s leading science-based, data-driven public health service organization, CDC works 24/7 to protect America from health, safety, and security threats, both foreign and in the United States.
By leveraging Cloudera for data and advanced analytics, the CDC has enhanced its capabilities for managing and mitigating infectious diseases.
Tackling Data Challenges
Tracking, predicting, and responding to infectious disease outbreaks is a data-intensive task. The ability to import and integrate diverse data from numerous sources is a critical capability.
CDC used Cloudera to create an open data lakehouse architecture — a database-like system that uses distributed file systems for storage and distributed services for querying—that enables it to do retrieval and manipulation of large amounts of respiratory virus sequence surveillance data for reporting and analysis.
Accelerating Workflows, Innovation, and Impact
CDC also uses Cloudera to consolidate data from disparate sources, including surveillance systems, lab results, and sequence data repositories. This facilitates and helps simplify creating unified, comprehensive views of virus trends nationally and globally. These data are essential for generating a variety of analytical reports and graphs that data consumers use across the organization.
Cloudera enhances CDC’s genomic analysis capabilities by allowing it to leverage data at scale — which is crucial for understanding and responding to pathogens.
Driving More Collaboration with the Scientific Community
Traditional methods of data analysis require working among many different programs and files to do common operations like filtering, sub-setting, and transforming data. To promote more rapid exploratory analysis and to deliver more dynamic information products without moving data around between different systems, CDC built a custom set of User Defined Functions (UDFs) that integrate with Apache Impala’s massively parallel SQL engine within Cloudera.
For example, built-in generic functions might provide the ability to make a string uppercase or count the data in a column. These kinds of use cases are not domain-specific. Instead, this library introduces domain-specific functionality from the field of bioinformatics (such as calculating sequence entropy or translating nucleotides), which are needed routinely by CDC analysts.
Integrating the new functions into the query engine means analysts do not have to leave the query editor and can make more dynamic visual dashboards that call UDFs in response to user input. CDC calls this approach “database-centric sequence surveillance analytics.” CDC has open-sourced its library, making it available to the broader scientific community via a repository.
CDC’s UDFs efficiently process biological data, enabling complex bioinformatics workflows directly within Impala SQL, such as:
Sequence Analysis: Distance functions, showing mutations, translation/reverse complement, allele and motif extraction, and sequence quality control
Utility Functions: Extra utilities to ease string and list manipulation as well as generate IDs
Statistical Analysis: Extra functions for statistical moments, testing for modalities, and functions for t-tests
Data and Time Utilities: Date conversion and interval functions are useful in reporting
These UDFs deliver greater workflow efficiency, accelerating the processing of large genomic and metadata datasets and reducing the time required for analysis and interpretation. Apache Impala allows for the creation of both scalar and aggregate functions, increasing the scope of domain-specific problems that can be solved. By open-sourcing these UDFs, the CDC fosters collaboration within the global scientific community, encouraging shared innovation and continuous improvement.
Making an Impact on Citizens and Public Health
With Cloudera, CDC delivered a host of benefits that directly affect the quality of life for people, including:
Improved sequence surveillance. CDC improved its ability to surveil circulating viruses by consolidating data at scale from multiple sources and using flexible open-source tooling. They can more easily analyze genomic data using these methods.
Innovative genomic research. Adding new capabilities for complex genomic analysis has advanced the CDC's research efforts, providing deeper insights into pathogen behavior and resistance patterns. Cloudera’s open data lakehouse architecture enables efficient storage and analysis of large genomic datasets, facilitating cutting-edge bioinformatics workflows.
CDC’s focus on modernizing its data and analytics architecture has had a significant impact on researchers, public health organizations, and, most importantly, the health of millions of Americans.