Cloudera Powers Opt-In Machine Learning Project for Real-Time Identification of Suicide Risk Factors in Military Veterans
Patterns and Predictions’ Durkheim Project Uses Predictive Analytics Across Data Sources
PALO ALTO, CA – September 25, 2013 – Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today announced that Patterns and Predictions, a predictive analytics company, partnered with Cloudera for an ongoing initiative applying machine learning to the identification of key correlations between military veterans’ communications and suicide risk. The Durkheim Project, as it is called, entails opt-in monitoring across a variety of online and mobile data channels to predict which military veterans are at the highest risk of suicide. It is powered by a real-time risk detection framework co-developed with Cloudera and built on CDH (Cloudera’s Distribution Including Apache Hadoop), Cloudera Impala and Cloudera Search.
“The promise of the Durkheim Project is expressed in its ability to collect, monitor and deliver insights from a diverse repository of complex data, including mobile and social media signals, with the hope of eventually providing real-time triage of interventional actions upon detection of a critical event,” said Patterns and Predictions founder Chris Poulin. “Cloudera's unique software and expertise enable us to make risk assessments faster and across larger data sets, resulting in better clinical outcomes.”
Applied Machine Learning Identifies and Predicts Mental Health Risk Factors
Patterns and Predictions' founder Chris Poulin began working with Dartmouth researchers in 2010 to address the problem of high suicide rates among veterans. Suicide rates among U.S. veterans are approximately twice that of the general population, a challenging phenomenon facing the U.S. Department of Veterans Affairs (VA).
With support from the Defense Advanced Research Project Agency (DARPA), a research arm of the Department of Defense (DoD), and Dartmouth College, the suicide risk prediction project includes a database of more than 100,000 U.S. veterans, all of whom have volunteered their participation. By mining these veterans’ social media posts and other indicators, Patterns and Predictions – together with a team of experts in artificial intelligence, medical professionals from private companies, and the U.S. Department of Veterans Affairs (VA) – developed a set of predictive indicators of suicidal risks for military veterans.
The tightly integrated machine learning system was trained by feeding in isolated statistical indicators – keywords, word patterns and other linguistic clues known to be associated with people who needed help – from a variety of veterans’ data sources. Words and linguistic patterns that veterans post online are data-mined for indicators of suicidal behavior and the system identifies useful clues in real data to establish a risk “score.”
With so many veteran participants, the data sets are very large. The veterans who opt into the project receive a unique Facebook app and a mobile app for either the iOS or Android operating system; these are designed to capture posts, Tweets, mobile uploads and geographic location. Additional profile data is captured as well, including physician information and clinical notes. To ensure compliance with various privacy and HIPAA regulations, all captured data is stored in a secure environment behind a medical firewall.
Open Source Hadoop Infrastructure Delivers Operational Efficiency for Critical Research
The Durkheim Project has a highly complex workflow, requiring foundational infrastructure and predictive modeling that supports big data collection and analysis at scale. Moreover, the team wanted to access all of the machine learning through search interfaces, which can get expensive since all of the machine learning is indexed.
The technical objective for building the machine learning data fabric underpinning the initiative was maximum speed at minimum cost. Poulin found most big data solutions to be low performance in terms of accuracy, or highly complex in implementation and/or in integration with Patterns and Predictions’ existing IT environment. Poulin chose to build on Apache Hadoop for its abstraction of underlying data set complexity and selected Cloudera for its category leadership and subject matter expertise in the Hadoop framework, open source and big data infrastructure. CDH, the market-leading, 100% open source distribution of Hadoop and related projects, as the cornerstone technology of the Durkheim Project. Using Cloudera Impala and Cloudera Search, the ingestion of data on Hadoop is markedly more efficient, delivering lower costs, better computational throughput and reduced complexity of IT support.
Patterns and Predictions engaged Cloudera Professional Services to co-develop code in the area of real-time prediction on CDH, called Bayesian Counters. The use of text analytics against the continuously fed large data pool delivers an exponential number of variables which can then be compared and analyzed, resulting in a real-time assessment of the participant’s mental health. The computational processing to analyze that data requires a big data fabric, and the benefit is that the output is much more informative.
In the Future, Data Could Help Veterans in Crisis
In February 2013, an investigation conducted by Patterns and Predictions, Dartmouth and the VA determined that the accuracy of this risk-prediction data model was statistically significant, with "consistent accuracies" of 65% percent or higher in predicting suicide risk in a veteran control group.
Still in its initial phases, the Durkheim Project is authorized only to monitor and analyze data. While the project has delivered statistically valid results that accurately predict suicide risk in a control group of veterans, its critical research is restricted, at least for the time being, to a non-interventional protocol. Using Cloudera, the project’s continued scaling of risk classifiers will help to establish the necessary confidence in the project’s ability to assess risk in real time, as they currently apply for an interventional study.
About Patterns and Predictions
Patterns and Predictions is a predictive analytics firm. Its core Centiment® technology provides unstructured and linguistics driven prediction. It is the technology powering the Durkheim Project’s ‘big data’ analytics network for the assessment of mental health risks. Partners include Bloomberg, The Geisel School of Medicine at Dartmouth, Cloudera, and Attivio. Funding sources include the U.S. Government (DARPA), and customers include Global 100 companies.
Founded in 2008, Cloudera pioneered the business case for Hadoop with CDH, the world’s most comprehensive, thoroughly tested and widely deployed 100% open source distribution of Apache Hadoop in both commercial and non-commercial environments. Now, the company is redefining data management with its Platform for Big Data, Cloudera Enterprise, empowering enterprises to Ask Bigger Questions™ and gain rich, actionable insights from all their data, to quickly and easily derive real business value that translates into competitive advantage. As the top contributor to the Apache open source community and leading educator of data professionals with the broadest array of Hadoop training and certification programs, Cloudera also offers comprehensive consulting services. Over 700 partners across hardware, software and services have teamed with Cloudera to help meet organizations’ big data goals. With tens of thousands of nodes under management and hundreds of customers across diverse markets, Cloudera is the category leader that has set the standard for Hadoop in the enterprise. www.cloudera.com
Connect with Cloudera
Bhava Communications for Cloudera
Ketchum for Cloudera
+44 (0) 20 7611 3788