More from #MongoDBWorld.
How UnitedHealth Group Integrated Open Data into its MongoDB-Based Information System
Cedric Cabone, CTO of Talend and Matt Axsom, Lead Architect at Optum – United Health Group
Talend is a unified platform, a code generator with distributed architecture. Open source.
Matt Axsom is with Optum. They are focused on Clinical Performance and Compliance handling Encounter data processing.
Optum use Talend Studio for process automation, orchestration and execution for business continuity and ROI.
Enable business users to change their analytics.
RADAR – Risk Adjustment Data Acquisition and Analysis Repository
Medicare and Retirement Claims related data.
Handling Trading partner file submissions (tracked via GridFS)
Transactions via Collections, correlated with Metadata
Downstream file outputs are stored in GridFS
Audit traceability from cradle to grave.
Simplified audit-ability an research.
Trading Partner Submissions
Hospitals and Providers and clearing houses.
Transactions include: Provider Direct, Hospital Data Capture, Chart Review, Delete Submissions
Trading Partner Submissions: Inputs are passed through Talend ETL, depositied in MongoDB then pased to external CMS (Medicare/Medicaid) and Internal RDBMS.
MongoDB collections: Trading Partner, GridFS, Encounters, Submissions.
The process started from:
Select * FROM TX_Detail where Date = “20140623”
Low priority but saves a business user 30 minutes per day to pull data and export to Excel and generate a status email.
Large datasets so not suitable for laptop/desktop.
This drove the selection of Talend. Talend does orchestration to SAS, SSIS and SAP (Business Objects/Crystal reports) web hooks (REST/SOAP APIs) as well as others.
No utility scripting required. Talend orchestrates across systems.
Talend can integrate with anything that is accessible from the command line.
MongoDB selection was driven by…
The file system didn’t have reference to objects. It made audit-ability difficult.
Alternatives were BFILE/HDFS/GridFS.
The operational aspects made the selection of MongoDB preferable. eg. Cloning data for development staging, needing to run parallel data streams in development.
Data Intake Issues
A lot of disparate data formats. Many can’t be changed.
Modeling of data in RDMS was complex due to data disparity. Building a schema to account for all data streams gets massively complex.
Use of Hadoop
- YARN provides Cluster Resource Management
- Talen can create Pig latin script
- HBASE requires taking system down for upgrades.
- MongoDB offers operations stability. It has only been taken down one in 21 months.
- HBase only came to Production level in Q1 2014. Whereas Optum had been running since October 2012.
Operational simplicity wins out.
Secondary indexing was required by Optum but this was not available until December 2012. Prior to that an equvalence to secondary indexing required a full table scan.
– 20 Table Join
– 4 billion Rows
– 856K row input driver
– 100 hour execution cycle
Hive Queries are seeing a 100 to 1 improvement by using MongoDB.
OLTP Platform load to Hadoop
– File to HDFS
– Sqoop to HDFS/Hive
Pig – Transformation
Map-Reduce (Data Read . Shuffle and Sort) Low level Java
HCatalog* – Shared Schema / Data Typing
Partition Notifications on load – MQ Style hook to trigger subsequent analytics after load
Tableau / SAP Business Objects 4
100 daily jobs
Moving/Migrating jobs to Data Services
Leveraging tMOM and Camel
– Migrating all core ODS to MongoDB
– ETL/ELT Analytics offload to Hadoop where it makes sense.
Looking at MongoDB Hadoop connector to allow direct run from Hadoop.
NPI – National Provider Identifier
Monthly file is approx 4GB in size.
4+ Million Records
publicly available data updated frequently.
NPI Permissible use:
Fraud waste and abuse
Patient Provider linkage.
It is the single source of truth.
PECOS from Government has additional data.
Augment with internal provider data
OIG – list of excluded individuals and entities
SAM – System for Award Management
Acceptable Physician speciality type.
Need to mashup these government data sets.
Social Services Administration produces Death Master – Who has died.
I think I want to look at use of RestCAT Open Source API to ingest MPPES in to MongoDB.
CMS captures Licensure but DOES NOT Validate that data. Taxonomy data may be questionable but NPI data is consistent.
[tag health cloud BigData MongoDB MongoDBWorld NoSQL]
Health & Cloud Technology Consultant
Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.