Category Archives: BigData Health

Heading to NYC to discuss consumer health information exchange with @Medyears

Today I am on a flying visit to New York City to discuss a consumer-mediated Health Information Initiative Workshop in Philadelphia. We are planning this for early September.

The objective is to drive adoption of BlueButton and give patients easier access to their health data.

If we run this as a Hybrid HealthCa.mp event who is interested in getting involved?

More information to follow soon as we finalize details…

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

Updated:Building a Healthcare Big Data Platform in the Cloud with @Datastax

Update: The date of the Webinar has been changed to: monday August 18th.

On August 18th, at Noon ET, I will be presenting on a Webinar with Datastax about building a Cloud-based Big Data platform for the Healthcare software division of Fortune 50 company using Datastax.

You can sign up for the webinar here: http://learn.datastax.com/WebinarDataStaxEnterpriseintheCloudHowthisFortune50HealthcareCompanyTurnedBigDataintoBigAnswers.html?utm_source=web&utm_medium=resources&utm_campaign=gtro4

This webinar will bring the presentation I delivered to a group in Boston, in May 2014, to a wider audience. I hope you can join me.

#mongodbworld Elliott Horowitz talking about the future roadmap for @mongoDB

The Future of MongoDB. Beyond the next release.

Elliot Horowitz recapped:

Mongo 2.8 later in 2014.
– Improved Concurrency
– Storage Engine API

MongoDB 3.0 and Beyond

Partitioned Joins

Why no joins – Less need in a document relational model.
Also don’t want features that create surprise when scaling horizontally.

Multi-Document Transactions

Approach the same way as Partitioned Joins. Plan ahead on joins across collections to see that everything is on the same shard.

Schema Validation

For example add a Query document for a collection. Validate on submission against the Query document.

Multi-Master Databases

Incrementing counters for example can work in a multi-master environment

Filtered Replica Sets

eg. a Retailer might want data in every store.

Filtering means that only a section of the data is replicated. Eg. The UK Data Center gets UK data from a Global Data Center.

Storage Engines

The ability to tailor storage to performance needs. I think Health care will be big on having encryption as a storage engine.

Resource Management

Providing the ability to manage operations across different types of machines. This may be built in to future versions of MMS.

Adaptive Provisioning

MMS should be able to automatically adjust cluster size. This looks like some of the Object Rocket features will be embedded in to the core MongoDB offering.

Queryable Backups

MMS – Providing the ability to find a restore one or more documents without having to restore the entire database.
Connect the backup to a Mongod daemon and it would then be available to query.

Database as a Service Software

MMS will provide this capability for internal or external / cloud use.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

#MongoDBworld Hardware provisioning of @MongoDb – What I need to know!

Session details: http://world.mongodb.com/mongodb-world/session/hardware-provisioning-mongodb

Hardware Provisioning for MongoDB

Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.

Chad Tindel

Senior Solution Architect at MongoDB

Chad Tindel is a Senior Solution Architect at MongoDB where he specializes in helping customers understand and use the nosql product to solve complex business problems. Previously, Chad was a Solution Architect at Cloudera focusing on the Hadoop space and was also a Solution Architect at Red Hat, helping customers build out their enterprise Linux infrastructures. He holds a BS in Computer Science from California Polytechnic in San Luis Obispo as well as an MS in Finance from the University of Denver.

Hardware Provisioning

There is not a lot of information out there on sizing MongoDB. This session, even though the last session was well attended.

How do you size?

Often customers over or under engineer.

Think the scenario where your app gets listed and suddenly lots of sign ups. The server gets beaten and needs to be re-sized.

Requirements – Step 1

What are the business requirements

  • Uptime (do you need more than one Data Center
  • Availability
  • Throughput
  • Responsiveness
  • Acceptable latency – especially during peak times.

Constraints

Resources available

Continuing Requirements

  • Requirements can change over time
  • More users, more data, new indexes
  • More writes

Guidance:

  • Collect metrics!
  • Adjust configuration incrementally
  • Plan ahead

Try to avoid a crisis.

Do a Proof of Concept

  • Start small on a single node
  • Design your schema (Read and write applications are different)
  • Understand query patterns
  • Get a handle on working set (the active data)

Then add replication to see impact

Review Requirements as result of POC

  • Data sizes (Number of documents, Average document size, size of data on disk, size of indexes, expected growth, document model)
  • Ingestion – Throughput / Updates / Deletes per second peak and average
  • Bulk inserts? How large and How often?

Do you have SLAs on this performance?

  • Performance expectations
  • Life of data
  • Security requirements (SSL, Encryption at rest)
  • Number of data centers in use (Active/Active , Active/Passive Cross Data Center latency)

Resource Usage:

IOPS (4K in size)
Size
Data and loading patterns.

CPU tends to be less important

Fast storage and as much RAM as you can.

Network latency affects replication lag

IOPS

7200 RPM SATA = 75-100 IOPS
15000 SAS = 175-210 IOPS
Amazon SSD EBS = 4,000 PIOPS / Volume
48,000 PIOPS / Instance

Intel X-25-E SLC = 5,000 IOPS

Use IOSTAT to monitor disk performance (or MongoPerf).

Release 2.4 added a feature to estimate the size of a working set.

Network Performance

Latency impacts WriteConcern time and ReadPreference

Throughput impacts Update and Write Patterns and Read/Queries

Use Netperf to measure network performance.

CPU Usage

Only really comes in to play when using queries without indexes which mean performing a table scan.
or for Sorting within a Shard and MergeSorted when aggregated.

Aggregation Framework or MapReduce require CPU Performance.

Case Study – Spanish Bank:

  • 6 months of logs held for 6 months
  • 18TB at 3TB/Month

  • Prod environment:
    3 Nodes / shard * 36 Shards = 108 Physical Machines
    128GB/RAM * 36 = 4.6TB RAM

2 Mongos
3 Config servers (virtual machines)

Online Retailer

  • moving Product catalog from SQL Server to MongoDB as an overhaul to Open Source
  • 2 Main Data Centers active/active
  • Cyber Monday peaks at 214 Requests/Sec. Budget for 400 Requests/Sec for headroom.
  • Heavy Read process orientation.

POC

  • 4 M product SKU’s with JSON document size of 30KB
  • Requests for specific product (by _id)
  • Products by Category (Return 72 documents – or 200 if a google bot)

Math

  • Partition (Shard) by Category.
  • Products in multiple categories are duplicated means on average doc is in 2 categories so store 4M SKUs x 2 = 8M

8M docs * 30K want everything in memory. 384GB RAM/Server

Sharding adds a layer of complexity (eg. Add config server) so don’t shard unless you need to.

Determined a 4 Node Replica set 2 in each Data Center. Plus an Arbiter.

Recommended a Single Replica Set
– 4 Node Replica

But customer found they could only deploy on 64G RAM. So they deployed 3 shards 4 nodes each + Arbiter.

Arbiters are small. They just exist for voting. Can be a small 1VCPU with 4GB RAM.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

#mongodbworld Securing @mongodb

I managed to catch the latter part of the session on securing MongoDB. It was good to see that Gazzang is available to support at rest encryption.

Vormetric (licensed by IBM is also na option.

FIPS 140-2 is possible when using SSL Encryption.

Audit Logging.

Audit guarantees – Event is written DEFORE writing to the journal
A write will not complete before being audited.

This is important because otherwise you might miss writes to the database.

Don’t forget to configure your audit logs to write to another machine, preferably not accessible by the same sys admins that manage the MongoDB servers themselves.

CRUD Auditing is coming in Release 2.8. Available as experimental code at the moment.

No IP Filtering on the database. Implement this at the server level.

Important Tips from Andreas Nilsson:

Don’t expose Database servers to the Internet! No. Never. DON’T DO IT! There is NO GODD USE CASE.

Design and configure Access Control.
Enable SSL
Disable and unnecessary interfaces
Lock down database files and minimize account privileges. ie. Don’t run DB, Web or other service as Root!

I would also add – Do Not use Standard or Default Accounts.

The MongoDB Security Manual and Whitepaper are available at Mongodb.com

Lots of questions in this session indicate that this is an area of incredible interest.

See the session details below:

Creating a Single View Part 3: Securing Your Deployment

Security is more critical than ever with new computing environments in the cloud and expanding access to the internet. There are a number of security protection mechanisms available for MongoDB to ensure you have a stable and secure architecture for your deployment. We’ll walk through general security threats to databases and specifically how they can be mitigated for MongoDB deployments. Topics will include general security tools and how to configure those for MongoDB, an overview of security features available in MongoDB, including LDAP, SSL, x.509 and Authentication.

Buzz Moschetti

Enterprise Architect, Financial Services at MongoDB

Buzz started his career in software solutions after abandoning the idea of going to medical school. Contrary to popular lore, he does not have the patent on the little light on the CapsLock key. After a very brief stint at Salomon Brothers, he moved to Bear Stearns. As Chief Architect at Bear Stearns and JPMorganChase investment bank, he coded his brains out for 25 years while also keeping about 9000 IT staff moving in roughly the same direction. Buzz enjoys fast cars, cycling, scuba diving, single malts, and writing and recording music in his home studio a.k.a. closet with an outlet and a PC running ProTools.

Brian Goodman

Enterprise Architect at MongoDB

Prior to joining MongoDB, Brian led JPMorgan’s predictive analytics and innovation team. He is an IBM Distinguished Engineer and Master Inventor, having led a variety of teams in advanced and emerging technology. Brian has over 15 years of diverse experiences and client exposure in cloud computing, grassroots collaboration, social software, technology adoption and expertise location. Brian is currently on a hybrid-photo journey mixing analogue and digital photography.

Andreas Nilson

Software Security Engineer at MongoDB

Andreas is a Software Security Engineer working on the core server team. Prior to joining MongoDB, Andreas was a Security Architect at NASDAQ OMX responsible for the security architecture of the trading systems. Past employment includes Check Point Software Technologies and Certezza. Andreas holds an MS degree in Computer Security from Columbia University and an MS degree in Engineering Physics from KTH Stockholm.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

#MongoDBWorld @Genentech: Speeding Drug Research

This is my third session looking at the use of MongoDB in a health setting. See my earlier posts from today:

The Genentech session information:

The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech)

Genentech Research and Early Development (gRED) develops drugs for significant unmet medical needs. Key to this effort is providing Investigators with new genetic strains of animals needed to understand disease causes and test new drugs. While these genetic strains have increased greatly in complexity, technology improvements have increased accuracy and throughput while reducing the cost of genetic testing. This has led to an effort to redevelop the Genetic Analysis Lab system to reduce the time needed to introduce new lab instruments from months to weeks or even days. Important to this initiative has been the introduction of MongoDB to capture the variety of data generated by genetic tests and integrate it with the existing Oracle RDBMS environment. Not only has it proved fairly easy to integrate the two, but we have been able to take advantage of the strengths of MongoDB to provide a flexible schema and Oracle to provide transaction management and integration with the existing information system.

Doug Garrett

Software Engineer in Research at Genentech

Doug Garrett, Software Engineer in Research at Genentech, has been developing software for over 20 years. Most recently he has worked on developing systems to support the processes and lab instruments needed for the development of genetic murine models needed for Genentech disease and drug research. Before Genentech, Doug worked for a number of different companies including Nokia, McKesson and Kaiser Medical. Doug holds a B.A. in Physics from Occidental College and an M.S. in Computer Information Systems from Boston University.

Genentech – Speeding Drug Research

The challenge was to integrate MongoDB with Oracle Relational Databases.

BioInformatics is different from IT.

  • The flexibility of the schema is a big benefit
  • It can also easily integrate with traditional RDBMS
  • Saving time is critical when you are dealing with saving human life.

Every new lab instrument drove a change to the Oracle RDBMS schema. This created a time lag and slowed genetic testing.

The Development process

  • What is a disease cause? Is it genetic
  • Develop new mouse model
  • Does it create a new drug.I s it safe and effective
  • Then move to clinical trials

With Oracle and their schema it took 6 months to modify the schema to add a new genetic test. The follow on additional test took a further 3 months. A more flexible solution was needed that didn’t add to the complexity of the database.

This led to the selection of MongoDB.

1 million rows in Oracle became 4,000 documents in MongoDB.

The ingestion process is where the configuration for an instrument is focused.

There is then a generic data loader. The schema complexity is in the mongo document and maintained in one place.

A java program presents users with a single window. This combines the Oracle record view with the content of the relevant MongoDB document(s).

The DB Schema is now immune from introduction of new instruments.

Going Live

Issues – A Disaster Recovery copy

Oracle replication is challenging. It is not a built in function

With MongoDB a replica was implemented in the DR site within a couple of hours.

MongoDB Aggregation Framework change in Release 2.6 the 16MB limit on result sets was removed.

Other uses for MongoDB

Import from CSV, JSON, XML and other sources.
MongoDB as a data import service great for building data pipelines.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

#mongoDBWorld @Sanofi talking about Cancer research Translational Medicine

Sanofi – Big Data and Translational Medicine

Session information: http://world.mongodb.com/mongodb-world/session/translational-medicine-platform-sanofi

Presentation by David Peyruc and David Erwan

www.sanofi.com

third largest Pharma company

Invest 4.7B Euros per year. From revenue of 33.4B Euros.

The challenge for Pharma

The classic business model is under threat. Generic drugs are a threat – “The Patent Cliff”

End of the Blockbuster Age.

New Paradigms

  • Personalized
  • Predictive
  • Preventive
  • Participatory

Translational Medicine is about bridging the gap between Clinical and Research worlds. Linking Hypothesis with Evidence.

Genomics and other *omics data needs to merge with other data. (See my other post from the Broad Institute from yesterday).

Translational Medicine Challenges

  • Diversity of data objects.
  • Large storage requirements. Genomics data is big.
  • Consistency and Traceability
  • User Friendly curation process for annotation to enable understanding and extract knowledge

Big Data (MongoDB) is used to help Extract, Curate, Normalize and Load data.
MongoDB is the central repository for Biomarker data.

Why MongoDB?

  • File and Metadata together.
  • Scalable (Data was sharded from day 1)
  • Easy to instal, use,understand and adapt/adopt.

The Journey to MongoDB

  • Big Data White Paper
  • Install and Benchmark
  • Proof of concept (a few javascript pages to demonstrate access)

MongoDB
– Runnong GridFS
– Apache Solr for search access
– Collections of Metadata, Config, User profile, Logs, tools etc.

  • Implement a REST API service layer
  • Build a web, Desktop and third party software integration interface (using Rest API)

Use Cases

360 degree Explorer

  • Disease or Syndrome / Receptor

Geographic Zone / Health Activity

  • Show the same data via faceted navigation

The platform is not a standard IT product. So there is minimal IT support.

Benefits

Scientists benefit:

  • More efficient tagging and curation
  • Awareness of more data sources
  • Easy exploration of data
  • Easier integration of external data

IT

  • Faster development
  • Flexibility
  • Performance
  • Documentation, support, training and community support.

Twitter connections for the presenters.
@bob_dit_l_ane
@DPeyruc

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

#mongodbworld Werner Vogels CTO from @Amazon – #IOT the Internet of Things is already here

http://world.mongodb.com/content/keynote-werner-vogels-amazon

Werner Vogels – The Internet of Things is here

Great presentation from Werner

“The amount of information recorded about a baby in the first day of life is 70 times the information contained in the Library of Congress”

Observations – Theory – Models – Facts

The mapping of the Human Genome. The first genome map cost billions of dollars. Now the 1,000 Genome Project at NIH. 1700 Genomes are freely available on AWS. It takes about 200TB.

Illumina has BaseSpace – Analytics for Genomics.

Unilever is doing deep sequenced Genome analytics.

The Ocean Observatories have deposited sensors on the ocean floors and are pulling data in to AWS for analysis.

The Mars Rover is using AWS S3 to merge 2 Mega Pixel camera images to create large panoramas.

Consumer world

Dropcam – just got purchased by NEST/Google for $550M. Providing a security camera feed. Dropcam is one of the biggest inbound video service on the web. More data uploaded than YouTube. Petabytes per day.

[Ed: I wonder how soon this will go to Google’s App Engine?]
.
Glowcap / GlowPack –

If you don’t take your meds on time the caps glow. If you ignore it plays a tune. If you keep ignoring it notifies your social network.

DirectLine in the UK – Insurance Company

Capturing data from your smartphone while you drive. Feeds your driving profile to the insurer. Drive well get a discount!

Ask NAO

Autistic kids are happier talking to robots than to people.
So Robots interact with Autistic kids.

Retail

Analytics have always been popular in retail Who is my customer? What are they saying/doing?

All this drives to Personalization.
The More data you can collect the more accurate your recommendations can be.

Werner gave some great examples where the Amazon recommendation fails – because there is not enough data.

Industrial

GE Engines – Turbines are being instrumented. The data flows to AWS and is analyzed. Efficincies of 1% can generate millions of dollars of savings.

Shell have instrumented their oil wells.

KArcher is a german industrial cleaning company. All the devices they build are data generators. The data is fed back to the equipment fleet owners.

DeConstruction – Built mBuilder sensors for building sites. The data flows in to an analytics dashboard to tell construction managers what is happening on a building site.

Sports

Each professional sports team has an analytics expert that can influence gameplay.

Some teams are equipping players with heart monitors. Players with same heartbeat profile are in sync and the team performs better.

Shockbox – Developed a strip for kids Ice Hockey helmets to assess likelihood of concussion.

Forusquare A big ongoDB user: – Social Cooler – The beer cooler only opens up if you check in with 3 of your friends. No drinking alone!

Tata – Predictive monitoring for Preventative Maintenance of Truck fleets.

OneBusAway – Public Transit app. Instrumented buses.

Waze or Moovit: Waze for private transport user generated content. Moovit is the same for public transport – Both are isreali companies. Waze was acquired by Google

Going real time.

We don’t want to know what happened yesterday. We want to know what is happening NOW.

Amazon built a real time tool. Amazon Kinesis

Managed service for real-time processing. Data is streamed in, sequenced and then output to storage eg. S3, DynamoDB etc.

Beyond the Display

To change behavior you have to find different interfaces.

Ambient Bus Pole – The light rises on the pole as the bus approaches.

Pre-Paid Electricity company provided customers with a lamp that glows when their account gets low. This solved people failing to top up their accounts on time.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.