Tag Archives: BigData

Updated:Building a Healthcare Big Data Platform in the Cloud with @Datastax

Update: The date of the Webinar has been changed to: monday August 18th.

On August 18th, at Noon ET, I will be presenting on a Webinar with Datastax about building a Cloud-based Big Data platform for the Healthcare software division of Fortune 50 company using Datastax.

You can sign up for the webinar here: http://learn.datastax.com/WebinarDataStaxEnterpriseintheCloudHowthisFortune50HealthcareCompanyTurnedBigDataintoBigAnswers.html?utm_source=web&utm_medium=resources&utm_campaign=gtro4

This webinar will bring the presentation I delivered to a group in Boston, in May 2014, to a wider audience. I hope you can join me.

#MongoDBWorld Genomics and the Connectivity Map (A presentation from the Broad Institute)

More from #MongoDBWorld.

Presentation by the Broad Institute:

# MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

“The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.”

https://world.mongodb.com/mongodb-world/session/mongodb-and-connectivity-map-making-connections-between-genetics-and-disease

The Connectivity Map began as a pilot project in 2006.

7,000 experiments
19,000 registered users
1,200 Scientific Reports

One Gene expression signature is expensive – thousands of dollars.

As cost drops the number of experiments can increase.

This has grown to 1.5 million experiments.

MongoDB came in to play because they didn’t know what the data structures needed to be.

CMap LINCS Dataset has built a library of 1.4M gene expression profiles.
12,488 compounds,

The Connectivity Map is easy to describe but difficult to model.

1.4 M profiles times 22,000 geners yields 30B data points.

This is further complicated by the diversity of use cases and users.

Annotation is complex and may be partial. The data is also frequently updated.

The Agile approach:
– Store just what’s needed
– Test and use daily
– Refactor frequently

The initial data model was simply an inventory of signatures.

4-5 fields in a json data packet.
This evolved from a simple signature_info block to cell_info and Treatment_info.

They then added computed fields and external meta-data which were added to Singature_info and Cell_info. This is easy to do in MongoDB.

APIs are awesome! Life Sciences need more of them.

functionality in the API overcame convention. So used the ?siginfo?q={“cell”:”A”} style rather than folder convention /siginfo/cell/A

Node.js and Mongoose (as noted in the earlier LinkedIn session) came in to play for easy API creation.

Compute API running on AWS performs message queuing via a capped collection.

HDF5 (Hierarchical Data Format) complements MongoDB for numerical analysis

GCTX is a binary format based on HDF5, cross platform with multiple language bindings.

Broad’s platform is Lincscloud – targeted to researchers: Lincsloud.org
This is free for academic use.

Uses of Broad’s tools:

  • Predicting Drug Function
  • Drug Re-purposing (failed drugs – new uses)
    i.e. Phase 2 trials are where results don’t live up to expectations but DRUG IS SAFE!

So can drug be re-mapped to new targets.
– Pushing from single patient application to two patients and on to population applications.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

#MongoDBWorld Hidden gems in the new 2.6 version of @mongoDB

More from #MongoDBWorld.

Hidden Gems in the 2.6 Release

Everyone using MongoDB is familiar with the big features of the 2.6 release (and if you’re not, here’s a link) — text search, $out, user-defined roles, X509 authentication, etc. But what about the little guys? Our VP of Engineering, Daniel Pasette, will take you on a tour of five small but mighty features from the 2.6 release that make your MongoDB experience more productive.

Dan Pasette

VP of Core Engineering at MongoDB

Dan is the VP of Core Engineering at MongoDB. Prior to joining MongoDB, Dan was a Development Manager at LimeWire where he led a team working on content ingestion for an (unreleased) digital music service called Grapevine. Past employment includes MTV Networks, Sonicnet, iXL, and Electronic Book Technologies. Dan holds a degree in Computer Science from Brown University.

http://world.mongodb.com/mongodb-world/session/hidden-gems-26-release

The Technical sessions are packed. I was hoping to look at Memory Management but the room was full to overflowing. So I dropped in to the session on the latest release of MongoDB – Version 2.6.

Power of 2 – Now default allocation Strategy

Power of 2 feature allows extra space when saving records. It is on by default in the latest release. It is best suited to uses that have re-writes to databases. What typically happens is a re-write expands the file and the file wouldn’t fit in the existing space. The extra space enabled by Power of 2 makes it more likely that records can be written back to the blocks they came from.

By adding space to records it reduces the amount of data movement because as data grows inside records the records still fit.

Server Side Timeouts

An example, a collection was indexed in staging but forgotten in production. This can cause table scans that cause users to re-try or re-scan. This creates socket timeouts. This can impact other users on the system. The new feature is maxTimeMS. This allows you to set a maximum time for how long an operation can run in the database. Set from milliseconds to minutes depending on the operation.

Query Engine Introspection

This works in conjunction with MaxTimeMS. It allows you to delve in to queries to resolve problems. The Query execution framework was completely re-writtin in 2.6. Prior to 2.6 the query path etc was opaque to users. This changed in 2.6.

The Query Planner chooses the best index for a given query.

Query Parser sends to Query Planner. This is passed to the Plan Cache. which passes to the Plan Runner.

The Plan Enumerator passes all the plans to the Multiplan router. This runs these plans for a limited amount of time and then chooses the most efficient.

On subsequent execution of the same query the query goes straight to the Plan Cache.

If the plan caches a sub-optimal plan.
Plans are dropped after indexing and other major changes.

getPlanCache

A set of Plan Cache tools to view and manipulate the cache.

Background indexing on Secondaries

This has existed but the feature has been rounded out.

Pre-2.6 background index builds became foreground index builds when replicated to secondaries.

In 2.6 keeps background indexing in the background.
Note: Background indexing isn’t as fast and is less tightly packed.

User Driven Enhancements

All of these features came about as a result of user feedback that go through jira.mongodb.com

Limits on Replica sets

Limit of 12 nodes in a replica set with 7 voting members

[tag cloud BigData MongoDBWorld

<

div style=”color: rgb(0, 0, 0); font-family: Arial; font-size: medium;”>

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark@ekivemark.com
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.