Tag Archives: Genomics

#MongoDBWorld Genomics and the Connectivity Map (A presentation from the Broad Institute)

More from #MongoDBWorld.

Presentation by the Broad Institute:

# MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

“The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.”


The Connectivity Map began as a pilot project in 2006.

7,000 experiments
19,000 registered users
1,200 Scientific Reports

One Gene expression signature is expensive – thousands of dollars.

As cost drops the number of experiments can increase.

This has grown to 1.5 million experiments.

MongoDB came in to play because they didn’t know what the data structures needed to be.

CMap LINCS Dataset has built a library of 1.4M gene expression profiles.
12,488 compounds,

The Connectivity Map is easy to describe but difficult to model.

1.4 M profiles times 22,000 geners yields 30B data points.

This is further complicated by the diversity of use cases and users.

Annotation is complex and may be partial. The data is also frequently updated.

The Agile approach:
– Store just what’s needed
– Test and use daily
– Refactor frequently

The initial data model was simply an inventory of signatures.

4-5 fields in a json data packet.
This evolved from a simple signature_info block to cell_info and Treatment_info.

They then added computed fields and external meta-data which were added to Singature_info and Cell_info. This is easy to do in MongoDB.

APIs are awesome! Life Sciences need more of them.

functionality in the API overcame convention. So used the ?siginfo?q={“cell”:”A”} style rather than folder convention /siginfo/cell/A

Node.js and Mongoose (as noted in the earlier LinkedIn session) came in to play for easy API creation.

Compute API running on AWS performs message queuing via a capped collection.

HDF5 (Hierarchical Data Format) complements MongoDB for numerical analysis

GCTX is a binary format based on HDF5, cross platform with multiple language bindings.

Broad’s platform is Lincscloud – targeted to researchers: Lincsloud.org
This is free for academic use.

Uses of Broad’s tools:

  • Predicting Drug Function
  • Drug Re-purposing (failed drugs – new uses)
    i.e. Phase 2 trials are where results don’t live up to expectations but DRUG IS SAFE!

So can drug be re-mapped to new targets.
– Pushing from single patient application to two patients and on to population applications.

[tag health cloud BigData MongoDB MongoDBWorld NoSQL]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Blog: http://blog.ekivemark.com
email: mark
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their Medyear.com platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.