Cutting Through The (Healthcare) Big Data Hype at #BigDataTechCon

The best antidote to hype about a technology (in this case, Big Data) is knowledge about the technology (especially the Hadoop ecosystem plus complementary and competing technologies).

So, off I went to Big Data Tech Con in Boston last week. I loved my deep dive into the software nuts and bolts under the hood of Big Data. I learned about (in some cases hands on) and live tweeted about:

  • Hadoop distributed processing of large datasets across clusters of computers using a simple programming model
  • Ambari Deployment, configuration and monitoring Hadoop clusters
  • Flume Collection and import of log and event data (of which is there is a lot!)
  • HBase Column-oriented NoSQL database scaling to billions of rows
  • HCatalog Schema and data type sharing over Pig, Hive and MapReduce
  • HDFS Distributed redundant file system for Hadoop
  • Hive Data warehouse with SQL-like access
  • Mahout Library of machine learning and data mining algorithms
  • MapReduce Parallel computation on server clusters
  • Pig High-level programming language for Hadoop computations
  • Oozie Orchestration and workflow management (BPM for Hadoop)
  • Sqoop Imports data from relational databases (extract/translate/load)
  • Zookeeper Configuration management and coordination of Hadoop nodes

Above is adapted from O’Reilly’s Big Data Now. It’s free. The first chapter is an excellent overview. Below are additional topics covered at Big Data Tech Con:

  • NoSQL non-relational databases (“Not Only SQL” has some SQL)
  • Cassandra highly available NoSQL database with no single point of failure
  • MongoDB scalable, high-performance, document-oriented NoSQL database
  • CouchDB easily replicated document-oriented NoSQL database
  • Google BigQuery web service for interactive analysis of massive datasets
  • Google Predictive API cloud-based machine learning tools for analyzing data
  • R plus Hadoop statistical analysis of massive datasets stored in Hadoop
  • Storm real-time complex event processor (Hadoop is batch)
  • Impala interactive SQL database using Hadoop’s HDFS file system

What do these technologies do, that makes Big Data possible? The shortest commonsense answer is that they can count higher than your laptop, or even the server down the hall or in your IS department.

Count? You say. Count? That doesn’t sound impressive or exotic at all! True. But it turns out that ability to count items in large sets enables remarkably intelligent software systems. One way to think about big data is that, in contrast to sophisticated statistical analysis using a single SQL database, big data is advanced applied counting, in parallel, across many databases.

Instead of complex algorithms applied to (relatively) small samples of data, Big Data is (relatively) simple algorithms (such as frequency counts) applied to (relatively) lots of data, sometimes so much, it is “all” the data generated by some software process. The more data, the better machine learning algorithms work. The more data monitored in real-time, the more individualized smartphone, web site, and desktop workflows.

What can you do with these counts? You can estimate probabilities. With these probabilities you can create automated systems that can predict

  • the next word (useful for natural language processing),
  • entities (people, places, diseases, etc.),
  • relationships (before/during/after, has-a/is-a, over/on/in/beside/below, etc.) among entities, and even
  • human behavior (when will you end your smartphone contract, will you stick to your meds, even will you reach for that next piece of pie).

I’ll cover more of the technologies listed above, perhaps in one of my occasional 5,000 word blog post opuses. Oozie is especially cool. It’s a workflow management system for Hadoop!

In my humble opinion, the best way for a health IT professional to cut through the healthcare Big Data hype is to learn about nuts and bolts of Big Data software. If you’ve ever taken an introduction to programming course (or willing to take one on Coursera) and know a bit of SQL, Bob’s your uncle!

Then, when someone makes an off-the-wall claim or wild-eyed guestimate about Healthcare Big Data you can at least try to imagine how it might be accomplished (a mental architectural sniff test). A little scientific and technical knowledge is exactly what more people need with respect to taking advantage of the benefits Big Data offers to healthcare.

Right now, if you’re in healthcare, and you’re interested in this stuff, check out the next Big Data Tech Con, in San Francisco. No, it’s not about healthcare Big Data. It’s about the nuts and bolts of the tools and techniques of Big Data (period), and that’s exactly what you need. It’ll be up to you to apply the tools to healthcare.

That said, healthcare did come up a couple times…

As an attendee I didn’t have access to the attendee list (apparently only exhibitors do, fair enough). But I did ask, and was told, a bunch of healthcare folks were there. In conversation, I heard of one well known EHR vendor investigating Hadoop for storing data from its customers, as well as a health insurance exchange doing similar.

The next tweet (not tweeted from #BigDataTechCon) links to a delightfully detailed (i.e. “non-hype”) description of one medical center’s use of Hadoop.

(I added the following two tweets on 4/15/2013.)

The software under Big Data’s hood does indeed have the potential to save billions of healthcare dollars. But, it has to be done right.

The biggest obstacle to doing it right? Workflow is the key. And health IT is not (yet) doing workflow right. (You knew I was going to eventually mention workflow, right? I always do.)

More later. Much more. But not too much later.

P.S. Just in case you don’t believe me, that you think I’m making this all up, here’s my certificate of completion! 🙂

certificate

P.S.S. Here are my tweets from #BigDataTechCon, in reverse order (so you don’t have to read from the bottom up).