The best antidote to hype about a technology (in this case, Big Data) is knowledge about the technology (especially the Hadoop ecosystem plus complementary and competing technologies).
Here’s a completely gratuitous photo of default slide displays between speakers. PS Great conference!
— Charles Webster, MD()
So, off I went to Big Data Tech Con in Boston last week. I loved my deep dive into the software nuts and bolts under the hood of Big Data. I learned about (in some cases hands on) and live tweeted about:
- Hadoop distributed processing of large datasets across clusters of computers using a simple programming model
- Ambari Deployment, configuration and monitoring Hadoop clusters
- Flume Collection and import of log and event data (of which is there is a lot!)
- HBase Column-oriented NoSQL database scaling to billions of rows
- HCatalog Schema and data type sharing over Pig, Hive and MapReduce
- HDFS Distributed redundant file system for Hadoop
- Hive Data warehouse with SQL-like access
- Mahout Library of machine learning and data mining algorithms
- MapReduce Parallel computation on server clusters
- Pig High-level programming language for Hadoop computations
- Oozie Orchestration and workflow management (BPM for Hadoop)
- Sqoop Imports data from relational databases (extract/translate/load)
- Zookeeper Configuration management and coordination of Hadoop nodes
Above is adapted from O’Reilly’s Big Data Now. It’s free. The first chapter is an excellent overview. Below are additional topics covered at Big Data Tech Con:
- NoSQL non-relational databases (“Not Only SQL” has some SQL)
- Cassandra highly available NoSQL database with no single point of failure
- MongoDB scalable, high-performance, document-oriented NoSQL database
- CouchDB easily replicated document-oriented NoSQL database
- Google BigQuery web service for interactive analysis of massive datasets
- Google Predictive API cloud-based machine learning tools for analyzing data
- R plus Hadoop statistical analysis of massive datasets stored in Hadoop
- Storm real-time complex event processor (Hadoop is batch)
- Impala interactive SQL database using Hadoop’s HDFS file system
What do these technologies do, that makes Big Data possible? The shortest commonsense answer is that they can count higher than your laptop, or even the server down the hall or in your IS department.
Addition in the Large: Simple Counts & Not-So-Simple Counts keynote by @‘s Oscar Boykin @
— Charles Webster, MD()
Count? You say. Count? That doesn’t sound impressive or exotic at all! True. But it turns out that ability to count items in large sets enables remarkably intelligent software systems. One way to think about big data is that, in contrast to sophisticated statistical analysis using a single SQL database, big data is advanced applied counting, in parallel, across many databases.
Instead of complex algorithms applied to (relatively) small samples of data, Big Data is (relatively) simple algorithms (such as frequency counts) applied to (relatively) lots of data, sometimes so much, it is “all” the data generated by some software process. The more data, the better machine learning algorithms work. The more data monitored in real-time, the more individualized smartphone, web site, and desktop workflows.
What can you do with these counts? You can estimate probabilities. With these probabilities you can create automated systems that can predict
- the next word (useful for natural language processing),
- entities (people, places, diseases, etc.),
- relationships (before/during/after, has-a/is-a, over/on/in/beside/below, etc.) among entities, and even
- human behavior (when will you end your smartphone contract, will you stick to your meds, even will you reach for that next piece of pie).
I’ll cover more of the technologies listed above, perhaps in one of my occasional 5,000 word blog post opuses. Oozie is especially cool. It’s a workflow management system for Hadoop!
In my humble opinion, the best way for a health IT professional to cut through the healthcare Big Data hype is to learn about nuts and bolts of Big Data software. If you’ve ever taken an introduction to programming course (or willing to take one on Coursera) and know a bit of SQL, Bob’s your uncle!
Then, when someone makes an off-the-wall claim or wild-eyed guestimate about Healthcare Big Data you can at least try to imagine how it might be accomplished (a mental architectural sniff test). A little scientific and technical knowledge is exactly what more people need with respect to taking advantage of the benefits Big Data offers to healthcare.
Right now, if you’re in healthcare, and you’re interested in this stuff, check out the next Big Data Tech Con, in San Francisco. No, it’s not about healthcare Big Data. It’s about the nuts and bolts of the tools and techniques of Big Data (period), and that’s exactly what you need. It’ll be up to you to apply the tools to healthcare.
Dates announced for TechCon in SF Bay Area: Oct 15-17, 2013. See you then!
— Alan Zeichick ()
That said, healthcare did come up a couple times…
Keynoter M Stonebraker on Genomics & Healthcare Informatics (eg. cohort groups for effective studies)
— Charles Webster, MD()
As an attendee I didn’t have access to the attendee list (apparently only exhibitors do, fair enough). But I did ask, and was told, a bunch of healthcare folks were there. In conversation, I heard of one well known EHR vendor investigating Hadoop for storing data from its customers, as well as a health insurance exchange doing similar.
The next tweet (not tweeted from #BigDataTechCon) links to a delightfully detailed (i.e. “non-hype”) description of one medical center’s use of Hadoop.
Sweet awesomeness! progress in healthcare > UC Irvine Health: Improving Quality of Care w/
— Charles Webster, MD()
(I added the following two tweets on 4/15/2013.)
@ great post! Btw, my coworker @ is the rock star behind UCI Hadoop/big data initiatives
— Mark Silverberg ()
@ Nice coverage of and post on Cutting Through The (Healthcare) Big Data Hype. A must read
— Charles Boicey ()
The software under Big Data’s hood does indeed have the potential to save billions of healthcare dollars. But, it has to be done right.
The biggest obstacle to doing it right? Workflow is the key. And health IT is not (yet) doing workflow right. (You knew I was going to eventually mention workflow, right? I always do.)
More later. Much more. But not too much later.
P.S. Just in case you don’t believe me, that you think I’m making this all up, here’s my certificate of completion! 🙂
P.S.S. Here are my tweets from #BigDataTechCon, in reverse order (so you don’t have to read from the bottom up).
I am! Any healthcare attendees? RT @ starts tomorrow! We hope you’re as excited as we are!
— Charles Webster, MD()
Starting > Crash Course in w/@
— Charles Webster, MD()
Machine Learning: All-knowing algorithms vs generic algorithms + data @
— Charles Webster, MD()
Machine Learning Crash Course@
— Charles Webster, MD()
3 kinds: recommenders, clustering, classifiers @
— Charles Webster, MD()
Clustering: K-Means, centroids, Voronoi Diagram @
— Charles Webster, MD()
Wiki:k-means partitions n ob’s to k clusters, each observation belongs to cluster w/nearest mean
— Charles Webster, MD()
Slide: supervised learning @
— Charles Webster, MD()
languages: Python, R, Julia, Matlab, Octave, Java, Scala, Clojure @
— Charles Webster, MD()
Slide: unsupervised learning @
— Charles Webster, MD()
Google has so many hard drives that they loose hundreds everyday
— Charles Webster, MD()
WordCount in MapReduce “The Hello World of big data”
— Charles Webster, MD()
.@ covers WordCount (Hello World of Big Data) w/MapReduce “Hadoop Data Warehousing w/HIVE”
— Charles Webster, MD()
Talking about Hadoop and EMR (Amazon Elastic MapReduce, not Electronic Medical Record)
— Charles Webster, MD()
Starting: Deep Dive into Apache @ tsn2.bzmedia.com/tradeshows/cla…
— Charles Webster, MD()
Slide: Cassandra thruput benchmarks vs other dbs @
— Charles Webster, MD()
Your relations are in your data, not your database. Denormalization pushes them into your app logic. @
— Charles Webster, MD()
Data warehouses: Traditional ‘row-store’ vendors from 90s 50 times slower than ‘column-store’, some converting – Stonebraker
— Charles Webster, MD()
keynote M Stonebraker is at least 4 Problems: SQL/complex analytics, velocity, diversity
— Charles Webster, MD()
Keynoter M Stonebraker on Genomics & Healthcare Informatics (eg. cohort groups for effective studies)
— Charles Webster, MD()
Stonebraker: AQL is declarative array query language, modeled on SQL, 4 array data @ @
— Charles Webster, MD()
BASE Basic Avail., Soft-state, Eventual Consistency / ACID Atomicity Consistency Isolation Durability
— Charles Webster, MD()
Predictive Modeling 101: Simple Models & Basic Evaluation @ tsn2.bzmedia.com/tradeshows/cla…
— Charles Webster, MD()
Slide: More data, more users, more interactivity are macro trends driving tech @
— Charles Webster, MD()
Non-NoSQL issues driving adoption of tech: inflexible schemas, inability to scale @
— Charles Webster, MD()
Slide: database types > key-value, data structure, document, column, graph @
— Charles Webster, MD()
Slide: market adoption includes healthcare enterprises among many @
— Charles Webster, MD()
apps: sub 1ms latency, hi-thruput 200,000 ops/s, many users, demand spikes, updates! @
— Charles Webster, MD()
Addition in the Large: Simple Counts & Not-So-Simple Counts keynote by @‘s Oscar Boykin @
— Charles Webster, MD()
Slide: “Twitter Scale”
— Charles Webster, MD()
Slide: How many followers of followers of account are there? @ Not in Twitter yet (I want it!)
— Charles Webster, MD()
According to @ 🙂 It may or not be the case that @ has more second level followers than @
— Charles Webster, MD()
Got me a signed copy of HBase In Action by @ & @ at Thank you both for writing it!
— Charles Webster, MD()
Every Thing You Ever Wanted to Know About [hadoop workflow system] But Were Afraid to Ask
— Charles Webster, MD()
Oozie Workflow Language
— Charles Webster, MD()
Good point! RT @ Oscar Boykin @ Why approximation algorithms work so well? Because real data is noisy anyway
— Charles Webster, MD()
Oozie Workflow: Good (hadoop ecosystem integrat, UI track progress, APIs), Bad (XML), Ugly (no loops)
— Charles Webster, MD()
Overall Oozie Execution > ‘objects’ right side, top 2 bottom: Bundle, Coordinator, Workflow, Action
— Charles Webster, MD()
.@ morning of last day of introducing keynote (& reminding attendees 2 submit speaker evals!)
— Charles Webster, MD()
Data Science is “advanced applied counting” according 2 friend of @ from @
— Charles Webster, MD()
Data Scientist = “Turbocharger” feed data exhaust back in2 system 2 improve performance @
— Charles Webster, MD()
Why Bloom Filters Work the Way They Do
— Charles Webster, MD()
HyperLogLog — Cornerstone of a Big Data Infrastructure
— Charles Webster, MD()
Starting: HBase Schema Design @
— Charles Webster, MD()
Count-Min Sketch & Its Applications
— Charles Webster, MD()
Tested 80 million columns in a single row @ Holy cow!
— Charles Webster, MD()
Starting: Data Flow Programing with tsn2.bzmedia.com/tradeshows/cla… @
— Charles Webster, MD()
How-To: Schedule Recurring Hadoop Jobs w/Apache Oozie @ may be of interest to attendees
— Charles Webster, MD()
Oozie Workflow Editor for Hadoop workflow system from @ may be of interest to attendees
— Charles Webster, MD()