Propel healthcare data science by solving the boring problems

Data cleaning is boring but critical.

If you have been paying attention to data science in healthcare you will have noticed the gradual shift from 2016’s Big Data to 2017’s Machine Learning.  Specifically, deep learning techniques attract much of the attention. The FDA recently approved the use of deep learning techniques in cardiac diagnoses.  Enlitic promises to automate the process of radiologic diagnosis for medical imaging.  And with the advent of wearables, there is an ever-increasing volume of health data that requires “smart” algorithms to parse out the signal from the noise.

All of them hinge on increasingly sophisticated algorithms.  All of them promise to improve how we live our lives, diagnose diseases, and approach treatments.  All of them claim that with smart automation, we can finally reduce the astronomical money hole that is American healthcare.

It is fitting that major healthcare organizations like the American Medical Association are slowly moving to value-based advocacy rather than that based solely on reimbursement.  Healthcare value, in this case, is defined as the quality of care delivered – whatever that means – as a function of the underlying cost for said service.

Data Silos in Healthcare

To deliver better value, we need to improve our data collection.  Data science is no longer optional.  From precision medicine to molecular imaging to consumer health devices, data is the currency of modern healthcare.

However, there are significant differences between data and insight.  Medicine is data-dense from the very beginning.  The problem arises from the heavily siloed nature of health data.  Handwritten physician notes, fax machines for imaging and pathologic reports, printed prescriptions are still commonplace today.   All of this is data: each carries pieces of information critical for the correct function of the machine.  They are also an extreme form of data silo – each sheet of paper can be independently damaged, lost, or otherwise incorrectly filed.


The Boring Parts of Data Science in Healthcare

Digitizing data entry is only the beginning.  When a patient enters the emergency department with GI bleeding, she will touch no fewer than 5 separate clinical systems with few means of communication.  In some hospital systems, the emergency department, inpatient, and outpatient records are separately managed.   Medical images including radiology examinations, pathology slides, and optical imaging procedures like upper endoscopy also depend on separate information systems.

Health Level 7 and FHIR are attempts at achieving healthcare interoperability and have had some success.  However, researchers who wish to correlate across data boundaries continue to find themselves navigating the impossibly complex, non-standard data streams coursing through the veins of a hospital system.

Implication for the Healthcare Data Enthusiast

A mentor once told me that you shouldn’t pick a career by its most exciting aspects.  Instead, when selecting a career, it is of utmost importance that you can derive at least some fulfillment by doing the most mundane, boring tasks.

Medical informatics is not unlike traditional medical research in this respect: 90% of the time is spent collecting and cleaning the data, getting it into a format that can be analyzed appropriately.  This process is sometimes referred to as data munging or data wrangling.

With the clean data at hand, you will then have an easy time learning from it for insight.

Standardization is key to breaking down siloed data.  DICOM is an imperfect standard, but the advent of DICOM file format enabled interoperability for smaller vendors such as Syngo, and MIM to create products for specific image processing needs alongside standard PACS systems by using the same standardized datasets.  Today, major vendors to also support either 3D image processing natively or provide support for 3rd party vendors because the market demands interoperability.


Deep learning is exciting.  It brings a spotlight to medical data science with a distinct burst of excitement that the field has never seen. However, breakthroughs in healthcare are not exciting because we are on the cusp of creating algorithms sophisticated enough to create an “AI radiologist” or “AI oncologist.” Those things are possible, but they are only 10% of the story.  To realize the deep promises of healthcare data science, we must first solve the boring, age-old problems of data access, standardization, and integration.


Howard Chen on GithubHoward Chen on LinkedinHoward Chen on Wordpress
Howard Chen
Vice Chair for Artificial Intelligence at Cleveland Clinic Diagnostics Institute
Howard is passionate about making diagnostic tests more accurate, expedient, and affordable through disciplined implementation of advanced technology. He previously served as Chief Informatics Officer for Imaging, where he led teams deploying and unifying radiology applications and AI in a multi-state, multi-hospital environment. Blog opinions are his own and in no way reflect those of the employer.

One response to “Propel healthcare data science by solving the boring problems

Leave a Reply