Propel healthcare data science by solving the boring problems

Data cleaning is boring but critical.

If you have been paying attention to data science in healthcare you will have noticed the gradual shift from 2016’s Big Data to 2017’s Machine Learning.  Specifically, deep learning techniques attract much of the attention. The FDA recently approved the use of deep learning techniques in cardiac diagnoses.  Enlitic promises to automate the process of radiologic diagnosis for medical imaging.  And with the advent of wearables, there is an ever-increasing volume of health data that requires “smart” algorithms to parse out the signal from the noise.

All of them hinge on increasingly sophisticated algorithms.  All of them promise to improve how we live our lives, diagnose diseases, and approach treatments.  All of them claim that with smart automation, we can finally reduce the astronomical money hole that is American healthcare.

It is fitting that major healthcare organizations like the American Medical Association are slowly moving to value-based advocacy rather than that based solely on reimbursement.  Healthcare value, in this case, is defined as the quality of care delivered – whatever that means – as a function of the underlying cost for said service.

Data Silos in Healthcare

To deliver better value, we need to improve our data collection.  Data science is no longer optional.  From precision medicine to molecular imaging to consumer health devices, data is the currency of modern healthcare.

However, there are significant differences between data and insight.  Medicine is data-dense from the very beginning.  The problem arises from the heavily siloed nature of health data.  Handwritten physician notes, fax machines for imaging and pathologic reports, printed prescriptions are still commonplace today.   All of this is data: each carries pieces of information critical for the correct function of the machine.  They are also an extreme form of data silo – each sheet of paper can be independently damaged, lost, or otherwise incorrectly filed.


The Boring Parts of Data Science in Healthcare

Digitizing data entry is only the beginning.  When a patient enters the emergency department with GI bleeding, she will touch no fewer than 5 separate clinical systems with few means of communication.  In some hospital systems, the emergency department, inpatient, and outpatient records are separately managed.   Medical images including radiology examinations, pathology slides, and optical imaging procedures like upper endoscopy also depend on separate information systems.

Health Level 7 and FHIR are attempts at achieving healthcare interoperability and have had some success.  However, researchers who wish to correlate across data boundaries continue to find themselves navigating the impossibly complex, non-standard data streams coursing through the veins of a hospital system.

Implication for the Healthcare Data Enthusiast

A mentor once told me that you shouldn’t pick a career by its most exciting aspects.  Instead, when selecting a career, it is of utmost importance that you can derive at least some fulfillment by doing the most mundane, boring tasks.

Medical informatics is not unlike traditional medical research in this respect: 90% of the time is spent collecting and cleaning the data, getting it into a format that can be analyzed appropriately.  This process is sometimes referred to as data munging or data wrangling.

With the clean data at hand, you will then have an easy time learning from it for insight.

Standardization is key to breaking down siloed data.  DICOM is an imperfect standard, but the advent of DICOM file format enabled interoperability for smaller vendors such as Syngo, and MIM to create products for specific image processing needs alongside standard PACS systems by using the same standardized datasets.  Today, major vendors to also support either 3D image processing natively or provide support for 3rd party vendors because the market demands interoperability.


Deep learning is exciting.  It brings a spotlight to medical data science with a distinct burst of excitement that the field has never seen. However, breakthroughs in healthcare are not exciting because we are on the cusp of creating algorithms sophisticated enough to create an “AI radiologist” or “AI oncologist.” Those things are possible, but they are only 10% of the story.  To realize the deep promises of healthcare data science, we must first solve the boring, age-old problems of data access, standardization, and integration.


One response to “Propel healthcare data science by solving the boring problems

  1. Doing data science in a healthcare company can save lives. Whether it’s by predicting which patients have a tumor on an MRI, are at risk of re-admission, or have misclassified diagnoses in electronic medical records are all examples of how predictive models can lead to better health outcomes and improve the quality of life of patients. Nevertheless, the healthcare industry presents many unique challenges and opportunities for data scientists.
    Currently, a major issue facing medical providers is that patients’ data tends to exist in silos. There is little integration across electronic medical record systems (both between and within medical providers), which can lead to fragmented care. This can lead to clinicians receiving out of date or incomplete information about a patient, or to duplication of treatments.
    Through a major data engineering effort, these systems could (and should) be integrated. This would vastly increase the potential of data scientists and data engineers, who could then provide analytics services that took into account the whole patients’ history to provide a level of consistency across care providers. Data workers could use such an integrated record to alert clinicians to duplication of procedures or dangerous prescription drug combinations.

    Data scientists have a lot to offer in the healthcare industry. The advances of machine learning and data science can and should be adopted in a space where the health of individuals can be improved. The opportunities for data scientists in this sector are nearly endless, and the potential for good is enormous.
    DataTrained Noida change data-driven discovery and solutions by leading the technological advancement in application and development of information science & machine-learning techniques. Our role is additionally academic in achieving this mission. We build a community of data scientists who will be equipped to understand the problem statement, analyses, develop and deliver solutions, leveraging the hands-on learning and acquired skills to address real-life and social challenges.
    I have come across a very committed and reliable training data science institution with oracle experienced mentors in-Noida
    India’s first edtech analytical company providing with 100% placement guaranty.

Leave a Reply