Big Data has become a radiology buzzword (the others: machine learning, AI, and disruptive innovation are also up there).
However, there is a real problem with using the term Big Data – it isn’t just one set of data problems. Big Data is a conglomerate of different data challenges: volume of data, heterogeneity of data, or the velocity of data are all important dimensions. Machine learning and internet of things are others layers superimposed on the big data problem.
Sometimes it is helpful to step back and approach data problems with a common framework, a way to think about how and which facets of data science fit in a real-life workflow in the face of an actual problem.
Below is a 6-element framework that helps me think about data-driven informatics problems. They are generally in chronological order, but they are not “steps” because you frequently will find yourself going back and redefining many things. However, the framework helps you maintain a big-picture outlook. The reason any sufficiently complex data problem requires a team approach.
The six elements are:
- Refine the question
- Identify the Data
- Plan your approach
- Pick a platform
- Explore the data
- Define a solution
Refine the Question
Although the six elements are not linear progressions, refining a proper data question does deserve its place as the first step (and one you should come back to often).
In scientific research, coming up with a testable hypothesis-driven question is often sufficient. In informatics, a focused question may also include a tangible result. For example, you may wish to assemble a visualization, create a software tool, or write a research paper. The different end goal will affect how you approach the question.
Trimming the “vagueness” fat is tough, but you can check your progress by focusing on the cleverly named S.M.A.R.T. criteria:
- Specific – the question should focus on a tangible, small scope of practice
- Measurable – the problem’s magnitude (and progress towards improvement) should have a quantifiable dimension.
- Assignable – the answer to the question should be translatable to actionable items.
- Realistic – the results should be achievable.
- Time-specific – the answer to the question should be achievable by a certain time.
For example, “I want to look at the volume of outpatient CT on the weekends” is vague. “I want to create a quarterly report, starting this October, for the executive committee to monitor the volume of outpatient CT volume” is a more focused goal.
Finally, remember that no matter how much you love technology, analytics, or machine learning – a poor question can ruin the project. Commit to your data question only insofar as you can trust it to deliver the tangible end-result. At each step of the process, take an opportunity to review your question and ensuring that it remains relevant.
Identify the Data
Once you have a specific question, it’s time to decide what raw data is necessary. Identifying the data is not only knowing the information you need but also collating data in the right format.
In radiology informatics, the primary sources of your data generally falls in one of three categories:
- Radiology-specific database (PACS, order metadata, diagnostic report text)
- Non-radiology specific database (electronic medical record, pathology database)
- Non-electronic data (paperwork)
Dataset acquisition often takes the form of a request to your institution’s data warehouse. The lucky few who have direct access to raw data have the opportunity to make their own inquiries.
Sometimes it’s worth thinking about creating a separate, analytical database, one specialized for your query. The reason is simple: if you’ve asked a sufficiently introspective question, the answer would not be immediately obvious. Likewise, the path to the best answer often is not immediately obvious, so why pigeonhole yourself to use data in only 1 way?
For instance, if you want to create a machine learning algorithm to predict the emergency department CT volume, a seemingly reasonable first step is making a educated guesses on the possible correlated causes and submit a query on only those metrics. A focused query is valuable if you have a solid grasp of the causal relationships for the underlying question.
More frequently, there is no easily discernable causal relationships. In those settings, querying on a short list of data features comes with the risk of tunnel vision. You and your analysis becomes blind to other possible contributors. The better approach would have been to cast a wider net and capture as many elements of the data as would be reasonable. Even if you have direct access to the radiology information system (RIS), you may find it optimized very differently from your needs.
I favor piping raw data obtained from RIS or the data warehouse into a dedicated analytical database. Data exploration is a key part of what makes radiology informatics valuable (and fun!), and reindexing for proper performance can be extremely helpful.
To be Continued…