8 min read
Oct 22, 2019

AI's Next Breakthrough: Analyzing Unstructured Data in Healthcare

Your handwritten doctors' notes are going to be more than good advice that's difficult to read.

Lauren MaffeoAssociate Principal Analyst

Last year, research proclaimed that AI could find the best treatment for patients with sepsis. Researchers at Imperial College London said that 98 percent of the time, their reinforcement learning agent, the Artificial Intelligence Clinician, outperformed doctors at predicting treatments for the leading cause of death in hospitals. When doctors got involved, patient mortality increased.

To train their AI Clinician, researchers gave the tool 48 sets of structured data containing records for about 100,000 intensive care unit patients. Despite this tool's success, training it on structured data alone doesn't paint a full picture of each patient's history. To do that, you'd need to train AI using unstructured data, which is up to 80 percent of the details in electronic health records (EHRs).

Group 3@1x Created with Sketch.

What is unstructured data in healthcare?

Unstructured data ranges from PDFs and audio files to emails and handwritten notes. If you've ever sent your doctor a message through their patient portal, or your doctor took notes on your adverse reactions to a new medication, those count as unstructured data.

Although numbers like one's heart rate and blood pressure add context for patients' health, they don't tell the full story. Unstructured data helps doctors capture patient nuances. And converting that data to a structured format for analysis can cause doctors to lose that nuance.

In the case of tasks like patient diagnoses, structured data might be enough to train accurate AI based on past patterns. But when it comes to ongoing treatment for chronic health problems like opioid addiction, AI won't reach its next benchmark without assessing unstructured data.

Dr. Summer Rankin is familiar with this problem. As a data scientist in Booz Allen Hamilton's Strategic Innovation group, Dr. Rankin is the technical lead for Project Shakespeare with Dr. Roselie Bright, an epidemiologist at the Food and Drug Administration (FDA). On this project, she uses machine learning (ML) and natural language processing (NLP) to find indicators for adverse events based on free text (such as notes) from EHRs.

GetApp interviewed Dr. Rankin* to learn more about structured vs. unstructured data.

*We edited this interview for length and clarity.

Group 3@1x Created with Sketch.

Unstructured vs. structured data

GetApp: What is unstructured vs. structured data?

Dr. Summer Rankin: Structured data [is] best given by an example, columns that each contain one type of data. [This] can be text or a number. (Blood pressure, temperature, provider name, age, ethnicity, etc.)

Unstructured data is a bunch of free text that may contain this structured data, but not in any specific format or template. I think of it as narrative text, but with EHRs it's often not even all that narrative.

Group 3@1x Created with Sketch.

Advantages and disadvantages of using unstructured data

GA: Describe the opportunities and limitations of using unstructured vs. structured data in AI for healthcare delivery

SR: Unstructured data can be time-consuming and/or expensive to process into usable information. The most common way that it is 'processed' is for a provider to read it and interpret it themselves.

The longer a patient's record, the more difficult this becomes, especially when there is a large amount of "note bloat" (copied and pasted text). The provider is fatigued by the end of a long note, and it's easy to miss important items among lots of repeated text.

The other way to process the unstructured data is with NLP, but this takes a well-trained model and the personnel (or ideally software) to do this processing.

Group 3@1x Created with Sketch.

Converting data from unstructured formats to structured formats

GA: Does converting data from an unstructured format to a structured format cause physicians to lose crucial details about patient care? If so, how and why is that the case?

SR: Yes, this can be problematic if you are using an algorithm because no algorithm is perfect. When we want to get from unstructured to structured, what we are essentially doing is something called "entity extraction" which is pulling out relevant lab values, medications, vitals, provider names, diagnoses, etc.

There are some excellent options like Amazon Comprehend Medical, but no algorithm will be perfect and some items will get missed. For example, a provider might misspell a word, or make a typo and forget to put a space between two statistics.

Here's what that would look like if a patient's blood pressure and heart rate merged together by mistake. Without proper spacing between these variables, an algorithm can't correctly assess them:


These values might not get split up properly or counted at all if the algorithm is splitting on whitespace.

You would also lose the narrative about this patient that might explain these variables, such as notes about why their blood pressure is so high. This information is crucial to help doctors make decisions about next steps to treat the patient. Even if you do capture that text, it's going to end up as a field of unstructured data among the structured data.

So, it's not a great idea to try to get rid of all the free text (i.e. unstructured data) in a a patient's file. In those cases, “Pedestrian vs. car brought into ICU with lung injury" may end up as “lung injury".

Or, “Wife and daughter were present and reported no history of cancer" gets cut down to “no history of cancer" or even “no history" and exclude the details about cancer. In either case, the doctor would miss crucial contextual details.

It's always important to look at each patient as a whole. Reading their narrative reports or medical history is a good way to start doing that before you even encounter them. Or, for doctors in fields like radiology, they may never meet their patients, but will still read their images.

Group 3@1x Created with Sketch.

What your organization should avoid in AI adoption

GA: What's the biggest mistake you see healthcare organizations make when trying to adopt AI?

SR: I have no real knowledge of how organizations are trying to adopt AI. It often seems to me that they are NOT adopting AI and are slow to adopt any new practices when it comes to EHRs. This is not necessarily a bad thing, as healthcare data is incredibly sensitive and we need to be careful about how we store and share it.

As far as general EHRs, I see some good things happening like patients getting access to their full records and orgs making it easier for patients to share their records across providers.

The thing I often worry about is the security of the EHR systems/software/computers in a hospital. If effort and resources are not put behind security it can make an organization very vulnerable to failures or malicious attacks, like we see with cities' data getting held for ransom.

Group 3@1x Created with Sketch.

Search for secure EMR software

Back to top