Standardizing Medical Data: Data Mining @ CoL


One of the most apt aphorisms for data science is the saying – ‘your results are as good as your data’. Due to an upsurge in electronic storage of medical data, one expects that the analysis of the data to be a straightforward task. However, one of the biggest challenges faced by data scientists especially when dealing with medical data is to gain insights from databases that are unstructured, disparate and inconsistent. This task becomes especially challenging when dealing with textual data.

With data present in several different silos and different formats one must be able to get them together into a single comprehensive database in a consistent and accurate way from which actionable insights may be gathered.

The Challenge

We were approached by one of the largest tertiary health care centre in the country where they sought to streamline their administrative tasks in their operation theatre in order to deliver better quality of care and to reduce mismanagement of resources. The hospital also sought to upgrade its existing Legacy system and bring it on par with international standards for hospital database management.

The hospital data set had approximately 1,00,000 different surgeries of which only approximately 35,000 of them were unique. This show the lack of uniformity of format The hospital then wanted to map their surgeries onto an international database with standardised names for surgeries.

CoL Approach

We at CoL understand the responsibility that is inherent in medical data and are unwilling to compromise on the accuracy of our processes. Due to this, we used a two pronged approach while mining and understanding the data.

  • In the first part of the solution we make an automated analysis that generates suggested changes. The list of words tries to account for spelling mistakes, abbreviations and even  synonyms present in the data. For example, the system suggests the following changes for cataract surgery. This process takes place in two stages.
  • We then, using advanced natural language processing algorithms, understood the context in which words occurred and further strengthened our mapping.
  • In order to gain accurate results, a trained healthcare professional looks at the results generated by the automated process in the previous section and manually approves then.

After this, we trained our natural language machine to identify the mapping between our structured data to UMLS, an international standard when it comes to standard medical terminology. [Still to be done]


“Data! Data! Data! I can’t make bricks without clay!”  -Sherlock Holmes

The basic building blocks that one needs for analysis is data. By our efforts in standardising the data, we were able to help the hospital view its own data

Descriptive Analytics and Visualisation: We designed a dashboard for the hospital which helps them understand their data.
Operations Streamlining: We designed an application which helps the hospital organise and plan its surgeries optimising resources and schedules.
Predictive Analytics: We designed an application that predicts the bill that a patient will have to pay based on their surgery, co-morbidities and other patient characteristics.


Medical Data is a valuable resource from which several insights can be gleaned. However,  before using this data for analysis, it is imperative to make a meta thesaurus which can be used to identify synonyms, abbreviations, misspelt words which may then be rectified in order to gain highly structured data. This data can then always be leveraged in order to gain actionable insights

Similar Reads