Practical challenges in using healthcare data from large hospitals in predictive analytics
Artificial Intelligence is leading one of the biggest revolutions in the healthcare industry. From predicting cancer using image recognition, to optimizing utilization of operation theaters, data science is changing the way healthcare providers are delivering healthcare to a patient. Hospital administrators who were reluctant to put their data in digital form are now eager to leverage their data to generate actionable insights to improve health outcomes and increase their revenue. Based on our experience of working with some of the largest healthcare datasets in the world, we hereby enlist some of the unique challenges in turning data into information.
- Data Entry
Healthcare providers at the bottom of the pyramid have limited understanding of the importance of recording the data in a standardized format. Regardless of how accurate or sophisticated a program’s logic is, the results will always be incorrect if the input data is filled with errors i.e. Garbage in, garbage out. Some of the common data entry challenges we encountered are: –
- Incorrect and Inconsistent format of date: Any prediction in the healthcare domain needs accurate chronological data as input. For example, capturing the wrong date can lead to the wrong calculation of the length of stay, which will affect the performance of the predictive model, which will further lead to the wrong estimate of treatment cost and operation scheduling for patients. Another related issue is that dates are not often in the same format.
- Non-standardized format of procedures and diagnosis (require ICD and CPT codes): Most of the predictive analytics use-cases in the healthcare domain require diagnosis and treatment. This requires that these by captured in standard formats, namely, ICD (International Classification of Diseases) and CPT (Current Procedural Terminology). But more often than not, these practices are not followed. Moreover, diseases and treatment procedures are entered as free text with no standardization whatsoever. Converting these into usable data points require a lot of cleaning (both manual and NLP techniques)
- Missing data: This is another problem which affects the performance of models generated.
- Absence of certain patient records in one database, which is present in another database, may tremendously decrease the size of our usable data points after merging.
- Inconsistencies in doctor’s/surgeon’s name: This is another challenge. Here again, standardization is infrequently seen. Cleaning such data is resource and time intensive and solutions may be suboptimal.
In summary, healthcare providers can improve their data capturing routines in a stepwise manner, by prioritizing valuable data types for their specific projects. More specifically, providers need to ensure that proper medical coding system is used (like ICD, CPT codes). Although there exist machine learning techniques which can automatically handle non-standardized string variables, these are sub-optimal as compared to having standard database practices in place.
- Numerous Sources of Data
Electronic Health Records are collected from multiple sources (such as Labs, Operation theatres, IPDs, OPDs etc). This is further complicated by the fact many hospitals have more than one database, each having its own standard practices. Sometimes multiple sources of data mean that there is no single source of truth, or that each source may have conflicting observations in which case it is difficult to ascertain which is the correct one. While analyzing the data, we observed such kinds of inconsistency in variables like age, admission date and discharge date from different data sources like (Inpatient and demographic data)
- Data Capturing
Healthcare data is present in multiple formats (e.g., numeric, paper, text, paper, etc.). Radiology uses images, old medical records exist in paper format, and today’s EHRs can hold hundreds of rows of textual and numerical data. Unfortunately, many hospitals are still not able to capture and store such data in digital form. Hospital administrators need to be informed about the potential benefits of digitizing data. Some of the use cases of data digitization are as follows:
- Mammography scans can be used to detect breast cancer
- Patient demographic data can be used to predict the length of stay in a hospital,and consequently, will lead to the better scheduling of surgeries in hospital.
Absence of the data in a usable format is a major limitation for data scientists in extracting required information
- Limited Computational Resources
Healthcare data is growing exponentially, and so is the requirement for computational power to automatically train predictive models on a periodic basis. Automated upgradation of models cannot be done on a local desktop as it takes a lot of time. Therefore, we need to rely on Cloud-based services to handle the regular surge in the requirement of computational power.
To conclude, I would like to make few suggestions to hospitals so as to enable data experts in deploying cutting-edge analytics products. Firstly, instead of having multiple Information Management System (IMS), deploy a common integrated data management system in a way that the data can be verified at multiple points of the patient’s journey. Additionally, a common integrated IMS system for all the data sources would ensure that entire journey of the patient is recorded at a single place and thus avoiding any inconsistency. Secondly, medical coders must be hired to ensure that proper ICD and CPT codes are used for diagnosis and procedures respectively. Thirdly, the hospital must start recording all the forms of data (i.e. Image, handwritten text, etc) in a digital usable form. We understand that frequently changing regulatory framework and Government attempts to make measures like readmissions rates, quality, and pricing information are nightmares for big hospital chains. These attempts not only increase the reporting burden but also leading to a shift from brand-based purchasing to value-based purchasing model. This calls for an urgent need in hospitals employs advanced analytics to stand out in this competitive environment.