How Data and Metrics Can Improve Our Ability to Respond to Pandemics

Data & Analytics

March 31, 2020

We are living in a middle of a pandemic which is likely to lead to a humanitarian crisis, which we still do not know how bad is going to be. While measured optimism is not that bad, it is important to read the signals and understand the numbers to gauge the scale and magnitude of the crisis.

We have seen in the past when wrong metrics were quoted with incorrect numbers, incorrect inferences were made from the metric. So the idea behind writing this post is to provide an overview of metrics referred in the COVID analytics works shared by data professionals. Of course these metrics are typically common for the study of any epidemic.

Coming to the data, just like many of you, I too have been following the coronavirus data on a daily basis. On some days, on an hourly basis, too. We all have noticed since last month or so, a good number of data dashboards, analysis and Kaggle kernels were shared by data professionals who diligently analysed the data and shared their work and findings.

Key Datasets:

So what kind of data is being used by the data practitioners and analysts for their analysis? While majority rely on the public datasets published on sites like Kaggle, most of the origins can be traced back to the following three key sources.

Key datasets for Covid-19 cases:

World Health Organization Data: They produce two sets of reports.
1. Situation Reports
2. Health Emergency Data
John Hopkins University Dataset which is on the GitHub
European Centre for Disease Control and Prevention (ECDC)

Key Metrics

One of the websites I have been referring to regularly is this, which shows a high level summary of cases. But different news sites globally are quoting vastly different numbers for the fatality rate, which was the trigger for this blog.

Firstly, I wanted to understand the key metrics used by Healthcare bodies and Govt organizations for COVID-19 and how they compare them with previous epidemics like SARS or MERS to understand the relative scale and magnitude of the current one.

CFR (Case Fatality Rate): The common metric quoted is the fatality rate which is simply obtained by dividing number of deaths by total number of cases at a point in time. WHO first quoted this number at around 2.1% in Feb and at 3.4% in the first week of March.

Mortality Rate: Some websites and newspaper sites quote fatality rate as the mortality rate. However looking up on WHOs’ definition, it is the proportion of the number of deaths caused by the disease to the given population.

Morbidity Rate: Another important metric to help understand the spread of COVID-19 in a location is obtained by dividing the number of people infected with the disease by the population. A variant of this is reported in the CEDC’s dashboard as cases per 100K of population.

Both Mortality rate and Morbidity rate are not meaningful when there is an ongoing outbreak and hence WHO usually release these numbers post the epidemic. However the CFR can still be used to understand the severity of cases over different stages of epidemic as well as to compare it with other epidemics at similar stage.

But what are the challenges in using these metric during an epidemic?

Firstly, the numbers keep changing based on the stage of the epidemic in the respective country.

Secondly, the unreported cases are not taken into consideration in these metrics. Ignoring the unreported cases inflates the CFR, while unreported deaths could balance it out, the chances for the latter is minimal for obvious reasons.

Mean Incubation Period: Why is there a 14 day self-quarantine advised for people with travel history? It is because current estimated range for the incubation period is from 2 to 14 days (with outliers, it is reported at 0 – 27 days). The mean Incubation period is average duration from the time of infection to the appearance of first symptoms of the disease. Just like in any data, there were noted outliers here as well where few people didn’t report any symptom for up to 25+ days.

Median Hospital Stay: Median duration of hospital stay is the period for which the patients are admitted in hospitals for treatment till the time of outcome. This number is crucial and important for governments to plan healthcare facilities and supplies required. Scale of COVID-19 shows that even the developed countries are struggling to accommodate/treat patients and the facilities are being stretched beyond their capacities.

Ratio of Community to Imported Spread: Especially for COVID-19, the governments keenly follow this ratio to understand if the virus is only among travellers coming into the country already infected or if there is a community spread. India is on the brink of a community spread at time of writing as the ratio is almost >= 1.

Some Key Data/Modelling Challenges in Epidemiology

The Unreported Problem: The problem of unreported cases is twofold. On the operational front, identification of unreported cases provide containment challenges as the unreported cases with mild/no symptoms could actively be transmitting the virus. On the modelling front, estimating the volume of unreported cases is vital to evaluate the size and severity of the epidemic so that the response can be optimized. There are advanced modelling techniques available to estimate the unreported cases. But those techniques would require a separate blog post on their own!

There are a whole lot of advanced analytics that can be done around this, like forecasting case volumes, predicting average length of stay (which can help hospitals & governments prepare), predicting the COVID-19 risk score of the individual based on symptoms and related variables (Apollo Hospitals just released one such AI model to predict the COVID Risk score), modelling the spread (R0), modelling survival and progression rates etc.

Lack of case level data: Though most of the datasets mentioned above are time-series data of daily cases and fatalities by geography, they are not at the grain level of individual cases. Of course, it will be difficult to collect unless the governments, hospitals, agencies and all others involved are willing to share this data. I am sure a whole lot of analytics can be done and interesting and useful insights can be obtained to understand other aspects of the epidemic in terms of the response, efficacy of treatment methods, medications tried, side effects etc.

About the Author

Natarajan Ganapathi

Data & AI Solutions Leader

Natarajan Ganapathi is Data & AI Solutions leader at Hexaware, where he is helping customers with his deep expertise across full data stack covering data strategy, architecture, engineering, machine learning, and data governance. With a proven track record of leading complex data strategies and large-scale Data & AI transformations, Natarajan is committed to making data work best for business.