Initial investigations of Covid Malaysia simulations exercises using R
SIR (Susceptible-Infected-Recovered) model of epidemiology for disease spreading is the most used model for studying the Covid outbreak. It is used as policy-making as applied by WHO and many other public health agencies worldwide. The model is a mathematical and statistical model for data obtained/available by the modeler - which is the core of works of data science. Given the massive impact of this outbreak and its nature which is unprecedented in human history, any learning and lessons are worthwhile. Data science as a new field could potentially show its use under such conditions. This short paper proposes some way forward in using data science and modeling of disease spread in Malaysia.
Why do we start with the SIR model? Because it is the common model used by WHO and most epidemiology studies which allow us to learn the model parameters from other scientific research and publications.
Model Notations and assumptions:
Parameter | Assumptions | Notes |
---|---|---|
\(t\) | Time period which is measured in days | |
\(O\) | Original state or state without intervention measures | |
\(E\) | Effective measures based on time dependent | |
\(N_t\) | Total population, which is assumed to be fixed at all times | |
\(S_t\) | Susceptible persons within the population at time \(t\) | e.g. \(S_t \sim \ binomial[S_{t-1}, (1-\alpha)^{I_t}]\) |
\(R_t\) | Total number of Recovered people at time \(t\) | e.g. \(R_t \sim \ binomial[I_t,\beta] + R_{t-1}\) |
\(I_t\) | Total number of Infected people at time \(t\) | i.e. \(I_t = N_t - R_{t-1} - S_{t-1}\) |
\(\alpha\) | Probability of infecting by an Infected person | i.e. from \(S_t \to I_t\) |
\(\beta\) | Probability of recovery, after being infected | i.e. \(I_t \to R_t\) |
\(\tau_t\) | Fraction of population infected | i.e. \(\frac{I_t}{N_t}\) at time \(t\) |
\((1-\tau_t)\) | Fraction of population NOT infected | i.e. \(e^{\tau_t R^O_t}, as \ N \to \infty\) |
\(p_O \ or \ p_E\) | Probability of disease transmission when in contact, during Orginal or Extended states | |
\(c_O \ or \ c_E\) | The number of contact happens per time \(t\) | i.e. per day |
\(l_O \ or \ l_E\) | The duration of infection period in units of t | i.e. in number of days |
\(SI_t\) | The disease generation function between being infected and start to infect others (called Serial-Interval) | |
\(R_o\) | The disease reproduction rate | e.g. modelled as: \(R_o\) = \(p_O*c_O*l_O\) |
\(R_E\) | The disease effective rate | e.g. modelled as: \(p_E*c_E*l_E\) |
\(r_t\) | The incidence growth rate | i.e. growth in terms of the number of Infected people |
Key parameters in the model are: \(SI_t\), \(R\) (i.e. Serial-Interval and Reproduction rates). What data scientists are interested in is in the behaviors of growth, \(r_t\) during this epidemic, modeled by these key parameters.
\[r_t = f_t(SI,R)\]
\(r_t\), rate of growth of the number of infected people is a function of \(SI\) and \(R\), the generation time (Serial-Interval) and the reproduction rates at the initial stage and the various stages of intervention measures. The key objective is to understand how does the function \(f_t(.)\) (i.e. the growth function) looks like? And how does it behaves over time (i.e. days of the epidemic)? To obtain a reliable estimate of \(r_t\) is a major challenge in which many epidemiologists had worked on various statistical models and tested them on past experiences with disease outbreak such as SARS, MERS, as well as on HIV and some varieties of cases of flu.
From the SIR Model we can derive the following, which can be used for prediction exercises:
During the early stage (Original state) of epidemic, \(R_o\) is much larger than 1 (i.e. \(R_o >> 1\)), hence the growth function \(r_t\) generally viewed to follow exponential curve with a certain probability distributions functions of mean: \(\mu\) and standard deviation (SD): \(\sigma\)
During an extended period, if intervention measures have taken place, \(R_E\) is being suppressed to approach 1 (or approaching to 1), the growth function, \(r_t\) still follows an exponential curve, but with a new set of parameters for \(\mu\) and \(\sigma\). This is called the flattening part of the growth curve.
However, it is important to note that flattening of the curve ONLY helps to slow down the growth, but not necessarily eliminating the growth to zero. This condition can be achieved if and only if \(R_E << 1\). This condition is similar to mathematical equation of \(E[I_{t+1}] - E[I_{t}] = 0\), i.e. the expected number of new cases is equal to zero.
From the assumptions the key exercise would be for us to obtain estimates of \(SI_t\), \(R_O\) and \(R_E\):
From item 4 above, we can deduce a few predictions: If the probability of infection to be decreased under any forms of interventions (such as MCO) we need to control all factors:
The stages of the Covid epidemic could be loosely divided into three:
Stages | Characteristics | Days | Parameters | Malaysia |
---|---|---|---|---|
Early | Onset of the outbreak | 1 to X days | Some threshold decided by the country | About 5 to 10 days |
Mid | Measures being taken | X to Y days | Measures are decided by the country | 7 days in and to last for 14 days |
Advance | Effect of measures take place | Y to Z days | Measures are decided by the country | Unknown |
Globally, the Covid epidemic has reached into its 60 days (approximately), with China and Korea as the most advance stage (more than 60 days), and other countries are either at fast-growing Early stage and some others are in the Mid stage. For the case of Malaysia, we are at 30 days into the epidemic, about three weeks of Early stage, and now we are one week into the Mid stage.
The key problem at any of the stages is to estimate (or simulate) the growth function \(r_t\) and the parameters of this function as mentioned before. For us to perform our simulations, we need to learn what could we understand about the growth function of more advance stages countries as a guide; namely, we could learn from the data of China and Korea and other countries with a sufficient amount of granular data.
Important learnings thus far:
What are the dangers that we still couldn’t know from the available data?
Even as we can get \(R_E\) approaches \(1\), it is still having positive growth. When will we start to see \(R_E\) to be significantly below \(1\), such as to reach a figure of \(0.5 <\) or even approaching \(0\)? For example, for the case of Korea, \(R_E\) is close to \(1\) but new cases still keep coming in.
The lead-lag effects of the Covid virus are still not fully understood globally (i.e. the \(SI\) and \(l\) parameters) and the effect of control measures (i.e. reducing \(R_O\) to \(R_E\) into below the figure of \(1\)) is also quite uncertain.
There are few noteworthy limitations of SIR Model or any of its variations:
The SIR model is non-stochastic, which means there are many uncertainties that are not easily quantifiable. Lack of granular and detailed data as the epidemic is underway is to be blamed, which limits our limitations for modeling these uncertainties. A better representation of the model is a stochastic SIR model, which allows us to delve deeper into various realistic scenarios and reduce model errors.
It ignores (or does not take into consideration) spatial dimensions of the outbreak.
SIR does not assume branching of states (i.e. it assumes stationarity of probability distributions over time and non-mutations of the virus behaviors)
SIR model assumes a Random Network Model (by default of binomial and normal distribution assumptions), which had already been proven that disease epidemics rarely follow a random network, but instead, it may follow a scale-free network properties.
However, to quote a saying: “Models are our best effort and duties to perform when no model at all is an even worse proposition. This is despite that models come with errors, both in model errors and errors in the models”.
The best way left (for the current time) is for us to simulate various scenarios given the broad parameters that we could learn from the data, using the SIR Model as the base model. Simulations (instead of predictions) allow us, data scientists to present various scenarios as a general guide. Hopefully, the public officials could perform their own estimates which could be of critical nature in their duties and professions under the current Covid crisis, as part of the crisis management.
General range of estimations on model parameters for Malaysia:
Parameter | Range | Comments |
---|---|---|
\(R_o \ and \ R_E\) | \(1.5 \ to \ 7\) | Based on studies of China Covid |
\(SI\) | \(5 \ to \ 10 \ days\) | Based on general estimates from SARS outbreak |
\(l\) | \(7 \ to \ 21 \ days\) | Based on general assumptions of how the disease infectious period |
\(c\) | \(2 \ to \ 8 \ times \ per \ day\) | Based on small world properties of human relations |
What could we gain from the simulations (example of possible problem statements):
To move forward, data science could provide all the tools to perform above mentioned tasks and problem solving. However, the exercise would critically depends on a) whatever data available in most timeliest manner, b) readiness of tools to perform the computations, c) modelling the problem statements, d) analysis of the results of the simulations.
For (a) we could only rely on publicly available data (which is insuffient, but given the situation it may be the best that we can do); In case of (b), we are ready to perform all the computations using R Programming language; (c) must be done by people with expert domain knowledge, which we do not posess; and, for (d), we can assist in interpreting the results from data analytics point of view and it requires domain expert knowledge.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".