Pre-amble

SIR (Susceptible-Infected-Recovered) model of epidemiology for disease spreading is the most used model for studying the Covid outbreak. It is used as policy-making as applied by WHO and many other public health agencies worldwide. The model is a mathematical and statistical model for data obtained/available by the modeler - which is the core of works of data science. Given the massive impact of this outbreak and its nature which is unprecedented in human history, any learning and lessons are worthwhile. Data science as a new field could potentially show its use under such conditions. This short paper proposes some way forward in using data science and modeling of disease spread in Malaysia.

The SIR Model

Why do we start with the SIR model? Because it is the common model used by WHO and most epidemiology studies which allow us to learn the model parameters from other scientific research and publications.

Model Notations and assumptions:

Parameter	Assumptions	Notes
\(t\)	Time period which is measured in days
\(O\)	Original state or state without intervention measures
\(E\)	Effective measures based on time dependent
\(N_t\)	Total population, which is assumed to be fixed at all times
\(S_t\)	Susceptible persons within the population at time \(t\)	e.g. \(S_t \sim \ binomial[S_{t-1}, (1-\alpha)^{I_t}]\)
\(R_t\)	Total number of Recovered people at time \(t\)	e.g. \(R_t \sim \ binomial[I_t,\beta] + R_{t-1}\)
\(I_t\)	Total number of Infected people at time \(t\)	i.e. \(I_t = N_t - R_{t-1} - S_{t-1}\)
\(\alpha\)	Probability of infecting by an Infected person	i.e. from \(S_t \to I_t\)
\(\beta\)	Probability of recovery, after being infected	i.e. \(I_t \to R_t\)
\(\tau_t\)	Fraction of population infected	i.e. \(\frac{I_t}{N_t}\) at time \(t\)
\((1-\tau_t)\)	Fraction of population NOT infected	i.e. \(e^{\tau_t R^O_t}, as \ N \to \infty\)
\(p_O \ or \ p_E\)	Probability of disease transmission when in contact, during Orginal or Extended states
\(c_O \ or \ c_E\)	The number of contact happens per time \(t\)	i.e. per day
\(l_O \ or \ l_E\)	The duration of infection period in units of t	i.e. in number of days
\(SI_t\)	The disease generation function between being infected and start to infect others (called Serial-Interval)
\(R_o\)	The disease reproduction rate	e.g. modelled as: \(R_o\) = \(p_Oc_Ol_O\)
\(R_E\)	The disease effective rate	e.g. modelled as: \(p_Ec_El_E\)
\(r_t\)	The incidence growth rate	i.e. growth in terms of the number of Infected people

Key parameters in the model are: \(SI_t\), \(R\) (i.e. Serial-Interval and Reproduction rates). What data scientists are interested in is in the behaviors of growth, \(r_t\) during this epidemic, modeled by these key parameters.

\[r_t = f_t(SI,R)\]

\(r_t\), rate of growth of the number of infected people is a function of \(SI\) and \(R\), the generation time (Serial-Interval) and the reproduction rates at the initial stage and the various stages of intervention measures. The key objective is to understand how does the function \(f_t(.)\) (i.e. the growth function) looks like? And how does it behaves over time (i.e. days of the epidemic)? To obtain a reliable estimate of \(r_t\) is a major challenge in which many epidemiologists had worked on various statistical models and tested them on past experiences with disease outbreak such as SARS, MERS, as well as on HIV and some varieties of cases of flu.

Model predictions from SIR

From the SIR Model we can derive the following, which can be used for prediction exercises:

During the early stage (Original state) of epidemic, \(R_o\) is much larger than 1 (i.e. \(R_o >> 1\)), hence the growth function \(r_t\) generally viewed to follow exponential curve with a certain probability distributions functions of mean: \(\mu\) and standard deviation (SD): \(\sigma\)
During an extended period, if intervention measures have taken place, \(R_E\) is being suppressed to approach 1 (or approaching to 1), the growth function, \(r_t\) still follows an exponential curve, but with a new set of parameters for \(\mu\) and \(\sigma\). This is called the flattening part of the growth curve.
However, it is important to note that flattening of the curve ONLY helps to slow down the growth, but not necessarily eliminating the growth to zero. This condition can be achieved if and only if \(R_E << 1\). This condition is similar to mathematical equation of \(E[I_{t+1}] - E[I_{t}] = 0\), i.e. the expected number of new cases is equal to zero.
From the assumptions the key exercise would be for us to obtain estimates of \(SI_t\), \(R_O\) and \(R_E\):
1. In the case of \(R_o\), given that we have data from many countries, as well as Malaysian data, we can use the data available to infer the rates. So far, the rates had been estimated to be anywhere between 2 to 5.
2. Estimation of \(R_E\) is extremely tricky due to many unknown or uncontrolled factors. However, we know from the model that \(R_E\) needs to be approaching 1 or to be lesser than 1 for the outbreak to slow or die down.
3. Estimation of \(SI_t\) proves to be the hardest given non-availability of detailed clinical studies and data. So far many studies on other diseases such as SARS generally assumed the median of \(SI\) ranges from 7 to 10 days.
4. \(l\) and \(c\) are harder to estimate since this represents the count of how many times an infected person got in contact with Susceptible person in a day (on average) as well as the duration of interactions that the people had.
From item 4 above, we can deduce a few predictions: If the probability of infection to be decreased under any forms of interventions (such as MCO) we need to control all factors:
1. \(R^E\), which is a tough element since it is a stochastic variable that changes through time. However, we can estimate its value if we can track the trends of the number of new patients clearly (i.e. we can do it only ex-post, after the event). However, we know that there are many measures which help to reduce \(R^E\), such as wearing face masks, sanitizing hands and places, increase in immunity of persons, quarantine of suspected person, etc.
2. \(SI_t\) needs to be “extracted” from available incidence data (i.e. bootstrapped)
3. \(c\) can be controlled via social distancing. The higher the distance, the better.
4. \(l\) can be controlled by reducing the time of an infected person remain at large (and quarantined or controlled) by testing and obtaining results of the tests faster to enable quick action of containing the infected person as fast as possible. # What have we learned thus far from data?

The stages of the Covid epidemic could be loosely divided into three:

Stages	Characteristics	Days	Parameters	Malaysia
Early	Onset of the outbreak	1 to X days	Some threshold decided by the country	About 5 to 10 days
Mid	Measures being taken	X to Y days	Measures are decided by the country	7 days in and to last for 14 days
Advance	Effect of measures take place	Y to Z days	Measures are decided by the country	Unknown

Globally, the Covid epidemic has reached into its 60 days (approximately), with China and Korea as the most advance stage (more than 60 days), and other countries are either at fast-growing Early stage and some others are in the Mid stage. For the case of Malaysia, we are at 30 days into the epidemic, about three weeks of Early stage, and now we are one week into the Mid stage.

The key problem at any of the stages is to estimate (or simulate) the growth function \(r_t\) and the parameters of this function as mentioned before. For us to perform our simulations, we need to learn what could we understand about the growth function of more advance stages countries as a guide; namely, we could learn from the data of China and Korea and other countries with a sufficient amount of granular data.

Important learnings thus far:

Globally we now know that the estimates for \(R_o\) varied between \(2 \ to \ 7\). In simple terms, this means that the number of confirmed cases are doubling every 1 to 3 days. Only two countries, namely Japan and Singapore demonstrate lower growth (doubling every 5 to 10 days), which indicates that their \(R_o\) is greater than 1 and less than 2 (\(1 < R_o < 2\)). Nevertheless, the rate is still above 1.
If any control measures to be effective, \(R_E\) has to fall below \(1\) (i.e. \(R_E << 1\)). So far, only China and Korea had managed to bring \(R_E\) to be near \(1\).
How long it takes for \(\tau\) to peak and start to flatten after control measures in place: - China: 25 - 30 days - Korea: 14 - 21 days

What are the dangers that we still couldn’t know from the available data?

Even as we can get \(R_E\) approaches \(1\), it is still having positive growth. When will we start to see \(R_E\) to be significantly below \(1\), such as to reach a figure of \(0.5 <\) or even approaching \(0\)? For example, for the case of Korea, \(R_E\) is close to \(1\) but new cases still keep coming in.
The lead-lag effects of the Covid virus are still not fully understood globally (i.e. the \(SI\) and \(l\) parameters) and the effect of control measures (i.e. reducing \(R_O\) to \(R_E\) into below the figure of \(1\)) is also quite uncertain.

Limitations of SIR Model

There are few noteworthy limitations of SIR Model or any of its variations:

The SIR model is non-stochastic, which means there are many uncertainties that are not easily quantifiable. Lack of granular and detailed data as the epidemic is underway is to be blamed, which limits our limitations for modeling these uncertainties. A better representation of the model is a stochastic SIR model, which allows us to delve deeper into various realistic scenarios and reduce model errors.
It ignores (or does not take into consideration) spatial dimensions of the outbreak.
SIR does not assume branching of states (i.e. it assumes stationarity of probability distributions over time and non-mutations of the virus behaviors)
SIR model assumes a Random Network Model (by default of binomial and normal distribution assumptions), which had already been proven that disease epidemics rarely follow a random network, but instead, it may follow a scale-free network properties.

However, to quote a saying: “Models are our best effort and duties to perform when no model at all is an even worse proposition. This is despite that models come with errors, both in model errors and errors in the models”.

Best efforts is to model by simulations

The best way left (for the current time) is for us to simulate various scenarios given the broad parameters that we could learn from the data, using the SIR Model as the base model. Simulations (instead of predictions) allow us, data scientists to present various scenarios as a general guide. Hopefully, the public officials could perform their own estimates which could be of critical nature in their duties and professions under the current Covid crisis, as part of the crisis management.

General range of estimations on model parameters for Malaysia:

Parameter	Range	Comments
\(R_o \ and \ R_E\)	\(1.5 \ to \ 7\)	Based on studies of China Covid
\(SI\)	\(5 \ to \ 10 \ days\)	Based on general estimates from SARS outbreak
\(l\)	\(7 \ to \ 21 \ days\)	Based on general assumptions of how the disease infectious period
\(c\)	\(2 \ to \ 8 \ times \ per \ day\)	Based on small world properties of human relations

What could we gain from the simulations (example of possible problem statements):

We can estimate the probability of the required minimum period for the current MCO initiatives, that is whether there is a high need to extend the period. The simulations allow us to bootstrap the parameters, which then allow us to track the progress of the outbreak as it progress, and allow us to look ahead (forecast) for near-future trends (days ahead).
We can estimate the parameters which are measurable, namely \(l\) and \(c\), by testing programs. Simulations will show us what level of testing is required for containing the outbreak and what level of social distancing is required minimally. In the case of testing, we could ascertain the minimal period (if it is reduceable) for the MCO to be released (i.e. minimal threshold required). Furthermore, in case of social distancing, we could simulate to generate the maximum size of gathering and methods of gatherings post MCO, immediately after release of MCO and well into the future. This could be used as guide for the public as conditions of releasing the MCO.
We could perform scenarios to measure effectiveness of MCO as it progress, as well as other measures to be imposed post-MCO.
We can develop many possible trajectories and input them into the simulations, and allow us to measure or derive optimal limits under various trajectories.

Way forward

To move forward, data science could provide all the tools to perform above mentioned tasks and problem solving. However, the exercise would critically depends on a) whatever data available in most timeliest manner, b) readiness of tools to perform the computations, c) modelling the problem statements, d) analysis of the results of the simulations.

For (a) we could only rely on publicly available data (which is insuffient, but given the situation it may be the best that we can do); In case of (b), we are ready to perform all the computations using R Programming language; (c) must be done by people with expert domain knowledge, which we do not posess; and, for (d), we can assist in interpreting the results from data analytics point of view and it requires domain expert knowledge.

Covid Malaysia - The need for simulations exercise

Table of Contents