Covid Malaysia - Perspectives from Data Science

Why data science as skills is important for crisis management such as Covid epidemics

Dr. Wan M Hasni, Chief Data Scientist (Techna Analytics Sdn. Bhd., Malaysia)

Table of Contents


Covid-19 started as a remote case on December 8, 2019, when few individuals were detected to have been infected with an “unusual case” of pneumonia in Wuhan city, Hubei Province, PRC. Wuhan is a city of 11 million people, which most importantly sits at the center of China mainland (equidistant between north and south, and east and west). This fact combined with the Chinese Lunar Year celebration from January 25th to February 8th when people in China travel home and celebrate, and thereafter disperse back to their workplaces, is another major event of importance. WHO declared its first major warning on January 30th, as a Public Health Emergency of International Concern, which is a date well under the holiday season, by 11 February 2020, WHO announced a name for the new coronavirus disease as Covid-19, and finally by 13 March 2020, WHO declares Covid-19 as global pandemic status.

As Data Scientists, our tasks are as follows: To understand the problem at hand, finding and obtaining relevant data related to the problem, searching and applying the right model to the data, provide predictions based on the data and the models to guide decision-makers for their decision making, provide tools for continuous monitoring of the problem using timely data by i) updating the models and models predictions as more data are streamed in, and ii) provide predictions and metrics for monitoring of the cause and effects related to the solutions and problems.

Covid Problem Statements from data science perspectives

Covid-19 and any disease which has an outbreak and spreading potential are called as “spreading phenomena” in Data Science, under a sub-field called Network Science. The focus is on how to model outbreaks, epidemics, and pandemics using domain knowledge (epidemiology) and network science to be confirmed using empirics of data. The modeling and empirics could to some degree, rely on past data and experience (from previous cases, such as SARS) and usage of digital data which are available at the hand of researchers.

The main problem statement is: “how the disease could spread and to contain this spreading phenomenon”. Since it is a spreading phenomenon, the objective is to develop and undertake policies to ensure this phenomenon is understood and containment procedures are executed with high effectiveness. Since each case of the disease is never the same, there would be severe limitations in terms of learning from past experiences (even though relevant), effective method of containments would also introduce errors and new errors (in the models), externalities in any models would be potentially very high (and hence need to be taken into considerations), learning as we go becomes an important tool of analysis (i.e. as the process is taking place, new findings and data have to be factored into the models and the predictions produced).

Characteristics of the Covid problem

The problem of spreading of Covid-19 could be characterized as a) a complicated problem - since the method and how the virus itself behaves is a complicated subject (from medical knowledge and sciences), b) a complex problem - since many causalities are related to human and physical network based on human behaviors, and c) a dynamic problem - since both the complications and complexities are interrelated through the dimension of agents behaviors (human in various capacities), time (as time lapsed) and space (as locations are being linked). Given these facts, modeling the spreading problem is even more challenging and the ability to model all these facets in a single large model is almost impossible (using currently available sciences).

In short, we could categorize the problem as a “Black Swan” event - where its occurrence is rare, and yet the consequence could be devastating. Furthermore, the problem is highly related across domain - from medical (epidemiology), to economics (financial implications), to society (human behaviors), and politics (public policies and decision making).

What could data science contribute?

First and foremost is to understand the limitations of what Data Science could do, and secondly, to know what we could do to contribute positively to the problem-solving process. And for this, Data Science could rely upon some of the tools that have shown lots of effectiveness in the past in almost (or nearest) methods which show potentials to understand the problem from “network perspectives”.

There are three forms of networks that are at inter-play: a) biological network - that is the behavior of the pathogens involved in a networked setting, b) social and human network - how human social network behave under such circumstances (of biological setting), and c) how could we use digital network information and data to help trace these networks.

Furthermore, we need to rely on some epidemic modeling which is cast in two fundamental hypotheses: compartmentalization - that is to classify individuals into three (or more) states: i) Susceptible (Healthy individuals who have not yet contacted the pathogen), ii) Infectious (Individuals who already contacted the pathogen and hence could infect others), and iii) Recovered (or terminal) (Individuals who have been infected but have recovered or died; in either case, are no longer Infectious). Level of homogeneity of mixing - homogenous versus heterogeneous mixing - which assumes that each individual has the same chance of coming into contact with Suspected or Infectious individual. This assumption is about the underlying probability model the researcher assumes. Under these scenarios - there are three broad models that a disease spreading could fall into:

  1. Susceptible-Infectious (SI) model
  2. Susceptible-Infected-Susceptible (SIS) model
  3. Susceptible-Infected-Recovered (SIR) model

Without going to details on these three models, let’s go directly into the model parameters and understand the significance of each of them.

Network model parameters and predictors

What parameters are important and how could we rank them in terms of the order of importance and relational importance? This is the subject of this section.

First and foremost is the underlying probability structure of the network: is it a Random Network (which is extremely unlikely) or an exponential network (which is highly likely). In game-theoretic decision making, establishing the probability structure of the network is the first important step.

Assuming (as available evidence shows) that the network is exponential, then what type of exponent it follow? Does it follow a Poisson process (or any Gaussian process) or it follows another class of non-Gaussian process? The general view is (based on past evidence), the process (like most complex networks) follows what is called as “Scale-Free Networks”, which underlying probability distribution follows Power Law distributions. And under such a case, our task is to empirically estimates the degree exponent of the power-law (called \(\gamma\)).

What does this mean? It means that if we could compute the exponent of this power law distribution - it could provide a single measure/metrics to classify many Nodes and Vertices in the network - which then allow decision-makers to quickly identify the Nodes and Vertices to provide an implementation of all the SI, SIS, and SIR model as elucidated above.

The model could answer many laymen questions for it to be useful: Few examples could be useful:

  1. what type of gathering could be deemed risky (i.e. above threshold) in terms of the number of people and location? Are 50 people too many or too little, say in a cafe in Ampang? (answering specific questions)

  2. what type of self-containment are effective? (answering medical questions)

  3. where and what the choke points in the health care system which require immediate and high attention? Should a fully contained center be established with all the medical procedures be in place? (answering public health questions)
  4. what type of activities are highly risky? (answering economic questions)
  5. how should companies and business reacts in regards to work and workplace (microeconomic questions)
  6. how businesses should respond to their business plans (macro-economic questions).

The key point is a coherent and comprehensive network model allows some definitive answers to some questions - which are actionable. The part of actionable insight from the model based on data and scientific approach is extremely important from the public policy and administration point of view.

Current State of Affairs

Promising uses of AI and analytics have shown some initial success in the case of South Korea1 and China2. Similarly, network science from Northeastern University expertise is employed by the United States Center for Disease Control (among other expertise used)3. All of these, among others, points to the fact that data science is becoming portent tools for epidemics and disease spreading problems, as well efforts to develop Global Assessment Report for risk and disaster management is known to have use tools from Data Science extensively.

Based on our observations, the state of affairs of the current situation for Malaysia, doesn’t show that:4

  1. Decision-makers have sufficient knowledge of the problem (evident by blaming games).
  2. The public is poorly informed (evident by many panic activities and reactionaries that are irrational).
  3. Healthcare systems that are not ready (or not prepared) to deal with Black Swan event of this nature.
  4. We are still far from readiness to model and use scientific tools, and in particular data science tools to deal with problem-solving.

Globally, there are some use cases which could be of lessons: 1. China did a very good job of containment and dealing with the issues (so far). 2. World Health Organization (WHO) has done a tremendous job (with so many limitations) it has 3. Sharing of knowledge and information (as well as modeling) across nations is important (cross border coordinations) - still very low, but need lots of improvements. 4. Common knowledge and sharing of data and information across agencies are extremely useful and helpful.

Furthermore, in the case of Malaysia, Disaster Management using Data Science is a very important field that requires immediate focus and up-skilling exercise for the country.

In this series of articles, we will presents some of our works on analyzing Covid Malaysia, using tools from Data Science as part of our effort in educating ourself about the problem, and sharing information with the interested parties from the public.

We welcome comments and views, critism as well as inputs from any parties sharing the same concerns as us.




  4. These comments are based on a personal view of the author, based on casual observations. Without further facts and studies, we neither confirm nor qualify these views affirmatively.


Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".