Predicting COVID-19 Cases

The project’s purpose is to predict the number of COVID-19 cases in South Korea with SIR modeling.

SIR modeling takes in dynamics of the COVID-19 outbreak as parameters and that is where the acronym is derived from. Those parameters are Susceptibles, Infected, and Recovered for a given time.

Data Set

We are using time series data from the GitHub datasets repository. The data has been cleaned and normalized for public use. The data is sourced from the team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been maintaining and collecting data from around the world.

Relevant Links:
https://github.com/datasets/covid-19
https://github.com/CSSEGISandData/COVID-19

# Loading Libraries
library('ggplot2')
library('deSolve')
library('EpiDynamics')
library('dplyr')

# Reading Data
df <- read.csv("time-series-19-covid-combined.csv", header = TRUE, sep =",")

Exploratory Data Analysis

We filtered the data for only South Korea and visualize the cumulative total of cases since January 22, 2020.

# Group the data by Country and Date
covidFilterData <- df %>%
  dplyr::filter(Country.Region %in% c("Korea, South")) %>%
  group_by(Date, Country.Region) %>%
  summarise(Confirmed = sum(Confirmed), Deaths = sum(Deaths))

# visualize confirmed cases by date
ggplot(data = covidFilterData, aes(x = as.Date(covidFilterData$Date, "%Y-%m-%d"), y = Confirmed, group = Country.Region)) +
  geom_line(aes(color = Country.Region)) +
  labs(x = 'Date', y = 'Cumulative Total') +
  geom_point(aes(color=Country.Region))+
  ggtitle("Covid-19 Confirmed Cases by Date")+
  scale_y_continuous(labels=scales::number_format(scale=.001, suffix="K")) +
  scale_x_date(date_labels="%m-%d", date_breaks="1 week", limits=as.Date(c('2020-01-22', '2020-04-17'))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

jpeg

Compiling the SIR Model

We estimated South Korea’s population and compiled the SIR model to take time, state, and parameters (estimates of the avgContacts and recoverRate) as inputs. The SIR model estimates the number of cases of Susceptibles, Infected, and Recovered for a given time. avgContacts is the average number of people contacts per person and recoverRate is the recovery rate. We calculate avgContacts and recoverRate by minimizing the Residual Sum of Squares.

# South Korea population
p <- 51178579

SK <- subset(df, Country.Region == "Korea, South")
colnames(SK)[6] <- "cases"
SK$Date <- as.Date(SK$Date)

# SIR Model
SIR <- function(time, state, parameters) {
  par <- as.list(c(state, parameters))
  with(par, {
    dS <- -avgContacts/p * I * S
    dI <- avgContacts/p * I * S - recoverRate * I
    dR <- recoverRate * I
    list(c(dS, dI, dR))
    })
}

# Number of Confirmed COVID-19 Cases
infected <- as.integer(SK$cases)
time <- 1:length(infected)
init <- c(S = p-infected[1], I = infected[1], R = 0)

# Residual Sum of Squares
library(deSolve)
RSS <- function(parameters) {
  names(parameters) <- c("avgContacts", "recoverRate")
  out <- ode(y = init, times = time, func = SIR, parms = parameters)
  fit <- out[ , 3]
  sum((infected - fit)^2)
}

Training the SIR Model

Training the SIR model by using R’s optim function to optimize the parameters and finding estimates for avgContacts and recoverRate.

# Initializing the Optimizer
opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1))

# Estimating avgContacts and recoverRate Parameters
opt_par <- setNames(opt$par, c("avgContacts", "recoverRate"))

Fitting the Model and Graphing Predictions

We fit the model with our current data and predict values for 160 days. From the first graph below, it is estimated that the majority of the South Korean population, around 50 million, are susceptible until the 100 days mark. The number of susceptible people begin to decline after 100 days and the number of people in recovery begin to incline after 100 days.

The second graph visualizes the worst case scenario of the number of infected cases. The reason it is worst case scenario is it is still early on in the epidemic and there are still many preventative measures and treatments being evaluated. The second graph predicts a peak at 140 days of almost 1.2 million people being infected.

# T is Time in Days
t <- 1:160

# Fitting the Model
fit <- data.frame(ode(y = init, times = t, func = SIR, parms = opt_par))

ggplot(data = fit, aes(x = time)) + geom_line(aes(y=S, color="steelblue"))+geom_line(aes(y=I, color="red"))+geom_line(aes(y=R, color="black")) + labs(x = 'Day', y = 'Number of People')+ ggtitle("Predicted COVID-19 Cases")+scale_y_continuous(labels=scales::number_format(scale=.000001, suffix="M")) + scale_color_identity(name="", breaks=c("steelblue", "red", "black"), labels=c("Susceptible", "Infected", "Recovered")
,guide = "legend")

jpeg

ggplot(data = fit, aes(x = time))+geom_line(aes(y=I), color="red")+ labs(x = 'Day', y = 'Number of Infected Cases')+ggtitle("Predicted COVID-19 Infected Cases")+scale_y_continuous(labels=scales::number_format(scale=.000001, suffix="M"))

jpeg