Build Your Own COVID Prediction Models In Minutes — Without Writing Any Code

Monument
8 min readNov 14, 2020

--

A COVID active cases and deaths model for New Jersey, built in Monument.

COVID has dramatically altered all of our lives. The corresponding flurry of information — and sometimes misinformation — has made these changes more acute and harder to unpack.

To empower citizens, journalists, and policymakers, Monument has assembled below easy-to-follow instructions for building your own machine learning models for COVID’s spread, mortality rate, and other important factors.

We have laid out the instructions in such a way to make them easy to modify for your own analyses. As always, if you have any questions, feel free to contact us at info@monument.ai.

Step 1: What we want to test

The specific questions we’re answering here are:

  1. What is the number of “active cases” we should expect in New Jersey tomorrow?
  2. What is the number of deaths we should expect in New Jersey tomorrow?

We will include COVID case information from states neighboring New Jersey to improve the accuracy of our model. This assumes that there is frequent travel in and out of New Jersey from Pennsylvania, New York, and Delaware — and that these travelers may contribute to New Jersey trends.

To build this model, we’re going to use a public dataset maintained by the Johns Hopkins University Center for Systems Science and Engineering (CSSE). The data was collected for CSSE’s Novel Coronavirus Visual Dashboard, but published on GitHub.

At the end of this article, we offer some suggestions for how to adapt and extend this approach for other applications, but for now we’ll proceed with our research question for New Jersey.

Step 2: Import data

You can access the CSSE data here. Clicking the green “Code” button near the top of the screen provides the option to access the data from the command line using git clone as well as simply downloading the data as a zip file. Download the data into a directory on your computer. If you’ve downloaded the zip file, you’ll need to un-zip it.

We’re particularly interested in the subdirectory inside the dataset called csse_covid_19_daily_reports_us. This contains a CSV with COVID data for each day since CSSE began tracking this information. We’re going to use this data to construct a time-series dataset to import into Monument and forecast future trends.

Because this data is not published as a time-series, we must transform it before importing it into Monument. We wrote a Python script that automates this process.

We pasted the entire script in the code block below. It was run on Python 3.9, but it should run on 3.7 and above. Bring it into a text editor like Notepad and save it with the file extension .py. You’ll then be able to execute the Python script and access the output file, which we will use in Monument.

Before proceeding, there are two places where you will need to edit the code:

  1. Near the beginning of the code, where the path variable is set. This tells the script where you have stored the raw CSSE data.
  2. At the very end, where we tell the script where to save the output file.
import glob
import pandas as pd
import os
import csv
myList = []# Define the path to where you have saved the CSVs
path = "/path/to/your/repo/copy/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
# In this example, we're predicting New Jersey active cases based on data from New Jersey and neighboring states.
targetStates = ['New Jersey', 'Pennsylvania', 'Delaware', 'New York']
targetColumns = ['Province_State', 'Confirmed', 'Active', 'Deaths', 'Recovered', 'Incident_Rate', 'Hospitalization_Rate', 'Mortality_Rate', 'Testing_Rate']for fname in glob.glob(path):
df = pd.read_csv(fname)
# Get the targetColumns for the targetStates
targetID = [list(df['Province_State']).index(state) for state in targetStates]
row = df.reindex(index=targetID, columns=targetColumns)
# pivot
row = (
row.assign(idx=row.groupby('Province_State').cumcount()).pivot(index='idx', columns='Province_State')
)
# rename
row.columns = [f'{y}_{x}' for x,y in row.columns]
# Put the date in, derived from the CSV name
dateFromFilename = os.path.basename(fname).replace('.csv','')
row['Date'] = dateFromFilename
myList.append(row)concatList = pd.concat(myList, sort=True)# Define where you want to save the output CSV
concatList.to_csv('/path/to/output/file.csv', index=False, header=True)

After you’ve run the above script, open Monument to start building your models. You can download a free trial of Monument on our homepage.

Note: You may need to open the CSV in Excel and sort by the “Date” column, to make sure that they are in the proper order. Also, if any columns start or end with Null values, you will have to replace these leading or trailing Nulls with 0 or something similar. Monument will soon be introducing functionality to handle these situations, but for now it must be addressed prior to importing the data.

Step 3: Building models

When you open Monument, you’ll want to click “New Project” on the welcome screen. This will bring you into a new workspace with an empty “INPUT” node. We need to load the CSSE data into this node by clicking the “CSV” button below the node and locating the output file from our Python script.

The screen will look like this:

The INPUT node workspace while importing the CSSE data.

After clicking “OK” in the bottom right-hand corner of the screen, Monument will automatically create a “MODEL BUILDING” node. This is a workspace that allows us to chart the data and apply machine learning algorithms by dragging and dropping them on the data.

First, let’s drag the “Date” pill into the “COLS (X)” area of the chart. Then, let’s drag the pill called “New Jersey_Active” into the “ROWS (Y)” area. You’ll end up with a chart like the one below.

Our first chart: New Jersey daily active cases over time.

After you have charted the data pills, you’ll notice that Monument has detected that you have loaded time-series data and shows you a list of algorithm pills that you can apply to construct models.

The algorithms are ordered from least complex to most complex. It’s best practice to start with the simplest models. If a simple model captures the trend, it is usually superior. You can read our post on “overfitting” for more background on why this is the case.

The LinReg (Linear Regression) algorithm does a decent job of capturing the trend, but it does lag the actual values a bit in the training period. You can see this clearly if you use the slider on the bottom of the chart to zoom in.

Applying LinReg (Linear Regression).

The ARIMA algorithm does a much better job, as you can see in the screenshot below. (Hint: you can hide data you want to ignore by clicking the data’s chart label at the top of the chart.)

ARIMA’s prediction.

Let’s also adjust the algorithm parameters to something more sensible for the context. We can do this by clicking the drop down arrow that we see when hovering over the algorithm pill and clicking “Parameters.”

The obvious place to start is the Lookback Period, which we will change from 10 to 7. The Lookback Period tells the algorithm how many periods to look back into when generating the next forward prediction. Most of us live our lives according to standard 7-day cycles of 5 work days and 2 weekend days. Particularly as many people emerge from quarantine and gradually socialize more on weekends, there is a likelihood that these social activities affect infection rates in a cyclical manner.

Indeed, when we change the Lookback Period to 7, we see the Validation Error Rate for ARIMA drops from 12.78 to 12.76. The lower the Validation Error Rate, the more accurate the model — so we just improved our model!

Another parameter you might want to adjust is the Lookahead Period. By default, Monument forecasts one period forward. We do this because, for reasons that may be intuitive, predictions of the near-feature are more reliable than predictions looking farther into the future.

You can experiment with different algorithms and different combinations of “Independents,” short for “Independent Variables.” Use your intuition about which Independents are more likely to contribute to the predictive power of the algorithm. (As an extreme example, you could, for instance, find and include a dataset on the number of times the Beatles were played on the radio each day and provide this to an algorithm. The algorithm would attempt to use this data, but it would not likely improve your model!)

BEFORE: The mLSTM (Multi-Layer LSTM) before adjusting the Independent Variables.
Adjusting the Independent Variables.
AFTER: The mLSTM after adjusting the Independent Variables.

You can also drag in “New Jersey_Deaths” — or any other data pill that you want to forecast!— and apply algorithms it it as well. Different algorithms will be more suitable for different datasets.

We’ve added New Jersey COVID deaths to the chart. It is the blue line near the bottom.

As you experiment, pay attention to how different combinations of algorithms, parameters, and independents affect the Validation Error Rate. Again, the rule of thumb is: the lower the Validation Error Rate, the more accurate the model.

You can see the values produced by changing the View Style to “grid.” You can also export the results as a CSV by clicking the OUTPUT node in the pipeline.

Selecting “gride” in the View Style menu.

Our quick models for active cases and deaths used data up to November 12, 2020:

  • Our LSTM model for active cases predicted 213,522 and the actual was 214,809. We were within 1,287 cases — or 0.5% — of the actual.
  • Our LinReg model for deaths predicted 16,497 deaths, while the actual was 16,522. We were within 25 — or 0.2% — of the actual.

Both of these results are excellent for a model constructed in the space of a few minutes, and they could likely be improved with further adjustments to the parameters and independents.

You can see the actual values reported in the CSSE repository — or on your local copy.

Conclusion

This has been an introduction to using CSSE data to build a COVID model for New Jersey, using both New Jersey data and data from neighboring states.

There are many ways to adapt and extend the model, including:

  • Pick a new set of “target” and “neighboring states,”
  • Expand the meaning of “neighboring” to take into account common travel corridors (e.g. include DC and California in a model for New York),
  • Add a column to the input data set that is a computation of the “day-on-day rate of active case growth.”
  • With more granular data, you could also predict infection rates at the county or neighborhood level.

You could also forecast any of the other columns we collected with the Python script, including hospitalization rate, case-fatality ratio, and recovered cases.

As you work each day, you’ll want to git clone the CSSE repository or download the files as a zip each day to get the latest data before building your models.

We hope you learned a bit about how to use Monument and feel empowered to build your own COVID models.

Interested in learning more about Monument? Book a free introductory Zoom call here.

--

--