About the data:
The CRS-19 model is built on a sample of over 52 thousands of cases in Poland with a positive PCR test for Covid-19 disease. The surveillance data was obtained from NIZP-PZH on November 9th, 2020. Raw data has 51 variables for 55 950 cases collected between 21/Feb/2020 and 04/Nov/2020, but cases with very short observation time and with large numbers of missing data were removed leaving 52 580 cases that are used for modelling.
Note that we observe only a fraction of the population of infected cases. First, it is a fraction visible by the health system, undiagnosed cases are not in this database. Second, it is a subsample that covers the period from March till October with close to uniform throughput of around 220 cases per day. That means that we have better coverage of cases from the first wave and from the summer than cases from the second wave.
The data were checked against the following biases: spatial (voivodeship level), gender, age distribution. In each case, we have not observed significant biases against official statistics for all infected cases published by the ministry (details in the report).
About the features:
Out of all available features that are sociodemographic (age, gender, city, ...), related to symptoms (cough, fever, lack of tasing, …), related to medical conditions (comorbidities) we selected features that have high predictive power, are consistent with domain knowledge (validated against literature studies), and do not condition on the future (all symptoms were removed due to this reason). Note that for a specific purpose (like monitoring after infections) other features may be more suitable.
Also, note that some features are correlated (like age and comorbidities) and this of course impose some challenges during the modelling. Some feature engineering was performed to merge comorbidities with small frequency into a single group.
About the model:
Note, that this is a statistical summary of covid-related mortality for case fatality rate (CFR) for past cases. We do not claim any generalisation towards future mortality that may be affected by many different factors (one most obvious is the condition of the health system). The purpose of this app is to show that some groups of patients should be strongly protected as the risk is very high (older than 60 years, with cardiovascular diseases). The model is based on a tree boosting model (xgboost library) with 150 rounds with trees of the depth up to 5 with the binary:logistic objective. These three hyperparameters were hand-engineered in order to make results as much consistent with literature studies (include all variables and possible interactions).
The variables included in the model are: age, gender, kidney disease, cardiovascular disease, diabetes, cancer and other comorbidities (like lung diseases) with enforced monotonicity constraints for all variables except gender.
The model natively takes into account the interactions between these variables.
The relationships identified by the tree boosting model have been confirmed by logistic regression with splines (rms library) and random forest (which gives predictions with smaller variability but larger bias).
About the explanations:
Model explanations are created using the DALEX library.
The waterfall plot is created with the Break-down algorithms, which is similar to SHAP values, but here BD is prefered in order to identify interactions.
Application is based on precalculated scores and is a simplified version of Arena dashboard, which is purely client-based developed in vue.js.
All modelling calculations are done in R.
More information:
This app was prepared by MI² DataLab and MOCOS groups.
The source codes are/will be located at https://github.com/MOCOS-COVID19/covid-atlas.
Comments can be submitted as issues at https://github.com/MOCOS-COVID19/covid-atlas/issues.