Model Output Validation
Validation Function
SMHValidation Package
To validate the Scenario Modeling Hub (SMH) model projection,
we use the validate_submission()
function from the
SMHvalidation R package
.
The `validate_submission()` function requires two parameters:
-
path
: path to the submissions file (or folder for partitioned data) to test -
js_def
: path to JSON file containing round definitions: names of columns, target names, etc. The JSON file should follow the `tasks.json` Hubverse format . The information in the JSON file can be separated in multiple groups for each round:-
The "task_ids" object defines both labels and
contents for each column in submission files
defining a modeling task. Any unique combination
of the values define a single modeling task.
For example, for SMH it will be the columns:
"scenario_id"
,"location"
,"origin_date"
,"horizon"
,"target"
,"age_group"
,"race_ethnicity"
, depending on the round. -
The "output_type" object defines accepted representations
for each task. For example, for SMH it concerns the columns:
"output_type"
,"output_type_id"
,"run_grouping"
,"stochastic_run"
and"value"
. - The "target_metadata" object containing metadata about each unique target, one object for each unique target value.
-
The "task_ids" object defines both labels and
contents for each column in submission files
defining a modeling task. Any unique combination
of the values define a single modeling task.
For example, for SMH it will be the columns:
?validate_submission()
SMH Sample format
In the Scenario Modeling Hub file format two possible formats
are used for the "sample"
output type:
-
All the submissions for a specific round should follow
the same format with predefined
output_type_id
column content. -
For a specific round, the hub collects the sample
"pairing" (or joint distribution) information each
team uses to generate the trajectories. In this case, the
output_type_id
column is set to"NA"
for the"sample"
output type and the "pairing" information is collected into two columns:"run_grouping"
and"stochastic_run"
. For more information on this format, please consult the SMH Sample format documentation .
tasks.json
file for the `"sample"` format is expressed with a parameter:
"samples_joint_across": [...]"
.
Output
The function can generate three different outputs:
- message when the submission does not contain any issues
- warning + report message when the submission contains one or multiple minor issues that do not prevent the submission from being included.
- error + report message when the submission contains one or multiple minor and/or major issues that prevent the submission from being included. In this case the submission file will have to be updated to be included in the corresponding SMH round.
File Check List
Submission file format
-
The
round_id
information, the date extracted in the file name(s) (path
parameter), should match one of theround_id
in the associatedtasks.json
file (js_def
parameter). -
The submission file(s) should be in one of the accepted formats:
-
Parquet file (preferred):
.parquet
or.gz.parquet
. Partitioning in this format is also accepted. - CSV file:
.csv
. Partitioning in this format is also accepted. - Compressed format:
.gz
or.zip
-
Parquet file (preferred):
- All columns containing date information should follow the "YYYY-MM_DD" format.
- No column should be in a "factor" format.
File format (test_column()
)
The name and number of the columns are corresponding to the expected format:
origin_date |
---|
scenario_id |
target |
horizon |
location |
age_group* |
race_ethnicity* |
output_type |
output_type_id |
run_grouping |
stochastic_run |
value |
*The columns age_group
and race_ethnicity
are not required
in each round. Please refer to the associated round definition for more information.
The order of the column is not important, but it should contain the expected number of columns with each name correctly spelled. The column should be in the expected format (no “factor” column accepted)
Remarks: If one column is missing, the submission test will directly stop and return an error message without running all the other tests.
Remarks: In this example, the "task_ids" columns correspond to:
origin_date, scenario_id, target, horizon, location, age_group, race_ethnicity
.
Scenario Information (test_scenario()
)
- The ID of the scenarios are corresponding to the expected ID of the expected round without any spelling errors.
Column “origin_date” (test_origindate()
)
- The
origin_date
column contains:-
one unique date value in the
YYYY-MM-DD
format (character or date format accepted, datetime will return a warning). - the column should contain information in "character" or "Date" format.
- depending on the round, the date should either match the round submission due date or the projection starting date. Please refer to the round documentation for more information.
-
one unique date value in the
“sample” information (test_sample()
)
-
The column
output_type_id
(orrun_grouping
andstochastic_run
) must only contain integer value for the output type"sample"
. If therun_grouping
andstochastic_run
columns are provided, theoutput_type_id
should be set toNA
for the output type"sample"
-
The submission file should contain the expected number of repetition
(number of samples or trajectories) for each task_id group, as
indicated in the associated JSON file (
js_def
parameter). And each group should contain the same number of trajectories. -
The submission should at least contain a unique sample identifier
by pairing group, For example, if pairing group is equal to
"horizon"
and"age_group
, the sample identifier1
should contain at least all the possible horizon and age group values once, and optionally can contain the specific and multiple values for the other task id columns (origin_date
,scenario
,location
,target
,horizon
, etc.). Thepairing_col
information is extracted from the"samples_joint_across"
information in the associatedtasks.json
file. If the information is not available, it will be set to"horizon"
by default.
For more information on pairing or grouping information, please consult the
SMH Sample format documentation .
Remarks: These tests are run only for file format expecting
"sample"
information (required or optional).
“cdf” information
-
The
output_type_id
column contains the expected Epiweek values, noted in the EWYYYYWW format.
Remarks: These tests are run only for file format expecting
"cdf"
information (required or optional).
Quantiles information and value
(test_quantile()
)
- The submission file should contain quantiles matching the expected
quantiles value (information should be provided via the
js_def
parameter), for example:0.010 0.025 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.700 0.750, 0.800 0.850 0.900 0.950 0.975 0.990
. If a submission is missing some (or all) quantiles it might still be accepted but might not be included in the Ensembles - For each task_id group, the value increases with the quantiles.
For example, for the 1st week ahead of target X for the location Y
and for the scenario A, if quantile
0.01
= “5” than quantile0.5
should be equal or greater than “5”.
Remarks: These tests are run only for file format expecting quantiles information (required or optional).
Value and Types information (test_val()
)
-
Each task_id group combination has one unique value projected.
For example: only 1 value for sample
1
, locationUS
, targetinc hosp
, horizon1
, age group0-130
and, scenarioA
. -
The projection contains the expected value (for example, integer value
greater or equal than zero, or numeric in between 0 and 1).
NA
value are not accepted - For each task_id group excluding horizon information and except
locations
66
(Guam),69
(Northern Mariana Island),60
(American Samoa),74
(US. Minor Outlying Islands), the whole projection does not contain only 1 unique value. For example, the projection for the incidence cases for one location and for one scenario does not contain only one unique value for the whole time series. As there is a possibility that 0 death or case might be projected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection. -
Only for the cumulative targets, for example
cum hosp
, for each task_id and output_type group (excluding"horizon"
), the value is not decreasing with time. -
[optional] Each projected value cannot by greater than the population size
of the corresponding geographical entity. Only run if the
pop_path
parameter is not set toNULL
. As an individual can be reinfected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection -
[optional] For the cumulative cases and deaths projection, the projected
value should not be less than the week 0 (or week -1, depending on the
availability on the time of submission) of the observed cumulative
cases and deaths, respectively. The test allow a difference of 5% to
take into account the difference between time of pulling the
observed data, sources, ... (only tested if the
lst_gs
parameter is notNULL
and contains observed cumulative cases and deaths value) -
[optional] The number of decimal points accepted in the column
"value"
is lower or equivalent to the parametern_decimal
(only run ifn_decimal
is not set toNULL
)
Target information and value (test_target()
)
-
The target are corresponding to the target name as expressed in the
associated metadata JSON file (
js_def
parameter) - The submission file contains projections for all the required targets. The submission file will be accepted if some targets are missing, but will return a warning message and the submission might not be included in the Ensembles
- For some targets, the submission is expected to contain projections for an expected number of weeks (or horizon). If the file contains more projected weeks than expected, the submission will still be accepted, but will return a warning message and the additional weeks will not be included in the visualization on the SMH website. If the file contains less projected weeks than expected, the submission might still be accepted, but will return a warning message and might not be included in the Ensembles, if the submission is not accepted, an error message will be returned
-
For some targets, for example
peak size hosp
, no"horizon"
information is expected. In this case, the column"horizon"
should be set toNA
Column “location” (test_location()
)
-
The location are corresponding to the location name as expressed in the
associated metadata JSON file (
js_def
parameter). If FIPS codes are expected and if the FIPS numbers are missing a trailing zero, the submission will be accepted but a warning message will be returned - For the target requiring only specific location(s), no additional location is provided in the submission file
Remarks: If a submission file contains only state level
projection (one or multiple), the location
column might be
automatically identified as numeric even if it was submitted in a
character format. In this case, a warning message will be automatically
print on the validation but, please feel free to ignore it.
Column “age_group” (test_agegroup()
)
-
The submission should contain a column
age_group
with values defined as<AGEMIN>-<AGEMAX>
-
The submission should contain the expected
age_group
information, as specified in the associated JSON file (js_def
parameter) - For the target requiring only specific age group(s), no additional age group is provided in the submission file. If additional age group are provided, a warning will be returned and the additional information might not be integrated in the analysis and visualization.
Remarks: These tests are only run if the submission contains
an age_group
column.
Column “race_ethnicity” (test_raceethnicity()
)
-
The submission should contain the expected
race_ethnicity
information, as specified in the associated JSON file (js_def
parameter) - For the target requiring only specific race ethnicity group(s), no additional group is provided in the submission file. If additional groups are provided, a warning will be returned and the additional information might not be integrated in the analysis and visualization.
Remarks: These tests are only run if the submission contains
an race_ethnicity
column.
File Checks Running Locally
Each submission will be validated using the
validate_submission()
function from the
SMHvalidation R package
.
The package is currently only available on GitHub, to
install it please follow the next steps:
install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", build_vignettes = TRUE)
or it can be manually installed by directly cloning/forking/downloading the package from GitHub.
To load the package, execute the following command:
library(SMHvalidation)
The package contains a validate_submission()
function
allowing the user to check their SMH submissions locally.
The documentation here is associated with the last version of the package. For previous versions of the documentation, please consult past version of the package.
Run the validation
Run without testing against observed data (example with the RSV SMH GitHub repository:
js_def <- "https://raw.githubusercontent.com/midas-network/rsv-scenario-modeling-hub/main/hub-config/tasks.json"
validate_submission("PATH/TO/SUBMISSION", js_def, merge_sample_col = TRUE)
File Visualization Running Locally (only for quantiles values)
The SMHvalidation R
package contains plotting functionality to output a plot of each
location and target, with all scenarios and only for
quantile
output type .
To run this visualization locally:
generate_validation_plots(path_proj = "PATH/TO/SUBMISSION", lst_gs=NULL , save_path=getwd(), y_sqrt = FALSE, plot_quantiles = c(0.025, 0.975))