Model Output Validation

Validation Function

SMHValidation Package

To validate the Scenario Modeling Hub (SMH) model projection, we use the validate_submission() function from the SMHvalidation R package .

The `validate_submission()` function requires two parameters:

  • path: path to the submissions file (or folder for partitioned data) to test

  • js_def: path to JSON file containing round definitions: names of columns, target names, etc. The JSON file should follow the `tasks.json` Hubverse format . The information in the JSON file can be separated in multiple groups for each round:
    • The "task_ids" object defines both labels and contents for each column in submission files defining a modeling task. Any unique combination of the values define a single modeling task. For example, for SMH it will be the columns: "scenario_id", "location", "origin_date", "horizon", "target", "age_group", "race_ethnicity", depending on the round.
    • The "output_type" object defines accepted representations for each task. For example, for SMH it concerns the columns: "output_type", "output_type_id", "run_grouping", "stochastic_run" and "value".
    • The "target_metadata" object containing metadata about each unique target, one object for each unique target value.
For more information on this function, please consult the associated documentation at ?validate_submission()

SMH Sample format

In the Scenario Modeling Hub file format two possible formats are used for the "sample" output type:

  • All the submissions for a specific round should follow the same format with predefined output_type_id column content.

  • For a specific round, the hub collects the sample "pairing" (or joint distribution) information each team uses to generate the trajectories. In this case, the output_type_id column is set to "NA" for the "sample" output type and the "pairing" information is collected into two columns: "run_grouping" and "stochastic_run" . For more information on this format, please consult the SMH Sample format documentation .
The secound format is assumed if the information for the associated round in tasks.json file for the `"sample"` format is expressed with a parameter: "samples_joint_across": [...]".

Output

The function can generate three different outputs:

  • message when the submission does not contain any issues
  • warning + report message when the submission contains one or multiple minor issues that do not prevent the submission from being included.
  • error + report message when the submission contains one or multiple minor and/or major issues that prevent the submission from being included. In this case the submission file will have to be updated to be included in the corresponding SMH round.

File Check List

Submission file format

  • The round_id information, the date extracted in the file name(s) (path parameter), should match one of the round_id in the associated tasks.json file (js_def parameter).

  • The submission file(s) should be in one of the accepted formats:
    • Parquet file (preferred): .parquet or .gz.parquet. Partitioning in this format is also accepted.
    • CSV file: .csv. Partitioning in this format is also accepted.
    • Compressed format: .gz or.zip

  • All columns containing date information should follow the "YYYY-MM_DD" format.

  • No column should be in a "factor" format.

File format (test_column())

The name and number of the columns are corresponding to the expected format:

origin_date
scenario_id
target
horizon
location
age_group*
race_ethnicity*
output_type
output_type_id
run_grouping
stochastic_run
value

*The columns age_group and race_ethnicity are not required in each round. Please refer to the associated round definition for more information.

The order of the column is not important, but it should contain the expected number of columns with each name correctly spelled. The column should be in the expected format (no “factor” column accepted)

Remarks: If one column is missing, the submission test will directly stop and return an error message without running all the other tests.

Remarks: In this example, the "task_ids" columns correspond to: origin_date, scenario_id, target, horizon, location, age_group, race_ethnicity.

Scenario Information (test_scenario())

  • The ID of the scenarios are corresponding to the expected ID of the expected round without any spelling errors.

Column “origin_date” (test_origindate())

  • The origin_date column contains:
    • one unique date value in the YYYY-MM-DD format (character or date format accepted, datetime will return a warning).
    • the column should contain information in "character" or "Date" format.
    • depending on the round, the date should either match the round submission due date or the projection starting date. Please refer to the round documentation for more information.

“sample” information (test_sample())

  • The column output_type_id (or run_grouping and stochastic_run) must only contain integer value for the output type "sample". If the run_grouping and stochastic_run columns are provided, the output_type_id should be set to NA for the output type "sample"

  • The submission file should contain the expected number of repetition (number of samples or trajectories) for each task_id group, as indicated in the associated JSON file (js_def parameter). And each group should contain the same number of trajectories.

  • The submission should at least contain a unique sample identifier by pairing group, For example, if pairing group is equal to "horizon" and "age_group, the sample identifier 1 should contain at least all the possible horizon and age group values once, and optionally can contain the specific and multiple values for the other task id columns (origin_date, scenario, location, target, horizon, etc.). The pairing_col information is extracted from the "samples_joint_across" information in the associated tasks.jsonfile. If the information is not available, it will be set to "horizon"by default.

For more information on pairing or grouping information, please consult the SMH Sample format documentation .
Remarks: These tests are run only for file format expecting "sample" information (required or optional).

“cdf” information

  • The output_type_id column contains the expected Epiweek values, noted in the EWYYYYWW format.

Remarks: These tests are run only for file format expecting "cdf" information (required or optional).

Quantiles information and value (test_quantile())

  • The submission file should contain quantiles matching the expected quantiles value (information should be provided via the js_def parameter), for example: 0.010 0.025 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.700 0.750, 0.800 0.850 0.900 0.950 0.975 0.990 . If a submission is missing some (or all) quantiles it might still be accepted but might not be included in the Ensembles

  • For each task_id group, the value increases with the quantiles. For example, for the 1st week ahead of target X for the location Y and for the scenario A, if quantile 0.01= “5” than quantile 0.5 should be equal or greater than “5”.

Remarks: These tests are run only for file format expecting quantiles information (required or optional).

Value and Types information (test_val())

  • Each task_id group combination has one unique value projected. For example: only 1 value for sample 1, location US, target inc hosp, horizon 1, age group 0-130 and, scenario A.

  • The projection contains the expected value (for example, integer value greater or equal than zero, or numeric in between 0 and 1). NA value are not accepted

  • For each task_id group excluding horizon information and except locations 66 (Guam), 69 (Northern Mariana Island), 60 (American Samoa), 74 (US. Minor Outlying Islands), the whole projection does not contain only 1 unique value. For example, the projection for the incidence cases for one location and for one scenario does not contain only one unique value for the whole time series. As there is a possibility that 0 death or case might be projected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection.

  • Only for the cumulative targets, for example cum hosp, for each task_id and output_type group (excluding "horizon"), the value is not decreasing with time.

  • [optional] Each projected value cannot by greater than the population size of the corresponding geographical entity. Only run if the pop_path parameter is not set to NULL. As an individual can be reinfected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection

  • [optional] For the cumulative cases and deaths projection, the projected value should not be less than the week 0 (or week -1, depending on the availability on the time of submission) of the observed cumulative cases and deaths, respectively. The test allow a difference of 5% to take into account the difference between time of pulling the observed data, sources, ... (only tested if the lst_gs parameter is not NULL and contains observed cumulative cases and deaths value)

  • [optional] The number of decimal points accepted in the column "value" is lower or equivalent to the parameter n_decimal (only run if n_decimal is not set to NULL)

Target information and value (test_target())

  • The target are corresponding to the target name as expressed in the associated metadata JSON file (js_def parameter)

  • The submission file contains projections for all the required targets. The submission file will be accepted if some targets are missing, but will return a warning message and the submission might not be included in the Ensembles

  • For some targets, the submission is expected to contain projections for an expected number of weeks (or horizon). If the file contains more projected weeks than expected, the submission will still be accepted, but will return a warning message and the additional weeks will not be included in the visualization on the SMH website. If the file contains less projected weeks than expected, the submission might still be accepted, but will return a warning message and might not be included in the Ensembles, if the submission is not accepted, an error message will be returned

  • For some targets, for example peak size hosp, no "horizon" information is expected. In this case, the column "horizon" should be set to NA

Column “location” (test_location())

  • The location are corresponding to the location name as expressed in the associated metadata JSON file (js_def parameter). If FIPS codes are expected and if the FIPS numbers are missing a trailing zero, the submission will be accepted but a warning message will be returned

  • For the target requiring only specific location(s), no additional location is provided in the submission file

Remarks: If a submission file contains only state level projection (one or multiple), the location column might be automatically identified as numeric even if it was submitted in a character format. In this case, a warning message will be automatically print on the validation but, please feel free to ignore it.

Column “age_group” (test_agegroup())

  • The submission should contain a column age_group with values defined as <AGEMIN>-<AGEMAX>

  • The submission should contain the expected age_group information, as specified in the associated JSON file (js_def parameter)

  • For the target requiring only specific age group(s), no additional age group is provided in the submission file. If additional age group are provided, a warning will be returned and the additional information might not be integrated in the analysis and visualization.

Remarks: These tests are only run if the submission contains an age_group column.

Column “race_ethnicity” (test_raceethnicity())

  • The submission should contain the expected race_ethnicity information, as specified in the associated JSON file (js_def parameter)

  • For the target requiring only specific race ethnicity group(s), no additional group is provided in the submission file. If additional groups are provided, a warning will be returned and the additional information might not be integrated in the analysis and visualization.

Remarks: These tests are only run if the submission contains an race_ethnicity column.

File Checks Running Locally

Each submission will be validated using the validate_submission() function from the SMHvalidation R package . The package is currently only available on GitHub, to install it please follow the next steps:

install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", build_vignettes = TRUE)

or it can be manually installed by directly cloning/forking/downloading the package from GitHub.

To load the package, execute the following command:

library(SMHvalidation)

The package contains a validate_submission() function allowing the user to check their SMH submissions locally.

The documentation here is associated with the last version of the package. For previous versions of the documentation, please consult past version of the package.

Run the validation

Run without testing against observed data (example with the RSV SMH GitHub repository:

js_def <- "https://raw.githubusercontent.com/midas-network/rsv-scenario-modeling-hub/main/hub-config/tasks.json"
validate_submission("PATH/TO/SUBMISSION", js_def, merge_sample_col = TRUE)

File Visualization Running Locally (only for quantiles values)

The SMHvalidation R package contains plotting functionality to output a plot of each location and target, with all scenarios and only for quantile output type .

To run this visualization locally:

generate_validation_plots(path_proj = "PATH/TO/SUBMISSION", lst_gs=NULL , save_path=getwd(), y_sqrt = FALSE, plot_quantiles = c(0.025, 0.975))