Validation Checks for US SMH files

Validation Functions

SMHValidation Package

The main functions in the SMHvalidation R package . R package aim to validate and visualize Scenario Modeling Hub submissions. To validate the Scenario Modeling Hub (SMH) model projection, we use the validate_submission() function from the

Remarks: Starting January 2025, the SMHvalidation packagef ormat and associated JSON file format has been updated to follow the Hubverse schema v5 format. This documentation is about the last version of the data.

For previous versions of the documentation, please consult past version of the package.

The `validate_submission()` function requires two parameters:

  • path: path to the submission file (or folder for partitioned data) to test. PQT, PARQUET, CSV, ZIP (not partitioned), or GZ (not partitioned) file formats are accepted. The path should be relative to the path of the hub containing the file to check.

  • hub_path: path to the hub containing the submission file. The hub should follow the hubverse standard and organization. For more information on hubverse, please consult the associated website.
For more information on this function, please consult the associated documentation at ?validate_submission().

SMH Sample format

In the Scenario Modeling Hub file format two possible formats are used for the "sample" output type:

  • All the submissions for a specific round should follow the same format with predefined output_type_id column content.

  • For a specific round, the hub collects the sample "pairing" (or joint distribution) information each team uses to generate the trajectories. In this case, the output_type_id column is set to "NA" for the "sample" output type and the "pairing" information is collected into two columns: "run_grouping" and "stochastic_run" . For more information on this format, please consult the SMH Sample format documentation .

The second format is assumed for the SMH `sample` output format. An additional parameter in the validate_submission() function, merge_sample_col allows to indicate that information by setting the parameter as merge_sample_col = c("run_grouping", "stochastic_run"). The validation will concetenate the `merge_sample_col` column by:

  • Uniting all `merge_sample_col` column into one column, with the value separated by a "_".
  • Transforming the output as a factor to categorize each value of the output
  • Transforming the output as a numeric to comply with the expected format
For example, in R: as.numeric(as.factor(paste0(run_grouping, "_", stochastic_run)))

Output

The function returns multiple messages and if any issue a failure or an error message.

  • If the check succeeds: a message is returned (with a green check mark).
  • If the check fails: an error or a failure message is returned. A failure represents a check that fails but does not block the proceeding of the validation to other checks, and is represented by a red cross. Whereas an error, will stop the validation process and is represented with a circled red cross.
  • If a check is not run or output additional information, a small i will be used to represents those.
For example (function output):
#> Run validation on files: 2023-11-12-team2-modelb.parquet
#> 
#> ── 2023-11-12-team2-modelb.parquet ────
#> 
#> ✔ [valid_round_id_col]: `round_id_col` name is valid.
#> ✔ [unique_round_id]: `round_id` column "origin_date" contains a single, unique
#>   round ID value.
#> ✔ [match_round_id]: All `round_id_col` "origin_date" values match submission
#>   `round_id` from file name.
> ✔ [colnames]: Column names are consistent with expected round task IDs and std
#>   column names.
#> ✔ [col_types]: Column data types match hub schema.
#> ✔ [valid_vals]: `tbl` contains valid values/value combinations.
#> ✔ [rows_unique]: All combinations of task ID
#>   column/`output_type`/`output_type_id` values are unique.
#> ✔ [req_vals]: Task ID/output type/output_type_id combinations all present.
#> ✔ [value_col_valid]: Values in column `value` all valid with respect to
#>   modeling task config.
#> ℹ [value_col_non_desc]: No quantile or cdf output types to check for
#>   non-descending values. Check skipped.
#> ✔ [spl_compound_taskid_set]: All samples in a model task conform to single,
#>   unique compound task ID set that matches or is coarser than the configured
#>   `compound_taksid_set`.
#> ✔ [spl_compound_tid]: Each sample compound task ID contains single, unique
#>   value.
#> ✔ [spl_non_compound_tid]: Task ID combinations of non compound task id values
#>   consistent across modeling task samples.
#> ✔ [spl_n]: Required samples per compound idx task present.
#> ✔ [na_value]: `value` does not contain `NA` value.
#> ✔ [flat_projection]: All projections don't have a unique value for the whole
#>   projection period.
#> ✔ [cumul_proj]: The cumulative values are not decreasing.

For Pull-request output, a function to store the output validation is used to return it in a PR message, in a slightly different format. A PR has two possible outputs, fail or pass. According to the validation output, the PR behavior will be different:

  • "Error" (red cross): the validation has failed and returned a message indicating the error(s). The error(s) should be fixed to have the PR accepted.
  • "Warning" (red !): the PR will fail but it can be accepted. It is necessary for the submitting team to validate if the warning(s) are expected or not before merging the PR. Once the warnings are clarified and accepted, the PR can be merge without additional modification.
  • "Information" (i): either the check has not run or the check pass and output some information of interest. The PR will not fail if one or multiple information check are outputted.
  • "Success" (green check): the validation did not found any issue and returns a message indicating that the validation is a success. The PR pass.
For example (PR output):
#> ✅: [valid_round_id_col]: `round_id_col` name is valid. 
#>  
#> ✅: [unique_round_id]: `round_id` column "origin_date" contains a single, unique round ID value. 
#>  
#> ✅: [match_round_id]: All `round_id_col` "origin_date" values match submission `round_id` from file name. 
#>  
#> ✅: [colnames]: Column names are consistent with expected round task IDs and std column names. 
#>  
#> ✅: [col_types]: Column data types match hub schema. 
#>  
#> ✅: [valid_vals]: `tbl` contains valid values/value combinations.  
#>  
#> ✅: [rows_unique]: All combinations of task ID column/`output_type`/`output_type_id` values are unique. 
#>  
#> ✅: [req_vals]: Task ID/output type/output_type_id combinations all present.  
#>  
#> ✅: [value_col_valid]: Values in column `value` all valid with respect to modeling task config. 
#>  
#> ℹ: [value_col_non_desc]: No quantile or cdf output types to check for non-descending values.
#>         Check skipped.
#> ✅: [spl_compound_taskid_set]: All samples in a model task conform to single, unique compound task ID set that matches or is
#>     coarser than the configured `compound_taksid_set`. 
#>  
#> ✅: [spl_compound_tid]: Each sample compound task ID contains single, unique value. 
#>  
#> ✅: [spl_non_compound_tid]: Task ID combinations of non compound task id values consistent across modeling task samples. 
#>  
#> ✅: [spl_n]: Required samples per compound idx task present.  
#>  
#> ✅: [na_value]: `value` does not contain `NA` value. 
#>  
#> ✅: [flat_projection]:  All projections don't have a unique value for the whole projection period. 
#>  
#> ✅: [cumul_proj]: The cumulative values are not decreasing. 
#> 

A shorten version of the PR output version of the validation is available, containing warning, information and error message only.

File Check List

Submission file format

  • The round_id information, either the date extracted in the file name(s) (path) or via the round_id parameter, should match one of the round_id in the associated tasks.json file.

  • The submission file(s) should be in one of the accepted formats:

    • Parquet file (preferred): .parquet or .gz.parquet. Partitioning in this format is also accepted.
    • CSV file: .csv. Partitioning in this format is also accepted.
    • Compressed format: .gz or .zip
  • All columns containing date should contain date information in the “YYYY-MM_DD” format.

  • No column should be in a “factor” format.

File format (test_column())

The name and number of the columns are corresponding to the expected format:

origin_date
scenario_id
target
horizon
location
age_group*
race_ethnicity*
output_type
output_type_id
run_grouping
stochastic_run
value

*The columns age_group and race_ethnicity are not required in each round. Please refer to the associated round definition for more information.

The order of the column is not important, but it should contain the expected number of columns with each name correctly spelled.

However, the format type of each columns should match the expected format as described in the tasks.json.

Remarks: In this example, the task_ids values correspond to: origin_date, scenario_id, target, horizon, location.

Remarks: If one required column is missing, the submission test will directly stop and return an error message without running all the others tests.

Round ID and "origin_date"

  • The origin_date column contains:
    • one unique date value in the YYYY-MM-DD format
    • the column should contain information in “character” or “Date” format

If the origin_date is also the round ID, the origin_date and round ID (date) in the file path should correspond.

Remarks: Implemented since January 2022, all submissions previous to this date can have a slightly more flexible date information.

Required values/variables

  • Each of the task id columns values correspond to the expected ID of the expected round without any spelling errors.

  • Each task_id/output_type/output type ID group combination has one unique value projected. For example: only 1 value associated with: quantile 0.5, location US, target inc hosp, horizon 1 and, scenario A.

  • The projection contains the expected value (for example, integer value >= 0 or numeric in between 0 and 1). NA value are not accepted

  • The submission file contains projections for all the required targets. The submission file might be accepted if some targets are missing, but might return an error message and the submission might not be included in the Ensembles

  • For some targets, for example peak size hosp, no "horizon" information is expected. In this case, the column "horizon" should be set to NA

Output type / output type ID

  • The column output_type_id (or run_grouping and stochastic_run) for the output type "sample" should only contain integer values.

  • The submission should contain the expected number of trajectories for each task_id group, as indicated in the associated JSON file (js_def parameter). And each group should contain the same number of trajectories.

  • If the submission expected pairing samples, the pairing information is extracted from the "compound_taskid_set" information in the associated tasks.json file. The minimal pairing is expected, but additional pairing is accepted. For more information on pairing or grouping information, please consult the SMH Sample format documentation . The hubverse is defining the pairing information in a different way, with a different vocabulary, for more information on the hubverse version, please consult the Sample output type and Compound modeling tasks information.

  • If the submission file contains quantilesm, the value for each group should ncrease with the quantiles. For example, for the 1st week ahead of target X for the location Y and for the scenario A, if quantile 0.01= 5 than quantile 0.5 should be equal or greater than 5

Additional tests

  • The locations correspond to the location name as expressed in the associated metadata JSON file (js_def parameter). If FIPS codes are expected and if the FIPS numbers are missing a trailing zero, the submission will be accepted but a warning message will be returned.

Remarks: If a submission file contains only state level projection (one or multiple), the location column might be automatically identified as numeric even if it was submitted in a character format. In this case, a warning message will be automatically printed on the validation but that warning can be safely ignored.

  • For each unique task_id group, excluding horizon information and except locations: 66 (Guam), 69 (Northern Mariana Island), 60 (American Samoa), 74 (US. Minor Outlying Islands), the whole projection does not contain only 1 unique value. For example, the projection for the incidence cases for one location and for one scenario does not contain only one unique value for the whole time series. As there is a possibility that 0 deaths or cases might be projected, the submission will still be accepted if the test failed but it will return a failure message asking to verify the projection

  • For the cumulative targets, for example cum hosp, for each task_id and output_type group (excluding "horizon"), the value is not decreasing with time.

  • [optional] Each projected value cannot by greater than the population size of the corresponding geographical entity. This test is run only if the pop_path parameter is not set to NULL. As an individual can be reinfected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection.

  • [optional] For the cumulative cases and deaths projection, the projected value should not be less than the week 0 (or week -1, depending on the availability on the time of submission) of the observed cumulative cases and deaths, respectively. The test allow a difference of 5% to take into account the difference between time of pulling the observed data, sources, … (only tested if the target-data parameter is not NULL and contains observed cumulative cases and deaths value).

  • [optional] The number of decimal points accepted in the column "value" is lower or equivalent to the parameter n_decimal (only run if n_decimal is not set to NULL).

Submission file path

If the file is not partitioned, the hubValidations validate_model_file() will be used to check:

  • the file is in the expected folder with the expected file name
  • only one file per round is submitted
  • associated metadata file is present in the hub
hubValidations::validate_model_file(hub_path, path_1)
#> 
#> ── 2023-11-12-team2-modelb.parquet ────
#> 
#> ✔ [file_exists]: File exists at path
#>   'model-output/team2-modelb/2023-11-12-team2-modelb.parquet'.
#> ✔ [file_name]: File name "2023-11-12-team2-modelb.parquet" is valid.
#> ✔ [file_location]: File directory name matches `model_id` metadata in file
#>   name.
#> ✔ [round_id_valid]: `round_id` is valid.
#> ✔ [file_format]: File is accepted hub format.
#> ✔ [file_n]: Number of accepted model output files per round met.
#> ✔ [metadata_exists]: Metadata file exists at path
#>   'model-metadata/team2-modelb.yaml'.

For more information on validate_model_file(), please consult the associated help page: ?validate_model_file

If the file is partitioned, the SMHvalidation package contains a wrapper function based on hubValidations validate_model_file().

The function validate_part_file() required three parameters:

  • hub_path: path to the repository containing the submission files and the tasks.json file, in the hubverse ormat.
  • folder_path: path to the folder containing only one round specific partitioned submission files. The folder is expected to be situated in the model-output folder and the path here should be relative to the hub_path, model output folder.
  • partition: character vector corresponding to the column names of each path segments.

For example:

SMHvalidation::validate_part_file(hub_path, path_2, c("origin_date", "target"))
#> 
#> ── t3-mc ────
#> 
#> ✔ [file_exists]: Files exist at path `/model-output/t3-mc`
#> ✔ [file_n]: Files have the same name.
#> ✔ [partition_name]: Partition exists in the accepted values: `origin_date,
#>   scenario_id, target, horizon, location, age_group`.
#> ✔ [partition_structure]: Files are partitioned in the expected number of
#>   columns.
#> ✔ [partition_value]: Partition values are expected.
#> 
#> 
#> ── 2023-11-12-t3-mc.parquet ────
#> 
#> 
#> 
#> ✔ [round_id_valid]: `round_id` is valid.
#> ✔ [file_name]: File name "2023-11-12-t3-mc.parquet" is valid.
#> ✔ [file_format]: File is accepted hub format.
#> ✔ [metadata_exists]: Metadata file exists at path 'model-metadata/t3-mc.yaml'.

Validation on Metadata and Abstract

For the metadata file, the hubValidations validate_model_metadata() will be used to validate it:

hubValidations::validate_model_metadata(hub_path, "team2-modelb.yaml")
#> 
#> ── model-metadata-schema.json ────
#> 
#> ✔ [metadata_schema_exists]: File exists at path
#>   'hub-config/model-metadata-schema.json'.
#> 
#> 
#> ── team2-modelb.yaml ────
#> 
#> 
#> 
#> ✔ [metadata_file_exists]: File exists at path
#>   'model-metadata/team2-modelb.yaml'.
#> ✔ [metadata_file_ext]: Metadata file extension is "yml" or "yaml".
#> ✔ [metadata_file_location]: Metadata file directory name matches
#>   "model-metadata".
#> ✔ [metadata_matches_schema]: Metadata file contents are consistent with schema
#>   specifications.
#> ✔ [metadata_file_name]: Metadata file name matches the `model_id` specified
#>   within the metadata file.

The validation for the “abstract” files associated with the submission is currently manual.

File Checks Running Locally

Each submission will be validated using the validate_submission() function from the SMHvalidation R package . The package is currently only available on GitHub, to install it please follow the next steps:

install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", build_vignettes = TRUE)

or it can be manually installed by directly cloning/forking/downloading the package from GitHub.

To load the package, execute the following command:

library(SMHvalidation)

The package contains a validate_submission() function allowing the user to check their SMH submissions locally.

The documentation here is associated with the last version of the package. For previous versions of the documentation, please consult past version of the package.

Run the validation

Run without testing against observed data (example with the RSV SMH GitHub repository:

validate_submission(model_out_path, hub_path,
                    merge_sample_col = c("run_grouping", "stochastic_run"))

With:

  • model_out_path: path to model projection file relatif to the model-output/ folder. For example: "team1-modela/team1-modela.parquet"
  • hub_path: path to the hub. For example: "./covid19-scenario-modeling-hub/"
  • As previously stated, the SMH hubs assumes a specific format for "sample" outputs. It required to have the "pairing" information into two columns: "run_grouping" and "stochastic_run". The merge_sample_col parameter should be use to indicate that.