Validation Checks for US SMH files
Validation Functions
SMHValidation Package
The main functions in the
SMHvalidation R package
. R package aim to validate and visualize Scenario
Modeling Hub submissions.
To validate the Scenario Modeling Hub (SMH) model projection,
we use the validate_submission()
function from the
Remarks: Starting January 2025, the SMHvalidation packagef ormat and associated JSON file format has been updated to follow the Hubverse schema v5 format. This documentation is about the last version of the data.
For previous versions of the documentation, please consult past version of the package.
The `validate_submission()` function requires two parameters:
-
path
: path to the submission file (or folder for partitioned data) to test. PQT, PARQUET, CSV, ZIP (not partitioned), or GZ (not partitioned) file formats are accepted. The path should be relative to the path of the hub containing the file to check. -
hub_path
: path to the hub containing the submission file. The hub should follow the hubverse standard and organization. For more information on hubverse, please consult the associated website.
?validate_submission()
.
SMH Sample format
In the Scenario Modeling Hub file format two possible formats
are used for the "sample"
output type:
-
All the submissions for a specific round should follow
the same format with predefined
output_type_id
column content. -
For a specific round, the hub collects the sample
"pairing" (or joint distribution) information each
team uses to generate the trajectories. In this case, the
output_type_id
column is set to"NA"
for the"sample"
output type and the "pairing" information is collected into two columns:"run_grouping"
and"stochastic_run"
. For more information on this format, please consult the SMH Sample format documentation .
The second format is assumed for the SMH `sample` output format. An additional
parameter in the validate_submission()
function,
merge_sample_col
allows to indicate that information by setting the
parameter as
merge_sample_col = c("run_grouping", "stochastic_run")
.
The validation will concetenate the `merge_sample_col` column by:
-
Uniting all `merge_sample_col` column into one column, with the value separated
by a
"_"
. - Transforming the output as a factor to categorize each value of the output
- Transforming the output as a numeric to comply with the expected format
as.numeric(as.factor(paste0(run_grouping, "_", stochastic_run)))
Output
The function returns multiple messages and if any issue a failure or an error message.
- If the check succeeds: a message is returned (with a green check mark).
- If the check fails: an error or a failure message is returned. A failure represents a check that fails but does not block the proceeding of the validation to other checks, and is represented by a red cross. Whereas an error, will stop the validation process and is represented with a circled red cross.
- If a check is not run or output additional information, a small i will be used to represents those.
#> Run validation on files: 2023-11-12-team2-modelb.parquet
#>
#> ── 2023-11-12-team2-modelb.parquet ────
#>
#> ✔ [valid_round_id_col]: `round_id_col` name is valid.
#> ✔ [unique_round_id]: `round_id` column "origin_date" contains a single, unique
#> round ID value.
#> ✔ [match_round_id]: All `round_id_col` "origin_date" values match submission
#> `round_id` from file name.
> ✔ [colnames]: Column names are consistent with expected round task IDs and std
#> column names.
#> ✔ [col_types]: Column data types match hub schema.
#> ✔ [valid_vals]: `tbl` contains valid values/value combinations.
#> ✔ [rows_unique]: All combinations of task ID
#> column/`output_type`/`output_type_id` values are unique.
#> ✔ [req_vals]: Task ID/output type/output_type_id combinations all present.
#> ✔ [value_col_valid]: Values in column `value` all valid with respect to
#> modeling task config.
#> ℹ [value_col_non_desc]: No quantile or cdf output types to check for
#> non-descending values. Check skipped.
#> ✔ [spl_compound_taskid_set]: All samples in a model task conform to single,
#> unique compound task ID set that matches or is coarser than the configured
#> `compound_taksid_set`.
#> ✔ [spl_compound_tid]: Each sample compound task ID contains single, unique
#> value.
#> ✔ [spl_non_compound_tid]: Task ID combinations of non compound task id values
#> consistent across modeling task samples.
#> ✔ [spl_n]: Required samples per compound idx task present.
#> ✔ [na_value]: `value` does not contain `NA` value.
#> ✔ [flat_projection]: All projections don't have a unique value for the whole
#> projection period.
#> ✔ [cumul_proj]: The cumulative values are not decreasing.
For Pull-request output, a function to store the output validation is used to return it in a PR message, in a slightly different format. A PR has two possible outputs, fail or pass. According to the validation output, the PR behavior will be different:
- "Error" (red cross): the validation has failed and returned a message indicating the error(s). The error(s) should be fixed to have the PR accepted.
- "Warning" (red !): the PR will fail but it can be accepted. It is necessary for the submitting team to validate if the warning(s) are expected or not before merging the PR. Once the warnings are clarified and accepted, the PR can be merge without additional modification.
- "Information" (i): either the check has not run or the check pass and output some information of interest. The PR will not fail if one or multiple information check are outputted.
- "Success" (green check): the validation did not found any issue and returns a message indicating that the validation is a success. The PR pass.
#> ✅: [valid_round_id_col]: `round_id_col` name is valid.
#>
#> ✅: [unique_round_id]: `round_id` column "origin_date" contains a single, unique round ID value.
#>
#> ✅: [match_round_id]: All `round_id_col` "origin_date" values match submission `round_id` from file name.
#>
#> ✅: [colnames]: Column names are consistent with expected round task IDs and std column names.
#>
#> ✅: [col_types]: Column data types match hub schema.
#>
#> ✅: [valid_vals]: `tbl` contains valid values/value combinations.
#>
#> ✅: [rows_unique]: All combinations of task ID column/`output_type`/`output_type_id` values are unique.
#>
#> ✅: [req_vals]: Task ID/output type/output_type_id combinations all present.
#>
#> ✅: [value_col_valid]: Values in column `value` all valid with respect to modeling task config.
#>
#> ℹ: [value_col_non_desc]: No quantile or cdf output types to check for non-descending values.
#> Check skipped.
#> ✅: [spl_compound_taskid_set]: All samples in a model task conform to single, unique compound task ID set that matches or is
#> coarser than the configured `compound_taksid_set`.
#>
#> ✅: [spl_compound_tid]: Each sample compound task ID contains single, unique value.
#>
#> ✅: [spl_non_compound_tid]: Task ID combinations of non compound task id values consistent across modeling task samples.
#>
#> ✅: [spl_n]: Required samples per compound idx task present.
#>
#> ✅: [na_value]: `value` does not contain `NA` value.
#>
#> ✅: [flat_projection]: All projections don't have a unique value for the whole projection period.
#>
#> ✅: [cumul_proj]: The cumulative values are not decreasing.
#>
A shorten version of the PR output version of the validation is available, containing warning, information and error message only.
File Check List
Submission file format
-
The
round_id
information, either the date extracted in the file name(s) (path
) or via theround_id
parameter, should match one of theround_id
in the associatedtasks.json
file. -
The submission file(s) should be in one of the accepted formats:
-
Parquet file (preferred):
.parquet
or.gz.parquet
. Partitioning in this format is also accepted. -
CSV file:
.csv
. Partitioning in this format is also accepted. -
Compressed format:
.gz
or.zip
-
Parquet file (preferred):
-
All columns containing
date
should contain date information in the “YYYY-MM_DD” format. -
No column should be in a “factor” format.
File format (test_column()
)
The name and number of the columns are corresponding to the expected format:
origin_date |
---|
scenario_id |
target |
horizon |
location |
age_group* |
race_ethnicity* |
output_type |
output_type_id |
run_grouping |
stochastic_run |
value |
*The columns age_group
and race_ethnicity
are not required
in each round. Please refer to the associated round definition for more information.
The order of the column is not important, but it should contain the expected number of columns with each name correctly spelled.
However, the format type of each columns should match the expected format as described
in the tasks.json
.
Remarks: In this example, the task_ids
values correspond to:
origin_date, scenario_id, target, horizon, location
.
Remarks: If one required column is missing, the submission test will directly stop and return an error message without running all the others tests.
Round ID and "origin_date"
- The
origin_date
column contains:- one unique date value in the
YYYY-MM-DD
format - the column should contain information in “character” or “Date” format
- one unique date value in the
If the origin_date
is also the round ID, the origin_date
and
round ID (date) in the file path should correspond.
Remarks: Implemented since January 2022, all submissions previous to this date can have a slightly more flexible date information.
Required values/variables
-
Each of the task id columns values correspond to the expected ID of the expected round without any spelling errors.
-
Each task_id/output_type/output type ID group combination has one unique value projected. For example: only 1 value associated with: quantile
0.5
, locationUS
, targetinc hosp
, horizon1
and, scenarioA
. -
The projection contains the expected value (for example, integer value
>= 0
or numeric in between 0 and 1).NA
value are not accepted -
The submission file contains projections for all the required targets. The submission file might be accepted if some targets are missing, but might return an error message and the submission might not be included in the Ensembles
-
For some targets, for example
peak size hosp
, no"horizon"
information is expected. In this case, the column"horizon"
should be set toNA
Output type / output type ID
-
The column
output_type_id
(orrun_grouping
andstochastic_run
) for the output type"sample"
should only contain integer values. -
The submission should contain the expected number of trajectories for each task_id group, as indicated in the associated JSON file (
js_def
parameter). And each group should contain the same number of trajectories. -
If the submission expected pairing samples, the pairing information is extracted from the
"compound_taskid_set"
information in the associatedtasks.json
file. The minimal pairing is expected, but additional pairing is accepted. For more information on pairing or grouping information, please consult the SMH Sample format documentation . The hubverse is defining the pairing information in a different way, with a different vocabulary, for more information on the hubverse version, please consult the Sample output type and Compound modeling tasks information. -
If the submission file contains quantilesm, the value for each group should ncrease with the quantiles. For example, for the 1st week ahead of target X for the location Y and for the scenario A, if quantile
0.01
=5
than quantile0.5
should be equal or greater than5
Additional tests
-
The locations correspond to the location name as expressed in the
associated metadata JSON file (
js_def
parameter). If FIPS codes are expected and if the FIPS numbers are missing a trailing zero, the submission will be accepted but a warning message will be returned.
Remarks: If a submission file contains only state level
projection (one or multiple), the location
column might be
automatically identified as numeric even if it was submitted in a
character format. In this case, a warning message will be automatically
printed on the validation but that warning can be safely ignored.
-
For each unique task_id group, excluding horizon information and except locations:
66
(Guam),69
(Northern Mariana Island),60
(American Samoa),74
(US. Minor Outlying Islands), the whole projection does not contain only 1 unique value. For example, the projection for the incidence cases for one location and for one scenario does not contain only one unique value for the whole time series. As there is a possibility that 0 deaths or cases might be projected, the submission will still be accepted if the test failed but it will return a failure message asking to verify the projection -
For the cumulative targets, for example
cum hosp
, for each task_id and output_type group (excluding"horizon"
), the value is not decreasing with time. -
[optional] Each projected value cannot by greater than the population size of the corresponding geographical entity. This test is run only if the
pop_path
parameter is not set toNULL
. As an individual can be reinfected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection. -
[optional] For the cumulative cases and deaths projection, the projected value should not be less than the week 0 (or week -1, depending on the availability on the time of submission) of the observed cumulative cases and deaths, respectively. The test allow a difference of 5% to take into account the difference between time of pulling the observed data, sources, … (only tested if the
target-data
parameter is notNULL
and contains observed cumulative cases and deaths value). -
[optional] The number of decimal points accepted in the column
"value"
is lower or equivalent to the parametern_decimal
(only run ifn_decimal
is not set toNULL
).
Submission file path
If the file is not partitioned, the hubValidations
validate_model_file()
will be used to check:
- the file is in the expected folder with the expected file name
- only one file per round is submitted
- associated metadata file is present in the hub
hubValidations::validate_model_file(hub_path, path_1)
#>
#> ── 2023-11-12-team2-modelb.parquet ────
#>
#> ✔ [file_exists]: File exists at path
#> 'model-output/team2-modelb/2023-11-12-team2-modelb.parquet'.
#> ✔ [file_name]: File name "2023-11-12-team2-modelb.parquet" is valid.
#> ✔ [file_location]: File directory name matches `model_id` metadata in file
#> name.
#> ✔ [round_id_valid]: `round_id` is valid.
#> ✔ [file_format]: File is accepted hub format.
#> ✔ [file_n]: Number of accepted model output files per round met.
#> ✔ [metadata_exists]: Metadata file exists at path
#> 'model-metadata/team2-modelb.yaml'.
For more information on validate_model_file()
, please
consult the associated help page: ?validate_model_file
If the file is partitioned, the SMHvalidation package contains a
wrapper function based on hubValidations validate_model_file()
.
The function validate_part_file()
required three
parameters:
-
hub_path
: path to the repository containing the submission files and thetasks.json
file, in the hubverse ormat. -
folder_path
: path to the folder containing only one round specific partitioned submission files. The folder is expected to be situated in themodel-output
folder and the path here should be relative to thehub_path
, model output folder. -
partition
: character vector corresponding to the column names of each path segments.
For example:
SMHvalidation::validate_part_file(hub_path, path_2, c("origin_date", "target"))
#>
#> ── t3-mc ────
#>
#> ✔ [file_exists]: Files exist at path `/model-output/t3-mc`
#> ✔ [file_n]: Files have the same name.
#> ✔ [partition_name]: Partition exists in the accepted values: `origin_date,
#> scenario_id, target, horizon, location, age_group`.
#> ✔ [partition_structure]: Files are partitioned in the expected number of
#> columns.
#> ✔ [partition_value]: Partition values are expected.
#>
#>
#> ── 2023-11-12-t3-mc.parquet ────
#>
#>
#>
#> ✔ [round_id_valid]: `round_id` is valid.
#> ✔ [file_name]: File name "2023-11-12-t3-mc.parquet" is valid.
#> ✔ [file_format]: File is accepted hub format.
#> ✔ [metadata_exists]: Metadata file exists at path 'model-metadata/t3-mc.yaml'.
Validation on Metadata and Abstract
For the metadata file, the hubValidations
validate_model_metadata()
will be used to validate it:
hubValidations::validate_model_metadata(hub_path, "team2-modelb.yaml")
#>
#> ── model-metadata-schema.json ────
#>
#> ✔ [metadata_schema_exists]: File exists at path
#> 'hub-config/model-metadata-schema.json'.
#>
#>
#> ── team2-modelb.yaml ────
#>
#>
#>
#> ✔ [metadata_file_exists]: File exists at path
#> 'model-metadata/team2-modelb.yaml'.
#> ✔ [metadata_file_ext]: Metadata file extension is "yml" or "yaml".
#> ✔ [metadata_file_location]: Metadata file directory name matches
#> "model-metadata".
#> ✔ [metadata_matches_schema]: Metadata file contents are consistent with schema
#> specifications.
#> ✔ [metadata_file_name]: Metadata file name matches the `model_id` specified
#> within the metadata file.
The validation for the “abstract” files associated with the submission is currently manual.
File Checks Running Locally
Each submission will be validated using the
validate_submission()
function from the
SMHvalidation R package
.
The package is currently only available on GitHub, to
install it please follow the next steps:
install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", build_vignettes = TRUE)
or it can be manually installed by directly cloning/forking/downloading the package from GitHub.
To load the package, execute the following command:
library(SMHvalidation)
The package contains a validate_submission()
function
allowing the user to check their SMH submissions locally.
The documentation here is associated with the last version of the package. For previous versions of the documentation, please consult past version of the package.
Run the validation
Run without testing against observed data (example with the RSV SMH GitHub repository:
validate_submission(model_out_path, hub_path,
merge_sample_col = c("run_grouping", "stochastic_run"))
With:
-
model_out_path
: path to model projection file relatif to themodel-output/
folder. For example:"team1-modela/team1-modela.parquet"
-
hub_path
: path to the hub. For example:"./covid19-scenario-modeling-hub/"
-
As previously stated, the SMH hubs assumes a specific format for "sample"
outputs. It required to have the "pairing" information into two columns:
"run_grouping"
and"stochastic_run"
. Themerge_sample_col
parameter should be use to indicate that.