“Grouped” Sample File Format

For a specific round, the hub collects information about the “grouping” of trajectories. This information allows us to characterize the joint distribution of samples, rather than having samples from only the marginal distributions. In other words, we want to identify which trajectories are independent, and which come from the same XXX (what we will call “grouped”).

We say two trajectories are “grouped” if they have the same parameters, initial conditions, etc. Grouping can occur on different levels. We describe a few possibilities below, then outline the file format to record this grouping information.

Grouping on horizon: The concept of a “trajectory” implies that weeks are grouped. In this case, all weeks from a single trajectory are from the same model run, with the same model parameters, etc.
Grouping on horizon and age_group: In this case, a single model run would generate results (e.g., of incident outcomes) for all weeks and all age groups. This is common for age-structured models.

The goal of this file format is to keep track of the “grouped” trajectories.

For example, for a specific round:
- Required minimal grouping on horizon, age group for each model run (number 2 above)

How To Register The “Group” Information:

To simplify the “how to” guide, we use a simplified example with only:

2 age groups: "65-130", and "0-130"
2 locations: "06", "47"
2 scenarios: "A", "B"
2 weeks horizon

For the output type format, the information is collected via two columns: "run_grouping" and "stochastic_run".

run_grouping: This column specifies any additional grouping if it controls for some factor driving the variance between trajectories (e.g., underlying parameters, baseline fit) that is shared across trajectories in different scenarios. I.e., if using this grouping will reduce overall variance compared to analyzing all trajectories as independent, this grouping should be recorded by giving all relevant rows the same number. If no such grouping exists, number each model run independently.
stochastic_run : a unique id to differentiate multiple stochastic runs. If no stochasticity: the column will contain a unique value.

In this case, the output_type_id column is set to NA and the “grouping” information is collected in two columns: "run_grouping" and "stochastic_run".

Number of Trajectories

First it is required to provide 100 trajectories for each task; the submission file contains 100 repetitions of each modeling task.

In the following examples only two trajectories will be provided.

For this first example, only the skeleton of the file is provided, the two columns: "run_grouping" and "stochastic_run" are empty as the following sections and the next examples will provide information on how to populate these 2 columns.

origin_date	scenario_id	target	location	horizon	age_group	output_type	output_type_id
2023-11-12	A	inc hosp	06	1	0-130	sample	NA
2023-11-12	A	inc hosp	06	2	0-130	sample	NA
2023-11-12	A	inc hosp	06	1	65-130	sample	NA
2023-11-12	A	inc hosp	06	2	65-130	sample	NA
2023-11-12	B	inc hosp	06	1	0-130	sample	NA
2023-11-12	B	inc hosp	06	2	0-130	sample	NA
2023-11-12	B	inc hosp	06	1	65-130	sample	NA
2023-11-12	B	inc hosp	06	2	65-130	sample	NA
2023-11-12	A	inc hosp	06	1	0-130	sample	NA
2023-11-12	A	inc hosp	06	2	0-130	sample	NA
2023-11-12	A	inc hosp	06	1	65-130	sample	NA
2023-11-12	A	inc hosp	06	2	65-130	sample	NA
2023-11-12	B	inc hosp	06	1	0-130	sample	NA
2023-11-12	B	inc hosp	06	2	0-130	sample	NA
2023-11-12	B	inc hosp	06	1	65-130	sample	NA
2023-11-12	B	inc hosp	06	2	65-130	sample	NA
2023-11-12	A	inc hosp	47	1	0-130	sample	NA
2023-11-12	A	inc hosp	47	2	0-130	sample	NA
2023-11-12	A	inc hosp	47	1	65-130	sample	NA
2023-11-12	A	inc hosp	47	2	65-130	sample	NA
2023-11-12	B	inc hosp	47	1	0-130	sample	NA
2023-11-12	B	inc hosp	47	2	0-130	sample	NA
2023-11-12	B	inc hosp	47	1	65-130	sample	NA
2023-11-12	B	inc hosp	47	2	65-130	sample	NA
2023-11-12	A	inc hosp	47	1	0-130	sample	NA
2023-11-12	A	inc hosp	47	2	0-130	sample	NA
2023-11-12	A	inc hosp	47	1	65-130	sample	NA
2023-11-12	A	inc hosp	47	2	65-130	sample	NA
2023-11-12	B	inc hosp	47	1	0-130	sample	NA
2023-11-12	B	inc hosp	47	2	0-130	sample	NA
2023-11-12	B	inc hosp	47	1	65-130	sample	NA
2023-11-12	B	inc hosp	47	2	65-130	sample	NA

“Group” Information

For the stochastic_run column:

If each model run is not stochastic, the column will be set to a unique identifier: 1 for all rows.
- see Example 1: column stochastic_run

For both stochastic_run and run_grouping columns:

If each model run has different run_grouping and/or because of stochasticity, each “group” will have a different identifier:
- As the minimal grouping is by "age_group" and "horizon", each “group” is defined as a group containing all the values possible for "age_group" and "horizon". The following group should have a unique identifier for each group in the submission file.
- It is possible to add additional grouping information:
  - for example if a team wants to “group” by "age_group", "horizon" and "scenario_id" (or "location"): each “group” is defined as a group containing all the values possible for "age_group" , "horizon" and "scenario_id" (or "location") and has a unique identifier for each group.
  - Another possibility of additional “grouping” can be by a subset of values from a specific column:
    - for example if we expand our example here to 5 scenarios and the submission is “grouped” by "age_group", "horizon" and by some subset of "scenario_id": each “group” is defined as a group containing all the values possible for "age_group" , "horizon" and some specific subset of "scenario_id".
If some model runs share the same run_grouping or stochastic_run (i.e. they share the same seed), each “group” will share the same identifier.

Minimum Grouping

As stated above, trajectories must be “grouped” at least by "age_group" and "horizon". It is required that the combination of the run_grouping and stochastic_run columns contain at least a unique identifier for each group containing all the possible values for "age_group" and "horizon".

Examples

Example 1 (grouped by `age_group` and `horizon`)

For example, if a model run has different run_grouping (model run independent) and the runs are not stochastic:

origin_date	scenario_id	target	location	horizon	age_group	output_type	output_type_id	run_grouping	stochastic_run
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	2	1
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	2	1
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	2	1
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	2	1
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	3	1
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	3	1
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	3	1
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	3	1
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	4	1
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	4	1
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	4	1
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	4	1
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	5	1
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	5	1
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	5	1
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	5	1
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	6	1
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	6	1
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	6	1
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	6	1
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	7	1
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	7	1
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	7	1
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	7	1
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	8	1
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	8	1
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	8	1
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	8	1

Example 2 (grouped by `age_group` and `horizon`)

For example, if a model run has different run_grouping (model run independent) for every stochastic run:

origin_date	scenario_id	target	location	horizon	age_group	output_type	output_type_id	run_grouping	stochastic_run
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	2	2
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	3	3
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	3	3
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	3	3
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	3	3
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	4	4
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	4	4
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	4	4
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	4	4
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	5	5
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	5	5
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	5	5
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	5	5
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	6	6
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	6	6
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	6	6
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	6	6
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	7	7
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	7	7
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	7	7
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	7	7
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	8	8
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	8	8
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	8	8
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	8	8

Example 3 (grouped by `age_group` and `horizon`)

For example, each model run has a run_grouping set replicated in multiple stochastic runs:

origin_date	scenario_id	target	location	horizon	age_group	output_type	output_type_id	run_grouping	stochastic_run
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	2	2
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	3
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	3
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	3
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	3
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	2	4
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	2	4
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	2	4
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	2	4
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	3	5
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	3	5
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	3	5
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	3	5
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	4	6
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	4	6
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	4	6
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	4	6
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	3	7
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	3	7
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	3	7
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	3	7
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	4	8
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	4	8
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	4	8
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	4	8

Example 4 (grouped by `age_group`, `horizon`, `scenario_id`)

For example, each model run has a run_grouping set grouped by age_group, horizon, scenario_id replicated in multiple stochastic runs. The scenarios are assumed to share the same run_grouping set but different stochastic runs.

origin_date	scenario_id	target	location	horizon	age_group	output_type	output_type_id	run_grouping	stochastic_run
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	1	2
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	1	2
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	1	2
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	1	2
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	3
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	3
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	3
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	3
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	1	4
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	1	4
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	1	4
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	1	4
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	2	5
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	2	5
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	2	5
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	2	5
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	2	6
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	2	6
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	2	6
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	2	6
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	2	7
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	2	7
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	2	7
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	2	7
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	2	8
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	2	8
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	2	8
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	2	8

Additional examples are available in the repository as .parquet files. They all contain required and optional targets.

Example 5: (grouped by `age_group`, `horizon`, `scenario_id`)

For example, each model run has a run_grouping set grouped by age_group, horizon, scenario_id (model run independent) with one stochastic run per grouping.

origin_date	scenario_id	target	location	horizon	age_group	output_type	output_type_id	run_grouping	stochastic_run
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	1	1
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	1	1
2023-11-12	A	inc hosp	06	1	0-130	sample	NA	2	2
2023-11-12	A	inc hosp	06	2	0-130	sample	NA	2	2
2023-11-12	A	inc hosp	06	1	65-130	sample	NA	2	2
2023-11-12	A	inc hosp	06	2	65-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	1	0-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	2	0-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	1	65-130	sample	NA	2	2
2023-11-12	B	inc hosp	06	2	65-130	sample	NA	2	2
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	3	3
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	3	3
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	3	3
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	3	3
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	3	3
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	3	3
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	3	3
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	3	3
2023-11-12	A	inc hosp	47	1	0-130	sample	NA	4	4
2023-11-12	A	inc hosp	47	2	0-130	sample	NA	4	4
2023-11-12	A	inc hosp	47	1	65-130	sample	NA	4	4
2023-11-12	A	inc hosp	47	2	65-130	sample	NA	4	4
2023-11-12	B	inc hosp	47	1	0-130	sample	NA	4	4
2023-11-12	B	inc hosp	47	2	0-130	sample	NA	4	4
2023-11-12	B	inc hosp	47	1	65-130	sample	NA	4	4
2023-11-12	B	inc hosp	47	2	65-130	sample	NA	4	4

Validation

The automatic validation on pull-request is updated to verify:

the run_grouping and stochastic_run columns contain integers for output type "sample"
the concatenation of the run_grouping and stochastic_run columns should contain the minimal grouping information: all possible values of the horizon and age group columns.
the submission file has the expected number of trajectories (100 trajectories)

Submission Example files

The RSV GitHub Repository contains multiple example files reproducing the required and optional targets for RSV round 1 (grouped by age group and horizon):

Team 2 - Model B:
- Each model run has different run_grouping (model run independent) and the runs are not stochastic.
- Submission grouped by horizon and age group
- Example file: 2023-11-12-team2-modelb.parquet
Team 3 - Model C
- Each model run has different run_grouping (model run independent) for every stochastic run.
- Submission grouped by horizon and age group
- Example file: 2023-11-12-team3-modelc.parquet
Team 4 - Model D:
- Each model run has a run_grouping set replicated in multiple stochastic run
- Submission grouped by horizon and age group
- Example file: 2023-11-12-team4-modeld.parquet
Team 5 - Model E:
- Each model run has a run_grouping set grouped by age_group, horizon, scenario_id replicated in multiple stochastic run. The scenarios are assumed to share the same run_grouping set but different stochastic runs
- Submission grouped by horizon, age group and scenario
- Example file: 2023-11-12-team5-modele.parquet