For our simulations we have used a synthetic population from which we instantiate the agents. A synthetic population in the right format is required to run this simulation (at least without modifications), but we have not been granted the right to redistribute these files. In this manual we detail the specifications of the synthetic population required to instantiate our simulation, so that you may be able to produce or convert your own.
We have also taken a sample from the synthetic population representing the details and activities of 3 agents (2 adults and one child) from a single household. These files can be found in src/main/resources/charlottesville_example and can serve as extra illustration of this documentation.
Agents in our simulation are drawn from a synthetic population of the state of Virginia, USA. This synthetic population has been constructed from multiple data sources including the American Community Survey (ACS), the National Household Travel Survey (NHTS), and various location and building data sets, as described in this technical report. This gives us a very detailed representation of the region we are studying (multiple counties within Virginia). Agents are assigned demographic variables drawn from the ACS, such as age, sex, race, household income, or, optionally, a designation as an essential worker, e.g., medical or retail.
The behavior of agents is characterized by weekly activity schedules, a set of typical daily activities the agents perform over the course of one week obtained by integrating data from the NHTS. The activity schedule defines the location, start time and duration of the agent's activities as one of 7 distinct high level activity types. Appropriate locations are assigned to different activities using data from multiple sources, including HERE, the Microsoft Building Database, and the National Center for Education Statistics (for school locations).
The synthetic population consists of three classes of files:
- Person files
- Household files
- Activity files
These are all .CSV files. The contents of each class of file may be distributed over multiple actual .CSV files.
The files contain references to records from the other files through IDs.
Specifically, the pid
(person ID) is used to denote a unique record from the Person files,
and the hid
(household ID) to denote a unique record from the Household files.
The person files contain one record for each synthetic person or agent in the population, and specifies the relevant
socio-demographical characteristics of that person or agent.
Each person belongs to a household, and each household has at least one member.
These persons share a residential location, to which agents withdraw when performing the
HOME
type of activity, or when not performing any activity at all (due to cancellation after normative reasoning).
The activity files specify with high granularity for each person in the Person files what they are typically doing over the course of one week. Each activity is encoded as a single record, and characterized by one of 7 high level activity types:
- HOME stay at or work from home
- WORK go to work or take a work-related trip
- SHOP buy goods (e.g., groceries, clothes, appliances)
- SCHOOL attend school as a student
- COLLEGE attend college as a student
- RELIGIOUS religious or other community activities
- OTHER any other class of activities, including recreational activities, exercise, dining at a restaurant, etc.
The original synthetic population also contains detailed activity types,
but as these have not been guaranteed to have been sampled accurately,
they are not used for the simulation.
However, they can be useful in understanding the semantics of the higher level activity types.
The DetailedActivity.java
file specifies an ENUM in which the detailed activity types that the synthetic population
uses are grouped by their higher level activity types.
All the fields used in each of the three files representing the synthetic population will be detailed here.
The synthetic population is split over each county in the state of Virginia. This allows us to run simulations for which we select the counties ourselves, instead of always using the entire state of Virginia. In the sample configuration with which the simulation can be instantiated, we refer to the synthetic population for the county of Charlottesville City.
Not all values in the synthetic population used for this research are actually employed in the simulation. However, some are still parsed by the model, so their presence is required. The unused values are below marked with an asterisk, and can be given arbitrary values (within their type constraints) without having an effect on the simulation, while the values from the synthetic population (present in the sample files) are not documented here at all. Do note that this repository contains ongoing research, and these values may be used in later versions.
In the following, categorical types are distinguished in that they are linked to a Java ENUM where the possible value types are also documented.
In the sample config, one Person file for Charlottesville City is specified:
charlottesville_examples/usa_va_charlottesville_person_1_6_0.csv
Each of the person files is parsed using the PersonReader
and each person record is instantiated in the Person
class.
Each record in a person file encodes one person from the synthetic population. Each Person file should have at least the following fields:
hid
: A long-typed real value representing a unique household IDpid
: A long-typed real value representing the unique ID of this recordserialno*
: An integer-typed real value. Originally a unique value that refers to the survey number from which this agent was sampled.age
: An integer-typed real number representing the age of this personrelationship*
: A categorical integer in the range [0,17] representing the relationship of this person to the reference person of the household. The semantics of the values can be found in the Relationship enum.sex*
: A categorical integer in the range [1,2] representing the gender of the agent, where1
meansmale
and2
meansfemale
school_enrollment*
: A categorical integer in the range [1,3] or the letterb
representing how the person is enrolled in a school program.1
means not enrolled,2
means enrolled in public education,3
means enrolled in private education or homeschooled, andb
means not applicable (for persons of 3 years old or less).grade_level_attending
: A categorical integer in the range [1,16] representing the grade level of the person enrolled in a school, or the stringbb
if not enrolled.1
means nursery school or preschool,2
means kindergarten,3
-14
represent the grade levels 1 to 12 respectively,15
represents an undergraduate student, and16
represents a graduate student or professional level education beyond a bachelor level.employment_status*
: A categorical integer in the range [1-6] representing the employment status of the person, orbb
if no employment status is available (for persons under the age of 16), where1
represents a civilian employed at work,2
means a civilian employed with a job, but not at work,3
means unemployed,4
means armed forces at work,5
means armed forces with a job but not at work, and6
means the agent is not in the labor force.occupation_socp*
: A string that originally represents one of a very large number of jobs. For this reason not encoded for the purpose of this simulation, and can be any string value for this work.designation
: Optional categorical integer representing the (optional) essential designation of this persons job. Possible designations are {military
,government
,retail
,none
,education
,medical
,care_facilitation
,dmv
}, wherenone
and a null value are equivalent
In the sample config, one Household file for Charlottesville City is specified:
charlottesville_examples/usa_va_charlottesville_household_1_6_0.csv
Each of the household files is parsed using the HouseholdReader
and each household record is instantiated in the Household
class.
Each record in a household file encodes one household. Each Household file should have at least the following fields:
hid
: A long-typed real value representing the unique ID of this recordserialno*
: An integer-typed real value. Originally a unique value that refers to the survey number from which this agent was sampled.puma*
: Public Use Microdata Areas code. See census.govhh_size*
: A categorical integer in the range [1,3] representing the size of the household, where1
means a house on less than one acre of ground,2
means house on between 1 and 10 acres of ground,3
means a house on more than 10 acres of ground, orb
representing a non-single-family house or a mobile home.vehicles*
: An integer-typed real number representing the number of vehicles the household jointly ownshh_income*
: An integer-typed real valued number representing the yearly joint household income. For reference, for the synthetic population of Charlottesville City, the range is [0,846000] with an average of79259
, a median of54100
and a standard deviation of87745
units_in_structure*
: A categorical 2-digit 0-padded integer in the range [1,10] representing if the house or apartment type is part of a bigger structure. The semantics of the values can be found in the UnitsInStructure enum.business*
: Categorical value from {'b', 1, 2, 9}, withb
representing a non-single-family house or a mobile home,1
meaning yes, there is a business on this property,2
means no, and9
means the case could not be sampled, as it was from 2016 or later.heating_fuel*
: Categorical value representing the type of heating fuel used by the household. The semantics of the values can be found in the Fuel enum.household_language*
: Categorical value representing the primary language spoken by the household. The semantics of the values can be found in the Language enum.family_type_and_employment_status*
: Categorical value representing the family structure. The semantics of the values can be found in the FamilyEmployment enum.workers_in_family*
: An integer-typed real value representing the number of workers in the familyrlid
: A unique long-valued ID representing the residence locationresidence_longitude
: The longitude of the residence location. IMPORTANT: This value is used to calculate the radius of gyration, so should be sampled accuratelyresidence_latitude
: The latitude of the residence location. IMPORTANT: This value is used to calculate the radius of gyration, so should be sampled accurately
In the sample config, two Activity files for Charlottesville City are specified, one for adult agents, and one for children:
charlottesville_examples/usa_va_charlottesville_activity_assignment_adult_week_1_6_0.csv
charlottesville_examples/usa_va_charlottesville_activity_assignment_child_week_1_6_0.csv
The activity files encode the activities over the course of one week for each agent in the population.
The TRIP
activity type is not currently used in the simulation, but it is a good idea to include them anyway, or
otherwise leave gaps between activities to account for travel time. Apart from travel time, ideally there should be no
gaps in the weekly activity schedule of any agent.
Each of the activity files is parsed using the ActivityFileReader
and each household record is instantiated in the Activity
class.
Each record in the activity file encodes one activity for one agent. Each activity file should have at least the following fields:
pid
: The long-typed value representing the unique ID of the person for this activityhid
: The long-typed value representing the unique ID of the household the person for this activity belongs toactivity_numer*
: A long-valued unique ID for this recordactivity_type
: A categorical integer in the range [0,7] representing eitherTRIP
,HOME
,WORK
,SHOP
,OTHER
,SCHOOL
,COLLEGE
, orRELIGIOUS
respectively.detailed_activity*
: A more detailed specification of what type this activity is. As explained previously, these detailed activity types have not been guaranteed to have been sampled accurately which is why we have opted not to use them in our model. However, they can be useful to understand the semantics of the higher levelactivity_type
s. SeeDetailedActivity.java
for more information.start_time
: A long value representing a time stamp for when the activity starts as the number of seconds since monday morning (so0
represents the first second of a Monday, and24 * 60 * 60 = 86400
represents the first second of Tuesday).duration
: The number of seconds an activity continueslid
♰: A long-typed value representing the unique ID of the location to be visited. Multiple visits of this or other agents to the same location should have the same IDlongitude
♰: The longitude of the activity location. IMPORTANT: This value is used to calculate the radius of gyration, so should be sampled accuratelylatitude
♰: The latitude of the activity location. IMPORTANT: This value is used to calculate the radius of gyration, so should be sampled accuratelytravel_mode*
♰: A categorical integer in the range [-9,-7] ∪ [1,20] ∪ {97} representing the mode of transport employed during aTRIP
type activity (no value required for other activity types) The semantics of the values can be found in theTransportMode*
enum.
The location designation of activities can be split to a separate class of files, as long as for each activity number generated in the activity files, there is a location assigned in one of the location designation files. This is the case in the provided samples, but it is not necessary, as all the relevant information used by the simulation can be specified as above.
In the sample config, one location assignment file is specified:
charlottesville_examples/usa_va_charlottesville_location_assignment_week_1_6_0.csv
Each record encodes the location for exactly one activity that is specified in the activity files.
If this approach is taken, the fields marked with a cross (♰) can be moved to this class of files
(i.e. deleted from the activity files), while the fields hid
, pid
, activity_number
, activity_type
,
start_time
, and duration
should be replicated, with the exact same values for matching records.