-
Download base datasets and files collected and processed by us. Unzip and put this folder in the root of project folder. To see how we generated these files you can run
notebooks/additional_data_preprocessing.ipynb
(seedata exploration
section below for more details). -
Having all needed files, you can generate
train*.pkl
andtest*.pkl
usingnotebooks/generate_data.ipynb
or you can just download it: train - 7 GB, test - 0.8 GB. You can verify that the data in the tables doesn't contain data leaks and data from datasets that aren't available to other participants. Please, store them asdata/train*.pkl
anddata/test*.pkl
. -
Having
train*.pkl
andtest*.pkl
usenotebooks/modelling_fai.ipynb
to train neural network and reproduce our submit. NOTE! For reproducing our results exactly please set PYTHONHASHSEED before running notebook. So, the start command should look like this:env PYTHONHASHSEED=42 jupyter notebook
.
Resulting folder structure should look like this:
Please, don't pay attention to Makefile
, setup.cfg
, apt.txt
,
and Neuro.md
, we used them only for working with our GPU cloud.
Let's discuss exactly what data we used and how we processed
it in notebooks/additional_data_preprocessing.ipynb
.
In our solution we used base data (train.csv
, road_segments
and Sanral_v3
)
and also we used additional data
shared between all participants (Weather Data
, Public holidays
and Uber Movement Data
).
Note, that we didnt't use
data leak which was found in Injuries*.csv
and Vehicles*.csv
.
We marked additional or modified files and directories with _processed
suffix in their names.
Folder: data/SANRAL_processed
.
We automatically assigned each CCTV
, VDS
and VMS
to the nearest road segment and then counted number
of assigned cameras for each road segment - cameras_count.csv
.
Also we convert original SANRAL
data to more usefull format:
cameras_commissioning_dates.csv
based onCommissioning Date
column fromVDS_locations.xlsx
.vds_locations.csv
is the second sheet ofSANRAL_V3/VDS_locations.xlsx
with handled missing records.VDS_hourly_all.csv
is concatenation of all hourly files fromSANRAL_V3/Vehicle detection sensor (VDS)/[2016-2019]
.
Folder: data/road_segments_processed
.
We opened shapefiles and their attribute table in QGis
(free programm for viewing geo files) and
enriched them with 3 new columns:
main_route
- we assigned each road segment to long part of road namedroute#
(see image below).vds_id
- the neareast VDS camera id to the road segment.num
- the enumerate id of the road segment.
Folder: data/weather_processed
.
We downloaded 2 airport weather files
(68816.01.01.2016.31.03.2019.1.0.0.en.utf8.00000000.csv
and
FACT.01.01.2016.31.03.2019.1.0.0.en.utf8.00000000.csv
)
from provided site and combined them into 1 table with small preprocessing
(see weather.csv
folder).
We found that some accidents have only day indicator (with missing hour), so we decided to fill this
missing values with hours taken from Injures*.csv
/ Vehicles.csv
files. As a result we replaced roughly
1500 records and stored fixed table as train_cleaned.csv
.
Please note, as in the train.csv
there is only 2017-2018 year data, we didn't use the leak
from 2019 year.
Folder: data/uber_processed/
We downloaded a lot of files from
Uber Movement
site provided by organizers (see uber_files_downloaded.zip
).
Than we filtered and
combined them into one large table - uber_files_joined.csv
.
Since the travel time data is related to uber zones, we
manually created mapping between uber zones and road segments - enum_to_uzone.json
.
In addition, we had to manually determine which road segments a car consistently
crosses moving from one uber zone to another (routes.json
and sid_neighbors.json
).
For convenience, we store lengths of segments in routes_length.json
and
replaced all road segment names with numbers (sid_to_enum.json
) using
num
column from shapefile from the previous step. As a result we got the average travel times in
submit format: for each road segment for each timestamp - segment_ttime_daily.csv
.
Training model is much simpler than feature engineering: we trained single neural network
(we used fastai
framework for this purpose) without
any ensembling with 1-fold local validation.
- To train the model you have to run
notebooks/modeling_fai.ipynb
, it requirestrain*.pkl
andtest*.pkl
as inputs, generated (or downloaded) on the previous step. As an output, this will give the submit file and the model weights. Estimated time of 1 attempt of model training on nvidia-tesla k80 is about 30 min.