Skip to content

Latest commit

 

History

History

shtruck_en

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Analyzing Shanghai Truck Dataset with Arctern

This tutorial uses the Shanghai Truck dataset as an example to describe how to use Arctern to process large geospatial data and use kepler.gl to visualize data.

Prerequisite

Install Jupyter Notebook

In the previous step, you have set up the artern_env environment. Enter this environment and run the following command to install Jupyter Notebook:

$ conda install -c conda-forge notebook

Install required libraries

In the arctern_env environment, run the following command to install required libraries:

$ pip install keplergl matplotlib

Data Preparation

Download the Shanghai Truck dataset, which includes more than 2 million Shanghai Truck data records and the Shanghai topographic map. Save the data in /tmp by default:

$ cd /tmp
# Download and unzip the Shanghai Truck data
$ wget https://github.com/arctern-io/arctern-bootcamp/raw/master/shtruck/file/20181016.zip
# Download Shanghai roda data
$ wget https://github.com/arctern-io/arctern-bootcamp/raw/master/shtruck/file/sh_roads.csv

Initialize Jupyter Notebook

Download arctern_shtruck_bootcamp.ipynb, and then start Jupyter Notebook in the arctern_env environment:

$ wget https://raw.githubusercontent.com/arctern-io/arctern-bootcamp/master/shtruck_en/arctern_shtruck_bootcamp.ipynb
# Start Jupyter Notebook
$ jupyter notebook

Open arctern_nytaxi_bootcamp.ipynb in Jupyter Notebook. Let's start to have fun with the example codes.

Introduction of example codes

1. Data loading

Load Shanghai truck trajectory data. The original data has a total of 8 columns, but here we only need 4 columns for analysis: plate number, time, latitude and longitude.

import pandas as pd
import arctern
from arctern import GeoSeries

sh_columns=[
    ("plate_number","string"),
    ("pos_time","string"),
    ("pos_longitude","double"),
    ("pos_latitude","double"),
    ("pos_direction0","double"),
    ("pos_direction1","double"),
    ("pos_direction2","double"),
    ("pos_direction3","double")
]

sh_select_columns={
                   "plate_number",
                   "pos_time",
                   "pos_longitude",
                   "pos_latitude"
                  }

sh_schema={}
sh_use_cols=[]
sh_names=[]
for idx in range(len(sh_columns)):
    if sh_columns[idx][0] in sh_select_columns:
        sh_schema[sh_columns[idx][0]] = sh_columns[idx][1]
        sh_use_cols.append(idx)
        sh_names.append(sh_columns[idx][0])
            
sh_df = pd.read_csv("/tmp/20181016.txt",
                    usecols=sh_use_cols,
                    names=sh_names,
                    dtype=sh_schema,
                    header=None,
                    delimiter="\t",
                    date_parser=pd.to_datetime,
                    parse_dates=["pos_time"])

According to the latitudes and longitudes, construct position information:

sh_df["pos_point"]=GeoSeries.point(sh_df.pos_longitude,sh_df.pos_latitude)
sh_df
plate_number pos_time pos_longitude pos_latitude pos_point
0 沪DK7362 2018-10-16 00:00:00 121.273108 30.989863 POINT (121.273108 30.989863)
1 沪DT0830 2018-10-16 00:00:00 121.471555 31.121763 POINT (121.471555 31.121763)
2 沪EP2723 2018-10-16 00:00:00 121.717205 31.380190 POINT (121.717205 31.38019)
3 沪DH9100 2018-10-16 00:00:00 121.476368 31.197768 POINT (121.476368 31.197768)
4 沪DP8608 2018-10-16 00:00:00 121.826568 31.096545 POINT (121.826568 31.096545)
... ... ... ... ... ...
2076589 沪EG9666 2018-10-16 23:59:31 121.753138 31.356040 POINT (121.753138 31.35604)
2076590 沪DP8746 2018-10-16 23:59:35 121.447145 31.125255 POINT (121.447145 31.125255)
2076591 沪DP8746 2018-10-16 23:59:41 121.448203 31.125408 POINT (121.448203 31.125408)
2076592 沪DP8746 2018-10-16 23:59:48 121.449426 31.125510 POINT (121.449426 31.12551)
2076593 沪DE6779 2018-10-16 23:59:54 121.880973 31.082136 POINT (121.880973 31.082136)

2076594 rows × 5 columns

2. Analyze the running track

We can restore the running track of a truck. First, we select a plate numbers of truck and filter all the data:

one_trunck_plate_number=sh_df.plate_number[0]
print(one_trunck_plate_number)
one_truck_df = sh_df[sh_df.plate_number==one_trunck_plate_number]
沪DK7362

Draw all the track points of this car on the map:

from keplergl import KeplerGl
KeplerGl(data={"car_pos": pd.DataFrame(data={'car_pos':one_truck_df.pos_point.to_wkt()})})

3. Display road network and track

Next, we will look at the running track of the above truck according to the road network information of Shanghai. First, load the road network information:

sh_roads=pd.read_csv("/tmp/sh_roads.csv", 
                     dtype={"roads":"string"},
                     usecols=[0],
                     names=["roads"],
                     header=None,
                     delimiter='|')
sh_roads=GeoSeries(sh_roads.roads)
sh_roads
0        LINESTRING (121.6358731 31.221484,121.6359771 ...
1        LINESTRING (121.6362516 31.2205594,121.6360422...
2        LINESTRING (121.6372043 31.220911,121.6369344 ...
3        LINESTRING (121.4637777 31.2314411,121.4637564...
4        LINESTRING (121.4628334 31.2311683,121.4627892...
                               ...                        
74688    LINESTRING (121.2544395 31.0235354,121.2550238...
74689    LINESTRING (121.6372338 31.2208457,121.6362516...
74690    LINESTRING (121.6372338 31.2208457,121.6373315...
74691    LINESTRING (121.3657763 31.085248,121.3656812 ...
74692    LINESTRING (121.6372043 31.220911,121.6372338 ...
Name: roads, Length: 74693, dtype: GeoDtype

Draw the above track location and road network information on the map:

one_truck_roads=KeplerGl(data={"car_pos": pd.DataFrame(data={'car_pos':one_truck_df.pos_point.to_wkt()})})
one_truck_roads.add_data(data=pd.DataFrame(data={'sh_roads':sh_roads.to_wkt()}),name="sh_roads")
one_truck_roads

The returned map results support interaction. By zooming it in, you can find that some trajectory points of the vehicle are not on the road, and these noise data need to be cleaned.

Data conversion

We regard the track locations that are not on the road as noisy data, and these track points must be bound to the closest roads:

is_near_road=arctern.near_road(sh_roads,sh_df.pos_point)
sh_near_road_df=sh_df[is_near_road]
on_road=arctern.nearest_location_on_road(sh_roads, sh_near_road_df.pos_point)
on_road=GeoSeries(on_road)
on_road
0          POINT (121.273065837839 30.9898629672054)
1          POINT (121.471521117758 31.1218966267949)
2          POINT (121.717183265368 31.3801593122801)
3           POINT (121.47636780833 31.1977688430427)
4          POINT (121.826533061028 31.0965194009541)
                             ...                    
2076589    POINT (121.753124012736 31.3560208068604)
2076590     POINT (121.44712530551 31.1255173541719)
2076591    POINT (121.448188797914 31.1255971887735)
2076592    POINT (121.449412558681 31.1256890544539)
2076593    POINT (121.880966206794 31.0821528456654)
Length: 1807018, dtype: GeoDtype
  • Binding algorithm

Bind the truck locations to the roads, and reconstruct the DataFrame

sh_on_road_df=pd.DataFrame(data={"plate_number":sh_near_road_df.plate_number,
                                 "pos_time":sh_near_road_df.pos_time,
                                 "on_road":on_road
                                })
sh_on_road_df
plate_number pos_time on_road
0 沪DK7362 2018-10-16 00:00:00 POINT (121.273065837839 30.9898629672054)
1 沪DT0830 2018-10-16 00:00:00 POINT (121.471521117758 31.1218966267949)
2 沪EP2723 2018-10-16 00:00:00 POINT (121.717183265368 31.3801593122801)
3 沪DH9100 2018-10-16 00:00:00 POINT (121.47636780833 31.1977688430427)
4 沪DP8608 2018-10-16 00:00:00 POINT (121.826533061028 31.0965194009541)
... ... ... ...
2076589 沪EG9666 2018-10-16 23:59:31 POINT (121.753124012736 31.3560208068604)
2076590 沪DP8746 2018-10-16 23:59:35 POINT (121.44712530551 31.1255173541719)
2076591 沪DP8746 2018-10-16 23:59:41 POINT (121.448188797914 31.1255971887735)
2076592 沪DP8746 2018-10-16 23:59:48 POINT (121.449412558681 31.1256890544539)
2076593 沪DE6779 2018-10-16 23:59:54 POINT (121.880966206794 31.0821528456654)

1807018 rows × 3 columns

Draw the track of the truck and the road network of Shanghai again:
one_on_road_df=sh_on_road_df[sh_on_road_df.plate_number==one_trunck_plate_number]
one_on_roads=KeplerGl(data={"car_pos": pd.DataFrame(data={'car_pos':one_on_road_df.on_road.to_wkt()})})
one_on_roads.add_data(data=pd.DataFrame(data={'sh_roads':sh_roads.to_wkt()}),name="sh_roads")
one_on_roads

By zooming out, we can see that all the locations are on the road:

4. Road analysis

There are 74,693 road network records in Shanghai, but trucks will not cross all roads. So, we will analyze the roads that the truck passes by to see which roads have the highest passing-by frequency.

Draw the road where trucks travel

First, filter out all the roads that the truck passes:

all_roads=arctern.nearest_road(sh_roads,sh_on_road_df.on_road)
all_roads=GeoSeries(all_roads)
road_codes, road_uniques = pd.factorize(all_roads)

Show the road data of all passing trucks and the corresponding percentage:

print(len(road_uniques))
print(len(road_uniques)*100.0/len(sh_roads))
16450
22.0234827895519

Draw all the roads that the truck passes:

KeplerGl(data={"all_roads": pd.DataFrame(data={'all_roads':GeoSeries(road_uniques).to_wkt()})})

For some main roads, there are many trucks pass by, which results in plenty of GPS signal sampling points on these roads. At the same time, due to the high traffic volume of these roads, many trucks drive very slowly, which further strengthenes the GPS sampling points on the roads.

Next, we will find the busy roads according to the number of truck locations on the roads.

Road weight analysis of truck

Count the number of truck locations on each road, and reconstruct DataFrame. We take the number of location points on the road as road weights:

roads_codes_series = pd.Series(road_codes)
roads_codes_series = roads_codes_series.value_counts()
roads_codes_series = roads_codes_series.sort_index()

sh_road_weight = pd.DataFrame(data={"on_road":GeoSeries(road_uniques),
                                    "weight_value":roads_codes_series
                                   })

sh_road_weight
on_road weight_value
0 LINESTRING (121.2730666 30.9888831,121.2730596... 1646
1 LINESTRING (121.4677565 31.1198416,121.4678423... 1579
2 LINESTRING (121.7202751 31.3780202,121.7197339... 141
3 LINESTRING (121.477849 31.1981056,121.4742212 ... 83
4 LINESTRING (121.8374393 31.0816313,121.8345587... 1268
... ... ...
16445 LINESTRING (121.4278848 31.2389835,121.4280869... 3
16446 LINESTRING (121.431042 31.2403309,121.4307167 ... 1
16447 LINESTRING (121.6378175 31.2374256,121.6373658... 1
16448 LINESTRING (121.432118 31.2416392,121.4314564 ... 1
16449 LINESTRING (121.4444113 30.9236353,121.443815 ... 2

16450 rows × 2 columns

Show the road weights:

sh_road_weight.weight_value.describe()
count    16450.000000
mean       109.849119
std        802.067993
min          1.000000
25%          2.000000
50%          5.000000
75%         22.000000
max      28144.000000
Name: weight_value, dtype: float64

It can be found that most roads are not busy, yet some roads are very busy.

Plot weight_value into a histogram:

import matplotlib.pyplot as plt
plt.bar(sh_road_weight.index,sh_road_weight.weight_value)
plt.show()

The histogram can further confirm the previous conclusion.

Sort all roads according to the road weights:

sh_sorted_road=sh_road_weight.sort_values(by=['weight_value'],ascending=False)
sh_sorted_road
on_road weight_value
102 LINESTRING (121.4248121 31.4032657,121.4265065... 28144
23 LINESTRING (121.473513 31.3702961,121.4736103 ... 24364
9 LINESTRING (121.6349225 31.14309,121.6348039 3... 21448
43 LINESTRING (121.2749664 31.0244814,121.2722674... 20599
89 LINESTRING (121.3814009 31.391344,121.3820681 ... 20463
... ... ...
13661 LINESTRING (121.3757469 31.0789625,121.3756868... 1
7980 LINESTRING (121.3675949 31.2531523,121.3677313... 1
11360 LINESTRING (121.6867871 31.104265,121.6815152 ... 1
13664 LINESTRING (121.5174295 31.2760579,121.5171332... 1
14269 LINESTRING (121.3291131 31.7212865,121.3269348... 1

16450 rows × 2 columns

Select the top 100 busiest roads:

sh_sorted_road.iloc[0:100]
on_road weight_value
102 LINESTRING (121.4248121 31.4032657,121.4265065... 28144
23 LINESTRING (121.473513 31.3702961,121.4736103 ... 24364
9 LINESTRING (121.6349225 31.14309,121.6348039 3... 21448
43 LINESTRING (121.2749664 31.0244814,121.2722674... 20599
89 LINESTRING (121.3814009 31.391344,121.3820681 ... 20463
... ... ...
13 LINESTRING (121.4986731 31.2685444,121.4963124... 3668
566 LINESTRING (121.3656233 31.257495,121.3689789 ... 3655
291 LINESTRING (121.5675539 31.3573854,121.5650207... 3625
759 LINESTRING (121.4644001 31.3612873,121.463751 ... 3622
345 LINESTRING (121.297295 31.1168969,121.2973077 ... 3616

100 rows × 2 columns

Draw the top 100 busiest roads on the map:

KeplerGl(data={"on_roads": pd.DataFrame(data={'on_roads':sh_sorted_road.on_road.iloc[0:100].to_wkt()})})