Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrity tests and fixes for sequences and circular lines #84

Open
marlinarnz opened this issue Jun 28, 2022 · 1 comment
Open

integrity tests and fixes for sequences and circular lines #84

marlinarnz opened this issue Jun 28, 2022 · 1 comment
Assignees

Comments

@marlinarnz
Copy link
Collaborator

Hi there,
for PT networks of hundreds of thousands to millions of links, quetzal's integrity check functions integrity_test_sequences() and integrity_test_circular_lines() take an indefinite long time (I had to interrupt the last test with 2 million links after one day). This is why I suggest some faster logic:

The sequence testing only accounts for the length of the trip, which might overlook situations like 1-->2-->2-->4, but that is less probable (does not occur in my GTFS feeds):

def test_sequences(trip):
    assert len(trip)==trip['link_sequence'].max(), \
        'broken sequence in trip {}'.format(trip['trip_id'].unique()[0])
self.links.groupby('trip_id').apply(test_sequences)

The circular lines test should account for any case where duplicate stops occur within one trip:

def test_circular(trip):
    if len(set(list(trip['a'])+list(trip['b']))) != len(trip)+1:
        return trip
self.circular_lines = self.links.groupby('trip_id').apply(test_circular).reset_index(level='trip_id', drop=True)

On the other hand, the fix methods are a bit too fast, dropping all affected trips. I would suggest a thorough fix by splitting up trip_id's, knowing, that this causes in additional interchanges. That does not represent reality, but is better than dropping trips, when their number is considerable.

A suggestion for trip sequences:

def fix_sequences(trip):
    if len(trip) > 1:
        trip = trip.sort_values('link_sequence')
        # Check link succession
        ind = list(trip.index)
        for i in range(len(trip.index) - 1):
            try:
                assert trip.loc[ind[i], 'b'] == trip.loc[ind[i+1], 'a'], \
                    'broken trip {}: stop {} has no successor link'.format(
                        trip['trip_id'].unique()[0], trip.loc[ind[i], 'b'])
            except AssertionError:
                trip.loc[ind[i+1]:ind[-1], 'trip_id'] = \
                    trip.loc[ind[i+1]:ind[-1], 'trip_id'] + '_' + str(i)
        # Repair sequences
        if len(trip) != trip['link_sequence'].max():
            trip['link_sequence'] = trip.groupby('trip_id')['link_sequence'].apply(
                lambda t: [j for j in range(1, len(t.index)+1)]).sum()
    return trip
self.links = self.links.groupby('trip_id').apply(fix_sequences).reset_index(level=0, drop=True)

My suggestion for circular lines fixes 97% of circularity the issues:

def fix_circular_split(trip):
    def split_trip(trip, split_by):
        split = [trip.index.get_loc(i) for i in trip.loc[trip[split_by].duplicated(keep=False)].index]
        if len(split) >= 1:
            trips = []
            # First stops
            trips.append(trip.iloc[: split[0]+1])
            # Middle stops
            for i in range(1, len(split)):
                t = trip.iloc[split[i-1]+1 : split[i]]
                t['trip_id'] = t['trip_id'] + '_' + str(i) + str(split_by)
                t['link_sequence'] = list(range(1, len(t)+1))
                trips.append(t)
            # Last stops
            t = trip.iloc[split[-1] :]
            t['trip_id'] = t['trip_id'] + '_n' + str(split_by)
            t['link_sequence'] = list(range(1, len(t)+1))
            trips.append(t)
            return pd.concat(trips)
        else:
            return trip
    # Split duplicated b stops
    trip = split_trip(trip, 'b')
    # Split duplicated a stops
    trip = trip.groupby('trip_id').apply(split_trip, 'a')
    return trip
fixed = self.circular_lines.groupby('trip_id').apply(fix_circular_split).reset_index(level='trip_id', drop=True)
initial_circular = self.circular_lines.copy()
fixed.groupby('trip_id').apply(test_circular).reset_index(level='trip_id', drop=True)
fixed.drop(self.circular_lines.index, inplace=True)
self.links = self.links.loc[~sm.links['trip_id'].isin(initial_circular['trip_id'].unique())]
self.links = self.links.append(fixed)

It's all tested with the PT network of entire Germany. I hope I made no mistakes translating the logic it into quetzal function suggestions.

I would suggest keeping the current methods, but including an option for "quick-checks" and "thorough-fixes".
Cheers

@siforf564
Copy link
Collaborator

Thank you @marlinarnz 😊

I look into these suggestions for integration soonly.

Simon ✌

@siforf564 siforf564 self-assigned this Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants