Skip to main content

Predicting delay times on the New York City subway

I had the opportunity to work as a Fellow with Insight Data Science for the last couple months, an experience that was much more challenging than I expected, but also a lot of fun! The people were great -- the program managers were supportive and highly knowledgeable. The other fellows in my session came from different academic backgrounds, yet we were all able to work together to accomplish very difficult tasks in a short amount of time. If you're interested in transitioning to data science in industry from academia, they are worth a peep.
The first few weeks of the fellowship involved putting together data projects which demonstrated our abilities to break out of the academic mindset and focus on finding solutions for practical/everyday or business-related problems. My initial attempt to build a project failed: a cross-art recommendation system. What's that? Basically the idea was that if you have a general interest in fine arts, and really like a certain writer / musician / photographer / etc., the recommender would introduce you to specific works from a different art form of your choice. I wanted to do this because lately I'm becoming more interested in visual art, but I know almost nothing about it. The recommender might have been a good way to lead me down a stylistic path based on other art forms that I do know something about. Is the idea practical? Well, recommendation systems are becoming ubiquitous across almost all web-based products and services (Netflix / Spotify / Amazon all have recommenders to expose you to things you might want to watch / listen / buy). However, a common criticism of recommendation systems is that they rarely introduce new material, and I thought this might be a good way to build a simple and unique content-based recommender that was specifically designed to show you completely new things. More than anything, I wanted to have fun and I wanted to dip into an art realm which I have increasingly neglected visiting as I get older.
After about a week I realized that the data I needed was going to be too difficult to obtain (getting the right type and getting enough data itself) in the timeframe of the project. So I switched. This post is focused primarily on my second project: a Twitterbot that announces expected delay times for the NYC subway. Big shift, right? Yes, but if I couldn't spend a few weeks selfishly delving into art, I still liked the idea of building something that could be easily used and accessed, was highly needed, and demonstrated my ability to perform predictive analytics. Additionally, it was topical -- the NYC subway is in a service crisis of sorts, and has been for years. Check out @MTA on Twitter if you want to gauge a general public opinion, or a quick Google search will easily confirm this.
The overall steps to the project were the following: 
1) Find real-time data about subway service 
2) Find historical data from which I could extract information related to specific delays and how long they lasted 
3) Using this data, build an accurate model predicting delay times 
4) Build a Twitter bot which receives real-time information about subway delays and announces anticipated delay times using the predictive model
I will present code snippets from some of the steps outlined above. Additionally, you can see a quick video presentation of my project on YouTube if you're interested.

Finding real-time data about NYC subway service

The MTA (the company that runs the NYC subway) has a developer page with real-time information about train locations, as well as some historical data for subway service. Without going into too much detail, I decided not to use this information because it was not suitable for my needs (and abilities) due to time constraints (I had already lost a week working on the recommendation system). Instead, I turned to the NYC subway Twitter page, which announced beginnings and ends of every subway delay as they occur. The code below can be used to stream the tweets of a single user, infinitely in a while loop:
In [ ]:
#!/usr/bin/env/python
import tweepy
import time as tm
import random as rd

# you first need to register for key information to use Twitter's API
CONSUMER_KEY = '' 
CONSUMER_SECRET = ''
ACCESS_TOKEN = ''
ACCESS_TOKEN_SECRET = ''

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.secure = True
auth.set_access_token(ACCESS_TOKEN,ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

# Now periodically check the user's twitter feed for new info.
ct =  20 # only pull this many tweets (change this to whatever you want)
newest = []
new_tweets = []
username = '@NYCTSubway'

while 1!=0:
    
    if not newest: # if newest is empty, there is no previous tweet
        # makes an initial request for most recent tweets
        new_tweets = api.user_timeline(username, count = ct)
    else:
        new_tweets = api.user_timeline(username, since_id = newest)

    
    if not new_tweets: # if there are no new tweets (or no tweets at all)
        print('no new tweets')
    else:
        outtweets = ([[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] 
                      for tweet in new_tweets])
        
        # print out some relevant tweet information
        for j in range(0,len(outtweets)):
            twt = outtweets[j]
            print(twt)

        # save the id of the newest tweet plus one (plus 1 to avoid overlap)
        newest = new_tweets[0].id + 1
        
    # sleep for a bit to avoid rate limit errors
    tm.sleep(rd.randint(180,240))
Note -- make sure you are not trying to get too much information from Twitter too quickly (use rate limits and pauses). There are rate limitations which, when exceeded, can cause errors, or even get your account suspended if you're not careful. See the Twitter developer page for best practices.

Finding historical data

One key issue to overcome when getting historical tweets is you're limited to the most recent ~2000 or so using the Twitter API. For my project, 2000 tweets were not nearly enough historical data because less than 5% of them were delay announcements. I needed to modify code using Selenium to scrape data from from Twitter, allowing me to bypass the rate limitation. I won't post the code in this blog, but it can be easily modified from here. This allowed to to collect over 70000 Tweets, with over 3000 delay announcements between the years 2015 - 2017!

Building an accurate model to predict delay times

This is the complete set of code I used to build the model. I tried multiple ML methods, but ultimately decided to use a regularized logistic regression (ridge) classifier to determine whether delays would be longer than or shorter than 15 minutes. Most methods performed similarly and I felt most comfortable with logistic regression.
My decision to use a classifier rather than a regression technique was mostly due to the fact that the model performed better as a classifier (oops, I know I'm not supposed to say that!). However, this decision ended up still being quite handy from a user standpoint -- the NYC subway delays are horrendously long. It's not uncommon for them to be many dozens of minutes. Therefore, after a certain amount of delay time a regression model isn't all that beneficial (who cares whether a delay will be exactly 35 or 40 minutes?). I felt if the user knew whether the delay would be shorter/longer than 15 minutes, that was ample time for her to figure out whether she should wait or find another route to her destination.
Here's the code:
In [1]:
# import some basic stuff
import pandas as pd
import numpy as np
from copy import deepcopy
from scipy import interp

# stuff for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# read a database from CSV and load it into a pandas dataframe
file = 'directory for training data csv file'
data = pd.read_csv(file)
del data['id'] #delete the 'id' column because I don't want that in the model
Here's a quick look at the original data that I pulled from the Twitter feed:
In [2]:
data.head(5)
Out[2]:
delaystructurelatitudelongitudedirectionevent_hourevent_monthevent_weekdaywhat_happenedroute1...routeGrouteJrouteLrouteMrouteNrouteQrouteRrouteSrouteWrouteZ
028.983333Subway40.665414-73.992872n/b356reason_undetermined0...0000000000
114.200000Elevated40.636260-73.994791s/b2013passenger0...0000000000
243.866667Subway40.701411-74.013205b/d1732signal1...0000000000
319.833333Subway40.725915-73.994659s/b1276signal0...0000000000
425.316667Open Cut40.646292-73.994324s/b1423switch0...0000000000
5 rows × 32 columns
I excluded delays that were longer than 1 hour. Below is a distribution of delays in minutes. The median is slightly over 16 minutes, and thus the 15 minute classification scheme also ended up being fairly handy in terms of maintaining a mostly balanced data set.
In [3]:
plt.hist(pd.to_numeric(data['delay']))
plt.xlabel('minutes')
plt.ylabel('occurrences')
print('median of delay time is ' + str(np.median(data['delay'])) + ' minutes')
median of delay time is 16.16666667 minutes
This is a quick function to discretize my delays from minutes into delays longer/shorther than 15 minutes. The few functions I have written in this code are here because I do the same processing on a hold-out data set to assess the final accuracy of the model (see below).
In [4]:
# This is a classification problem, so here I discretize the data
y = data['delay']

def discretize_response(y):
    ystr = []
    for i in range(0,len(y)):
        if y[i] < 15:
            s = 0 #'less than 15'
        else:
            s = 1 #'greater than 10'

        ystr.append(s)
    
    # display the distribution of positive (>=15 minutes) and negative (<15 minutes) classes     
    y = pd.DataFrame({'delay':ystr})
    return y

y = discretize_response(y)
print(y['delay'].value_counts())
1    1374
0    1206
Name: delay, dtype: int64
Most of the data is categorical, so I expanded the feature set using one-hot-encoding. This created more than 80 features, which I wanted to compress to keep the model as simple as possible (and to avoid over-fitting). The next few lines show how I created dummy variables using one-hot-encoding, and then consolidated the feature space to maintain simplicity.
In [6]:
# do one-hot-encoding for categorical features
X = data.ix[:,'structure':'routeZ']

# print the original list of features
print("Original features:\n", list(X.columns), "\n")
X_dummies = pd.get_dummies(X,columns=['structure', 'event_month','event_weekday',
                                            'direction','what_happened','event_hour'])

del X_dummies['structure_Embankment'] #has to be done to match the test set

# print the list of features after creating dummy variables
print("Features after get_dummies:\n", list(X_dummies.columns))
Original features:
 ['structure', 'latitude', 'longitude', 'direction', 'event_hour', 'event_month', 'event_weekday', 'what_happened', 'route1', 'route2', 'route3', 'route4', 'route5', 'route6', 'route7', 'routeA', 'routeB', 'routeC', 'routeD', 'routeE', 'routeF', 'routeG', 'routeJ', 'routeL', 'routeM', 'routeN', 'routeQ', 'routeR', 'routeS', 'routeW', 'routeZ'] 

Features after get_dummies:
 ['latitude', 'longitude', 'route1', 'route2', 'route3', 'route4', 'route5', 'route6', 'route7', 'routeA', 'routeB', 'routeC', 'routeD', 'routeE', 'routeF', 'routeG', 'routeJ', 'routeL', 'routeM', 'routeN', 'routeQ', 'routeR', 'routeS', 'routeW', 'routeZ', 'structure_At Grade', 'structure_Elevated', 'structure_Open Cut', 'structure_Subway', 'structure_Viaduct', 'event_month_1', 'event_month_2', 'event_month_3', 'event_month_4', 'event_month_5', 'event_month_6', 'event_month_7', 'event_month_8', 'event_month_9', 'event_month_10', 'event_month_11', 'event_month_12', 'event_weekday_0', 'event_weekday_1', 'event_weekday_2', 'event_weekday_3', 'event_weekday_4', 'event_weekday_5', 'event_weekday_6', 'direction_b/d', 'direction_direction_unknown ', 'direction_n/b', 'direction_s/b', 'what_happened_investigation', 'what_happened_mechanical', 'what_happened_passenger', 'what_happened_reason_undetermined', 'what_happened_signal', 'what_happened_switch', 'what_happened_track', 'event_hour_0', 'event_hour_1', 'event_hour_2', 'event_hour_3', 'event_hour_4', 'event_hour_5', 'event_hour_6', 'event_hour_7', 'event_hour_8', 'event_hour_9', 'event_hour_10', 'event_hour_11', 'event_hour_12', 'event_hour_13', 'event_hour_14', 'event_hour_15', 'event_hour_16', 'event_hour_17', 'event_hour_18', 'event_hour_19', 'event_hour_20', 'event_hour_21', 'event_hour_22', 'event_hour_23']
In [7]:
# here's the function to consolidate the dummy features

def consolidate_features(X):
    # reorganize the features
    #X = deepcopy(X)

    # consolidate months down to seasons
    X['winter'] = X['event_month_12'] + X['event_month_1'] + X['event_month_2']
    X['spring'] = X['event_month_3'] + X['event_month_4'] + X['event_month_5']
    X['summer'] = X['event_month_6'] + X['event_month_7'] + X['event_month_8']
    X['autumn'] = X['event_month_9'] + X['event_month_10'] + X['event_month_11']

    # consolidate times of day to rush hour morning / evening, work hours, evening / sleep hours
    # notice, hours are in GMT
    X['time_rush_hour_eve'] = X['event_hour_19'] + X['event_hour_20'] + X['event_hour_21'] #GMT
    X['time_rush_hour_morn'] = X['event_hour_10'] + X['event_hour_11'] + X['event_hour_12'] #GMT
    X['time_work'] = (X['event_hour_13'] + X['event_hour_14'] + X['event_hour_15'] + 
                         X['event_hour_16'] + X['event_hour_17'] + X['event_hour_18']) #GMT
    X['time_eve'] = X['event_hour_22'] + X['event_hour_23'] + X['event_hour_0'] 
    X['time_sleep'] = (X['event_hour_1'] + X['event_hour_2'] + X['event_hour_3'] +
                          X['event_hour_4'] + X['event_hour_5'] + X['event_hour_6'] +
                          X['event_hour_7'] + X['event_hour_8'] + X['event_hour_9'])

    # consolidate day of week into weekday / weekend
    X['weekday'] = (X['event_weekday_0'] + X['event_weekday_1'] + X['event_weekday_2'] +
                      X['event_weekday_3'] + X['event_weekday_4'])
    X['weekend'] = X['event_weekday_5'] + X['event_weekday_6']

    # remove unnecessary data
    # I removed routes because those were highly correlated to the station locations
    cols = [c for c in X.columns if c.lower()[:5] != 'event']
    X = X[cols]
    cols = [c for c in X.columns if c.lower()[:5] != 'route']
    X = X[cols]

    return X

X_dummies = consolidate_features(X_dummies)

# print the new list of consolidated features
print('the new feature names are: ')
list(X_dummies)
the new feature names are: 
Out[7]:
['latitude',
 'longitude',
 'structure_At Grade',
 'structure_Elevated',
 'structure_Open Cut',
 'structure_Subway',
 'structure_Viaduct',
 'direction_b/d',
 'direction_direction_unknown ',
 'direction_n/b',
 'direction_s/b',
 'what_happened_investigation',
 'what_happened_mechanical',
 'what_happened_passenger',
 'what_happened_reason_undetermined',
 'what_happened_signal',
 'what_happened_switch',
 'what_happened_track',
 'winter',
 'spring',
 'summer',
 'autumn',
 'time_rush_hour_eve',
 'time_rush_hour_morn',
 'time_work',
 'time_eve',
 'time_sleep',
 'weekday',
 'weekend']
You can see that after consolidating, the list of features is much more manageable than directly after creating dummy variables. For example, days of the week were consolidated to be either weekdays or weekends; months were consolidated to seasons, etc.
Because the preprocessing at this point is done, the next task is to build the model. The lines below are pretty self-explanatory. There were 2 additional steps taken to prepare the data here. First, I normalized the numerical data (the geographic coordinates of the stations). Second, I used SMOTE (synthetic minority oversampling technique) to balance the data set. Even though the data set was almost balanced, I found using SMOTE greatly improved the performance of my final model.
In [8]:
# import stuff for building and testing model
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, f1_score, classification_report
from sklearn.cross_validation import StratifiedKFold
from sklearn.linear_model import RidgeClassifier

# apply regularization to the classifier (I tried many values -- 30 seemed best)
classifier = RidgeClassifier 

# normalize the data (this mainly applies to the station geographic coordinates)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_dummies)
X_dummies_scaled = scaler.transform(X_dummies)

# use SMOTE for creating balanced data sets (oversample the minority class)
from imblearn.over_sampling import SMOTE
smote = SMOTE(kind = "regular")
X_dummies_scaled_sm, y_sm = smote.fit_sample(X_dummies_scaled, y)

Next, I wanted to optimize my regularization parameter, alpha. I performed a 5-fold cross validation, training the model on 80% of the data and validating on 20%. The plot below shows the F1 score (a weighted average of precision and recall) of a validation data set as a function of alpha.
In [10]:
# function for plotting ROC curves for each validation fold
def plot_regularization(X,y):
    
    # check for optimal regularization
    alphas = [0.01,0.03,0.1,0.3,1,3,10,30,100,300,1000,3000]
    
    # prepare to run classifier with cross-validation 
    cv = StratifiedKFold(y, n_folds=5)
    
    # create an empty array to hold model performance metrics
    f1scores = np.zeros((5,len(alphas)))

    # loop through validation folds
    for i, (train, val) in enumerate(cv):
        
        # loop through regularization values
        for j, alpha in enumerate(alphas):

            # fit to the training data and validate for the ith fold
            ridge = classifier(alpha).fit(X[train], y[train])

            # get performance metrics
            y_pred = ridge.predict(X[val])
            f1scores[i][j] = f1_score(y[val], y_pred, average="macro")
            
        plt.plot(np.log10(alphas),f1scores[i][:], marker='o', label=('fold ' + str(i)))
        
    #plt.ylim([0.5,0.75])
    plt.xlim([np.log10(alphas[0]),np.log10(100000)])
    plt.xlabel('log (alpha)')
    plt.ylabel('F1 score')
    plt.title('F1 scores')
    plt.legend(loc="lower right")
        
# plot out the effect of regularization on the F1 score        
plot_regularization(X_dummies_scaled_sm, y_sm)
You can see from the plot above that for most validation sets, the F1 score slowly increases and then takes a quick decrease. I wanted to optimze alpha such that F1 scores were the highest. I chose 300 (log10(300) = 2.48) because for 3 out of the 5 folds, F1 was at or very close to its peak at this value.
To further assess the model performance, I performed an ROC analysis and measured the area under the curve using the set alpha parameter.
In [12]:
# function for plotting ROC curves for each validation fold
def plot_cv_roc(X,y):
    
    # prepare to run classifier with cross-validation 
    cv = StratifiedKFold(y, n_folds=5)
    
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)

    for i, (train, val) in enumerate(cv):

        # fit to the training data and validate for the ith fold
        # set the regularization parameter, alpha, based on previous plot
        probas_ = classifier(alpha=300).fit(X[train], y[train]).decision_function(X[val])

        # Compute ROC curve and area under the curve
        fpr, tpr, thresholds = roc_curve(y[val], probas_)

        # use to calculate the mean ROC
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0

        # plot the ROC for the ith fold
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

    # plot the chance ROC
    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')

    # plot the mean ROC
    mean_tpr /= len(cv)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, 'k--',
             label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    
# plot the ROC curves for each fold
plot_cv_roc(X_dummies_scaled_sm, y_sm)
Again, you can see that the AUC was very close to 0.7 for all folds, indicating the model did a fair job of distinguishing between long and short delays.
Now that the cross validation has shown that my model can predict delay times to a fair amount of accuracy, I next build the final model using ALL of the available data, keeping the regularization parameter set to the value indicated above.
In [13]:
# cross validation suggests the model is decent --
# now use all the data to create the model
classifier = RidgeClassifier(alpha=300) #use optimum regularization, alpha, as determined above
final_model = classifier.fit(X_dummies_scaled_sm, y_sm)
I then want to test this final model on a hold-out data set. Before doing that, I have to preprocess this untouch data in a way that was identical to the training data. My hold-out set has 286 samples, as printed out below:
In [16]:
# read in hold-out test data set, and do all the preprocessing that was done before
# on the training data
file = 'directory for hold-out test data csv file'
data = pd.read_csv(file)
del data['id']
data.shape

# discretize the response variable
y_test = data['delay']
y_test = discretize_response(y_test)

# convert categorical data to dummy variables
X_test = data.ix[:,'structure':'routeZ']
X_dummies_test = pd.get_dummies(X_test,columns=['structure', 'event_month','event_weekday',
                                            'direction','what_happened','event_hour'])
#del X_dummies_test['structure_Embankment'] #has to be done to match the test set

# consolidate the dummy variables
X_dummies_test = consolidate_features(X_dummies_test)


# scale the test data according to the training data 
X_dummies_scaled_test = scaler.transform(X_dummies_test)

# print size of the test data set (samples by features)
print(X_dummies_scaled_test.shape)
(286, 29)
And now let's see what the final accuracy is:
In [17]:
pred_test = final_model.predict(X_dummies_scaled_test)

# print ridge regression accuracy on test data
print("ridge regression, test data: ")
print(classification_report(y_test, pred_test,
                          target_names=["less than 15","15 or greater"]))

print('final model accuracy: %0.2f' % accuracy_score(y_test,pred_test))
ridge regression, test data: 
               precision    recall  f1-score   support

 less than 15       0.67      0.72      0.69       138
15 or greater       0.72      0.66      0.69       148

  avg / total       0.69      0.69      0.69       286

final model accuracy: 0.69
You can see it performed in a way that was very similar to the validation set, indicating that the training of the model did not result in any overfitting (great!). Also, note that SMOTE was not performed on the test set, as I was interested in seeing how the model would predict on real data, regardless of the probability of any single delay being long or short.
Finally, let's take a look at the model weights:
In [18]:
# print out the weights of the classifier
y1 = np.transpose(classifier.coef_)
x = np.arange(len(y1))
plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
ylab = X_dummies.columns[0:-1]
plt.barh(x,y1)
plt.yticks(x+0.5,ylab)
plt.ylabel("feature")
plt.xlabel("theta")
Out[18]:
<matplotlib.text.Text at 0x11c0b06a0>
You can see that the reported delay reasons are most important in determining delay length (not surprisingly). However, there is an interesting story behind this and I urge you to go to the video link above to find out more!

Building the bot

Finally, I won't add the code for the bot here because it essentially involves the code for streaming Twitter I posted above, extracting relevant information from the tweet, and then posting the prediction to the bot's twitter account. Pretty boring stuff. But if you're interested in building a Twitter bot, it's surprisingly easy with Heroku. Check out these videos and try it out! Also, check out the bot for this project at https://twitter.com/MTADelayBot.


Comments