Will I Be Late? - Study of SEPTA train delays

Business Problem

Every day SEPTA faces problems of delays with their many forms of transportation. These delays can force commuters to be late to work and other important events. Lateness leads to unhappy customers which then affects SEPTA’s reputation. Ultimately, it can force potential customers to find other methods of transportation.

Proposed Solution

By analyzing SEPTA’s regional train data we can

Develop a sense of what types of factors affect trains being late.
Build a predictive model that can be used to better inform customers in advance that a train will be late.

Results

We can predict whether a train will be late accurately approximately 75% of the time.
Most important factors:
- Train line (some train lines have higher rates of lateness)
- Precipitation
- Type of weather

Data Breakdown

Train Data

We used the On-Time Performance dataset provided by SEPTA to Kaggle.com. This “otp.csv” from the Kaggle dataset contains On-Time Performance information as from 23 Mar to 6 Nov, 2016 :

train_id
direction(‘N’ or ‘S’ direction is demarcated as either Northbound or Southbound)
origin
next_station (Think of this as the station stop, at timeStamp)
date
status(‘On Time’, ‘5 min’, … This is a status on train lateness. 999 is a suspended train)
Note: only 32% of trains run perfectly “on-time”, with another 30% running 1-3 minutes late and the remaining ~1/3 of the trains running >3 minutes late
timestamp

Weather Data

We collected weather data using the DarkSky API for the Philadelphia area for all start- and end-points of the Regional Rail lines we modeled. The data is aggregated into a single file, allowing us to link it to the OTP data, and includes the following information:

time
summary
icon description
precipIntensity
precipProbability
temperature
apparentTemperature
dewPoint
humidity
pressure
windSpeed
windBearing
visibility
cloudCover
precipType

Data Preparation and Cleaning

Turn almost all of OTP into one-hot dummy variables (hour, day, month, stop, etc)
Combine OTP and weather data so that we have 1 row per stop per train trip per day
Standardize continuous variables
Identify missing data (12 values missing for wind speed); removed records

#trainView <- read.csv("trainView.csv")
otp <- read.csv("otp.csv")

##Clean up train names
otp$train_id <- gsub('[A-Za-z/. /-]+', '' ,otp$train_id)

##Standardize status (On Time = 0)
otp$status <- gsub(' min', '', otp$status)
otp$status <- gsub('On Time', '0', otp$status)

otp$delay <- as.numeric(otp$status)
otp$status <- NULL

# SEPTA - On Time Performance Data. 2016-03-23 to 2016-11-06

# subset data for Paoli/Thorndale line and Fox-Chase

Thorndale_list <- c('Downingtown', 'Whitford', 'Exton', 'Malvern', 'Paoli', 'Daylesford', 'Berwyn', 'Devon',
                    'Strafford', 'Wayne', 'St. Davids', 'Radnor', 'Villanova', 'Rosemont', 'Bryn Mawr', 'Haverford', 'Ardmore', 'Wynnewood',
                    'Narberth', 'Merion', 'Overbrook', '30th Street Station', 'Suburban Station', 'Temple U')

Thorndale_list_out <- c('Downingtown', 'Whitford', 'Exton', 'Malvern', 'Paoli', 'Daylesford', 'Berwyn', 'Devon',
                    'Strafford', 'Wayne', 'St. Davids', 'Radnor', 'Villanova', 'Rosemont', 'Bryn Mawr', 'Haverford', 'Ardmore', 'Wynnewood',
                    'Narberth', 'Merion', 'Overbrook', '30th Street Station', 'Suburban Station', 'Thorndale')


##Subsetting with OTP data
##Thorndale Inbound

thorndale_in_otp <- otp %>%
  filter(origin == 'Thorndale' | origin == 'Malvern' | origin == 'Frazer Yard') %>% 
  filter(next_station %in% Thorndale_list) %>% 
  arrange(timeStamp)

##Foxchase Inbound

foxchase_in_otp <- otp %>%
  filter(origin == 'Fox Chase') %>%
  filter(next_station %in% c('30th Street Station','Suburban Station','Jefferson Station','Temple U','Olney','Wayne Junction', 
                             'Lawndale','Cheltenham','Ryers')) %>% 
  arrange(timeStamp)

##Thorndale Outbound

thorndale_out_otp <- otp %>% 
  filter(origin %in% c('Temple U','Suburban Station',
                       '30th Street Station', 'Jefferson Station')) %>% 
  filter(next_station %in% Thorndale_list_out) %>% 
  arrange(timeStamp)

##Foxchase Outbound

foxchase_out_otp <- otp %>%
  filter(origin == '30th Street Station' | origin == 'Powelton Yard') %>%
  filter(next_station %in% c('30th Street Station','Fox Chase','Suburban Station','Jefferson Station','Temple U','Olney','Wayne Junction', 
                             'Lawndale','Cheltenham','Ryers')) %>% 
  arrange(timeStamp)

##Create a new column with nearest hour of timestamp (otp)

#Thorndale Inbound

thorndale_in_otp$timeStamp <- as.POSIXct(thorndale_in_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")

thorndale_in_otp <- mutate(thorndale_in_otp, 
                           round_timestamp = 
                             round_date(thorndale_in_otp$timeStamp, c("hour")))

##Foxchase Inbound

foxchase_in_otp$timeStamp <- as.POSIXct(foxchase_in_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")

foxchase_in_otp <- mutate(foxchase_in_otp, 
                           round_timestamp = 
                             round_date(foxchase_in_otp$timeStamp, c("hour")))

##Thorndale Outbound

thorndale_out_otp$timeStamp <- as.POSIXct(thorndale_out_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")

thorndale_out_otp <- mutate(thorndale_out_otp, 
                           round_timestamp = 
                             round_date(thorndale_out_otp$timeStamp, c("hour")))

##Foxchase Outbound

foxchase_out_otp$timeStamp <- as.POSIXct(foxchase_out_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")

foxchase_out_otp <- mutate(foxchase_out_otp, 
                          round_timestamp = 
                            round_date(foxchase_out_otp$timeStamp, c("hour")))


### NOTES: need to group by trainID and date


weather <- read.csv("weather.csv")

# convert relevant columns to factors
weather$station <- factor(weather$station)
#weather$time <- as_date(weather$time)
weather$time <- as.POSIXct(weather$time, format = "%m/%d/%y %H:%M")
weather$summary <- factor(weather$summary)
weather$icon <- factor(weather$icon)
weather$precipType <- factor(weather$precipType)


##Split weather data by station

weather_foxchase <- filter(weather, station == "Fox Chase")
weather_phila <- filter(weather, station == "Philadelphia")
weather_thorndale <- filter(weather, station == "Thorndale")

##Merging each dataset with the weather data

##Thorndale Inbound (otp)

thorndale_in_otp_weather <- thorndale_in_otp %>%  
  mutate(time = round_timestamp) %>% 
  left_join(weather_thorndale, by = "time") %>% 
  arrange(time)


##Thorndale Outbound

thorndale_out_otp_weather <- thorndale_out_otp %>%  
  mutate(time = round_timestamp) %>% 
  left_join(weather_thorndale, by = "time") %>% 
  arrange(time)

##Fox Chase Inbound 

foxchase_in_otp_weather <- foxchase_in_otp %>%  
  mutate(time = round_timestamp) %>% 
  left_join(weather_foxchase, by = "time") %>% 
  arrange(time)

##Fox Chase Outbound

foxchase_out_otp_weather <- foxchase_out_otp %>%  
  mutate(time = round_timestamp) %>% 
  left_join(weather_foxchase, by = "time") %>% 
  arrange(time)

Random Forest

Random Forest is an extension of the decision tree algorithm. Decision trees work by partitioning the data in such a way that allows one to determine specific rules for classifying what we are trying to predict using the predictor variables. Decision trees are useful because they are very intuitive, and allow for easy interpretation of the variables. However, decision trees on their own do not have very high predictive power, which is where Random Forest comes in. Often with prediction and classification techniques there is a tradeoff between intuitivity and accuracy. In the case of Random Forest, it is not as intuitive as a simple decision tree, but it is still possible to obtain a list of important factors.

The way that Random Forest works is that it samples the input data based on a given parameter (ntree), and builds that many decision trees (e.g., a common ntree is 500, meaning the Random Forest algorithm grows 500 decision trees). In addition to sampling the data, the algorithm also samples the predictor variables. The number of predictor variables that is sampled at each split is also a parameter called mtry. Random Forest works as an ensemble of decision trees, and so it uses a majority rule to combine the decision trees together. For example, if we use an ntree value of 500, and 300 of the decision trees predict a train will be late, while 200 predict a train won’t be late, then the final Random Forest will predict that the train will be late.

Cross-Validation and Accuracy

The Random Forest algorithm can be tuned using two parameters that are mentioned earlier - ntree and mtry. To reduce the time to cross-validate, we decided to keep the default ntree of 500, since it has been shown to be a robust number for the number of trees to grow. To choose the best value for mtry, we ran a 5 fold cross validation, looping over mtry values of 1 - 11 (where 11 is one less than the total number of predictors - using an mtry of 12 would simply use all the predictors at every split instead of randomly sampling them). The cross-validation showed that the best accuracy would be achieved using an mtry value of 11. After tuning the parameter, we ran the model on the 30% validation set and obtained an overall accuracy of 83.9%. The breakdown of class accuracies is as follows:

One detail to note is that this random forest model has a slightly higher Type 1 error (false positive) rate (than the Type 2 error - a miss). In other words, when the model makes an error, it is more likely that it is predicting a train will be late when it is actually going to be on time. In the future, the model can be tuned to reduce either* the Type 1 or Type 2 error depending on which error is worse to make.

*A tradeoff exists between Type 1 and Type 2 error unless the model is perfect. It is common to choose which error is more costly, and then minimize it at the cost of increasing the size of the other error.

#NOTE: This code takes a while to run, so it is included as a reference. The code following this imports an already prepped dataset and uses the parameters obtained from the cross validation to implement the final model. 

library(randomForest)

##Merge datasets to run on all of the data

##Create a column for dataset

thorndale_in_otp_weather$dataset <- "thorndale_in"
thorndale_out_otp_weather$dataset <- "thorndale_out"
foxchase_in_otp_weather$dataset <- "foxchase_in"
foxchase_out_otp_weather$dataset <- "foxchase_in"

merged_data <- rbind(thorndale_in_otp_weather, thorndale_out_otp_weather,
                   foxchase_in_otp_weather, foxchase_out_otp_weather)

##Convert Status to Late/Note Late

merged_data <- mutate(merged_data, delay =
                                   ifelse(merged_data$delay > 0, 1,
                                          ifelse(merged_data == 999, 999, 0)))


##Modyfing Variables

merged_data <- merged_data %>% 
  filter(delay != 999) %>% 
  mutate(delay = factor(merged_data$delay)) %>% 
  mutate(day_of_week = factor(weekdays(merged_data$round_timestamp))) %>% 
  mutate(hour = factor(hour(merged_data$round_timestamp))) %>% 
  mutate(month = factor(month(merged_data$round_timestamp))) %>% 
  mutate(origin = factor(as.character(merged_data$origin))) %>% 
  mutate(next_station = factor(as.character(merged_data$next_station))) %>% 
  mutate(icon = factor(as.character(merged_data$icon))) %>% 
  mutate(dataset = factor(as.character(merged_data$dataset))) %>% 
  mutate(windSpeed = ifelse(is.na(merged_data$windSpeed) == TRUE, 
                            mean(merged_data$windSpeed, na.rm = TRUE), merged_data$windSpeed))




##Choose the right variables

##Train ID has 94 levels - randomForest only supports upto 53.  

rf_vars <- c("origin", "next_station",
              "delay", "day_of_week", "precipIntensity", "precipProbability",
             "temperature", "windSpeed", "visibility", "month", "hour", "icon", "dataset")

merged_data_model <- select(merged_data, rf_vars)


##Split into training/testing set

set.seed(100)
train <- sample(nrow(merged_data_model), 0.7*nrow(merged_data_model), replace = FALSE)
merged_train <- merged_data_model[train,]
merged_test <- merged_data_model[-train,]

##Initial Model

Sys.time()

merged_default_model <- randomForest(delay ~ ., data = merged_train, importance = TRUE)

Sys.time()

merged_default_model

# Predicting on train set

merged_pred_train <- predict(merged_default_model, merged_train, type = "class")
# Checking classification accuracy
table(merged_pred_train, merged_train$delay)  


# Predicting on Validation set
merged_pred_train_test <- predict(merged_default_model, merged_test, type = "class")

# Checking classification accuracy
#mean(merged_pred_train_test == merged_test$delay)                    
table(merged_pred_train_test,merged_test$delay)

##importance

importance(merged_default_model)

##Cross Validation

library(caret)


# Grid Search

Sys.time()

control <- trainControl(method="repeatedcv", number=5, repeats=1)
seed <- 100
metric <- "Accuracy"
set.seed(seed)
mtry <- c(1:11)
ntree <- c(200,500,700,900)
tunegrid <- expand.grid(.mtry=mtry)
rf_search <- train(delay~., data=merged_train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_search)

Sys.time()

plot(rf_search)

library(randomForest)

##Split final model_data

merged_data_model <- readRDS("model_data.rds")

names <- gsub(pattern = " ", replacement = "", x = colnames(merged_data_model))

colnames(merged_data_model) <- names

merged_data_model <- merged_data_model[,-2]

set.seed(100)
train <- sample(nrow(merged_data_model), 0.7*nrow(merged_data_model), replace = FALSE)
merged_train <- merged_data_model[train,]
merged_test_new <- merged_data_model[-train,]

##Run new model with optimal mtry



#Sys.time()

#merged_mtry11_model <- randomForest(delay ~ .,data = merged_train, mtry = 11, importance = TRUE)

merged_mtry11_model <- readRDS("final_model.RDS")

#Sys.time()

# Predicting on Validation set
merged_pred_train_test <- predict(merged_mtry11_model, merged_test_new, type = "class")

# Checking classification accuracy
#mean(merged_pred_train_test == merged_test_new$delay)                    
holdout_acc <- table(merged_pred_train_test,merged_test_new$delay)



##importance

importance_Valid <- as.data.frame(importance(merged_mtry11_model))

names <- rownames(importance_Valid)

importance_Valid$factor <- names

import_table <- importance_Valid %>% 
  select(factor, MeanDecreaseGini) %>% 
  arrange(desc(MeanDecreaseGini)) %>% 
  head()

Variable Importance


factor	MeanDecreaseGini

temperature	4,609.015
windSpeed	4,316.436
visibility	2,149.204
originFrazerYard	2,025.816
originJeffersonStation	1,125.068
originThorndale	1,085.159

Above, we can take a look at the top 6 factors in our random forest model. Temperature, Wind Speed, and Visibility appear to be important variables (based on MeanDecreaseGini*) from our weather variables. Other important variables are the different origins of the trains, which suggests that some train lines have more late trains than others.

*MeanDecreaseGini is based on the GINI Index which is a measure of impurity. With decision trees and random forest, the algorithm can use the GINI Index to decide the order of which variables to split on. Variables with the lowest level of impurity (one branch has most of the late trains, the other branch has most of the on-time trains) are chosen first. In this case, we see top variables with the lowest impurity which implies variable importance.

Future Directions and Limitations

One major limitation of the current analysis is that the available data was collected between March and November 2016. During the summer of 2016, SEPTA had an employee strike which forced people that usually take other forms of transportation to take regional rail. This overfilled the trains and most likely caused some of the late trains. In the future it would be beneficial to obtain data which avoids abnormal events. However, this current analysis can still be used as a proof of concept.

Further, in the future it would be useful to obtain data from at least one full year in order to see how lateness changes across all of the months.

Note: This project also included other algorithms such as Logistic Regression, SVM, Naive Bayes, and Neural Networks, but this report only includes Random Forest as it performed the best.