Every day SEPTA faces problems of delays with their many forms of transportation. These delays can force commuters to be late to work and other important events. Lateness leads to unhappy customers which then affects SEPTA’s reputation. Ultimately, it can force potential customers to find other methods of transportation.
By analyzing SEPTA’s regional train data we can
We used the On-Time Performance dataset provided by SEPTA to Kaggle.com. This “otp.csv” from the Kaggle dataset contains On-Time Performance information as from 23 Mar to 6 Nov, 2016 :
We collected weather data using the DarkSky API for the Philadelphia area for all start- and end-points of the Regional Rail lines we modeled. The data is aggregated into a single file, allowing us to link it to the OTP data, and includes the following information:
#trainView <- read.csv("trainView.csv")
otp <- read.csv("otp.csv")
##Clean up train names
otp$train_id <- gsub('[A-Za-z/. /-]+', '' ,otp$train_id)
##Standardize status (On Time = 0)
otp$status <- gsub(' min', '', otp$status)
otp$status <- gsub('On Time', '0', otp$status)
otp$delay <- as.numeric(otp$status)
otp$status <- NULL
# SEPTA - On Time Performance Data. 2016-03-23 to 2016-11-06
# subset data for Paoli/Thorndale line and Fox-Chase
Thorndale_list <- c('Downingtown', 'Whitford', 'Exton', 'Malvern', 'Paoli', 'Daylesford', 'Berwyn', 'Devon',
'Strafford', 'Wayne', 'St. Davids', 'Radnor', 'Villanova', 'Rosemont', 'Bryn Mawr', 'Haverford', 'Ardmore', 'Wynnewood',
'Narberth', 'Merion', 'Overbrook', '30th Street Station', 'Suburban Station', 'Temple U')
Thorndale_list_out <- c('Downingtown', 'Whitford', 'Exton', 'Malvern', 'Paoli', 'Daylesford', 'Berwyn', 'Devon',
'Strafford', 'Wayne', 'St. Davids', 'Radnor', 'Villanova', 'Rosemont', 'Bryn Mawr', 'Haverford', 'Ardmore', 'Wynnewood',
'Narberth', 'Merion', 'Overbrook', '30th Street Station', 'Suburban Station', 'Thorndale')
##Subsetting with OTP data
##Thorndale Inbound
thorndale_in_otp <- otp %>%
filter(origin == 'Thorndale' | origin == 'Malvern' | origin == 'Frazer Yard') %>%
filter(next_station %in% Thorndale_list) %>%
arrange(timeStamp)
##Foxchase Inbound
foxchase_in_otp <- otp %>%
filter(origin == 'Fox Chase') %>%
filter(next_station %in% c('30th Street Station','Suburban Station','Jefferson Station','Temple U','Olney','Wayne Junction',
'Lawndale','Cheltenham','Ryers')) %>%
arrange(timeStamp)
##Thorndale Outbound
thorndale_out_otp <- otp %>%
filter(origin %in% c('Temple U','Suburban Station',
'30th Street Station', 'Jefferson Station')) %>%
filter(next_station %in% Thorndale_list_out) %>%
arrange(timeStamp)
##Foxchase Outbound
foxchase_out_otp <- otp %>%
filter(origin == '30th Street Station' | origin == 'Powelton Yard') %>%
filter(next_station %in% c('30th Street Station','Fox Chase','Suburban Station','Jefferson Station','Temple U','Olney','Wayne Junction',
'Lawndale','Cheltenham','Ryers')) %>%
arrange(timeStamp)
##Create a new column with nearest hour of timestamp (otp)
#Thorndale Inbound
thorndale_in_otp$timeStamp <- as.POSIXct(thorndale_in_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")
thorndale_in_otp <- mutate(thorndale_in_otp,
round_timestamp =
round_date(thorndale_in_otp$timeStamp, c("hour")))
##Foxchase Inbound
foxchase_in_otp$timeStamp <- as.POSIXct(foxchase_in_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")
foxchase_in_otp <- mutate(foxchase_in_otp,
round_timestamp =
round_date(foxchase_in_otp$timeStamp, c("hour")))
##Thorndale Outbound
thorndale_out_otp$timeStamp <- as.POSIXct(thorndale_out_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")
thorndale_out_otp <- mutate(thorndale_out_otp,
round_timestamp =
round_date(thorndale_out_otp$timeStamp, c("hour")))
##Foxchase Outbound
foxchase_out_otp$timeStamp <- as.POSIXct(foxchase_out_otp$timeStamp, format="%Y-%m-%d %H:%M:%S")
foxchase_out_otp <- mutate(foxchase_out_otp,
round_timestamp =
round_date(foxchase_out_otp$timeStamp, c("hour")))
### NOTES: need to group by trainID and date
weather <- read.csv("weather.csv")
# convert relevant columns to factors
weather$station <- factor(weather$station)
#weather$time <- as_date(weather$time)
weather$time <- as.POSIXct(weather$time, format = "%m/%d/%y %H:%M")
weather$summary <- factor(weather$summary)
weather$icon <- factor(weather$icon)
weather$precipType <- factor(weather$precipType)
##Split weather data by station
weather_foxchase <- filter(weather, station == "Fox Chase")
weather_phila <- filter(weather, station == "Philadelphia")
weather_thorndale <- filter(weather, station == "Thorndale")
##Merging each dataset with the weather data
##Thorndale Inbound (otp)
thorndale_in_otp_weather <- thorndale_in_otp %>%
mutate(time = round_timestamp) %>%
left_join(weather_thorndale, by = "time") %>%
arrange(time)
##Thorndale Outbound
thorndale_out_otp_weather <- thorndale_out_otp %>%
mutate(time = round_timestamp) %>%
left_join(weather_thorndale, by = "time") %>%
arrange(time)
##Fox Chase Inbound
foxchase_in_otp_weather <- foxchase_in_otp %>%
mutate(time = round_timestamp) %>%
left_join(weather_foxchase, by = "time") %>%
arrange(time)
##Fox Chase Outbound
foxchase_out_otp_weather <- foxchase_out_otp %>%
mutate(time = round_timestamp) %>%
left_join(weather_foxchase, by = "time") %>%
arrange(time)
Random Forest is an extension of the decision tree algorithm. Decision trees work by partitioning the data in such a way that allows one to determine specific rules for classifying what we are trying to predict using the predictor variables. Decision trees are useful because they are very intuitive, and allow for easy interpretation of the variables. However, decision trees on their own do not have very high predictive power, which is where Random Forest comes in. Often with prediction and classification techniques there is a tradeoff between intuitivity and accuracy. In the case of Random Forest, it is not as intuitive as a simple decision tree, but it is still possible to obtain a list of important factors.
The way that Random Forest works is that it samples the input data based on a given parameter (ntree), and builds that many decision trees (e.g., a common ntree is 500, meaning the Random Forest algorithm grows 500 decision trees). In addition to sampling the data, the algorithm also samples the predictor variables. The number of predictor variables that is sampled at each split is also a parameter called mtry. Random Forest works as an ensemble of decision trees, and so it uses a majority rule to combine the decision trees together. For example, if we use an ntree value of 500, and 300 of the decision trees predict a train will be late, while 200 predict a train won’t be late, then the final Random Forest will predict that the train will be late.
The Random Forest algorithm can be tuned using two parameters that are mentioned earlier - ntree and mtry. To reduce the time to cross-validate, we decided to keep the default ntree of 500, since it has been shown to be a robust number for the number of trees to grow. To choose the best value for mtry, we ran a 5 fold cross validation, looping over mtry values of 1 - 11 (where 11 is one less than the total number of predictors - using an mtry of 12 would simply use all the predictors at every split instead of randomly sampling them). The cross-validation showed that the best accuracy would be achieved using an mtry value of 11. After tuning the parameter, we ran the model on the 30% validation set and obtained an overall accuracy of 83.9%. The breakdown of class accuracies is as follows:
One detail to note is that this random forest model has a slightly higher Type 1 error (false positive) rate (than the Type 2 error - a miss). In other words, when the model makes an error, it is more likely that it is predicting a train will be late when it is actually going to be on time. In the future, the model can be tuned to reduce either* the Type 1 or Type 2 error depending on which error is worse to make.
*A tradeoff exists between Type 1 and Type 2 error unless the model is perfect. It is common to choose which error is more costly, and then minimize it at the cost of increasing the size of the other error.
#NOTE: This code takes a while to run, so it is included as a reference. The code following this imports an already prepped dataset and uses the parameters obtained from the cross validation to implement the final model.
library(randomForest)
##Merge datasets to run on all of the data
##Create a column for dataset
thorndale_in_otp_weather$dataset <- "thorndale_in"
thorndale_out_otp_weather$dataset <- "thorndale_out"
foxchase_in_otp_weather$dataset <- "foxchase_in"
foxchase_out_otp_weather$dataset <- "foxchase_in"
merged_data <- rbind(thorndale_in_otp_weather, thorndale_out_otp_weather,
foxchase_in_otp_weather, foxchase_out_otp_weather)
##Convert Status to Late/Note Late
merged_data <- mutate(merged_data, delay =
ifelse(merged_data$delay > 0, 1,
ifelse(merged_data == 999, 999, 0)))
##Modyfing Variables
merged_data <- merged_data %>%
filter(delay != 999) %>%
mutate(delay = factor(merged_data$delay)) %>%
mutate(day_of_week = factor(weekdays(merged_data$round_timestamp))) %>%
mutate(hour = factor(hour(merged_data$round_timestamp))) %>%
mutate(month = factor(month(merged_data$round_timestamp))) %>%
mutate(origin = factor(as.character(merged_data$origin))) %>%
mutate(next_station = factor(as.character(merged_data$next_station))) %>%
mutate(icon = factor(as.character(merged_data$icon))) %>%
mutate(dataset = factor(as.character(merged_data$dataset))) %>%
mutate(windSpeed = ifelse(is.na(merged_data$windSpeed) == TRUE,
mean(merged_data$windSpeed, na.rm = TRUE), merged_data$windSpeed))
##Choose the right variables
##Train ID has 94 levels - randomForest only supports upto 53.
rf_vars <- c("origin", "next_station",
"delay", "day_of_week", "precipIntensity", "precipProbability",
"temperature", "windSpeed", "visibility", "month", "hour", "icon", "dataset")
merged_data_model <- select(merged_data, rf_vars)
##Split into training/testing set
set.seed(100)
train <- sample(nrow(merged_data_model), 0.7*nrow(merged_data_model), replace = FALSE)
merged_train <- merged_data_model[train,]
merged_test <- merged_data_model[-train,]
##Initial Model
Sys.time()
merged_default_model <- randomForest(delay ~ ., data = merged_train, importance = TRUE)
Sys.time()
merged_default_model
# Predicting on train set
merged_pred_train <- predict(merged_default_model, merged_train, type = "class")
# Checking classification accuracy
table(merged_pred_train, merged_train$delay)
# Predicting on Validation set
merged_pred_train_test <- predict(merged_default_model, merged_test, type = "class")
# Checking classification accuracy
#mean(merged_pred_train_test == merged_test$delay)
table(merged_pred_train_test,merged_test$delay)
##importance
importance(merged_default_model)
##Cross Validation
library(caret)
# Grid Search
Sys.time()
control <- trainControl(method="repeatedcv", number=5, repeats=1)
seed <- 100
metric <- "Accuracy"
set.seed(seed)
mtry <- c(1:11)
ntree <- c(200,500,700,900)
tunegrid <- expand.grid(.mtry=mtry)
rf_search <- train(delay~., data=merged_train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_search)
Sys.time()
plot(rf_search)
library(randomForest)
##Split final model_data
merged_data_model <- readRDS("model_data.rds")
names <- gsub(pattern = " ", replacement = "", x = colnames(merged_data_model))
colnames(merged_data_model) <- names
merged_data_model <- merged_data_model[,-2]
set.seed(100)
train <- sample(nrow(merged_data_model), 0.7*nrow(merged_data_model), replace = FALSE)
merged_train <- merged_data_model[train,]
merged_test_new <- merged_data_model[-train,]
##Run new model with optimal mtry
#Sys.time()
#merged_mtry11_model <- randomForest(delay ~ .,data = merged_train, mtry = 11, importance = TRUE)
merged_mtry11_model <- readRDS("final_model.RDS")
#Sys.time()
# Predicting on Validation set
merged_pred_train_test <- predict(merged_mtry11_model, merged_test_new, type = "class")
# Checking classification accuracy
#mean(merged_pred_train_test == merged_test_new$delay)
holdout_acc <- table(merged_pred_train_test,merged_test_new$delay)
##importance
importance_Valid <- as.data.frame(importance(merged_mtry11_model))
names <- rownames(importance_Valid)
importance_Valid$factor <- names
import_table <- importance_Valid %>%
select(factor, MeanDecreaseGini) %>%
arrange(desc(MeanDecreaseGini)) %>%
head()
factor | MeanDecreaseGini |
temperature | 4,609.015 |
windSpeed | 4,316.436 |
visibility | 2,149.204 |
originFrazerYard | 2,025.816 |
originJeffersonStation | 1,125.068 |
originThorndale | 1,085.159 |
Above, we can take a look at the top 6 factors in our random forest model. Temperature, Wind Speed, and Visibility appear to be important variables (based on MeanDecreaseGini*) from our weather variables. Other important variables are the different origins of the trains, which suggests that some train lines have more late trains than others.
*MeanDecreaseGini is based on the GINI Index which is a measure of impurity. With decision trees and random forest, the algorithm can use the GINI Index to decide the order of which variables to split on. Variables with the lowest level of impurity (one branch has most of the late trains, the other branch has most of the on-time trains) are chosen first. In this case, we see top variables with the lowest impurity which implies variable importance.
One major limitation of the current analysis is that the available data was collected between March and November 2016. During the summer of 2016, SEPTA had an employee strike which forced people that usually take other forms of transportation to take regional rail. This overfilled the trains and most likely caused some of the late trains. In the future it would be beneficial to obtain data which avoids abnormal events. However, this current analysis can still be used as a proof of concept.
Further, in the future it would be useful to obtain data from at least one full year in order to see how lateness changes across all of the months.
Note: This project also included other algorithms such as Logistic Regression, SVM, Naive Bayes, and Neural Networks, but this report only includes Random Forest as it performed the best.