MTA Daily Subway Ridership Estimator

I. Abstract

The Metropolitan Transit Authority of New York City has been sharing its ridership and traffic data online each day during the coronavirus pandemic. Available to download is all the data for all the agency's transport services. About the subway data: Subway ridership figures are determined from MetroCard and OMNY swipes and taps and include ridership on the Staten Island Railway. Figures from recent days may be revised as data reconciliation processes are carried out. This project aims to implement the available data provided by the MTA and other data resources to train an artificial neural network to predict the MTA's daily subway ridership. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain.

II. Introduction

The MTA has been sharing the ridership and traffic data each day to help you understand how many people are using the services in and around New York City. All the agency's services are updated to provide an estimated ridership number for each specific date. The data provided for all transit services is compared to a percent of a comparable pre-pandemic day. The project will utilize the available ridership data by the agency and contributing factors that affect overall ridership such as weather, public school session calendar, and holidays. All the factors chosen are shown to affect the overall ridership on a daily occasion.

III. Materials and Methods

The contributing factors that will be used to train the ANN are chosen based on direct correlation to overall ridership and for its ease of availability. The weather data used will be sourced from the National Centers for Environmental Information website, which allows users to download reported weather conditions for a selected city/region (Central Park/New York City). The public-school calendar will also be used as a ridership factor considering in 2021-22, there were 1,058,888 students in the NYC school system, the largest school district in the United States. Lastly, holidays will be considered for the reduced ridership impact of schools being closed and employer's workday closures.

Data Preprocessing

Upon receiving the MTA's ridership data and the NOAA (National Oceanic and Atmospheric) daily summaries for the NY region data, the data was compiled into a relevant daily summary csv file. The NYC public school academic calendar was used to extract school recess, as well as holidays were incorporated into the file. The category contributing ridership factors are either represented as a decimal numeral (average wind speed, average temperature, etc.) or a Boolean logic number (weekend, holiday, etc.) in which a value of 1 represents meeting the criteria for the category and a value of 0 does not meet the criteria.

Artificial Neural Network Model

A sequential model consisting of a linear stack of layers in Keras will be used for the program. Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. The script allows the user to train a new ANN model or load an existing model. By selecting to train a model, the appropriate data file is read, and the inputs and output are selected for the model. The model can be configured to have the appropriate desired number of neurons per hidden layer, as well as the desired number of hidden layers. The model's activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input.

The piecewise linear function/rectified linear activation function (Relu) is chosen for the model, ensuring to output the input directly if it is positive, otherwise, it will output zero. The model consists of a arbitrary configuration consisting of two hidden layers with four neurons and three neurons respectfully. The first hidden layer has a input shade consisting of ten inputs, while the second hidden layer has three neurons, both consisting of a Relu activation function. The output layer has a linear activation function, also known as "no activation," or "identity function" (multiplied x1. 0), is where the activation is proportional to the input. Optimizer that implements the Adam algorithm.

Compiling the model, the optimizer implements the Adam algorithm and computes the mean of squares of error between labels and predictions. Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first order and second-order moments. The purpose of loss functions is to compute the quantity that a model should seek to minimize during training. Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The squaring means that larger mistakes result in more errors than smaller mistakes, meaning that the model is punished for making larger mistakes. Lastly, the epoch for the model is arbitrary chosen at a value of 2000 iterations.

IV. Results

Once the ANN model has been trained, four predictions were made which consisted of two weekdays in the immediate future and two weekend days in the immediate past to test the accuracy of the model.

Weekday in the immediate future (September 1 & September 2):

Sep. 1

Results show a 99% accuracy for the recorded data on Sep. 1.

Sep. 2

Results show a 99% accuracy for the recorded data on Sep. 2.

Weekend in the immediate past (August 27 & September 28):

Aug. 27

Results show a 96% accuracy for the recorded data on Aug. 27.

Aug. 28

Results show a 90% accuracy for the recorded data on Aug. 28.

Discussion

Numerically, the model has a greater accuracy percentage when estimating daily weekday ridership versus weekend ridership. The model was consistent in outputting both weekday & weekend ridership estimates in the immediate future and past when comparing to the available recorded data. This project focused on subway ridership, however the data used in the model can also be used to estimate bus ridership and other transport services provided by the MTA.

Potential Issues

Bias

The data used to train the model does not consider many biases that have also been shown to affect transit ridership and or momentum/trends in the data.

Example 1: Consider a holiday falling in the middle of a week and the effects it may have on ridership versus a holiday falling at the start or end of a weekend. The data presented into the model will fail to distinguish the relationship/bias of a holiday to ridership.

Example 2: Consider the end/return of the school year, where the momentum of ridership will increase/decrease appropriately. The data presented to the model will fail to accurately estimate ridership based on solely previous data.

Lack of data

The model's training data set includes publicly available data provided by the MTA, NOAA, & DOE (Department of Education). Other forms of data that have shown to influence ridership can be implemented to build a more accurate model such as employment in the local area, crime/safety statistics, traffic statistics, population trends, and much more.

Closing thoughts

In conclusion, this project was made possible by the availability of public data provided. The project allowed me to undergo the process of reaching a unique project statement by gathering data, training a model, and considering the biases and improvement of the program.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
ANN		ANN
Data		Data
Results		Results
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTA Daily Subway Ridership Estimator

About

Releases

Packages

Languages

AngelsGills/MTA_Daily_Subway_Ridership_Estimator

Folders and files

Latest commit

History

Repository files navigation

MTA Daily Subway Ridership Estimator

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages