← Back to home

NYC Taxi PySpark Pipeline

Nov 2025 · 6 min read · MSc AI, Kristiania University College — Big Data Systems

PySpark Spark SQL Spark ML Parquet SQLite pandas data engineering

The NYC Taxi and Limousine Commission publishes monthly Parquet files covering every yellow- and green-taxi trip in New York City. Across the 2020–2024 window the data runs to hundreds of millions of rows — far beyond what a single-machine pandas workflow can comfortably handle. The exercise was to treat it as a real big-data problem from the start: build the ingest, cleaning, feature engineering, EDA, and modelling stages so that the same code would scale from a laptop sample to the full multi-year corpus.

The downstream goal was to predict fare_amount from trip attributes (distance, time, pickup/drop-off zones, passenger count) — a regression problem nominally simple, but interesting precisely because the data engineering around it is where most of the work lives.

I structured the pipeline as four discrete stages, each owning a clear contract with the next: ingest → process → persist → analyse. Keeping the boundaries explicit means the heavy Spark stages can be re-run independently when the schema changes, and the downstream EDA can be replayed against the persisted artefact without re-touching the source Parquet.

For the cleaning step I deliberately leaned on Spark SQL rather than the DataFrame API where possible — SQL is the lingua franca of data engineering, easier to review, and easier to port if the pipeline ever moves to a SQL warehouse like BigQuery or Snowflake.

For the ML stage I used Spark ML's Pipeline + VectorAssembler + train/test split idiom and trained three regressors (Linear Regression, Decision Tree, Random Forest) on the cleaned, feature-engineered data to compare how much non-linearity actually buys on this dataset.

Parquet ingest: Read both yellow and green taxi Parquet files directly into Spark DataFrames using spark.read.parquet. The schemas differ slightly between the two taxi types (green has trip_type and ehail_fee; yellow has airport_fee), so I aligned the relevant columns before any cross-type analysis.

Spark SQL cleaning: Filtered out trips with zero or negative fare/distance, null pickup/drop-off zones, and impossibly long trip durations. Standardised the pickup/drop-off datetime columns. Cast all numerics to consistent types so downstream joins and aggregations behaved predictably.

Persisted samples as SQLite: Saved cleaned, sampled outputs to .db files so the EDA stage and any companion demo could query without spinning up a Spark session. This is the same pattern you'd use in a production pipeline that hands curated subsets off to a BI layer.

EDA with Spark SQL: Eight aggregation queries answering questions the operator actually cares about — top pickup zones, hour-of-day distribution, fare-by-distance bins, weekend vs weekday volume, drop-off zones with the highest average fare, payment-method mix, average fare for long trips, and yellow-vs-green comparison.

Feature engineering: Derived trip_duration_minutes from pickup/drop-off timestamps, extracted hour-of-day and day-of-week categoricals, bucketed trip_distance into ordinal bins, and one-hot encoded payment_type for the regression stage.

Spark ML regression: Trained LinearRegression, DecisionTreeRegressor, and RandomForestRegressor through a Spark ML Pipeline on the cleaned, feature-engineered data. Evaluated each on a held-out test split using RMSE and R².

Companion query explorer: A small Gradio app that replays the EDA queries against the persisted SQLite samples — same SQL the Spark pipeline ran, just at a scale you can host on free CPU infrastructure.

Years Covered

2020–2024

Pipeline Stages

4

EDA Queries

8

The EDA confirmed a number of intuitive patterns and surfaced a couple of less obvious ones. Trip volume peaks sharply between 5pm and 7pm, with a smaller morning ridge. Average fare scales near-linearly with distance up to about 10 miles, then flattens for the longer airport-trip-heavy tail. Weekday trip counts substantially exceed weekend trips, but average fare is slightly higher on weekends — consistent with longer leisure trips replacing short commuter hops.

On the regression side, Random Forest outperformed both Linear Regression and the single Decision Tree on test-set R² — unsurprising given the non-linear interaction between distance, time, and zone. Linear Regression remained competitive on RMSE for trips inside the bulk of the distribution, but degraded sharply on the long-distance tail where its inability to model the flattening fare curve cost it.

Operationally, persisting cleaned outputs as SQLite turned out to be a much better choice than I expected. The companion query explorer demo runs on essentially zero infrastructure, queries return in milliseconds, and the SQL is portable to any system that speaks ANSI SQL.

Working in Spark made me a much more deliberate writer of SQL. With pandas, you can chain together steps that work on small data and break silently on large data — type coercions, NULL handling, joins that explode. Spark forces you to be explicit about every cast and every join key, which transfers nicely to any production environment.

The separation between processing (Spark) and serving (SQLite, in this case) is a pattern I'll keep using. Big-data tools are not the right surface for ad-hoc queries; persisting a curated subset to a smaller, faster store after the heavy work is done gives you the best of both worlds.

Cross-validating the same query in two different engines — Spark SQL on the full Parquet, then sqlite3 on the persisted sample — is a useful sanity check. Differences surface schema bugs, type-coercion bugs, and edge cases the smaller dataset hides.