Optimizing Airport Pickup Buffer Calculation Algorithms and Mitigating Data Skew in Distributed Computing
Requirements and Algorithmic Logic
The system requires a dynamic buffer time recommendation based on historical flight data to optimize operational efficiency. The logic determines whether to use specific flight data or aggregated data from similar flights based on order volume thresholds.
Eligibility Criteria
For a target flight to trigger a calculation, it must meet one of the following conditions based on historical data (X months):
- Condition A: The specific flight's order volume is greater than or equal to threshold Y.
- Condition B: The combined order volume of the target flight and its 'similar flights' is greater than or equal to threshold Y.
If Condition A is met, the system uses historical data from that specific flight. If Condition A fails but Condition B is met, the system aggregates data from the target flight and similar flights. If neither condition is met, no calculation is performed.
Flight Definitions
- Specific Flight: Identical flight number and identical origin/destination airports.
- Similar Flights: Same origin/destination airports where the scheduled landing time is within ± Z hours of the target flight's scheduled landing time.
Calculation Methodology
The core metric is defined as the duration between the actual flight landing time and the time the driver picks up the passenger:
Buffer_Duration = Driver_Pickup_Timestamp - Flight_Actual_Landing_Timestamp
Data Processing Steps
- Retrieve historical data for eligible flights over the last X months.
- Calculate the duration for each record.
- Filter out outliers by removing values less than L or greater than U.
- Apply a Weighted Moving Average (WMA) algorithm to predict the optimal buffer time. Recent data is assigned higher significance.
Prediction Algorithm
The WMA formula prioritizes recent months. Let the weights be defined as $w_1, w_2, w_3$ for periods $T_1, T_2, T_3$ respectively (where $T_1$ is the most recent).
Predicted_Buffer = Floor(
(Sum(Duration_T1 * w1) + Sum(Duration_T2 * w2) + Sum(Duration_T3 * w3)) /
(Count(T1) * w1 + Count(T2) * w2 + Count(T3) * w3)
)
The final result is rounded to the nearest integer minute.
Performance Bottleneck Analysis
During the initial implementation of the BI model and SQL logic, the job execution time was significantly longer than anticipated given the data volume. This suggested a performance inefficiency typical of distributed computing environments.
Diagnosing Data Skew
Analysis of the Spark UI indicated a classic Long Tail effect. While the majority of tasks completed rapidly, specific tasks within certain Stages remained active for an extended period. This disparity confirmed the presence of data skew.
Common causes for this behavior include:
- Uneven data distribution (Hot Keys).
- Improper Join strategies (e.g., Join keys with many nulls).
- Data type mismatches in Join keys.
Root Cause Investigation
By correlating the stalled Stage IDs in the Spark UI with the underlying SQL execution plan, the bottleneck was isolated to a specific Join operation. Further investigation into the Join keys ruled out null values and type mismatches. The root cause was identified as data distribution non-uniformity caused by specific hotspot keys. A small subset of keys contained a disproportionately large number of records, forcing a single reducer to handle the bulk of the processing.
Resolution Strategy
To mitigate the skew, the following approaches were evaluated:
- Salting/Isolation: Separating hotspot and non-hotspot data and processing them individually before unioning the results.
- Filtering: Removing the hotspot keys if they hold no analytical value for the business logic.
Given that the hotspot keys in this scenario did not contribute meaningful insights to the buffer time calculation, the optimization involved directly filtering out these specific keys from the Join operation. This balanced the load across the cluster and significantly reduced execution time.