Read and format project data
# url to data
= "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json"
flights_url # read data as JSON file
= pd.read_json(flights_url) flights
Course DS 250
Nefi Melgar
Delayed flights are not something most people look forward to. In the best case scenario you may only wait a few extra minutes for the plane to be cleaned. However, those few minutes can stretch into hours if a mechanical issue is discovered or a storm develops. Arriving hours late may result in you missing a connecting flight, job interview, or your best friend’s wedding.
In 2003 the Bureau of Transportation Statistics (BTS) began collecting data on the causes of delayed flights. The categories they use are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft, and Security. You can visit the BTS website to read definitions of these categories.
CREATE CONSISTENCY BY FIXING MISSING DATA TYPES.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | num_of_delays_total | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
919 | IAD | Washington, DC: Washington Dulles International | December | 2015.0 | 2799 | 182 | 183 | 61 | 0 | 17 | 443 | NaN | 15438 | 2826.0 | 0 | 1825 | 31164 |
920 | ORD | Chicago, IL: Chicago O'Hare International | December | 2015.0 | 25568 | 923 | 1755 | 1364 | 11 | 180 | 4233 | 80962.0 | 132055 | 72045.0 | 435 | 22459 | 307956 |
921 | SAN | San Diego, CA: San Diego International | NaN | 2015.0 | 6231 | 480 | 606 | 256 | 5 | 37 | 1383 | 25402.0 | 35796 | 9038.0 | 161 | 2742 | 73139 |
922 | SFO | San Francisco, CA: San Francisco International | December | 2015.0 | 13833 | 757 | 1180 | 2372 | 9 | 147 | 4465 | 55283.0 | 96703 | 193525.0 | 285 | 13788 | 359584 |
923 | SLC | Salt Lake City, UT: Salt Lake City International | December | 2015.0 | 8804 | 483 | 796 | 404 | 5 | 56 | 1745 | 37354.0 | 49549 | 13515.0 | 158 | 6693 | 107269 |
Check row 921, empty value is displayed as NaN.
The following code cleans the DataFrame by addressing various columns:
NA
and fills missing values using forward fill.If the answer to the following question is True, then the cleasing hasn’t worked correctly.
# ***clean airport_name column***
# this code will assign airport name based on airport codes
airport_names = [
"Atlanta, GA: Hartsfield-Jackson Atlanta International",
"Denver, CO: Denver International",
"Washington, DC: Washington Dulles International",
"Chicago, IL: Chicago O'Hare International",
"San Francisco, CA: San Francisco International",
"San Diego, CA: San Diego International",
"Salt Lake City, UT: Salt Lake City International",
]
airport_codes = ["ATL", "DEN", "IAD", "ORD", "SAN", "SFO", "SLC"]
counter = 0
while counter < flights.shape[0]:
# while counter < len(airport_codes):
if flights.loc[counter, "airport_code"] == airport_codes[0]:
flights.at[counter, "airport_name"] = airport_names[0]
elif flights.loc[counter, "airport_code"] == airport_codes[1]:
flights.at[counter, "airport_name"] = airport_names[1]
elif flights.loc[counter, "airport_code"] == airport_codes[2]:
flights.at[counter, "airport_name"] = airport_names[2]
elif flights.loc[counter, "airport_code"] == airport_codes[3]:
flights.at[counter, "airport_name"] = airport_names[3]
elif flights.loc[counter, "airport_code"] == airport_codes[4]:
flights.at[counter, "airport_name"] = airport_names[4]
elif flights.loc[counter, "airport_code"] == airport_codes[5]:
flights.at[counter, "airport_name"] = airport_names[5]
elif flights.loc[counter, "airport_code"] == airport_codes[6]:
flights.at[counter, "airport_name"] = airport_names[6]
counter += 1
# ***clean month column***
flights["month"].value_counts()
# replace all n/a values with NA
flights["month"] = flights["month"].replace("n/a", pd.NA)
# flights.replace("n/a", np.nan, inplace=True)
# check for missing values
flights["month"].isna().value_counts()
# The "ffill" method fills in missing values by forward-filling, which means
# that it uses the last known value to fill in subsequent missing values.
flights["month"].fillna(method="ffill", inplace=True)
# ***clean year column***
flights["year"].fillna(method="ffill", inplace=True)
# ***clean num_of_delays_carrier column***
flights["num_of_delays_carrier"].value_counts()
delays_subset = flights["num_of_delays_carrier"].replace("1500+", pd.NA)
delays_subset.dropna()
# convert column to numeric values and get the mean
delays_subset_numeric = pd.to_numeric(delays_subset, errors="coerce")
mean_delay = round(delays_subset_numeric.mean())
# add the mean of the column to replace those cells with a 1500+ value
flights["num_of_delays_carrier"] = flights["num_of_delays_carrier"].replace(
"1500+", mean_delay)
# ***clean num_of_delays_late_aircraft column***
flights["num_of_delays_late_aircraft"].value_counts()
delays_late_subset = flights["num_of_delays_late_aircraft"].replace("-999", pd.NA)
delays_late_subset.dropna()
# convert column to numeric values and get the mean
delays_late_subset_numeric = pd.to_numeric(delays_late_subset, errors="coerce")
mean_delays_late = round(delays_late_subset_numeric.mean())
# add the mean of the column to replace those cells with a -999 value
flights["num_of_delays_late_aircraft"] = flights["num_of_delays_late_aircraft"].replace(
-999, mean_delays_late)
# ***clean minutes_delayed_Carrier column***
flights["minutes_delayed_carrier"].value_counts()
min_delayed_carrier_mean = round(flights["minutes_delayed_carrier"].mean())
flights["minutes_delayed_carrier"].fillna(min_delayed_carrier_mean, inplace=True)
# ***clean minutes_delayed_nas column***
flights["minutes_delayed_nas"].isna().value_counts()
mins_delayed_nas_subset = flights["minutes_delayed_nas"].replace("-999", pd.NA)
mins_delayed_nas_subset.dropna()
mins_delayed_nas_mean = round(mins_delayed_nas_subset.mean())
# add the mean of the column to replace those cells with a -999 value
flights["minutes_delayed_nas"] = flights["minutes_delayed_nas"].replace(
-999, mins_delayed_nas_mean)
flights["minutes_delayed_nas"].fillna(mins_delayed_nas_mean, inplace=True)
# ***check if there's any NaN value in the df***
has_nan = flights.isnull().any().any()
print("Does the DataFrame have any NaN values?", has_nan)
# ***check if there's any NaN values in the df using columns***
# nan_columns = flights.isnull().any()
# print("Columns with NaN values:\n", nan_columns)
Does the DataFrame have any NaN values? False
Which airport has the worst delays?
To determine the “worst” airport for delays, I chose the metric of the proportion of delayed flights. This metric is calculated as the total number of delayed flights divided by the total number of flights for each airport. The reason for choosing this metric is that it directly indicates the likelihood of a flight being delayed at a given airport, providing a clear measure of delay performance. A higher delay proportion signifies a greater frequency of delays, which is more impactful to passengers and operations compared to just the total number of delays or average delay time. Among the listed airports, San Francisco International Airport (SFO) has the highest delay proportion at 26.10%, making it the “worst” airport in terms of flight delays.
# get total flights for each airport
flight_totals = flights.groupby("airport_code")["num_of_flights_total"].sum()
# get total delays for each airport
delayed_totals = flights.groupby("airport_code")["num_of_delays_total"].sum()
# get proportion of delayed flights
delayed_proportion = (delayed_totals / flight_totals) * 100
delayed_proportion = round(delayed_proportion, 2)
# get minutes delay time in hours
delay_totals_minutes = flights.groupby("airport_code")["minutes_delayed_total"].sum()
delay_totals_hours = round(delay_totals_minutes / 60, 2)
# create the worst airport data frame
worst_df = pd.DataFrame(
{
# "AirportCode": flight_totals.index,
"TotalFlights": flight_totals,
"TotalDelays": delayed_totals,
"DelayProportion": delayed_proportion,
"DelayTime(Hours)": delay_totals_hours,
}
)
worst_df = worst_df.sort_values(by="DelayTime(Hours)", ascending=False)
worst_df
TotalFlights | TotalDelays | DelayProportion | DelayTime(Hours) | |
---|---|---|---|---|
airport_code | ||||
ORD | 3597588 | 830825 | 23.09 | 939268.82 |
ATL | 4430047 | 902443 | 20.37 | 899732.10 |
SFO | 1630945 | 425604 | 26.10 | 442508.22 |
DEN | 2513974 | 468519 | 18.64 | 419556.35 |
IAD | 851571 | 168467 | 19.78 | 171391.30 |
SLC | 1403384 | 205160 | 14.62 | 168722.85 |
SAN | 917862 | 175132 | 19.08 | 137937.47 |
San Francisco International Airport (SFO) has the highest delay proportion with 26.10%, Chicago’s O’Hare International Airport is in second place with a 23.09%.
WHAT IS THE BEST MONTH TO FLY IF YOU WANT TO AVOID DELAYS OF ANY LENGTH?
To determine the best month to fly if you want to avoid delays of any length, I chose the metric of total minutes delayed per month. This metric reflects the overall delay time experienced by flights in each month, providing a clear indicator of which months tend to have fewer delays. After analyzing the data from 2005 to 2015, September emerged as the best month to fly, with the lowest total minutes delayed at 152,221.09. This indicates that flights in September generally experience the least amount of delay time, making it the optimal month for travelers seeking to avoid delays.
# average delay time per each month, reset index and convert to dataframe
flights_month = (
flights.groupby("month")["minutes_delayed_total"].mean().round(2).reset_index()
)
fig = px.scatter(
flights_month,
x="month",
y="minutes_delayed_total",
title="Avg. delay time (mins) by month (2005-2015)",
labels={"minutes_delayed_total": "Average Minutes", "month": "Month"},
)
# format amounts to show (,) and round them, display each 30,000
fig.update_yaxes(tickformat=",.0f", dtick=30000)
fig.show()
September is the best month to fly, with the lowest total minutes delayed at 152,221.09.
LATE FLIGHTS BECAUSE OF WEATHER
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. A new column were created, it calculates the total number of flights delayed by weather (both severe and mild). All the missing values in the Late Aircraft variable were replaced with the mean value. The first 5 rows of data are shown in a table to demonstrate this calculation. These three rules were for the calculations:
# total flights for each airport
flight_totals = flights.groupby("airport_code")["num_of_flights_total"].sum()
# total delays for each airport
delayed_totals = flights.groupby("airport_code")["num_of_delays_total"].sum()
# RULE 1: 100% of delayed flights in the Weather category are due to weather
weather_100 = flights.groupby("airport_code")["num_of_delays_weather"].sum()
# RULE 2: 30% of all delayed flights in the Late-Arriving category are due to weather
sample_flights_30 = flights.sample(frac=0.3, random_state=1)
weather_30 = sample_flights_30.groupby("airport_code")[
"num_of_delays_late_aircraft"
].sum()
# RULE 3: From April to August, 40% of delayed flights in the NAS category are due to weather.
months_40 = ["April", "May", "June", "July", "August"]
months_65 = [
"January",
"February",
"March",
"September",
"October",
"November",
"December",
]
df_40 = flights.query(f"month in @months_40")
sample_flights_40 = df_40.sample(frac=0.4, random_state=1)
weather_40 = sample_flights_40.groupby("airport_code")["num_of_delays_nas"].sum()
# RULE 3.1: The rest of the months, the proportion rises to 65%
df_65 = flights.query(f"month in @months_65")
sample_flights_65 = df_65.sample(frac=0.65, random_state=1)
weather_65 = sample_flights_65.groupby("airport_code")["num_of_delays_nas"].sum()
# dataframe to display the results
weather_df = pd.DataFrame(
{
"TotalFlights": flight_totals,
"TotalDelays": delayed_totals,
"WeatherDelay": weather_100,
"DelaysLateAircraft": weather_30,
"DelayNAS40": weather_40,
"DelayNAS65": weather_65,
}
).reset_index()
weather_df
airport_code | TotalFlights | TotalDelays | WeatherDelay | DelaysLateAircraft | DelayNAS40 | DelayNAS65 | |
---|---|---|---|---|---|---|---|
0 | ATL | 4430047 | 902443 | 32375 | 56109 | 56969 | 113774 |
1 | DEN | 2513974 | 468519 | 13836 | 41232 | 24368 | 44908 |
2 | IAD | 851571 | 168467 | 4794 | 21143 | 11488 | 14227 |
3 | ORD | 3597588 | 830825 | 20765 | 88245 | 50875 | 105548 |
4 | SAN | 917862 | 175132 | 4320 | 20957 | 8502 | 15181 |
5 | SFO | 1630945 | 425604 | 10377 | 37367 | 45907 | 65225 |
6 | SLC | 1403384 | 205160 | 6831 | 26889 | 9799 | 19775 |
This table look overwhelming as many numbers are shown.
We will summarize the previous table by calculating the proportion of delayed flights due to weather-related factors for each airport in a new DataFrame. A iteration was made through each row of the DataFrame, calculated the total delay time, divided it by the total delayed flights, and converted it to a percentage. The resulting proportions are displayed in a chart with columns for airport codes and their respective delay proportions due to weather.
counter_weather = 0
proportion_delays_totals = []
airtport_codes_weather = []
while counter_weather < len(weather_df["TotalFlights"]):
# make the sum for the rows
rule1_percent = weather_df.WeatherDelay[counter_weather]
rule2_percent = weather_df.DelaysLateAircraft[counter_weather]
rule3_percent = weather_df.DelayNAS40[counter_weather]
rule31_percent = weather_df.DelayNAS65[counter_weather]
# perform the sum of each cell of the row
row_total = rule1_percent + rule2_percent + rule3_percent + rule31_percent
# get elements of the total delays due of weather column
flight_totals_weather = weather_df.TotalDelays[counter_weather]
# get proportion of delayed flights due of weather and append the
# proportion to the list proportion delays total
delayed_proportion_weather = (row_total / flight_totals_weather) * 100
delayed_proportion_weather = round(delayed_proportion_weather, 1)
# print(delayed_proportion_weather)
proportion_delays_totals.append(delayed_proportion_weather)
# get airport codes and append them to the list of airport codes
airport_code_weather = weather_df.airport_code[counter_weather]
airtport_codes_weather.append(airport_code_weather)
# create new data frame
weather_proportions_data = {
"airport_code": airtport_codes_weather,
"proportion": proportion_delays_totals,
}
proportions_weather_df = pd.DataFrame(weather_proportions_data)
counter_weather += 1
proportions_weather_df
airport_code | proportion | |
---|---|---|
0 | ATL | 28.7 |
1 | DEN | 26.5 |
2 | IAD | 30.7 |
3 | ORD | 31.9 |
4 | SAN | 28.0 |
5 | SFO | 37.3 |
6 | SLC | 30.9 |
San Francisco International Airport (SFO) has the highest proportion of delayed flights because of weather at 37.3%, making it the worst for delays also in this category. Chicago O’Hare International Airport (ORD) follows with a 31.9% delay proportion. This indicates that travelers using SFO and ORD are more likely to experience delays due of weather compared to other airports.
Washington Dulles International Airport (IAD) and Salt Lake City International Airport (SLC) also have high delay proportions at around 30%. Overall, SFO and ORD stand out as the airports with the most significant delay issues.