Wings - Machine First Al - Exploratory Data Analysis
Instructions:
- The data required for this task has been provided in the file 'data.csv'
- Read the questions provided for each cell and assign your answers to respective variables provided in the following cell.
- If answers are floating point numbers round of upto two floating point after the decimal. For example, 10.546 should be read as 10.55; 10.544 as 10.54 and 10.1 as 10.10
- pandas and numpy packages are preinstalled and these packages should to sufficient to solve this task.
- Please don't change variable name meant to assign your answers.
#Run this cell to import the Packages.
import pandas as pd
import numpy as np
### Read the data (this will not be graded)
df = pd.read_csv('data.csv')
df.head()
Task 1: What is the standard deviation of maximum windspeed across all the days.
Note: ws_std should be of type float.
ws_std = round(df['Maximum windspeed (mph)'].std(),2)
ws_std
Task 2: What is the difference between 50th percentile and 75th percentile of average temperature.
Note: p_range should be of type float
p50 = np.percentile(df['Average temperature (°F)'], 50)
p75= df['Average temperature (°F)'].quantile(0.75)
p_range = float(round(p75-p50, 2))
p_range
p50 = np.percentile(df['Average temperature (°F)'], 50)
p75= df['Average temperature (°F)'].quantile(0.75)
p_range = float(round(p75-p50, 2))
p_range
Task 3; What is the pearson correlation between average dew point and average temperature.
Note: corr should be of type float
Solution 1
correlation_matrix = df[['Average temperature (°F)', 'Average dewpoint (°F)']].corr(method='pearson')
corr = float(round(correlation_matrix.loc['Average temperature (°F)', 'Average dewpoint (°F)'],2))
print(corr)
Solution B
import scipy.stats as stats
cor_coeff_sts, p_value = stats.pearsonr(df['Average temperature (°F)'], df['Average dewpoint (°F)'])
corr = float( round(cor_coeff_sts,2))
print(corr)
Task 4: Out of all the available records which month has the lowest average humidity.
- Assign your answer as month index, for example if its July index is 7. Note: dew_month should be of type int
df["Day"] = pd.to_datetime(df["Day"], format="%d/%m/%Y")
# Extract month
df["Month"] = df["Day"].dt.month
dew_month = int(df.groupby("Month")["Average humidity (%)"].min().idxmin())
dew_month
Task 5: Which month has the highest median for maximum_gust_speed out of all the available records. Also find the respective value - hint: group by month Note:
max_gust_value should be of type float
max_gust_month should be of type int
max_gust_value = float(round(df.groupby("Month")["Maximum gust speed (mph)"].median().max(),2))
max_gust_month = int(df.groupby("Month")["Maximum gust speed (mph)"].median().idxmax())
max_gust_value , max_gust_month
Task 6: Determine the average temperature between the months of March 2010 to May 2012 (including both the months) Note: avg_temp should be of type float
df2 = df.set_index("Day", inplace=False)
avg_temp = float(round(df2.loc["2010-03":"2012-05"]['Average temperature (°F)'].mean(),2))
avg_temp
Task 7: Find the range of average temperature on Dec 2010 Note: temp_range should be of type float
temp_range = float(round(df2.loc["2010-12"]['Average temperature (°F)'].agg(["min", "max"]).diff().iloc[-1],2))
temp_range
Task 8: Out of all available records which day has the highest difference between maximum_pressure and minimum_pressure - assign the date in string format as 'yyyy-mm-dd'. Make sure you enclose it with single quote
df2['Pressure_Diff'] = df2['Maximum pressure '] - df2['Minimum pressure ']
max_p_range_day = df2['Pressure_Diff'].idxmax().strftime('%Y-%m-%d') # f"'{df2['Pressure_Diff'].idxmax().strftime('%Y-%m-%d')}'"
print( max_p_range_day)
Task 9: How many days falls under median (i.e equal to median value) of barometer reading. Note: median_b_days should be of type int
median_b_days = int((df['Average barometer (in)'] == df['Average barometer (in)'].median()).sum())
median_b_days
Task 10: Out of all the available records how many days are within one standard deviation of average temperature Note: num_days_std should be of type int
m, s = df['Average temperature (°F)'].agg(['mean','std'])
ub = m+s
lb = m-s
print(ub, lb)
num_days_std = int(df[(df['Average temperature (°F)']>=lb) & (df['Average temperature (°F)']<=ub)].shape[0])
num_days_std