Mini-Project for Data Scientist - Exploratory Data Analysis Fresco Play Handson Solution

Learn Data Cleaning in python while working with the dataframe, DateTime module, iloc and loc uses, apply Lambda module to the Columns and more.
Mini-Project for Data Scientist - Exploratory Data Analysis Fresco Play Handson Solution - www.pdfcup.com

Welcome to Turing Machine Data Scientist Program: Use-case 2 - Exploratory Data Analysis


Q1: What is the standard deviation of maximum windspeed across all the days.

q1- What is the standard deviation of maximum windspeed across all the days.

Solution: 1


# import the data file for this hands-on.
import pandas as pd
import numpy as np

#dataDOTcsv  = "https://hr-projects-assets-prod.s3.amazonaws.com/c3pde3c3lgm/963fbab228e2896e79fc09e385ab377d/data.csv"

fl = pd.read_csv("data.csv")

# Task 1:  What is the standard deviation of maximum windspeed across all the days

temp = np.std(fl["Maximum windspeed (mph)"]).round(2)
q1 = temp.round(2)

 

Q2: What is the difference between 50th percentile and 75th percentile of average temperature.

q2- What is the difference between 50th percentile and 75th percentile of average temperature.

Solution: 2


# import the data file for this hands-on.
import pandas as pd
import numpy as np
fl = pd.read_csv("data.csv")

# Task 2: What is the standard deviation of maximum windspeed across all the days

a = fl["Average temperature (°F)"].quantile(0.75)
b = fl["Average temperature (°F)"].quantile(0.50)
q2 = round(a-b, 2)

 

Q3: What is the pearson correlation between average dew point and average temperature.

q3- What is the pearson correlation between average dew point and average temperature.

Solution: 3


#Task 3: What is the pearson correlation between average dew point and average temperature.

temp = fl["Average dewpoint (°F)"].corr(fl["Average temperature (°F)"])
q3 = round(temp, 2)
 

Q4: Out of all the available records which month has the lowest average humidity.

q4- Out of all the available records which month has the lowest average humidity.

Solution: 4


# Task 4: Out of all the available records which month has the lowest average humidity.
effected_col = 'Average humidity (%)'
return_col = "Day"

lowst_Avg = min(fl[effected_col])

col = fl[effected_col]

temp = fl.loc[col ==  lowst_Avg, return_col]
dates = temp.iloc[0]

dt_indx = pd.date_range(dates, periods = 1, freq ='M') 
q4 = dt_indx.month[0]

 

Q7: Average Tempture bwtn months - March 2010 to May 2012

q7- Average Tempture bwtn months - March 2010 to May 2012

Solution: 7


#Task 7 : Average Tempture bwtn months - March 2010 to May 2012

fl["Day_New"] = pd.to_datetime(fl['Day'], format='%d/%m/%Y')
ans = fl[(fl["Day_New" ] >= '2010/03/01') & ( fl["Day_New" ] <= '2012/05/31' )]

avg_temp = ans.describe().loc["mean"][0]
# fl.drop( columns=["Day_New"], inplace=True)  # to delete new column

ans = round(avg_temp,2)
print(ans)
q7 = ans

 

Q8: Find the range of averange temperature on Dec 2010

q8- Find the range of averange temperature on Dec 2010

Solution: 8


#Task 8: Find the range of averange temperature  on Dec 2010 

fl["Day_New"] = pd.to_datetime(fl['Day'], format='%d/%m/%Y')
temp = fl[(fl["Day_New" ] >= '2010/12/01') & ( fl["Day_New" ] <= '2010/12/31' )]
# avg_temp_dec = temp.describe().loc["mean"][0]

mx = temp["Average temperature (°F)"].max()
mn = temp["Average temperature (°F)"].min()
avg_temp_dec  = round(mx-mn, 2)
print(avg_temp_dec)
q8 = avg_temp_dec
 

Q9: Out of all available records which day has the highest difference between maximum_pressure and minimum_pressure

q9- Out of all available records which day has the highest difference between maximum_pressure and minimum_pressure

Solution: 9


#Task9 : Out of all available records which day has the highest difference between maximum_pressure and minimum_pressure

fl['pressure_diff'] = fl['Maximum pressure '] - fl['Minimum pressure ']

max_press_indx = fl['pressure_diff'].idxmax()
max_press_date = fl['Day_New'][max_press_indx]
ans = max_press_date.strftime('%Y-%m-%d')
print(ans)
q9 = ans
 

Q10: How many days falls under median (i.e equal to median value) of barrometer reading.

q10- How many days falls under median (i.e equal to median value) of barrometer reading.

Solution: 10


#Task 10: How many days falls under median (i.e equal to median value) of barrometer reading.

medn = fl["Average barometer (in)"].median()
medn_fltr = filter(lambda x :   x==medn , fl["Average barometer (in)"])
ans = list(medn_fltr).count(medn)
print(ans)
q10 = ans
 

Q11: Out of all the available records how many days are within one standard deviation of average temperaturem

q11- Out of all the available records how many days are within one standard deviation of average temperaturem

Solution: 11


# Task 11: Out of all the available records how many days are within one standard deviation of average temperaturem

avg_temp_std= round(fl["Average temperature (°F)"].std(),2)
avg_temp_mean = round(fl.iloc[:,1].mean(),2)

num_days_std= len(fl[(fl["Average temperature (°F)"] >= avg_temp_mean-avg_temp_std) & (fl.iloc[:,1] <= avg_temp_mean + avg_temp_std)] )

print('num_days_std =',num_days_std)
q11 = num_days_std

 

Q5 - Q6: Which month has the highest median for maximum_gust_speed out of all the available records.

q5-q6 : Which month has the highest median for maximum_gust_speed out of all the available records.

Solution: 5-6


### Which month has the highest median for maximum_gust_speed out of all the available records. 
# Also find the repective value - hint: group by month

# Try to write the code for this problem, and in case faced any isssue, feel free to write on Comment-Box.

# If you write your solution properly then you will get below response.

# q5 = 34.50
# q6 = 2

 

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

7 comments

  1. Anonymous
    please correct question 2 code, qlso right codes for question 5 and 6
    1. Anonymous
      I also faced same issue but there is problem with round function, try this:
      a = fl["Average temperature (°F)"].quantile(0.75)
      b = fl["Average temperature (°F)"].quantile(0.50)
      q2 = round(a-b, 2)
    2. Anonymous
      thank you, if you could please help with Q5 & Q6
    3. Anonymous
      import pandas as pd

      fl = pd.read_csv("data.csv")

      monthly_median = fl.groupby(fl['Day'])['Maximum windspeed (mph)'].median()

      highest_median_month = monthly_median.idxmax()
      highest_median_value = monthly_median.max()

      print("q6 =", q6)
      print("q5 =", q5)

      with open('q5.pickle', 'wb') as f:
      pickle.dump(q5, f)

      with open('q6.pickle', 'wb') as f:
      pickle.dump(q6, f) this is my code still there is a error
  2. Anonymous
    import pandas as pd
    import pickle

    data = pd.read_csv("data.csv")
    data['Day'] = pd.to_datetime(data['Day'], dayfirst=True)
    data['Month'] = data['Day'].dt.month
    monthly_median = data.groupby('Month')['Maximum gust speed (mph)'].median()
    highest_median_month = monthly_median.idxmax()
    highest_median_value = monthly_median.max()

    q5 = highest_median_value
    q6 = highest_median_month

    print("q5 =", q5)
    print("q6 =", q6)

    with open('q5.pickle', 'wb') as f:
    pickle.dump(q5, f)

    with open('q6.pickle', 'wb') as f:
    pickle.dump(q6, f)
    1. Anonymous
      You would need to download dat.csv, check date format (it needs to be English UK), change it and upload. and run the code
    2. Anonymous
      Yup! This solution is working perfectly.
      and I think the last four lines is not required there, we can skip these lines:

      with open('q5.pickle', 'wb') as f:
      pickle.dump(q5, f)

      with open('q6.pickle', 'wb') as f:
      pickle.dump(q6, f)