Mini-Project for Data Scientist - Web Scrapping Fresco Play HackerRank Handson Solution

Welcome to Turing Machine Data Scientist Program: Use-case 1

Web Scrapping Project

Solution: Web Scrapping Project

Welcome to your first casestudy:

In this case study you have to scrape weather data from the website "http://www.estesparkweather.net/archive_reports.php?date=200901"
Scrape all the available attributes of weather data for each day from 2009-01-01 to 2018-10-28
Ignore records for missing days
Represent the scraped data as pandas dataframe object.

Task 1: Run the below cell to import necessary packages

import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle

Instructions: Dataframe specific deatails.

Perform necessary data cleaning and type cast each attributes to relevent data type
Make sure the index column is date-time format (yyyy-mm-dd)
Each record in the dataframe corresponds to weather deatils of a given day
Expected column names (order dose not matter):-
['Average temperature (°F)', 'Average humidity (%)', 'Average dewpoint (°F)', 'Average barometer (in)', 'Average windspeed (mph)', 'Average gustspeed (mph)', 'Average direction (°deg)', 'Rainfall for month (in)', 'Rainfall for year (in)', 'Maximum rain per minute', 'Maximum temperature (°F)', 'Minimum temperature (°F)', 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure', 'Minimum pressure', 'Maximum windspeed (mph)', 'Maximum gust speed (mph)', 'Maximum heat index (°F)']

Task 2: Scrape the website with given timestamp.

# Start Your Python Web Screpping Code from here:
year= ["2009","2010","2011","2012","2013","2014","2015","2016","2017","2018",] #2009-01-01 to 2018-10-28
month = ["01","02","03","04","05","06","07","08","09","10","11","12"]
months =['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
 
final_list =[]
temp_list = []
for yy in year:
    for mm in month:
        url = 'https://www.estesparkweather.net/archive_reports.php?date={}{}'.format(yy,mm)
        print(url)
        res = requests.get(url)
        xml = BeautifulSoup(res.content, "html.parser")

        for trr in xml.find_all('table'):
            final_list_append = True
            months_check = True
            for k in trr.find_all('tr'):
                final_list_append = False 
                text = k.get_text(separator=" ").strip()
                match = re.match(r"([a-zA-Z\s]+)(-?[\d\.]+)", text)  # Regex to split key and value (numeric only)
                if match:
                    
                    key, value = match.groups()
                    
                    if 'Average and Extremes for Month of' in key :
                        break
                    if months_check:
                        if key.strip().lower() in months:
                            months_check= False
                            value = yy +"-"+ mm + '-' + value  # yyyy-mm-dd
                        
                    temp_list.append(value )
                    # print(f"{key.strip()}: {value.strip()}")
                    
            if not final_list_append:
                if temp_list:
                    final_list.append(temp_list.copy())
                temp_list.clear()

Task 3: Create a Dataframe and append the scrapped data into the dataframe.


import pandas as pd
col =['Date', 'Average temperature (°F)', 'Average humidity (%)', 'Average dewpoint (°F)', 'Average barometer (in)', 'Average windspeed (mph)', 'Average gustspeed (mph)', 'Average direction (°deg)', 'Rainfall for month (in)', 'Rainfall for year (in)', 'Maximum rain per minute', 'Maximum temperature (°F)', 'Minimum temperature (°F)', 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure', 'Minimum pressure', 'Maximum windspeed (mph)', 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
df = pd.DataFrame(final_list[:-1], columns=col)
df.head(33)
print(df.shape)

Task 4: Drop the Null values from the dataframe.


print(df.isnull().sum().sum())
df.dropna(axis=0,inplace=True)
print(df.isnull().sum().sum())
print(df.shape)

Task 5: Set 'Date' column as dataframe index.


# Convert 'Date' column to datetime format
df2 = df.copy()
df2['Date'] = pd.to_datetime(df2['Date'])

# Set 'Date' as index
df2.set_index('Date', inplace=True)

print(df2.shape)

Task 6: Filter the dataframe with the expected date range.


start_date = '2009-01-01'
end_date = '2018-10-28'

# Filter the DataFrame for the specified date range
filtered_df = df2.loc[start_date:end_date]
filtered_df = filtered_df.astype(float)
print(filtered_df.shape)

Task 7: Save the dataframe in the pickle file.


# Once you are done with you scrapping, save your dataframe as pickle file by name 'dataframe.pk'
import pickle
with open("dataframe.pk", "wb") as file:
    pickle.dump(filtered_df, file)

Task 8(Optional A): Save the Dataframe in CSV and Excel format.


filtered_df.to_excel('scrape.xlsx')
filtered_df.to_csv('scrape.csv')

Task 9(Optional B): Run below cell to validate whether your dataframe having expected values or not.


import numpy as np

mean = round(np.mean(filtered_df["2011-08-1":"2011-08-20"]["Average windspeed (mph)"]), 2)
assert str(mean) == '4.64', 'Wrong Value'
print('mean: ',mean)

max = round(np.max(filtered_df["2011-04-20":"2012-01-1"]["Maximum temperature (°F)"]), 2)
assert str(max) == '89.7', f'Test Failed: {max}'
print('max: ', max)

std2 = round(np.std(filtered_df["2011-04-20":"2012-01-1"]["Maximum pressure"]), 2)
assert str(std2) == '0.27',f'Test Failed: {std2}'
print('std2: ', std2)

PDFcup.com

Mini-Project for Data Scientist - Web Scrapping Fresco Play HackerRank Handson Solution

Welcome to Turing Machine Data Scientist Program: Use-case 1

Web Scrapping Project

Task 1: Run the below cell to import necessary packages

Task 2: Scrape the website with given timestamp.

Task 3: Create a Dataframe and append the scrapped data into the dataframe.

Task 4: Drop the Null values from the dataframe.

Task 5: Set 'Date' column as dataframe index.

Task 6: Filter the dataframe with the expected date range.

Task 7: Save the dataframe in the pickle file.

Task 8(Optional A): Save the Dataframe in CSV and Excel format.

Task 9(Optional B): Run below cell to validate whether your dataframe having expected values or not.

About the author

Post a Comment

PDFcup.com