Mini-Project for Data Scientist - Web Scrapping Fresco Play HackerRank Handson Solution

Learn and build a simple web scraping project using Python with handson using libraries like Beautiful Soup, requests, XML parsing using bs4 and etc.

Welcome to Turing Machine Data Scientist Program: Use-case 1

Web Scrapping Project

Solution: Web Scrapping Project

Welcome to your first casestudy:
  • In this case study you have to scrape weather data from the website "http://www.estesparkweather.net/archive_reports.php?date=200901"
  • Scrape all the available attributes of weather data for each day from 2009-01-01 to 2018-10-28
  • Ignore records for missing days
  • Represent the scraped data as pandas dataframe object.

Task 1: Run the below cell to import necessary packages

import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
Instructions: Dataframe specific deatails.
  • Perform necessary data cleaning and type cast each attributes to relevent data type
  • Make sure the index column is date-time format (yyyy-mm-dd)
  • Each record in the dataframe corresponds to weather deatils of a given day
  • Expected column names (order dose not matter):-
    ['Average temperature (°F)', 'Average humidity (%)', 'Average dewpoint (°F)', 'Average barometer (in)', 'Average windspeed (mph)', 'Average gustspeed (mph)', 'Average direction (°deg)', 'Rainfall for month (in)', 'Rainfall for year (in)', 'Maximum rain per minute', 'Maximum temperature (°F)', 'Minimum temperature (°F)', 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure', 'Minimum pressure', 'Maximum windspeed (mph)', 'Maximum gust speed (mph)', 'Maximum heat index (°F)']

Task 2: Scrape the website with given timestamp.

# Start Your Python Web Screpping Code from here:
year= ["2009","2010","2011","2012","2013","2014","2015","2016","2017","2018",] #2009-01-01 to 2018-10-28
month = ["01","02","03","04","05","06","07","08","09","10","11","12"]
months =['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
 
final_list =[]
temp_list = []
for yy in year:
    for mm in month:
        url = 'https://www.estesparkweather.net/archive_reports.php?date={}{}'.format(yy,mm)
        print(url)
        res = requests.get(url)
        xml = BeautifulSoup(res.content, "html.parser")

        for trr in xml.find_all('table'):
            final_list_append = True
            months_check = True
            for k in trr.find_all('tr'):
                final_list_append = False 
                text = k.get_text(separator=" ").strip()
                match = re.match(r"([a-zA-Z\s]+)(-?[\d\.]+)", text)  # Regex to split key and value (numeric only)
                if match:
                    
                    key, value = match.groups()
                    
                    if 'Average and Extremes for Month of' in key :
                        break
                    if months_check:
                        if key.strip().lower() in months:
                            months_check= False
                            value = yy +"-"+ mm + '-' + value  # yyyy-mm-dd
                        
                    temp_list.append(value )
                    # print(f"{key.strip()}: {value.strip()}")
                    
            if not final_list_append:
                if temp_list:
                    final_list.append(temp_list.copy())
                temp_list.clear()   

Task 3: Create a Dataframe and append the scrapped data into the dataframe.


import pandas as pd
col =['Date', 'Average temperature (°F)', 'Average humidity (%)', 'Average dewpoint (°F)', 'Average barometer (in)', 'Average windspeed (mph)', 'Average gustspeed (mph)', 'Average direction (°deg)', 'Rainfall for month (in)', 'Rainfall for year (in)', 'Maximum rain per minute', 'Maximum temperature (°F)', 'Minimum temperature (°F)', 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure', 'Minimum pressure', 'Maximum windspeed (mph)', 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
df = pd.DataFrame(final_list[:-1], columns=col)
df.head(33)
print(df.shape)

Task 4: Drop the Null values from the dataframe.


print(df.isnull().sum().sum())
df.dropna(axis=0,inplace=True)
print(df.isnull().sum().sum())
print(df.shape)

Task 5: Set 'Date' column as dataframe index.


# Convert 'Date' column to datetime format
df2 = df.copy()
df2['Date'] = pd.to_datetime(df2['Date'])

# Set 'Date' as index
df2.set_index('Date', inplace=True)

print(df2.shape)

Task 6: Filter the dataframe with the expected date range.


start_date = '2009-01-01'
end_date = '2018-10-28'

# Filter the DataFrame for the specified date range
filtered_df = df2.loc[start_date:end_date]
filtered_df = filtered_df.astype(float)
print(filtered_df.shape)

Task 7: Save the dataframe in the pickle file.


# Once you are done with you scrapping, save your dataframe as pickle file by name 'dataframe.pk'
import pickle
with open("dataframe.pk", "wb") as file:
    pickle.dump(filtered_df, file)

Task 8(Optional A): Save the Dataframe in CSV and Excel format.


filtered_df.to_excel('scrape.xlsx')
filtered_df.to_csv('scrape.csv')

Task 9(Optional B): Run below cell to validate whether your dataframe having expected values or not.


import numpy as np

mean = round(np.mean(filtered_df["2011-08-1":"2011-08-20"]["Average windspeed (mph)"]), 2)
assert str(mean) == '4.64', 'Wrong Value'
print('mean: ',mean)

max = round(np.max(filtered_df["2011-04-20":"2012-01-1"]["Maximum temperature (°F)"]), 2)
assert str(max) == '89.7', f'Test Failed: {max}'
print('max: ', max)

std2 = round(np.std(filtered_df["2011-04-20":"2012-01-1"]["Maximum pressure"]), 2)
assert str(std2) == '0.27',f'Test Failed: {std2}'
print('std2: ', std2)

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment