Welcome to Turing Machine Data Scientist Program: Use-case 1
Web Scrapping Project
Solution: Web Scrapping Project
Welcome to your first casestudy:
- In this case study you have to scrape weather data from the website "http://www.estesparkweather.net/archive_reports.php?date=200901"
- Scrape all the available attributes of weather data for each day from 2009-01-01 to 2018-10-28
- Ignore records for missing days
- Represent the scraped data as pandas dataframe object.
Task 1: Run the below cell to import necessary packages
import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
Instructions: Dataframe specific deatails.
- Perform necessary data cleaning and type cast each attributes to relevent data type
- Make sure the index column is date-time format (yyyy-mm-dd)
- Each record in the dataframe corresponds to weather deatils of a given day
- Expected column names (order dose not matter):-
['Average temperature (°F)', 'Average humidity (%)', 'Average dewpoint (°F)', 'Average barometer (in)', 'Average windspeed (mph)', 'Average gustspeed (mph)', 'Average direction (°deg)', 'Rainfall for month (in)', 'Rainfall for year (in)', 'Maximum rain per minute', 'Maximum temperature (°F)', 'Minimum temperature (°F)', 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure', 'Minimum pressure', 'Maximum windspeed (mph)', 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
Task 2: Scrape the website with given timestamp.
# Start Your Python Web Screpping Code from here:
year= ["2009","2010","2011","2012","2013","2014","2015","2016","2017","2018",] #2009-01-01 to 2018-10-28
month = ["01","02","03","04","05","06","07","08","09","10","11","12"]
months =['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
final_list =[]
temp_list = []
for yy in year:
for mm in month:
url = 'https://www.estesparkweather.net/archive_reports.php?date={}{}'.format(yy,mm)
print(url)
res = requests.get(url)
xml = BeautifulSoup(res.content, "html.parser")
for trr in xml.find_all('table'):
final_list_append = True
months_check = True
for k in trr.find_all('tr'):
final_list_append = False
text = k.get_text(separator=" ").strip()
match = re.match(r"([a-zA-Z\s]+)(-?[\d\.]+)", text) # Regex to split key and value (numeric only)
if match:
key, value = match.groups()
if 'Average and Extremes for Month of' in key :
break
if months_check:
if key.strip().lower() in months:
months_check= False
value = yy +"-"+ mm + '-' + value # yyyy-mm-dd
temp_list.append(value )
# print(f"{key.strip()}: {value.strip()}")
if not final_list_append:
if temp_list:
final_list.append(temp_list.copy())
temp_list.clear()
Task 3: Create a Dataframe and append the scrapped data into the dataframe.
import pandas as pd
col =['Date', 'Average temperature (°F)', 'Average humidity (%)', 'Average dewpoint (°F)', 'Average barometer (in)', 'Average windspeed (mph)', 'Average gustspeed (mph)', 'Average direction (°deg)', 'Rainfall for month (in)', 'Rainfall for year (in)', 'Maximum rain per minute', 'Maximum temperature (°F)', 'Minimum temperature (°F)', 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure', 'Minimum pressure', 'Maximum windspeed (mph)', 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
df = pd.DataFrame(final_list[:-1], columns=col)
df.head(33)
print(df.shape)
Task 4: Drop the Null values from the dataframe.
print(df.isnull().sum().sum())
df.dropna(axis=0,inplace=True)
print(df.isnull().sum().sum())
print(df.shape)
Task 5: Set 'Date' column as dataframe index.
# Convert 'Date' column to datetime format
df2 = df.copy()
df2['Date'] = pd.to_datetime(df2['Date'])
# Set 'Date' as index
df2.set_index('Date', inplace=True)
print(df2.shape)
Task 6: Filter the dataframe with the expected date range.
start_date = '2009-01-01'
end_date = '2018-10-28'
# Filter the DataFrame for the specified date range
filtered_df = df2.loc[start_date:end_date]
filtered_df = filtered_df.astype(float)
print(filtered_df.shape)
Task 7: Save the dataframe in the pickle file.
# Once you are done with you scrapping, save your dataframe as pickle file by name 'dataframe.pk'
import pickle
with open("dataframe.pk", "wb") as file:
pickle.dump(filtered_df, file)
Task 8(Optional A): Save the Dataframe in CSV and Excel format.
filtered_df.to_excel('scrape.xlsx')
filtered_df.to_csv('scrape.csv')
Task 9(Optional B): Run below cell to validate whether your dataframe having expected values or not.
import numpy as np
mean = round(np.mean(filtered_df["2011-08-1":"2011-08-20"]["Average windspeed (mph)"]), 2)
assert str(mean) == '4.64', 'Wrong Value'
print('mean: ',mean)
max = round(np.max(filtered_df["2011-04-20":"2012-01-1"]["Maximum temperature (°F)"]), 2)
assert str(max) == '89.7', f'Test Failed: {max}'
print('max: ', max)
std2 = round(np.std(filtered_df["2011-04-20":"2012-01-1"]["Maximum pressure"]), 2)
assert str(std2) == '0.27',f'Test Failed: {std2}'
print('std2: ', std2)