Mini-Project Digital NLP Fresco Play Hands-on Solution Hacker Rank

It's recommended to review this article before proceeding with the hands-on solution below for a better understanding:-

LAB 1: Welcome to MLT - Sprint 5 - Case Study 1 - NLP Text Processing

Solution: Case Study 1- Text Preprocessing

Question 1: NLP - Python - Processing Raw Text

Instruction: Define a function called ` processRawText`, which takes a parameter. The first parameter ` textURL' is an URL link. Perform the following tasks.

Read the text content from the given link ` textURL'. Store the content in the variable `textcontent.
Tokenize all the words in the ` textcontent`, and convert them into lower case. Store the tokenized list of words in ` tokenizedlcwords`. (Hint: Use word tokenize)
Find the number of words in `tokenizedlcwords `, and store the result in `noofwords.
Find the number of unique words in `tokenizedlcwords`, and store the result in `noofunqwords`.
Calculate the word coverage of `tokenizedlcwords ` obtained from the number of words and number of unique words, Store the result in the ` wordcov.
Determine the frequency distribution of all words having only alphabets in `tokenizedlcwords '. Store the result in the variable 'wordfreq'.
Find the maximum frequent word of ` tokenizedlcwords'. Store the result in the variable 'maxfreqʼ.
Return noofwords , noofunqwords , wordcov, and maxfreq' variables from the function.

Sample Output
210
127
1
The


#!/bin/python3
 
import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
import nltk
 
#
# Complete the 'processRawText' function below.
#
# The function accepts STRING textURL as parameter.
#
 
def processRawText(textURL):
    import requests
    response = requests.get(textURL)
   
    words  = nltk.tokenize.word_tokenize(response.text) # Tokenize into words
    lower_words = list(map( str.lower, words))
    noofwords = len(lower_words)
   
    unique_words = set(lower_words)
    noofunqwords = len(unique_words)
   
    wordcov = math.floor(noofwords/noofunqwords)
   
    from collections import Counter
    mostCommon = Counter(lower_words)
    maxfreq= mostCommon.most_common(1)[0][0]
    return noofwords, noofunqwords, wordcov, maxfreq
 
if __name__ == '__main__':
    textURL = input()
 
    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())
 
    noofwords, noofunqwords, wordcov, maxfreq = processRawText(textURL)
    print(noofwords)
    print(noofunqwords)
    print(wordcov)
    print(maxfreq)

LAB 2: Case Study 2 - NLP - Text Representation

Fresco Course ID: 2632

Solution: Run the below cell to install the needed libraries

pip install nltk

import pandas as pd
import nltk
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Read the CSV file dataset.csv

df = pd.read_csv('dataset.csv')

Task 1: Use Count Vectorizer to find the vocabulary for the given data set and store it in the variable S1. Note: Output must be dataframe and it’s column name should be ‘order’.

vectorizer = CountVectorizer()
X= vectorizer.fit_transform(df.iloc[:, 0]) # Fit the vectorizer to the text data
S1 = pd.DataFrame({'order': list(vectorizer.get_feature_names())})
#S1 = pd.DataFrame(sorted(vectorizer.vocabulary_.keys()), columns=['order']) # Alternative way

Task 2: Find the Bag of words for the given data set and store it in the variable S2. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).

# Find Bag of Words representation and store in S2
S2 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 3: Find the Term Frequency (TF) with norm ‘l1’ and disable use_idf for the given dataset and store it in the variable S3. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names)

# Initialize TfidfVectorizer with L1 normalization and use_idf=False
vectorizer = TfidfVectorizer(norm='l1', use_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S3 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 4: Find the Term Frequency (TF) with norm ‘l2’ and disable use_idf for the given dataset and store it in the variable S4. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).

#write your code below
vectorizer = TfidfVectorizer(norm='l2', use_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S4 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 5: Find the TF*IDF (TFIDF) value for the given dataset and store it in the variable S5. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).

#write your code below
vectorizer = TfidfVectorizer()
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S5 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 6: Find the Inverse Document Frequency (IDF) value with soomth_idf as false for the given dataset and store it in the variable S6.

Note: Output must be dataframe and it’s index should be the feature of words(get_feature_names) and column name should be ‘values’.

#write your code below
vectorizer = TfidfVectorizer(smooth_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Get the IDF values (these are part of the model)
idf_values = vectorizer.idf_
# Convert to DataFrame
S6 = pd.DataFrame(idf_values, index=vectorizer.get_feature_names(),columns=['values'])

LAB 3: Welcome to MLT - Sprint 5 - Case Study 3 - NLP Sentiment Analysis

Case Study 3 - Sentiment Analysis