Mini-Project Digital NLP Fresco Play Hands-on Solution Hacker Rank

Explore NLTK for text processing, sentiment analysis, and KNN algorithm to enhance NLP tasks, from tokenization to classification.

It's recommended to review this article before proceeding with the hands-on solution below for a better understanding:-

LAB 1: Welcome to MLT - Sprint 5 - Case Study 1 - NLP Text Processing

Solution: Case Study 1- Text Preprocessing

Question 1: NLP - Python - Processing Raw Text

Instruction: Define a function called ` processRawText`, which takes a parameter. The first parameter ` textURL' is an URL link. Perform the following tasks.
  • Read the text content from the given link ` textURL'. Store the content in the variable `textcontent.
  • Tokenize all the words in the ` textcontent`, and convert them into lower case. Store the tokenized list of words in ` tokenizedlcwords`. (Hint: Use word tokenize)
  • Find the number of words in `tokenizedlcwords `, and store the result in `noofwords.
  • Find the number of unique words in `tokenizedlcwords`, and store the result in `noofunqwords`.
  • Calculate the word coverage of `tokenizedlcwords ` obtained from the number of words and number of unique words, Store the result in the ` wordcov.
  • Determine the frequency distribution of all words having only alphabets in `tokenizedlcwords '. Store the result in the variable 'wordfreq'.
  • Find the maximum frequent word of ` tokenizedlcwords'. Store the result in the variable 'maxfreqʼ.
  • Return noofwords , noofunqwords , wordcov, and maxfreq' variables from the function.

Sample Output
210
127
1
The


#!/bin/python3
 
import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
import nltk
 
#
# Complete the 'processRawText' function below.
#
# The function accepts STRING textURL as parameter.
#
 
def processRawText(textURL):
    import requests
    response = requests.get(textURL)
   
    words  = nltk.tokenize.word_tokenize(response.text) # Tokenize into words
    lower_words = list(map( str.lower, words))
    noofwords = len(lower_words)
   
    unique_words = set(lower_words)
    noofunqwords = len(unique_words)
   
    wordcov = math.floor(noofwords/noofunqwords)
   
    from collections import Counter
    mostCommon = Counter(lower_words)
    maxfreq= mostCommon.most_common(1)[0][0]
    return noofwords, noofunqwords, wordcov, maxfreq
 
if __name__ == '__main__':
    textURL = input()
 
    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())
 
    noofwords, noofunqwords, wordcov, maxfreq = processRawText(textURL)
    print(noofwords)
    print(noofunqwords)
    print(wordcov)
    print(maxfreq)

LAB 2: Case Study 2 - NLP - Text Representation

Fresco Course ID: 2632

Solution: Run the below cell to install the needed libraries

pip install nltk
import pandas as pd
import nltk
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
Read the CSV file dataset.csv
df = pd.read_csv('dataset.csv')

Task 1: Use Count Vectorizer to find the vocabulary for the given data set and store it in the variable S1. Note: Output must be dataframe and it’s column name should be ‘order’.

vectorizer = CountVectorizer()
X= vectorizer.fit_transform(df.iloc[:, 0]) # Fit the vectorizer to the text data
S1 = pd.DataFrame({'order': list(vectorizer.get_feature_names())})
#S1 = pd.DataFrame(sorted(vectorizer.vocabulary_.keys()), columns=['order']) # Alternative way

Task 2: Find the Bag of words for the given data set and store it in the variable S2. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).

# Find Bag of Words representation and store in S2
S2 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 3: Find the Term Frequency (TF) with norm ‘l1’ and disable use_idf for the given dataset and store it in the variable S3. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names)

# Initialize TfidfVectorizer with L1 normalization and use_idf=False
vectorizer = TfidfVectorizer(norm='l1', use_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S3 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 4: Find the Term Frequency (TF) with norm ‘l2’ and disable use_idf for the given dataset and store it in the variable S4. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).

#write your code below
vectorizer = TfidfVectorizer(norm='l2', use_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S4 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 5: Find the TF*IDF (TFIDF) value for the given dataset and store it in the variable S5. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).

#write your code below
vectorizer = TfidfVectorizer()
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S5 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Task 6: Find the Inverse Document Frequency (IDF) value with soomth_idf as false for the given dataset and store it in the variable S6.

Note: Output must be dataframe and it’s index should be the feature of words(get_feature_names) and column name should be ‘values’.

#write your code below
vectorizer = TfidfVectorizer(smooth_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Get the IDF values (these are part of the model)
idf_values = vectorizer.idf_
# Convert to DataFrame
S6 = pd.DataFrame(idf_values, index=vectorizer.get_feature_names(),columns=['values'])

LAB 3: Welcome to MLT - Sprint 5 - Case Study 3 - NLP Sentiment Analysis

Case Study 3 - Sentiment Analysis

Fresco Course ID: 2632

Instruction! The data set required for this task is given in the file name 'SA_dataset.csv'.
Read the question then perform the solution and assign the answer to the respective variables given in the cells below

Import packages and read dataset.

import pandas as pd
import nltk
import re
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('stopwords')
nltk.download('vader_lexicon')
 
from nltk.corpus import stopwords
Use Case: Perform Sentiment Analysis of the given data set and store it as the 'sentiment.csv' and the column names should be 'content', 'sentiment' and 'value'.
  • Read the data from the CSV file
  • Conver the content to lower case and remove the numbers, special charackers
  • Remove the stop words from the content
  • Create two columns 'sentiment' and 'value'
  • Predict the sentiment for the given data and store it to the respective columns
  • Once you done with the above steps save the dataframe as 'sentiment.csv'
  • Column 'content' has the text as given in the dataset
  • Column 'sentiment' has the predictions Positive or Negative
  • Column 'value' has numerical values 1 or 0. Store 1 if the content is positive, store 0 if the content is negative
#write your code below
 
df = pd.read_csv('SA_dataset.csv')

# Convert text to lowercase and remove numbers, special characters
df['content'] = df['content'].str.lower()
df['content'] = df['content'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['content'] = df['content'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))


# Perform Sentiment Analysis using VADER
sia = SentimentIntensityAnalyzer()
def analyze_sentiment(text):
    score = sia.polarity_scores(text)['compound']
    return 'Positive' if score >= 0 else 'Negative', 1 if score >= 0 else 0
    
    
# Apply sentiment analysis
df[['sentiment', 'value']] = df['content'].apply(lambda x: pd.Series(analyze_sentiment(x)))


# Save results to sentiment.csv
df.to_csv('sentiment.csv', index=False)

# Display the first few rows
print(df.head())
 

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment