NLP Using Deep Learning Fresco Play Handson Solution HackerRank

LAB 1: Analogy Completion:

You will be performing following operations:

Load the pretrained vectors from the text file
Write a function to find cosine similarity between two word vectors
Write an function to find analogy analogy problems such as King : Queen :: Men : __?__

Task1: Define a function get_word_vectors()

You will be performing following operations:

A text file having the trained word vectors is provided for you as word2vec.txt in the same working directory.
Each line in the file is space seperated values where first value is the word and the remaing values are its vector representation.
Define a function get_word_vectors() with parameters: file_name
returns: word_to_vec: dictionary with key as the word and the value is the corresponding word vectors as 1-d array each element of type float32.

# Task 1:
import numpy as np
def get_word_vectors(file_name):
    word_to_vec = {}
    
    with open(file_name, 'r') as file:
        for line in file:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype=np.float32)
            word_to_vec[word] = vector
            
    return word_to_vec

Using the function you defined above read the word vectors from the file word_vectors.txt and assign it to variable word_to_vec
Expected output (showing only first few values of vectors)

Father: [ 0.095496 0.70418 -0.40777 -0.80844 1.256 0.77071 ...]
mother: [ 0.4336 1.0727 -0.6196 -0.80679 1.2519 1.3767 ....]

# Run the cell
word_to_vec = get_word_vectors('word2vec.txt')
father = word_to_vec["father"]
mother = word_to_vec["mother"]
print("Father: ", father)
print("mother: ", mother)

Task 2: Define a function named cosine_similarity().

Determine the cosine similarity between two word vectors

The formula for cosine similarity is given by score = (𝑈.𝑉)/(√||𝑈||.||𝑉||), where ||U|| and ||V|| is the sum of the squares of the elemnts individual vectors
parameters u, v are the word vectors whose similarity has to be determined
returns - score: cosine similarity of u and v

# Task 2:
def cosine_similarity(u, v):
    ###Start code here

    # Calculate the dot product of the vectors
    dot_product = np.dot(u, v)

    # Calculate the norms (magnitudes) of the vectors
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)

    # Calculate cosine similarity
    if norm_u == 0 or norm_v == 0:
        return 0.0  # Handle division by zero case

    score = dot_product / (norm_u * norm_v)
    return score

Run the bellow cell to find the similarity between word vectors "paris" and "rome"

Expected output is:- similarity score : 0.7099411

# Run the cell
paris = word_to_vec["paris"]
rome = word_to_vec["rome"]
print("similarity score :", cosine_similarity(paris, rome))
#similarity score : 0.7099411

Task 3: In the word analogy task, we complete the analogy.

In detail, we are trying to find a word d, such that the associated word vectors 𝑢1,𝑣1,𝑢2,𝑣2 are related in the following manner: 𝑢1−𝑣1≈𝑢2−𝑣2 . We will measure the similarity between 𝑢1−𝑣1 and 𝑢2−𝑣2 using cosine similarity.

As an example, to find the best possible word for the analogy King : Queen :: Men : _? you will perform following steps:

Extract word vectors of three words king, queen and men
Find the element wise difference between the two word vectors king and queen as V1
Find the element wise difference between the word vector men and each word vector in word_to_vec ditionary as V2 (while doing so exclude the words of interest ie. king, queen and men)
Find the cosine similarity between vector V1 and V2 and choose the word from the word_to_vec ditionary that has maximum similarity between V1 and V2.

Define the function named find_analogy()

parameters:
word1 - string corresponding to word vector 𝑢1,
word2 - string corresponding to word vector 𝑣1 ,
word3 - string corresponding to word vector 𝑢2 ,
word_to_vec - dictionary of words and their corresponding vectors
returns: best_word - the word such that 𝑢1 - 𝑣1 is close to 𝑣_𝑏𝑒𝑠𝑡_𝑤𝑜𝑟𝑑 - 𝑣𝑐 , as measured by cosine similarity

# Task 3:
def find_analogy(word_1, word_2, word_3, word_to_vec):
    ####Start code here
    # Extract word vectors for the provided words
    if word_1 not in word_to_vec or word_2 not in word_to_vec or word_3 not in word_to_vec:
        return None  # Return None if any of the words are not found in the dictionary

    vector_1 = word_to_vec[word_1]  # Vector for word_1 (e.g., king)
    vector_2 = word_to_vec[word_2]  # Vector for word_2 (e.g., queen)
    vector_3 = word_to_vec[word_3]  # Vector for word_3 (e.g., men)

    # Calculate the difference vector V1 = vector_1 - vector_2
    V1 = vector_1 - vector_2
    
    best_similarity = -1  # Initialize best similarity score
    best_word = None  # Initialize the best word

    # Iterate through all words in the dictionary
    for word, vector in word_to_vec.items():
        if word in [word_1, word_2, word_3]:  # Exclude the words of interest
            continue
        
        # Calculate V2 = vector_3 - vector (where vector is the current word vector)
        V2 = vector_3 - vector
        
        # Calculate cosine similarity between V1 and V2
        similarity = cosine_similarity(V1, V2)

        # Update best_word if current similarity is greater than the best found so far
        if similarity > best_similarity:
            best_similarity = similarity
            best_word = word
    ###End code
    return best_word

Run the below code to check your above defined function

father -> son :: mother -> daughter
india -> delhi :: japan -> tokyo

# Run the cell:
print ('{} -> {} :: {} -> {}'.format('father', 'son', 'mother',find_analogy('father', 'son', 'mother', word_to_vec)))
print ('{} -> {} :: {} -> {}'.format('india', 'delhi', 'japan',find_analogy('india', 'delhi', 'japan', word_to_vec)))
# father -> son :: mother -> daughter
# india -> delhi :: japan -> tokyo

LAB 2: Sentiment Classification Hands on


# bin/python
# Import all the necessary packages in the below cell as and when you require
from keras.datasets import imdb
from keras.datasets.imdb import get_word_index
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing import sequence

Downloading the dataset!

Keras has a built in function to download movie review available in imdb.
Each words in the review are represented by their unique index and the labels are in binary format representing positive or negative reviews
The necessary code to download the dataset has been written for you.
The variable word_to_id is a dictionary containing words and their corresponding ids

# Run the cell to download the dataset
vocab_size = 5000
(X_train, Y_train), (X_test, Y_test) = imdb.load_data(num_words=vocab_size)
word_to_id = get_word_index()
print("word to id fist five samples {}".format({key:value for key, value in zip(list(word_to_id.keys())[:5], list(word_to_id.values())[:5])}))
print("\n")
print("sample input\n", X_train[0])
print('\n')
print("target output", Y_train[0])

Task 1:

Each review in the dataset has some special tokens such as!

START : to identify the start of the sentence
UNK : If some words are not identified in the vocabulary
PAD : The value to be filled if sequence requires padding

# Task 1:
# Offset the word_to_id dictionary by three values such that 0,1,2 represents START, UNK, PAD respectively.
# Once you perform the above step reverse the word_to_id dictionary to represent ids as keys and words as values. 
# Assign the resulting dictionary to id_to_word variable

word_to_id = { k:v+3 for k, v in word_to_id.items() } # Python Dictionary
word_to_id["START"] = 1
word_to_id["UNK"] = 2
word_to_id["PAD"] = 3
id_to_word = {value:key for key,value in word_to_id.items()} # Reverse Python Dictionary

Task 2

# Task 2:
# Run the below code to view the first review in training samples
print(" ".join([id_to_word[i] for i in X_train[0]]))

Task 3: First 500 words

Since each movie reviews are of variable lengths in terms of number of words, so it is necessay to fix the review lenght to few words say upto first 500 words.

For each of the samples of X_train and X_test sample upto first 500 words
If reviews are less than 500 words pad the sequence with zeros in the beginning to make up the length upto 500
Assign the padded sequence to X_train_pad and X_test_pad variables for train and test smaples respectively

# Task 3:
from keras.preprocessing import sequence 

X_train_pad = sequence.pad_sequences(X_train,  maxlen= 500)
X_test_pad =  sequence.pad_sequences(X_test,  maxlen= 500)
print(X_train_pad[0])

print("X_train_pad shape:", X_train_pad.shape)
print("Y_train shape:", Y_train.shape)
print("X_test_pad shape:", X_test_pad.shape)
print("Y_test shape:", Y_test.shape)

Task 4:

Using keras Sequential to build an LSTM model using the below specifications

Add an embedding layer( the look up table) such that vacabulary size is 5000 and each word in the vocabulary is 32 dimension vector
Add an LSTM layer with 100 hidden nodes
Add a final sigmoid activation layer
Use adam optimizer and binary cross entropy loss, and metrics as accuracy

# Task 4:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
embedding_vector_length = 32 
model = Sequential() 

# Step 1: Add an embedding layer
model.add(Embedding(input_dim=5000, output_dim=32))

# Step 2: Add an LSTM layer
model.add(LSTM(100))

# Step 3: Add a final sigmoid activation layer
model.add(Dense(1, activation='sigmoid'))

# Step 4: Compile the model with adam optimizer and binary crossentropy loss
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


print(model.summary())

Task 5:

Fit the model with X_train_pad and Y_train as train data and X_test_pad, Y_test as Validation set

- set the number of epochs to 3
- set batch size to 64

# Task 5:
# Start code here

# Fit the model
model.fit(
    X_train_pad,      # Training data
    Y_train,          # Training labels
    validation_data=(X_test_pad, Y_test),  # Validation data and labels
    epochs=3,         # Number of epochs
    batch_size=64     # Batch size
)

#End code

PDFcup.com