LAB 1: Analogy Completion:
You will be performing following operations:
- Load the pretrained vectors from the text file
- Write a function to find cosine similarity between two word vectors
- Write an function to find analogy analogy problems such as King : Queen :: Men : __?__
Task1: Define a function get_word_vectors()
You will be performing following operations:
- A text file having the trained word vectors is provided for you as word2vec.txt in the same working directory.
- Each line in the file is space seperated values where first value is the word and the remaing values are its vector representation.
- Define a function get_word_vectors() with parameters: file_name
- returns: word_to_vec: dictionary with key as the word and the value is the corresponding word vectors as 1-d array each element of type float32.
# Task 1:
import numpy as np
def get_word_vectors(file_name):
word_to_vec = {}
with open(file_name, 'r') as file:
for line in file:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype=np.float32)
word_to_vec[word] = vector
return word_to_vec
Using the function you defined above read the word vectors from the file word_vectors.txt and assign it to variable word_to_vec
Expected output (showing only first few values of vectors)
Expected output (showing only first few values of vectors)
- Father: [ 0.095496 0.70418 -0.40777 -0.80844 1.256 0.77071 ...]
- mother: [ 0.4336 1.0727 -0.6196 -0.80679 1.2519 1.3767 ....]
# Run the cell
word_to_vec = get_word_vectors('word2vec.txt')
father = word_to_vec["father"]
mother = word_to_vec["mother"]
print("Father: ", father)
print("mother: ", mother)
Task 2: Define a function named cosine_similarity().
Determine the cosine similarity between two word vectors
- The formula for cosine similarity is given by score = (𝑈.𝑉)/(√||𝑈||.||𝑉||), where ||U|| and ||V|| is the sum of the squares of the elemnts individual vectors
- parameters u, v are the word vectors whose similarity has to be determined
- returns - score: cosine similarity of u and v
# Task 2:
def cosine_similarity(u, v):
###Start code here
# Calculate the dot product of the vectors
dot_product = np.dot(u, v)
# Calculate the norms (magnitudes) of the vectors
norm_u = np.linalg.norm(u)
norm_v = np.linalg.norm(v)
# Calculate cosine similarity
if norm_u == 0 or norm_v == 0:
return 0.0 # Handle division by zero case
score = dot_product / (norm_u * norm_v)
return score
Run the bellow cell to find the similarity between word vectors "paris" and "rome"
- Expected output is:- similarity score : 0.7099411
# Run the cell
paris = word_to_vec["paris"]
rome = word_to_vec["rome"]
print("similarity score :", cosine_similarity(paris, rome))
#similarity score : 0.7099411
Task 3: In the word analogy task, we complete the analogy.
In detail, we are trying to find a word d, such that the associated word vectors 𝑢1,𝑣1,𝑢2,𝑣2 are related in the following manner: 𝑢1−𝑣1≈𝑢2−𝑣2 . We will measure the similarity between 𝑢1−𝑣1 and 𝑢2−𝑣2 using cosine similarity.
As an example, to find the best possible word for the analogy King : Queen :: Men : _? you will perform following steps:
- Extract word vectors of three words king, queen and men
- Find the element wise difference between the two word vectors king and queen as V1
- Find the element wise difference between the word vector men and each word vector in word_to_vec ditionary as V2 (while doing so exclude the words of interest ie. king, queen and men)
- Find the cosine similarity between vector V1 and V2 and choose the word from the word_to_vec ditionary that has maximum similarity between V1 and V2.
Define the function named find_analogy()
- parameters:
word1 - string corresponding to word vector 𝑢1,
word2 - string corresponding to word vector 𝑣1 ,
word3 - string corresponding to word vector 𝑢2 ,
word_to_vec - dictionary of words and their corresponding vectors - returns: best_word - the word such that 𝑢1 - 𝑣1 is close to 𝑣_𝑏𝑒𝑠𝑡_𝑤𝑜𝑟𝑑 - 𝑣𝑐 , as measured by cosine similarity
# Task 3:
def find_analogy(word_1, word_2, word_3, word_to_vec):
####Start code here
# Extract word vectors for the provided words
if word_1 not in word_to_vec or word_2 not in word_to_vec or word_3 not in word_to_vec:
return None # Return None if any of the words are not found in the dictionary
vector_1 = word_to_vec[word_1] # Vector for word_1 (e.g., king)
vector_2 = word_to_vec[word_2] # Vector for word_2 (e.g., queen)
vector_3 = word_to_vec[word_3] # Vector for word_3 (e.g., men)
# Calculate the difference vector V1 = vector_1 - vector_2
V1 = vector_1 - vector_2
best_similarity = -1 # Initialize best similarity score
best_word = None # Initialize the best word
# Iterate through all words in the dictionary
for word, vector in word_to_vec.items():
if word in [word_1, word_2, word_3]: # Exclude the words of interest
continue
# Calculate V2 = vector_3 - vector (where vector is the current word vector)
V2 = vector_3 - vector
# Calculate cosine similarity between V1 and V2
similarity = cosine_similarity(V1, V2)
# Update best_word if current similarity is greater than the best found so far
if similarity > best_similarity:
best_similarity = similarity
best_word = word
###End code
return best_word
Run the below code to check your above defined function
- Expected output:
- father -> son :: mother -> daughter
- india -> delhi :: japan -> tokyo
# Run the cell:
print ('{} -> {} :: {} -> {}'.format('father', 'son', 'mother',find_analogy('father', 'son', 'mother', word_to_vec)))
print ('{} -> {} :: {} -> {}'.format('india', 'delhi', 'japan',find_analogy('india', 'delhi', 'japan', word_to_vec)))
# father -> son :: mother -> daughter
# india -> delhi :: japan -> tokyo
LAB 2: Sentiment Classification Hands on
# bin/python
# Import all the necessary packages in the below cell as and when you require
from keras.datasets import imdb
from keras.datasets.imdb import get_word_index
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing import sequence
Downloading the dataset!
- Keras has a built in function to download movie review available in imdb.
- Each words in the review are represented by their unique index and the labels are in binary format representing positive or negative reviews
- The necessary code to download the dataset has been written for you.
- The variable word_to_id is a dictionary containing words and their corresponding ids
# Run the cell to download the dataset
vocab_size = 5000
(X_train, Y_train), (X_test, Y_test) = imdb.load_data(num_words=vocab_size)
word_to_id = get_word_index()
print("word to id fist five samples {}".format({key:value for key, value in zip(list(word_to_id.keys())[:5], list(word_to_id.values())[:5])}))
print("\n")
print("sample input\n", X_train[0])
print('\n')
print("target output", Y_train[0])
Task 1:
Each review in the dataset has some special tokens such as!
- START : to identify the start of the sentence
- UNK : If some words are not identified in the vocabulary
- PAD : The value to be filled if sequence requires padding
# Task 1:
# Offset the word_to_id dictionary by three values such that 0,1,2 represents START, UNK, PAD respectively.
# Once you perform the above step reverse the word_to_id dictionary to represent ids as keys and words as values.
# Assign the resulting dictionary to id_to_word variable
word_to_id = { k:v+3 for k, v in word_to_id.items() } # Python Dictionary
word_to_id["START"] = 1
word_to_id["UNK"] = 2
word_to_id["PAD"] = 3
id_to_word = {value:key for key,value in word_to_id.items()} # Reverse Python Dictionary
Task 2
# Task 2:
# Run the below code to view the first review in training samples
print(" ".join([id_to_word[i] for i in X_train[0]]))
Task 3: First 500 words
Since each movie reviews are of variable lengths in terms of number of words, so it is necessay to fix the review lenght to few words say upto first 500 words.
- For each of the samples of X_train and X_test sample upto first 500 words
- If reviews are less than 500 words pad the sequence with zeros in the beginning to make up the length upto 500
- Assign the padded sequence to X_train_pad and X_test_pad variables for train and test smaples respectively
# Task 3:
from keras.preprocessing import sequence
X_train_pad = sequence.pad_sequences(X_train, maxlen= 500)
X_test_pad = sequence.pad_sequences(X_test, maxlen= 500)
print(X_train_pad[0])
print("X_train_pad shape:", X_train_pad.shape)
print("Y_train shape:", Y_train.shape)
print("X_test_pad shape:", X_test_pad.shape)
print("Y_test shape:", Y_test.shape)
Task 4:
Using keras Sequential to build an LSTM model using the below specifications
- Add an embedding layer( the look up table) such that vacabulary size is 5000 and each word in the vocabulary is 32 dimension vector
- Add an LSTM layer with 100 hidden nodes
- Add a final sigmoid activation layer
- Use adam optimizer and binary cross entropy loss, and metrics as accuracy
# Task 4:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
embedding_vector_length = 32
model = Sequential()
# Step 1: Add an embedding layer
model.add(Embedding(input_dim=5000, output_dim=32))
# Step 2: Add an LSTM layer
model.add(LSTM(100))
# Step 3: Add a final sigmoid activation layer
model.add(Dense(1, activation='sigmoid'))
# Step 4: Compile the model with adam optimizer and binary crossentropy loss
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())
Task 5:
Fit the model with X_train_pad and Y_train as train data and X_test_pad, Y_test as Validation set
- - set the number of epochs to 3
- - set batch size to 64
# Task 5:
# Start code here
# Fit the model
model.fit(
X_train_pad, # Training data
Y_train, # Training labels
validation_data=(X_test_pad, Y_test), # Validation data and labels
epochs=3, # Number of epochs
batch_size=64 # Batch size
)
#End code