DLNLP assignment addition

c0774f4f · Ankush · 8a7e507f · c0774f4f · c0774f4f · c0774f4f
Commit c0774f4f authored Aug 19, 2022 by Ankush
12 changed files
--- a/DLNLP_assignment-1/203050104_20305007_20305030.pptx
+++ b/DLNLP_assignment-1/203050104_20305007_20305030.pptx
--- a/DLNLP_assignment-1/GUI.ipynb
+++ b/DLNLP_assignment-1/GUI.ipynb
--- a/DLNLP_assignment-1/README.txt
+++ b/DLNLP_assignment-1/README.txt
+Instructions to run the code file for Assignment-2
+1. Open code.ipynb file.
+2. Run the cells in the file one by one.
+3. Run the cell To access Pre Trained word embeddings from google drive which are mounted from our folder. In order to access it please mail your id to us, we will share with you the folder so all embeddings(glove.6B.300d.txt, glove.6B.200d.txt, glove.6B.300d.txt, wiki-news-300d-1M.vec, and GoogleNews-vectors-negative300.bin) and datasets( train.csv, gold_test.csv) could be accessed by you.
+4. Run the Header files section.
+5. Run the Pre-processing function section.
+6. Run the Function for Neural Network.
+7. Run the Required for word embedding.
+8. There are 3 embeddings sections in the file Glove, Wod2vec and fasttext. These all embeddings are present Word Embeddings section.
+9. Any one embedding section could be run.
+10. Run the section Train Neural Network to get train accuracy.
+11. Run the Test accuracy model to get the gold_test data accuracy.
+12. First the section with headers will be run, then section with data preprocessing, then initialising the Neural Network, then run the required embedding(Glove, word2vec or fasttext) and then train the Neural Network. Train accuracy will be displayed then run the test model.
+13. To run the imbalanced data handling techniques (like Undersampling Technique-1 etc.) first run the functions section, then run the Set Hyperparameter section (you can change the hyperparameters here), then run any of the following sections for using the corresponding technique. By default glove word embedding-300d is used that is embed==1, to change this go to Data Imbalanced Handling and at there acces Functions section, change the initialisation of embed. If embed==2, fasttext word embedding will be used and integers other than 1 and 2 will access word2vec pretrained embeddings.  
+Steps 1, 2, 3, 4, 5, 6 are necessary to execute step-12.
+We have also attached a file containing the GUI (GUI.ipynb). From which GUI could be run. code.ipynb contains embedding whereas GUI.ipynb only represents GUI as input text and gives output with probabilities. 
+In this python notebook, the last 2 cells should be run to generate to GUI. There will be one textbox where the user has to insert text and then he can press the “Print” button to get the predicted rating and probabilities of each rating.
\ No newline at end of file
--- a/DLNLP_assignment-1/code.ipynb
+++ b/DLNLP_assignment-1/code.ipynb
--- a/DLNLP_assignment2/Assignment-3.pptx
+++ b/DLNLP_assignment2/Assignment-3.pptx
--- a/DLNLP_assignment2/GUI.ipynb
+++ b/DLNLP_assignment2/GUI.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package stopwords to /home/rohit/nltk_data...\n",
+      "[nltk_data]   Package stopwords is already up-to-date!\n",
+      "[nltk_data] Downloading package punkt to /home/rohit/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import tensorflow as tf\n",
+    "import pickle \n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import string\n",
+    "import nltk\n",
+    "from nltk.corpus import stopwords\n",
+    "import tensorflow as tf\n",
+    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
+    "from tensorflow.keras.preprocessing.text import one_hot\n",
+    "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
+    "from tensorflow.keras import Sequential\n",
+    "from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, SimpleRNN, LSTM, Bidirectional, GRU\n",
+    "from sklearn import datasets, model_selection, metrics\n",
+    "from keras.layers.embeddings import Embedding\n",
+    "from keras.initializers import Constant\n",
+    "from nltk.tokenize import word_tokenize\n",
+    "from sklearn.model_selection import train_test_split \n",
+    "nltk.download('stopwords')\n",
+    "stopword = stopwords.words('english') \n",
+    "nltk.download('punkt')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "def encode_data(tokenizer, text, tokens, preprocessing_training_data = False):\n",
+    "    # This function will be used to encode the reviews using a dictionary (created using corpus vocabulary) \n",
+    "\n",
+    "    # Example of encoding :\"The food was fabulous but pricey\" has a vocabulary of 4 words, each one has to be mapped to an integer like: \n",
+    "    # {'The':1,'food':2,'was':3 'fabulous':4 'but':5 'pricey':6} this vocabulary has to be created for the entire corpus and then be used to \n",
+    "    # encode the words into integers \n",
+    "\n",
+    "    # return encoded examples\n",
+    "    if preprocessing_training_data:\n",
+    "        tokenizer = Tokenizer(oov_token = '<oov>')\n",
+    "        tokenizer.fit_on_texts(tokens)\n",
+    "\n",
+    "    sequences = tokenizer.texts_to_sequences(text)\n",
+    "    return sequences, tokenizer\n",
+    "\n",
+    "def convert_to_lower(text):\n",
+    "    # return the reviews after convering then to lowercase\n",
+    "    lower_text = text.lower()\n",
+    "    return lower_text\n",
+    "\n",
+    "def perform_tokenization(text):\n",
+    "    # return the reviews after performing tokenization\n",
+    "    token=nltk.word_tokenize(text)\n",
+    "    return token\n",
+    "\n",
+    "def remove_stopwords(text):\n",
+    "    # return the reviews after removing the stopwords\n",
+    "    stopword = [] # not any stopword\n",
+    "    removing_stopwords=[word for word in text if word not in stopword]\n",
+    "    return removing_stopwords\n",
+    "\n",
+    "def remove_punctuation(text):\n",
+    "    # return the reviews after removing punctuations\n",
+    "    removing_punctuation = [word for word in text if word.isalpha()]\n",
+    "    return removing_punctuation\n",
+    "\n",
+    "def perform_padding(data, maxlen):\n",
+    "    # return the reviews after padding the reviews to maximum length\n",
+    "    padded_data = pad_sequences(data, maxlen=maxlen, padding='post')\n",
+    "    return padded_data\n",
+    "\n",
+    "def preprocess_data(tokenizer, data, preprocessing_training_data=False, maxlen=None):\n",
+    "    # make all the following function calls on your data\n",
+    "    # EXAMPLE:->\n",
+    "    '''\n",
+    "    review = data[\"reviews\"]\n",
+    "    review = convert_to_lower(review)\n",
+    "    review = remove_punctuation(review)\n",
+    "    review = remove_stopwords(review)\n",
+    "    review = perform_tokenization(review)\n",
+    "    review = encode_data(review)\n",
+    "    review = perform_padding(review)\n",
+    "    '''\n",
+    "    # return processed data\n",
+    "\n",
+    "    reviews = data[\"reviews\"]\n",
+    "    list_of_reviews = list(reviews)\n",
+    "    string_of_reviews = ' '.join(str(e) for e in list_of_reviews)\n",
+    "\n",
+    "    lower_text = convert_to_lower(string_of_reviews)\n",
+    "    tokens = perform_tokenization(lower_text)\n",
+    "    tokens = remove_stopwords(tokens)\n",
+    "    tokens = remove_punctuation(tokens)\n",
+    "    encoded_data, tokenizer = encode_data(tokenizer, reviews, tokens, preprocessing_training_data)\n",
+    "    reviews = perform_padding(encoded_data, maxlen)\n",
+    "\n",
+    "    return pd.DataFrame(reviews), tokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "maxlen = 31 # verified "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! tar -xzf model.tar.gz\n",
+    "model = tf.keras.models.load_model(\"model\", compile = False)\n",
+    "\n",
+    "with open(r\"tokenizer.pkl\", \"rb\") as input_file:\n",
+    "    tokenizer = pickle.load(input_file)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[7.46410131e-01 1.02791116e-01 1.10829026e-01 2.15716772e-02\n",
+      "  1.83980539e-02]\n",
+      " [9.36661454e-05 1.18578457e-04 2.48400168e-03 7.10299909e-02\n",
+      "  9.26273704e-01]\n",
+      " [1.44877762e-03 4.65300196e-04 2.53088167e-03 4.58934791e-02\n",
+      "  9.49661493e-01]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Test input\n",
+    "Test = ['this is bad', 'wow nice!', 'this is a great product']\n",
+    "test_reviews = pd.DataFrame(Test, columns=['reviews'])\n",
+    "\n",
+    "# pre-process data\n",
+    "test_reviews, tokenizer = preprocess_data(tokenizer, test_reviews, maxlen=maxlen)\n",
+    "\n",
+    "# predict\n",
+    "test_predictions = model.predict(test_reviews)\n",
+    "\n",
+    "# show probabilities\n",
+    "print(test_predictions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def predict_rating(text):\n",
+    "    test_reviews = pd.DataFrame([text], columns=['reviews'])\n",
+    "\n",
+    "#     test_reviews, _ = preprocess_data(tokenizer, test_reviews, maxlen=train_reviews.shape[1])\n",
+    "    test_reviews, _ = preprocess_data(tokenizer, test_reviews, maxlen=31)\n",
+    "\n",
+    "    test_predictions = model.predict(test_reviews)\n",
+    "    test_ratings = np.argmax(np.array(test_predictions), axis=1) + 1\n",
+    "    \n",
+    "    import tabulate\n",
+    "    \n",
+    "    str = f\"\\nPredicted Rating: {test_ratings}\\n\\n\\n\"\n",
+    "    str += tabulate.tabulate(test_predictions, headers=[\"Rating-1\", \"Rating-2\", \"Rating-3\", \"Rating-4\", \"Rating-5\"])\n",
+    "    return str"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tkinter as tk\n",
+    "  \n",
+    "# Top level window \n",
+    "frame = tk.Tk() \n",
+    "frame.title(\"Rate reviews\") \n",
+    "frame.geometry('600x600') \n",
+    "\n",
+    "# Function for getting Input from textbox and printing it at label widget \n",
+    "def printInput(): \n",
+    "    inp = inputtxt.get(1.0, \"end-1c\") \n",
+    "    output = predict_rating(inp)\n",
+    "    lbl.config(text = output) \n",
+    "\n",
+    "# TextBox Creation \n",
+    "inputtxt = tk.Text(frame, \n",
+    "                   height = 10, \n",
+    "                   width = 40, \n",
+    "                   font=(\"Courier\", 18)) \n",
+    "  \n",
+    "inputtxt.pack() \n",
+    "  \n",
+    "# Button Creation \n",
+    "printButton = tk.Button(frame, \n",
+    "                        text = \"Print\",  \n",
+    "                        command = printInput, \n",
+    "                        font=(\"monospace\", 14)) \n",
+    "printButton.pack() \n",
+    "  \n",
+    "# Label Creation \n",
+    "lbl = tk.Label(frame, text = \"\") \n",
+    "lbl.pack() \n",
+    "frame.mainloop() "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/DLNLP_assignment2/README.txt
+++ b/DLNLP_assignment2/README.txt
+Instructions to run the code file for Assignment-3
+WITHOUT PRE-TRAINED EMBEDDING:
+1. Open code.ipynb file.
+2. Run the following sections in the file one by one.
+* Install python packages
+* To access pre-trained embeddings and dataset (train and test)
+* Header files
+* Preprocessing function
+* Define Models  (without pre-trained embedding layer)
+* Import Datasets
+* Without pre-trained word embedding
+3. If you want to run the data imbalanced handling techniques, then go to “Data Imbalanced Handling” section and run the first cell “Utility Functions”. Then you can run any of the techniques (Undersampling Technique-1, Undersampling Technique-1 and so on).
+4. Run any of the models (FFNN, LSTM, RNN, Bi-LSTM, GRU, Bi-GRU) by running the section “Train X Model” (replace X with the model according to your need like “Train LSTM Model”).
+---------------------------------------------------------------------------------------------------------------------------
+WITH PRE-TRAINED EMBEDDING:
+1. Open code.ipynb file.
+2. Run the cells in the file one by one.
+3. Run “Install python packages” section.
+4. Then we will run the cell “To access pre-trained embeddings and dataset (train and test)” which  accesses Pre Trained word embeddings from google drive which are mounted from our folder. In order to access it please mail your id to us, we will share with you the folder so all embeddings(glove.6B.300d.txt, glove.6B.200d.txt, glove.6B.300d.txt, wiki-news-300d-1M.vec, and GoogleNews-vectors-negative300.bin) and datasets( train.csv, gold_test.csv) could be accessed by you. We are using word2vec for this assignment as it gives best results.
+5. Run the “Header files” section.
+6. Run the “Preprocessing function” section to convert our train dataset into embeddings.
+7. We will run the “Define Models (with pretrained embedding layer)” section which contains models for RNN, LSTM, Bi-LSTM, GRU and Bi-GRU. Run one at a time.
+8. Run the “Import Datasets” section to import train and test dataset.
+9. Then run the “Word Embeddings” section, there are 3 embeddings sections: Glove, Word2vec and fasttext. These all embeddings are present in the Word Embeddings section. We will run the Word2vec section as only this is used in our assignment.
+10.We will move to the “Data Imbalanced Handling” section and use one of the sampling techniques (like Undersampling Technique-1 etc.) from all the mentioned techniques in the section.  First run the Utility functions section, then run any of the following sections for using the corresponding technique.
+11. There are different sections of our models (RNN, LSTM, Bi-LSTM, GRU and Bi-GRU). We will train the model and test the accuracy of every model running these models.
+----------------------------------------------------------------------------------------------------------------------------
+----------------------------------------------------------------------------------------------------------------------------
+GUI:
+We have also attached a file containing the GUI (GUI.ipynb). This file first loads the stored model and tokenizer (code.ipynb contains one section named “Save Model“. This section should be run to store the model and tokenizer. It will generate two files model.tar.gz and tokenizer.pkl. model.tar.gz and tokenizer.pkl should be present in the same directory where GUI.ipynb is present ) and uses that throughout the code.
+Libraries to be installed in your local PC:
+Keras==2.4.3
+tensorflow==2.2.0
+numpy==1.17.4
+nltk==3.4.5
+tabulate==0.8.9
+pandas==1.0.5
+scikit_learn==0.24.1 
+All the cells should be run sequentially. Then the user will see a window. There will be one textbox where the user has to insert the text and then he can press the “Print” button to get the predicted rating and probabilities of each rating.
+Note: This GUI does not work in colab (due to obvious reasons) but works perfectly in the local machines. If you want to test the GUI, one approach could be to run the code.ipynb from Google colab and download the model.tar.gz and tokenizer.pkl files. Put those files in the same directory of GUI.ipynb and run all the cells of GUI.ipynb sequentially.
\ No newline at end of file
--- a/DLNLP_assignment2/code.ipynb
+++ b/DLNLP_assignment2/code.ipynb
--- a/DLNLP_assignment3/Assignment-4.pptx
+++ b/DLNLP_assignment3/Assignment-4.pptx
--- a/DLNLP_assignment3/GUI.ipynb
+++ b/DLNLP_assignment3/GUI.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package stopwords to /home/rohit/nltk_data...\n",
+      "[nltk_data]   Package stopwords is already up-to-date!\n",
+      "[nltk_data] Downloading package punkt to /home/rohit/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import tensorflow as tf\n",
+    "import pickle \n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import string\n",
+    "import nltk\n",
+    "from nltk.corpus import stopwords\n",
+    "import tensorflow as tf\n",
+    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
+    "from tensorflow.keras.preprocessing.text import one_hot\n",
+    "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
+    "from tensorflow.keras import Sequential\n",
+    "from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, SimpleRNN, LSTM, Bidirectional, GRU\n",
+    "from sklearn import datasets, model_selection, metrics\n",
+    "from keras.layers.embeddings import Embedding\n",
+    "from keras.initializers import Constant\n",
+    "from nltk.tokenize import word_tokenize\n",
+    "from sklearn.model_selection import train_test_split \n",
+    "nltk.download('stopwords')\n",
+    "stopword = stopwords.words('english') \n",
+    "nltk.download('punkt')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "def encode_data(tokenizer, text, tokens, preprocessing_training_data = False):\n",
+    "    # This function will be used to encode the reviews using a dictionary (created using corpus vocabulary) \n",
+    "\n",
+    "    # Example of encoding :\"The food was fabulous but pricey\" has a vocabulary of 4 words, each one has to be mapped to an integer like: \n",
+    "    # {'The':1,'food':2,'was':3 'fabulous':4 'but':5 'pricey':6} this vocabulary has to be created for the entire corpus and then be used to \n",
+    "    # encode the words into integers \n",
+    "\n",
+    "    # return encoded examples\n",
+    "    if preprocessing_training_data:\n",
+    "        tokenizer = Tokenizer(oov_token = '<oov>')\n",
+    "        tokenizer.fit_on_texts(tokens)\n",
+    "\n",
+    "    sequences = tokenizer.texts_to_sequences(text)\n",
+    "    return sequences, tokenizer\n",
+    "\n",
+    "def convert_to_lower(text):\n",
+    "    # return the reviews after convering then to lowercase\n",
+    "    lower_text = text.lower()\n",
+    "    return lower_text\n",
+    "\n",
+    "def perform_tokenization(text):\n",
+    "    # return the reviews after performing tokenization\n",
+    "    token=nltk.word_tokenize(text)\n",
+    "    return token\n",
+    "\n",
+    "def remove_stopwords(text):\n",
+    "    # return the reviews after removing the stopwords\n",
+    "    stopword = [] # not any stopword\n",
+    "    removing_stopwords=[word for word in text if word not in stopword]\n",
+    "    return removing_stopwords\n",
+    "\n",
+    "def remove_punctuation(text):\n",
+    "    # return the reviews after removing punctuations\n",
+    "    removing_punctuation = [word for word in text if word.isalpha()]\n",
+    "    return removing_punctuation\n",
+    "\n",
+    "def perform_padding(data, maxlen):\n",
+    "    # return the reviews after padding the reviews to maximum length\n",
+    "    padded_data = pad_sequences(data, maxlen=maxlen, padding='post')\n",
+    "    return padded_data\n",
+    "\n",
+    "def preprocess_data(tokenizer, data, preprocessing_training_data=False, maxlen=None):\n",
+    "    # make all the following function calls on your data\n",
+    "    # EXAMPLE:->\n",
+    "    '''\n",
+    "    review = data[\"reviews\"]\n",
+    "    review = convert_to_lower(review)\n",
+    "    review = remove_punctuation(review)\n",
+    "    review = remove_stopwords(review)\n",
+    "    review = perform_tokenization(review)\n",
+    "    review = encode_data(review)\n",
+    "    review = perform_padding(review)\n",
+    "    '''\n",
+    "    # return processed data\n",
+    "\n",
+    "    reviews = data[\"reviews\"]\n",
+    "    list_of_reviews = list(reviews)\n",
+    "    string_of_reviews = ' '.join(str(e) for e in list_of_reviews)\n",
+    "\n",
+    "    lower_text = convert_to_lower(string_of_reviews)\n",
+    "    tokens = perform_tokenization(lower_text)\n",
+    "    tokens = remove_stopwords(tokens)\n",
+    "    tokens = remove_punctuation(tokens)\n",
+    "    encoded_data, tokenizer = encode_data(tokenizer, reviews, tokens, preprocessing_training_data)\n",
+    "    reviews = perform_padding(encoded_data, maxlen)\n",
+    "\n",
+    "    return pd.DataFrame(reviews), tokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "maxlen = 31 # verified "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! tar -xzf model.tar.gz\n",
+    "model = tf.keras.models.load_model(\"model\", compile = False)\n",
+    "\n",
+    "with open(r\"tokenizer.pkl\", \"rb\") as input_file:\n",
+    "    tokenizer = pickle.load(input_file)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[7.46410131e-01 1.02791116e-01 1.10829026e-01 2.15716772e-02\n",
+      "  1.83980539e-02]\n",
+      " [9.36661454e-05 1.18578457e-04 2.48400168e-03 7.10299909e-02\n",
+      "  9.26273704e-01]\n",
+      " [1.44877762e-03 4.65300196e-04 2.53088167e-03 4.58934791e-02\n",
+      "  9.49661493e-01]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Test input\n",
+    "Test = ['this is bad', 'wow nice!', 'this is a great product']\n",
+    "test_reviews = pd.DataFrame(Test, columns=['reviews'])\n",
+    "\n",
+    "# pre-process data\n",
+    "test_reviews, tokenizer = preprocess_data(tokenizer, test_reviews, maxlen=maxlen)\n",
+    "\n",
+    "# predict\n",
+    "test_predictions = model.predict(test_reviews)\n",
+    "\n",
+    "# show probabilities\n",
+    "print(test_predictions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def predict_rating(text):\n",
+    "    test_reviews = pd.DataFrame([text], columns=['reviews'])\n",
+    "\n",
+    "#     test_reviews, _ = preprocess_data(tokenizer, test_reviews, maxlen=train_reviews.shape[1])\n",
+    "    test_reviews, _ = preprocess_data(tokenizer, test_reviews, maxlen=31)\n",
+    "\n",
+    "    test_predictions = model.predict(test_reviews)\n",
+    "    test_ratings = np.argmax(np.array(test_predictions), axis=1) + 1\n",
+    "    \n",
+    "    import tabulate\n",
+    "    \n",
+    "    str = f\"\\nPredicted Rating: {test_ratings}\\n\\n\\n\"\n",
+    "    str += tabulate.tabulate(test_predictions, headers=[\"Rating-1\", \"Rating-2\", \"Rating-3\", \"Rating-4\", \"Rating-5\"])\n",
+    "    return str"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tkinter as tk\n",
+    "  \n",
+    "# Top level window \n",
+    "frame = tk.Tk() \n",
+    "frame.title(\"Rate reviews\") \n",
+    "frame.geometry('600x600') \n",
+    "\n",
+    "# Function for getting Input from textbox and printing it at label widget \n",
+    "def printInput(): \n",
+    "    inp = inputtxt.get(1.0, \"end-1c\") \n",
+    "    output = predict_rating(inp)\n",
+    "    lbl.config(text = output) \n",
+    "\n",
+    "# TextBox Creation \n",
+    "inputtxt = tk.Text(frame, \n",
+    "                   height = 10, \n",
+    "                   width = 40, \n",
+    "                   font=(\"Courier\", 18)) \n",
+    "  \n",
+    "inputtxt.pack() \n",
+    "  \n",
+    "# Button Creation \n",
+    "printButton = tk.Button(frame, \n",
+    "                        text = \"Print\",  \n",
+    "                        command = printInput, \n",
+    "                        font=(\"monospace\", 14)) \n",
+    "printButton.pack() \n",
+    "  \n",
+    "# Label Creation \n",
+    "lbl = tk.Label(frame, text = \"\") \n",
+    "lbl.pack() \n",
+    "frame.mainloop() "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/DLNLP_assignment3/README.txt
+++ b/DLNLP_assignment3/README.txt
+Instructions to run the code file for Assignment-4
+1. Open code.ipynb file.
+2. Run the cells in the file one by one.
+3. Run “Install python packages” section.
+4. Then we will run the cell “To download dataset (train and test)” which  accesses the datasets in the program.
+5. Then run the 'To Install the ktrain package' to download the ktrain package.
+6. Then 'import the packages'.
+7. The 'Download the pre-trained DistilBERT-base-uncased Model' downloads the pretrained model.
+8. In the next cell data preprocessing is being done.
+9. The model is being trained in "Training the Model" cell.
+10.Then the predictions are made in the next cell.
+11.The "Explanability of the model" is used for analysis of thr model.
+Run the section "To resolve confusion between adjacent classes" to test the overlapping of words between classes.
+----------------------------------------------------------------------------------------------------------------------------
+GUI:
+We have also attached a file containing the GUI (GUI.ipynb). This file first loads the stored model and tokenizer (code.ipynb contains one section named “Save Model“. This section should be run to store the model and tokenizer. It will generate two files model.tar.gz and tokenizer.pkl. model.tar.gz and tokenizer.pkl should be present in the same directory where GUI.ipynb is present ) and uses that throughout the code.
+All the cells should be run sequentially. Then the user will see a window. There will be one textbox where the user has to insert the text and then he can press the “Print” button to get the predicted rating and probabilities of each rating.
+Note: This GUI does not work in colab (due to obvious reasons) but works perfectly in the local machines. If you want to test the GUI, one approach could be to run the code.ipynb from Google colab and download the model.tar.gz and tokenizer.pkl files. Put those files in the same directory of GUI.ipynb and run all the cells of GUI.ipynb sequentially.
--- a/DLNLP_assignment3/code.ipynb
+++ b/DLNLP_assignment3/code.ipynb