Initial commit

3ce75350 · Meet Narendra · 3ce75350 · 3ce75350 · 3ce75350 · 3ce75350
Commit 3ce75350 authored Nov 28, 2023 by Meet Narendra 💬
28 changed files
--- a/.ipynb_checkpoints/retrieval-checkpoint.ipynb
+++ b/.ipynb_checkpoints/retrieval-checkpoint.ipynb
--- a/IMDB-Movie-Data.csv
+++ b/IMDB-Movie-Data.csv
--- a/README.md
+++ b/README.md
+# Knowledge Graph-based Question Answering for Movie Dataset
+
+## Overview
+
+This project focuses on Information Retrieval for a movie dataset using a Knowledge Graph-based Question Answering approach. The knowledge graph comprises nodes representing actors, directors, movie titles, genres, and years. The goal is to retrieve the top-k relevant nodes for a given question about the IMDB movie dataset.
+
+## Directory Structure
+
+```
+.
+├── data
+├── models
+│   ├── query_model.pt
+│   └── node_model.pt
+├── sample_questions.py
+├── eval.py
+├── node_embed.py
+├── question_generation.py
+├── scores.py
+├── generated_questions.csv
+├── README.md
+├── test_questions.csv
+├── graph.json
+├── plot.py
+├── requirements.txt
+├── test.txt
+├── IMDB-Movie-Data.csv
+├── preprocess.py
+└── retrieval.ipynb
+```
+
+### Files and Descriptions
+
+- **data**: Directory to store input data.
+- **models**: Directory to save trained models.
+  - `query_model.pt`: Trained model for question embeddings.
+  - `node_model.pt`: Trained model for node embeddings.
+- **sample_questions.py**: Script containing sample questions for testing.
+- **eval.py**: Evaluation script for assessing the performance of the retrieval system.
+- **node_embed.py**: Script for node embedding generation.
+- **question_generation.py**: Script for generating questions.
+- **scores.py**: Script containing scoring functions.
+- **generated_questions.csv**: CSV file to store generated questions.
+- **README.md**: This file, providing an overview of the project.
+- **test_questions.csv**: CSV file containing test questions.
+- **graph.json**: JSON file representing the knowledge graph.
+- **plot.py**: Script for generating plots or visualizations.
+- **requirements.txt**: File listing the required Python packages.
+- **test.txt**: File for storing test-related information.
+- **IMDB-Movie-Data.csv**: Input dataset containing information about movies.
+- **preprocess.py**: Script for preprocessing the dataset.
+- **retrieval.ipynb**: Jupyter notebook for the retrieval process.
+
+## Getting Started
+
+1. Install the required dependencies using `pip install -r requirements.txt`.
+2. Run `preprocess.py` to preprocess the dataset.
+3. Execute `retrieval.ipynb` for the information retrieval process.
+4. Adjust parameters and experiment with different queries in `sample_questions.py`.
+5. Evaluate the system using `eval.py` to assess retrieval performance.
+
+## Usage
+
+- Modify `sample_questions.py` to add or modify sample questions.
+- Run `question_generation.py` to generate questions based on the knowledge graph.
+- Utilize the trained models (`query_model.pt` and `node_model.pt`) for embeddings.
+- Evaluate the system using `eval.py` to assess retrieval performance.
--- a/eval.py
+++ b/eval.py
+import random
+import gradio as gr
+import json
+import numpy as np
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+from tqdm import tqdm
+import pandas as pd
+from transformers import AutoTokenizer,AutoModelForTokenClassification
+from collections import Counter
+import torch
+from transformers import pipeline
+import networkx as nx
+
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-MiniLM-L6-v2')
+NER_tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
+NER_model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
+query_model = torch.load('query_model.pt').to(device)
+node_model = torch.load('node_model.pt').to(device)
+
+graph = {}
+
+with open('graph.json') as f:
+    graph = json.load(f)
+
+G = nx.Graph()
+G.add_nodes_from(graph['node_list'])
+G.add_edges_from(graph['edge_list'])
+
+
+ner_pipe = pipeline("ner", model=NER_model, tokenizer=NER_tokenizer)
+
+
+def mean_pooling(model_output, attention_mask):
+    #print(model_output[0].shape,attention_mask.shape)
+    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+
+
+def find_duplicates(lst):
+    counter = Counter(lst)
+    return [item for item, count in counter.items() if count > 1]
+
+nodes = graph['id_dict'].values()
+nodes = [str(x).strip() for x in nodes]
+
+# for x,y in graph['edge_list']:
+#     print(graph['id_dict'][str(x)],graph['id_dict'][str(y)])
+
+with torch.no_grad():
+    encoded_docs = tokenizer(nodes, padding=True, truncation=True, return_tensors='pt').to(device)
+    model_out = node_model(**encoded_docs)
+    frozen_embeddings = mean_pooling(model_out, encoded_docs['attention_mask']).cpu().numpy()
+
+print(frozen_embeddings.shape)
+
+print(len(nodes),len(set(nodes)),find_duplicates(nodes),len(frozen_embeddings),len(frozen_embeddings[0]))
+
+def get_neighbor(node,k):
+    neighbors = []
+    global G
+    prev = [node]
+    while k>0:
+        k-=1
+        curr_neighbors = []
+        for n in prev:
+            curr_neighbors.extend(G.neighbors(n))
+        prev = curr_neighbors
+        neighbors.extend(curr_neighbors)
+    return neighbors
+
+
+def get_top_k_nodes(message, history):
+    with torch.no_grad():
+        ner_tags = ner_pipe(message)
+        tags_in_kg = []
+        for tag in ner_tags:
+            if tag['word'] in nodes:
+                tags_in_kg.append(tag['word'])
+        if len(tags_in_kg)==0:
+            #print('Using the default all comparison approach')
+            encoded_query = tokenizer(message, padding=True, truncation=True, return_tensors='pt').to(device)
+            model_out = query_model(**encoded_query)
+            embedded_query = mean_pooling(model_out, encoded_query['attention_mask']).cpu().numpy()
+            scores = cosine_similarity(embedded_query, frozen_embeddings)
+            argsort = scores.argsort()[0][::-1] 
+            selected_nodes = [nodes[x] for x in argsort[:10]]
+            return "\n".join(selected_nodes)
+        else:
+            #collect the node and its neighbors
+            #print('Using the NER approach')
+            neighbors = []
+            for tag in tags_in_kg:
+                neighbors.extend(get_neighbor(graph['dict_id'][tag],1))
+            #print(neighbors)
+            neighbors = [str(graph['id_dict'][str(i)]) for i in neighbors]
+            #print(neighbors)
+            encoded_docs = tokenizer(neighbors, padding=True, truncation=True, return_tensors='pt').to(device)
+            model_out = node_model(**encoded_docs)
+            neighbor_frozen_embeddings = mean_pooling(model_out, encoded_docs['attention_mask']).cpu().numpy()
+            encoded_query = tokenizer(message, padding=True, truncation=True, return_tensors='pt').to(device)
+            model_out = query_model(**encoded_query)
+            embedded_query = mean_pooling(model_out, encoded_query['attention_mask']).cpu().numpy()
+            scores = cosine_similarity(embedded_query, neighbor_frozen_embeddings)
+            argsort = scores.argsort()[0][::-1] 
+            selected_nodes = [neighbors[x] for x in argsort[:10]]
+            return "\n".join(selected_nodes)
+
+
+questions = []
+answers = []
+
+with open('test.txt', 'r') as file:
+    total = 0
+    correct = 0
+    for line1, line2, line3 in tqdm(zip(file, file, file)):
+        question = line1.strip()
+        answer = line2[8:].strip()
+        #print(question,answer)
+        correctanswer = get_top_k_nodes(question,"").split('\n')
+        correctanswer = [x.strip().lower() for x in correctanswer]
+        #print(len(correctanswer),len(set(correctanswer)))
+        tempnodes = [x.lower().strip() for x in nodes]
+        #print(set(correctanswer)-set(tempnodes))
+        if answer.lower() in correctanswer:
+            correct+=1
+        else:
+            print(answer)
+        total+=1
+        questions.append(question)
+        answers.append(answer)
+
+test_df = pd.DataFrame({'Questions': questions,'Answers': answers},columns=['Questions','Answers'])
+test_df.to_csv('test_questions.csv')
+
+print(f'Accuracy {correct*100/total}\n{correct} {total}')
+
+
+gr.ChatInterface(get_top_k_nodes).launch()
\ No newline at end of file
--- a/generated_questions.csv
+++ b/generated_questions.csv
--- a/graph.json
+++ b/graph.json
--- a/models/1_Pooling/config.json
+++ b/models/1_Pooling/config.json
+{
+  "word_embedding_dimension": 384,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false
+}
\ No newline at end of file
--- a/models/README.md
+++ b/models/README.md
+---
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- transformers
+
+---
+
+# {MODEL_NAME}
+
+This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
+
+<!--- Describe your model here -->
+
+## Usage (Sentence-Transformers)
+
+Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
+
+```
+pip install -U sentence-transformers
+```
+
+Then you can use the model like this:
+
+```python
+from sentence_transformers import SentenceTransformer
+sentences = ["This is an example sentence", "Each sentence is converted"]
+
+model = SentenceTransformer('{MODEL_NAME}')
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+
+
+
+## Usage (HuggingFace Transformers)
+Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
+
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+
+
+#Mean Pooling - Take attention mask into account for correct averaging
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+
+
+# Sentences we want sentence embeddings for
+sentences = ['This is an example sentence', 'Each sentence is converted']
+
+# Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
+model = AutoModel.from_pretrained('{MODEL_NAME}')
+
+# Tokenize sentences
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+
+# Compute token embeddings
+with torch.no_grad():
+    model_output = model(**encoded_input)
+
+# Perform pooling. In this case, mean pooling.
+sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+
+print("Sentence embeddings:")
+print(sentence_embeddings)
+```
+
+
+
+## Evaluation Results
+
+<!--- Describe how your model was evaluated -->
+
+For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
+
+
+## Training
+The model was trained with the parameters:
+
+**DataLoader**:
+
+`torch.utils.data.dataloader.DataLoader` of length 443 with parameters:
+```
+{'batch_size': 8192, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
+```
+
+**Loss**:
+
+`sentence_transformers.losses.TripletLoss.TripletLoss` with parameters:
+  ```
+  {'distance_metric': 'TripletDistanceMetric.EUCLIDEAN', 'triplet_margin': 5}
+  ```
+
+Parameters of the fit()-Method:
+```
+{
+    "epochs": 10,
+    "evaluation_steps": 0,
+    "evaluator": "NoneType",
+    "max_grad_norm": 1,
+    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
+    "optimizer_params": {
+        "lr": 2e-05
+    },
+    "scheduler": "WarmupLinear",
+    "steps_per_epoch": null,
+    "warmup_steps": 10000,
+    "weight_decay": 0.01
+}
+```
+
+
+## Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
+  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
+)
+```
+
+## Citing & Authors
+
+<!--- Describe where people can find more information -->
\ No newline at end of file
--- a/models/config.json
+++ b/models/config.json
+{
+  "_name_or_path": "/raid/nlp/pranavg/.cache/torch/sentence_transformers/sentence-transformers_paraphrase-MiniLM-L6-v2/",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.28.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}
--- a/models/config_sentence_transformers.json
+++ b/models/config_sentence_transformers.json
+{
+  "__version__": {
+    "sentence_transformers": "2.0.0",
+    "transformers": "4.7.0",
+    "pytorch": "1.9.0+cu102"
+  }
+}
\ No newline at end of file
--- a/models/modules.json
+++ b/models/modules.json
+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]
\ No newline at end of file
--- a/models/pytorch_model.bin
+++ b/models/pytorch_model.bin
--- a/models/sentence_bert_config.json
+++ b/models/sentence_bert_config.json
+{
+  "max_seq_length": 128,
+  "do_lower_case": false
+}
\ No newline at end of file
--- a/models/special_tokens_map.json
+++ b/models/special_tokens_map.json
+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}
--- a/models/tokenizer.json
+++ b/models/tokenizer.json
--- a/models/tokenizer_config.json
+++ b/models/tokenizer_config.json
+{
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}
--- a/models/vocab.txt
+++ b/models/vocab.txt
--- a/node_embed.py
+++ b/node_embed.py
+import networkx as nx
+import random
+import numpy as np
+from typing import List
+from tqdm import tqdm
+import json
+from sentence_transformers import SentenceTransformer, losses, InputExample
+from torch.utils.data import DataLoader
+import os
+
+if not os.path.exists('./models/'):
+    os.makedirs('./models/')
+
+
+# Hyperparams
+#We saw that keeping shorter but wider walks helped in generating better node embeddings. We use sentence bert as an embedding layer so that some semantic information can be retained from the values of the nodes. Example: Names of actors or directors already form a cluster without any node embedding training so we utilise that and fine tune it to generate better embeddings using random walk.
+walk_length=5
+num_walks=10
+negative_sample_size=30
+batch_size=8192 #recommended to lower it for a general purpose cpu
+epochs=20
+
+class DeepWalk:
+    def __init__(self, walk_length: int, walks_per_node: int):
+        """
+        :param walk_length: length of the walk
+        :param walks_per_node: number of walks per node
+        """
+        self.walk_length = walk_length
+        self.walk_per_node = walks_per_node
+
+    def random_walk(self, g: nx.Graph, start: str, use_probabilities: bool = False) -> List[str]:
+        """
+        Generate a random walk starting on start
+        :param g: Graph
+        :param start: starting node for the random walk
+        :param use_probabilities: if True take into account the weights assigned to each edge to select the next candidate
+        :return: a random walk starting from the 'start' node
+        """
+        walk = [start]
+        for i in range(self.walk_length):
+            neighbours = g.neighbors(walk[i])
+            neighs = list(neighbours)
+            if use_probabilities:
+                probabilities = [g.get_edge_data(walk[i], neig)["weight"] for neig in neighs]
+                sum_probabilities = sum(probabilities)
+                probabilities = list(map(lambda t: t / sum_probabilities, probabilities))
+                p = np.random.choice(neighs, p=probabilities)
+            else:
+                p = random.choice(neighs)
+            walk.append(p)
+        return walk
+    
+    def get_walks(self, g: nx.Graph, use_probabilities: bool = False) -> List[List[str]]:
+        """
+        Generate all the random walks
+        :param g: Graph
+        :param use_probabilities:
+        :return:
+        """
+        random_walks = []
+        for _ in range(self.walk_per_node):
+            random_nodes = list(g.nodes)
+            random.shuffle(random_nodes)
+            for node in tqdm(random_nodes):
+                random_walks.append(self.random_walk(g=g, start=node, use_probabilities=use_probabilities))
+        return random_walks
+
+graph = {}
+
+with open('graph.json') as f:
+    graph = json.load(f)
+
+#print(graph)
+
+G = nx.Graph()
+G.add_nodes_from(graph['node_list'])
+G.add_edges_from(graph['edge_list'])
+
+# for node in graph['node_list']:
+#     if str(node) not in graph['id_dict']:
+#         print(node)
+
+# for node in G.nodes():
+#     neighbors = list(G.neighbors(node))
+#     print(f'{graph["id_dict"][str(node)]} {[graph["id_dict"][str(x)] for x in neighbors]}')
+# print(len(graph['node_list']),len(set(graph['node_list'])))
+# exit()
+
+DW = DeepWalk(walk_length,num_walks)
+walks = DW.get_walks(G)
+
+def get_negative_samples(walk, nodelist, k):
+    """
+    Get negative samples for training. Here we take the walk or a single node and retrieve a set k negative samples at random from the nodelist, we know that this approach is not ideal but it worked for an assignment so we went ahead anyway :D
+    :param walk: a single node or a list of nodes
+    :param nodelist: a list of all nodes in the KG
+    :param k: the number of negative samples to extract
+    :return: a set of negative samples
+    """
+    negative_sample_space = list(set(nodelist)-set(walk))
+    selected_samples = np.random.choice(negative_sample_space,k,replace=False)
+    return selected_samples
+
+
+
+train_examples = []
+
+# Here we generate a training examples from the extracted walks
+for walk in walks:
+    #print(len(walk))
+    negative_samples = get_negative_samples(walk,graph['node_list'],negative_sample_size)
+    #print(len(negative_samples))
+    sentence_walk = [str(graph['id_dict'][str(x)]) for x in walk]
+    sentence_negatives = [str(graph['id_dict'][str(x)]) for x in negative_samples]
+    #print(sentence_walk,sentence_negatives)
+    for i in range(1,len(sentence_walk)):
+        if sentence_walk[0]!=sentence_walk[i]:
+            for negative in sentence_negatives:
+                train_examples.append(InputExample(texts=[sentence_walk[0],sentence_walk[i],negative]))
+
+                
+print('Length of train examples',len(train_examples))
+
+model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
+train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
+train_loss = losses.TripletLoss(model=model)
+
+model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=epochs, show_progress_bar=True)
+
+model.save('./models/')
+
--- a/node_model.pt
+++ b/node_model.pt
--- a/plot.py
+++ b/plot.py
+import json
+graph = {}
+
+with open('graph.json') as f:
+    graph = json.load(f)
+
+
+from sentence_transformers import SentenceTransformer, losses, InputExample
+model = SentenceTransformer('./models/')
+
+nodes = graph['id_dict'].values()
+nodes = [str(x) for x in nodes][:200]
+
+frozen_embeddings = model.encode(nodes)
+
+from sklearn.manifold import TSNE
+
+# Create a t-SNE model
+tsne = TSNE(n_components=2, random_state=42)
+
+# Fit and transform your embeddings using t-SNE
+embeddings_2d = tsne.fit_transform(frozen_embeddings)
+
+import pandas as pd
+import plotly.express as px
+
+# Create a DataFrame for Plotly Express
+df = pd.DataFrame({'x': embeddings_2d[:, 0], 'y': embeddings_2d[:, 1], 'word': nodes})
+
+# Scatter plot with annotations using Plotly Express
+fig = px.scatter(df, x='x', y='y', text='word', title='t-SNE Visualization of Word Embeddings')
+fig.show()
+
+
--- a/preprocess.py
+++ b/preprocess.py
+import pandas as pd
+
+data = pd.read_csv('IMDB-Movie-Data.csv', index_col='Rank')
+
+print(data.head())
+
+print(data.columns)
+
+titles = set(data['Title'])
+
+genres = set()
+
+description = set(data['Description'])
+
+director = set(data['Director'])
+
+actors = set()
+
+year = set()
+
+for idx,row in data.iterrows():
+    temp = row['Genre'].split(',')
+    temp = [x.strip() for x in temp]
+    genres.update(temp)
+    temp = row['Actors'].split(',')
+    temp = [x.strip() for x in temp]
+    actors.update(temp)
+    year.update([row['Year']])
+
+graph = {
+    # different sets of nodes
+    'titles': list(titles),
+    'genres':  list(genres),
+    #'description': list(description),
+    'director': list(director),
+    'actors': list(actors),
+    'year': list(year),
+}
+
+count = 0
+
+graph['dict_id'] = {}
+graph['id_dict'] = {}
+
+
+for idx,row in data.iterrows():
+    temp = row['Genre'].split(',')
+    temp = [x.strip() for x in temp]
+    for x in temp:
+        if x not in graph['dict_id'] and count not in graph['id_dict']:
+            graph['dict_id'][x] = count
+            graph['id_dict'][count] = x
+            count+=1
+    temp = row['Actors'].split(',')
+    temp = [x.strip() for x in temp]
+    for x in temp:
+        if x not in graph['dict_id'] and count not in graph['id_dict']:
+            graph['dict_id'][x] = count
+            graph['id_dict'][count] = x
+            count+=1
+
+    if row['Year'] not in graph['dict_id'] and count not in graph['id_dict']:
+        graph['dict_id'][row['Year']] = count
+        graph['id_dict'][count] = row['Year']
+        count+=1
+    if row['Title'] not in graph['dict_id'] and count not in graph['id_dict']:
+        graph['dict_id'][row['Title']] = count
+        graph['id_dict'][count] = row['Title']
+        count+=1
+    if row['Director'] not in graph['dict_id'] and count not in graph['id_dict']:
+        graph['dict_id'][row['Director']] = count
+        graph['id_dict'][count] = row['Director']
+        count+=1
+
+edge_list = []
+
+for idx,row in data.iterrows():
+    temp_list = []
+
+    temp = row['Genre'].split(',')
+    temp = [x.strip() for x in temp]
+    temp_list.extend([(graph['dict_id'][x],0) for x in temp])
+    
+    temp = row['Actors'].split(',')
+    temp = [x.strip() for x in temp]
+    temp_list.extend([(graph['dict_id'][x],1) for x in temp])
+
+    temp_list.append((graph['dict_id'][row['Year']],2))
+    temp_list.append((graph['dict_id'][row['Director']],3))
+    temp_list.append((graph['dict_id'][row['Title']],4))
+    print(len(temp_list))
+    for x in temp_list:
+        for y in temp_list:
+            if x[0]!=y[0] and x[1]!=y[1]:
+                edge_list.append((x[0],y[0]))
+                edge_list.append((y[1],x[1]))
+
+            
+    
+edge_list = list(set(edge_list))
+#for x,y in edge_list:
+#    print(graph['id_dict'][x],graph['id_dict'][y])
+
+graph['edge_list'] = edge_list
+graph['node_list'] = [i for i in range(count)]
+
+for node in graph['node_list']:
+    if node not in graph['id_dict']:
+        print(node)
+
+print(count, len(edge_list)/2)
+
+import json
+with open('graph.json','w') as f:
+    json.dump(graph,f)
+
+# import networkx as nx
+# import matplotlib.pyplot as plt
+
+
+# nodes = [i for i in range(count)]
+# edges = edge_list
+
+# G = nx.Graph()
+
+# G.add_nodes_from(nodes)
+# G.add_edges_from(edges)
+
+# # Plot the graph interactively
+# pos = nx.spring_layout(G)
+# nx.draw(G, pos, with_labels=True, font_weight='bold', node_size=700, node_color='skyblue', font_color='black', font_size=10, edge_color='gray')
+
+# # Make the plot interactive with zooming capabilities
+# plt.savefig('graph.png')
+
+
+
+
+
+
+
+
+
+
--- a/query_model.pt
+++ b/query_model.pt
--- a/question_generation.py
+++ b/question_generation.py
+#This file generates synthetic questions for training a retrieval model. We use few shot prompting to generate questions for a particular answer and context. We then filter the questions based on whether the model is able to answer the question correctly or not.
+import os
+from transformers import AutoTokenizer,AutoModelForCausalLM,BitsAndBytesConfig
+import torch
+import pandas as pd
+import numpy as np
+from datasets import Dataset
+#os.environ["CUDA_VISIBLE_DEVICES"] = "3"
+os.system("echo $CUDA_VISIBLE_DEVICES")
+#eval()
+from transformers import AutoTokenizer, DataCollatorWithPadding,AutoModelForCausalLM
+from transformers import LlamaForCausalLM #llama
+from transformers import FalconForCausalLM #Falcon
+from transformers import T5ForConditionalGeneration #t5
+from transformers import AutoModelForSeq2SeqLM #Flant5
+from transformers import GPT2LMHeadModel #gpt2
+import json
+import random
+from sample_questions import few_shot_examples
+from tqdm import tqdm
+from transformers import pipeline
+
+# Quantisation config for 180b falcon
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=torch.float16
+)
+
+def setup_model(model_name):
+    model_paths = {
+            'llama-2-7b': '/raid/nlp/models/llama-2-7B-hf',
+            'llama-2-7b-chat': '/raid/nlp/models/llama-2-7b-chat-hf',
+            'llama-2-13b': '/raid/nlp/models/llama-2-13b-hf',
+            'llama-2-13b-chat': '/raid/nlp/models/llama-2-13b-chat-hf',
+            'llama-2-70b': '/raid/nlp/models/llama-2-70b-hf',
+            'llama-2-70b-chat': '/raid/nlp/models/llama-2-70b-chat-hf',
+            't5-3b': '/raid/nlp/models/t5-3b',
+            't5-11b': '/raid/nlp/models/t5-11b',
+            'flan-t5-small': '/raid/nlp/models/flan-t5-small',
+            'flan-t5-base': '/raid/nlp/models/flan-t5-base',
+            'flan-t5-large': '/raid/nlp/models/flan-t5-large',
+            'flan-t5-xl': '/raid/nlp/models/flan-t5-xl',
+            'flan-t5-xxl': '/raid/nlp/models/flan-t5-xxl',
+            'falcon-40b-instruct': '/raid/nlp/models/falcon-40b-instruct',
+            'falcon-40b': '/raid/nlp/models/falcon-40b',
+            'gpt2': '/raid/nlp/models/gpt2',
+            'phi-1': '/raid/nlp/models/phi-1',
+            'falcon-7b-instruct': '/raid/nlp/models/falcon-7b-instruct',
+            'falcon-7b': '/raid/nlp/models/falcon-7b',
+            'falcon-180b': '/raid/nlp/models/falcon-180b/falcon-180B',
+            'phi-1_5': '/raid/nlp/models/phi-1_5'
+        }
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    if model_name not in model_paths.keys():
+        raise Exception('Model not found')
+    model = None
+    tokenizer = None
+    torch.cuda.empty_cache()
+    print('Loading Model...')
+    tokenizer = AutoTokenizer.from_pretrained(model_paths[model_name])
+    if model_name=='llama-2-7b':
+        model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
+        model.bfloat16()
+    elif model_name=='llama-2-7b-chat':
+        model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
+        model.bfloat16()
+    elif model_name=='llama-2-13b':
+        model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
+        model.bfloat16()
+    elif model_name=='llama-2-13b-chat':
+        model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
+        model.bfloat16()
+    elif model_name=='llama-2-70b':
+        model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", load_in_8bit=True)
+        model.bfloat16()
+    elif model_name=='llama-2-70b-chat':
+        model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", load_in_8bit=True)
+        model.bfloat16()
+    elif model_name=='t5-3b':
+        model = T5ForConditionalGeneration.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='t5-11b':
+        model = T5ForConditionalGeneration.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='flan-t5-small':
+        model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='flan-t5-base':
+        model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='flan-t5-large':
+        model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='flan-t5-xl':
+        model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='flan-t5-xxl':
+        model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='gpt2':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
+    elif model_name=='falcon-7b-instruct':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],device_map="auto",trust_remote_code=True, torch_dtype=torch.float16)
+    elif model_name=='falcon-7b':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name], device_map="auto",trust_remote_code=True,torch_dtype=torch.float16)
+    elif model_name=='falcon-40b-instruct':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name], device_map="auto",trust_remote_code=True,load_in_8bit=True)
+    elif model_name=='falcon-40b':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name], device_map="auto",trust_remote_code=True,load_in_8bit=True)
+    elif model_name=='falcon-180b':
+        model = AutoModelForCausalLM.from_pretrained(
+            model_paths[model_name],
+            quantization_config=bnb_config,
+            trust_remote_code=True,
+            device_map="auto",
+            torch_dtype=torch.float16,
+        )
+        model.config.use_cache = False
+    elif model_name=='phi-1':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],trust_remote_code=True, torch_dtype=torch.float16)
+        model.to(device)
+    elif model_name=='phi-1_5':
+        model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],trust_remote_code=True, torch_dtype=torch.float16)
+        model.to(device)
+    else:
+        raise Exception('Model not in list')
+    model.eval()
+    print('Loaded')
+    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer.pad_token_id = tokenizer.eos_token_id
+    tokenizer.padding_side = "right"
+    tokenizer.truncation_side = "left"
+    return model,tokenizer
+
+def generate_questions(n=10,model=None,tokenizer=None):
+    questions = []
+    answers = []
+    data = pd.read_csv('IMDB-Movie-Data.csv', index_col='Rank')
+    instruction = 'Your task is to generate a question from a given context and answer such that the question can be answered without the context from a movie dataset. The question should be such that it can be answered using only one item from the given fields (Movie Title, Actors, Genre, Director, Year)'
+    global few_shot_examples
+    count = 0
+    for idx,row in tqdm(data.iterrows()):
+        temp_list = []
+
+        #temp = row['Genre'].split(',')
+        #temp = [x.strip() for x in temp]
+        #temp_list.extend([(x,0) for x in temp])
+        
+        temp = row['Actors'].split(',')
+        temp = [x.strip() for x in temp][:2]
+        temp_list.extend([x for x in temp])
+
+        temp_list.append(row['Year'])
+        #temp_list.append(row['Director'])
+        temp_list.append(row['Title'])
+        description = row['Description'].strip()
+        genre = row['Genre']
+        actors = row['Actors']
+        year = row['Year']
+        director = row['Director']
+        title = row['Title']
+        random_elements = np.random.choice(temp_list,4,replace=False)
+        for random_element in random_elements:
+            k_shot = np.random.choice(few_shot_examples,2,replace=False)
+            prompt = instruction  + '\n\n' + '\n'.join(k_shot) + '\n' +  f'Context=\nMovie Title: {title}\nDescription: {description}\nActors: {actors}\nGenre: {genre}\nDirector: {director}\nYear: {year}\nAnswer={random_element}\nQuestion='
+            inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
+            sample_out_1 = model.generate(**inputs, do_sample=True, top_p=0.95, top_k=0, num_beams=5, min_new_tokens=5, max_new_tokens=50)
+            #sample_out_2 = model.generate(**inputs, do_sample=True, top_p=0.7, num_beams=5, temperature=1.3, top_k=0,min_new_tokens=15,max_new_tokens=100)
+            #print(sample_out_1)
+            #print(inputs.input_ids.shape,sample_out_1.shape)
+            generated_1 = tokenizer.batch_decode(sample_out_1, skip_special_tokens=True)
+            #generated_1 = [x.split('\n')[0] for x in generated_1]
+            print(generated_1)
+            #generated_2 = tokenizer.batch_decode(sample_out_2, skip_special_tokens=True)
+            questions.extend(generated_1)
+            #questions.extend(generated_2)
+            answers.extend([random_element])
+            count+=1
+            if(count==n):
+                break
+    return questions,answers
+
+
+if __name__=="__main__":
+    model_name = 'flan-t5-xxl'
+    model, tokenizer = setup_model(model_name)
+    questions,answers = generate_questions(4000,model,tokenizer)
+    print(questions,answers)
+    df = pd.DataFrame({'Questions': questions,'Answers': answers},columns=['Questions','Answers'])
+    df.to_csv('generated_questions.csv')
+
+        
+    
+    
+    
\ No newline at end of file
--- a/requirements.txt
+++ b/requirements.txt
+datasets==2.14.4
+gradio==4.7.1
+networkx==3.1
+numpy==1.24.4
+pandas==2.0.3
+plotly==5.18.0
+scikit_learn==1.2.2
+sentence_transformers==2.2.2
+torch==2.0.1
+tqdm==4.66.1
+transformers==4.31.0
--- a/retrieval.ipynb
+++ b/retrieval.ipynb
--- a/sample_questions.py
+++ b/sample_questions.py
+#Few shot examples for generating pseudo questions
+few_shot_examples = [
+'''Context=
+Movie Title: Guardians of the Galaxy
+Description: A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.
+Actors: Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana
+Genre: Action,Adventure,Sci-Fi
+Director: James Gunn
+Year: 2014
+Answer= Guardians of the Galaxy
+Question= What is the title of the 2014 film starring Chris Pratt and Vin Diesel about a group of intergalactic criminals?
+''',
+'''Context=
+Movie Title: The Departed
+Description: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.
+Actors: Leonardo DiCaprio, Matt Damon, Jack Nicholson, Mark Wahlberg
+Genre: Crime,Drama,Thriller
+Director: Martin Scorsese
+Year: 2006
+Answer= Leonardo DiCaprio
+Question= Who played the role of an undercover cop in the 2006 crime drama thriller directed by Martin Scorsese titled "The Departed"?
+''',
+'''Context=
+Movie Title: Personal Shopper
+Description: A personal shopper in Paris refuses to leave the city until she makes contact with her twin brother who previously died there. Her life becomes more complicated when a mysterious person contacts her via text message.
+Actors: Kristen Stewart, Lars Eidinger, Sigrid Bouaziz,Anders Danielsen Lie
+Genre: Drama,Mystery,Thriller
+Director: Olivier Assayas
+Year: 2016
+Answer= 2016
+Question= What year did the movie "Personal Shopper," directed by Olivier Assayas and starring Kristen Stewart, Lars Eidinger, Sigrid Bouaziz, and Anders Danielsen Lie, hit the screens?
+''',
+'''Context=
+Movie Title: War Dogs
+Description: Based on the true story of two young men, David Packouz and Efraim Diveroli, who won a $300 million contract from the Pentagon to arm America's allies in Afghanistan.
+Actors: Jonah Hill, Miles Teller, Steve Lantz, Gregg Weiner
+Genre: Comedy,Crime,Drama
+Director: Todd Phillips
+Year: 2016
+Answer= War Dogs
+Question= What is the movie where Jonah Hill and Miles Teller play characters involved in winning a $300 million Pentagon contract to arm America's allies in Afghanistan?
+''',
+'''Context=
+Movie Title: The Accountant
+Description: As a math savant uncooks the books for a new client, the Treasury Department closes in on his activities and the body count starts to rise.
+Actors: Ben Affleck, Anna Kendrick, J.K. Simmons, Jon Bernthal
+Genre: Action,Crime,Drama
+Director: Gavin O'Connor
+Year: 2016
+Answer= 2016
+Question= In which year was the movie "The Accountant" directed by Gavin O'Connor starring Ben Affleck and Anna Kendrick released? 
+''',
+'''Context=
+Movie Title: Pirates of the Caribbean: Dead Man's Chest
+Description: Jack Sparrow races to recover the heart of Davy Jones to avoid enslaving his soul to Jones' service, as other friends and foes seek the heart for their own agenda as well.
+Actors: Johnny Depp, Orlando Bloom, Keira Knightley, Jack Davenport
+Genre: Action,Adventure,Fantasy
+Director: Gore Verbinski
+Year: 2006
+Answer= 2006
+Question= In which year was the movie Pirates of the Caribbean: Dead Man's Chest released?
+''',
+'''Context=
+Movie Title: The Avengers
+Description: Earth's mightiest heroes must come together and learn to fight as a team if they are to stop the mischievous Loki and his alien army from enslaving humanity.
+Actors: Robert Downey Jr., Chris Evans, Scarlett Johansson,Jeremy Renner
+Genre: Action,Sci-Fi
+Director: Joss Whedon
+Year: 2012
+Answer= Robert Downey Jr.
+Question= Who played the character of Iron Man in "The Avengers"?
+''',
+'''Context=
+Movie Title: Mad Max: Fury Road
+Description: A woman rebels against a tyrannical ruler in postapocalyptic Australia in search for her home-land with the help of a group of female prisoners, a psychotic worshipper, and a drifter named Max.
+Actors: Tom Hardy, Charlize Theron, Nicholas Hoult, Zoë Kravitz
+Genre: Action,Adventure,Sci-Fi
+Director: George Miller
+Year: 2015
+Answer= Tom Hardy
+Question= Who played the character Max in Mad Max: Fury Road?
+''',
+'''Context=
+Movie Title: Magic Mike
+Description: A male stripper teaches a younger performer how to party, pick up women, and make easy money.
+Actors: Channing Tatum, Alex Pettyfer, Olivia Munn,Matthew McConaughey
+Genre: Comedy,Drama
+Director: Steven Soderbergh
+Year: 2012
+Answer= Channing Tatum
+Question= Who played a leading role in the 2012 movie "Magic Mike" about a male stripper?
+''',
+'''Context=
+Movie Title: The Incredible Hulk
+Description: Bruce Banner, a scientist on the run from the U.S. Government, must find a cure for the monster he emerges whenever he loses his temper.
+Actors: Edward Norton, Liv Tyler, Tim Roth, William Hurt
+Genre: Action,Adventure,Sci-Fi
+Director: Louis Leterrier
+Year: 2008
+Answer= Edward Norton
+Question= Who played Bruce Banner in the 2008 movie Incredible Hulk about Bruce Banner, a scientist who becomes a monster?
+''',
+'''Context=
+Movie Title: Grown Ups 2
+Description: After moving his family back to his hometown to be with his friends and their kids, Lenny finds out that between old bullies, new bullies, schizo bus drivers, drunk cops on skis, and 400 costumed party crashers sometimes crazy follows you.
+Actors: Adam Sandler, Kevin James, Chris Rock, David Spade
+Genre: Comedy
+Director: Dennis Dugan
+Year: 2013
+Answer= Grown Ups 2
+Question= What is the title of the 2013 movie starring Adam Sandler in a family comedy?
+''',
+'''Context=
+Movie Title: The Wolverine
+Description: When Wolverine is summoned to Japan by an old acquaintance, he is embroiled in a conflict that forces him to confront his own demons.
+Actors: Hugh Jackman, Will Yun Lee, Tao Okamoto, Rila Fukushima
+Genre: Action,Adventure,Sci-Fi
+Director: James Mangold
+Year: 2013
+Answer= The Wolverine
+Question= What is the title of the 2013 movie directed by James Mangold starring Hugh Jackman?
+''',
+'''Context=
+Movie Title: Southpaw
+Description: Boxer Billy Hope turns to trainer Tick Wills to help him get his life back on track after losing his wife in a tragic accident and his daughter to child protection services.
+Actors: Jake Gyllenhaal, Rachel McAdams, Oona Laurence,Forest Whitaker
+Genre: Drama,Sport
+Director: Antoine Fuqua
+Year: 2015
+Answer= Southpaw
+Question= What is the title of a boxing film starring Jake Gyllenhaal in 2015 and directed by Antoine Fuqua?
+'''
+]
\ No newline at end of file
--- a/test.txt
+++ b/test.txt
--- a/test_questions.csv
+++ b/test_questions.csv