Commit 3ce75350 authored by Meet Narendra's avatar Meet Narendra 💬

Initial commit

parents
Pipeline #1890 canceled with stages
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
# Knowledge Graph-based Question Answering for Movie Dataset
## Overview
This project focuses on Information Retrieval for a movie dataset using a Knowledge Graph-based Question Answering approach. The knowledge graph comprises nodes representing actors, directors, movie titles, genres, and years. The goal is to retrieve the top-k relevant nodes for a given question about the IMDB movie dataset.
## Directory Structure
```
.
├── data
├── models
│ ├── query_model.pt
│ └── node_model.pt
├── sample_questions.py
├── eval.py
├── node_embed.py
├── question_generation.py
├── scores.py
├── generated_questions.csv
├── README.md
├── test_questions.csv
├── graph.json
├── plot.py
├── requirements.txt
├── test.txt
├── IMDB-Movie-Data.csv
├── preprocess.py
└── retrieval.ipynb
```
### Files and Descriptions
- **data**: Directory to store input data.
- **models**: Directory to save trained models.
- `query_model.pt`: Trained model for question embeddings.
- `node_model.pt`: Trained model for node embeddings.
- **sample_questions.py**: Script containing sample questions for testing.
- **eval.py**: Evaluation script for assessing the performance of the retrieval system.
- **node_embed.py**: Script for node embedding generation.
- **question_generation.py**: Script for generating questions.
- **scores.py**: Script containing scoring functions.
- **generated_questions.csv**: CSV file to store generated questions.
- **README.md**: This file, providing an overview of the project.
- **test_questions.csv**: CSV file containing test questions.
- **graph.json**: JSON file representing the knowledge graph.
- **plot.py**: Script for generating plots or visualizations.
- **requirements.txt**: File listing the required Python packages.
- **test.txt**: File for storing test-related information.
- **IMDB-Movie-Data.csv**: Input dataset containing information about movies.
- **preprocess.py**: Script for preprocessing the dataset.
- **retrieval.ipynb**: Jupyter notebook for the retrieval process.
## Getting Started
1. Install the required dependencies using `pip install -r requirements.txt`.
2. Run `preprocess.py` to preprocess the dataset.
3. Execute `retrieval.ipynb` for the information retrieval process.
4. Adjust parameters and experiment with different queries in `sample_questions.py`.
5. Evaluate the system using `eval.py` to assess retrieval performance.
## Usage
- Modify `sample_questions.py` to add or modify sample questions.
- Run `question_generation.py` to generate questions based on the knowledge graph.
- Utilize the trained models (`query_model.pt` and `node_model.pt`) for embeddings.
- Evaluate the system using `eval.py` to assess retrieval performance.
import random
import gradio as gr
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import pandas as pd
from transformers import AutoTokenizer,AutoModelForTokenClassification
from collections import Counter
import torch
from transformers import pipeline
import networkx as nx
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-MiniLM-L6-v2')
NER_tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
NER_model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
query_model = torch.load('query_model.pt').to(device)
node_model = torch.load('node_model.pt').to(device)
graph = {}
with open('graph.json') as f:
graph = json.load(f)
G = nx.Graph()
G.add_nodes_from(graph['node_list'])
G.add_edges_from(graph['edge_list'])
ner_pipe = pipeline("ner", model=NER_model, tokenizer=NER_tokenizer)
def mean_pooling(model_output, attention_mask):
#print(model_output[0].shape,attention_mask.shape)
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def find_duplicates(lst):
counter = Counter(lst)
return [item for item, count in counter.items() if count > 1]
nodes = graph['id_dict'].values()
nodes = [str(x).strip() for x in nodes]
# for x,y in graph['edge_list']:
# print(graph['id_dict'][str(x)],graph['id_dict'][str(y)])
with torch.no_grad():
encoded_docs = tokenizer(nodes, padding=True, truncation=True, return_tensors='pt').to(device)
model_out = node_model(**encoded_docs)
frozen_embeddings = mean_pooling(model_out, encoded_docs['attention_mask']).cpu().numpy()
print(frozen_embeddings.shape)
print(len(nodes),len(set(nodes)),find_duplicates(nodes),len(frozen_embeddings),len(frozen_embeddings[0]))
def get_neighbor(node,k):
neighbors = []
global G
prev = [node]
while k>0:
k-=1
curr_neighbors = []
for n in prev:
curr_neighbors.extend(G.neighbors(n))
prev = curr_neighbors
neighbors.extend(curr_neighbors)
return neighbors
def get_top_k_nodes(message, history):
with torch.no_grad():
ner_tags = ner_pipe(message)
tags_in_kg = []
for tag in ner_tags:
if tag['word'] in nodes:
tags_in_kg.append(tag['word'])
if len(tags_in_kg)==0:
#print('Using the default all comparison approach')
encoded_query = tokenizer(message, padding=True, truncation=True, return_tensors='pt').to(device)
model_out = query_model(**encoded_query)
embedded_query = mean_pooling(model_out, encoded_query['attention_mask']).cpu().numpy()
scores = cosine_similarity(embedded_query, frozen_embeddings)
argsort = scores.argsort()[0][::-1]
selected_nodes = [nodes[x] for x in argsort[:10]]
return "\n".join(selected_nodes)
else:
#collect the node and its neighbors
#print('Using the NER approach')
neighbors = []
for tag in tags_in_kg:
neighbors.extend(get_neighbor(graph['dict_id'][tag],1))
#print(neighbors)
neighbors = [str(graph['id_dict'][str(i)]) for i in neighbors]
#print(neighbors)
encoded_docs = tokenizer(neighbors, padding=True, truncation=True, return_tensors='pt').to(device)
model_out = node_model(**encoded_docs)
neighbor_frozen_embeddings = mean_pooling(model_out, encoded_docs['attention_mask']).cpu().numpy()
encoded_query = tokenizer(message, padding=True, truncation=True, return_tensors='pt').to(device)
model_out = query_model(**encoded_query)
embedded_query = mean_pooling(model_out, encoded_query['attention_mask']).cpu().numpy()
scores = cosine_similarity(embedded_query, neighbor_frozen_embeddings)
argsort = scores.argsort()[0][::-1]
selected_nodes = [neighbors[x] for x in argsort[:10]]
return "\n".join(selected_nodes)
questions = []
answers = []
with open('test.txt', 'r') as file:
total = 0
correct = 0
for line1, line2, line3 in tqdm(zip(file, file, file)):
question = line1.strip()
answer = line2[8:].strip()
#print(question,answer)
correctanswer = get_top_k_nodes(question,"").split('\n')
correctanswer = [x.strip().lower() for x in correctanswer]
#print(len(correctanswer),len(set(correctanswer)))
tempnodes = [x.lower().strip() for x in nodes]
#print(set(correctanswer)-set(tempnodes))
if answer.lower() in correctanswer:
correct+=1
else:
print(answer)
total+=1
questions.append(question)
answers.append(answer)
test_df = pd.DataFrame({'Questions': questions,'Answers': answers},columns=['Questions','Answers'])
test_df.to_csv('test_questions.csv')
print(f'Accuracy {correct*100/total}\n{correct} {total}')
gr.ChatInterface(get_top_k_nodes).launch()
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
{
"word_embedding_dimension": 384,
"pooling_mode_cls_token": false,
"pooling_mode_mean_tokens": true,
"pooling_mode_max_tokens": false,
"pooling_mode_mean_sqrt_len_tokens": false
}
\ No newline at end of file
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
---
# {MODEL_NAME}
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
<!--- Describe your model here -->
## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
```
## Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Evaluation Results
<!--- Describe how your model was evaluated -->
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
## Training
The model was trained with the parameters:
**DataLoader**:
`torch.utils.data.dataloader.DataLoader` of length 443 with parameters:
```
{'batch_size': 8192, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```
**Loss**:
`sentence_transformers.losses.TripletLoss.TripletLoss` with parameters:
```
{'distance_metric': 'TripletDistanceMetric.EUCLIDEAN', 'triplet_margin': 5}
```
Parameters of the fit()-Method:
```
{
"epochs": 10,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 10000,
"weight_decay": 0.01
}
```
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```
## Citing & Authors
<!--- Describe where people can find more information -->
\ No newline at end of file
{
"_name_or_path": "/raid/nlp/pranavg/.cache/torch/sentence_transformers/sentence-transformers_paraphrase-MiniLM-L6-v2/",
"architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 384,
"initializer_range": 0.02,
"intermediate_size": 1536,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.28.1",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
{
"__version__": {
"sentence_transformers": "2.0.0",
"transformers": "4.7.0",
"pytorch": "1.9.0+cu102"
}
}
\ No newline at end of file
[
{
"idx": 0,
"name": "0",
"path": "",
"type": "sentence_transformers.models.Transformer"
},
{
"idx": 1,
"name": "1",
"path": "1_Pooling",
"type": "sentence_transformers.models.Pooling"
}
]
\ No newline at end of file
{
"max_seq_length": 128,
"do_lower_case": false
}
\ No newline at end of file
{
"cls_token": "[CLS]",
"mask_token": "[MASK]",
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"unk_token": "[UNK]"
}
This diff is collapsed.
{
"clean_up_tokenization_spaces": true,
"cls_token": "[CLS]",
"do_basic_tokenize": true,
"do_lower_case": true,
"mask_token": "[MASK]",
"model_max_length": 512,
"never_split": null,
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"strip_accents": null,
"tokenize_chinese_chars": true,
"tokenizer_class": "BertTokenizer",
"unk_token": "[UNK]"
}
This diff is collapsed.
import networkx as nx
import random
import numpy as np
from typing import List
from tqdm import tqdm
import json
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
import os
if not os.path.exists('./models/'):
os.makedirs('./models/')
# Hyperparams
#We saw that keeping shorter but wider walks helped in generating better node embeddings. We use sentence bert as an embedding layer so that some semantic information can be retained from the values of the nodes. Example: Names of actors or directors already form a cluster without any node embedding training so we utilise that and fine tune it to generate better embeddings using random walk.
walk_length=5
num_walks=10
negative_sample_size=30
batch_size=8192 #recommended to lower it for a general purpose cpu
epochs=20
class DeepWalk:
def __init__(self, walk_length: int, walks_per_node: int):
"""
:param walk_length: length of the walk
:param walks_per_node: number of walks per node
"""
self.walk_length = walk_length
self.walk_per_node = walks_per_node
def random_walk(self, g: nx.Graph, start: str, use_probabilities: bool = False) -> List[str]:
"""
Generate a random walk starting on start
:param g: Graph
:param start: starting node for the random walk
:param use_probabilities: if True take into account the weights assigned to each edge to select the next candidate
:return: a random walk starting from the 'start' node
"""
walk = [start]
for i in range(self.walk_length):
neighbours = g.neighbors(walk[i])
neighs = list(neighbours)
if use_probabilities:
probabilities = [g.get_edge_data(walk[i], neig)["weight"] for neig in neighs]
sum_probabilities = sum(probabilities)
probabilities = list(map(lambda t: t / sum_probabilities, probabilities))
p = np.random.choice(neighs, p=probabilities)
else:
p = random.choice(neighs)
walk.append(p)
return walk
def get_walks(self, g: nx.Graph, use_probabilities: bool = False) -> List[List[str]]:
"""
Generate all the random walks
:param g: Graph
:param use_probabilities:
:return:
"""
random_walks = []
for _ in range(self.walk_per_node):
random_nodes = list(g.nodes)
random.shuffle(random_nodes)
for node in tqdm(random_nodes):
random_walks.append(self.random_walk(g=g, start=node, use_probabilities=use_probabilities))
return random_walks
graph = {}
with open('graph.json') as f:
graph = json.load(f)
#print(graph)
G = nx.Graph()
G.add_nodes_from(graph['node_list'])
G.add_edges_from(graph['edge_list'])
# for node in graph['node_list']:
# if str(node) not in graph['id_dict']:
# print(node)
# for node in G.nodes():
# neighbors = list(G.neighbors(node))
# print(f'{graph["id_dict"][str(node)]} {[graph["id_dict"][str(x)] for x in neighbors]}')
# print(len(graph['node_list']),len(set(graph['node_list'])))
# exit()
DW = DeepWalk(walk_length,num_walks)
walks = DW.get_walks(G)
def get_negative_samples(walk, nodelist, k):
"""
Get negative samples for training. Here we take the walk or a single node and retrieve a set k negative samples at random from the nodelist, we know that this approach is not ideal but it worked for an assignment so we went ahead anyway :D
:param walk: a single node or a list of nodes
:param nodelist: a list of all nodes in the KG
:param k: the number of negative samples to extract
:return: a set of negative samples
"""
negative_sample_space = list(set(nodelist)-set(walk))
selected_samples = np.random.choice(negative_sample_space,k,replace=False)
return selected_samples
train_examples = []
# Here we generate a training examples from the extracted walks
for walk in walks:
#print(len(walk))
negative_samples = get_negative_samples(walk,graph['node_list'],negative_sample_size)
#print(len(negative_samples))
sentence_walk = [str(graph['id_dict'][str(x)]) for x in walk]
sentence_negatives = [str(graph['id_dict'][str(x)]) for x in negative_samples]
#print(sentence_walk,sentence_negatives)
for i in range(1,len(sentence_walk)):
if sentence_walk[0]!=sentence_walk[i]:
for negative in sentence_negatives:
train_examples.append(InputExample(texts=[sentence_walk[0],sentence_walk[i],negative]))
print('Length of train examples',len(train_examples))
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
train_loss = losses.TripletLoss(model=model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=epochs, show_progress_bar=True)
model.save('./models/')
File added
import json
graph = {}
with open('graph.json') as f:
graph = json.load(f)
from sentence_transformers import SentenceTransformer, losses, InputExample
model = SentenceTransformer('./models/')
nodes = graph['id_dict'].values()
nodes = [str(x) for x in nodes][:200]
frozen_embeddings = model.encode(nodes)
from sklearn.manifold import TSNE
# Create a t-SNE model
tsne = TSNE(n_components=2, random_state=42)
# Fit and transform your embeddings using t-SNE
embeddings_2d = tsne.fit_transform(frozen_embeddings)
import pandas as pd
import plotly.express as px
# Create a DataFrame for Plotly Express
df = pd.DataFrame({'x': embeddings_2d[:, 0], 'y': embeddings_2d[:, 1], 'word': nodes})
# Scatter plot with annotations using Plotly Express
fig = px.scatter(df, x='x', y='y', text='word', title='t-SNE Visualization of Word Embeddings')
fig.show()
import pandas as pd
data = pd.read_csv('IMDB-Movie-Data.csv', index_col='Rank')
print(data.head())
print(data.columns)
titles = set(data['Title'])
genres = set()
description = set(data['Description'])
director = set(data['Director'])
actors = set()
year = set()
for idx,row in data.iterrows():
temp = row['Genre'].split(',')
temp = [x.strip() for x in temp]
genres.update(temp)
temp = row['Actors'].split(',')
temp = [x.strip() for x in temp]
actors.update(temp)
year.update([row['Year']])
graph = {
# different sets of nodes
'titles': list(titles),
'genres': list(genres),
#'description': list(description),
'director': list(director),
'actors': list(actors),
'year': list(year),
}
count = 0
graph['dict_id'] = {}
graph['id_dict'] = {}
for idx,row in data.iterrows():
temp = row['Genre'].split(',')
temp = [x.strip() for x in temp]
for x in temp:
if x not in graph['dict_id'] and count not in graph['id_dict']:
graph['dict_id'][x] = count
graph['id_dict'][count] = x
count+=1
temp = row['Actors'].split(',')
temp = [x.strip() for x in temp]
for x in temp:
if x not in graph['dict_id'] and count not in graph['id_dict']:
graph['dict_id'][x] = count
graph['id_dict'][count] = x
count+=1
if row['Year'] not in graph['dict_id'] and count not in graph['id_dict']:
graph['dict_id'][row['Year']] = count
graph['id_dict'][count] = row['Year']
count+=1
if row['Title'] not in graph['dict_id'] and count not in graph['id_dict']:
graph['dict_id'][row['Title']] = count
graph['id_dict'][count] = row['Title']
count+=1
if row['Director'] not in graph['dict_id'] and count not in graph['id_dict']:
graph['dict_id'][row['Director']] = count
graph['id_dict'][count] = row['Director']
count+=1
edge_list = []
for idx,row in data.iterrows():
temp_list = []
temp = row['Genre'].split(',')
temp = [x.strip() for x in temp]
temp_list.extend([(graph['dict_id'][x],0) for x in temp])
temp = row['Actors'].split(',')
temp = [x.strip() for x in temp]
temp_list.extend([(graph['dict_id'][x],1) for x in temp])
temp_list.append((graph['dict_id'][row['Year']],2))
temp_list.append((graph['dict_id'][row['Director']],3))
temp_list.append((graph['dict_id'][row['Title']],4))
print(len(temp_list))
for x in temp_list:
for y in temp_list:
if x[0]!=y[0] and x[1]!=y[1]:
edge_list.append((x[0],y[0]))
edge_list.append((y[1],x[1]))
edge_list = list(set(edge_list))
#for x,y in edge_list:
# print(graph['id_dict'][x],graph['id_dict'][y])
graph['edge_list'] = edge_list
graph['node_list'] = [i for i in range(count)]
for node in graph['node_list']:
if node not in graph['id_dict']:
print(node)
print(count, len(edge_list)/2)
import json
with open('graph.json','w') as f:
json.dump(graph,f)
# import networkx as nx
# import matplotlib.pyplot as plt
# nodes = [i for i in range(count)]
# edges = edge_list
# G = nx.Graph()
# G.add_nodes_from(nodes)
# G.add_edges_from(edges)
# # Plot the graph interactively
# pos = nx.spring_layout(G)
# nx.draw(G, pos, with_labels=True, font_weight='bold', node_size=700, node_color='skyblue', font_color='black', font_size=10, edge_color='gray')
# # Make the plot interactive with zooming capabilities
# plt.savefig('graph.png')
File added
#This file generates synthetic questions for training a retrieval model. We use few shot prompting to generate questions for a particular answer and context. We then filter the questions based on whether the model is able to answer the question correctly or not.
import os
from transformers import AutoTokenizer,AutoModelForCausalLM,BitsAndBytesConfig
import torch
import pandas as pd
import numpy as np
from datasets import Dataset
#os.environ["CUDA_VISIBLE_DEVICES"] = "3"
os.system("echo $CUDA_VISIBLE_DEVICES")
#eval()
from transformers import AutoTokenizer, DataCollatorWithPadding,AutoModelForCausalLM
from transformers import LlamaForCausalLM #llama
from transformers import FalconForCausalLM #Falcon
from transformers import T5ForConditionalGeneration #t5
from transformers import AutoModelForSeq2SeqLM #Flant5
from transformers import GPT2LMHeadModel #gpt2
import json
import random
from sample_questions import few_shot_examples
from tqdm import tqdm
from transformers import pipeline
# Quantisation config for 180b falcon
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
def setup_model(model_name):
model_paths = {
'llama-2-7b': '/raid/nlp/models/llama-2-7B-hf',
'llama-2-7b-chat': '/raid/nlp/models/llama-2-7b-chat-hf',
'llama-2-13b': '/raid/nlp/models/llama-2-13b-hf',
'llama-2-13b-chat': '/raid/nlp/models/llama-2-13b-chat-hf',
'llama-2-70b': '/raid/nlp/models/llama-2-70b-hf',
'llama-2-70b-chat': '/raid/nlp/models/llama-2-70b-chat-hf',
't5-3b': '/raid/nlp/models/t5-3b',
't5-11b': '/raid/nlp/models/t5-11b',
'flan-t5-small': '/raid/nlp/models/flan-t5-small',
'flan-t5-base': '/raid/nlp/models/flan-t5-base',
'flan-t5-large': '/raid/nlp/models/flan-t5-large',
'flan-t5-xl': '/raid/nlp/models/flan-t5-xl',
'flan-t5-xxl': '/raid/nlp/models/flan-t5-xxl',
'falcon-40b-instruct': '/raid/nlp/models/falcon-40b-instruct',
'falcon-40b': '/raid/nlp/models/falcon-40b',
'gpt2': '/raid/nlp/models/gpt2',
'phi-1': '/raid/nlp/models/phi-1',
'falcon-7b-instruct': '/raid/nlp/models/falcon-7b-instruct',
'falcon-7b': '/raid/nlp/models/falcon-7b',
'falcon-180b': '/raid/nlp/models/falcon-180b/falcon-180B',
'phi-1_5': '/raid/nlp/models/phi-1_5'
}
device = "cuda" if torch.cuda.is_available() else "cpu"
if model_name not in model_paths.keys():
raise Exception('Model not found')
model = None
tokenizer = None
torch.cuda.empty_cache()
print('Loading Model...')
tokenizer = AutoTokenizer.from_pretrained(model_paths[model_name])
if model_name=='llama-2-7b':
model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
model.bfloat16()
elif model_name=='llama-2-7b-chat':
model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
model.bfloat16()
elif model_name=='llama-2-13b':
model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
model.bfloat16()
elif model_name=='llama-2-13b-chat':
model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
model.bfloat16()
elif model_name=='llama-2-70b':
model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", load_in_8bit=True)
model.bfloat16()
elif model_name=='llama-2-70b-chat':
model = LlamaForCausalLM.from_pretrained(model_paths[model_name], device_map="auto", load_in_8bit=True)
model.bfloat16()
elif model_name=='t5-3b':
model = T5ForConditionalGeneration.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='t5-11b':
model = T5ForConditionalGeneration.from_pretrained(model_paths[model_name], device_map="auto", torch_dtype=torch.float16)
elif model_name=='flan-t5-small':
model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='flan-t5-base':
model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='flan-t5-large':
model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='flan-t5-xl':
model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='flan-t5-xxl':
model = AutoModelForSeq2SeqLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='gpt2':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],device_map="auto", torch_dtype=torch.float16)
elif model_name=='falcon-7b-instruct':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],device_map="auto",trust_remote_code=True, torch_dtype=torch.float16)
elif model_name=='falcon-7b':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name], device_map="auto",trust_remote_code=True,torch_dtype=torch.float16)
elif model_name=='falcon-40b-instruct':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name], device_map="auto",trust_remote_code=True,load_in_8bit=True)
elif model_name=='falcon-40b':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name], device_map="auto",trust_remote_code=True,load_in_8bit=True)
elif model_name=='falcon-180b':
model = AutoModelForCausalLM.from_pretrained(
model_paths[model_name],
quantization_config=bnb_config,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16,
)
model.config.use_cache = False
elif model_name=='phi-1':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],trust_remote_code=True, torch_dtype=torch.float16)
model.to(device)
elif model_name=='phi-1_5':
model = AutoModelForCausalLM.from_pretrained(model_paths[model_name],trust_remote_code=True, torch_dtype=torch.float16)
model.to(device)
else:
raise Exception('Model not in list')
model.eval()
print('Loaded')
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"
tokenizer.truncation_side = "left"
return model,tokenizer
def generate_questions(n=10,model=None,tokenizer=None):
questions = []
answers = []
data = pd.read_csv('IMDB-Movie-Data.csv', index_col='Rank')
instruction = 'Your task is to generate a question from a given context and answer such that the question can be answered without the context from a movie dataset. The question should be such that it can be answered using only one item from the given fields (Movie Title, Actors, Genre, Director, Year)'
global few_shot_examples
count = 0
for idx,row in tqdm(data.iterrows()):
temp_list = []
#temp = row['Genre'].split(',')
#temp = [x.strip() for x in temp]
#temp_list.extend([(x,0) for x in temp])
temp = row['Actors'].split(',')
temp = [x.strip() for x in temp][:2]
temp_list.extend([x for x in temp])
temp_list.append(row['Year'])
#temp_list.append(row['Director'])
temp_list.append(row['Title'])
description = row['Description'].strip()
genre = row['Genre']
actors = row['Actors']
year = row['Year']
director = row['Director']
title = row['Title']
random_elements = np.random.choice(temp_list,4,replace=False)
for random_element in random_elements:
k_shot = np.random.choice(few_shot_examples,2,replace=False)
prompt = instruction + '\n\n' + '\n'.join(k_shot) + '\n' + f'Context=\nMovie Title: {title}\nDescription: {description}\nActors: {actors}\nGenre: {genre}\nDirector: {director}\nYear: {year}\nAnswer={random_element}\nQuestion='
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
sample_out_1 = model.generate(**inputs, do_sample=True, top_p=0.95, top_k=0, num_beams=5, min_new_tokens=5, max_new_tokens=50)
#sample_out_2 = model.generate(**inputs, do_sample=True, top_p=0.7, num_beams=5, temperature=1.3, top_k=0,min_new_tokens=15,max_new_tokens=100)
#print(sample_out_1)
#print(inputs.input_ids.shape,sample_out_1.shape)
generated_1 = tokenizer.batch_decode(sample_out_1, skip_special_tokens=True)
#generated_1 = [x.split('\n')[0] for x in generated_1]
print(generated_1)
#generated_2 = tokenizer.batch_decode(sample_out_2, skip_special_tokens=True)
questions.extend(generated_1)
#questions.extend(generated_2)
answers.extend([random_element])
count+=1
if(count==n):
break
return questions,answers
if __name__=="__main__":
model_name = 'flan-t5-xxl'
model, tokenizer = setup_model(model_name)
questions,answers = generate_questions(4000,model,tokenizer)
print(questions,answers)
df = pd.DataFrame({'Questions': questions,'Answers': answers},columns=['Questions','Answers'])
df.to_csv('generated_questions.csv')
\ No newline at end of file
datasets==2.14.4
gradio==4.7.1
networkx==3.1
numpy==1.24.4
pandas==2.0.3
plotly==5.18.0
scikit_learn==1.2.2
sentence_transformers==2.2.2
torch==2.0.1
tqdm==4.66.1
transformers==4.31.0
This source diff could not be displayed because it is too large. You can view the blob instead.
#Few shot examples for generating pseudo questions
few_shot_examples = [
'''Context=
Movie Title: Guardians of the Galaxy
Description: A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.
Actors: Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana
Genre: Action,Adventure,Sci-Fi
Director: James Gunn
Year: 2014
Answer= Guardians of the Galaxy
Question= What is the title of the 2014 film starring Chris Pratt and Vin Diesel about a group of intergalactic criminals?
''',
'''Context=
Movie Title: The Departed
Description: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.
Actors: Leonardo DiCaprio, Matt Damon, Jack Nicholson, Mark Wahlberg
Genre: Crime,Drama,Thriller
Director: Martin Scorsese
Year: 2006
Answer= Leonardo DiCaprio
Question= Who played the role of an undercover cop in the 2006 crime drama thriller directed by Martin Scorsese titled "The Departed"?
''',
'''Context=
Movie Title: Personal Shopper
Description: A personal shopper in Paris refuses to leave the city until she makes contact with her twin brother who previously died there. Her life becomes more complicated when a mysterious person contacts her via text message.
Actors: Kristen Stewart, Lars Eidinger, Sigrid Bouaziz,Anders Danielsen Lie
Genre: Drama,Mystery,Thriller
Director: Olivier Assayas
Year: 2016
Answer= 2016
Question= What year did the movie "Personal Shopper," directed by Olivier Assayas and starring Kristen Stewart, Lars Eidinger, Sigrid Bouaziz, and Anders Danielsen Lie, hit the screens?
''',
'''Context=
Movie Title: War Dogs
Description: Based on the true story of two young men, David Packouz and Efraim Diveroli, who won a $300 million contract from the Pentagon to arm America's allies in Afghanistan.
Actors: Jonah Hill, Miles Teller, Steve Lantz, Gregg Weiner
Genre: Comedy,Crime,Drama
Director: Todd Phillips
Year: 2016
Answer= War Dogs
Question= What is the movie where Jonah Hill and Miles Teller play characters involved in winning a $300 million Pentagon contract to arm America's allies in Afghanistan?
''',
'''Context=
Movie Title: The Accountant
Description: As a math savant uncooks the books for a new client, the Treasury Department closes in on his activities and the body count starts to rise.
Actors: Ben Affleck, Anna Kendrick, J.K. Simmons, Jon Bernthal
Genre: Action,Crime,Drama
Director: Gavin O'Connor
Year: 2016
Answer= 2016
Question= In which year was the movie "The Accountant" directed by Gavin O'Connor starring Ben Affleck and Anna Kendrick released?
''',
'''Context=
Movie Title: Pirates of the Caribbean: Dead Man's Chest
Description: Jack Sparrow races to recover the heart of Davy Jones to avoid enslaving his soul to Jones' service, as other friends and foes seek the heart for their own agenda as well.
Actors: Johnny Depp, Orlando Bloom, Keira Knightley, Jack Davenport
Genre: Action,Adventure,Fantasy
Director: Gore Verbinski
Year: 2006
Answer= 2006
Question= In which year was the movie Pirates of the Caribbean: Dead Man's Chest released?
''',
'''Context=
Movie Title: The Avengers
Description: Earth's mightiest heroes must come together and learn to fight as a team if they are to stop the mischievous Loki and his alien army from enslaving humanity.
Actors: Robert Downey Jr., Chris Evans, Scarlett Johansson,Jeremy Renner
Genre: Action,Sci-Fi
Director: Joss Whedon
Year: 2012
Answer= Robert Downey Jr.
Question= Who played the character of Iron Man in "The Avengers"?
''',
'''Context=
Movie Title: Mad Max: Fury Road
Description: A woman rebels against a tyrannical ruler in postapocalyptic Australia in search for her home-land with the help of a group of female prisoners, a psychotic worshipper, and a drifter named Max.
Actors: Tom Hardy, Charlize Theron, Nicholas Hoult, Zoë Kravitz
Genre: Action,Adventure,Sci-Fi
Director: George Miller
Year: 2015
Answer= Tom Hardy
Question= Who played the character Max in Mad Max: Fury Road?
''',
'''Context=
Movie Title: Magic Mike
Description: A male stripper teaches a younger performer how to party, pick up women, and make easy money.
Actors: Channing Tatum, Alex Pettyfer, Olivia Munn,Matthew McConaughey
Genre: Comedy,Drama
Director: Steven Soderbergh
Year: 2012
Answer= Channing Tatum
Question= Who played a leading role in the 2012 movie "Magic Mike" about a male stripper?
''',
'''Context=
Movie Title: The Incredible Hulk
Description: Bruce Banner, a scientist on the run from the U.S. Government, must find a cure for the monster he emerges whenever he loses his temper.
Actors: Edward Norton, Liv Tyler, Tim Roth, William Hurt
Genre: Action,Adventure,Sci-Fi
Director: Louis Leterrier
Year: 2008
Answer= Edward Norton
Question= Who played Bruce Banner in the 2008 movie Incredible Hulk about Bruce Banner, a scientist who becomes a monster?
''',
'''Context=
Movie Title: Grown Ups 2
Description: After moving his family back to his hometown to be with his friends and their kids, Lenny finds out that between old bullies, new bullies, schizo bus drivers, drunk cops on skis, and 400 costumed party crashers sometimes crazy follows you.
Actors: Adam Sandler, Kevin James, Chris Rock, David Spade
Genre: Comedy
Director: Dennis Dugan
Year: 2013
Answer= Grown Ups 2
Question= What is the title of the 2013 movie starring Adam Sandler in a family comedy?
''',
'''Context=
Movie Title: The Wolverine
Description: When Wolverine is summoned to Japan by an old acquaintance, he is embroiled in a conflict that forces him to confront his own demons.
Actors: Hugh Jackman, Will Yun Lee, Tao Okamoto, Rila Fukushima
Genre: Action,Adventure,Sci-Fi
Director: James Mangold
Year: 2013
Answer= The Wolverine
Question= What is the title of the 2013 movie directed by James Mangold starring Hugh Jackman?
''',
'''Context=
Movie Title: Southpaw
Description: Boxer Billy Hope turns to trainer Tick Wills to help him get his life back on track after losing his wife in a tragic accident and his daughter to child protection services.
Actors: Jake Gyllenhaal, Rachel McAdams, Oona Laurence,Forest Whitaker
Genre: Drama,Sport
Director: Antoine Fuqua
Year: 2015
Answer= Southpaw
Question= What is the title of a boxing film starring Jake Gyllenhaal in 2015 and directed by Antoine Fuqua?
'''
]
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment