Commit f7842ea7 authored by Meet Narendra's avatar Meet Narendra 💬

Updated readme

parent 3ce75350
Pipeline #1891 canceled with stages
# Knowledge Graph-based Question Answering for Movie Dataset # Knowledge Graph-based Question Answering for Movie Dataset
This readme is generated with the help of chatgpt ;)
## Overview ## Overview
This project focuses on Information Retrieval for a movie dataset using a Knowledge Graph-based Question Answering approach. The knowledge graph comprises nodes representing actors, directors, movie titles, genres, and years. The goal is to retrieve the top-k relevant nodes for a given question about the IMDB movie dataset. This project focuses on Information Retrieval for a movie dataset using a Knowledge Graph-based Question Answering approach. The knowledge graph comprises nodes representing actors, directors, movie titles, genres, and years. The goal is to retrieve the top-k relevant nodes for a given question about the IMDB movie dataset.
...@@ -9,9 +11,9 @@ This project focuses on Information Retrieval for a movie dataset using a Knowle ...@@ -9,9 +11,9 @@ This project focuses on Information Retrieval for a movie dataset using a Knowle
``` ```
. .
├── data ├── data
├── models ├── models //Contains the SBERT node embedding model
├── query_model.pt ├── query_model.pt
│ └── node_model.pt ├── node_model.pt //Same as the SBERT node embedding model just converted to pt format
├── sample_questions.py ├── sample_questions.py
├── eval.py ├── eval.py
├── node_embed.py ├── node_embed.py
...@@ -32,12 +34,12 @@ This project focuses on Information Retrieval for a movie dataset using a Knowle ...@@ -32,12 +34,12 @@ This project focuses on Information Retrieval for a movie dataset using a Knowle
### Files and Descriptions ### Files and Descriptions
- **data**: Directory to store input data. - **data**: Directory to store input data.
- **models**: Directory to save trained models. - **models**: Directory to save node embedding model in sentence bert format.
- `query_model.pt`: Trained model for question embeddings. - `query_model.pt`: Trained model for question embeddings.
- `node_model.pt`: Trained model for node embeddings. - `node_model.pt`: Trained model for node embeddings.
- **sample_questions.py**: Script containing sample questions for testing. - **sample_questions.py**: Script containing sample questions for testing.
- **eval.py**: Evaluation script for assessing the performance of the retrieval system. - **eval.py**: Evaluation script for assessing the performance of the retrieval system.
- **node_embed.py**: Script for node embedding generation. - **node_embed.py**: Script for node embedding training.
- **question_generation.py**: Script for generating questions. - **question_generation.py**: Script for generating questions.
- **scores.py**: Script containing scoring functions. - **scores.py**: Script containing scoring functions.
- **generated_questions.csv**: CSV file to store generated questions. - **generated_questions.csv**: CSV file to store generated questions.
...@@ -54,14 +56,13 @@ This project focuses on Information Retrieval for a movie dataset using a Knowle ...@@ -54,14 +56,13 @@ This project focuses on Information Retrieval for a movie dataset using a Knowle
## Getting Started ## Getting Started
1. Install the required dependencies using `pip install -r requirements.txt`. 1. Install the required dependencies using `pip install -r requirements.txt`.
2. Run `preprocess.py` to preprocess the dataset. 2. `Preprocess.py` Preprocess the dataset and generate nodes and graph to convert it to a trainable format. Node embed.py Here we use an existing sentence bert model for initialisation of our node embeddings then do a training using random walk with a Triplet loss with the root node as an anchor node and neighbors as positive samples and a subset of other nodes as negative samples. Walk length = 5, Num walks per node = 10, negative sample size = 30, batch_size = 8192, epochs = 20
3. Execute `retrieval.ipynb` for the information retrieval process. 3. `Question_generation.py` Now we aim to train a query model to align with these generated embeddings but we have less number of training samples of query node pairs. So we use few shot prompting using an instruction-tuned model (flan-t5-large) to generate questions corresponding to a given node and context. We tried various generation parameters to create good sampled questions. We generate 4000+ question node pairs. Although we cannot rely on synthetic questions, some manual quality estimation gave us a good idea of the generated question quality.
4. Adjust parameters and experiment with different queries in `sample_questions.py`. 4. `Retrieval.ipynb` Now we have the node embeddings generated using SBERT and a lot of query answer pairs. Now we will train another query model (another pretrained SBERT again, different from the node embedding SBERT) to align the query and the document (node) embedding, we take this motivation from dual encoder architectures but we keep the node embedding SBERT model frozen. We train this model using a cosine similarity loss between the query SBERT and the node embedding SBERT model.
5. Evaluate the system using `eval.py` to assess retrieval performance. 5. `Eval.py` Now to put it all together, we have a query model and a document model but to reduce comparisons we use a bert base NER model to identify potential named entities in the query. If there exists a named entity we identify it and retrieve neighboring nodes upto k hops and then only perform similarity matching based on the retrieved nodes. If no named entities are found we compare all nodes to the query and rank them based on their similarity score. The advantage of this method over direct NER based retrieval is, we can all retrieve answers for queries without named entities and more over we get ranking here which cannot be present in direct named entity based hopping. Additionally, we can use a regex to identify year nodes as well.
## Contributors
## Usage 22m0742 Meet Doshi meetdoshi@cse.iitb.ac.in
- Modify `sample_questions.py` to add or modify sample questions. 22m2103 Sameer Pimparkhede sameerp@cse.iitb.ac.in
- Run `question_generation.py` to generate questions based on the knowledge graph.
- Utilize the trained models (`query_model.pt` and `node_model.pt`) for embeddings.
- Evaluate the system using `eval.py` to assess retrieval performance.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment