Part Three: How to chat with your documents in a Local Chatbot using OpenAI API

Rohit Raj
2 min readNov 29, 2023

--

In part two of series, I showed how to chat with documents using OpenAI API.

Our implementation had a limitation that it was sending all documents content as context to API. In this tutorial we will modify our code so that it sends only the relevant content to API.

For this we will use FAISS vector store and OpenAI embeddings. We will read our documents and use OpenAI embeddings to save them in FAISS vector store. Next whenever we have a query, we will first query the vector store to retrieve relevant content. And we will pass only the relevant content to the API.

First we will import the required library. For langchain python 3.10.2 version is preferred.

import gradio as gr
from openai import OpenAI
import docx2txt
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import Docx2txtLoader
import os
from langchain.embeddings import OpenAIEmbeddings

Next I wrote a function for reading files using loaders provided in langchain

def read_text_from_file(file_path):
# print(file_path)
# Check the file type and read accordingly
if file_path.endswith('.docx'):
loader = Docx2txtLoader(file_path)

elif file_path.endswith('.pdf'):
loader = PyPDFLoader(file_path)

elif file_path.endswith('.txt'):
loader = TextLoader(file_path)

return loader.load()

This loads the text content of the file and uses the relevant loader for the file type.

We run this function for each file in the document folder and using OpenaAI embeddings save them in the vector store using the below function.

def load_documents(file_path):
global db
documents = []

for files in file_path:
file_contents = read_text_from_file(files)
documents.extend(file_contents)

# Split text from PDF into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)
print('file loading done')

I am using RecursiveCharacterTextSplitter of Langchain to split text content of documents into chunks of size 500 token with 50 overlap. There are other kinds of text splitters available in Langchain. The right choice of text splitter and chunk size is crucial for the success of question and answer application.

Next we can query our FAISS database using the following code

query_results = db.similarity_search(query)
print(query_results)

So we modified our code in part two into the below code for our chatbot. This version of the chatbot would be able to answer queries over a large collection of documents.

If you are not satisfied with the performance of the FAISS vector store then we can use one of the following vector stores available with Langchain

If our collection of documents has a structure then we can pass the structure in another API call to OpenAI API to help us choose the relevant documents.

Ultimately this approach is suitable only if the size of documents which you want to query is very large. For limited documents size OpenAI launched Assistant API in November 2023. I will explain it in my next blog.

--

--

Rohit Raj
Rohit Raj

Written by Rohit Raj

Studied at IIT Madras and IIM Indore. Love Data Science

No responses yet