LlamaIndex: Using data connectors to build a custom ChatGPT for private documents

In this post, we're going to see how we can use LlamaIndex's PDF Loader Data Connector to ingest data from the Domino's Pizza Nutritional Information PDF, then query that data, and print the LLM's response.

LlamaIndex: Using data connectors to build a custom ChatGPT for private documents
LlamaIndex - Chat with your documents tutorial
⚠️
I highly recommend you go over this short post about LlamaIndex if you're just getting started.

Introduction

Have you ever wanted to quickly get information from your files without reading a lot of pages? Well with the advancements in LLMs and tools around them, you can now literally chat with your documents (a PDF for example). We're going to be doing exactly that using LlamaIndex and Data Connectors.

LlamaIndex will help you build LLM applications by providing a framework that can easily ingest data from multiple sources and then use that data as context with a Large Language Model (LLM) such as GPT-4.

In this post, we're going to ingest data from a PDF file using a LlamaIndex Data Connector.

What are Data Connectors

Data Connectors in LlamaIndex are essentially plugins that allow us to take in data from a source (such as PDF files) and then use the loaded data in our LLM application. For this example, we're going to ingest a PDF document, so we'll be using the PDF Loader Data Connector.

💡
Data Connectors are available on the LlamaHub website. I recommend you visit and explore available connectors if you're planning on building a LLM app using LlamaIndex.

After ingesting data, an index could be constructed and used to query the data about specific questions using a Query Engine or to have a chat-style conversation using a Chat Engine.

LlamaIndex Engines

We're going to quickly define what the Query and Chat Engines are and briefly explain their function.

Query Engine

A query engine is a generic interface that allows you to ask questions about the data ingested from one or more sources using Data Connectors. A query engine takes in a natural language input and returns a response.

A query engine can be initialized by using the as_query_engine() method as shown below:

query_engine = index.as_query_engine()
response = query_engine.query("What are Data Connectors?")

Chat Engine

Similarly, we can think of a chat engine as an extension of a query engine that supports having a conversation (back-and-forth messages) with your data. It achieves this by keeping track of the message history and retaining context for future queries. If you're building a bot for your custom data or any conversation-type interface you'll probably use the chat engine.

A chat engine can be initialized by using the as_chat_engine() method as shown below:

chat_engine = index.as_chat_engine()
response = chat_engine.chat("What are Data Connectors?")

Setting Up

To set up our first Data Connector for this example we'll need an OpenAI API Key and a PDF file that you'd like to process.

Installing LlamaIndex

Let's get started by installing LlamaIndex using pip. In your terminal window, type the following:

pip install llama-index

Creating Empty Directory

mkdir data-connectors

Then, let's cd into our new directory:

cd data-connectors

We can finally create our app.py Python file:

touch app.py

Querying PDF Example

Next, we're going to do the following:

  1. Set the OpenAI API Key
  2. Import required packages
  3. Load LlamaIndex Data Connector: PDF Reader
  4. Ingest a sample PDF file
  5. Use the Query Engine to query OpenAI's LLM

Set up OpenAI API Key

import os
os.environ["OPENAI_API_KEY"] = 'YOUR-API-KEY-HERE'

Import Required Packages

from pathlib import Path
from llama_index import VectorStoreIndex, download_loader

Here we'll load Path from pathlib which makes it easier to interact with files and directories. We'll also import VectorStoreIndex and download_loader from llama_index.

💡
VectorStoreIndex represents a vector index which is a type of index used to store and manage multidimensional data called vectors. AI models can produce vectors called "embedding models". These models take something, like an article, picture, or video, and turn it into a set of numbers, or a vector, that represents it.

download_loader will help up load one of the many LlamaIndex Data Connectors, in our case we'll be using the PDF Loader connector (as shown below).

Load PDF Reader

Using download_loader we'll now load the PDF Loader Data Connector:

PDFReader = download_loader("PDFReader")

loader = PDFReader()

Ingest Sample PDF File

Next week some friends are coming over and we're having Domino's Pizza for dinner. I genuinely want to query their nutritional information and get more details about my choices, so I decided to use the Canadian Domino's Pizza Nutritional Guide as my sample PDF but obviously you can swap it with any other PDF based on your use-case.

documents = loader.load_data(file=Path('dominos.pdf'))

index = VectorStoreIndex.from_documents(documents)

Using load_data that takes in the PDF path, we can convert the PDF's content to a VectorStoreIndex as shown above.

Query the PDF

Final step here is to query the PDF:

query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")

print(response)

The printed response in my case is: This document is about the nutrition guide for Domino's Pizza.

Here's another interesting query:

response = query_engine.query("How many Pizza types are there?")

To which the LLM responded: There are 6 pizza types mentioned in the context information.

Recap and Next Steps

In this post, we've seen how we can use LlamaIndex's PDF Loader Data Connector to ingest data from the Domino's Pizza Nutritional Information PDF and query that data and receive a response from OpenAI's model. LlamaIndex supports other LLMs, and for your specific use case you could use a different model that does not require internet access to keep your private data, private.

You can download the full code from this repo.

Feel free to experiment with your own documents, and stay tuned for future posts if you like the content.