Have you ever wanted to quickly get information from your files without reading a lot of pages? Well with the advancements in LLMs and tools around them, you can now literally chat with your documents (a PDF for example). We're going to be doing exactly that using LlamaIndex and Data Connectors.
LlamaIndex will help you build LLM applications by providing a framework that can easily ingest data from multiple sources and then use that data as context with a Large Language Model (LLM) such as GPT-4.
In this post, we're going to ingest data from a
What are Data Connectors
Data Connectors in LlamaIndex are essentially plugins that allow us to take in data from a source (such as PDF files) and then use the loaded data in our LLM application. For this example, we're going to ingest a
PDF Loader Data Connector.
After ingesting data, an index could be constructed and used to query the data about specific questions using a
Query Engine or to have a chat-style conversation using a
We're going to quickly define what the Query and Chat Engines are and briefly explain their function.
A query engine is a generic interface that allows you to ask questions about the data ingested from one or more sources using
Data Connectors. A query engine takes in a natural language input and returns a response.
A query engine can be initialized by using the
as_query_engine() method as shown below:
query_engine = index.as_query_engine()
response = query_engine.query("What are Data Connectors?")
Similarly, we can think of a chat engine as an extension of a query engine that supports having a conversation (back-and-forth messages) with your data. It achieves this by keeping track of the message history and retaining context for future queries. If you're building a bot for your custom data or any conversation-type interface you'll probably use the chat engine.
A chat engine can be initialized by using the
as_chat_engine() method as shown below:
chat_engine = index.as_chat_engine()
response = chat_engine.chat("What are Data Connectors?")
To set up our first
Data Connector for this example we'll need an
OpenAI API Key and a PDF file that you'd like to process.
Let's get started by installing
pip. In your terminal window, type the following:
pip install llama-index
Creating Empty Directory
cd into our new directory:
We can finally create our
app.py Python file:
Querying PDF Example
Next, we're going to do the following:
- Set the OpenAI API Key
- Import required packages
- Load LlamaIndex Data Connector:
- Ingest a sample PDF file
- Use the
Query Engineto query OpenAI's LLM
Set up OpenAI API Key
os.environ["OPENAI_API_KEY"] = 'YOUR-API-KEY-HERE'
Import Required Packages
from pathlib import Path
from llama_index import VectorStoreIndex, download_loader
Here we'll load
pathlib which makes it easier to interact with files and directories. We'll also import
VectorStoreIndex represents a vector index which is a type of index used to store and manage multidimensional data called vectors. AI models can produce vectors called "embedding models". These models take something, like an article, picture, or video, and turn it into a set of numbers, or a vector, that represents it.
download_loader will help up load one of the many LlamaIndex
Data Connectors, in our case we'll be using the
PDF Loader connector (as shown below).
Load PDF Reader
download_loader we'll now load the
PDF Loader Data Connector:
PDFReader = download_loader("PDFReader")
loader = PDFReader()
Ingest Sample PDF File
Next week some friends are coming over and we're having Domino's Pizza for dinner. I genuinely want to query their nutritional information and get more details about my choices, so I decided to use the Canadian Domino's Pizza Nutritional Guide as my sample PDF but obviously you can swap it with any other PDF based on your use-case.
documents = loader.load_data(file=Path('dominos.pdf'))
index = VectorStoreIndex.from_documents(documents)
load_data that takes in the PDF path, we can convert the PDF's content to a
VectorStoreIndex as shown above.
Query the PDF
Final step here is to query the PDF:
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
The printed response in my case is:
This document is about the nutrition guide for Domino's Pizza.
Here's another interesting query:
response = query_engine.query("How many Pizza types are there?")
To which the LLM responded:
There are 6 pizza types mentioned in the context information.
Recap and Next Steps
In this post, we've seen how we can use LlamaIndex's PDF Loader Data Connector to ingest data from the Domino's Pizza Nutritional Information PDF and query that data and receive a response from OpenAI's model. LlamaIndex supports other LLMs, and for your specific use case you could use a different model that does not require internet access to keep your private data, private.
You can download the full code from this repo.
Feel free to experiment with your own documents, and stay tuned for future posts if you like the content.