Build a PDF ingestion and Question/Answering system
This guide assumes familiarity with the following concepts:
PDF files often hold crucial unstructured data unavailable from other sources. They can be quite lengthy, and unlike plain text files, cannot generally be fed directly into the prompt of a language model.
In this tutorial, you’ll create a system that can answer questions about PDF files. More specifically, you’ll use a Document Loader to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.
This tutorial will gloss over some concepts more deeply covered in our RAG tutorial, so you may want to go through those first if you haven’t already.
Let’s dive in!
Loading documents​
First, you’ll need to choose a PDF to load. We’ll use a document from Nike’s annual public SEC report. It’s over 100 pages long, and contains some crucial data mixed with longer explanatory text. However, you can feel free to use a PDF of your choosing.
Once you’ve chosen your PDF, the next step is to load it into a format
that an LLM can more easily handle, since LLMs generally require text
inputs. LangChain has a few different built-in document
loaders for this purpose which you
can experiment with. Below, we’ll use one powered by the
pdf-parse
package that
reads from a filepath:
import "pdf-parse"; // Peer dep
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("../../data/nke-10k-2023.pdf");
const docs = await loader.load();
console.log(docs.length);
107
console.log(docs[0].pageContent.slice(0, 100));
console.log(docs[0].metadata);
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}
So what just happened?
- The loader reads the PDF at the specified path into memory.
- It then extracts text data using the
pdf-parse
package. - Finally, it creates a LangChain Document for each page of the PDF with the page’s content and some metadata about where in the document the text came from.
LangChain has many other document loaders for other data sources, or you can create a custom document loader.
Question answering with RAG​
Next, you’ll prepare the loaded documents for later retrieval. Using a text splitter, you’ll split your loaded documents into smaller documents that can more easily fit into an LLM’s context window, then load them into a vector store. You can then create a retriever from the vector store for use in our RAG chain:
Pick your chat model:
- OpenAI
- Anthropic
- FireworksAI
- MistralAI
- Groq
- VertexAI
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/openai
yarn add @langchain/openai
pnpm add @langchain/openai
Add environment variables
OPENAI_API_KEY=your-api-key
Instantiate the model
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({ model: "gpt-4o" });
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
Add environment variables
ANTHROPIC_API_KEY=your-api-key
Instantiate the model
import { ChatAnthropic } from "@langchain/anthropic";
const model = new ChatAnthropic({
model: "claude-3-5-sonnet-20240620",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
Add environment variables
FIREWORKS_API_KEY=your-api-key
Instantiate the model
import { ChatFireworks } from "@langchain/community/chat_models/fireworks";
const model = new ChatFireworks({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/mistralai
yarn add @langchain/mistralai
pnpm add @langchain/mistralai
Add environment variables
MISTRAL_API_KEY=your-api-key
Instantiate the model
import { ChatMistralAI } from "@langchain/mistralai";
const model = new ChatMistralAI({
model: "mistral-large-latest",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/groq
yarn add @langchain/groq
pnpm add @langchain/groq
Add environment variables
GROQ_API_KEY=your-api-key
Instantiate the model
import { ChatGroq } from "@langchain/groq";
const model = new ChatGroq({
model: "mixtral-8x7b-32768",
temperature: 0
});
Install dependencies
- npm
- yarn
- pnpm
npm i @langchain/google-vertexai
yarn add @langchain/google-vertexai
pnpm add @langchain/google-vertexai
Add environment variables
GOOGLE_APPLICATION_CREDENTIALS=credentials.json
Instantiate the model
import { ChatVertexAI } from "@langchain/google-vertexai";
const model = new ChatVertexAI({
model: "gemini-1.5-flash",
temperature: 0
});
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const splits = await textSplitter.splitDocuments(docs);
const vectorstore = await MemoryVectorStore.fromDocuments(
splits,
new OpenAIEmbeddings()
);
const retriever = vectorstore.asRetriever();
Finally, you’ll use some built-in helpers to construct the final
ragChain
:
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
const systemTemplate = [
`You are an assistant for question-answering tasks. `,
`Use the following pieces of retrieved context to answer `,
`the question. If you don't know the answer, say that you `,
`don't know. Use three sentences maximum and keep the `,
`answer concise.`,
`\n\n`,
`{context}`,
].join("");
const prompt = ChatPromptTemplate.fromMessages([
["system", systemTemplate],
["human", "{input}"],
]);
const questionAnswerChain = await createStuffDocumentsChain({ llm, prompt });
const ragChain = await createRetrievalChain({
retriever,
combineDocsChain: questionAnswerChain,
});
const results = await ragChain.invoke({
input: "What was Nike's revenue in 2023?",
});
console.log(results);
{
input: "What was Nike's revenue in 2023?",
chat_history: [],
context: [
Document {
pageContent: 'Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we\n' +
'believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving\n' +
'speed and responsiveness as we serve consumers globally.\n' +
'FINANCIAL HIGHLIGHTS\n' +
'•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\n' +
'•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\n' +
'fiscal 2023\n' +
'•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign\n' +
'currency exchange rates, partially offset by strategic pricing actions',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
'2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
'•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
"increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n" +
'equivalent basis.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'EUROPE, MIDDLE EAST & AFRICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$8,260 $7,388 12 %25 %$6,970 6 %9 %\n' +
'Apparel4,566 4,527 1 %14 %3,996 13 %16 %\n' +
'Equipment592 564 5 %18 %490 15 %17 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$8,522 $8,377 2 %15 %$7,812 7 %10 %\n' +
'Sales through NIKE Direct4,896 4,102 19 %33 %3,644 13 %15 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$3,531 $3,293 7 %$2,435 35 % \n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•EMEA revenues increased 21% on a currency-neutral basis, due to higher revenues in Men's, the Jordan Brand, Women's and Kids'. NIKE Direct revenues\n" +
'increased 33%, driven primarily by strong digital sales growth of 43% and comparable store sales growth of 22%.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'NORTH AMERICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$14,897 $12,228 22 %22 %$11,644 5 %5 %\n' +
'Apparel5,947 5,492 8 %9 %5,028 9 %9 %\n' +
'Equipment764 633 21 %21 %507 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$11,273 $9,621 17 %18 %$10,186 -6 %-6 %\n' +
'Sales through NIKE Direct10,335 8,732 18 %18 %6,993 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$5,454 $5,114 7 %$5,089 0 %\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the Jordan Brand. NIKE Direct revenues\n" +
'increased 18%, driven by strong digital sales growth of 23%, comparable store sales growth of 9% and the addition of new stores.',
metadata: [Object]
}
],
answer: 'According to the financial highlights, Nike, Inc. achieved record revenues of $51.2 billion in fiscal 2023, which increased 10% on a reported basis and 16% on a currency-neutral basis compared to fiscal 2022.'
}
You can see that you get both a final answer in the answer
key of the
results object, and the context
the LLM used to generate an answer.
Examining the values under the context
further, you can see that they
are documents that each contain a chunk of the ingested page content.
Usefully, these documents also preserve the original metadata from way
back when you first loaded them:
console.log(results.context[0].pageContent);
Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we
believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving
speed and responsiveness as we serve consumers globally.
FINANCIAL HIGHLIGHTS
•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively
•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for
fiscal 2023
•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign
currency exchange rates, partially offset by strategic pricing actions
console.log(results.context[0].metadata);
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 31, lines: { from: 14, to: 22 } }
}
This particular chunk came from page 31 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.
For a deeper dive into RAG, see this more focused tutorial or our how-to guides.
Next steps​
You’ve now seen how to load documents from a PDF file with a Document Loader and some techniques you can use to prepare that loaded data for RAG.
For more on document loaders, you can check out:
- The entry in the conceptual guide
- Related how-to guides
- Available integrations
- How to create a custom document loader
For more on RAG, see: