Skip to main content

Documentation Index

Fetch the complete documentation index at: https://developer.box.com/llms.txt

Use this file to discover all available pages before exploring further.

This tutorial walks you through building a retrieval-augmented generation (RAG) pipeline using Pinecone as a vector database. You will create vector embeddings from files stored in Box, store them in Pinecone, and query them through a large language model (LLM) to answer questions about your content.

RAG concept overview

What is a vector database?

Pinecone is a managed, cloud-native vector database. A vector database stores and retrieves representations of your data, called embeddings, so you can find results that are similar in meaning rather than literally identical. This is the foundation of semantic search.

What are embeddings?

Embeddings are a mathematical representation of the meaning of your content. They are generated by taking input data, splitting (or chunking) it, and using an embedding model to produce a sequence of floating-point values — for example, [0.3, 0.4, 0.1, 1.8, 1.1, ...]. You can think of these values as mapping concepts to points in a high-dimensional space. Similar concepts cluster together: “cat” is close to “kitten,” while “banana” and “dog” are far apart. Different embedding models are trained on different data and may be specialized for specific use cases.

How does RAG work with Box and Pinecone?

In a typical RAG workflow:
  • Indexing: Content from a Box folder is extracted, chunked, and converted into embeddings using an embedding model. These embeddings are stored in Pinecone along with metadata such as the file name and a reference back to the source file in Box.
  • Retrieval: When a user asks a question, the question is converted into embeddings using the same model. Pinecone’s query functionality retrieves the content chunks most relevant to the question.
  • Generation: The original question and the retrieved content chunks are combined into a prompt and sent to an LLM, which generates an answer grounded in your Box content.
The LLM doesn’t have direct access to your Box files. RAG bridges that gap by providing the relevant context so the model can generate accurate answers based on your data.

Prerequisites

Before you begin, make sure you have the following:
  • A Box folder with files you want to query. The folder must be accessible by the user account that creates the Box application. Note the folder ID from the URL bar. You will need it later.
  • A Pinecone account. Sign up for a free starter plan and generate an API key.
  • For the purpose of this tutorial, an OpenAI account. This can be substituted for any other LLM. Sign up for OpenAI and generate an API key. You may need to attach billing information.
  • Python installed on your machine.

Create a Box custom application

Create an OAuth application in the Box Developer Console:
  1. Click New App in the top right corner.
  2. Enter an app name and select OAuth 2.0 as the authentication method.
  3. Click Create App.
  4. Scroll down to Redirect URIs and add: http://127.0.0.1:5000/callback
  5. Check all boxes under Application Scopes.
  6. Click Save Changes.
Note the Client ID and Client Secret. You will need these for the configuration file.

Create a Pinecone index

  1. Log in to the Pinecone Console.
  2. Click Create Index.
  3. Give the index a name (for example, pinecone-demo).
  4. Set the Dimensions field to 1024.
  5. Leave all other settings at their defaults and click Create Index.
Create a Pinecone index
Note the index name. You will add it to the configuration file.
Pinecone index screen
You can also create an index programmatically via the Pinecone API.

Initialize the code repository

Clone the sample code to your local machine:
git clone https://github.com/box-community/box-pinecone-sample.git
cd box-pinecone-sample
Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
Install the dependencies:
pip install -r requirements.txt
Copy the sample configuration file and update it with your credentials:
cp sample_config.py config.py
Open config.py in your editor and fill in:
  • Your Box Client ID and Client Secret
  • Your Box folder ID
  • Your Pinecone API key and index name
  • Your OpenAI API key
Configuration file
Do not set the folder ID to 0. This would attempt to index your entire Box account, which is not recommended. It would exceed rate limits, consume significant resources, and likely fail to complete.

Create and store embeddings

With the configuration complete, run the main script to create embeddings from your Box content:
python main.py
The first time you run the application, a browser window opens asking you to grant access. This is the standard OAuth 2.0 authorization flow. Click Grant Access to Box.
The OAuth tokens are stored in a .oauth.json file in the project directory. The refresh token remains valid for 60 days of inactivity.
The script processes each file in the specified Box folder, extracts its , chunks it, generates embeddings using the Pinecone Inference API, and stores them in Pinecone. Metadata, including a reference back to the source file in Box, is attached to each vector. Once complete, you can view the embeddings in the Pinecone Console along with the associated metadata (file name, plain text of the chunk, and more). This metadata is useful for filtering responses and managing document versions.
Embeddings in Pinecone Console

Query the LLM

With the embeddings stored, you can ask questions about your Box content. This part of the project uses OpenAI as the LLM provider. Run the query script:
python query.py
Enter a question at the prompt. The script converts your question to an embedding, retrieves the most relevant content chunks from Pinecone, and sends them along with your question to OpenAI to generate an answer.
Query answer from the service

Enhancement ideas

The sample runs indexing on demand. You could create a scheduled task or an event-driven service using that triggers when files in the Box folder change. The script uses upserts, so re-running it updates existing records.
The query script runs via the command line. You could build a web UI for a more user-friendly question-and-answer experience.
The demo uses OAuth 2.0. You could integrate or authentication for server-to-server use cases.
The embeddings use the Pinecone Inference API, and the query script uses OpenAI. You can substitute different embedding models, LLM providers, vector dimensions, distance metrics, and chunk sizes to fit your use case.
The script processes files that have a in Box (automatically created for supported file types under 500 MB). You could add third-party libraries to handle additional content types or larger files.

Resources

Sample code

Clone the Box + Pinecone sample repository on GitHub.

Pinecone documentation

Learn more about vector databases and how Pinecone works.