> ## Documentation Index
> Fetch the complete documentation index at: https://developer.box.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Box and Pinecone

> Build a RAG pipeline by connecting Box content to Pinecone for vector search and question answering.

export const RelatedLinks = ({title, items = []}) => {
  const getBadgeClass = badge => {
    if (!badge) return "badge-default";
    const badgeType = badge.toLowerCase().replace(/\s+/g, "-");
    return `badge-${badge === "ガイド" ? "guide" : badgeType}`;
  };
  if (!items || items.length === 0) {
    return null;
  }
  return <div className="my-8">
      {}
      <h3 className="text-sm font-bold uppercase tracking-wider mb-4">{title}</h3>

      {}
      <div className="flex flex-col gap-3">
        {items.map((item, index) => <a key={index} href={item.href} className="py-2 px-3 rounded related_link hover:bg-[#f2f2f2] dark:hover:bg-[#111827] flex items-center gap-3 group no-underline hover:no-underline border-b-0">
            {}
            <span className={`px-2 py-1 rounded-full text-xs font-semibold uppercase tracking-wide flex-shrink-0 ${getBadgeClass(item.badge)}`}>
              {item.badge}
            </span>

            {}
            <span className="text-base">{item.label}</span>
          </a>)}
      </div>
    </div>;
};

export const SignupCTA = ({children}) => {
  return <div className="flex flex-wrap items-center gap-4 p-5 rounded-lg border border-gray-200 dark:border-gray-700 my-6" style={{
    background: "linear-gradient(135deg, rgba(0, 97, 213, 0.06), rgba(0, 97, 213, 0.02))"
  }}>
      <div className="flex-1 text-sm leading-relaxed text-gray-700 dark:text-gray-300" style={{
    minWidth: "280px"
  }}>
        {children}
      </div>
      <div className="flex flex-col items-center gap-2">
        <a href="https://account.box.com/signup/developer#ty9l3" className="signup-cta-button inline-flex items-center whitespace-nowrap px-5 py-2 text-sm font-semibold text-white no-underline">
          Get started for free
        </a>
        <a href="https://account.box.com/developers/console" className="signup-cta-login text-xs text-gray-500 dark:text-gray-400 no-underline whitespace-nowrap">
          Already have an account? Log in
        </a>
      </div>
    </div>;
};

export const Link = ({href, children, className, ...props}) => {
  const localizedHref = href;
  return <a href={localizedHref} className={className} {...props}>
      {children}
    </a>;
};

This tutorial walks you through building a retrieval-augmented generation
(RAG) pipeline using [Pinecone](https://www.pinecone.io/) as a vector
database. You will create vector embeddings from files
stored in Box, store them in Pinecone, and query them through a large
language model (LLM) to answer questions about your content.

<SignupCTA>
  A free developer account gives you access to the Box API and everything
  you need to build AI-powered workflows with Pinecone.
</SignupCTA>

## RAG concept overview

### What is a vector database?

Pinecone is a managed, cloud-native
[vector database](https://www.pinecone.io/learn/vector-database/). A vector
database stores and retrieves representations of your data, called
embeddings, so you can find results that are similar in meaning rather
than literally identical. This is the foundation of semantic search.

### What are embeddings?

Embeddings are a mathematical representation of the meaning of your content.
They are generated by taking input data, splitting (or chunking) it, and
using an embedding model to produce a sequence of floating-point values —
for example, `[0.3, 0.4, 0.1, 1.8, 1.1, ...]`.

You can think of these values as mapping concepts to points in a
high-dimensional space. Similar concepts cluster together: "cat" is close
to "kitten," while "banana" and "dog" are far apart. Different embedding
models are trained on different data and may be specialized for specific
use cases.

### How does RAG work with Box and Pinecone?

In a typical RAG workflow:

* **Indexing**: Content from a Box folder is extracted, chunked, and
  converted into embeddings using an embedding model. These embeddings are
  stored in Pinecone along with metadata such as the file name and a
  reference back to the source file in Box.

* **Retrieval**: When a user asks a question, the question is converted
  into embeddings using the same model. Pinecone's query functionality
  retrieves the content chunks most relevant to the question.

* **Generation**: The original question and the retrieved content chunks
  are combined into a prompt and sent to an LLM, which generates an answer
  grounded in your Box content.

The LLM doesn't have direct access to your Box files. RAG bridges that gap
by providing the relevant context so the model can generate accurate answers
based on your data.

## Prerequisites

Before you begin, make sure you have the following:

* A **Box folder** with files you want to query. The folder must be
  accessible by the user account that creates the Box application. Note the
  folder ID from the URL bar. You will need it later.
* A **Pinecone account**.
  [Sign up for a free starter plan](https://docs.pinecone.io/guides/get-started/quickstart)
  and generate an API key.
* For the purpose of this tutorial, an **OpenAI account**. This can be substituted for any other LLM.
  [Sign up for OpenAI](https://platform.openai.com/docs/quickstart) and
  generate an API key. You may need to attach billing information.
* [Python](https://www.python.org/downloads/) installed on your machine.

## Create a Box custom application

Create an OAuth application in the
[Box Developer Console](https://cloud.app.box.com/developers/console):

1. Click **New App** in the top right corner.
2. Enter an app name and select **OAuth 2.0** as the authentication method.
3. Click **Create App**.
4. Scroll down to **Redirect URIs** and add:
   `http://127.0.0.1:5000/callback`
5. Check all boxes under **Application Scopes**.
6. Click **Save Changes**.

Note the **Client ID** and **Client Secret**. You will need these for the
configuration file.

## Create a Pinecone index

1. Log in to the [Pinecone Console](https://app.pinecone.io/).
2. Click **Create Index**.
3. Give the index a name (for example, `pinecone-demo`).
4. Set the **Dimensions** field to `1024`.
5. Leave all other settings at their defaults and click **Create Index**.

<Frame caption="Create a Pinecone index with 1024 dimensions">
  <img src="https://mintcdn.com/box/nQi7jppEz_5O1YgH/images/ai/vector-databases/pinecone-create-index.png?fit=max&auto=format&n=nQi7jppEz_5O1YgH&q=85&s=5e480208c5cb72805b40b3f6da2b2216" alt="Create a Pinecone index" width="1100" height="610" data-path="images/ai/vector-databases/pinecone-create-index.png" />
</Frame>

Note the index name. You will add it to the configuration file.

<Frame caption="The Pinecone index overview after creation">
  <img src="https://mintcdn.com/box/nQi7jppEz_5O1YgH/images/ai/vector-databases/pinecone-index-screen.png?fit=max&auto=format&n=nQi7jppEz_5O1YgH&q=85&s=310c12351e1e646555e3a8ea03312c15" alt="Pinecone index screen" width="1100" height="612" data-path="images/ai/vector-databases/pinecone-index-screen.png" />
</Frame>

<Tip>
  You can also create an index programmatically via the
  [Pinecone API](https://docs.pinecone.io/reference/api/2024-07/control-plane/create_index).
</Tip>

## Initialize the code repository

Clone the [sample code](https://github.com/box-community/box-pinecone-sample)
to your local machine:

```bash theme={null}
git clone https://github.com/box-community/box-pinecone-sample.git
cd box-pinecone-sample
```

Create and activate a virtual environment:

```bash theme={null}
python3 -m venv venv
source venv/bin/activate
```

Install the dependencies:

```bash theme={null}
pip install -r requirements.txt
```

Copy the sample configuration file and update it with your credentials:

```bash theme={null}
cp sample_config.py config.py
```

Open `config.py` in your editor and fill in:

* Your Box **Client ID** and **Client Secret**
* Your **Box folder ID**
* Your **Pinecone API key** and **index name**
* Your **OpenAI API key**

<Frame caption="Fill in your credentials and folder ID in config.py">
  <img src="https://mintcdn.com/box/nQi7jppEz_5O1YgH/images/ai/vector-databases/pinecone-config.png?fit=max&auto=format&n=nQi7jppEz_5O1YgH&q=85&s=5cd537b2285acdd7af035184674bb9b6" alt="Configuration file" width="1100" height="689" data-path="images/ai/vector-databases/pinecone-config.png" />
</Frame>

<Warning>
  Do **not** set the folder ID to `0`. This would attempt to index your entire
  Box account, which is not recommended. It would exceed rate limits, consume
  significant resources, and likely fail to complete.
</Warning>

## Create and store embeddings

With the configuration complete, run the main script to create embeddings
from your Box content:

```bash theme={null}
python main.py
```

The first time you run the application, a browser window opens asking you to
grant access. This is the standard OAuth 2.0 authorization flow. Click
**Grant Access to Box**.

<Note>
  The OAuth tokens are stored in a `.oauth.json` file in the project
  directory. The refresh token remains valid for 60 days of inactivity.
</Note>

The script processes each file in the specified Box folder, extracts its
<Link href="/guides/representations/text">text representation</Link>,
chunks it, generates embeddings using the
[Pinecone Inference API](https://docs.pinecone.io/reference/api/2024-07/inference/generate-embeddings),
and stores them in Pinecone. Metadata, including a reference back to the
source file in Box, is attached to each vector.

Once complete, you can view the embeddings in the Pinecone Console along
with the associated metadata (file name, plain text of the chunk, and more).
This metadata is useful for filtering responses and managing document
versions.

<Frame caption="Embeddings with metadata in the Pinecone Console">
  <img src="https://mintcdn.com/box/nQi7jppEz_5O1YgH/images/ai/vector-databases/pinecone-embeddings-console.png?fit=max&auto=format&n=nQi7jppEz_5O1YgH&q=85&s=a28c58eda0c0732d4a62d0dd0c44dfa3" alt="Embeddings in Pinecone Console" width="1100" height="612" data-path="images/ai/vector-databases/pinecone-embeddings-console.png" />
</Frame>

## Query the LLM

With the embeddings stored, you can ask questions about your Box content.
This part of the project uses OpenAI as the LLM provider.

Run the query script:

```bash theme={null}
python query.py
```

Enter a question at the prompt. The script converts your question to an
embedding, retrieves the most relevant content chunks from Pinecone, and
sends them along with your question to OpenAI to generate an answer.

<Frame caption="The query service returns an answer based on your Box content">
  <img src="https://mintcdn.com/box/nQi7jppEz_5O1YgH/images/ai/vector-databases/pinecone-query-answer.png?fit=max&auto=format&n=nQi7jppEz_5O1YgH&q=85&s=481a0971063e9f9128e406b0cbc4448a" alt="Query answer from the service" width="1100" height="108" data-path="images/ai/vector-databases/pinecone-query-answer.png" />
</Frame>

## Enhancement ideas

<AccordionGroup>
  <Accordion title="Automate indexing with events">
    The sample runs indexing on demand. You could create a scheduled task or
    an event-driven service using
    <Link href="/guides/events">Box events</Link> that triggers when files
    in the Box folder change. The script uses upserts, so re-running it
    updates existing records.
  </Accordion>

  <Accordion title="Add a user interface">
    The query script runs via the command line. You could build a web UI for
    a more user-friendly question-and-answer experience.
  </Accordion>

  <Accordion title="Use different authentication methods">
    The demo uses OAuth 2.0. You could integrate
    <Link href="/guides/authentication/jwt">JWT</Link> or
    <Link href="/guides/authentication/client-credentials">Client Credentials Grant</Link>
    authentication for server-to-server use cases.
  </Accordion>

  <Accordion title="Swap models or configuration">
    The embeddings use the Pinecone Inference API, and the query script
    uses OpenAI. You can substitute different embedding models, LLM
    providers, vector dimensions, distance metrics, and chunk sizes to fit
    your use case.
  </Accordion>

  <Accordion title="Expand file type support">
    The script processes files that have a
    <Link href="/guides/representations/text">text representation</Link>
    in Box (automatically created for supported file types under 500 MB).
    You could add third-party libraries to handle additional content types
    or larger files.
  </Accordion>
</AccordionGroup>

## Resources

<CardGroup cols={2}>
  <Card title="Sample code" href="https://github.com/box-community/box-pinecone-sample" icon="github">
    Clone the Box + Pinecone sample repository on GitHub.
  </Card>

  <Card title="Pinecone documentation" href="https://www.pinecone.io/learn/vector-database/" icon="database">
    Learn more about vector databases and how Pinecone works.
  </Card>
</CardGroup>

<RelatedLinks
  title="RELATED RESOURCES"
  items={[
{ label: translate("AI integrations"), href: "/ai/integrations", badge: "GUIDE" },
{ label: translate("Box and Weaviate"), href: "/ai/vector-databases/weaviate", badge: "GUIDE" },
{ label: translate("Get started with Box AI"), href: "/guides/box-ai/ai-tutorials/prerequisites", badge: "GUIDE" }
]}
/>
