Skip to content

Commit

Permalink
Initial save for new orbis-lit-langchain repo
Browse files Browse the repository at this point in the history
  • Loading branch information
mzkrasner committed Dec 17, 2024
1 parent 16a2e5e commit a5651cd
Show file tree
Hide file tree
Showing 239 changed files with 5,161 additions and 40,860 deletions.
7 changes: 5 additions & 2 deletions .example.env.local
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_ENVIRONMENT=
CONTEXT_ID=
TABLE_ID=
ETHEREUM_PRIVATE_KEY=
LIT_TOKEN_ID=
ORBIS_SEED=
169 changes: 139 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,18 @@
# Langchain, Pinecone, and GPT with Next.js - Full Stack Starter
# OrbisDB, Lit Protocol, and Langchain Starter

This is a basic starter project for building with the following tools and APIs:

- Next.js
- LangchainJS
- Pineceone Vector Database
- GPT3

When I started diving into all of this, I felt while I understood some of the individual pieces, it was hard to piece together everything into a cohesive project. I hope this project is useful for anyone looking to build with this stack, and just needing something to start with.
- OrbisDB
- Lit Protocol

### What we're building

We are building an app that takes text (text files), embeds them into vectors, stores them into Pinecone, and allows semantic searching of the data.

For anyone wondering what Semantic search is, here is an overview (taken directly from ChatGPT4):
We are building an app that takes text (text files), embeds them into vectors, stores them into OrbisDB, and allows semantic searching of the data.

__Semantic search refers to a search approach that understands the user's intent and the contextual meaning of search queries, instead of merely matching keywords.__

__It uses natural language processing and machine learning to interpret the semantics, or meaning, behind queries. This results in more accurate and relevant search results. Semantic search can consider user intent, query context, synonym recognition, and natural language understanding. Its applications range from web search engines to personalized recommendation systems.__
We've also enabled data privacy using Lit Protocol to encrypt the corresponding text for each embedding, and programmatically decrypts based on access control conditions.

## Running the app

Expand All @@ -28,48 +23,162 @@ In this section I will walk you through how to deploy and run this app.
To run this app, you need the following:

1. An [OpenAI](https://platform.openai.com/) API key
2. [Pinecone](https://app.pinecone.io/) API Key
2. A modified [OrbisDB] instance (outlined below)
3. Docker
4. A [Lit](https://www.litprotocol.com/) token ID (also shown below)

### Up and running
## Initial Setup

To run the app locally, follow these steps:

1. Clone this repo
1. Clone this repo and install the dependencies

```sh
git clone https://github.com/ceramicstudio/orbis-lit-langchain
cd orbis-lit-langchain
yarn install
```

2. In a separate terminal, clone this modified version of OrbisDB and install the dependencies

```sh
git clone [email protected]:dabit3/semantic-search-nextjs-pinecone-langchain-chatgpt.git
git clone https://github.com/mzkrasner/orbisdb
cd orbisdb
npm install
```

2. Change into the directory and install the dependencies using either NPM or Yarn
3. In your orbisdb terminal, start the database process

```sh
# Ensure that you have your Docker Daemon running in the background first
npm run dev
```

3. Copy `.example.env.local` to a new file called `.env.local` and update with your API keys and environment.
Your OrbisDB instance will need to initially be configured using the GUI running on `localhost:7008`. Navigate to this address in your browser and follow these steps:

__Be sure your environment is an actual environment given to you by Pinecone, like `us-west4-gcp-free`__
a. For "Ceramic node URL" enter the following value: `https://ceramic-orbisdb-mainnet-direct.hirenodes.io/`

4. (Optional) - Add your own custom text or markdown files into the `/documents` folder.
b. For "Ceramic Seed" simply click "generate a new one" and go to the next page

5. Run the app:
c. For "Database configuration" enter the following:

```sh
User=postgres
Database=postgres
Password=postgres
Host=localhost
Port=5432
```

Go to the next page

d. Click next on the presets page (do not select anything)

e. Connect with your Metamask account and click "Get started". Keep the Orbis Studio UI in your browser as we will navigate back to it later

4. Go to your `orbis-lit-langchain` terminal and copy the example env file

```sh
cp .env.example.local .env.local
```

5. Navigate to your browser running the OrbisDB UI and create a new context. You can call this anything you want. Once saved, click into your new context and copy the value prefixed with "k" into your `.env.local` file

```sh
CONTEXT_ID="<your-context-id>"
```

6. Next, we will create an OrbisDB seed to self-authenticate onto the Ceramic Network using the Orbis SDK

```sh
yarn gen-seed
```

Copy only the array of numbers into your `.env.local` file

```sh
# enter as a string like "[2, 19, 140, 10...]"
ORBIS_SEED="your-array-here"
```

Make sure the final number in your array does not contain a trailing comma

7. Copy an active and funded OpenAI API key into your `.env.local` file next to `OPENAI_API_KEY`

8. Choose or create a dummy Metamask address and claim Lit Protocol Testnet tokens using that address by visiting `https://chronicle-yellowstone-faucet.getlit.dev/`

9. Navigate to `https://explorer.litprotocol.com/` in your browser and sign in with the same dummy address as the previous step. Once signed in, click "Mint a new PKP". After minting, copy the value under "Token ID" into your `.env.local` file

```sh
LIT_TOKEN_ID="<your-token-id>"
```

10. Grab the private key from your dummy Metamask wallet (used in the two steps above) and enter it into your `.env.local` file

```sh
ETHEREUM_PRIVATE_KEY="<your-private-key>"
```

11. Finally, deploy your OrbisDB data model we will use to create and query via vector search

```sh
yarn deploy-model
```

Copy the value prefixed with "k" into your `.env.local` file

```sh
TABLE_ID="<your-table-id>"
```

## Running the Application

Now that our environment is configured, run the following to start the application from within your `orbis-lit-langchain` terminal

```sh
npm run dev
```

### Need to know
Make sure that OrbisDB is still running in your other terminal.

When creating the embeddings and the index, it can take up to 2-4 minutes for the index to fully initialize. There is a settimeout function of 180 seconds in the `utils` that waits for the index to be created.
Navigate to `localhost:3000` in your browser.

If the initialization takes longer, then it will fail the first time you try to create the embeddings. If this happens, visit [the Pinecone console](https://app.pinecone.io/) to watch and wait for the status of your index being created to finish, then run the function again.
### Create embeddings

### Running a query
This repository contains a small portion of the [Ceramic Developer Docs](https://developers.ceramic.network/) (specifically information on Decentralized Identifiers) that the application will use to create encrypted embeddings. Feel free to replace this with other documentation if you wish

__The pre-configured app data is about the [Lens protocol developer documentation](https://docs.lens.xyz/docs/overview), so it will only understand questions about it unless you replace it with your own data. Here are a couple of questions you might ask it with the default data__
Click on "Create index and embeddings" and observe your terminal logs in both your `orbisdb` and `orbis-lit-langchain` terminals.

1. What is the difference between Lens and traditional social platforms
2. What is the difference between the Lens SDK and the Lens API
3. How to query Lens data in bulk?
Once finished, your browser console will notify you that the data has been successfully created and loaded into OrbisDB.

> The base of this project was guided by [this Node.js tutorial](https://www.youtube.com/watch?v=CF5buEVrYwo), with some restructuring and ported over to Next.js. You can also follow them [here](https://twitter.com/Dev__Digest/status/1656744114409406467) on Twitter!
### Run a query

### Getting your data
Since the dataset is limited to special knowledge about DIDs, try the following query

`tell me about decentralized identifiers in ceramic`

Since this is knowledge contained in the embeddings we just created, your LLM's response will find these embeddings based on cosine similarity search and use it as context in the response (after decrypting the values). You can observe your terminal's logs to see what decrypted context it's using.

**Ensure that the dummy wallet you spun up contains 0.000001 ETH or more**

## Access control

At the moment, very simple access control conditions are being leveraged based on whether the wallet trying to read the data contains >=0.000001 ETH (found in [utils](./utils.ts))

```typescript
const accessControlConditions = [
{
contractAddress: "",
standardContractType: "",
chain: "ethereum",
method: "eth_getBalance",
parameters: [":userAddress", "latest"],
returnValueTest: {
comparator: ">=",
value: "1000000000000", // 0.000001 ETH
},
},
];
```

I recommend checking out [GPT Repository Loader](https://github.com/mpoon/gpt-repository-loader) which makes it simple to turn any GitHub repo into a text format, preserving the structure of the files and file contents, making it easy to chop up and save into pinecone using my codebase.
There is a wide array of access control conditions you can use or create. For more information, visit [Lit's Access Control documentation](https://developer.litprotocol.com/sdk/access-control/intro).
55 changes: 39 additions & 16 deletions app/api/read/route.ts
Original file line number Diff line number Diff line change
@@ -1,21 +1,44 @@
import { NextRequest, NextResponse } from 'next/server'
import { PineconeClient } from '@pinecone-database/pinecone'
import {
queryPineconeVectorStoreAndQueryLLM,
} from '../../../utils'
import { indexName } from '../../../config'
import { NextRequest, NextResponse } from "next/server";
import { queryLLM } from "../../../utils";
import { OrbisDB } from "@useorbis/db-sdk";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

export async function POST(req: NextRequest) {
const body = await req.json()
const client = new PineconeClient()
await client.init({
apiKey: process.env.PINECONE_API_KEY || '',
environment: process.env.PINECONE_ENVIRONMENT || ''
})
const body = await req.json();
console.log("body: ", body);
const db = new OrbisDB({
ceramic: {
gateway: "https://ceramic-orbisdb-mainnet-direct.hirenodes.io/",
},
nodes: [
{
gateway: "http://localhost:7008",
},
],
});

const text = await queryPineconeVectorStoreAndQueryLLM(client, indexName, body)
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
});
console.log("Splitting query into chunks...");
// 5. Split text into chunks (documents)
const chunks = await textSplitter.createDocuments([body]);

const array = await new OpenAIEmbeddings().embedDocuments(
chunks.map((chunk) => chunk.pageContent.replace(/\n/g, " "))
);
const formattedEmbedding = `ARRAY[${array.join(", ")}]::vector`;
const query = `
SELECT content, embedding <=> ${formattedEmbedding} AS similarity
FROM ${process.env.TABLE_ID}
ORDER BY similarity ASC
LIMIT 5;
`;
const context = await db.select().raw(query).run();
const res = await queryLLM(body, context);

return NextResponse.json({
data: text
})
}
data: res,
});
}
42 changes: 15 additions & 27 deletions app/api/setup/route.ts
Original file line number Diff line number Diff line change
@@ -1,38 +1,26 @@
import { NextResponse } from 'next/server'
import { PineconeClient } from '@pinecone-database/pinecone'
import { TextLoader } from 'langchain/document_loaders/fs/text'
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory'
import { PDFLoader } from 'langchain/document_loaders/fs/pdf'
import {
createPineconeIndex,
updatePinecone
} from '../../../utils'
import { indexName } from '../../../config'
import { NextResponse } from "next/server";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { updateOrbis } from "../../../utils";

export async function POST() {
const loader = new DirectoryLoader('./documents', {
const loader = new DirectoryLoader("./documents", {
".txt": (path) => new TextLoader(path),
".md": (path) => new TextLoader(path),
".pdf": (path) => new PDFLoader(path)
})
".pdf": (path) => new PDFLoader(path),
});

const docs = await loader.load()
const vectorDimensions = 1536

const client = new PineconeClient()
await client.init({
apiKey: process.env.PINECONE_API_KEY || '',
environment: process.env.PINECONE_ENVIRONMENT || ''
})
const docs = await loader.load();

try {
await createPineconeIndex(client, indexName, vectorDimensions)
await updatePinecone(client, indexName, docs)
const { CONTEXT_ID, TABLE_ID } = process.env;
await updateOrbis(docs, CONTEXT_ID, TABLE_ID);
} catch (err) {
console.log('error: ', err)
console.log("error: ", err);
}

return NextResponse.json({
data: 'successfully created index and loaded data into pinecone...'
})
}
data: "successfully created index and loaded data into OrbisDB...",
});
}
Loading

0 comments on commit a5651cd

Please sign in to comment.