Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
diptanu committed Jul 21, 2024
1 parent 09f7a5d commit 0cb379d
Show file tree
Hide file tree
Showing 12 changed files with 33 additions and 126 deletions.
30 changes: 7 additions & 23 deletions examples/pdf/chunking/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,9 @@
# PDF Chunking with Indexify and RecursiveCharacterTextSplitter

In this cookbook, we'll explore how to create a PDF chunking pipeline using Indexify, the tensorlake/marker for PDF text extraction, and the tensorlake/chunk-extractor with RecursiveCharacterTextSplitter. By the end of this document, you should have a pipeline capable of ingesting PDF documents and chunking their content for further processing or analysis.
Pipeline to extract and chunk text from a PDF. The pipeline uses -

## Table of Contents

1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Setup](#setup)
- [Install Indexify](#install-indexify)
- [Install Required Extractors](#install-required-extractors)
4. [Creating the Extraction Graph](#creating-the-extraction-graph)
5. [Implementing the Chunking Pipeline](#implementing-the-chunking-pipeline)
6. [Running the Chunking Process](#running-the-chunking-process)
7. [Customization and Advanced Usage](#customization-and-advanced-usage)
8. [Conclusion](#conclusion)

## Introduction

The PDF chunking pipeline will be composed of two main steps:
1. PDF to Text extraction using the `tensorlake/marker` extractor.
2. Text chunking using the `tensorlake/chunk-extractor` with RecursiveCharacterTextSplitter.
1. `tensorlake/marker` for PDF text extraction
2. `tensorlake/chunk-extractor` chunks markdown from the previous step with Langchain's RecursiveCharacterTextSplitter.

## Prerequisites

Expand Down Expand Up @@ -91,12 +75,12 @@ client.create_extraction_graph(extraction_graph)

You can run this script to set up the pipeline:
```bash
python pdf_chunking_graph.py
python setup_graph.py
```

## Ingestion and Retreival from the Pipeline
## Ingestion and Retrieval from the Pipeline

Now that we have our extraction graph set up, we can upload files and make the pipeline generate chunks. Create a file `upload_and_retreive.py`:
Now that we have our extraction graph set up, we can upload files and make the pipeline generate chunks. Create a file `upload_and_retrieve.py`:

```python
import os
Expand Down Expand Up @@ -146,7 +130,7 @@ if __name__ == "__main__":

You can run the Python script to process a PDF and generate chunks:
```bash
python upload_and_retreive.py
python upload_and_retrieve.py
```
<img src="https://docs.getindexify.ai/example_code/pdf/chunking/carbon.png" width="600"/>

Expand Down
File renamed without changes.
26 changes: 7 additions & 19 deletions examples/pdf/image/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,10 @@
# PDF Image Extraction with Indexify

This project demonstrates how to extract images from PDF documents using Indexify. It includes two main components: setting up an extraction graph for image extraction and a script to process PDFs and retrieve the extracted images.
Pipeline to extract images from PDF. The pipeline uses -

## Table of Contents
1. `tensorlake/pdfextractor` to extract images from PDFs.

1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Setup](#setup)
4. [File Descriptions](#file-descriptions)
5. [Usage](#usage)
6. [Customization](#customization)
7. [Conclusion](#conclusion)

## Introduction

This project showcases the use of Indexify to create a pipeline for extracting images from PDF documents. It consists of two main parts:
- An extraction graph that defines the process of converting PDFs to images.
- A script that downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
We provide a script that downloads a PDF, uploads it to Indexify, and retrieves the extracted images.

## Prerequisites

Expand Down Expand Up @@ -47,15 +35,15 @@ Before we begin, ensure you have the following:

## File Descriptions

1. `setup.py`: This script sets up the extraction graph for converting PDFs to images.
1. `setup_graph.py`: This script sets up the extraction graph for converting PDFs to images.

2. `upload_and_retreive.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.

## Usage

1. First, run the [`setup.py`](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/setup.py) script to set up the extraction graph:
1. First, run the [`setup_graph.py`](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/setup_graph.py) script to set up the extraction graph:
```bash
python setup.py
python setup_graph.py
```

2. Then, run the [`upload_and_retrieve.py`](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/upload_and_retrieve.py) script to process a PDF and extract images:
Expand All @@ -75,7 +63,7 @@ Before we begin, ensure you have the following:

## Customization

You can customize the image extraction process by modifying the `extraction_graph_spec` in `image_pipeline.py`. For example, you could add additional extraction steps or change the output format.
You can customize the image extraction process by modifying the `extraction_graph_spec` in `setup_graph.py`. For example, you could add additional extraction steps or change the output format.

In `upload_and_retrieve.py`, you can modify the `pdf_url` variable to process different PDF documents.

Expand Down
File renamed without changes.
23 changes: 9 additions & 14 deletions examples/pdf/indexing_and_rag/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
# Retrieval-Augmented Generation (RAG) with Indexify

In this cookbook, we'll explore how to create a Retrieval-Augmented Generation (RAG) system using Indexify. We'll cover two approaches: a text-based RAG system and a multimodal RAG system.

## Introduction

We'll explore two RAG pipelines:
# Retrieval-Augmented Generation (RAG) on PDFs with Indexify

We show how to create a Retrieval-Augmented Generation (RAG) system using Indexify. We'll cover two approaches: a text-based RAG system and a multi-modal RAG system.

1. A text-based RAG system using `tensorlake/pdfextractor`, `tensorlake/chunk-extractor`, and `tensorlake/minilm-l6`.
2. A multimodal RAG system that includes image processing using `tensorlake/clip-extractor` and GPT-4o mini for answer generation.
2. A multi-modal RAG system that includes image processing using `tensorlake/clip-extractor` and GPT-4o mini for answer generation.

## Prerequisites

Expand Down Expand Up @@ -166,9 +161,9 @@ python upload_and_retreive.py
```
<img src="https://docs.getindexify.ai/example_code/pdf/indexing_and_rag/carbon.png" width="600"/>

## Part 2: Multimodal RAG with GPT-4o mini
## Part 2: Multi-Modal RAG with GPT-4o mini

### Creating the Multimodal Extraction Graph
### Creating the Multi-Modal Extraction Graph

Create a new Python file called `mm_extraction_graph.py` and add the following code:

Expand Down Expand Up @@ -205,7 +200,7 @@ extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
```

Run this script to set up the multimodal pipeline:
Run this script to set up the multi-modal pipeline:
```bash
python mm_extraction_graph.py
```
Expand Down Expand Up @@ -315,7 +310,7 @@ if __name__ == "__main__":

Replace `"YOUR_OPENAI_API_KEY"` with your actual OpenAI API key.

### Running the Multimodal RAG System
### Running the Multi-Modal RAG System

Reference from PDF file from which answer should be generated:

Expand Down Expand Up @@ -358,11 +353,11 @@ These RAG systems demonstrate the power of combining Indexify with large languag
1. **Scalability**: Indexify can process and index large numbers of PDFs efficiently, including both text and images.
2. **Flexibility**: You can easily swap out components or adjust parameters to suit your specific needs.
3. **Integration**: The systems seamlessly integrate PDF processing, embedding generation, and text generation.
4. **Multimodal Capabilities**: The second system shows how to incorporate both text and image data for more comprehensive question answering.
4. **Multi-Modal Capabilities**: The second system shows how to incorporate both text and image data for more comprehensive question answering.

## Next Steps

- Learn more about Indexify on our docs - https://docs.getindexify.ai
- Explore ways to evaluate and improve the quality of retrieved contexts and generated answers.
- Consider implementing a user interface for easier interaction with your RAG systems.
- Experiment with different multimodal models and ways of combining text and image data for more sophisticated question answering.
- Experiment with different multi-modal models and ways of combining text and image data for more sophisticated question answering.
19 changes: 1 addition & 18 deletions examples/pdf/pdf_to_markdown/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,6 @@
# PDF Text Extraction with Indexify and Marker

This guide demonstrates how to create a PDF text extraction pipeline using Indexify and the tensorlake/marker extractor. By the end of this document, you'll have a pipeline capable of extracting text content from PDF documents for further processing or analysis.

## Table of Contents

1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Setup](#setup)
- [Install Indexify](#install-indexify)
- [Install Required Extractor](#install-required-extractor)
4. [Creating the Extraction Graph](#creating-the-extraction-graph)
5. [Implementing the Text Extraction Pipeline](#implementing-the-text-extraction-pipeline)
6. [Running the Text Extraction Process](#running-the-text-extraction-process)
7. [Customization and Advanced Usage](#customization-and-advanced-usage)
8. [Conclusion](#conclusion)

## Introduction

The PDF text extraction pipeline will use the `tensorlake/marker` extractor to convert PDF documents into plain text.
We show how to create a pipeline capable of extracting text content from PDF documents It uses the `tensorlake/marker` extractor to convert PDF documents into markdown.

## Prerequisites

Expand Down
Binary file added examples/pdf/sample_invoice.pdf
Binary file not shown.
21 changes: 2 additions & 19 deletions examples/pdf/structured_extraction/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,8 @@
# Structured Extraction from PDFs with GPT-4

In this cookbook, we'll explore how to create a PDF schema extraction pipeline using Indexify, the Marker PDF extractor, and OpenAI's language models. By the end of the document, you should have a pipeline capable of ingesting PDF documents and extracting structured information based on a predefined schema.
Structured Extraction from PDF involves extracting specific information from documents. We show how to create a pipeline, which accepts a schema and extracts information from PDFs into the provided schema.

![Preview data](https://i.postimg.cc/XYCqNP0p/hoa.png)

## Table of Contents

1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Setup](#setup)
- [Install Indexify](#install-indexify)
- [Install Required Extractors](#install-required-extractors)
4. [Creating the Extraction Graph](#creating-the-extraction-graph)
5. [Implementing the Schema Extraction Pipeline](#implementing-the-schema-extraction-pipeline)
6. [Running the Schema Extraction](#running-the-schema-extraction)
7. [Customization and Advanced Usage](#customization-and-advanced-usage)
8. [Conclusion](#conclusion)

## Introduction

The schema extraction pipeline will consist of two steps:
The pipeline is composed of two steps:
- PDF to Text extraction using the pre-built extractor `tensorlake/marker`.
- Schema-based information extraction using `tensorlake/schema` with OpenAI's language models.

Expand Down
21 changes: 5 additions & 16 deletions examples/pdf/table_extraction/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,9 @@
# Table Extraction from PDFs

This project demonstrates how to extract tables from PDF documents using Indexify. It includes two main components: setting up an extraction graph for table extraction and a script to process PDFs and retrieve the extracted tables.
We show how to extract tables from PDF documents using Indexify.

## Table of Contents
It consists of two main parts:

1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Setup](#setup)
4. [File Descriptions](#file-descriptions)
5. [Usage](#usage)
6. [Customization](#customization)
7. [Conclusion](#conclusion)

## Introduction

This project showcases the use of Indexify to create a pipeline for extracting tables from PDF documents. It consists of two main parts:
- An extraction graph that defines the process of converting PDFs to tables.
- A script that downloads a PDF, uploads it to Indexify, and retrieves the extracted tables.

Expand Down Expand Up @@ -49,7 +38,7 @@ Before we begin, ensure you have the following:

1. `table_pipeline.py`: This script sets up the extraction graph for converting PDFs to tables.

2. `upload_and_retreive.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted tables.
2. `upload_and_retrieve.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted tables.

## Usage

Expand All @@ -58,7 +47,7 @@ Before we begin, ensure you have the following:
python table_pipeline.py
```

2. Then, run the [`upload_and_retreive.py`](upload_and_retreive.py) script to process a PDF and extract tables:
2. Then, run the [`upload_and_retrieve.py`](upload_and_retreive.py) script to process a PDF and extract tables:
```bash
python upload_and_retreive.py
```
Expand All @@ -74,7 +63,7 @@ Before we begin, ensure you have the following:

You can customize the table extraction process by modifying the `extraction_graph_spec` in `table_pipeline.py`. For example, you could add additional extraction steps or change the output format.

In `upload_and_retreive.py`, you can modify the `pdf_url` variable to process different PDF documents.
In `upload_and_retrieve.py`, you can modify the `pdf_url` variable to process different PDF documents.

## Conclusion

Expand Down
19 changes: 2 additions & 17 deletions examples/video/transcript/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,8 @@
# Debate Topic-wise Summary Pipeline with Indexify and Mistral

In this cookbook, we'll explore how to create a debate topic-wise summary pipeline using Indexify and Mistral's large language models. By the end of this document, you'll have a pipeline capable of processing video debates, extracting audio, performing speech recognition and diarization, and generating summaries for each topic discussed.
We show how to create a pipeline capable of summarizing and performing topic extraction on videos.
The pipeline will consist of four main steps -

## Table of Contents

1. [Introduction](#introduction)
2. [Prerequisites](#prerequisites)
3. [Setup](#setup)
- [Install Indexify](#install-indexify)
- [Install Required Extractors](#install-required-extractors)
4. [Creating the Extraction Graph](#creating-the-extraction-graph)
5. [Implementing the Debate Summary Pipeline](#implementing-the-debate-summary-pipeline)
6. [Running the Summary Pipeline](#running-the-summary-pipeline)
7. [Customization and Advanced Usage](#customization-and-advanced-usage)
8. [Conclusion](#conclusion)

## Introduction

The debate summary pipeline will consist of four main steps:
1. Video to Audio extraction using `tensorlake/audio-extractor`
2. Speech recognition and diarization using `tensorlake/asrdiarization`
3. Topic extraction using `tensorlake/mistral`
Expand Down

0 comments on commit 0cb379d

Please sign in to comment.