update docs

tensorlakeai · Jul 21, 2024 · 0cb379d · 0cb379d
1 parent 09f7a5d
commit 0cb379d
Show file tree

Hide file tree

Showing 12 changed files with 33 additions and 126 deletions.
diff --git a/examples/pdf/chunking/README.md b/examples/pdf/chunking/README.md
@@ -1,25 +1,9 @@
 # PDF Chunking with Indexify and RecursiveCharacterTextSplitter
 
-In this cookbook, we'll explore how to create a PDF chunking pipeline using Indexify, the tensorlake/marker for PDF text extraction, and the tensorlake/chunk-extractor with RecursiveCharacterTextSplitter. By the end of this document, you should have a pipeline capable of ingesting PDF documents and chunking their content for further processing or analysis.
+Pipeline to extract and chunk text from a PDF. The pipeline uses - 
 
-## Table of Contents
-
-1. [Introduction](#introduction)
-2. [Prerequisites](#prerequisites)
-3. [Setup](#setup)
-   - [Install Indexify](#install-indexify)
-   - [Install Required Extractors](#install-required-extractors)
-4. [Creating the Extraction Graph](#creating-the-extraction-graph)
-5. [Implementing the Chunking Pipeline](#implementing-the-chunking-pipeline)
-6. [Running the Chunking Process](#running-the-chunking-process)
-7. [Customization and Advanced Usage](#customization-and-advanced-usage)
-8. [Conclusion](#conclusion)
-
-## Introduction
-
-The PDF chunking pipeline will be composed of two main steps:
-1. PDF to Text extraction using the `tensorlake/marker` extractor.
-2. Text chunking using the `tensorlake/chunk-extractor` with RecursiveCharacterTextSplitter.
+1. `tensorlake/marker` for PDF text extraction
+2. `tensorlake/chunk-extractor` chunks markdown from the previous step with Langchain's RecursiveCharacterTextSplitter.
 
 ## Prerequisites
 
@@ -91,12 +75,12 @@ client.create_extraction_graph(extraction_graph)
 
 You can run this script to set up the pipeline:
 ```bash
-python pdf_chunking_graph.py
+python setup_graph.py
 ```
 
-## Ingestion and Retreival from the Pipeline
+## Ingestion and Retrieval from the Pipeline
 
-Now that we have our extraction graph set up, we can upload files and make the pipeline generate chunks. Create a file `upload_and_retreive.py`:
+Now that we have our extraction graph set up, we can upload files and make the pipeline generate chunks. Create a file `upload_and_retrieve.py`:
 
 ```python
 import os
@@ -146,7 +130,7 @@ if __name__ == "__main__":
 
 You can run the Python script to process a PDF and generate chunks:
 ```bash
-python upload_and_retreive.py
+python upload_and_retrieve.py
 ```
 <img src="https://docs.getindexify.ai/example_code/pdf/chunking/carbon.png" width="600"/>
 

diff --git a/examples/pdf/chunking/pdf_chunking_graph.py → examples/pdf/chunking/setup_graph.py b/examples/pdf/chunking/pdf_chunking_graph.py → examples/pdf/chunking/setup_graph.py
diff --git a/examples/pdf/chunking/upload_and_retreive.py → examples/pdf/chunking/upload_and_retrieve.py b/examples/pdf/chunking/upload_and_retreive.py → examples/pdf/chunking/upload_and_retrieve.py
diff --git a/examples/pdf/image/README.md b/examples/pdf/image/README.md
@@ -1,22 +1,10 @@
 # PDF Image Extraction with Indexify
 
-This project demonstrates how to extract images from PDF documents using Indexify. It includes two main components: setting up an extraction graph for image extraction and a script to process PDFs and retrieve the extracted images.
+Pipeline to extract images from PDF. The pipeline uses - 
 
-## Table of Contents
+1. `tensorlake/pdfextractor` to extract images from PDFs.
 
-1. [Introduction](#introduction)
-2. [Prerequisites](#prerequisites)
-3. [Setup](#setup)
-4. [File Descriptions](#file-descriptions)
-5. [Usage](#usage)
-6. [Customization](#customization)
-7. [Conclusion](#conclusion)
-
-## Introduction
-
-This project showcases the use of Indexify to create a pipeline for extracting images from PDF documents. It consists of two main parts:
-- An extraction graph that defines the process of converting PDFs to images.
-- A script that downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
+We provide a script that downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
 
 ## Prerequisites
 
@@ -47,15 +35,15 @@ Before we begin, ensure you have the following:
 
 ## File Descriptions
 
-1. `setup.py`: This script sets up the extraction graph for converting PDFs to images.
+1. `setup_graph.py`: This script sets up the extraction graph for converting PDFs to images.
 
 2. `upload_and_retreive.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
 
 ## Usage
 
-1. First, run the [`setup.py`](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/setup.py) script to set up the extraction graph:
+1. First, run the [`setup_graph.py`](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/setup_graph.py) script to set up the extraction graph:
    ```bash
-   python setup.py
+   python setup_graph.py
    ```
 
 2. Then, run the [`upload_and_retrieve.py`](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/upload_and_retrieve.py) script to process a PDF and extract images:
@@ -75,7 +63,7 @@ Before we begin, ensure you have the following:
 
 ## Customization
 
-You can customize the image extraction process by modifying the `extraction_graph_spec` in `image_pipeline.py`. For example, you could add additional extraction steps or change the output format.
+You can customize the image extraction process by modifying the `extraction_graph_spec` in `setup_graph.py`. For example, you could add additional extraction steps or change the output format.
 
 In `upload_and_retrieve.py`, you can modify the `pdf_url` variable to process different PDF documents.
 

diff --git a/examples/pdf/image/setup.py → examples/pdf/image/setup_graph.py b/examples/pdf/image/setup.py → examples/pdf/image/setup_graph.py
diff --git a/examples/pdf/indexing_and_rag/README.md b/examples/pdf/indexing_and_rag/README.md
@@ -1,14 +1,9 @@
-# Retrieval-Augmented Generation (RAG) with Indexify
-
-In this cookbook, we'll explore how to create a Retrieval-Augmented Generation (RAG) system using Indexify. We'll cover two approaches: a text-based RAG system and a multimodal RAG system.
-
-## Introduction
-
-We'll explore two RAG pipelines:
+# Retrieval-Augmented Generation (RAG) on PDFs with Indexify
 
+We show how to create a Retrieval-Augmented Generation (RAG) system using Indexify. We'll cover two approaches: a text-based RAG system and a multi-modal RAG system.
 
 1. A text-based RAG system using `tensorlake/pdfextractor`, `tensorlake/chunk-extractor`, and `tensorlake/minilm-l6`.
-2. A multimodal RAG system that includes image processing using `tensorlake/clip-extractor` and GPT-4o mini for answer generation.
+2. A multi-modal RAG system that includes image processing using `tensorlake/clip-extractor` and GPT-4o mini for answer generation.
 
 ## Prerequisites
 
@@ -166,9 +161,9 @@ python upload_and_retreive.py
 ```
 <img src="https://docs.getindexify.ai/example_code/pdf/indexing_and_rag/carbon.png" width="600"/>
 
-## Part 2: Multimodal RAG with GPT-4o mini
+## Part 2: Multi-Modal RAG with GPT-4o mini
 
-### Creating the Multimodal Extraction Graph
+### Creating the Multi-Modal Extraction Graph
 
 Create a new Python file called `mm_extraction_graph.py` and add the following code:
 
@@ -205,7 +200,7 @@ extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
 client.create_extraction_graph(extraction_graph)
 ```
 
-Run this script to set up the multimodal pipeline:
+Run this script to set up the multi-modal pipeline:
 ```bash
 python mm_extraction_graph.py
 ```
@@ -315,7 +310,7 @@ if __name__ == "__main__":
 
 Replace `"YOUR_OPENAI_API_KEY"` with your actual OpenAI API key.
 
-### Running the Multimodal RAG System
+### Running the Multi-Modal RAG System
 
 Reference from PDF file from which answer should be generated:
 
@@ -358,11 +353,11 @@ These RAG systems demonstrate the power of combining Indexify with large languag
 1. **Scalability**: Indexify can process and index large numbers of PDFs efficiently, including both text and images.
 2. **Flexibility**: You can easily swap out components or adjust parameters to suit your specific needs.
 3. **Integration**: The systems seamlessly integrate PDF processing, embedding generation, and text generation.
-4. **Multimodal Capabilities**: The second system shows how to incorporate both text and image data for more comprehensive question answering.
+4. **Multi-Modal Capabilities**: The second system shows how to incorporate both text and image data for more comprehensive question answering.
 
 ## Next Steps
 
 - Learn more about Indexify on our docs - https://docs.getindexify.ai
 - Explore ways to evaluate and improve the quality of retrieved contexts and generated answers.
 - Consider implementing a user interface for easier interaction with your RAG systems.
-- Experiment with different multimodal models and ways of combining text and image data for more sophisticated question answering.
+- Experiment with different multi-modal models and ways of combining text and image data for more sophisticated question answering.
diff --git a/examples/pdf/pdf_to_markdown/README.md b/examples/pdf/pdf_to_markdown/README.md
@@ -1,23 +1,6 @@
 # PDF Text Extraction with Indexify and Marker
 
-This guide demonstrates how to create a PDF text extraction pipeline using Indexify and the tensorlake/marker extractor. By the end of this document, you'll have a pipeline capable of extracting text content from PDF documents for further processing or analysis.
-
-## Table of Contents
-
-1. [Introduction](#introduction)
-2. [Prerequisites](#prerequisites)
-3. [Setup](#setup)
-   - [Install Indexify](#install-indexify)
-   - [Install Required Extractor](#install-required-extractor)
-4. [Creating the Extraction Graph](#creating-the-extraction-graph)
-5. [Implementing the Text Extraction Pipeline](#implementing-the-text-extraction-pipeline)
-6. [Running the Text Extraction Process](#running-the-text-extraction-process)
-7. [Customization and Advanced Usage](#customization-and-advanced-usage)
-8. [Conclusion](#conclusion)
-
-## Introduction
-
-The PDF text extraction pipeline will use the `tensorlake/marker` extractor to convert PDF documents into plain text.
+We show how to create a pipeline capable of extracting text content from PDF documents It uses the `tensorlake/marker` extractor to convert PDF documents into markdown.
 
 ## Prerequisites
 

diff --git a/examples/pdf/sample_invoice.pdf b/examples/pdf/sample_invoice.pdf
diff --git a/examples/pdf/structured_extraction/README.md b/examples/pdf/structured_extraction/README.md
@@ -1,25 +1,8 @@
 # Structured Extraction from PDFs with GPT-4
 
-In this cookbook, we'll explore how to create a PDF schema extraction pipeline using Indexify, the Marker PDF extractor, and OpenAI's language models. By the end of the document, you should have a pipeline capable of ingesting PDF documents and extracting structured information based on a predefined schema.
+Structured Extraction from PDF involves extracting specific information from documents. We show how to create a pipeline, which accepts a schema and extracts information from PDFs into the provided schema.
 
-![Preview data](https://i.postimg.cc/XYCqNP0p/hoa.png)
-
-## Table of Contents
-
-1. [Introduction](#introduction)
-2. [Prerequisites](#prerequisites)
-3. [Setup](#setup)
-   - [Install Indexify](#install-indexify)
-   - [Install Required Extractors](#install-required-extractors)
-4. [Creating the Extraction Graph](#creating-the-extraction-graph)
-5. [Implementing the Schema Extraction Pipeline](#implementing-the-schema-extraction-pipeline)
-6. [Running the Schema Extraction](#running-the-schema-extraction)
-7. [Customization and Advanced Usage](#customization-and-advanced-usage)
-8. [Conclusion](#conclusion)
-
-## Introduction
-
-The schema extraction pipeline will consist of two steps:
+The pipeline is composed of two steps:
 - PDF to Text extraction using the pre-built extractor `tensorlake/marker`.
 - Schema-based information extraction using `tensorlake/schema` with OpenAI's language models.
 

diff --git a/examples/pdf/table_extraction/README.md b/examples/pdf/table_extraction/README.md
@@ -1,20 +1,9 @@
 # Table Extraction from PDFs
 
-This project demonstrates how to extract tables from PDF documents using Indexify. It includes two main components: setting up an extraction graph for table extraction and a script to process PDFs and retrieve the extracted tables.
+We show how to extract tables from PDF documents using Indexify. 
 
-## Table of Contents
+It consists of two main parts:
 
-1. [Introduction](#introduction)
-2. [Prerequisites](#prerequisites)
-3. [Setup](#setup)
-4. [File Descriptions](#file-descriptions)
-5. [Usage](#usage)
-6. [Customization](#customization)
-7. [Conclusion](#conclusion)
-
-## Introduction
-
-This project showcases the use of Indexify to create a pipeline for extracting tables from PDF documents. It consists of two main parts:
 - An extraction graph that defines the process of converting PDFs to tables.
 - A script that downloads a PDF, uploads it to Indexify, and retrieves the extracted tables.
 
@@ -49,7 +38,7 @@ Before we begin, ensure you have the following:
 
 1. `table_pipeline.py`: This script sets up the extraction graph for converting PDFs to tables.
 
-2. `upload_and_retreive.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted tables.
+2. `upload_and_retrieve.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted tables.
 
 ## Usage
 
@@ -58,7 +47,7 @@ Before we begin, ensure you have the following:
    python table_pipeline.py
    ```
 
-2. Then, run the [`upload_and_retreive.py`](upload_and_retreive.py) script to process a PDF and extract tables:
+2. Then, run the [`upload_and_retrieve.py`](upload_and_retreive.py) script to process a PDF and extract tables:
    ```bash
    python upload_and_retreive.py
    ```
@@ -74,7 +63,7 @@ Before we begin, ensure you have the following:
 
 You can customize the table extraction process by modifying the `extraction_graph_spec` in `table_pipeline.py`. For example, you could add additional extraction steps or change the output format.
 
-In `upload_and_retreive.py`, you can modify the `pdf_url` variable to process different PDF documents.
+In `upload_and_retrieve.py`, you can modify the `pdf_url` variable to process different PDF documents.
 
 ## Conclusion
 

diff --git a/...f/table_extraction/upload_and_retreive.py → ...f/table_extraction/upload_and_retrieve.py b/...f/table_extraction/upload_and_retreive.py → ...f/table_extraction/upload_and_retrieve.py
diff --git a/examples/video/transcript/README.md b/examples/video/transcript/README.md
@@ -1,23 +1,8 @@
 # Debate Topic-wise Summary Pipeline with Indexify and Mistral
 
-In this cookbook, we'll explore how to create a debate topic-wise summary pipeline using Indexify and Mistral's large language models. By the end of this document, you'll have a pipeline capable of processing video debates, extracting audio, performing speech recognition and diarization, and generating summaries for each topic discussed.
+We show how to create a pipeline capable of summarizing and performing topic extraction on videos.
+The pipeline will consist of four main steps -
 
-## Table of Contents
-
-1. [Introduction](#introduction)
-2. [Prerequisites](#prerequisites)
-3. [Setup](#setup)
-   - [Install Indexify](#install-indexify)
-   - [Install Required Extractors](#install-required-extractors)
-4. [Creating the Extraction Graph](#creating-the-extraction-graph)
-5. [Implementing the Debate Summary Pipeline](#implementing-the-debate-summary-pipeline)
-6. [Running the Summary Pipeline](#running-the-summary-pipeline)
-7. [Customization and Advanced Usage](#customization-and-advanced-usage)
-8. [Conclusion](#conclusion)
-
-## Introduction
-
-The debate summary pipeline will consist of four main steps:
 1. Video to Audio extraction using `tensorlake/audio-extractor`
 2. Speech recognition and diarization using `tensorlake/asrdiarization`
 3. Topic extraction using `tensorlake/mistral`