Scalable Data Warehouse for LLM Finetuning: API Design for High Throughput Data Ingestion and RAG Retrieval

Project Overview

This projects aims to enhance Natural Language Processing (NLP) capabilities for African languages, focusing on Amharic. This project aims to develop a comprehensive data corpus to support various NLP applications, such as semantic search, content generation, chatbot support, sentiment analysis, and speech recognition.

Business Need

The lack of extensive, high-quality text/audio datasets for Amharic is a significant bottleneck for developing competitive NLP products. By collecting and processing a vast amount of text/audio data from diverse online sources, this project will enhance Roots Tech Solutions' ability to create innovative NLP tools for these languages.

Contributors

Abubeker Shamil
Michael George
Nyamusi Moraa
Eyerusalem Admassu

Tech Stack

Programming Languages: Python, JavaScript (React)
Web Scraping Tools: Selenium
Database: PostgreSQL
API Frameworks: Flask
Containerization: Docker, Docker Compose
Workflow Automation: Apache Airflow
Annotation Tool: Prodigy
Monitoring: Grafana

Setup Instructions

Prerequisites

Python 3.x
Docker and Docker Compose
PostgreSQL or MongoDB (for local development)

Installation

Clone the Repository

git clone https://github.com/your-username/your-repository.git
cd your-repository

Set Up Virtual Environment

Copy code
python3 -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Install Requirements

pip install -r requirements.txt
Set Up Environment Variables

Set Up Virtual Environment Create a .env file and add the following variables

DB_USERNAME='your_username'
DB_PASSWORD='your_password'
DB_HOST='your_host'
DB_PORT=port
DB_DATABASE='db_name'

Run Docker Compose
```
docker-compose up --build
```
Run Web Scraping Scripts
```
python scrapper/news_sites/alain.py
```
Normalize and Clean Text Data
```
python scripts/clean_data.py
```
Filter Data for Amharic
```
python scripts/filter_data.py
```

API Development

Run FastAPI Application

cd api
fastapi dev main.py

Automation & Stream Processing

Set Up Apache Airflow

airflow db init
airflow webserver --port 8080
airflow scheduler

Code Structure

├── app
│   ├── main.py               # API entry point
│   ├── routes                # API routes
│   ├── view_models           # Pydantic models (schemas)
│   ├── controllers           # Business logic
│   └── models                # SQLAlchemy models
├── data/raw
│   ├── alain_news.csv        # Raw data files
│   └── ...
├── schema
│   ├── news_schema.sql       # SQL schema for news
│   └── ...
├── db/connection
│   ├── db_connection.py      # db connection script
│   └── ...
├── scrapper
│   ├── news_sites/           # News sites scrapping scripts
│   ├── telegram/             # Telegram Scrapping scripts
│   ├── other/                # Other sites scripts
│   └── ...
├── Dockerfile                # Docker configuration
├── docker-compose.yml        # Docker Compose configuration
├── requirements.txt          # Python dependencies
├── README.md                 # Project documentation
└── ...

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a new branch.
Make your changes.
Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Next Steps

Populate the Repository: Ensure the repository has the necessary scripts (scrapy_spider.py, clean_data.py, filter_data.py) and configuration files (Dockerfile, docker-compose.yml).
Document Individual Scripts: Add comments and documentation within each script to explain its functionality.
Setup AWS Resources: Configure AWS resources (EC2, S3, RDS, Kinesis) as needed for your specific project requirements.
Collaborate and Communicate: Share the repository link with your team members and collaborate using issues and pull requests for any changes or improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
api		api
dags		dags
data/raw		data/raw
db		db
frontend		frontend
notebooks		notebooks
scrapper		scrapper
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
Dockerfile.scraper		Dockerfile.scraper
README.md		README.md
alain_news_class.csv		alain_news_class.csv
compose.yaml		compose.yaml
docker-compose.yaml		docker-compose.yaml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable Data Warehouse for LLM Finetuning: API Design for High Throughput Data Ingestion and RAG Retrieval

Project Overview

Table of Contents

Business Need

Contributors

Tech Stack

Setup Instructions

Prerequisites

Installation

API Development

Automation & Stream Processing

Code Structure

Contributing

License

Next Steps

About

Releases

Packages

Languages

dev-abuke/Scalable_Datawarehouse_Amharic_Data_Ingestion_For_LLM_RAG

Folders and files

Latest commit

History

Repository files navigation

Scalable Data Warehouse for LLM Finetuning: API Design for High Throughput Data Ingestion and RAG Retrieval

Project Overview

Table of Contents

Business Need

Contributors

Tech Stack

Setup Instructions

Prerequisites

Installation

API Development

Automation & Stream Processing

Code Structure

Contributing

License

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages