Scalable Data Warehouse for LLM Finetuning: API Design for High Throughput Data Ingestion and RAG Retrieval
This projects aims to enhance Natural Language Processing (NLP) capabilities for African languages, focusing on Amharic. This project aims to develop a comprehensive data corpus to support various NLP applications, such as semantic search, content generation, chatbot support, sentiment analysis, and speech recognition.
- Project Overview
- Business Need
- Contributors
- Tech Stack
- Setup Instructions
- Usage
- Project Structure
- Contributing
- License
The lack of extensive, high-quality text/audio datasets for Amharic is a significant bottleneck for developing competitive NLP products. By collecting and processing a vast amount of text/audio data from diverse online sources, this project will enhance Roots Tech Solutions' ability to create innovative NLP tools for these languages.
- Abubeker Shamil
- Michael George
- Nyamusi Moraa
- Eyerusalem Admassu
- Programming Languages: Python, JavaScript (React)
- Web Scraping Tools: Selenium
- Database: PostgreSQL
- API Frameworks: Flask
- Containerization: Docker, Docker Compose
- Workflow Automation: Apache Airflow
- Annotation Tool: Prodigy
- Monitoring: Grafana
- Python 3.x
- Docker and Docker Compose
- PostgreSQL or MongoDB (for local development)
-
Clone the Repository
git clone https://github.com/your-username/your-repository.git cd your-repository
-
Set Up Virtual Environment
Copy code python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Requirements
pip install -r requirements.txt Set Up Environment Variables
-
Set Up Virtual Environment Create a .env file and add the following variables
DB_USERNAME='your_username' DB_PASSWORD='your_password' DB_HOST='your_host' DB_PORT=port DB_DATABASE='db_name'
-
Run Docker Compose
docker-compose up --build
-
Run Web Scraping Scripts
python scrapper/news_sites/alain.py
-
Normalize and Clean Text Data
python scripts/clean_data.py
-
Filter Data for Amharic
python scripts/filter_data.py
Run FastAPI Application
cd api
fastapi dev main.py
Set Up Apache Airflow
airflow db init
airflow webserver --port 8080
airflow scheduler
├── app
│ ├── main.py # API entry point
│ ├── routes # API routes
│ ├── view_models # Pydantic models (schemas)
│ ├── controllers # Business logic
│ └── models # SQLAlchemy models
├── data/raw
│ ├── alain_news.csv # Raw data files
│ └── ...
├── schema
│ ├── news_schema.sql # SQL schema for news
│ └── ...
├── db/connection
│ ├── db_connection.py # db connection script
│ └── ...
├── scrapper
│ ├── news_sites/ # News sites scrapping scripts
│ ├── telegram/ # Telegram Scrapping scripts
│ ├── other/ # Other sites scripts
│ └── ...
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose configuration
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── ...
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch.
- Make your changes.
- Submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- Populate the Repository: Ensure the repository has the necessary scripts (
scrapy_spider.py
,clean_data.py
,filter_data.py
) and configuration files (Dockerfile
,docker-compose.yml
). - Document Individual Scripts: Add comments and documentation within each script to explain its functionality.
- Setup AWS Resources: Configure AWS resources (EC2, S3, RDS, Kinesis) as needed for your specific project requirements.
- Collaborate and Communicate: Share the repository link with your team members and collaborate using issues and pull requests for any changes or improvements.