bio-phenotype.mp4
π You can explore and interact with the Bio-Phenotype by accessing the app through the following link: https://dry-recipe-9383.ploomberapp.io.
This project, Phenotype RAG, was developed as the final assignment for the LLM Zoomcamp. It implements a Retrieval-Augmented Generation (RAG) system that intelligently answers questions related to phenotypes by utilizing both a knowledge base and large language models (LLMs). The system is designed to assist with queries about phenotypes in fields such as genetics, evolutionary biology, and medical diagnostics. By integrating retrieval and generation capabilities, the project provides precise and contextually accurate information, making it a powerful tool for phenotype-related research and clinical applications.
Phenotyping is essential in fields such as genetics, evolutionary biology, and medical diagnostics, enabling researchers and clinicians to analyze observable traits shaped by genetic and environmental factors. However, the sheer volume and complexity of phenotype data pose significant challenges in efficiently accessing and retrieving relevant information. This project tackles these challenges by developing an intelligent assistant designed to answer complex phenotype-related queries. Utilizing Retrieval-Augmented Generation (RAG) techniques, the system integrates the reasoning capabilities of large language models (LLMs) with the accuracy of a curated knowledge base, enhancing the accessibility and precision of phenotype information for researchers, healthcare professionals, and educators.
The Phenotype RAG project aims to achieve the following objectives:
- 1. Enhance Data Retrieval: Implement a Retrieval-Augmented Generation (RAG) system to efficiently access and retrieve accurate information about phenotypes from a comprehensive knowledge base.
- 2. Improve Query Accuracy: Utilize advanced language models to reformulate and optimize queries, ensuring that the answers provided are contextually relevant and precise.
- 3. Offer Educational Value: Create an accessible platform for students and professionals to learn about phenotyping, improving their grasp of complex concepts through a user-friendly interface.
- 4. Ensure Scalability and Flexibility: Develop a system with a flexible architecture that can integrate with various tools and adapt to different research needs, promoting scalability and adaptability in diverse applications.
- 5. Foster Collaboration: Make the project's code and documentation available to the community, encouraging collaborative development and knowledge sharing to advance the field.
- Anaconda: Used for managing dependencies and environment configurations.
- Docker: Containerizes the application for easy deployment and consistent execution across different platforms.
- Grafana: Provides monitoring and visualization dashboards to track application performance and usage metrics.
- Streamlit: Offers a user-friendly interface for interacting with the Phenotype RAG system.
- Prefect: Orchestrates data ingestion workflows to ensure smooth and automated processes.
- gemma2-9b-it: Utilized for question reformulation, optimizing queries for better understanding.
- mixtral-8x7b-32768: Powers the retrieval-augmented generation by processing large volumes of text and delivering more contextually accurate answers.
- all-MiniLM-L6-v2: Handles embedding generation and semantic search, allowing for precise query-to-answer matching.
- Groq: Integrates with the system for efficient vector processing during the search phase.
- Pinecone: Manages vector indexing and provides fast, scalable retrieval of information using semantic search.
- Pytest: Ensures code reliability through unit and integration tests.
- Git: Version control for tracking changes and collaboration.
- Visual Studio Code: Integrated development environment (IDE) for writing and debugging code.
- Jupyter Notebook: Facilitates exploratory data analysis and preprocessing through interactive notebooks.
- PostgreSQL: Relational database used for storing and querying structured data.
The project is organized into the following directories and files:
phenotype-rag/
βββ bio-phenotype/ # Root folder for the main application logic
β βββ data/ # Directory to hold project-specific datasets
β β βββ bio-phenotype.csv # Main dataset: includes phenotype-related questions and answers
β βββ sql/ # Directory for database management and schema scripts
β β βββ .env # Environment file storing sensitive credentials database connection strings
β β βββ create_table.py # Python script to automate the creation of tables in PostgreSQL
β βββ tests/ # Directory for unit tests to ensure code quality and correctness
β β βββ test.py # Python script containing test cases for core functionalities of the project
β βββ __init__.py # Initializes the `bio-phenotype` package, making its modules importable across the project
β βββ main.py # Streamlit application entry point; defines the UI and handles user interaction
β βββ prefect_ingest.py # Prefect workflow script that automates data ingestion and processing tasks
β βββ requirements.txt # Lists Python dependencies needed to run the project (for pip-based installations)
β βββ utils.py # Contains utility functions for data processing, I/O operations, and common tasks
βββ data/ # Contains raw data files that can be accessed across different components
β βββ bio-phenotype.csv # Same dataset as in `bio-phenotype/data`, accessible for testing and backup
βββ grafana/ # Directory for Grafana monitoring setup
β βββ monitoring/
β βββ docker-compose.yaml # Docker Compose configuration for setting up Grafana
β βββ grafana_datasources.yaml # Configuration file that defines the data sources Grafana will connect to PostgreSQL
βββ images/ # Directory for storing project-related images and screenshots
β βββ app.png # Screenshot of the Streamlit app's interface
β βββ grafana.png # Screenshot of the Grafana monitoring dashboard, displaying key metrics
β βββ gloq.png # Screenshot of Groq AI acceleration with integrated API Keys
β βββ pinecone.png # Screenshot of Pinecone vector database powering semantic and similarity searches
βββ notebook/ # Directory containing Jupyter notebooks for exploratory data analysis (EDA) and model experimentation
β βββ .env # Environment file specifically for notebook-related configurations (API keys, credentials)
β βββ vector_Indexing_.ipynb # Notebook for vectorizing data and indexing it into the semantic search system (Pinecone)
βββ docker-compose.yaml # Primary Docker Compose file to orchestrate multi-container setups, including app, database, and Grafana
βββ README.md # Project documentation with detailed instructions on usage, setup, and project purpose
βββ requirements.txt # Python dependencies for the entire project (ensuring the environment is consistent across machines)
βββ test.py # Standalone test script covering various components, including ingestion, database interactions, and the API
The dataset used for this project contains questions and answers about phenotypes, with a focus on genetic research, evolutionary biology, and medical diagnostics. It explores how phenotypic traits relate to cognitive function, disease susceptibility, and treatment outcomes, highlighting the role of phenotyping in personalized medicine. The dataset also covers the impact of traits on aging, chronic diseases, and mental health disorders. Phenotypic trait analysis is crucial in understanding genetic predispositions, environmental adaptations, and evolutionary processes. This resource supports the development of diagnostic tools, therapeutic strategies, and health interventions by linking observable traits to genetic and environmental factors. Additionally, it is valuable for research in agricultural phenotypes, such as plant growth and disease resistance.
Ensure the following are installed on your machine:
- Anaconda (latest version)
- Python (version 3.10 or later)
- PostgreSQL (latest version)
- Grafana (latest version)
- Clone the repository:
git clone https://github.com/nathadriele/biophenotype-rag.git
cd bio-phenotype
- Create and activate the virtual environment:
conda create -n bio-phenotype python=3.10
conda activate bio-phenotype
- Install dependencies:
pip install -r requirements.txt
- Start the
vector_Indexing_.ipynb
notebook with Jupyter:
jupyter notebook
To run the application, you will need access keys (API Key) for both GroqCloud and Pinecone. You will create and substitute them, as well as create an Index in Pinecone. You will need accounts on both platforms.
- Create or log into your GroqCloud account and navigate to
API Keys
>Create API Key
. - Copy and save the
Key
in a text editor for later use.
- On the Pinecone website, go to Indexes > Create Index.
- Configure the index as follows:
- Default / bio
- Dimensions: 384
- Metric: Cosine
- Capacity mode: Serverless
- Cloud provider: AWS
- Region: Virginia | us-east-1
- Complete the setup by clicking on Create Index.
Note: The region can be changed without significantly affecting the code. However, altering other configurations would require significant code adjustments.
After completing the previous steps, add your API keys to the .env
files in the notebook
and lang-bio-groq
folders, as shown below:
Make sure to replace your-pinecone-api-key
and your-groqcloud-api-key
with the actual keys you generated earlier.
To run the application locally, you may need to adjust the configurations in the .env file to match your environment. This also applies to the Grafana setup parameters shown below.
- In the Anaconda Prompt, ensure you are in the
lang-bio-groq
folder and run the following command:
streamlit run main.py
Grafana is used to monitor performance, and the image displays a dashboard configured with key performance metrics. In this example, it is evident:
- Average Response Time: The current average response time, which is tracked in real-time to ensure system responsiveness.
- Record Count by Month: This chart tracks the number of records entered into the system.
- Total Conversations: The gauge shows a total of conversations monitored, with the status represented in green, indicating acceptable levels.
- Distribution of Questions and Answers: The average question length, and the average response length is significantly higher, at 161 characters. This highlights the tendency for longer responses compared to the questions.
The Phenotype RAG: Bio-Phenotype Insights Assistant enhances research and practice in genetics and medical diagnostics by integrating retrieval and generation of phenotype information. It facilitates efficient access to complex data, supports accurate diagnostics, and provides a valuable educational tool. With flexible architecture, the application improves interaction with large volumes of data and fosters innovation through a collaborative and accessible approach for the community.
This project was developed as the final assignment for the LLM Zoomcamp course.