Ashish Chouhan, Saifeldin Mandour, and Michael Gertz
Heidelberg University
Contact us at: {chouhan, gertz}@informatik.uni-heidelberg.de
, [email protected]
Table of Contents
Video demonstration: here
Exploratory search of large text corpora is essential in domains like biomedical research, where large amounts of research literature are continuously generated. This paper presents
The
Folder: backend/
-
Data Collection and Storage (
1.embedding_data_storage
): PubMed abstracts from 2020–2024 were collected and stored in OpenSearch, yielding about four million abstracts with metadata. -
Topic Modeling and Clustering Information (
2.topic_modelling
and4. cluster_information
): Abstracts are embedded with NeuML/pubmedbert-base-embeddings, reduced in dimensionality via UMAP, and clustered using HDBSCAN. Keywords and labels for each cluster are generated using BM25 and GPT-4o-mini, and stored in OpenSearch. -
RAG Pipeline (
3.rag_pipeline
): For question answering, abstracts are segmented into sentences, creating around$46$ million sentence embeddings indexed in OpenSearch. Document-level queries retrieve contextually relevant sentence chunks, which are then processed with Mixtral-8x7B to generate precise answers with citations pointing to respective PubMed Abstract.
Figure 1: Overview of the ClusterTalk interface. The interface includes four main features: 1) a chat functionality panel on the top-left for asking corpus and document-level queries; 2) a metadata information panel on the bottom-left for displaying metadata information of the selected documents; 3) a central cluster visualization map showing research topics like “Cancer Treatment” and “Genetic Disorders”; 4) a faceted search panel at top for keyword search on Title
and Abstract
text.
Folder: app/
-
Cluster Overview: Visualizes thematic clusters, like “Cancer Treatment,” and allows for intuitive exploration.
-
Faceted Search and Filtering: Filters documents by date, keywords, and clusters, refining corpus exploration.
-
Question-Answering Interface: Supports document-level and corpus-level queries, allowing users to ask both targeted and broad questions about selected clusters or the entire corpus.
Clone the repository by executing the below command
git clone https://github.com/achouhan93/ClusterTalk.git
Navigate to the cloned repository folder
cd ClusterTalk
Once the repository is successfully cloned and user navigated to the folder.
Execute the below steps to setup Python Environment (tested with Python 3.9.0):
- Setup a venv with python (or
conda
)
python -m venv .venv
- Activate venv
source .venv/bin/activate
- Install all necessary dependencies by running
pip install -r requirements.txt
- Rename the
.env-example
to.env
and populate the file with the required credentials
CLUSTER_TALK_LOG_EXE_PATH="logs/insights_execution.log"
CLUSTER_TALK_LOG_PATH="logs/"
# Required for Backend functionalities, i.e., Embedding creation and storage,
# Topic Modeling and Clustering information construction and storage,
# Retrieval Augmented Generation (RAG) or QA Pipeline to work
# Opensearch Connection Details
OPENSEARCH_USERNAME = "your_opensearch_username"
OPENSEARCH_PASSWORD = "your_opensearch_password"
OPENSEARCH_PORT=your_opensearch_port
CLUSTER_TALK_OPENSEARCH_HOST="your_opensearch_host_name"
CLUSTER_TALK_OPENSEARCH_SOURCE_INDEX="frameintell_pubmed"
CLUSTER_TALK_OPENSEARCH_TARGET_INDEX_COMPLETE="frameintell_pubmed_abstract_embeddings"
CLUSTER_TALK_OPENSEARCH_TARGET_INDEX_SENTENCE="frameintell_pubmed_sentence_embeddings"
CLUSTER_TALK_CLUSTER_INFORMATION_INDEX="frameintell_clustertalk_clusterinformation"
CLUSTER_TALK_DOCUMENT_INFORMATION_INDEX="frameintell_clustertalk_documentinformation"
# HuggingFace Key
HUGGINGFACE_AUTH_KEY = "your-huggingface-api-key"
## Required for embedding computation for Abstract and Sentences
CLUSTER_TALK_EMBEDDING_MODEL="NeuML/pubmedbert-base-embeddings"
## Required for topic label and topic description generation
OPENAI_API_KEY = "your-openapi-key"
## Required for Answer Generation in the QA Pipeline
MODEL_CONFIGS = '{"mixtral7B": {"temperature": 0.3, "max_tokens": 100, "huggingface_model":"mistralai/Mixtral-8x7B-Instruct-v0.1", "repetition_penalty":1.2, "stop_sequences":["<|endoftext|>", "</s>"]}}'
# For storage of the BERTopic models at the intermediate stage
MODEL_PATH = "./intermediate_results/"
# Required for frontend
APP_URL="http://localhost:5173"
OPENSEARCH_NODE="https://your-opensearch-hostname:your-opensearch-port"
- Start the backend server:
cd backend/3.\ rag_pipeline uvicorn main:app --reload --port 8100
Execute the below steps to setup frontend:
-
Navigate to the app folder:
cd app
-
Install frontend dependencies:
npm install
-
Start the frontend server:
npm run dev
No current information
We use the standard MIT license for code artifacts.
See license/LICENSE.txt
for more information.
We thank the Bundesministerium für Bildung und Forschung (BMBF) for funding this research within the FrameIntell project.