GitHub - achouhan93/ClusterTalk: The code for ClusterTalk framework: a framework for corpus exploration using multi-dimensional exploratory search.

ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search

Ashish Chouhan, Saifeldin Mandour, and Michael Gertz

Heidelberg University

Contact us at: {chouhan, gertz}@informatik.uni-heidelberg.de, [email protected]

Report Bug · Request Feature

Table of Contents

About The Project
- Abstract
Project Structure
- Backend
- Frontend
Getting Started

Setting up Backend
Setting up Frontend

Cite our work
License
Acknowledgments

About The Project

Video demonstration: here

Abstract

Exploratory search of large text corpora is essential in domains like biomedical research, where large amounts of research literature are continuously generated. This paper presents $\textit{ClusterTalk}$ (The demo video and source code are available at: https://github.com/achouhan93/ClusterTalk), a framework for corpus exploration using multi-dimensional exploratory search. Our system integrates document clustering with faceted search, allowing users to interactively refine their exploration and ask corpus and document-level queries. Compared to traditional one-dimensional search approaches like keyword search or clustering, this system improves the discoverability of information by encouraging a deeper interaction with the corpus. We demonstrate the functionality of the $\textit{ClusterTalk}$ framework based on four million PubMed abstracts for the four-year time frame.

(back to top)

Project Structure

The $\textit{ClusterTalk}$ framework provides a web-based tool for exploring PubMed abstracts, utilizing backend components for document clustering and retrieval-augmented generation (RAG). It employs BERTopic and LangChain for backend processing, with Cosmograph used for interactive visualizations in the frontend. This setup supports both faceted search on abstracts and natural language query capabilities for enhanced corpus navigation.

Backend

Folder: backend/

Data Collection and Storage (1.embedding_data_storage): PubMed abstracts from 2020–2024 were collected and stored in OpenSearch, yielding about four million abstracts with metadata.
Topic Modeling and Clustering Information (2.topic_modelling and 4. cluster_information): Abstracts are embedded with NeuML/pubmedbert-base-embeddings, reduced in dimensionality via UMAP, and clustered using HDBSCAN. Keywords and labels for each cluster are generated using BM25 and GPT-4o-mini, and stored in OpenSearch.
RAG Pipeline (3.rag_pipeline): For question answering, abstracts are segmented into sentences, creating around $46$ million sentence embeddings indexed in OpenSearch. Document-level queries retrieve contextually relevant sentence chunks, which are then processed with Mixtral-8x7B to generate precise answers with citations pointing to respective PubMed Abstract.

Frontend

Figure 1: Overview of the ClusterTalk interface. The interface includes four main features: 1) a chat functionality panel on the top-left for asking corpus and document-level queries; 2) a metadata information panel on the bottom-left for displaying metadata information of the selected documents; 3) a central cluster visualization map showing research topics like “Cancer Treatment” and “Genetic Disorders”; 4) a faceted search panel at top for keyword search on Title and Abstract text.

Folder: app/

Cluster Overview: Visualizes thematic clusters, like “Cancer Treatment,” and allows for intuitive exploration.
Faceted Search and Filtering: Filters documents by date, keywords, and clusters, refining corpus exploration.
Question-Answering Interface: Supports document-level and corpus-level queries, allowing users to ask both targeted and broad questions about selected clusters or the entire corpus.

(back to top)

Getting Started

Clone the repository by executing the below command

git clone https://github.com/achouhan93/ClusterTalk.git

Navigate to the cloned repository folder

cd ClusterTalk

Once the repository is successfully cloned and user navigated to the folder.

Setting up Backend

Execute the below steps to setup Python Environment (tested with Python 3.9.0):

Setup a venv with python (or conda)

python -m venv .venv

Activate venv

source .venv/bin/activate

Install all necessary dependencies by running

pip install -r requirements.txt

Rename the .env-example to .env and populate the file with the required credentials

CLUSTER_TALK_LOG_EXE_PATH="logs/insights_execution.log"
CLUSTER_TALK_LOG_PATH="logs/"

# Required for Backend functionalities, i.e., Embedding creation and storage, 
# Topic Modeling and Clustering information construction and storage,
# Retrieval Augmented Generation (RAG) or QA Pipeline to work

# Opensearch Connection Details
OPENSEARCH_USERNAME = "your_opensearch_username"
OPENSEARCH_PASSWORD = "your_opensearch_password"
OPENSEARCH_PORT=your_opensearch_port
CLUSTER_TALK_OPENSEARCH_HOST="your_opensearch_host_name"

CLUSTER_TALK_OPENSEARCH_SOURCE_INDEX="frameintell_pubmed"
CLUSTER_TALK_OPENSEARCH_TARGET_INDEX_COMPLETE="frameintell_pubmed_abstract_embeddings"
CLUSTER_TALK_OPENSEARCH_TARGET_INDEX_SENTENCE="frameintell_pubmed_sentence_embeddings"
CLUSTER_TALK_CLUSTER_INFORMATION_INDEX="frameintell_clustertalk_clusterinformation"
CLUSTER_TALK_DOCUMENT_INFORMATION_INDEX="frameintell_clustertalk_documentinformation"

# HuggingFace Key
HUGGINGFACE_AUTH_KEY = "your-huggingface-api-key"

## Required for embedding computation for Abstract and Sentences
CLUSTER_TALK_EMBEDDING_MODEL="NeuML/pubmedbert-base-embeddings"
## Required for topic label and topic description generation
OPENAI_API_KEY = "your-openapi-key"
## Required for Answer Generation in the QA Pipeline
MODEL_CONFIGS = '{"mixtral7B": {"temperature": 0.3, "max_tokens": 100, "huggingface_model":"mistralai/Mixtral-8x7B-Instruct-v0.1", "repetition_penalty":1.2, "stop_sequences":["<|endoftext|>", "</s>"]}}'

# For storage of the BERTopic models at the intermediate stage
MODEL_PATH = "./intermediate_results/"

# Required for frontend
APP_URL="http://localhost:5173"
OPENSEARCH_NODE="https://your-opensearch-hostname:your-opensearch-port"

Start the backend server:

cd backend/3.\ rag_pipeline
uvicorn main:app --reload --port 8100

Setting up Frontend

Execute the below steps to setup frontend:

Navigate to the app folder:
```
cd app
```
Install frontend dependencies:
```
npm install
```
Start the frontend server:
```
npm run dev
```

(back to top)

Cite our work

No current information

(back to top)

License

We use the standard MIT license for code artifacts. See license/LICENSE.txt for more information.

(back to top)

Acknowledgments

We thank the Bundesministerium für Bildung und Forschung (BMBF) for funding this research within the FrameIntell project.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
app		app
backend		backend
images		images
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search

About The Project

Abstract

Project Structure

Backend

Frontend

Getting Started

Setting up Backend

Setting up Frontend

Cite our work

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

achouhan93/ClusterTalk

Folders and files

Latest commit

History

Repository files navigation

ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search

About The Project

Abstract

Project Structure

Backend

Frontend

Getting Started

Setting up Backend

Setting up Frontend

Cite our work

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages