Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Search and Codebase Indexing Service Implementation #609

Draft
wants to merge 233 commits into
base: main
Choose a base branch
from

Conversation

daniel-lxs
Copy link

Semantic Search Service Implementation [⚠️ Work In Progress ⚠️]

Description

This PR introduces a comprehensive semantic search service for code and text files within a workspace. The service provides:

  1. File Indexing:

    • Supports multiple programming languages (JavaScript, Python, Rust, Go, etc.)
    • Handles both code files (parsed with Tree-sitter) and plain text files
    • Implements content hashing to detect file changes and skip unchanged files
    • Enforces size limits and text file validation
  2. Search Capabilities:

    • Semantic search across indexed files using vector embeddings
    • Deduplication of results to avoid redundant matches
    • Separate handling of code and file results
    • Configurable maximum number of results
  3. Infrastructure:

    • Integration with LanceDB vector store for efficient similarity search
    • Caching mechanism for embeddings to improve performance
    • Status tracking for workspace indexing state
    • Robust initialization with error handling and retry logic
  4. Configuration:

    • Configurable maximum results
    • Model type selection (currently supports MiniLM)

Type of change

  • New feature

How Has This Been Tested?

The service has been tested with:

  • Various code files across supported languages
  • Large text files (up to 5MB)
  • Different workspace configurations
  • Edge cases (empty files, binary files, etc.)
  • Error scenarios (failed initialization, missing files)

Checklist:

  • My code follows the patterns of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

Additional context

The service is designed to be extensible, with clear interfaces for:

  • Adding new embedding models
  • Supporting additional file types
  • Integrating with different vector stores
  • Integrating the vector store with external API embedding providers

…, add advanced result deduplication, improve .gitignore file indexing, update semantic search plan, refactor code parsing and indexing
…mproved result formatting, deduplication, and error handling. Update types for search results to include vector data and refine logging for better debugging.
…e-specific caching. Update cache key generation to include workspace ID, enhancing data isolation for semantic search operations.
…o SemanticSearchService

- Introduce relationships in CodeDefinition to enhance context for embedding generation.
- Refactor embedding methods to leverage contextual information.
- Update TreeSitterParser to extract relationship data during code parsing.
- Introduce new commands for managing the semantic search index: reindexing and deletion.
- Implement progress tracking for indexing operations, providing real-time feedback in the UI.
- Add support for filtering indexed files based on supported extensions.
- Update the SettingsView component to include controls for semantic search settings and display indexing progress.
- Refactor SemanticSearchService to handle indexing and clearing of the semantic search index more effectively.
…ne class

- Remove semantic search initialization from extension.ts
- Integrate semantic search initialization directly into Cline class
- Simplify initialization process and error handling
- Improve background initialization and error logging
- Maintain existing semantic search functionality with more cohesive implementation
- Add @lancedb/lancedb package to project dependencies
- Update esbuild configuration to include LanceDB Linux x64 GNU external dependency
- Prepare for enhanced vector storage and retrieval capabilities
…ration

- Modify semantic search indexing to index all files without extension filtering
- Update storage directory configuration to use cache directory directly
- Simplify logging for file indexing process
- Remove unnecessary file extension filtering during indexing
… and text file support

- Replace existing vector store with LanceDB implementation
- Add robust text file detection and indexing capabilities
- Implement content hash-based file change tracking
- Enhance file indexing with improved memory management and file type handling
- Add support for indexing non-code text files with semantic search
- Improve search result ranking and filtering logic
- Update MiniLM model from L6 to L12 version for improved embedding quality
…t extraction

- Refactor TreeSitterParser to simplify code segment extraction logic
- Introduce a new TypeScript-specific query for more precise code parsing
- Update CodeSegment type to support more flexible segment types
- Add file hash verification to prevent unnecessary parsing
- Enhance language parser to support custom queries for different languages
…and improve error handling

- Move semantic search initialization logic to ClineProvider
- Simplify semantic search service creation and workspace indexing
- Add retry mechanism for semantic search initialization
- Improve progress reporting and error handling during indexing
- Pass semantic search service as a parameter to Cline constructor
…code parsing

- Update TypeScript tree-sitter query to capture adjacent comments for code segments
- Add support for stripping comment formatting and selecting adjacent documentation
- Improve parsing of method, class, function, and variable declarations
- Refine comment extraction and association with code elements
…nnecessary metadata

- Remove detailed function and method metadata from CodeSegment type
- Streamline TreeSitterParser to focus on core code segment extraction
- Introduce CodeSegmentType enum for more type-safe segment classification
- Improve context extraction with hierarchical parent tracking
- Simplify import graph retrieval and parsing logic
…iable name extraction

- Introduce JavaScript tree-sitter query for code segment parsing
- Enhance variable name extraction with fallback to identifier nodes
- Update WASM directory resolution to use current module directory
- Implement import extraction for JavaScript files
- Simplify import and export segment collection
…for code parsing

- Create test cases for parsing TypeScript and JavaScript code segments
- Cover parsing of classes, functions, imports, and variables
- Implement dynamic test file generation and cleanup
- Verify code segment extraction for different language constructs
- Create index file for semantic search language queries
- Implement JavaScript tree-sitter query for parsing code segments
- Support extraction of imports, classes, methods, functions, and variables
- Align JavaScript parsing with existing TypeScript query structure
… directory support

- Add optional `wasmDir` parameter to `loadRequiredLanguageParsers` function
- Update language loading to use provided or default WASM directory
- Modify `loadLanguage` function to accept custom WASM directory path
- Improve parser initialization with dynamic file location configuration
- Update return type to provide more detailed parser and query information
…rove initialization robustness

- Introduce WorkspaceIndexStatus enum to track indexing progress
- Add methods to update and retrieve workspace indexing status
- Enhance initialization error handling and status management
- Remove deprecated memory monitoring and initialization tracking code
- Simplify initialization process with more focused error reporting
…test infrastructure

- Delete memory monitoring classes and associated test files
- Remove in-memory and persistent vector store implementations
- Clean up deprecated memory tracking and vector storage code
- Eliminate test infrastructure for memory and vector store components
- Delete global state keys for semantic search memory and score settings
- Remove configuration handling for max memory and minimum score
- Simplify semantic search initialization with default parameters
- Add semantic search status tracking to global state
…ogic

- Modify result processing to prioritize code results while maintaining original order
- Simplify result formatting with more consistent type handling
- Remove hardcoded score thresholds and filtering logic
- Improve result deduplication and trimming to max results
- Update result type conversion to use SearchResultType enum
…ypes

- Remove detailed metadata fields from CodeDefinition
- Refactor SearchResult types to use more concise structure
- Update SearchResultType to use enum instead of string literals
- Remove vector and score properties from search result interfaces
- Streamline embedding generation by reducing metadata complexity
…pping

- Update search method to return more concise VectorSearchResult type
- Simplify result mapping by removing detailed CodeSearchResult structure
- Reduce logging and console output in search method
- Improve vector dimension calculation using vector length
- Align vector search result with recent type refactoring
…for clarity

- Update interface name from SearchResult to VectorSearchResult
- Maintain existing type structure and method signatures
- Improve type naming to better reflect vector search semantics
…n UI

- Remove memory and score configuration sliders from settings view
- Update ExtensionState and WebviewMessage to track semantic search status
- Add workspace status display in settings with color-coded status indicator
…s tracking

- Modify ClineProvider to accept semantic search service as a promise
- Add dynamic status tracking for semantic search initialization
- Implement WebView messaging for indexing progress and status updates
- Update SettingsView to request and display semantic search status
- Enhance ExtensionStateContext to manage semantic search status
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.