Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pytests #52

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Update pytests #52

wants to merge 16 commits into from

Conversation

galshubeli
Copy link
Contributor

@galshubeli galshubeli commented Dec 26, 2024

Summary by CodeRabbit

Release Notes v0.5.0

  • New Features

    • Added support for advanced testing metrics using the deepeval library.
    • Enhanced test coverage with multiple model configurations in the CI/CD pipeline.
    • Introduced a new testing class for validating Knowledge Graph outputs.
    • Added new constants to improve ontology extraction and data processing.
  • Dependencies

    • Updated Python version requirement to 3.10+.
    • Added rich dependency for improved output formatting.
    • Updated deepeval dependency to a newer version.
  • Testing

    • Introduced comprehensive metrics for evaluating language model outputs.
    • Expanded test matrix to include Gemini and GPT-4o models.
  • Removals

    • Removed previous Knowledge Graph test implementations for Gemini, Ollama, and OpenAI models.
    • Deleted the requirements.txt file, removing defined dependencies.

Copy link

coderabbitai bot commented Dec 26, 2024

Walkthrough

This pull request introduces updates to the project's dependency management and testing infrastructure. The pyproject.toml file has been modified to require a newer Python version and includes updates to dependencies. Several model-specific Knowledge Graph test files have been removed, while a new testing class with enhanced functionality has been added. The GitHub Actions workflow has been updated to support matrix testing across multiple AI models, and a new metrics evaluation system has been implemented to assess the performance of language model outputs.

Changes

File Change Summary
pyproject.toml - Updated Python version constraint from "^3.9.0" to ">=3.10.0,<3.14"
- Added rich dependency with version constraint "^13.9.4"
- Updated deepeval dependency to version "^2.2.6"
tests/test_kg_gemini.py, tests/test_kg_ollama.py, tests/test_kg_openai.py - Removed test files for Gemini, Ollama, and OpenAI KG implementations
tests/test_rag.py - Added TestKGLiteLLM class
- Implemented new test method with metrics evaluation
.github/workflows/test.yml - Added matrix strategy for testing with multiple models
- Introduced TEST_MODEL and GEMINI_API_KEY environment variables
graphrag_sdk/test_metrics.py - Added CombineMetrics, GraphContextualRecall, and GraphContextualRelevancy classes
- Added GraphContextualRecallTemplate and GraphContextualRelevancyTemplate classes
graphrag_sdk/fixtures/prompts.py - Added new constants for ontology extraction and data processing prompts with enhanced instructions
graphrag_sdk/models/litellm.py - Updated exception handling in check_valid_key and check_and_pull_model methods for better error reporting
requirements.txt - Removed the requirements.txt file, which contained project dependencies

Sequence Diagram

sequenceDiagram
    participant Workflow as GitHub Actions
    participant Tests as Test Runner
    participant Metrics as Metrics Evaluator
    participant KG as Knowledge Graph

    Workflow->>Tests: Trigger tests with matrix models
    Tests->>KG: Initialize Knowledge Graph
    Tests->>Metrics: Evaluate LLM responses
    Metrics->>Metrics: Combine relevancy metrics
    Metrics-->>Tests: Return composite score
    Tests-->>Workflow: Report test results
Loading

Poem

🐰 In the realm of code, a rabbit's delight,
Metrics dance and models take flight!
Version bumped, tests refined with care,
Knowledge graphs spinning without a snare 🧠
Deepeval joins, our testing grows bright! 🌟

Tip

🌐 Web search-backed reviews and chat
  • We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
  • You can disable this feature by setting web_search: false in the knowledge_base settings.
  • Please share any feedback in the Discord discussion.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/test_rag.py (3)

4-7: Enhance clarity by grouping related imports from deepeval.
You're importing assert_test and LLMTestCase across separate statements. For readability, consider grouping them into a single import statement if they originate from the same library.

-from deepeval import assert_test
...
-from deepeval.test_case import LLMTestCase
+from deepeval import assert_test, LLMTestCase

94-95: File path usage vs. resource management.
"tests/data/madoff.txt" is hardcoded. For cross-platform compatibility, consider using built-in path utilities or fixtures. The new usage of self.kg_gemini.process_sources(sources) is appropriate if each test run sets up data consistently.


106-113: Use of LLMTestCase for structured comparisons.
Packaging input, output, and context into LLMTestCase clarifies your test logic. Just be aware of your “expected_output” phrasing—“Over than 10 actors…” might need re-evaluation for language clarity (e.g. "More than 10 actors…").

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8617ba5 and 4b5df23.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • pyproject.toml (2 hunks)
  • tests/test_kg_gemini.py (0 hunks)
  • tests/test_kg_ollama.py (0 hunks)
  • tests/test_kg_openai.py (0 hunks)
  • tests/test_rag.py (2 hunks)
💤 Files with no reviewable changes (3)
  • tests/test_kg_ollama.py
  • tests/test_kg_openai.py
  • tests/test_kg_gemini.py
🔇 Additional comments (12)
tests/test_rag.py (10)

11-13: Good use of explicit metric imports.
The direct imports of individual metrics (AnswerRelevancyMetric, etc.) add clarity, enabling easy reference and usage in your test code. No issues found here.


72-72: Graph name alignment.
Renaming to "IMDB" is consistent with your usage later in the test. This clarifies the context and avoids ambiguous references to external providers.


74-75: Model selection check.
Using "gpt-4o" might be a specialized or less common model name. If it's a placeholder, ensure consistent usage across the entire test suite and watch for any potential mismatch in actual environment configurations.


81-86: Consolidating multiple Graph instances.
Creating two KnowledgeGraph instances referencing the same ontology promotes thorough comparative testing. Double-check that the correct model configurations (OpenAI vs. Gemini) are consistently applied across your suite (e.g., advanced parameters, temperature, etc.), to ensure valid coverage.


97-98: Chat session approach.
Utilizing separate chat sessions for each knowledge graph ensures test isolation. Good practice to confirm the session state is self-contained.


99-102: Metric-based validations.
Incorporating multiple metrics (relevancy, precision, recall) offers a robust assessment of the LLM's responses. This is a solid improvement over straightforward string comparisons.


114-114: Seamless usage of assert_test.
Good integration of assert_test from deepeval for each test case. No issues found.


116-117: Parallel test coverage.
Repeating the same prompt and test pattern with kg_openai ensures parity. This parallels the kg_gemini flow and fosters direct comparison of model responses.


118-125: Consistent test structuring.
Reusing the same LLMTestCase fields keeps tests uniform. Ensure that each knowledge graph (OpenAI vs. Gemini) is indeed configured with the intended model to truly compare results.


127-127: Final test assertion.
Having a second call to assert_test keeps both sets of results validated under the same metrics. This final step aligns with the consolidated approach to test coverage.

pyproject.toml (2)

3-3: Incremented version to “0.4.2”.
This aligns with the new changes introduced in testing and dependencies. Make sure to tag or document this release accordingly for consistent version tracking.


25-25: Added new dependencies and extras.
Introducing deepeval and optional modeling dependencies allows flexible experimentation. Ensure that pinned ranges (like "^2.0.9") reflect tested compatibility.

Would you like a script to check for known vulnerabilities or overshadowing dependencies related to these newly added packages?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/test_rag.py (2)

13-13: Remove unused import ContextualRecallMetric

The ContextualRecallMetric is imported but never used in the test implementation.

-from deepeval.metrics import AnswerRelevancyMetric, ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric
+from deepeval.metrics import AnswerRelevancyMetric, ContextualPrecisionMetric, ContextualRelevancyMetric

106-127: Refactor duplicate test case creation logic

The test case creation logic is duplicated for both OpenAI and Gemini implementations. Consider extracting this into a helper method to improve maintainability and reduce code duplication.

def create_test_case(self, kg, name):
    input = "How many actors acted in a movie?"
    chat = kg.chat_session()
    answer = chat.send_message(input)
    return LLMTestCase(
        input=input,
        actual_output=answer['response'],
        retrieval_context=[answer['context']],
        context=[answer['context']],
        name=name,
        expected_output="Over than 10 actors acted in a movie."
    )

Then use it like this:

test_case = self.create_test_case(self.kg_gemini, "gemini")
assert_test(test_case, [answer_relevancy_metric, answer_cp_metric, answer_crelevancy_metric])

test_case = self.create_test_case(self.kg_openai, "openai")
assert_test(test_case, [answer_relevancy_metric, answer_cp_metric, answer_crelevancy_metric])
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b5df23 and faa3b78.

📒 Files selected for processing (1)
  • tests/test_rag.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_rag.py

101-101: Local variable answer_crecall_metric is assigned to but never used

Remove assignment to unused variable answer_crecall_metric

(F841)

🔇 Additional comments (1)
tests/test_rag.py (1)

72-85: Potential naming conflict with identical graph names

Both kg_openai and kg_gemini instances are using the same graph name "IMDB". This could lead to conflicts if they're persisting data to the same storage. Consider using distinct names like "IMDB_openai" and "IMDB_gemini".

tests/test_rag.py Outdated Show resolved Hide resolved
tests/test_rag.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (8)
tests/test_rag.py (2)

71-71: Extract hardcoded file path to a constant.

The test data file path is duplicated. Consider extracting it to a class-level or module-level constant to improve maintainability.

+ TEST_DATA_FILE = "tests/data/madoff.txt"

- file_path = "tests/data/madoff.txt"
+ file_path = TEST_DATA_FILE

class TestKGLiteLLM(unittest.TestCase):
    def test_kg_creation(self):
-        file_path = "tests/data/madoff.txt"
+        file_path = TEST_DATA_FILE

Also applies to: 180-180


82-94: Improve test maintainability with parameterization.

The test inputs and expected outputs are hardcoded as lists. Consider using pytest's parameterize feature for better test organization and maintenance.

Example refactor:

import pytest

TEST_CASES = [
    pytest.param(
        "How many actors acted in a movie?",
        "Over than 10 actors acted in a movie.",
        id="count_actors"
    ),
    pytest.param(
        "Which actors acted in the movie Madoff: The Monster of Wall Street?",
        "Joseph Scotto, Melony Feliciano, and Donna Pastorello acted in the movie Madoff: The Monster of Wall Street.",
        id="list_actors"
    ),
    # ... other test cases
]

@pytest.mark.parametrize("input_query,expected_output", TEST_CASES)
def test_movie_queries(input_query, expected_output):
    # Test implementation
custom_metric.py (6)

6-6: Remove unused import get_or_create_event_loop.

The function get_or_create_event_loop is imported but not used anywhere in the code. Removing this unused import will clean up the code.

Apply this diff to fix the issue:

-from deepeval.utils import get_or_create_event_loop, prettify_list
+from deepeval.utils import prettify_list
🧰 Tools
🪛 Ruff (0.8.2)

6-6: deepeval.utils.get_or_create_event_loop imported but unused

Remove unused import: deepeval.utils.get_or_create_event_loop

(F401)


16-16: Remove unused import ConversationalTestCase.

The class ConversationalTestCase is imported but not used anywhere in the code. Removing this unused import will clean up the code.

Apply this diff to fix the issue:

-from deepeval.test_case import (
-    LLMTestCaseParams,
-    ConversationalTestCase,
-)
+from deepeval.test_case import (
+    LLMTestCaseParams,
+)
🧰 Tools
🪛 Ruff (0.8.2)

16-16: deepeval.test_case.ConversationalTestCase imported but unused

Remove unused import: deepeval.test_case.ConversationalTestCase

(F401)


20-20: Remove unused import ContextualRecallTemplate.

The class ContextualRecallTemplate is imported but not used from the module deepeval.metrics.contextual_recall.template. Since you define GraphContextualRecallTemplate later in the code, this import can be removed.

Apply this diff to fix the issue:

-from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
🧰 Tools
🪛 Ruff (0.8.2)

20-20: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

(F401)


117-117: Consider simplifying the composite score calculation.

The current logic sets the score to zero if in strict mode and the composite score is below the threshold. This could be simplified or clarified for better readability.

Consider updating the score calculation:

self.score = composite_score if not self.strict_mode else (composite_score if composite_score >= self.threshold else 0)

145-170: Handle potential exceptions when generating verdicts or calculating scores.

In the measure method of GraphContextualRecall, exceptions during verdict generation or score calculation could cause the evaluation to fail silently. Adding error handling will improve robustness.

Consider wrapping the verdict generation and score calculation in try-except blocks to handle exceptions appropriately.

🧰 Tools
🪛 Ruff (0.8.2)

153-153: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


258-293: Review multiline string formatting for prompts.

The prompt strings in GraphContextualRecallTemplate use triple quotes and f-strings, which can be error-prone. Ensure that the placeholders are correctly formatted and that the strings are displayed as intended.

Verify that the prompts render correctly and consider using dedicated template strings or raw strings if necessary.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faa3b78 and 00ff51a.

📒 Files selected for processing (2)
  • custom_metric.py (1 hunks)
  • tests/test_rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_rag.py

191-191: Local variable answer_crecall_metric is assigned to but never used

Remove assignment to unused variable answer_crecall_metric

(F841)

custom_metric.py

4-4: Redefinition of unused Optional from line 1

Remove definition: Optional

(F811)


6-6: deepeval.utils.get_or_create_event_loop imported but unused

Remove unused import: deepeval.utils.get_or_create_event_loop

(F401)


14-14: Redefinition of unused LLMTestCase from line 3

Remove definition: LLMTestCase

(F811)


16-16: deepeval.test_case.ConversationalTestCase imported but unused

Remove unused import: deepeval.test_case.ConversationalTestCase

(F401)


18-18: Redefinition of unused BaseMetric from line 2

Remove definition: BaseMetric

(F811)


20-20: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

(F401)


22-22: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

(F403)


153-153: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


197-197: Reason may be undefined, or defined from star imports

(F405)


197-197: Reason may be undefined, or defined from star imports

(F405)


219-219: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


228-228: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


233-233: Verdicts may be undefined, or defined from star imports

(F405)


233-233: Verdicts may be undefined, or defined from star imports

(F405)


234-234: Verdicts may be undefined, or defined from star imports

(F405)


240-240: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


250-250: Do not use bare except

(E722)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: test
🔇 Additional comments (4)
tests/test_rag.py (4)

1-21: LGTM! Well-structured imports and setup.

The imports are well-organized, and the logging configuration is appropriate for testing purposes.


64-64: Verify model configuration consistency.

The model configuration differs between the global scope ("gemini/gemini-2.0-flash-exp") and the test class ("gpt-4o"). This inconsistency might lead to different behaviors in tests.

Please clarify if this difference is intentional or if one of these should be updated to match the other.

Also applies to: 169-169


79-79: Remove unused metric variable.

The answer_crecall_metric is created but never used in the assertions.

Also applies to: 79-79


178-204: 🛠️ Refactor suggestion

Enhance test coverage and reduce duplication.

The test method duplicates logic from the global scope and only tests a single scenario. Consider:

  1. Moving the common test logic to helper methods
  2. Adding more test cases for error scenarios
  3. Testing edge cases (empty source file, malformed data, etc.)

Would you like me to help generate additional test cases for error scenarios and edge cases?

🧰 Tools
🪛 Ruff (0.8.2)

191-191: Local variable answer_crecall_metric is assigned to but never used

Remove assignment to unused variable answer_crecall_metric

(F841)

tests/test_rag.py Outdated Show resolved Hide resolved
custom_metric.py Outdated Show resolved Hide resolved
custom_metric.py Outdated Show resolved Hide resolved
custom_metric.py Outdated Show resolved Hide resolved
custom_metric.py Outdated Show resolved Hide resolved
custom_metric.py Outdated Show resolved Hide resolved
custom_metric.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (2)
graphrag_sdk/custom_metric.py (2)

2-2: Remove unused import of ContextualRelevancyMetric

The ContextualRelevancyMetric is imported at line 2 but not used in the code. Removing unused imports helps keep the code clean and improves readability.

Apply this diff:

-from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
+from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric
🧰 Tools
🪛 Ruff (0.8.2)

2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

Remove unused import: deepeval.metrics.ContextualRelevancyMetric

(F401)


22-22: Remove unused import of ContextualRecallTemplate

The ContextualRecallTemplate is imported at line 22 but not used in the code. Removing unused imports improves code clarity.

Apply this diff:

-from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
🧰 Tools
🪛 Ruff (0.8.2)

22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00ff51a and 2169875.

📒 Files selected for processing (2)
  • graphrag_sdk/custom_metric.py (1 hunks)
  • tests/test_rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_rag.py

191-191: Local variable answer_crecall_metric is assigned to but never used

Remove assignment to unused variable answer_crecall_metric

(F841)

graphrag_sdk/custom_metric.py

2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

Remove unused import: deepeval.metrics.ContextualRelevancyMetric

(F401)


4-4: Redefinition of unused Optional from line 1

Remove definition: Optional

(F811)


14-14: Redefinition of unused LLMTestCase from line 3

Remove definition: LLMTestCase

(F811)


19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

(F403)


20-20: Redefinition of unused BaseMetric from line 2

Remove definition: BaseMetric

(F811)


22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

(F401)


24-24: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

(F403)


162-162: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


206-206: Reason may be undefined, or defined from star imports

(F405)


206-206: Reason may be undefined, or defined from star imports

(F405)


228-228: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


237-237: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


242-242: Verdicts may be undefined, or defined from star imports

(F405)


242-242: Verdicts may be undefined, or defined from star imports

(F405)


243-243: Verdicts may be undefined, or defined from star imports

(F405)


249-249: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


259-259: Do not use bare except

(E722)


375-375: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


418-418: Reason may be undefined, or defined from star imports

(F405)


418-418: Reason may be undefined, or defined from star imports

(F405)


442-442: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


450-450: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


454-454: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


460-460: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


468-468: Do not use bare except

(E722)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: test
🔇 Additional comments (2)
tests/test_rag.py (2)

79-79: Remove unused variable answer_crecall_metric

The answer_crecall_metric variable is assigned but never used at lines 79 and 191. Removing unused variables helps clean up the code.

Apply this diff:

-answer_crecall_metric = ContextualRecallMetric(threshold=0.5)

Also applies to: 191-191


125-167: Refactor duplicate ontology definition

The ontology structure is duplicated in both the global scope and within the TestKGLiteLLM class. Consider extracting it into a shared function or fixture to improve maintainability.

graphrag_sdk/custom_metric.py Outdated Show resolved Hide resolved
graphrag_sdk/custom_metric.py Outdated Show resolved Hide resolved
graphrag_sdk/custom_metric.py Outdated Show resolved Hide resolved
graphrag_sdk/custom_metric.py Outdated Show resolved Hide resolved
graphrag_sdk/custom_metric.py Outdated Show resolved Hide resolved
graphrag_sdk/custom_metric.py Outdated Show resolved Hide resolved
tests/test_rag.py Outdated Show resolved Hide resolved
tests/test_rag.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
graphrag_sdk/test_metrics.py (1)

19-19: Avoid using wildcard imports

Using from module import * can lead to namespace pollution and makes it harder to track where specific classes or functions are coming from. It also causes issues with static analysis tools.

Replace wildcard imports with explicit imports of the required classes or functions. For example:

-from deepeval.metrics.contextual_relevancy.schema import *
+from deepeval.metrics.contextual_relevancy.schema import ContextualRecallVerdict, Verdicts, Reason

...

-from deepeval.metrics.contextual_recall.schema import *
+from deepeval.metrics.contextual_recall.schema import ContextualRelevancyVerdicts

Also applies to: 24-24

🧰 Tools
🪛 Ruff (0.8.2)

19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

(F403)

.github/workflows/test.yml (3)

80-80: Document the TEST_MODEL environment variable usage.

Consider adding a comment in the workflow file explaining:

  1. The expected format and valid values for TEST_MODEL
  2. How this variable is used in the test suite
  3. What happens if the variable is not set
+          # TEST_MODEL: Model identifier used by pytest to determine which LLM to use for testing
+          # Valid values: gemini-1.5-flash-001, gpt-4o
           TEST_MODEL: ${{ matrix.model }} # Pass the model as an environment variable

27-27: Remove trailing whitespace.

Fix the formatting issue by removing trailing spaces on line 27.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 27-27: trailing spaces

(trailing-spaces)


Line range hint 13-81: Consider enhancing workflow robustness.

Suggested improvements for the workflow configuration:

  1. Add a timeout to prevent long-running tests from blocking the pipeline:
timeout-minutes: 30
  1. Specify an explicit Python version instead of using "3.x":
python-version: "3.11"
  1. Add verbose output to pytest for better debugging:
run: poetry run pytest -v
🧰 Tools
🪛 yamllint (1.35.1)

[error] 27-27: trailing spaces

(trailing-spaces)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2169875 and ec40932.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • .github/workflows/test.yml (2 hunks)
  • graphrag_sdk/test_metrics.py (1 hunks)
  • pyproject.toml (1 hunks)
  • tests/test_rag.py (1 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
.github/workflows/test.yml

[error] 27-27: trailing spaces

(trailing-spaces)

🪛 Ruff (0.8.2)
graphrag_sdk/test_metrics.py

2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

Remove unused import: deepeval.metrics.ContextualRelevancyMetric

(F401)


4-4: Redefinition of unused Optional from line 1

Remove definition: Optional

(F811)


6-6: deepeval.utils.get_or_create_event_loop imported but unused

Remove unused import: deepeval.utils.get_or_create_event_loop

(F401)


14-14: Redefinition of unused LLMTestCase from line 3

Remove definition: LLMTestCase

(F811)


19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

(F403)


20-20: Redefinition of unused BaseMetric from line 2

Remove definition: BaseMetric

(F811)


22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

(F401)


24-24: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

(F403)


164-164: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


208-208: Reason may be undefined, or defined from star imports

(F405)


208-208: Reason may be undefined, or defined from star imports

(F405)


230-230: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


239-239: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


244-244: Verdicts may be undefined, or defined from star imports

(F405)


244-244: Verdicts may be undefined, or defined from star imports

(F405)


245-245: Verdicts may be undefined, or defined from star imports

(F405)


251-251: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)


261-261: Do not use bare except

(E722)


369-369: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


412-412: Reason may be undefined, or defined from star imports

(F405)


412-412: Reason may be undefined, or defined from star imports

(F405)


436-436: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


444-444: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


448-448: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


454-454: ContextualRelevancyVerdicts may be undefined, or defined from star imports

(F405)


462-462: Do not use bare except

(E722)

🔇 Additional comments (4)
tests/test_rag.py (1)

102-102: Correct grammatical error in expected output

The phrase "Over than 10 actors acted in a movie." should be "More than 10 actors acted in a movie." This correction improves clarity and grammatical accuracy.

Apply this diff to correct the phrase:

-    "Over than 10 actors acted in a movie.",
+    "More than 10 actors acted in a movie.",
pyproject.toml (3)

Line range hint 13-31: LGTM! Well-organized dependency structure.

The dependencies are well-organized with:

  • Clear separation between core and optional dependencies
  • Appropriate version constraints
  • Logical grouping of LLM provider dependencies as optional

25-25: Verify deepeval compatibility with existing dependencies.

The new deepeval dependency looks good with appropriate version constraints. Let's verify there are no conflicts:

✅ Verification successful

✓ deepeval dependency verified as compatible and secure

The package is actively maintained with no known vulnerabilities. Version constraints are appropriate and won't cause unexpected breaking changes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential dependency conflicts and security advisories

# Check PyPI for latest versions and potential conflicts
curl -s https://pypi.org/pypi/deepeval/json | jq '.info.version'

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "deepeval") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 414


Line range hint 2-3: Verify if version bump follows semantic versioning.

The version bump from 0.4.2 to 0.5.0 indicates new features. Let's verify this aligns with the changes:

✅ Verification successful

Version bump to 0.5.0 follows semantic versioning correctly

The version bump from 0.4.2 to 0.5.0 is appropriate as it represents:

  • Addition of new features (deepeval integration for metrics)
  • No breaking changes (removed file was not referenced in the codebase)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check commit history for breaking changes that might require a major version bump

# Look for breaking changes indicators in commit messages
git log --oneline | grep -iE "break|breaking|major"

# Check for removed files that might indicate breaking changes
git diff --name-status HEAD~1 | grep "^D"

Length of output: 75


Script:

#!/bin/bash
# Check for references to custom_metric in the codebase
rg -l "custom_metric" 

# Look for potential replacement files
fd "metric" --type f --changed-within 2weeks

# Check if custom_metric was imported in examples or tests
rg -l "from graphrag_sdk.custom_metric import"

Length of output: 144

Comment on lines +2 to +4
from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase
from typing import Optional, List, Union
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused imports and resolve redefinitions

There are several unused imports and redefinitions in the code which can lead to confusion and potential conflicts. Specifically:

  • Line 2: ContextualRelevancyMetric is imported but not used.
  • Line 4: Optional is redefined from line 1.
  • Line 6: get_or_create_event_loop is imported but not used.
  • Line 14: LLMTestCase is redefined from line 3.
  • Line 20: BaseMetric is redefined from line 2.
  • Line 22: ContextualRecallTemplate is imported but not used.

Apply this diff to clean up the imports and resolve redefinitions:

 from typing import Optional
-from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
+from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric
 from deepeval.test_case import LLMTestCase
-from typing import Optional, List, Union
+from typing import List, Union

-from deepeval.utils import get_or_create_event_loop, prettify_list
+from deepeval.utils import prettify_list

-from deepeval.test_case import (
-    LLMTestCase,
-    LLMTestCaseParams,
-    ConversationalTestCase,
-)
+from deepeval.test_case import LLMTestCaseParams, ConversationalTestCase

 from deepeval.metrics.contextual_relevancy.schema import *
-from deepeval.metrics import BaseMetric
 from deepeval.models import DeepEvalBaseLLM
+from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
 from deepeval.metrics.indicator import metric_progress_indicator
 from deepeval.metrics.contextual_recall.schema import *

Also applies to: 6-6, 14-14, 20-20, 22-22

🧰 Tools
🪛 Ruff (0.8.2)

2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

Remove unused import: deepeval.metrics.ContextualRelevancyMetric

(F401)


4-4: Redefinition of unused Optional from line 1

Remove definition: Optional

(F811)

graphrag_sdk/test_metrics.py Show resolved Hide resolved

self.evaluation_cost = 0 if self.using_native_model else None
with metric_progress_indicator(self, _show_indicator=_show_indicator):
self.verdicts: List[ContextualRecallVerdict] = (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure classes from wildcard imports are explicitly imported

Classes such as ContextualRecallVerdict, Verdicts, Reason, and ContextualRelevancyVerdicts are used but may not be defined due to wildcard imports. This can lead to undefined names and potential runtime errors.

Explicitly import the required classes to ensure they are defined:

+from deepeval.metrics.contextual_relevancy.schema import ContextualRecallVerdict, Verdicts, Reason
+from deepeval.metrics.contextual_recall.schema import ContextualRelevancyVerdicts

 # Replace wildcard imports
-from deepeval.metrics.contextual_relevancy.schema import *
-from deepeval.metrics.contextual_recall.schema import *

Also applies to: 208-208, 230-230, 239-239, 244-244, 251-251, 369-369, 412-412, 436-436, 444-444, 448-448, 454-454

🧰 Tools
🪛 Ruff (0.8.2)

164-164: ContextualRecallVerdict may be undefined, or defined from star imports

(F405)

.github/workflows/test.yml Outdated Show resolved Hide resolved
Copy link

qodo-merge-pro bot commented Jan 29, 2025

CI Feedback 🧐

(Feedback updated until commit 2ed1b9b)

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: test (openai/gpt-4o)

Failed stage: Run tests [❌]

Failed test name: test_llm

Failure summary:

The test 'test_llm' in test_rag.py failed because the mean score (0.1667) was below the required
threshold of 0.5. Specifically:

  • The test expected an average score >= 0.5
  • The actual scores were [1.0, 0.0, 0.0, 0.0, 0.0, 0.0], resulting in a mean of 0.1667
  • This suggests that only 1 out of 6 test cases passed, while the others failed completely

  • Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    778:  2025-02-04 14:50:08 [    INFO] 
    779:  LiteLLM completion() model= gpt-4o; provider = openai (utils.py:2820)
    780:  2025-02-04 14:50:09 [    INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" (_client.py:1038)
    781:  2025-02-04 14:50:09 [    INFO] Wrapper: Completed Call, calling success_handler (utils.py:952)
    782:  2025-02-04 14:50:10 [    INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" (_client.py:1038)
    783:  2025-02-04 14:50:17 [    INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" (_client.py:1038)
    784:  2025-02-04 14:50:21 [    INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" (_client.py:1038)
    785:  2025-02-04 14:50:30 [    INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" (_client.py:1038)
    786:  FAILED                                                                   [100%]Running teardown with pytest sessionfinish...
    787:  =================================== FAILURES ===================================
    ...
    
    823:  )
    824:  score = answer_combined_metric.measure(test_case)
    825:  scores.append(score)
    826:  self.kg.delete()
    827:  >       assert np.mean(scores) >= 0.5
    828:  E       assert np.float64(0.16666666666666666) >= 0.5
    829:  E        +  where np.float64(0.16666666666666666) = <function mean at 0x7f7e03379870>([1.0, 0.0, 0.0, 0.0, 0.0, 0.0])
    830:  E        +    where <function mean at 0x7f7e03379870> = np.mean
    831:  tests/test_rag.py:130: AssertionError
    ...
    
    926:  LiteLLM completion() model= gpt-4o; provider = openai
    927:  �[92m14:50:08 - LiteLLM:INFO�[0m: utils.py:952 - Wrapper: Completed Call, calling success_handler
    928:  �[92m14:50:08 - LiteLLM:INFO�[0m: utils.py:2820 - 
    929:  LiteLLM completion() model= gpt-4o; provider = openai
    930:  �[92m14:50:09 - LiteLLM:INFO�[0m: utils.py:952 - Wrapper: Completed Call, calling success_handler
    931:  ------------------------------ Captured log call -------------------------------
    932:  DEBUG    graphrag_sdk.steps.extract_data_step:extract_data_step.py:126 Processing task: extract_data_step_e31bcb32-4cec-4309-93cc-d415bece8b44
    933:  DEBUG    extract_data_step_e31bcb32-4cec-4309-93cc-d415bece8b44:extract_data_step.py:127 Processing task: extract_data_step_e31bcb32-4cec-4309-93cc-d415bece8b44
    934:  DEBUG    extract_data_step_e31bcb32-4cec-4309-93cc-d415bece8b44:extract_data_step.py:145 User message:  You are tasked with extracting entities and relations from the text below, using the ontology provided.  **Output Format:**  - Provide the extracted data as a JSON object with two keys: `"entities"` and `"relations"`.  - **Entities**: Represent entities and concepts. Each entity should have a `"label"` and `"attributes"` field.  - **Relations**: Represent relations between entities or concepts. Each relation should have a `"label"`, `"source"`, `"target"`, and `"attributes"` field.  **Guidelines:** - **Extract all entities and relations**: Capture all entities and relations mentioned in the text.  - **Use Only the Provided Ontology**: Utilize only the types of entities, relations, and attributes defined in the ontology.  - **Assign IDs Where Required**: Assign textual IDs to entities and relations as specified.  - **Avoid Duplicates**: Ensure each entity and relation is unique; do not include duplicates.  - **Formatting**:   - Do not include any introduction or explanation in the response, only the JSON.      - Use double quotes for all string values.    - Properly escape any special characters.    - Dates should be in the format `"YYYY-MM-DD"`.    - Correct any spacing or formatting issues in text fields as necessary.  - **Precision**: Be concise and precise in your extraction.  - **Token Limit**: Ensure your response does not exceed **8192 tokens**.  **User Instructions**:    **Ontology**: ***'entities': [***'label': 'Actor', 'attributes': [***'name': 'name', 'type': 'string', 'unique': True, 'required': True***], 'description': ''***, ***'label': 'Movie', 'attributes': [***'name': 'title', 'type': 'string', 'unique': True, 'required': True***], 'description': ''***], 'relations': [***'label': 'ACTED_IN', 'source': ***'label': 'Actor'***, 'target': ***'label': 'Movie'***, 'attributes': [***'name': 'role', 'type': 'string', 'unique': False, 'required': True***]***]***  **Raw Text**: Madoff: The Monster of Wall Street (TV Mini Series 2023) - IMDb  MenuMoviesRelease CalendarTop 250 MoviesMost Popular MoviesBrowse Movies by GenreTop Box OfficeShowtimes & TicketsMovie NewsIndia Movie SpotlightTV ShowsWhat's on TV & StreamingTop 250 TV ShowsMost Popular TV ShowsBrowse TV Shows by GenreTV NewsWatchWhat to WatchLatest TrailersIMDb OriginalsIMDb PicksIMDb SpotlightIMDb PodcastsAwards & EventsOscarsBlack History MonthSundance Film FestivalSXSW Film FestivalSTARmeter AwardsAwards CentralFestival CentralAll EventsCelebsBorn TodayMost Popular CelebsCelebrity NewsCommunityHelp CenterContributor ZonePollsFor Industry ProfessionalsLanguageEnglish (United States)LanguageFully supportedEnglish (United States)Partially supportedFrançais (Canada)Français (France)Deutsch (Deutschland)हिंदी (भारत)Italiano (Italia)Português (Brasil)Español (España)Español (México)AllAllWatchlistSign InENFully supportedEnglish (United States)Partially supportedFrançais (Canada)Français (France)Deutsch (Deutschland)हिंदी (भारत)Italiano (Italia)Português (Brasil)Español (España)Español (México)Use app Episode guideCast & crewUser reviewsTriviaFAQIMDbProAll topicsMadoff: The Monster of Wall StreetTV Mini Series2023TV-MA1hIMDb RATING7.3/107.8KYOUR RATINGRatePlay trailer1:502 Videos12 PhotosCrime DocumentaryTrue CrimeCrimeDocumentaryIt follows the rise and fall of the American financier and ponzi schemer: Madoff.It follows the rise and fall of the American financier and ponzi schemer: Madoff.It follows the rise and fall of the American financier and ponzi schemer: Madoff.StarsAlex HammerliJoseph ScottoMelony FelicianoSee production info at IMDbProIMDb RATING7.3/107.8KYOUR RATINGRateStarsAlex HammerliJoseph ScottoMelony Feliciano31User reviews16Critic reviewsSee production info at IMDbPro Episodes4Browse episodesTopTop-ratedSeason2023 Videos2Trailer 1:50Official TrailerTrailer 1:42Madoff: The Monster Of Wall StreetTrailer 1:42Madoff: The Monster Of Wall StreetPhotos12Add photo+ 7 Top cast47EditAlex HammerliMadoff Employee4 eps • 20234 episodes • 2023Joseph ScottoBernie Madoff4 eps • 20234 episodes • 2023Melony FelicianoBackground Extra4 eps • 20234 episodes • 2023Donna PastorelloEleanor Squillari4 eps • 20234 episodes • 2023Sarah KuklisEllen Hales4 eps • 20234 episodes • 2023Stephanie BeauchampJodi Crupi4 eps • 20234 episodes • 2023Elijah George19th Floor Trader4 eps • 20234 episodes • 2023Howie SchaalJerry O'Hara4 eps • 20234 episodes • 2023Cris ColicchioPeter Madoff4 eps • 20234 episodes • 2023Kevin DelanoAndrew Madoff4 eps • 20234 episodes • 2023Diana B. HenriquesSelf - Author - The Wizard of Lies…4 eps • 20234 episodes • 2023Isa CamyarFrank DiPascali4 eps • 20234 episodes • 2023Alex OlsonMark Madoff4 eps • 20234 episodes • 2023Alicia ErlingerAnnette Bongiorno4 eps • 20234 episodes • 2023Robert Loftus19th Floor Trader3 eps • 20233 episodes • 2023Paul FaggioneJeffrey Tucker3 eps • 20233 episodes • 2023Marla FreemanSonja Kohn3 eps • 20233 episodes • 2023Rafael Antonio VasquezGeorge Perez3 eps • 20233 episodes • 2023All cast & crewProduction, box office & more at IMDbProMore like this7.4Madoff7.1Jeffrey Epstein: Filthy Rich6.3Eat the Rich: The GameStop Saga7.0Pepsi, Where's My Jet?7.0Waco: American Apocalypse8.1Dirty Money7.5American Manhunt: The Boston Marathon Bombing6.8Murdaugh Murders: A Southern Scandal7.4Trainwreck: Woodstock '997.4FIFA Uncovered7.0Get Gotti6.3Trust No One: The Hunt for the Crypto KingStorylineEditDid you knowEditTriviaThe French aristocrat Thierry Magon de La Villehuchet committed suicide after losing an estimated $1.4 billion of his and other aristocrat's family fortunes in Madoff's scheme. This was the second time the very wealthy "famille Magon" lost a large part of its fortune. In July 1794, banker Jean-Baptiste Magon de La Balue and 18 other members of the family were guillotined in Paris and a large part of their castles and fortunes confiscated. This happened one year after the decapitation of King Louis 16 and his wife Marie-Antoinette, and ironically, only 9 days before the decapitation of the revolutionary leader Maximilien de Robespierre.ConnectionsFeatured in Jeremy Vine: Episode #6.5 (2023)User reviews31ReviewReviewFeatured review6/10Informative but repetitive and leaving huge question marksFor anybody who just heard faint echoes about the "Madoff monster" this documentary provides a good insight into his crimes. However, starting with weak points first, it suffers from a severe case of Netflexite, aka that annoying device of starting somewhere mid-point of a story and working its way back and forward quite randomly.Second, we never get any info about what happened to the Security & Exchange Commission's inspectors who "failed" to inspect or to the SEC at large, the "commission" that showed only gross incompetence or even collusion with the monster.The story starts with Madoff arrest on 11 Dec. 2008, the terrible year of the great crisis and works its way back to Madoff's birth, youth, marriage to Ruth and origins of his firm, with excruciatingly boring details. There are many interviews with employees from the legit Madoff operations and none from the illegal - no surprise there.There are also many re-enactment, of which at least half is superfluous. In short, Madoff run his illegal scheme from the 17th floor of the Lipstick building and the legal from the 19th. Employees from the 19th were forbidden entrance to the 17th and even Mark and Andrew, Bernie's son were not allowed. But even if this would be considered at least bizarre and worth exploring, nobody did anything for decades.Some external connections started sniffing around Madoff as far back as 1992, when his name was mentioned to the SEC. Since then, there were six investigations on Madoff, all botched.At this stage I would have like to watch a documentary about SEC and what happened to those "inspectors" - I guess nothing, but still...Four episodes about this complex yet easy fraud are long to digest and the tragic ending comes none too soon, with Mark committing suicide, Andrew dying of cancer, Ruth being destitute and Bernie dying in the slammer, also none too soon. A tighter editing and less flourishing would have helped.dierregiFeb 22, 2023PermalinkTop picksSign in to rate and Watchlist for personalized recommendationsSign inFAQ14How many seasons does Madoff: The Monster of Wall Street have?Powered by AlexaDetailsEditRelease dateJanuary 4, 2023 (United States)Country of originUnited StatesOfficial siteNetflix SiteLanguageEnglishAlso known asМЕЙДОФФ: Монстр із Волл-стрітProduction companiesRadicalMediaThird Eye Motion Picture CompanySee more company credits at IMDbProTech specsEditRuntime1 hourColorColorSound mixDolby DigitalAspect ratio16:9 HDRelated newsContribute to this pageSuggest an edit or add missing contentIMDb Answers: Help fill gaps in our dataLearn more about contributingEdit pageAdd episode More to exploreRecently viewedYou have no recently viewed pages Get the IMDb appSign in for more accessSign in for more accessFollow IMDb on socialGet the IMDb appFor Android and iOSHelpSite IndexIMDbProBox Office MojoLicense IMDb DataPress RoomAdvertisingJobsConditions of UsePrivacy PolicyYour Ads Privacy ChoicesIMDb, an Amazon company© 1990-2025 by IMDb.com, Inc.Back to top  
    ...
    
    1203:  INFO     LiteLLM:utils.py:952 Wrapper: Completed Call, calling success_handler
    1204:  INFO     httpx:_client.py:1038 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
    1205:  INFO     httpx:_client.py:1038 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
    1206:  INFO     httpx:_client.py:1038 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
    1207:  INFO     httpx:_client.py:1038 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
    1208:  =============================== warnings summary ===============================
    1209:  ../../../.cache/pypoetry/virtualenvs/graphrag-sdk-7G9skz4b-py3.13/lib/python3.13/site-packages/pydantic/_internal/_config.py:295
    1210:  ../../../.cache/pypoetry/virtualenvs/graphrag-sdk-7G9skz4b-py3.13/lib/python3.13/site-packages/pydantic/_internal/_config.py:295
    1211:  /home/runner/.cache/pypoetry/virtualenvs/graphrag-sdk-7G9skz4b-py3.13/lib/python3.13/site-packages/pydantic/_internal/_config.py:295: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
    1212:  warnings.warn(DEPRECATION_MESSAGE, DeprecationWarning)
    1213:  ../../../.cache/pypoetry/virtualenvs/graphrag-sdk-7G9skz4b-py3.13/lib/python3.13/site-packages/pydantic/_internal/_config.py:345
    1214:  /home/runner/.cache/pypoetry/virtualenvs/graphrag-sdk-7G9skz4b-py3.13/lib/python3.13/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
    1215:  * 'fields' has been removed
    1216:  warnings.warn(message, UserWarning)
    1217:  -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
    1218:  =========================== short test summary info ============================
    1219:  FAILED tests/test_rag.py::TestKGLiteLLM::test_llm - assert np.float64(0.16666666666666666) >= 0.5
    1220:  +  where np.float64(0.16666666666666666) = <function mean at 0x7f7e03379870>([1.0, 0.0, 0.0, 0.0, 0.0, 0.0])
    1221:  +    where <function mean at 0x7f7e03379870> = np.mean
    1222:  ============= 1 failed, 19 passed, 3 warnings in 149.31s (0:02:29) =============
    1223:  ##[error]Process completed with exit code 1.
    ...
    
    1232:  [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
    1233:  [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
    1234:  [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
    1235:  http.https://github.com/.extraheader
    1236:  [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
    1237:  [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
    1238:  Print service container logs: 754fb1a6762045ec86a30d54f2ebd2bd_falkordbfalkordblatest_e92070
    1239:  ##[command]/usr/bin/docker logs --details 14b268800f923a65e2d78da18c4d48e1a46032ddf0a5409f1263100b8d09a3fb
    1240:  11:C 04 Feb 2025 14:42:51.879 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
    

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (1)
    .github/workflows/test.yml (1)

    28-30: ⚠️ Potential issue

    Incorrect model name and missing matrix configuration options.

    The model name "gpt-4o" appears to be incorrect. Additionally, the matrix configuration could be enhanced for better test reliability.

    Apply these changes to fix the model name and improve the matrix configuration:

         strategy:
    +      fail-fast: false  # Continue running other models if one fails
           matrix:
    -        model: [gemini/gemini-1.5-flash-001, openai/gpt-4o] # List of models
    +        model: [gemini/gemini-1.5-flash-001, openai/gpt-4] # List of models

    Let's verify the model names in the codebase:

    #!/bin/bash
    # Search for model name references in test files and configuration
    rg -l "gemini-1.5-flash-001|gpt-4" tests/
    🧹 Nitpick comments (3)
    .github/workflows/test.yml (3)

    27-27: Remove trailing whitespace.

    There is unnecessary trailing whitespace on this line.

    -    
    +
    🧰 Tools
    🪛 YAMLlint (1.35.1)

    [error] 27-27: trailing spaces

    (trailing-spaces)


    80-80: Consider documenting the TEST_MODEL environment variable.

    The TEST_MODEL environment variable is crucial for the matrix testing strategy but lacks documentation about its purpose and expected values.

    -          TEST_MODEL: ${{ matrix.model }} # Pass the model as an environment variable
    +          TEST_MODEL: ${{ matrix.model }} # Model identifier for matrix testing (format: provider/model-name)

    Line range hint 28-80: Consider potential impacts of matrix testing on services.

    The matrix strategy will run tests for each model in parallel, which means multiple instances of tests will interact with FalkorDB and Ollama services simultaneously. This could lead to:

    1. Increased resource usage on services
    2. Potential race conditions if tests modify shared data
    3. Longer overall execution time if services become bottlenecks

    Consider:

    • Adding resource limits to services
    • Implementing test isolation mechanisms
    • Monitoring service performance during matrix runs
    🧰 Tools
    🪛 YAMLlint (1.35.1)

    [error] 27-27: trailing spaces

    (trailing-spaces)

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between ee5244a and 5bf6681.

    📒 Files selected for processing (1)
    • .github/workflows/test.yml (2 hunks)
    🧰 Additional context used
    🪛 YAMLlint (1.35.1)
    .github/workflows/test.yml

    [error] 27-27: trailing spaces

    (trailing-spaces)

    ⏰ Context from checks skipped due to timeout of 90000ms (1)
    • GitHub Check: test (openai/gpt-4o)

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 1

    ♻️ Duplicate comments (3)
    tests/test_rag.py (3)

    32-84: 🛠️ Refactor suggestion

    Refactor duplicate ontology definition.

    The ontology structure is duplicated between the global scope and the TestKGLiteLLM class. Consider extracting this into a shared fixture or helper method to improve maintainability.


    101-108: ⚠️ Potential issue

    Correct grammatical error in expected outputs.

    The phrase "Over than 10 actors acted in a movie." should be "More than 10 actors acted in a movie." This correction improves clarity and grammatical accuracy.


    89-108: 🛠️ Refactor suggestion

    Consider data-driven test assertions based on source content.

    After reviewing the test file and source data (madoff.txt), I can confirm the original suggestions and provide specific recommendations:

    1. The hardcoded expected output "Over than 10 actors acted in a movie" should be derived from the actual data.
    2. The test should validate the response structure before assertions.
    3. The metrics being used are good for measuring response quality, but there's no validation of error cases.
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 5bf6681 and 4f599af.

    📒 Files selected for processing (3)
    • .github/workflows/test.yml (2 hunks)
    • graphrag_sdk/test_metrics.py (1 hunks)
    • tests/test_rag.py (1 hunks)
    🧰 Additional context used
    🪛 Ruff (0.8.2)
    graphrag_sdk/test_metrics.py

    2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

    Remove unused import: deepeval.metrics.ContextualRelevancyMetric

    (F401)


    4-4: Redefinition of unused Optional from line 1

    Remove definition: Optional

    (F811)


    6-6: deepeval.utils.get_or_create_event_loop imported but unused

    Remove unused import: deepeval.utils.get_or_create_event_loop

    (F401)


    14-14: Redefinition of unused LLMTestCase from line 3

    Remove definition: LLMTestCase

    (F811)


    19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

    (F403)


    20-20: Redefinition of unused BaseMetric from line 2

    Remove definition: BaseMetric

    (F811)


    22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

    Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

    (F401)


    24-24: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

    (F403)


    164-164: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    208-208: Reason may be undefined, or defined from star imports

    (F405)


    208-208: Reason may be undefined, or defined from star imports

    (F405)


    230-230: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    239-239: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    244-244: Verdicts may be undefined, or defined from star imports

    (F405)


    244-244: Verdicts may be undefined, or defined from star imports

    (F405)


    245-245: Verdicts may be undefined, or defined from star imports

    (F405)


    251-251: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    261-261: Do not use bare except

    (E722)


    366-366: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    409-409: Reason may be undefined, or defined from star imports

    (F405)


    409-409: Reason may be undefined, or defined from star imports

    (F405)


    433-433: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    441-441: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    445-445: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    451-451: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    459-459: Do not use bare except

    (E722)

    🪛 YAMLlint (1.35.1)
    .github/workflows/test.yml

    [error] 27-27: trailing spaces

    (trailing-spaces)

    ⏰ Context from checks skipped due to timeout of 90000ms (2)
    • GitHub Check: test (openai/gpt-4o)
    • GitHub Check: test (gemini/gemini-1.5-flash-001)
    🔇 Additional comments (2)
    .github/workflows/test.yml (1)

    28-30: Update the model name in the matrix configuration.

    The model name gpt-4o appears to be incorrect. Please verify and update the model name in the matrix configuration.

    Run the following script to verify the model names:

    #!/bin/bash
    # Description: Verify the model names in the test files.
    
    # Test: Search for model name usage in test files.
    rg -B2 -A2 "gemini-1.5-flash-001|gpt-4o" tests/
    graphrag_sdk/test_metrics.py (1)

    1-24: 🛠️ Refactor suggestion

    Clean up imports to improve code quality.

    There are several issues with the imports:

    1. Unused imports: ContextualRelevancyMetric, get_or_create_event_loop, ContextualRecallTemplate
    2. Redefined imports: Optional, LLMTestCase, BaseMetric
    3. Star imports that make it difficult to track dependencies

    Apply this diff to clean up the imports:

     from typing import Optional
    -from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
    +from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric
     from deepeval.test_case import LLMTestCase
    -from typing import Optional, List, Union
    +from typing import List, Union
    
    -from deepeval.utils import get_or_create_event_loop, prettify_list
    +from deepeval.utils import prettify_list
     from deepeval.metrics.utils import (
         construct_verbose_logs,
         trimAndLoadJson,
         check_llm_test_case_params,
         initialize_model,
     )
    -from deepeval.test_case import (
    -    LLMTestCase,
    -    LLMTestCaseParams,
    -    ConversationalTestCase,
    -)
    +from deepeval.test_case import LLMTestCaseParams, ConversationalTestCase
    
    -from deepeval.metrics.contextual_relevancy.schema import *
    -from deepeval.metrics import BaseMetric
    +from deepeval.metrics.contextual_relevancy.schema import (
    +    ContextualRecallVerdict,
    +    Verdicts,
    +    Reason,
    +    ContextualRelevancyVerdicts
    +)
     from deepeval.models import DeepEvalBaseLLM
    -from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
     from deepeval.metrics.indicator import metric_progress_indicator
    -from deepeval.metrics.contextual_recall.schema import *

    Likely invalid or redundant comment.

    🧰 Tools
    🪛 Ruff (0.8.2)

    2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

    Remove unused import: deepeval.metrics.ContextualRelevancyMetric

    (F401)


    4-4: Redefinition of unused Optional from line 1

    Remove definition: Optional

    (F811)


    6-6: deepeval.utils.get_or_create_event_loop imported but unused

    Remove unused import: deepeval.utils.get_or_create_event_loop

    (F401)


    14-14: Redefinition of unused LLMTestCase from line 3

    Remove definition: LLMTestCase

    (F811)


    19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

    (F403)


    20-20: Redefinition of unused BaseMetric from line 2

    Remove definition: BaseMetric

    (F811)


    22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

    Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

    (F401)


    24-24: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

    (F403)

    graphrag_sdk/test_metrics.py Show resolved Hide resolved
    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 1

    🧹 Nitpick comments (5)
    tests/test_rag.py (5)

    1-14: Organize imports for better readability.

    Consider organizing imports into groups: standard library, third-party packages, and local imports.

     import os
     import logging
     import unittest
     import numpy as np
    +
     from dotenv import load_dotenv
    -from graphrag_sdk.entity import Entity
    -from graphrag_sdk.source import Source
     from deepeval.test_case import LLMTestCase
    +
    +from graphrag_sdk.entity import Entity
    +from graphrag_sdk.source import Source
     from graphrag_sdk.relation import Relation
     from graphrag_sdk.ontology import Ontology
     from graphrag_sdk.models.litellm import LiteModel
     from graphrag_sdk.attribute import Attribute, AttributeType
     from graphrag_sdk import KnowledgeGraph, KnowledgeGraphModelConfig

    16-21: Consider adjusting logging configuration for tests.

    DEBUG level logging might be too verbose for tests. Consider using INFO level by default and allowing override through environment variables.

    -logging.basicConfig(level=logging.DEBUG)
    +log_level = os.getenv('TEST_LOG_LEVEL', 'INFO')
    +logging.basicConfig(level=getattr(logging, log_level))

    32-74: Consider moving ontology definition to a configuration file.

    The ontology definition is quite extensive and might be reused across different test files. Consider moving it to a separate configuration file or fixture.

    Example structure:

    # tests/fixtures/ontologies.py
    def create_movie_ontology():
        ontology = Ontology([], [])
        ontology.add_entity(
            Entity(
                label="Actor",
                attributes=[
                    Attribute(
                        name="name",
                        attr_type=AttributeType.STRING,
                        unique=True,
                        required=True,
                    ),
                ],
            )
        )
        # ... rest of the ontology definition
        return ontology

    87-88: Use path resolution for test data files.

    Hardcoded file paths might cause issues in different environments. Use os.path.join() with a base test data directory.

    -        file_path = "tests/data/madoff.txt"
    +        test_data_dir = os.path.join(os.path.dirname(__file__), "data")
    +        file_path = os.path.join(test_data_dir, "madoff.txt")

    110-111: Make metric threshold configurable.

    The combined metrics threshold and assertion threshold are hardcoded. Consider making these configurable through class attributes or environment variables.

    +    DEFAULT_METRIC_THRESHOLD = 0.5
    +    
         def test_llm(self):
    -        answer_combined_metric = CombineMetrics(threshold=0.5)
    +        threshold = float(os.getenv('TEST_METRIC_THRESHOLD', self.DEFAULT_METRIC_THRESHOLD))
    +        answer_combined_metric = CombineMetrics(threshold=threshold)
             scores = []
             # ... test implementation ...
    -        assert np.mean(scores) >= 0.5
    +        assert np.mean(scores) >= threshold

    Also applies to: 130-130

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 4f599af and 478e673.

    📒 Files selected for processing (1)
    • tests/test_rag.py (1 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms (2)
    • GitHub Check: test (openai/gpt-4o)
    • GitHub Check: test (gemini/gemini-1.5-flash-001)
    🔇 Additional comments (2)
    tests/test_rag.py (2)

    101-108: Correct grammatical error in expected output.

    The phrase "Over than 10 actors" should be "More than 10 actors" for grammatical correctness.


    30-30: Verify the default model configuration.

    The default model "gemini/gemini-2.0-flash-exp" includes "exp" which might indicate experimental features. Consider using a stable version for testing.

    ✅ Verification successful

    Default model configuration is intentionally set

    The model "gemini/gemini-2.0-flash-exp" is documented as the recommended model for Google's API in the project's README. The test configuration allows override via TEST_MODEL environment variable, providing flexibility for testing with different models if needed.

    🏁 Scripts executed

    The following scripts were executed for the analysis:

    Script:

    #!/bin/bash
    # Check if there are any stability issues reported for the experimental model
    rg -l "gemini-2\.0-flash-exp.*stable" .
    

    Length of output: 41


    Script:

    #!/bin/bash
    # Search for the model name pattern and related configurations
    rg "gemini.*flash" -B 2 -A 2
    
    # Look for model configuration documentation
    fd -e md -e rst -e txt | xargs rg -l "model.*configuration|model.*setup"
    
    # Find similar test configurations
    fd "test.*\.py" | xargs rg "TEST_MODEL|model_name.*=.*getenv"
    

    Length of output: 3295

    tests/test_rag.py Outdated Show resolved Hide resolved
    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 2

    🧹 Nitpick comments (5)
    tests/test_rag.py (1)

    113-130: Test logic is succinct but consider data volatility.

    The test checks each prompt’s response against an expected string and computes aggregated scores. If the IMDB content changes, these checks may fail unexpectedly. Consider mocking or freezing external data for more predictable tests.

    graphrag_sdk/models/litellm.py (1)

    77-77: Remove unused exception variable.

    You’re catching the exception but not using its variable. This can be simplified:

    -        except Exception as e:
    +        except Exception:
    🧰 Tools
    🪛 Ruff (0.8.2)

    77-77: Local variable e is assigned to but never used

    Remove assignment to unused variable e

    (F841)

    graphrag_sdk/test_metrics.py (3)

    33-130: Add type hints and improve error handling in CombineMetrics class.

    The class implementation is good but could benefit from:

    1. Type hints for class attributes
    2. Better error handling in the measure method
    3. Documentation for the class and its methods

    Apply this diff to improve the implementation:

     class CombineMetrics(BaseMetric):
    +    """A metric that combines multiple graph context metrics to provide a composite score."""
    +
         def __init__(
             self,
             threshold: float = 0.5,
             evaluation_model: Optional[str] = "gpt-4o",
             include_reason: bool = True,
             async_mode: bool = True,
             strict_mode: bool = False,
         ):
    +        self.score: float = 0.0
    +        self.reason: Optional[str] = None
    +        self.success: bool = False
    +        self.error: Optional[str] = None
             self.threshold = 1 if strict_mode else threshold
             self.evaluation_model = evaluation_model
             self.include_reason = include_reason
             self.async_mode = async_mode
             self.strict_mode = strict_mode
    
    -    def measure(self, test_case: LLMTestCase):
    +    def measure(self, test_case: LLMTestCase) -> float:
             try:
                 graph_context_recall_metric, graph_context_relevancy_metric = self.initialize_metrics()
    -            # Remember, deepeval's default metrics follow the same pattern as your custom metric!
    -            # relevancy_metric.measure(test_case)
                 graph_context_relevancy_metric.measure(test_case)
                 graph_context_recall_metric.measure(test_case)
                 
    -            # Custom logic to set score, reason, and success
                 self.set_score_reason_success(graph_context_recall_metric, graph_context_relevancy_metric)
                 return self.score
             except Exception as e:
    -            # Set and re-raise error
                 self.error = str(e)
    +            self.success = False
    +            self.score = 0.0
                 raise

    261-333: Improve organization of prompt templates.

    The prompt templates are currently defined as long f-strings within the methods. Consider:

    1. Moving the templates to a separate configuration file
    2. Using a template engine for better maintainability

    Consider using a template engine like Jinja2 to manage these templates. I can help implement this if you'd like.


    462-538: Consider consolidating template classes.

    The GraphContextualRelevancyTemplate class is very similar to GraphContextualRecallTemplate. Consider:

    1. Creating a base template class
    2. Moving templates to a configuration file
    3. Using a template engine for better maintainability

    I can help implement a consolidated template system using a template engine if you'd like.

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 478e673 and 24e9211.

    📒 Files selected for processing (4)
    • graphrag_sdk/fixtures/prompts.py (3 hunks)
    • graphrag_sdk/models/litellm.py (1 hunks)
    • graphrag_sdk/test_metrics.py (1 hunks)
    • tests/test_rag.py (1 hunks)
    🧰 Additional context used
    🪛 Ruff (0.8.2)
    graphrag_sdk/models/litellm.py

    77-77: Local variable e is assigned to but never used

    Remove assignment to unused variable e

    (F841)

    graphrag_sdk/test_metrics.py

    2-2: deepeval.metrics.AnswerRelevancyMetric imported but unused

    Remove unused import

    (F401)


    2-2: deepeval.metrics.FaithfulnessMetric imported but unused

    Remove unused import

    (F401)


    2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

    Remove unused import

    (F401)


    4-4: Redefinition of unused Optional from line 1

    Remove definition: Optional

    (F811)


    6-6: deepeval.utils.get_or_create_event_loop imported but unused

    Remove unused import: deepeval.utils.get_or_create_event_loop

    (F401)


    14-14: Redefinition of unused LLMTestCase from line 3

    Remove definition: LLMTestCase

    (F811)


    19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

    (F403)


    20-20: Redefinition of unused BaseMetric from line 2

    Remove definition: BaseMetric

    (F811)


    22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

    Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

    (F401)


    24-24: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

    (F403)


    156-156: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    200-200: Reason may be undefined, or defined from star imports

    (F405)


    200-200: Reason may be undefined, or defined from star imports

    (F405)


    222-222: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    231-231: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    236-236: Verdicts may be undefined, or defined from star imports

    (F405)


    236-236: Verdicts may be undefined, or defined from star imports

    (F405)


    237-237: Verdicts may be undefined, or defined from star imports

    (F405)


    243-243: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    253-253: Do not use bare except

    (E722)


    361-361: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    404-404: Reason may be undefined, or defined from star imports

    (F405)


    404-404: Reason may be undefined, or defined from star imports

    (F405)


    428-428: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    436-436: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    440-440: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    446-446: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    454-454: Do not use bare except

    (E722)

    🔇 Additional comments (6)
    graphrag_sdk/fixtures/prompts.py (1)

    1-618: No changes detected in this file.

    There are no lines marked with ~, indicating no modifications were introduced here.

    tests/test_rag.py (4)

    1-26: Imports and class definition look good.

    The environment setup and logging configuration are appropriate, and the class docstring is clear. No issues found here.


    28-86: Ontology and setup logic appear solid.

    Defining the Actor and Movie entities along with the ACTED_IN relation is concise and correct. Ensuring the environment variable for the model name is read is a good practice.


    92-100: Input queries are reasonable.

    Providing multiple queries (e.g., “Which actors […]?”) is a good way to test various aspects of retrieval. The approach of referencing an external URL for real data may be somewhat volatile over time if the IMDB page changes, but it's acceptable for integration tests.
    [approve]


    102-109: Correct the grammatical error in “Over than 10 actors”.

    This exact issue was previously flagged; still valid.

    - "Over than 10 actors acted in a movie.",
    + "More than 10 actors acted in a movie.",
    graphrag_sdk/test_metrics.py (1)

    1-24: 🛠️ Refactor suggestion

    Clean up imports to remove unused and redefined imports.

    Several imports are either unused or redefined, which can lead to confusion and potential conflicts.

    Apply this diff to clean up the imports:

     from typing import Optional
    -from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
    +from deepeval.metrics import BaseMetric
     from deepeval.test_case import LLMTestCase
    -from typing import Optional, List, Union
    +from typing import List, Union
    
    -from deepeval.utils import get_or_create_event_loop, prettify_list
    +from deepeval.utils import prettify_list
     from deepeval.metrics.utils import (
         construct_verbose_logs,
         trimAndLoadJson,
         check_llm_test_case_params,
         initialize_model,
     )
    -from deepeval.test_case import (
    -    LLMTestCase,
    -    LLMTestCaseParams,
    -    ConversationalTestCase,
    -)
    +from deepeval.test_case import LLMTestCaseParams, ConversationalTestCase
    
    -from deepeval.metrics.contextual_relevancy.schema import *
    -from deepeval.metrics import BaseMetric
    +from deepeval.metrics.contextual_relevancy.schema import (
    +    ContextualRecallVerdict,
    +    Verdicts,
    +    Reason,
    +    ContextualRelevancyVerdicts
    +)
     from deepeval.models import DeepEvalBaseLLM
    -from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
     from deepeval.metrics.indicator import metric_progress_indicator
    -from deepeval.metrics.contextual_recall.schema import *

    Likely invalid or redundant comment.

    🧰 Tools
    🪛 Ruff (0.8.2)

    2-2: deepeval.metrics.AnswerRelevancyMetric imported but unused

    Remove unused import

    (F401)


    2-2: deepeval.metrics.FaithfulnessMetric imported but unused

    Remove unused import

    (F401)


    2-2: deepeval.metrics.ContextualRelevancyMetric imported but unused

    Remove unused import

    (F401)


    4-4: Redefinition of unused Optional from line 1

    Remove definition: Optional

    (F811)


    6-6: deepeval.utils.get_or_create_event_loop imported but unused

    Remove unused import: deepeval.utils.get_or_create_event_loop

    (F401)


    14-14: Redefinition of unused LLMTestCase from line 3

    Remove definition: LLMTestCase

    (F811)


    19-19: from deepeval.metrics.contextual_relevancy.schema import * used; unable to detect undefined names

    (F403)


    20-20: Redefinition of unused BaseMetric from line 2

    Remove definition: BaseMetric

    (F811)


    22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate imported but unused

    Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate

    (F401)


    24-24: from deepeval.metrics.contextual_recall.schema import * used; unable to detect undefined names

    (F403)

    Comment on lines +131 to +260
    self.score = self._calculate_score()
    self.reason = self._generate_reason(test_case.expected_output)
    self.success = self.score >= self.threshold
    self.verbose_logs = construct_verbose_logs(
    self,
    steps=[
    f"Verdicts:\n{prettify_list(self.verdicts)}",
    f"Score: {self.score}\nReason: {self.reason}",
    ],
    )

    return self.score

    def _generate_reason(self, expected_output: str):
    if self.include_reason is False:
    return None

    supportive_reasons = []
    unsupportive_reasons = []
    for verdict in self.verdicts:
    if verdict.verdict.lower() == "yes":
    supportive_reasons.append(verdict.reason)
    else:
    unsupportive_reasons.append(verdict.reason)

    prompt = GraphContextualRecallTemplate.generate_reason(
    expected_output=expected_output,
    supportive_reasons=supportive_reasons,
    unsupportive_reasons=unsupportive_reasons,
    score=format(self.score, ".2f"),
    )

    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return data["reason"]
    else:
    try:
    res: Reason = self.model.generate(prompt, schema=Reason)
    return res.reason
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return data["reason"]

    def _calculate_score(self):
    number_of_verdicts = len(self.verdicts)
    if number_of_verdicts == 0:
    return 0

    justified_sentences = 0
    for verdict in self.verdicts:
    if verdict.verdict.lower() == "yes":
    justified_sentences += 1

    score = justified_sentences / number_of_verdicts
    return 0 if self.strict_mode and score < self.threshold else score

    def _generate_verdicts(
    self, expected_output: str, retrieval_context: List[str], cypher_query: Optional[str] = None
    ) -> List[ContextualRecallVerdict]:
    prompt = GraphContextualRecallTemplate.generate_verdicts(
    expected_output=expected_output, retrieval_context=retrieval_context, cypher_query=cypher_query
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    verdicts = [
    ContextualRecallVerdict(**item) for item in data["verdicts"]
    ]
    return verdicts
    else:
    try:
    res: Verdicts = self.model.generate(prompt, schema=Verdicts)
    verdicts: Verdicts = [item for item in res.verdicts]
    return verdicts
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    verdicts = [
    ContextualRecallVerdict(**item) for item in data["verdicts"]
    ]
    return verdicts

    def is_successful(self) -> bool:
    if self.error is not None:
    self.success = False
    else:
    try:
    self.success = self.score >= self.threshold
    except:
    self.success = False
    return self.success

    @property
    def __name__(self):
    return "Graph Contextual Recall"

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Fix error handling and improve type safety in GraphContextualRecall class.

    The class has several areas for improvement:

    1. Bare except clauses should catch specific exceptions
    2. Type hints are missing for class attributes
    3. Error handling could be more robust

    Apply this diff to improve the implementation:

     class GraphContextualRecall(BaseMetric):
    +    """A metric that evaluates the recall of graph context against expected output."""
    +
         def __init__(
             self,
             threshold: float = 0.5,
             model: Optional[Union[str, DeepEvalBaseLLM]] = None,
             include_reason: bool = True,
             strict_mode: bool = False,
             verbose_mode: bool = False,
         ):
    +        self.score: float = 0.0
    +        self.reason: Optional[str] = None
    +        self.success: bool = False
    +        self.error: Optional[str] = None
    +        self.verdicts: List[ContextualRecallVerdict] = []
    +        self.evaluation_cost: Optional[float] = None
    +        self.verbose_logs: Optional[str] = None
             self.threshold = 1 if strict_mode else threshold
             self.model, self.using_native_model = initialize_model(model)
             self.evaluation_model = self.model.get_model_name()
             self.include_reason = include_reason
             self.strict_mode = strict_mode
             self.verbose_mode = verbose_mode
    
         def is_successful(self) -> bool:
             if self.error is not None:
                 self.success = False
             else:
                 try:
                     self.success = self.score >= self.threshold
    -            except:
    +            except Exception as e:
    +                self.error = str(e)
                     self.success = False
             return self.success
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    class GraphContextualRecall(BaseMetric):
    def __init__(
    self,
    threshold: float = 0.5,
    model: Optional[Union[str, DeepEvalBaseLLM]] = None,
    include_reason: bool = True,
    strict_mode: bool = False,
    verbose_mode: bool = False,
    ):
    self.threshold = 1 if strict_mode else threshold
    self.model, self.using_native_model = initialize_model(model)
    self.evaluation_model = self.model.get_model_name()
    self.include_reason = include_reason
    self.strict_mode = strict_mode
    self.verbose_mode = verbose_mode
    def measure(
    self,
    test_case: LLMTestCase,
    _show_indicator: bool = True,
    ) -> float:
    check_llm_test_case_params(test_case, required_params, self)
    self.evaluation_cost = 0 if self.using_native_model else None
    with metric_progress_indicator(self, _show_indicator=_show_indicator):
    self.verdicts: List[ContextualRecallVerdict] = (
    self._generate_verdicts(
    test_case.expected_output, test_case.retrieval_context, test_case.additional_metadata
    )
    )
    self.score = self._calculate_score()
    self.reason = self._generate_reason(test_case.expected_output)
    self.success = self.score >= self.threshold
    self.verbose_logs = construct_verbose_logs(
    self,
    steps=[
    f"Verdicts:\n{prettify_list(self.verdicts)}",
    f"Score: {self.score}\nReason: {self.reason}",
    ],
    )
    return self.score
    def _generate_reason(self, expected_output: str):
    if self.include_reason is False:
    return None
    supportive_reasons = []
    unsupportive_reasons = []
    for verdict in self.verdicts:
    if verdict.verdict.lower() == "yes":
    supportive_reasons.append(verdict.reason)
    else:
    unsupportive_reasons.append(verdict.reason)
    prompt = GraphContextualRecallTemplate.generate_reason(
    expected_output=expected_output,
    supportive_reasons=supportive_reasons,
    unsupportive_reasons=unsupportive_reasons,
    score=format(self.score, ".2f"),
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return data["reason"]
    else:
    try:
    res: Reason = self.model.generate(prompt, schema=Reason)
    return res.reason
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return data["reason"]
    def _calculate_score(self):
    number_of_verdicts = len(self.verdicts)
    if number_of_verdicts == 0:
    return 0
    justified_sentences = 0
    for verdict in self.verdicts:
    if verdict.verdict.lower() == "yes":
    justified_sentences += 1
    score = justified_sentences / number_of_verdicts
    return 0 if self.strict_mode and score < self.threshold else score
    def _generate_verdicts(
    self, expected_output: str, retrieval_context: List[str], cypher_query: Optional[str] = None
    ) -> List[ContextualRecallVerdict]:
    prompt = GraphContextualRecallTemplate.generate_verdicts(
    expected_output=expected_output, retrieval_context=retrieval_context, cypher_query=cypher_query
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    verdicts = [
    ContextualRecallVerdict(**item) for item in data["verdicts"]
    ]
    return verdicts
    else:
    try:
    res: Verdicts = self.model.generate(prompt, schema=Verdicts)
    verdicts: Verdicts = [item for item in res.verdicts]
    return verdicts
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    verdicts = [
    ContextualRecallVerdict(**item) for item in data["verdicts"]
    ]
    return verdicts
    def is_successful(self) -> bool:
    if self.error is not None:
    self.success = False
    else:
    try:
    self.success = self.score >= self.threshold
    except:
    self.success = False
    return self.success
    @property
    def __name__(self):
    return "Graph Contextual Recall"
    class GraphContextualRecall(BaseMetric):
    """A metric that evaluates the recall of graph context against expected output."""
    def __init__(
    self,
    threshold: float = 0.5,
    model: Optional[Union[str, DeepEvalBaseLLM]] = None,
    include_reason: bool = True,
    strict_mode: bool = False,
    verbose_mode: bool = False,
    ):
    self.score: float = 0.0
    self.reason: Optional[str] = None
    self.success: bool = False
    self.error: Optional[str] = None
    self.verdicts: List[ContextualRecallVerdict] = []
    self.evaluation_cost: Optional[float] = None
    self.verbose_logs: Optional[str] = None
    self.threshold = 1 if strict_mode else threshold
    self.model, self.using_native_model = initialize_model(model)
    self.evaluation_model = self.model.get_model_name()
    self.include_reason = include_reason
    self.strict_mode = strict_mode
    self.verbose_mode = verbose_mode
    def measure(
    self,
    test_case: LLMTestCase,
    _show_indicator: bool = True,
    ) -> float:
    check_llm_test_case_params(test_case, required_params, self)
    self.evaluation_cost = 0 if self.using_native_model else None
    with metric_progress_indicator(self, _show_indicator=_show_indicator):
    self.verdicts: List[ContextualRecallVerdict] = (
    self._generate_verdicts(
    test_case.expected_output, test_case.retrieval_context, test_case.additional_metadata
    )
    )
    self.score = self._calculate_score()
    self.reason = self._generate_reason(test_case.expected_output)
    self.success = self.score >= self.threshold
    self.verbose_logs = construct_verbose_logs(
    self,
    steps=[
    f"Verdicts:\n{prettify_list(self.verdicts)}",
    f"Score: {self.score}\nReason: {self.reason}",
    ],
    )
    return self.score
    def _generate_reason(self, expected_output: str):
    if self.include_reason is False:
    return None
    supportive_reasons = []
    unsupportive_reasons = []
    for verdict in self.verdicts:
    if verdict.verdict.lower() == "yes":
    supportive_reasons.append(verdict.reason)
    else:
    unsupportive_reasons.append(verdict.reason)
    prompt = GraphContextualRecallTemplate.generate_reason(
    expected_output=expected_output,
    supportive_reasons=supportive_reasons,
    unsupportive_reasons=unsupportive_reasons,
    score=format(self.score, ".2f"),
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return data["reason"]
    else:
    try:
    res: Reason = self.model.generate(prompt, schema=Reason)
    return res.reason
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return data["reason"]
    def _calculate_score(self):
    number_of_verdicts = len(self.verdicts)
    if number_of_verdicts == 0:
    return 0
    justified_sentences = 0
    for verdict in self.verdicts:
    if verdict.verdict.lower() == "yes":
    justified_sentences += 1
    score = justified_sentences / number_of_verdicts
    return 0 if self.strict_mode and score < self.threshold else score
    def _generate_verdicts(
    self, expected_output: str, retrieval_context: List[str], cypher_query: Optional[str] = None
    ) -> List[ContextualRecallVerdict]:
    prompt = GraphContextualRecallTemplate.generate_verdicts(
    expected_output=expected_output, retrieval_context=retrieval_context, cypher_query=cypher_query
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    verdicts = [
    ContextualRecallVerdict(**item) for item in data["verdicts"]
    ]
    return verdicts
    else:
    try:
    res: Verdicts = self.model.generate(prompt, schema=Verdicts)
    verdicts: Verdicts = [item for item in res.verdicts]
    return verdicts
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    verdicts = [
    ContextualRecallVerdict(**item) for item in data["verdicts"]
    ]
    return verdicts
    def is_successful(self) -> bool:
    if self.error is not None:
    self.success = False
    else:
    try:
    self.success = self.score >= self.threshold
    except Exception as e:
    self.error = str(e)
    self.success = False
    return self.success
    @property
    def __name__(self):
    return "Graph Contextual Recall"
    🧰 Tools
    🪛 Ruff (0.8.2)

    156-156: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    200-200: Reason may be undefined, or defined from star imports

    (F405)


    200-200: Reason may be undefined, or defined from star imports

    (F405)


    222-222: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    231-231: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    236-236: Verdicts may be undefined, or defined from star imports

    (F405)


    236-236: Verdicts may be undefined, or defined from star imports

    (F405)


    237-237: Verdicts may be undefined, or defined from star imports

    (F405)


    243-243: ContextualRecallVerdict may be undefined, or defined from star imports

    (F405)


    253-253: Do not use bare except

    (E722)

    Comment on lines +334 to +461
    (self._generate_verdicts(test_case.input, context, test_case.additional_metadata))
    for context in test_case.retrieval_context
    ]
    self.score = self._calculate_score()
    self.reason = self._generate_reason(test_case.input)
    self.success = self.score >= self.threshold
    self.verbose_logs = construct_verbose_logs(
    self,
    steps=[
    f"Verdicts:\n{prettify_list(self.verdicts_list)}",
    f"Score: {self.score}\nReason: {self.reason}",
    ],
    )

    return self.score

    def _generate_reason(self, input: str):
    if self.include_reason is False:
    return None

    irrelevancies = []
    relevant_statements = []
    for verdicts in self.verdicts_list:
    for verdict in verdicts.verdicts:
    if verdict.verdict.lower() == "no":
    irrelevancies.append(verdict.reason)
    else:
    relevant_statements.append(verdict.statement)

    prompt: dict = GraphContextualRelevancyTemplate.generate_reason(
    input=input,
    irrelevancies=irrelevancies,
    relevant_statements=relevant_statements,
    score=format(self.score, ".2f"),
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return data["reason"]
    else:
    try:
    res: Reason = self.model.generate(prompt, schema=Reason)
    return res.reason
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return data["reason"]

    def _calculate_score(self):
    total_verdicts = 0
    relevant_statements = 0
    for verdicts in self.verdicts_list:
    for verdict in verdicts.verdicts:
    total_verdicts += 1
    if verdict.verdict.lower() == "yes":
    relevant_statements += 1

    if total_verdicts == 0:
    return 0

    score = relevant_statements / total_verdicts
    return 0 if self.strict_mode and score < self.threshold else score

    def _generate_verdicts(
    self, input: str, context: str, cypher_query: Optional[str] = None
    ) -> ContextualRelevancyVerdicts:
    prompt = GraphContextualRelevancyTemplate.generate_verdicts(
    input=input, context=context, cypher_query=cypher_query
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return ContextualRelevancyVerdicts(**data)
    else:
    try:
    res = self.model.generate(
    prompt, schema=ContextualRelevancyVerdicts
    )
    return res
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return ContextualRelevancyVerdicts(**data)

    def is_successful(self) -> bool:
    if self.error is not None:
    self.success = False
    else:
    try:
    self.success = self.score >= self.threshold
    except:
    self.success = False
    return self.success

    @property
    def __name__(self):
    return "Graph Contextual Relevancy"

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Fix error handling and improve type safety in GraphContextualRelevancy class.

    Similar to GraphContextualRecall, this class needs:

    1. Proper exception handling
    2. Type hints for class attributes
    3. Better error handling in the measure method

    Apply this diff to improve the implementation:

     class GraphContextualRelevancy(BaseMetric):
    +    """A metric that evaluates the relevancy of graph context against input."""
    +
         def __init__(
             self,
             threshold: float = 0.5,
             model: Optional[Union[str, DeepEvalBaseLLM]] = None,
             include_reason: bool = True,
             strict_mode: bool = False,
             verbose_mode: bool = False,
         ):
    +        self.score: float = 0.0
    +        self.reason: Optional[str] = None
    +        self.success: bool = False
    +        self.error: Optional[str] = None
    +        self.verdicts_list: List[ContextualRelevancyVerdicts] = []
    +        self.evaluation_cost: Optional[float] = None
    +        self.verbose_logs: Optional[str] = None
             self.threshold = 1 if strict_mode else threshold
             self.model, self.using_native_model = initialize_model(model)
             self.evaluation_model = self.model.get_model_name()
             self.include_reason = include_reason
             self.strict_mode = strict_mode
             self.verbose_mode = verbose_mode
    
         def is_successful(self) -> bool:
             if self.error is not None:
                 self.success = False
             else:
                 try:
                     self.success = self.score >= self.threshold
    -            except:
    +            except Exception as e:
    +                self.error = str(e)
                     self.success = False
             return self.success
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    class GraphContextualRelevancy(BaseMetric):
    def __init__(
    self,
    threshold: float = 0.5,
    model: Optional[Union[str, DeepEvalBaseLLM]] = None,
    include_reason: bool = True,
    strict_mode: bool = False,
    verbose_mode: bool = False,
    ):
    self.threshold = 1 if strict_mode else threshold
    self.model, self.using_native_model = initialize_model(model)
    self.evaluation_model = self.model.get_model_name()
    self.include_reason = include_reason
    self.strict_mode = strict_mode
    self.verbose_mode = verbose_mode
    def measure(
    self,
    test_case: Union[LLMTestCase, ConversationalTestCase],
    _show_indicator: bool = True,
    ) -> float:
    if isinstance(test_case, ConversationalTestCase):
    test_case = test_case.turns[0]
    check_llm_test_case_params(test_case, required_params, self)
    self.evaluation_cost = 0 if self.using_native_model else None
    with metric_progress_indicator(self, _show_indicator=_show_indicator):
    self.verdicts_list: List[ContextualRelevancyVerdicts] = [
    (self._generate_verdicts(test_case.input, context, test_case.additional_metadata))
    for context in test_case.retrieval_context
    ]
    self.score = self._calculate_score()
    self.reason = self._generate_reason(test_case.input)
    self.success = self.score >= self.threshold
    self.verbose_logs = construct_verbose_logs(
    self,
    steps=[
    f"Verdicts:\n{prettify_list(self.verdicts_list)}",
    f"Score: {self.score}\nReason: {self.reason}",
    ],
    )
    return self.score
    def _generate_reason(self, input: str):
    if self.include_reason is False:
    return None
    irrelevancies = []
    relevant_statements = []
    for verdicts in self.verdicts_list:
    for verdict in verdicts.verdicts:
    if verdict.verdict.lower() == "no":
    irrelevancies.append(verdict.reason)
    else:
    relevant_statements.append(verdict.statement)
    prompt: dict = GraphContextualRelevancyTemplate.generate_reason(
    input=input,
    irrelevancies=irrelevancies,
    relevant_statements=relevant_statements,
    score=format(self.score, ".2f"),
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return data["reason"]
    else:
    try:
    res: Reason = self.model.generate(prompt, schema=Reason)
    return res.reason
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return data["reason"]
    def _calculate_score(self):
    total_verdicts = 0
    relevant_statements = 0
    for verdicts in self.verdicts_list:
    for verdict in verdicts.verdicts:
    total_verdicts += 1
    if verdict.verdict.lower() == "yes":
    relevant_statements += 1
    if total_verdicts == 0:
    return 0
    score = relevant_statements / total_verdicts
    return 0 if self.strict_mode and score < self.threshold else score
    def _generate_verdicts(
    self, input: str, context: str, cypher_query: Optional[str] = None
    ) -> ContextualRelevancyVerdicts:
    prompt = GraphContextualRelevancyTemplate.generate_verdicts(
    input=input, context=context, cypher_query=cypher_query
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return ContextualRelevancyVerdicts(**data)
    else:
    try:
    res = self.model.generate(
    prompt, schema=ContextualRelevancyVerdicts
    )
    return res
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return ContextualRelevancyVerdicts(**data)
    def is_successful(self) -> bool:
    if self.error is not None:
    self.success = False
    else:
    try:
    self.success = self.score >= self.threshold
    except:
    self.success = False
    return self.success
    @property
    def __name__(self):
    return "Graph Contextual Relevancy"
    class GraphContextualRelevancy(BaseMetric):
    """A metric that evaluates the relevancy of graph context against input."""
    def __init__(
    self,
    threshold: float = 0.5,
    model: Optional[Union[str, DeepEvalBaseLLM]] = None,
    include_reason: bool = True,
    strict_mode: bool = False,
    verbose_mode: bool = False,
    ):
    self.score: float = 0.0
    self.reason: Optional[str] = None
    self.success: bool = False
    self.error: Optional[str] = None
    self.verdicts_list: List[ContextualRelevancyVerdicts] = []
    self.evaluation_cost: Optional[float] = None
    self.verbose_logs: Optional[str] = None
    self.threshold = 1 if strict_mode else threshold
    self.model, self.using_native_model = initialize_model(model)
    self.evaluation_model = self.model.get_model_name()
    self.include_reason = include_reason
    self.strict_mode = strict_mode
    self.verbose_mode = verbose_mode
    def measure(
    self,
    test_case: Union[LLMTestCase, ConversationalTestCase],
    _show_indicator: bool = True,
    ) -> float:
    if isinstance(test_case, ConversationalTestCase):
    test_case = test_case.turns[0]
    check_llm_test_case_params(test_case, required_params, self)
    self.evaluation_cost = 0 if self.using_native_model else None
    with metric_progress_indicator(self, _show_indicator=_show_indicator):
    self.verdicts_list: List[ContextualRelevancyVerdicts] = [
    (self._generate_verdicts(test_case.input, context, test_case.additional_metadata))
    for context in test_case.retrieval_context
    ]
    self.score = self._calculate_score()
    self.reason = self._generate_reason(test_case.input)
    self.success = self.score >= self.threshold
    self.verbose_logs = construct_verbose_logs(
    self,
    steps=[
    f"Verdicts:\n{prettify_list(self.verdicts_list)}",
    f"Score: {self.score}\nReason: {self.reason}",
    ],
    )
    return self.score
    def _generate_reason(self, input: str):
    if self.include_reason is False:
    return None
    irrelevancies = []
    relevant_statements = []
    for verdicts in self.verdicts_list:
    for verdict in verdicts.verdicts:
    if verdict.verdict.lower() == "no":
    irrelevancies.append(verdict.reason)
    else:
    relevant_statements.append(verdict.statement)
    prompt: dict = GraphContextualRelevancyTemplate.generate_reason(
    input=input,
    irrelevancies=irrelevancies,
    relevant_statements=relevant_statements,
    score=format(self.score, ".2f"),
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return data["reason"]
    else:
    try:
    res: Reason = self.model.generate(prompt, schema=Reason)
    return res.reason
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return data["reason"]
    def _calculate_score(self):
    total_verdicts = 0
    relevant_statements = 0
    for verdicts in self.verdicts_list:
    for verdict in verdicts.verdicts:
    total_verdicts += 1
    if verdict.verdict.lower() == "yes":
    relevant_statements += 1
    if total_verdicts == 0:
    return 0
    score = relevant_statements / total_verdicts
    return 0 if self.strict_mode and score < self.threshold else score
    def _generate_verdicts(
    self, input: str, context: str, cypher_query: Optional[str] = None
    ) -> ContextualRelevancyVerdicts:
    prompt = GraphContextualRelevancyTemplate.generate_verdicts(
    input=input, context=context, cypher_query=cypher_query
    )
    if self.using_native_model:
    res, cost = self.model.generate(prompt)
    self.evaluation_cost += cost
    data = trimAndLoadJson(res, self)
    return ContextualRelevancyVerdicts(**data)
    else:
    try:
    res = self.model.generate(
    prompt, schema=ContextualRelevancyVerdicts
    )
    return res
    except TypeError:
    res = self.model.generate(prompt)
    data = trimAndLoadJson(res, self)
    return ContextualRelevancyVerdicts(**data)
    def is_successful(self) -> bool:
    if self.error is not None:
    self.success = False
    else:
    try:
    self.success = self.score >= self.threshold
    except Exception as e:
    self.error = str(e)
    self.success = False
    return self.success
    @property
    def __name__(self):
    return "Graph Contextual Relevancy"
    🧰 Tools
    🪛 Ruff (0.8.2)

    361-361: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    404-404: Reason may be undefined, or defined from star imports

    (F405)


    404-404: Reason may be undefined, or defined from star imports

    (F405)


    428-428: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    436-436: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    440-440: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    446-446: ContextualRelevancyVerdicts may be undefined, or defined from star imports

    (F405)


    454-454: Do not use bare except

    (E722)

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 2

    🧹 Nitpick comments (3)
    tests/test_rag.py (3)

    16-16: Consider moving telemetry setting to environment configuration.

    Instead of hardcoding the telemetry setting, consider moving it to the environment configuration for better flexibility and maintainability.

    -os.environ["DEEPEVAL_ENABLE_TELEMETRY"] = "NO"
    +telemetry_enabled = os.getenv("DEEPEVAL_ENABLE_TELEMETRY", "NO")
    +os.environ["DEEPEVAL_ENABLE_TELEMETRY"] = telemetry_enabled

    75-75: Make graph name extraction more robust.

    The current graph name extraction assumes a specific format and could fail with malformed model names.

    -        cls.graph_name = model_name.split("/")[0]
    +        # Handle cases where model name doesn't contain "/"
    +        cls.graph_name = model_name.split("/")[0] if "/" in model_name else model_name

    88-88: Make source URL configurable.

    Consider moving the hardcoded URL to a configuration file or environment variable for better maintainability and flexibility.

    -        url = "https://www.imdb.com/title/tt23732458/"
    +        url = os.getenv("TEST_SOURCE_URL", "https://www.imdb.com/title/tt23732458/")
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 24e9211 and 2ed1b9b.

    📒 Files selected for processing (2)
    • requirements.txt (0 hunks)
    • tests/test_rag.py (1 hunks)
    💤 Files with no reviewable changes (1)
    • requirements.txt
    🔇 Additional comments (3)
    tests/test_rag.py (3)

    32-74: Refactor duplicate ontology definition.

    The ontology structure appears to be duplicated in the codebase. Consider extracting it into a shared fixture or helper method.


    103-103: Correct grammatical error in expected output.

    The phrase "Over than 10 actors" is grammatically incorrect.


    90-92: Add source file existence check.

    Add validation to ensure the source exists before processing.

    @classmethod
    def setUpClass(cls):
    # Get the model name from the environment variable
    model_name = os.getenv("TEST_MODEL", "gemini/gemini-1.5-flash-001")
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Add error handling for missing model environment variable.

    The model name retrieval should handle cases where the environment variable is not set.

    -        model_name = os.getenv("TEST_MODEL", "gemini/gemini-1.5-flash-001")
    +        model_name = os.getenv("TEST_MODEL")
    +        if not model_name:
    +            raise ValueError("TEST_MODEL environment variable must be set")
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    model_name = os.getenv("TEST_MODEL", "gemini/gemini-1.5-flash-001")
    model_name = os.getenv("TEST_MODEL")
    if not model_name:
    raise ValueError("TEST_MODEL environment variable must be set")

    Comment on lines +129 to +130
    self.kg.delete()
    assert np.mean(scores) >= 0.5
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Enhance test assertions.

    The current assertion only checks the mean score. Consider adding more comprehensive assertions:

    1. Individual score thresholds
    2. Response structure validation
    3. Error case handling
             self.kg.delete()
    -        assert np.mean(scores) >= 0.5
    +        mean_score = np.mean(scores)
    +        min_score = np.min(scores)
    +        
    +        # Assert overall performance
    +        self.assertGreaterEqual(mean_score, 0.5, "Mean score below threshold")
    +        self.assertGreaterEqual(min_score, 0.3, "Individual responses below minimum threshold")
    +        
    +        # Assert all responses have required structure
    +        for score in scores:
    +            self.assertIsNotNone(score, "Score should not be None")
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    self.kg.delete()
    assert np.mean(scores) >= 0.5
    self.kg.delete()
    mean_score = np.mean(scores)
    min_score = np.min(scores)
    # Assert overall performance
    self.assertGreaterEqual(mean_score, 0.5, "Mean score below threshold")
    self.assertGreaterEqual(min_score, 0.3, "Individual responses below minimum threshold")
    # Assert all responses have required structure
    for score in scores:
    self.assertIsNotNone(score, "Score should not be None")

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (4)
    tests/test_rag.py (4)

    30-30: Consider using a configuration file for default model name.

    The default model name "gemini/gemini-2.0-flash-exp" is hardcoded. Consider moving this to a configuration file for better maintainability.

    -        model_name = os.getenv("TEST_MODEL", "gemini/gemini-2.0-flash-exp")
    +        model_name = os.getenv("TEST_MODEL")
    +        if not model_name:
    +            model_name = config.get("default_model", "gemini/gemini-2.0-flash-exp")

    90-92: Add source validation before processing.

    Validate the source URL before processing to ensure it's accessible.

    +        import requests
    +        try:
    +            response = requests.head(url)
    +            response.raise_for_status()
    +        except requests.RequestException as e:
    +            raise ValueError(f"Invalid or inaccessible source URL: {url}. Error: {e}")
    +
             sources = [Source(url)]
             self.kg.process_sources(sources)

    93-109: Structure test cases as data-driven fixtures.

    Consider moving the test inputs and expected outputs to a test fixture file for better maintainability and readability.

    # test_data.py
    TEST_CASES = [
        {
            "input": "How many actors acted in a movie?",
            "expected_output": "Over than 10 actors acted in a movie.",
            "description": "Count of actors query"
        },
        # ... more test cases
    ]
    -        inputs = [
    -            "How many actors acted in a movie?",
    -            "Which actors acted in a movie?",
    -            # ...
    -        ]
    -
    -        expected_outputs = [
    -            "Over than 10 actors acted in a movie.",
    -            "Joseph Scotto, Melony Feliciano, and Donna Pastorello acted in a movie",
    -            # ...
    -        ]
    +        from .test_data import TEST_CASES
    +        
    +        for test_case in TEST_CASES:
    +            input_text = test_case["input"]
    +            expected_output = test_case["expected_output"]

    117-126: Add test case description for better error reporting.

    Include a descriptive name for each test case to make failures more informative.

                 test_case = LLMTestCase(
                 input=input_text,
                 actual_output=answer["response"],
                 retrieval_context=[answer["context"]],
                 context=[answer["context"]],
    -            name="kg_rag_test",
    +            name=f"kg_rag_test: {test_case['description']}",
                 expected_output=expected_output,
                 additional_metadata=answer["cypher"],
                 )
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 2ed1b9b and 687cc23.

    📒 Files selected for processing (1)
    • tests/test_rag.py (1 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms (2)
    • GitHub Check: test (openai/gpt-4o)
    • GitHub Check: test (gemini/gemini-1.5-flash-001)

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (1)
    .github/workflows/test.yml (1)

    27-27: Remove Trailing Whitespace.
    Line 27 contains trailing whitespace that may trigger YAML linter warnings and reduce overall readability. Please remove the extra spaces to conform with YAML style best practices.

    🧰 Tools
    🪛 YAMLlint (1.35.1)

    [error] 27-27: trailing spaces

    (trailing-spaces)

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 687cc23 and 1ed80d3.

    📒 Files selected for processing (1)
    • .github/workflows/test.yml (2 hunks)
    🧰 Additional context used
    🪛 YAMLlint (1.35.1)
    .github/workflows/test.yml

    [error] 27-27: trailing spaces

    (trailing-spaces)

    ⏰ Context from checks skipped due to timeout of 90000ms (2)
    • GitHub Check: test (openai/gpt-4o)
    • GitHub Check: test (gemini/gemini-2.0-flash-exp)
    🔇 Additional comments (2)
    .github/workflows/test.yml (2)

    28-30: Verify Matrix Configuration.
    The new matrix strategy now defines models as [gemini/gemini-2.0-flash-exp, openai/gpt-4o]. Please confirm that the model identifier openai/gpt-4o is correct and matches your intentions; previous reviews flagged similar identifiers as potential typos. Also, consider whether additional matrix options (e.g., fail-fast: false) might improve the testing robustness.


    80-81: Confirm Environment Variable Additions for Testing.
    The "Run tests" step now includes GEMINI_API_KEY and sets TEST_MODEL via the matrix. Ensure that these environment variables are correctly required by your tests and that their corresponding secrets and usage in the code are appropriately documented.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant