-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pytests #52
base: main
Are you sure you want to change the base?
Update pytests #52
Conversation
WalkthroughThis pull request introduces updates to the project's dependency management and testing infrastructure. The Changes
Sequence DiagramsequenceDiagram
participant Workflow as GitHub Actions
participant Tests as Test Runner
participant Metrics as Metrics Evaluator
participant KG as Knowledge Graph
Workflow->>Tests: Trigger tests with matrix models
Tests->>KG: Initialize Knowledge Graph
Tests->>Metrics: Evaluate LLM responses
Metrics->>Metrics: Combine relevancy metrics
Metrics-->>Tests: Return composite score
Tests-->>Workflow: Report test results
Poem
Tip 🌐 Web search-backed reviews and chat
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/test_rag.py (3)
4-7
: Enhance clarity by grouping related imports fromdeepeval
.
You're importingassert_test
andLLMTestCase
across separate statements. For readability, consider grouping them into a single import statement if they originate from the same library.-from deepeval import assert_test ... -from deepeval.test_case import LLMTestCase +from deepeval import assert_test, LLMTestCase
94-95
: File path usage vs. resource management.
"tests/data/madoff.txt"
is hardcoded. For cross-platform compatibility, consider using built-in path utilities or fixtures. The new usage ofself.kg_gemini.process_sources(sources)
is appropriate if each test run sets up data consistently.
106-113
: Use ofLLMTestCase
for structured comparisons.
Packaging input, output, and context intoLLMTestCase
clarifies your test logic. Just be aware of your “expected_output” phrasing—“Over than 10 actors…” might need re-evaluation for language clarity (e.g. "More than 10 actors…").
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (5)
pyproject.toml
(2 hunks)tests/test_kg_gemini.py
(0 hunks)tests/test_kg_ollama.py
(0 hunks)tests/test_kg_openai.py
(0 hunks)tests/test_rag.py
(2 hunks)
💤 Files with no reviewable changes (3)
- tests/test_kg_ollama.py
- tests/test_kg_openai.py
- tests/test_kg_gemini.py
🔇 Additional comments (12)
tests/test_rag.py (10)
11-13
: Good use of explicit metric imports.
The direct imports of individual metrics (AnswerRelevancyMetric, etc.) add clarity, enabling easy reference and usage in your test code. No issues found here.
72-72
: Graph name alignment.
Renaming to "IMDB"
is consistent with your usage later in the test. This clarifies the context and avoids ambiguous references to external providers.
74-75
: Model selection check.
Using "gpt-4o"
might be a specialized or less common model name. If it's a placeholder, ensure consistent usage across the entire test suite and watch for any potential mismatch in actual environment configurations.
81-86
: Consolidating multiple Graph instances.
Creating two KnowledgeGraph
instances referencing the same ontology promotes thorough comparative testing. Double-check that the correct model configurations (OpenAI vs. Gemini) are consistently applied across your suite (e.g., advanced parameters, temperature, etc.), to ensure valid coverage.
97-98
: Chat session approach.
Utilizing separate chat sessions for each knowledge graph ensures test isolation. Good practice to confirm the session state is self-contained.
99-102
: Metric-based validations.
Incorporating multiple metrics (relevancy, precision, recall) offers a robust assessment of the LLM's responses. This is a solid improvement over straightforward string comparisons.
114-114
: Seamless usage of assert_test
.
Good integration of assert_test
from deepeval
for each test case. No issues found.
116-117
: Parallel test coverage.
Repeating the same prompt and test pattern with kg_openai
ensures parity. This parallels the kg_gemini
flow and fosters direct comparison of model responses.
118-125
: Consistent test structuring.
Reusing the same LLMTestCase
fields keeps tests uniform. Ensure that each knowledge graph (OpenAI vs. Gemini) is indeed configured with the intended model to truly compare results.
127-127
: Final test assertion.
Having a second call to assert_test
keeps both sets of results validated under the same metrics. This final step aligns with the consolidated approach to test coverage.
pyproject.toml (2)
3-3
: Incremented version to “0.4.2”.
This aligns with the new changes introduced in testing and dependencies. Make sure to tag or document this release accordingly for consistent version tracking.
25-25
: Added new dependencies and extras.
Introducing deepeval
and optional modeling dependencies allows flexible experimentation. Ensure that pinned ranges (like "^2.0.9"
) reflect tested compatibility.
Would you like a script to check for known vulnerabilities or overshadowing dependencies related to these newly added packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/test_rag.py (2)
13-13
: Remove unused importContextualRecallMetric
The
ContextualRecallMetric
is imported but never used in the test implementation.-from deepeval.metrics import AnswerRelevancyMetric, ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric +from deepeval.metrics import AnswerRelevancyMetric, ContextualPrecisionMetric, ContextualRelevancyMetric
106-127
: Refactor duplicate test case creation logicThe test case creation logic is duplicated for both OpenAI and Gemini implementations. Consider extracting this into a helper method to improve maintainability and reduce code duplication.
def create_test_case(self, kg, name): input = "How many actors acted in a movie?" chat = kg.chat_session() answer = chat.send_message(input) return LLMTestCase( input=input, actual_output=answer['response'], retrieval_context=[answer['context']], context=[answer['context']], name=name, expected_output="Over than 10 actors acted in a movie." )Then use it like this:
test_case = self.create_test_case(self.kg_gemini, "gemini") assert_test(test_case, [answer_relevancy_metric, answer_cp_metric, answer_crelevancy_metric]) test_case = self.create_test_case(self.kg_openai, "openai") assert_test(test_case, [answer_relevancy_metric, answer_cp_metric, answer_crelevancy_metric])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_rag.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_rag.py
101-101: Local variable answer_crecall_metric
is assigned to but never used
Remove assignment to unused variable answer_crecall_metric
(F841)
🔇 Additional comments (1)
tests/test_rag.py (1)
72-85
: Potential naming conflict with identical graph names
Both kg_openai
and kg_gemini
instances are using the same graph name "IMDB". This could lead to conflicts if they're persisting data to the same storage. Consider using distinct names like "IMDB_openai" and "IMDB_gemini".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (8)
tests/test_rag.py (2)
71-71
: Extract hardcoded file path to a constant.The test data file path is duplicated. Consider extracting it to a class-level or module-level constant to improve maintainability.
+ TEST_DATA_FILE = "tests/data/madoff.txt" - file_path = "tests/data/madoff.txt" + file_path = TEST_DATA_FILE class TestKGLiteLLM(unittest.TestCase): def test_kg_creation(self): - file_path = "tests/data/madoff.txt" + file_path = TEST_DATA_FILEAlso applies to: 180-180
82-94
: Improve test maintainability with parameterization.The test inputs and expected outputs are hardcoded as lists. Consider using pytest's parameterize feature for better test organization and maintenance.
Example refactor:
import pytest TEST_CASES = [ pytest.param( "How many actors acted in a movie?", "Over than 10 actors acted in a movie.", id="count_actors" ), pytest.param( "Which actors acted in the movie Madoff: The Monster of Wall Street?", "Joseph Scotto, Melony Feliciano, and Donna Pastorello acted in the movie Madoff: The Monster of Wall Street.", id="list_actors" ), # ... other test cases ] @pytest.mark.parametrize("input_query,expected_output", TEST_CASES) def test_movie_queries(input_query, expected_output): # Test implementationcustom_metric.py (6)
6-6
: Remove unused importget_or_create_event_loop
.The function
get_or_create_event_loop
is imported but not used anywhere in the code. Removing this unused import will clean up the code.Apply this diff to fix the issue:
-from deepeval.utils import get_or_create_event_loop, prettify_list +from deepeval.utils import prettify_list🧰 Tools
🪛 Ruff (0.8.2)
6-6:
deepeval.utils.get_or_create_event_loop
imported but unusedRemove unused import:
deepeval.utils.get_or_create_event_loop
(F401)
16-16
: Remove unused importConversationalTestCase
.The class
ConversationalTestCase
is imported but not used anywhere in the code. Removing this unused import will clean up the code.Apply this diff to fix the issue:
-from deepeval.test_case import ( - LLMTestCaseParams, - ConversationalTestCase, -) +from deepeval.test_case import ( + LLMTestCaseParams, +)🧰 Tools
🪛 Ruff (0.8.2)
16-16:
deepeval.test_case.ConversationalTestCase
imported but unusedRemove unused import:
deepeval.test_case.ConversationalTestCase
(F401)
20-20
: Remove unused importContextualRecallTemplate
.The class
ContextualRecallTemplate
is imported but not used from the moduledeepeval.metrics.contextual_recall.template
. Since you defineGraphContextualRecallTemplate
later in the code, this import can be removed.Apply this diff to fix the issue:
-from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
🧰 Tools
🪛 Ruff (0.8.2)
20-20:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unusedRemove unused import:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
117-117
: Consider simplifying the composite score calculation.The current logic sets the score to zero if in strict mode and the composite score is below the threshold. This could be simplified or clarified for better readability.
Consider updating the score calculation:
self.score = composite_score if not self.strict_mode else (composite_score if composite_score >= self.threshold else 0)
145-170
: Handle potential exceptions when generating verdicts or calculating scores.In the
measure
method ofGraphContextualRecall
, exceptions during verdict generation or score calculation could cause the evaluation to fail silently. Adding error handling will improve robustness.Consider wrapping the verdict generation and score calculation in try-except blocks to handle exceptions appropriately.
🧰 Tools
🪛 Ruff (0.8.2)
153-153:
ContextualRecallVerdict
may be undefined, or defined from star imports(F405)
258-293
: Review multiline string formatting for prompts.The prompt strings in
GraphContextualRecallTemplate
use triple quotes and f-strings, which can be error-prone. Ensure that the placeholders are correctly formatted and that the strings are displayed as intended.Verify that the prompts render correctly and consider using dedicated template strings or raw strings if necessary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
custom_metric.py
(1 hunks)tests/test_rag.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_rag.py
191-191: Local variable answer_crecall_metric
is assigned to but never used
Remove assignment to unused variable answer_crecall_metric
(F841)
custom_metric.py
4-4: Redefinition of unused Optional
from line 1
Remove definition: Optional
(F811)
6-6: deepeval.utils.get_or_create_event_loop
imported but unused
Remove unused import: deepeval.utils.get_or_create_event_loop
(F401)
14-14: Redefinition of unused LLMTestCase
from line 3
Remove definition: LLMTestCase
(F811)
16-16: deepeval.test_case.ConversationalTestCase
imported but unused
Remove unused import: deepeval.test_case.ConversationalTestCase
(F401)
18-18: Redefinition of unused BaseMetric
from line 2
Remove definition: BaseMetric
(F811)
20-20: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unused
Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
22-22: from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names
(F403)
153-153: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
197-197: Reason
may be undefined, or defined from star imports
(F405)
197-197: Reason
may be undefined, or defined from star imports
(F405)
219-219: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
228-228: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
233-233: Verdicts
may be undefined, or defined from star imports
(F405)
233-233: Verdicts
may be undefined, or defined from star imports
(F405)
234-234: Verdicts
may be undefined, or defined from star imports
(F405)
240-240: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
250-250: Do not use bare except
(E722)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test
🔇 Additional comments (4)
tests/test_rag.py (4)
1-21
: LGTM! Well-structured imports and setup.The imports are well-organized, and the logging configuration is appropriate for testing purposes.
64-64
: Verify model configuration consistency.The model configuration differs between the global scope ("gemini/gemini-2.0-flash-exp") and the test class ("gpt-4o"). This inconsistency might lead to different behaviors in tests.
Please clarify if this difference is intentional or if one of these should be updated to match the other.
Also applies to: 169-169
79-79
: Remove unused metric variable.The
answer_crecall_metric
is created but never used in the assertions.Also applies to: 79-79
178-204
: 🛠️ Refactor suggestionEnhance test coverage and reduce duplication.
The test method duplicates logic from the global scope and only tests a single scenario. Consider:
- Moving the common test logic to helper methods
- Adding more test cases for error scenarios
- Testing edge cases (empty source file, malformed data, etc.)
Would you like me to help generate additional test cases for error scenarios and edge cases?
🧰 Tools
🪛 Ruff (0.8.2)
191-191: Local variable
answer_crecall_metric
is assigned to but never usedRemove assignment to unused variable
answer_crecall_metric
(F841)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (2)
graphrag_sdk/custom_metric.py (2)
2-2
: Remove unused import ofContextualRelevancyMetric
The
ContextualRelevancyMetric
is imported at line 2 but not used in the code. Removing unused imports helps keep the code clean and improves readability.Apply this diff:
-from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric +from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric🧰 Tools
🪛 Ruff (0.8.2)
2-2:
deepeval.metrics.ContextualRelevancyMetric
imported but unusedRemove unused import:
deepeval.metrics.ContextualRelevancyMetric
(F401)
22-22
: Remove unused import ofContextualRecallTemplate
The
ContextualRecallTemplate
is imported at line 22 but not used in the code. Removing unused imports improves code clarity.Apply this diff:
-from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
🧰 Tools
🪛 Ruff (0.8.2)
22-22:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unusedRemove unused import:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
graphrag_sdk/custom_metric.py
(1 hunks)tests/test_rag.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_rag.py
191-191: Local variable answer_crecall_metric
is assigned to but never used
Remove assignment to unused variable answer_crecall_metric
(F841)
graphrag_sdk/custom_metric.py
2-2: deepeval.metrics.ContextualRelevancyMetric
imported but unused
Remove unused import: deepeval.metrics.ContextualRelevancyMetric
(F401)
4-4: Redefinition of unused Optional
from line 1
Remove definition: Optional
(F811)
14-14: Redefinition of unused LLMTestCase
from line 3
Remove definition: LLMTestCase
(F811)
19-19: from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names
(F403)
20-20: Redefinition of unused BaseMetric
from line 2
Remove definition: BaseMetric
(F811)
22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unused
Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
24-24: from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names
(F403)
162-162: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
206-206: Reason
may be undefined, or defined from star imports
(F405)
206-206: Reason
may be undefined, or defined from star imports
(F405)
228-228: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
237-237: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
242-242: Verdicts
may be undefined, or defined from star imports
(F405)
242-242: Verdicts
may be undefined, or defined from star imports
(F405)
243-243: Verdicts
may be undefined, or defined from star imports
(F405)
249-249: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
259-259: Do not use bare except
(E722)
375-375: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
418-418: Reason
may be undefined, or defined from star imports
(F405)
418-418: Reason
may be undefined, or defined from star imports
(F405)
442-442: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
450-450: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
454-454: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
460-460: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
468-468: Do not use bare except
(E722)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test
🔇 Additional comments (2)
tests/test_rag.py (2)
79-79
: Remove unused variableanswer_crecall_metric
The
answer_crecall_metric
variable is assigned but never used at lines 79 and 191. Removing unused variables helps clean up the code.Apply this diff:
-answer_crecall_metric = ContextualRecallMetric(threshold=0.5)
Also applies to: 191-191
125-167
: Refactor duplicate ontology definitionThe ontology structure is duplicated in both the global scope and within the
TestKGLiteLLM
class. Consider extracting it into a shared function or fixture to improve maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
graphrag_sdk/test_metrics.py (1)
19-19
: Avoid using wildcard importsUsing
from module import *
can lead to namespace pollution and makes it harder to track where specific classes or functions are coming from. It also causes issues with static analysis tools.Replace wildcard imports with explicit imports of the required classes or functions. For example:
-from deepeval.metrics.contextual_relevancy.schema import * +from deepeval.metrics.contextual_relevancy.schema import ContextualRecallVerdict, Verdicts, Reason ... -from deepeval.metrics.contextual_recall.schema import * +from deepeval.metrics.contextual_recall.schema import ContextualRelevancyVerdictsAlso applies to: 24-24
🧰 Tools
🪛 Ruff (0.8.2)
19-19:
from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names(F403)
.github/workflows/test.yml (3)
80-80
: Document the TEST_MODEL environment variable usage.Consider adding a comment in the workflow file explaining:
- The expected format and valid values for TEST_MODEL
- How this variable is used in the test suite
- What happens if the variable is not set
+ # TEST_MODEL: Model identifier used by pytest to determine which LLM to use for testing + # Valid values: gemini-1.5-flash-001, gpt-4o TEST_MODEL: ${{ matrix.model }} # Pass the model as an environment variable
27-27
: Remove trailing whitespace.Fix the formatting issue by removing trailing spaces on line 27.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 27-27: trailing spaces
(trailing-spaces)
Line range hint
13-81
: Consider enhancing workflow robustness.Suggested improvements for the workflow configuration:
- Add a timeout to prevent long-running tests from blocking the pipeline:
timeout-minutes: 30
- Specify an explicit Python version instead of using "3.x":
python-version: "3.11"
- Add verbose output to pytest for better debugging:
run: poetry run pytest -v🧰 Tools
🪛 yamllint (1.35.1)
[error] 27-27: trailing spaces
(trailing-spaces)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (4)
.github/workflows/test.yml
(2 hunks)graphrag_sdk/test_metrics.py
(1 hunks)pyproject.toml
(1 hunks)tests/test_rag.py
(1 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
.github/workflows/test.yml
[error] 27-27: trailing spaces
(trailing-spaces)
🪛 Ruff (0.8.2)
graphrag_sdk/test_metrics.py
2-2: deepeval.metrics.ContextualRelevancyMetric
imported but unused
Remove unused import: deepeval.metrics.ContextualRelevancyMetric
(F401)
4-4: Redefinition of unused Optional
from line 1
Remove definition: Optional
(F811)
6-6: deepeval.utils.get_or_create_event_loop
imported but unused
Remove unused import: deepeval.utils.get_or_create_event_loop
(F401)
14-14: Redefinition of unused LLMTestCase
from line 3
Remove definition: LLMTestCase
(F811)
19-19: from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names
(F403)
20-20: Redefinition of unused BaseMetric
from line 2
Remove definition: BaseMetric
(F811)
22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unused
Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
24-24: from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names
(F403)
164-164: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
208-208: Reason
may be undefined, or defined from star imports
(F405)
208-208: Reason
may be undefined, or defined from star imports
(F405)
230-230: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
239-239: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
244-244: Verdicts
may be undefined, or defined from star imports
(F405)
244-244: Verdicts
may be undefined, or defined from star imports
(F405)
245-245: Verdicts
may be undefined, or defined from star imports
(F405)
251-251: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
261-261: Do not use bare except
(E722)
369-369: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
412-412: Reason
may be undefined, or defined from star imports
(F405)
412-412: Reason
may be undefined, or defined from star imports
(F405)
436-436: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
444-444: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
448-448: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
454-454: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
462-462: Do not use bare except
(E722)
🔇 Additional comments (4)
tests/test_rag.py (1)
102-102
: Correct grammatical error in expected outputThe phrase "Over than 10 actors acted in a movie." should be "More than 10 actors acted in a movie." This correction improves clarity and grammatical accuracy.
Apply this diff to correct the phrase:
- "Over than 10 actors acted in a movie.", + "More than 10 actors acted in a movie.",pyproject.toml (3)
Line range hint
13-31
: LGTM! Well-organized dependency structure.The dependencies are well-organized with:
- Clear separation between core and optional dependencies
- Appropriate version constraints
- Logical grouping of LLM provider dependencies as optional
25-25
: Verify deepeval compatibility with existing dependencies.The new deepeval dependency looks good with appropriate version constraints. Let's verify there are no conflicts:
✅ Verification successful
✓ deepeval dependency verified as compatible and secure
The package is actively maintained with no known vulnerabilities. Version constraints are appropriate and won't cause unexpected breaking changes.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for potential dependency conflicts and security advisories # Check PyPI for latest versions and potential conflicts curl -s https://pypi.org/pypi/deepeval/json | jq '.info.version' # Check for security advisories gh api graphql -f query=' { securityVulnerabilities(first: 5, ecosystem: PIP, package: "deepeval") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'Length of output: 414
Line range hint
2-3
: Verify if version bump follows semantic versioning.The version bump from 0.4.2 to 0.5.0 indicates new features. Let's verify this aligns with the changes:
✅ Verification successful
Version bump to 0.5.0 follows semantic versioning correctly
The version bump from 0.4.2 to 0.5.0 is appropriate as it represents:
- Addition of new features (deepeval integration for metrics)
- No breaking changes (removed file was not referenced in the codebase)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check commit history for breaking changes that might require a major version bump # Look for breaking changes indicators in commit messages git log --oneline | grep -iE "break|breaking|major" # Check for removed files that might indicate breaking changes git diff --name-status HEAD~1 | grep "^D"Length of output: 75
Script:
#!/bin/bash # Check for references to custom_metric in the codebase rg -l "custom_metric" # Look for potential replacement files fd "metric" --type f --changed-within 2weeks # Check if custom_metric was imported in examples or tests rg -l "from graphrag_sdk.custom_metric import"Length of output: 144
from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric | ||
from deepeval.test_case import LLMTestCase | ||
from typing import Optional, List, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unused imports and resolve redefinitions
There are several unused imports and redefinitions in the code which can lead to confusion and potential conflicts. Specifically:
- Line 2:
ContextualRelevancyMetric
is imported but not used. - Line 4:
Optional
is redefined from line 1. - Line 6:
get_or_create_event_loop
is imported but not used. - Line 14:
LLMTestCase
is redefined from line 3. - Line 20:
BaseMetric
is redefined from line 2. - Line 22:
ContextualRecallTemplate
is imported but not used.
Apply this diff to clean up the imports and resolve redefinitions:
from typing import Optional
-from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
+from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
-from typing import Optional, List, Union
+from typing import List, Union
-from deepeval.utils import get_or_create_event_loop, prettify_list
+from deepeval.utils import prettify_list
-from deepeval.test_case import (
- LLMTestCase,
- LLMTestCaseParams,
- ConversationalTestCase,
-)
+from deepeval.test_case import LLMTestCaseParams, ConversationalTestCase
from deepeval.metrics.contextual_relevancy.schema import *
-from deepeval.metrics import BaseMetric
from deepeval.models import DeepEvalBaseLLM
+from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate
from deepeval.metrics.indicator import metric_progress_indicator
from deepeval.metrics.contextual_recall.schema import *
Also applies to: 6-6, 14-14, 20-20, 22-22
🧰 Tools
🪛 Ruff (0.8.2)
2-2: deepeval.metrics.ContextualRelevancyMetric
imported but unused
Remove unused import: deepeval.metrics.ContextualRelevancyMetric
(F401)
4-4: Redefinition of unused Optional
from line 1
Remove definition: Optional
(F811)
|
||
self.evaluation_cost = 0 if self.using_native_model else None | ||
with metric_progress_indicator(self, _show_indicator=_show_indicator): | ||
self.verdicts: List[ContextualRecallVerdict] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure classes from wildcard imports are explicitly imported
Classes such as ContextualRecallVerdict
, Verdicts
, Reason
, and ContextualRelevancyVerdicts
are used but may not be defined due to wildcard imports. This can lead to undefined names and potential runtime errors.
Explicitly import the required classes to ensure they are defined:
+from deepeval.metrics.contextual_relevancy.schema import ContextualRecallVerdict, Verdicts, Reason
+from deepeval.metrics.contextual_recall.schema import ContextualRelevancyVerdicts
# Replace wildcard imports
-from deepeval.metrics.contextual_relevancy.schema import *
-from deepeval.metrics.contextual_recall.schema import *
Also applies to: 208-208, 230-230, 239-239, 244-244, 251-251, 369-369, 412-412, 436-436, 444-444, 448-448, 454-454
🧰 Tools
🪛 Ruff (0.8.2)
164-164: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
CI Feedback 🧐(Feedback updated until commit 2ed1b9b)A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
.github/workflows/test.yml (1)
28-30
:⚠️ Potential issueIncorrect model name and missing matrix configuration options.
The model name "gpt-4o" appears to be incorrect. Additionally, the matrix configuration could be enhanced for better test reliability.
Apply these changes to fix the model name and improve the matrix configuration:
strategy: + fail-fast: false # Continue running other models if one fails matrix: - model: [gemini/gemini-1.5-flash-001, openai/gpt-4o] # List of models + model: [gemini/gemini-1.5-flash-001, openai/gpt-4] # List of modelsLet's verify the model names in the codebase:
#!/bin/bash # Search for model name references in test files and configuration rg -l "gemini-1.5-flash-001|gpt-4" tests/
🧹 Nitpick comments (3)
.github/workflows/test.yml (3)
27-27
: Remove trailing whitespace.There is unnecessary trailing whitespace on this line.
- +🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 27-27: trailing spaces
(trailing-spaces)
80-80
: Consider documenting the TEST_MODEL environment variable.The TEST_MODEL environment variable is crucial for the matrix testing strategy but lacks documentation about its purpose and expected values.
- TEST_MODEL: ${{ matrix.model }} # Pass the model as an environment variable + TEST_MODEL: ${{ matrix.model }} # Model identifier for matrix testing (format: provider/model-name)
Line range hint
28-80
: Consider potential impacts of matrix testing on services.The matrix strategy will run tests for each model in parallel, which means multiple instances of tests will interact with FalkorDB and Ollama services simultaneously. This could lead to:
- Increased resource usage on services
- Potential race conditions if tests modify shared data
- Longer overall execution time if services become bottlenecks
Consider:
- Adding resource limits to services
- Implementing test isolation mechanisms
- Monitoring service performance during matrix runs
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 27-27: trailing spaces
(trailing-spaces)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/test.yml
(2 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
.github/workflows/test.yml
[error] 27-27: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test (openai/gpt-4o)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
tests/test_rag.py (3)
32-84
: 🛠️ Refactor suggestionRefactor duplicate ontology definition.
The ontology structure is duplicated between the global scope and the
TestKGLiteLLM
class. Consider extracting this into a shared fixture or helper method to improve maintainability.
101-108
:⚠️ Potential issueCorrect grammatical error in expected outputs.
The phrase "Over than 10 actors acted in a movie." should be "More than 10 actors acted in a movie." This correction improves clarity and grammatical accuracy.
89-108
: 🛠️ Refactor suggestionConsider data-driven test assertions based on source content.
After reviewing the test file and source data (madoff.txt), I can confirm the original suggestions and provide specific recommendations:
- The hardcoded expected output "Over than 10 actors acted in a movie" should be derived from the actual data.
- The test should validate the response structure before assertions.
- The metrics being used are good for measuring response quality, but there's no validation of error cases.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/test.yml
(2 hunks)graphrag_sdk/test_metrics.py
(1 hunks)tests/test_rag.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
graphrag_sdk/test_metrics.py
2-2: deepeval.metrics.ContextualRelevancyMetric
imported but unused
Remove unused import: deepeval.metrics.ContextualRelevancyMetric
(F401)
4-4: Redefinition of unused Optional
from line 1
Remove definition: Optional
(F811)
6-6: deepeval.utils.get_or_create_event_loop
imported but unused
Remove unused import: deepeval.utils.get_or_create_event_loop
(F401)
14-14: Redefinition of unused LLMTestCase
from line 3
Remove definition: LLMTestCase
(F811)
19-19: from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names
(F403)
20-20: Redefinition of unused BaseMetric
from line 2
Remove definition: BaseMetric
(F811)
22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unused
Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
24-24: from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names
(F403)
164-164: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
208-208: Reason
may be undefined, or defined from star imports
(F405)
208-208: Reason
may be undefined, or defined from star imports
(F405)
230-230: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
239-239: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
244-244: Verdicts
may be undefined, or defined from star imports
(F405)
244-244: Verdicts
may be undefined, or defined from star imports
(F405)
245-245: Verdicts
may be undefined, or defined from star imports
(F405)
251-251: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
261-261: Do not use bare except
(E722)
366-366: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
409-409: Reason
may be undefined, or defined from star imports
(F405)
409-409: Reason
may be undefined, or defined from star imports
(F405)
433-433: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
441-441: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
445-445: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
451-451: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
459-459: Do not use bare except
(E722)
🪛 YAMLlint (1.35.1)
.github/workflows/test.yml
[error] 27-27: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: test (openai/gpt-4o)
- GitHub Check: test (gemini/gemini-1.5-flash-001)
🔇 Additional comments (2)
.github/workflows/test.yml (1)
28-30
: Update the model name in the matrix configuration.The model name
gpt-4o
appears to be incorrect. Please verify and update the model name in the matrix configuration.Run the following script to verify the model names:
#!/bin/bash # Description: Verify the model names in the test files. # Test: Search for model name usage in test files. rg -B2 -A2 "gemini-1.5-flash-001|gpt-4o" tests/graphrag_sdk/test_metrics.py (1)
1-24
: 🛠️ Refactor suggestionClean up imports to improve code quality.
There are several issues with the imports:
- Unused imports:
ContextualRelevancyMetric
,get_or_create_event_loop
,ContextualRecallTemplate
- Redefined imports:
Optional
,LLMTestCase
,BaseMetric
- Star imports that make it difficult to track dependencies
Apply this diff to clean up the imports:
from typing import Optional -from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric +from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric from deepeval.test_case import LLMTestCase -from typing import Optional, List, Union +from typing import List, Union -from deepeval.utils import get_or_create_event_loop, prettify_list +from deepeval.utils import prettify_list from deepeval.metrics.utils import ( construct_verbose_logs, trimAndLoadJson, check_llm_test_case_params, initialize_model, ) -from deepeval.test_case import ( - LLMTestCase, - LLMTestCaseParams, - ConversationalTestCase, -) +from deepeval.test_case import LLMTestCaseParams, ConversationalTestCase -from deepeval.metrics.contextual_relevancy.schema import * -from deepeval.metrics import BaseMetric +from deepeval.metrics.contextual_relevancy.schema import ( + ContextualRecallVerdict, + Verdicts, + Reason, + ContextualRelevancyVerdicts +) from deepeval.models import DeepEvalBaseLLM -from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate from deepeval.metrics.indicator import metric_progress_indicator -from deepeval.metrics.contextual_recall.schema import *Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
2-2:
deepeval.metrics.ContextualRelevancyMetric
imported but unusedRemove unused import:
deepeval.metrics.ContextualRelevancyMetric
(F401)
4-4: Redefinition of unused
Optional
from line 1Remove definition:
Optional
(F811)
6-6:
deepeval.utils.get_or_create_event_loop
imported but unusedRemove unused import:
deepeval.utils.get_or_create_event_loop
(F401)
14-14: Redefinition of unused
LLMTestCase
from line 3Remove definition:
LLMTestCase
(F811)
19-19:
from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names(F403)
20-20: Redefinition of unused
BaseMetric
from line 2Remove definition:
BaseMetric
(F811)
22-22:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unusedRemove unused import:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
24-24:
from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names(F403)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
tests/test_rag.py (5)
1-14
: Organize imports for better readability.Consider organizing imports into groups: standard library, third-party packages, and local imports.
import os import logging import unittest import numpy as np + from dotenv import load_dotenv -from graphrag_sdk.entity import Entity -from graphrag_sdk.source import Source from deepeval.test_case import LLMTestCase + +from graphrag_sdk.entity import Entity +from graphrag_sdk.source import Source from graphrag_sdk.relation import Relation from graphrag_sdk.ontology import Ontology from graphrag_sdk.models.litellm import LiteModel from graphrag_sdk.attribute import Attribute, AttributeType from graphrag_sdk import KnowledgeGraph, KnowledgeGraphModelConfig
16-21
: Consider adjusting logging configuration for tests.DEBUG level logging might be too verbose for tests. Consider using INFO level by default and allowing override through environment variables.
-logging.basicConfig(level=logging.DEBUG) +log_level = os.getenv('TEST_LOG_LEVEL', 'INFO') +logging.basicConfig(level=getattr(logging, log_level))
32-74
: Consider moving ontology definition to a configuration file.The ontology definition is quite extensive and might be reused across different test files. Consider moving it to a separate configuration file or fixture.
Example structure:
# tests/fixtures/ontologies.py def create_movie_ontology(): ontology = Ontology([], []) ontology.add_entity( Entity( label="Actor", attributes=[ Attribute( name="name", attr_type=AttributeType.STRING, unique=True, required=True, ), ], ) ) # ... rest of the ontology definition return ontology
87-88
: Use path resolution for test data files.Hardcoded file paths might cause issues in different environments. Use
os.path.join()
with a base test data directory.- file_path = "tests/data/madoff.txt" + test_data_dir = os.path.join(os.path.dirname(__file__), "data") + file_path = os.path.join(test_data_dir, "madoff.txt")
110-111
: Make metric threshold configurable.The combined metrics threshold and assertion threshold are hardcoded. Consider making these configurable through class attributes or environment variables.
+ DEFAULT_METRIC_THRESHOLD = 0.5 + def test_llm(self): - answer_combined_metric = CombineMetrics(threshold=0.5) + threshold = float(os.getenv('TEST_METRIC_THRESHOLD', self.DEFAULT_METRIC_THRESHOLD)) + answer_combined_metric = CombineMetrics(threshold=threshold) scores = [] # ... test implementation ... - assert np.mean(scores) >= 0.5 + assert np.mean(scores) >= thresholdAlso applies to: 130-130
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_rag.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: test (openai/gpt-4o)
- GitHub Check: test (gemini/gemini-1.5-flash-001)
🔇 Additional comments (2)
tests/test_rag.py (2)
101-108
: Correct grammatical error in expected output.The phrase "Over than 10 actors" should be "More than 10 actors" for grammatical correctness.
30-30
: Verify the default model configuration.The default model "gemini/gemini-2.0-flash-exp" includes "exp" which might indicate experimental features. Consider using a stable version for testing.
✅ Verification successful
Default model configuration is intentionally set
The model "gemini/gemini-2.0-flash-exp" is documented as the recommended model for Google's API in the project's README. The test configuration allows override via TEST_MODEL environment variable, providing flexibility for testing with different models if needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if there are any stability issues reported for the experimental model rg -l "gemini-2\.0-flash-exp.*stable" .Length of output: 41
Script:
#!/bin/bash # Search for the model name pattern and related configurations rg "gemini.*flash" -B 2 -A 2 # Look for model configuration documentation fd -e md -e rst -e txt | xargs rg -l "model.*configuration|model.*setup" # Find similar test configurations fd "test.*\.py" | xargs rg "TEST_MODEL|model_name.*=.*getenv"Length of output: 3295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
tests/test_rag.py (1)
113-130
: Test logic is succinct but consider data volatility.The test checks each prompt’s response against an expected string and computes aggregated scores. If the IMDB content changes, these checks may fail unexpectedly. Consider mocking or freezing external data for more predictable tests.
graphrag_sdk/models/litellm.py (1)
77-77
: Remove unused exception variable.You’re catching the exception but not using its variable. This can be simplified:
- except Exception as e: + except Exception:🧰 Tools
🪛 Ruff (0.8.2)
77-77: Local variable
e
is assigned to but never usedRemove assignment to unused variable
e
(F841)
graphrag_sdk/test_metrics.py (3)
33-130
: Add type hints and improve error handling in CombineMetrics class.The class implementation is good but could benefit from:
- Type hints for class attributes
- Better error handling in the measure method
- Documentation for the class and its methods
Apply this diff to improve the implementation:
class CombineMetrics(BaseMetric): + """A metric that combines multiple graph context metrics to provide a composite score.""" + def __init__( self, threshold: float = 0.5, evaluation_model: Optional[str] = "gpt-4o", include_reason: bool = True, async_mode: bool = True, strict_mode: bool = False, ): + self.score: float = 0.0 + self.reason: Optional[str] = None + self.success: bool = False + self.error: Optional[str] = None self.threshold = 1 if strict_mode else threshold self.evaluation_model = evaluation_model self.include_reason = include_reason self.async_mode = async_mode self.strict_mode = strict_mode - def measure(self, test_case: LLMTestCase): + def measure(self, test_case: LLMTestCase) -> float: try: graph_context_recall_metric, graph_context_relevancy_metric = self.initialize_metrics() - # Remember, deepeval's default metrics follow the same pattern as your custom metric! - # relevancy_metric.measure(test_case) graph_context_relevancy_metric.measure(test_case) graph_context_recall_metric.measure(test_case) - # Custom logic to set score, reason, and success self.set_score_reason_success(graph_context_recall_metric, graph_context_relevancy_metric) return self.score except Exception as e: - # Set and re-raise error self.error = str(e) + self.success = False + self.score = 0.0 raise
261-333
: Improve organization of prompt templates.The prompt templates are currently defined as long f-strings within the methods. Consider:
- Moving the templates to a separate configuration file
- Using a template engine for better maintainability
Consider using a template engine like Jinja2 to manage these templates. I can help implement this if you'd like.
462-538
: Consider consolidating template classes.The GraphContextualRelevancyTemplate class is very similar to GraphContextualRecallTemplate. Consider:
- Creating a base template class
- Moving templates to a configuration file
- Using a template engine for better maintainability
I can help implement a consolidated template system using a template engine if you'd like.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
graphrag_sdk/fixtures/prompts.py
(3 hunks)graphrag_sdk/models/litellm.py
(1 hunks)graphrag_sdk/test_metrics.py
(1 hunks)tests/test_rag.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
graphrag_sdk/models/litellm.py
77-77: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
graphrag_sdk/test_metrics.py
2-2: deepeval.metrics.AnswerRelevancyMetric
imported but unused
Remove unused import
(F401)
2-2: deepeval.metrics.FaithfulnessMetric
imported but unused
Remove unused import
(F401)
2-2: deepeval.metrics.ContextualRelevancyMetric
imported but unused
Remove unused import
(F401)
4-4: Redefinition of unused Optional
from line 1
Remove definition: Optional
(F811)
6-6: deepeval.utils.get_or_create_event_loop
imported but unused
Remove unused import: deepeval.utils.get_or_create_event_loop
(F401)
14-14: Redefinition of unused LLMTestCase
from line 3
Remove definition: LLMTestCase
(F811)
19-19: from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names
(F403)
20-20: Redefinition of unused BaseMetric
from line 2
Remove definition: BaseMetric
(F811)
22-22: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unused
Remove unused import: deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
24-24: from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names
(F403)
156-156: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
200-200: Reason
may be undefined, or defined from star imports
(F405)
200-200: Reason
may be undefined, or defined from star imports
(F405)
222-222: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
231-231: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
236-236: Verdicts
may be undefined, or defined from star imports
(F405)
236-236: Verdicts
may be undefined, or defined from star imports
(F405)
237-237: Verdicts
may be undefined, or defined from star imports
(F405)
243-243: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
253-253: Do not use bare except
(E722)
361-361: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
404-404: Reason
may be undefined, or defined from star imports
(F405)
404-404: Reason
may be undefined, or defined from star imports
(F405)
428-428: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
436-436: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
440-440: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
446-446: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
454-454: Do not use bare except
(E722)
🔇 Additional comments (6)
graphrag_sdk/fixtures/prompts.py (1)
1-618
: No changes detected in this file.There are no lines marked with
~
, indicating no modifications were introduced here.tests/test_rag.py (4)
1-26
: Imports and class definition look good.The environment setup and logging configuration are appropriate, and the class docstring is clear. No issues found here.
28-86
: Ontology and setup logic appear solid.Defining the
Actor
andMovie
entities along with theACTED_IN
relation is concise and correct. Ensuring the environment variable for the model name is read is a good practice.
92-100
: Input queries are reasonable.Providing multiple queries (e.g., “Which actors […]?”) is a good way to test various aspects of retrieval. The approach of referencing an external URL for real data may be somewhat volatile over time if the IMDB page changes, but it's acceptable for integration tests.
[approve]
102-109
: Correct the grammatical error in “Over than 10 actors”.This exact issue was previously flagged; still valid.
- "Over than 10 actors acted in a movie.", + "More than 10 actors acted in a movie.",graphrag_sdk/test_metrics.py (1)
1-24
: 🛠️ Refactor suggestionClean up imports to remove unused and redefined imports.
Several imports are either unused or redefined, which can lead to confusion and potential conflicts.
Apply this diff to clean up the imports:
from typing import Optional -from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric +from deepeval.metrics import BaseMetric from deepeval.test_case import LLMTestCase -from typing import Optional, List, Union +from typing import List, Union -from deepeval.utils import get_or_create_event_loop, prettify_list +from deepeval.utils import prettify_list from deepeval.metrics.utils import ( construct_verbose_logs, trimAndLoadJson, check_llm_test_case_params, initialize_model, ) -from deepeval.test_case import ( - LLMTestCase, - LLMTestCaseParams, - ConversationalTestCase, -) +from deepeval.test_case import LLMTestCaseParams, ConversationalTestCase -from deepeval.metrics.contextual_relevancy.schema import * -from deepeval.metrics import BaseMetric +from deepeval.metrics.contextual_relevancy.schema import ( + ContextualRecallVerdict, + Verdicts, + Reason, + ContextualRelevancyVerdicts +) from deepeval.models import DeepEvalBaseLLM -from deepeval.metrics.contextual_recall.template import ContextualRecallTemplate from deepeval.metrics.indicator import metric_progress_indicator -from deepeval.metrics.contextual_recall.schema import *Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
2-2:
deepeval.metrics.AnswerRelevancyMetric
imported but unusedRemove unused import
(F401)
2-2:
deepeval.metrics.FaithfulnessMetric
imported but unusedRemove unused import
(F401)
2-2:
deepeval.metrics.ContextualRelevancyMetric
imported but unusedRemove unused import
(F401)
4-4: Redefinition of unused
Optional
from line 1Remove definition:
Optional
(F811)
6-6:
deepeval.utils.get_or_create_event_loop
imported but unusedRemove unused import:
deepeval.utils.get_or_create_event_loop
(F401)
14-14: Redefinition of unused
LLMTestCase
from line 3Remove definition:
LLMTestCase
(F811)
19-19:
from deepeval.metrics.contextual_relevancy.schema import *
used; unable to detect undefined names(F403)
20-20: Redefinition of unused
BaseMetric
from line 2Remove definition:
BaseMetric
(F811)
22-22:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
imported but unusedRemove unused import:
deepeval.metrics.contextual_recall.template.ContextualRecallTemplate
(F401)
24-24:
from deepeval.metrics.contextual_recall.schema import *
used; unable to detect undefined names(F403)
self.score = self._calculate_score() | ||
self.reason = self._generate_reason(test_case.expected_output) | ||
self.success = self.score >= self.threshold | ||
self.verbose_logs = construct_verbose_logs( | ||
self, | ||
steps=[ | ||
f"Verdicts:\n{prettify_list(self.verdicts)}", | ||
f"Score: {self.score}\nReason: {self.reason}", | ||
], | ||
) | ||
|
||
return self.score | ||
|
||
def _generate_reason(self, expected_output: str): | ||
if self.include_reason is False: | ||
return None | ||
|
||
supportive_reasons = [] | ||
unsupportive_reasons = [] | ||
for verdict in self.verdicts: | ||
if verdict.verdict.lower() == "yes": | ||
supportive_reasons.append(verdict.reason) | ||
else: | ||
unsupportive_reasons.append(verdict.reason) | ||
|
||
prompt = GraphContextualRecallTemplate.generate_reason( | ||
expected_output=expected_output, | ||
supportive_reasons=supportive_reasons, | ||
unsupportive_reasons=unsupportive_reasons, | ||
score=format(self.score, ".2f"), | ||
) | ||
|
||
if self.using_native_model: | ||
res, cost = self.model.generate(prompt) | ||
self.evaluation_cost += cost | ||
data = trimAndLoadJson(res, self) | ||
return data["reason"] | ||
else: | ||
try: | ||
res: Reason = self.model.generate(prompt, schema=Reason) | ||
return res.reason | ||
except TypeError: | ||
res = self.model.generate(prompt) | ||
data = trimAndLoadJson(res, self) | ||
return data["reason"] | ||
|
||
def _calculate_score(self): | ||
number_of_verdicts = len(self.verdicts) | ||
if number_of_verdicts == 0: | ||
return 0 | ||
|
||
justified_sentences = 0 | ||
for verdict in self.verdicts: | ||
if verdict.verdict.lower() == "yes": | ||
justified_sentences += 1 | ||
|
||
score = justified_sentences / number_of_verdicts | ||
return 0 if self.strict_mode and score < self.threshold else score | ||
|
||
def _generate_verdicts( | ||
self, expected_output: str, retrieval_context: List[str], cypher_query: Optional[str] = None | ||
) -> List[ContextualRecallVerdict]: | ||
prompt = GraphContextualRecallTemplate.generate_verdicts( | ||
expected_output=expected_output, retrieval_context=retrieval_context, cypher_query=cypher_query | ||
) | ||
if self.using_native_model: | ||
res, cost = self.model.generate(prompt) | ||
self.evaluation_cost += cost | ||
data = trimAndLoadJson(res, self) | ||
verdicts = [ | ||
ContextualRecallVerdict(**item) for item in data["verdicts"] | ||
] | ||
return verdicts | ||
else: | ||
try: | ||
res: Verdicts = self.model.generate(prompt, schema=Verdicts) | ||
verdicts: Verdicts = [item for item in res.verdicts] | ||
return verdicts | ||
except TypeError: | ||
res = self.model.generate(prompt) | ||
data = trimAndLoadJson(res, self) | ||
verdicts = [ | ||
ContextualRecallVerdict(**item) for item in data["verdicts"] | ||
] | ||
return verdicts | ||
|
||
def is_successful(self) -> bool: | ||
if self.error is not None: | ||
self.success = False | ||
else: | ||
try: | ||
self.success = self.score >= self.threshold | ||
except: | ||
self.success = False | ||
return self.success | ||
|
||
@property | ||
def __name__(self): | ||
return "Graph Contextual Recall" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix error handling and improve type safety in GraphContextualRecall class.
The class has several areas for improvement:
- Bare except clauses should catch specific exceptions
- Type hints are missing for class attributes
- Error handling could be more robust
Apply this diff to improve the implementation:
class GraphContextualRecall(BaseMetric):
+ """A metric that evaluates the recall of graph context against expected output."""
+
def __init__(
self,
threshold: float = 0.5,
model: Optional[Union[str, DeepEvalBaseLLM]] = None,
include_reason: bool = True,
strict_mode: bool = False,
verbose_mode: bool = False,
):
+ self.score: float = 0.0
+ self.reason: Optional[str] = None
+ self.success: bool = False
+ self.error: Optional[str] = None
+ self.verdicts: List[ContextualRecallVerdict] = []
+ self.evaluation_cost: Optional[float] = None
+ self.verbose_logs: Optional[str] = None
self.threshold = 1 if strict_mode else threshold
self.model, self.using_native_model = initialize_model(model)
self.evaluation_model = self.model.get_model_name()
self.include_reason = include_reason
self.strict_mode = strict_mode
self.verbose_mode = verbose_mode
def is_successful(self) -> bool:
if self.error is not None:
self.success = False
else:
try:
self.success = self.score >= self.threshold
- except:
+ except Exception as e:
+ self.error = str(e)
self.success = False
return self.success
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class GraphContextualRecall(BaseMetric): | |
def __init__( | |
self, | |
threshold: float = 0.5, | |
model: Optional[Union[str, DeepEvalBaseLLM]] = None, | |
include_reason: bool = True, | |
strict_mode: bool = False, | |
verbose_mode: bool = False, | |
): | |
self.threshold = 1 if strict_mode else threshold | |
self.model, self.using_native_model = initialize_model(model) | |
self.evaluation_model = self.model.get_model_name() | |
self.include_reason = include_reason | |
self.strict_mode = strict_mode | |
self.verbose_mode = verbose_mode | |
def measure( | |
self, | |
test_case: LLMTestCase, | |
_show_indicator: bool = True, | |
) -> float: | |
check_llm_test_case_params(test_case, required_params, self) | |
self.evaluation_cost = 0 if self.using_native_model else None | |
with metric_progress_indicator(self, _show_indicator=_show_indicator): | |
self.verdicts: List[ContextualRecallVerdict] = ( | |
self._generate_verdicts( | |
test_case.expected_output, test_case.retrieval_context, test_case.additional_metadata | |
) | |
) | |
self.score = self._calculate_score() | |
self.reason = self._generate_reason(test_case.expected_output) | |
self.success = self.score >= self.threshold | |
self.verbose_logs = construct_verbose_logs( | |
self, | |
steps=[ | |
f"Verdicts:\n{prettify_list(self.verdicts)}", | |
f"Score: {self.score}\nReason: {self.reason}", | |
], | |
) | |
return self.score | |
def _generate_reason(self, expected_output: str): | |
if self.include_reason is False: | |
return None | |
supportive_reasons = [] | |
unsupportive_reasons = [] | |
for verdict in self.verdicts: | |
if verdict.verdict.lower() == "yes": | |
supportive_reasons.append(verdict.reason) | |
else: | |
unsupportive_reasons.append(verdict.reason) | |
prompt = GraphContextualRecallTemplate.generate_reason( | |
expected_output=expected_output, | |
supportive_reasons=supportive_reasons, | |
unsupportive_reasons=unsupportive_reasons, | |
score=format(self.score, ".2f"), | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
else: | |
try: | |
res: Reason = self.model.generate(prompt, schema=Reason) | |
return res.reason | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
def _calculate_score(self): | |
number_of_verdicts = len(self.verdicts) | |
if number_of_verdicts == 0: | |
return 0 | |
justified_sentences = 0 | |
for verdict in self.verdicts: | |
if verdict.verdict.lower() == "yes": | |
justified_sentences += 1 | |
score = justified_sentences / number_of_verdicts | |
return 0 if self.strict_mode and score < self.threshold else score | |
def _generate_verdicts( | |
self, expected_output: str, retrieval_context: List[str], cypher_query: Optional[str] = None | |
) -> List[ContextualRecallVerdict]: | |
prompt = GraphContextualRecallTemplate.generate_verdicts( | |
expected_output=expected_output, retrieval_context=retrieval_context, cypher_query=cypher_query | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
verdicts = [ | |
ContextualRecallVerdict(**item) for item in data["verdicts"] | |
] | |
return verdicts | |
else: | |
try: | |
res: Verdicts = self.model.generate(prompt, schema=Verdicts) | |
verdicts: Verdicts = [item for item in res.verdicts] | |
return verdicts | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
verdicts = [ | |
ContextualRecallVerdict(**item) for item in data["verdicts"] | |
] | |
return verdicts | |
def is_successful(self) -> bool: | |
if self.error is not None: | |
self.success = False | |
else: | |
try: | |
self.success = self.score >= self.threshold | |
except: | |
self.success = False | |
return self.success | |
@property | |
def __name__(self): | |
return "Graph Contextual Recall" | |
class GraphContextualRecall(BaseMetric): | |
"""A metric that evaluates the recall of graph context against expected output.""" | |
def __init__( | |
self, | |
threshold: float = 0.5, | |
model: Optional[Union[str, DeepEvalBaseLLM]] = None, | |
include_reason: bool = True, | |
strict_mode: bool = False, | |
verbose_mode: bool = False, | |
): | |
self.score: float = 0.0 | |
self.reason: Optional[str] = None | |
self.success: bool = False | |
self.error: Optional[str] = None | |
self.verdicts: List[ContextualRecallVerdict] = [] | |
self.evaluation_cost: Optional[float] = None | |
self.verbose_logs: Optional[str] = None | |
self.threshold = 1 if strict_mode else threshold | |
self.model, self.using_native_model = initialize_model(model) | |
self.evaluation_model = self.model.get_model_name() | |
self.include_reason = include_reason | |
self.strict_mode = strict_mode | |
self.verbose_mode = verbose_mode | |
def measure( | |
self, | |
test_case: LLMTestCase, | |
_show_indicator: bool = True, | |
) -> float: | |
check_llm_test_case_params(test_case, required_params, self) | |
self.evaluation_cost = 0 if self.using_native_model else None | |
with metric_progress_indicator(self, _show_indicator=_show_indicator): | |
self.verdicts: List[ContextualRecallVerdict] = ( | |
self._generate_verdicts( | |
test_case.expected_output, test_case.retrieval_context, test_case.additional_metadata | |
) | |
) | |
self.score = self._calculate_score() | |
self.reason = self._generate_reason(test_case.expected_output) | |
self.success = self.score >= self.threshold | |
self.verbose_logs = construct_verbose_logs( | |
self, | |
steps=[ | |
f"Verdicts:\n{prettify_list(self.verdicts)}", | |
f"Score: {self.score}\nReason: {self.reason}", | |
], | |
) | |
return self.score | |
def _generate_reason(self, expected_output: str): | |
if self.include_reason is False: | |
return None | |
supportive_reasons = [] | |
unsupportive_reasons = [] | |
for verdict in self.verdicts: | |
if verdict.verdict.lower() == "yes": | |
supportive_reasons.append(verdict.reason) | |
else: | |
unsupportive_reasons.append(verdict.reason) | |
prompt = GraphContextualRecallTemplate.generate_reason( | |
expected_output=expected_output, | |
supportive_reasons=supportive_reasons, | |
unsupportive_reasons=unsupportive_reasons, | |
score=format(self.score, ".2f"), | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
else: | |
try: | |
res: Reason = self.model.generate(prompt, schema=Reason) | |
return res.reason | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
def _calculate_score(self): | |
number_of_verdicts = len(self.verdicts) | |
if number_of_verdicts == 0: | |
return 0 | |
justified_sentences = 0 | |
for verdict in self.verdicts: | |
if verdict.verdict.lower() == "yes": | |
justified_sentences += 1 | |
score = justified_sentences / number_of_verdicts | |
return 0 if self.strict_mode and score < self.threshold else score | |
def _generate_verdicts( | |
self, expected_output: str, retrieval_context: List[str], cypher_query: Optional[str] = None | |
) -> List[ContextualRecallVerdict]: | |
prompt = GraphContextualRecallTemplate.generate_verdicts( | |
expected_output=expected_output, retrieval_context=retrieval_context, cypher_query=cypher_query | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
verdicts = [ | |
ContextualRecallVerdict(**item) for item in data["verdicts"] | |
] | |
return verdicts | |
else: | |
try: | |
res: Verdicts = self.model.generate(prompt, schema=Verdicts) | |
verdicts: Verdicts = [item for item in res.verdicts] | |
return verdicts | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
verdicts = [ | |
ContextualRecallVerdict(**item) for item in data["verdicts"] | |
] | |
return verdicts | |
def is_successful(self) -> bool: | |
if self.error is not None: | |
self.success = False | |
else: | |
try: | |
self.success = self.score >= self.threshold | |
except Exception as e: | |
self.error = str(e) | |
self.success = False | |
return self.success | |
@property | |
def __name__(self): | |
return "Graph Contextual Recall" |
🧰 Tools
🪛 Ruff (0.8.2)
156-156: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
200-200: Reason
may be undefined, or defined from star imports
(F405)
200-200: Reason
may be undefined, or defined from star imports
(F405)
222-222: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
231-231: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
236-236: Verdicts
may be undefined, or defined from star imports
(F405)
236-236: Verdicts
may be undefined, or defined from star imports
(F405)
237-237: Verdicts
may be undefined, or defined from star imports
(F405)
243-243: ContextualRecallVerdict
may be undefined, or defined from star imports
(F405)
253-253: Do not use bare except
(E722)
(self._generate_verdicts(test_case.input, context, test_case.additional_metadata)) | ||
for context in test_case.retrieval_context | ||
] | ||
self.score = self._calculate_score() | ||
self.reason = self._generate_reason(test_case.input) | ||
self.success = self.score >= self.threshold | ||
self.verbose_logs = construct_verbose_logs( | ||
self, | ||
steps=[ | ||
f"Verdicts:\n{prettify_list(self.verdicts_list)}", | ||
f"Score: {self.score}\nReason: {self.reason}", | ||
], | ||
) | ||
|
||
return self.score | ||
|
||
def _generate_reason(self, input: str): | ||
if self.include_reason is False: | ||
return None | ||
|
||
irrelevancies = [] | ||
relevant_statements = [] | ||
for verdicts in self.verdicts_list: | ||
for verdict in verdicts.verdicts: | ||
if verdict.verdict.lower() == "no": | ||
irrelevancies.append(verdict.reason) | ||
else: | ||
relevant_statements.append(verdict.statement) | ||
|
||
prompt: dict = GraphContextualRelevancyTemplate.generate_reason( | ||
input=input, | ||
irrelevancies=irrelevancies, | ||
relevant_statements=relevant_statements, | ||
score=format(self.score, ".2f"), | ||
) | ||
if self.using_native_model: | ||
res, cost = self.model.generate(prompt) | ||
self.evaluation_cost += cost | ||
data = trimAndLoadJson(res, self) | ||
return data["reason"] | ||
else: | ||
try: | ||
res: Reason = self.model.generate(prompt, schema=Reason) | ||
return res.reason | ||
except TypeError: | ||
res = self.model.generate(prompt) | ||
data = trimAndLoadJson(res, self) | ||
return data["reason"] | ||
|
||
def _calculate_score(self): | ||
total_verdicts = 0 | ||
relevant_statements = 0 | ||
for verdicts in self.verdicts_list: | ||
for verdict in verdicts.verdicts: | ||
total_verdicts += 1 | ||
if verdict.verdict.lower() == "yes": | ||
relevant_statements += 1 | ||
|
||
if total_verdicts == 0: | ||
return 0 | ||
|
||
score = relevant_statements / total_verdicts | ||
return 0 if self.strict_mode and score < self.threshold else score | ||
|
||
def _generate_verdicts( | ||
self, input: str, context: str, cypher_query: Optional[str] = None | ||
) -> ContextualRelevancyVerdicts: | ||
prompt = GraphContextualRelevancyTemplate.generate_verdicts( | ||
input=input, context=context, cypher_query=cypher_query | ||
) | ||
if self.using_native_model: | ||
res, cost = self.model.generate(prompt) | ||
self.evaluation_cost += cost | ||
data = trimAndLoadJson(res, self) | ||
return ContextualRelevancyVerdicts(**data) | ||
else: | ||
try: | ||
res = self.model.generate( | ||
prompt, schema=ContextualRelevancyVerdicts | ||
) | ||
return res | ||
except TypeError: | ||
res = self.model.generate(prompt) | ||
data = trimAndLoadJson(res, self) | ||
return ContextualRelevancyVerdicts(**data) | ||
|
||
def is_successful(self) -> bool: | ||
if self.error is not None: | ||
self.success = False | ||
else: | ||
try: | ||
self.success = self.score >= self.threshold | ||
except: | ||
self.success = False | ||
return self.success | ||
|
||
@property | ||
def __name__(self): | ||
return "Graph Contextual Relevancy" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix error handling and improve type safety in GraphContextualRelevancy class.
Similar to GraphContextualRecall, this class needs:
- Proper exception handling
- Type hints for class attributes
- Better error handling in the measure method
Apply this diff to improve the implementation:
class GraphContextualRelevancy(BaseMetric):
+ """A metric that evaluates the relevancy of graph context against input."""
+
def __init__(
self,
threshold: float = 0.5,
model: Optional[Union[str, DeepEvalBaseLLM]] = None,
include_reason: bool = True,
strict_mode: bool = False,
verbose_mode: bool = False,
):
+ self.score: float = 0.0
+ self.reason: Optional[str] = None
+ self.success: bool = False
+ self.error: Optional[str] = None
+ self.verdicts_list: List[ContextualRelevancyVerdicts] = []
+ self.evaluation_cost: Optional[float] = None
+ self.verbose_logs: Optional[str] = None
self.threshold = 1 if strict_mode else threshold
self.model, self.using_native_model = initialize_model(model)
self.evaluation_model = self.model.get_model_name()
self.include_reason = include_reason
self.strict_mode = strict_mode
self.verbose_mode = verbose_mode
def is_successful(self) -> bool:
if self.error is not None:
self.success = False
else:
try:
self.success = self.score >= self.threshold
- except:
+ except Exception as e:
+ self.error = str(e)
self.success = False
return self.success
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class GraphContextualRelevancy(BaseMetric): | |
def __init__( | |
self, | |
threshold: float = 0.5, | |
model: Optional[Union[str, DeepEvalBaseLLM]] = None, | |
include_reason: bool = True, | |
strict_mode: bool = False, | |
verbose_mode: bool = False, | |
): | |
self.threshold = 1 if strict_mode else threshold | |
self.model, self.using_native_model = initialize_model(model) | |
self.evaluation_model = self.model.get_model_name() | |
self.include_reason = include_reason | |
self.strict_mode = strict_mode | |
self.verbose_mode = verbose_mode | |
def measure( | |
self, | |
test_case: Union[LLMTestCase, ConversationalTestCase], | |
_show_indicator: bool = True, | |
) -> float: | |
if isinstance(test_case, ConversationalTestCase): | |
test_case = test_case.turns[0] | |
check_llm_test_case_params(test_case, required_params, self) | |
self.evaluation_cost = 0 if self.using_native_model else None | |
with metric_progress_indicator(self, _show_indicator=_show_indicator): | |
self.verdicts_list: List[ContextualRelevancyVerdicts] = [ | |
(self._generate_verdicts(test_case.input, context, test_case.additional_metadata)) | |
for context in test_case.retrieval_context | |
] | |
self.score = self._calculate_score() | |
self.reason = self._generate_reason(test_case.input) | |
self.success = self.score >= self.threshold | |
self.verbose_logs = construct_verbose_logs( | |
self, | |
steps=[ | |
f"Verdicts:\n{prettify_list(self.verdicts_list)}", | |
f"Score: {self.score}\nReason: {self.reason}", | |
], | |
) | |
return self.score | |
def _generate_reason(self, input: str): | |
if self.include_reason is False: | |
return None | |
irrelevancies = [] | |
relevant_statements = [] | |
for verdicts in self.verdicts_list: | |
for verdict in verdicts.verdicts: | |
if verdict.verdict.lower() == "no": | |
irrelevancies.append(verdict.reason) | |
else: | |
relevant_statements.append(verdict.statement) | |
prompt: dict = GraphContextualRelevancyTemplate.generate_reason( | |
input=input, | |
irrelevancies=irrelevancies, | |
relevant_statements=relevant_statements, | |
score=format(self.score, ".2f"), | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
else: | |
try: | |
res: Reason = self.model.generate(prompt, schema=Reason) | |
return res.reason | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
def _calculate_score(self): | |
total_verdicts = 0 | |
relevant_statements = 0 | |
for verdicts in self.verdicts_list: | |
for verdict in verdicts.verdicts: | |
total_verdicts += 1 | |
if verdict.verdict.lower() == "yes": | |
relevant_statements += 1 | |
if total_verdicts == 0: | |
return 0 | |
score = relevant_statements / total_verdicts | |
return 0 if self.strict_mode and score < self.threshold else score | |
def _generate_verdicts( | |
self, input: str, context: str, cypher_query: Optional[str] = None | |
) -> ContextualRelevancyVerdicts: | |
prompt = GraphContextualRelevancyTemplate.generate_verdicts( | |
input=input, context=context, cypher_query=cypher_query | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
return ContextualRelevancyVerdicts(**data) | |
else: | |
try: | |
res = self.model.generate( | |
prompt, schema=ContextualRelevancyVerdicts | |
) | |
return res | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
return ContextualRelevancyVerdicts(**data) | |
def is_successful(self) -> bool: | |
if self.error is not None: | |
self.success = False | |
else: | |
try: | |
self.success = self.score >= self.threshold | |
except: | |
self.success = False | |
return self.success | |
@property | |
def __name__(self): | |
return "Graph Contextual Relevancy" | |
class GraphContextualRelevancy(BaseMetric): | |
"""A metric that evaluates the relevancy of graph context against input.""" | |
def __init__( | |
self, | |
threshold: float = 0.5, | |
model: Optional[Union[str, DeepEvalBaseLLM]] = None, | |
include_reason: bool = True, | |
strict_mode: bool = False, | |
verbose_mode: bool = False, | |
): | |
self.score: float = 0.0 | |
self.reason: Optional[str] = None | |
self.success: bool = False | |
self.error: Optional[str] = None | |
self.verdicts_list: List[ContextualRelevancyVerdicts] = [] | |
self.evaluation_cost: Optional[float] = None | |
self.verbose_logs: Optional[str] = None | |
self.threshold = 1 if strict_mode else threshold | |
self.model, self.using_native_model = initialize_model(model) | |
self.evaluation_model = self.model.get_model_name() | |
self.include_reason = include_reason | |
self.strict_mode = strict_mode | |
self.verbose_mode = verbose_mode | |
def measure( | |
self, | |
test_case: Union[LLMTestCase, ConversationalTestCase], | |
_show_indicator: bool = True, | |
) -> float: | |
if isinstance(test_case, ConversationalTestCase): | |
test_case = test_case.turns[0] | |
check_llm_test_case_params(test_case, required_params, self) | |
self.evaluation_cost = 0 if self.using_native_model else None | |
with metric_progress_indicator(self, _show_indicator=_show_indicator): | |
self.verdicts_list: List[ContextualRelevancyVerdicts] = [ | |
(self._generate_verdicts(test_case.input, context, test_case.additional_metadata)) | |
for context in test_case.retrieval_context | |
] | |
self.score = self._calculate_score() | |
self.reason = self._generate_reason(test_case.input) | |
self.success = self.score >= self.threshold | |
self.verbose_logs = construct_verbose_logs( | |
self, | |
steps=[ | |
f"Verdicts:\n{prettify_list(self.verdicts_list)}", | |
f"Score: {self.score}\nReason: {self.reason}", | |
], | |
) | |
return self.score | |
def _generate_reason(self, input: str): | |
if self.include_reason is False: | |
return None | |
irrelevancies = [] | |
relevant_statements = [] | |
for verdicts in self.verdicts_list: | |
for verdict in verdicts.verdicts: | |
if verdict.verdict.lower() == "no": | |
irrelevancies.append(verdict.reason) | |
else: | |
relevant_statements.append(verdict.statement) | |
prompt: dict = GraphContextualRelevancyTemplate.generate_reason( | |
input=input, | |
irrelevancies=irrelevancies, | |
relevant_statements=relevant_statements, | |
score=format(self.score, ".2f"), | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
else: | |
try: | |
res: Reason = self.model.generate(prompt, schema=Reason) | |
return res.reason | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
return data["reason"] | |
def _calculate_score(self): | |
total_verdicts = 0 | |
relevant_statements = 0 | |
for verdicts in self.verdicts_list: | |
for verdict in verdicts.verdicts: | |
total_verdicts += 1 | |
if verdict.verdict.lower() == "yes": | |
relevant_statements += 1 | |
if total_verdicts == 0: | |
return 0 | |
score = relevant_statements / total_verdicts | |
return 0 if self.strict_mode and score < self.threshold else score | |
def _generate_verdicts( | |
self, input: str, context: str, cypher_query: Optional[str] = None | |
) -> ContextualRelevancyVerdicts: | |
prompt = GraphContextualRelevancyTemplate.generate_verdicts( | |
input=input, context=context, cypher_query=cypher_query | |
) | |
if self.using_native_model: | |
res, cost = self.model.generate(prompt) | |
self.evaluation_cost += cost | |
data = trimAndLoadJson(res, self) | |
return ContextualRelevancyVerdicts(**data) | |
else: | |
try: | |
res = self.model.generate( | |
prompt, schema=ContextualRelevancyVerdicts | |
) | |
return res | |
except TypeError: | |
res = self.model.generate(prompt) | |
data = trimAndLoadJson(res, self) | |
return ContextualRelevancyVerdicts(**data) | |
def is_successful(self) -> bool: | |
if self.error is not None: | |
self.success = False | |
else: | |
try: | |
self.success = self.score >= self.threshold | |
except Exception as e: | |
self.error = str(e) | |
self.success = False | |
return self.success | |
@property | |
def __name__(self): | |
return "Graph Contextual Relevancy" |
🧰 Tools
🪛 Ruff (0.8.2)
361-361: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
404-404: Reason
may be undefined, or defined from star imports
(F405)
404-404: Reason
may be undefined, or defined from star imports
(F405)
428-428: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
436-436: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
440-440: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
446-446: ContextualRelevancyVerdicts
may be undefined, or defined from star imports
(F405)
454-454: Do not use bare except
(E722)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tests/test_rag.py (3)
16-16
: Consider moving telemetry setting to environment configuration.Instead of hardcoding the telemetry setting, consider moving it to the environment configuration for better flexibility and maintainability.
-os.environ["DEEPEVAL_ENABLE_TELEMETRY"] = "NO" +telemetry_enabled = os.getenv("DEEPEVAL_ENABLE_TELEMETRY", "NO") +os.environ["DEEPEVAL_ENABLE_TELEMETRY"] = telemetry_enabled
75-75
: Make graph name extraction more robust.The current graph name extraction assumes a specific format and could fail with malformed model names.
- cls.graph_name = model_name.split("/")[0] + # Handle cases where model name doesn't contain "/" + cls.graph_name = model_name.split("/")[0] if "/" in model_name else model_name
88-88
: Make source URL configurable.Consider moving the hardcoded URL to a configuration file or environment variable for better maintainability and flexibility.
- url = "https://www.imdb.com/title/tt23732458/" + url = os.getenv("TEST_SOURCE_URL", "https://www.imdb.com/title/tt23732458/")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
requirements.txt
(0 hunks)tests/test_rag.py
(1 hunks)
💤 Files with no reviewable changes (1)
- requirements.txt
🔇 Additional comments (3)
tests/test_rag.py (3)
32-74
: Refactor duplicate ontology definition.The ontology structure appears to be duplicated in the codebase. Consider extracting it into a shared fixture or helper method.
103-103
: Correct grammatical error in expected output.The phrase "Over than 10 actors" is grammatically incorrect.
90-92
: Add source file existence check.Add validation to ensure the source exists before processing.
tests/test_rag.py
Outdated
@classmethod | ||
def setUpClass(cls): | ||
# Get the model name from the environment variable | ||
model_name = os.getenv("TEST_MODEL", "gemini/gemini-1.5-flash-001") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for missing model environment variable.
The model name retrieval should handle cases where the environment variable is not set.
- model_name = os.getenv("TEST_MODEL", "gemini/gemini-1.5-flash-001")
+ model_name = os.getenv("TEST_MODEL")
+ if not model_name:
+ raise ValueError("TEST_MODEL environment variable must be set")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
model_name = os.getenv("TEST_MODEL", "gemini/gemini-1.5-flash-001") | |
model_name = os.getenv("TEST_MODEL") | |
if not model_name: | |
raise ValueError("TEST_MODEL environment variable must be set") |
self.kg.delete() | ||
assert np.mean(scores) >= 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance test assertions.
The current assertion only checks the mean score. Consider adding more comprehensive assertions:
- Individual score thresholds
- Response structure validation
- Error case handling
self.kg.delete()
- assert np.mean(scores) >= 0.5
+ mean_score = np.mean(scores)
+ min_score = np.min(scores)
+
+ # Assert overall performance
+ self.assertGreaterEqual(mean_score, 0.5, "Mean score below threshold")
+ self.assertGreaterEqual(min_score, 0.3, "Individual responses below minimum threshold")
+
+ # Assert all responses have required structure
+ for score in scores:
+ self.assertIsNotNone(score, "Score should not be None")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.kg.delete() | |
assert np.mean(scores) >= 0.5 | |
self.kg.delete() | |
mean_score = np.mean(scores) | |
min_score = np.min(scores) | |
# Assert overall performance | |
self.assertGreaterEqual(mean_score, 0.5, "Mean score below threshold") | |
self.assertGreaterEqual(min_score, 0.3, "Individual responses below minimum threshold") | |
# Assert all responses have required structure | |
for score in scores: | |
self.assertIsNotNone(score, "Score should not be None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/test_rag.py (4)
30-30
: Consider using a configuration file for default model name.The default model name "gemini/gemini-2.0-flash-exp" is hardcoded. Consider moving this to a configuration file for better maintainability.
- model_name = os.getenv("TEST_MODEL", "gemini/gemini-2.0-flash-exp") + model_name = os.getenv("TEST_MODEL") + if not model_name: + model_name = config.get("default_model", "gemini/gemini-2.0-flash-exp")
90-92
: Add source validation before processing.Validate the source URL before processing to ensure it's accessible.
+ import requests + try: + response = requests.head(url) + response.raise_for_status() + except requests.RequestException as e: + raise ValueError(f"Invalid or inaccessible source URL: {url}. Error: {e}") + sources = [Source(url)] self.kg.process_sources(sources)
93-109
: Structure test cases as data-driven fixtures.Consider moving the test inputs and expected outputs to a test fixture file for better maintainability and readability.
# test_data.py TEST_CASES = [ { "input": "How many actors acted in a movie?", "expected_output": "Over than 10 actors acted in a movie.", "description": "Count of actors query" }, # ... more test cases ]- inputs = [ - "How many actors acted in a movie?", - "Which actors acted in a movie?", - # ... - ] - - expected_outputs = [ - "Over than 10 actors acted in a movie.", - "Joseph Scotto, Melony Feliciano, and Donna Pastorello acted in a movie", - # ... - ] + from .test_data import TEST_CASES + + for test_case in TEST_CASES: + input_text = test_case["input"] + expected_output = test_case["expected_output"]
117-126
: Add test case description for better error reporting.Include a descriptive name for each test case to make failures more informative.
test_case = LLMTestCase( input=input_text, actual_output=answer["response"], retrieval_context=[answer["context"]], context=[answer["context"]], - name="kg_rag_test", + name=f"kg_rag_test: {test_case['description']}", expected_output=expected_output, additional_metadata=answer["cypher"], )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_rag.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: test (openai/gpt-4o)
- GitHub Check: test (gemini/gemini-1.5-flash-001)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/test.yml (1)
27-27
: Remove Trailing Whitespace.
Line 27 contains trailing whitespace that may trigger YAML linter warnings and reduce overall readability. Please remove the extra spaces to conform with YAML style best practices.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 27-27: trailing spaces
(trailing-spaces)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/test.yml
(2 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
.github/workflows/test.yml
[error] 27-27: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: test (openai/gpt-4o)
- GitHub Check: test (gemini/gemini-2.0-flash-exp)
🔇 Additional comments (2)
.github/workflows/test.yml (2)
28-30
: Verify Matrix Configuration.
The new matrix strategy now defines models as[gemini/gemini-2.0-flash-exp, openai/gpt-4o]
. Please confirm that the model identifieropenai/gpt-4o
is correct and matches your intentions; previous reviews flagged similar identifiers as potential typos. Also, consider whether additional matrix options (e.g.,fail-fast: false
) might improve the testing robustness.
80-81
: Confirm Environment Variable Additions for Testing.
The "Run tests" step now includesGEMINI_API_KEY
and setsTEST_MODEL
via the matrix. Ensure that these environment variables are correctly required by your tests and that their corresponding secrets and usage in the code are appropriately documented.
Summary by CodeRabbit
Release Notes v0.5.0
New Features
Dependencies
rich
dependency for improved output formatting.deepeval
dependency to a newer version.Testing
Removals
requirements.txt
file, removing defined dependencies.