Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphrag integration #4612

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

lspinheiro
Copy link
Collaborator

@lspinheiro lspinheiro commented Dec 9, 2024

Why are these changes needed?

This PR adds initial integration between graphrag and autogen by exposing local and global search as tools that can be used in autogen-agentchat. To be followed up with a user-guide/cookbook. I I added no tests because the test data I used was fairly large and I'm not sure we have a stablished way to add tests for those more complex integrations but there is a script below that I used. The indexing needs to be done in graphrag first, the goal is to illustrate the e2e steps in a notebook.

Would appreciate some initial feedback, hoping to gradually extend with more flexible configuration, integration of drift search and examples.

Related issue number

Checks

@rysweet
Copy link
Collaborator

rysweet commented Dec 10, 2024

hi @lspinheiro - this is exciting. its also marked as DRAFT in the subject line but not marked as such in the PR - I'm marking as draft and please set it back by clicking Ready to Review when you are ready.

@rysweet rysweet marked this pull request as draft December 10, 2024 17:21
@ekzhu
Copy link
Collaborator

ekzhu commented Dec 12, 2024

Exciting to see this!! I love the tool idea. The tool itself can also be stateful and shared by multiple agents.

@ekzhu ekzhu added rag retrieve-augmented generative agents proj-extensions labels Dec 12, 2024
@lspinheiro lspinheiro requested a review from ekzhu December 17, 2024 06:10
@lspinheiro lspinheiro marked this pull request as ready for review December 17, 2024 06:17
@lspinheiro
Copy link
Collaborator Author

Thanks @ekzhu and @rysweet . This should be ready for review now. Still needs improvements as mentioned in the description, but the tools can be used. I used the following test script.

import asyncio
from autogen_core import CancellationToken
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient
from autogen_ext.tools.graphrag import (
    GlobalSearchTool,
    LocalSearchTool,
    GlobalDataConfig,
    LocalDataConfig,
    EmbeddingConfig,
)
from azure.identity import DefaultAzureCredential, get_bearer_token_provider


async def main():
    openai_client = AzureOpenAIChatCompletionClient(
        model="gpt-4o-mini",
        azure_endpoint="https://<resource-name>.openai.azure.com", 
        azure_deployment="gpt-4o-mini",
        api_version="2024-08-01-preview",
        azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
    )

    # Global search example
    global_config = GlobalDataConfig(
        input_dir="./autogen-test/ragtest/output"
    )
    
    global_tool = GlobalSearchTool.from_config(
        openai_client=openai_client,
        data_config=global_config
    )

    global_args = {
        "query": "What does the station-master says about Dr. Becher?"
    }

    global_result = await global_tool.run_json(global_args, CancellationToken())
    print("\nGlobal Search Result:")
    print(global_result)
    
    # Local search example
    local_config = LocalDataConfig(
        input_dir="./autogen-test/ragtest/output"
    )

    embedding_config = EmbeddingConfig(
        model="text-embedding-3-small",
        api_base="https://<resource-name>.openai.azure.com", 
        deployment_name="text-embedding-3-small",
        api_version="2023-05-15",
        api_type="azure",
        azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"),
        max_retries=10,
        request_timeout=180.0,
    )

    local_tool = LocalSearchTool.from_config(
        openai_client=openai_client,
        data_config=local_config,
        embedding_config=embedding_config
    )

    local_args = {
        "query": "What does the station-master says about Dr. Becher?"
    }

    local_result = await local_tool.run_json(local_args, CancellationToken())
    print("\nLocal Search Result:")
    print(local_result)


if __name__ == "__main__":
    asyncio.run(main())

@lspinheiro
Copy link
Collaborator Author

@jackgerrits , I had to add verride-dependencies for pydantic and tenacity because the current version of pydantic is below their minimum requirement and there is a conflict with llamaindex which requires a lower version of tenancity, but it is a dev dependency for us. Let me know if you have any concerns with the approach

@rysweet rysweet changed the title [DRAFT] Graphrag integration Graphrag integration Dec 17, 2024
@gagb
Copy link
Collaborator

gagb commented Dec 17, 2024

Thank you! More documentation would help me review this PR. I would like to be able to build the docs page on this PR and see the example.

@gagb gagb mentioned this pull request Dec 17, 2024
@gagb
Copy link
Collaborator

gagb commented Dec 19, 2024

Related #4438

@lspinheiro lspinheiro requested a review from gagb December 20, 2024 02:15
@lspinheiro
Copy link
Collaborator Author

Thank you! More documentation would help me review this PR. I would like to be able to build the docs page on this PR and see the example.

@gagb , I added a sample with a readme and some docstrings that should help with the review.

@lspinheiro lspinheiro requested a review from ekzhu December 30, 2024 01:26
@lspinheiro lspinheiro force-pushed the lpinheiro/feat/add-graphrag-tools branch from e4e2b52 to cac2aef Compare January 3, 2025 23:10
Copy link

codecov bot commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 93.16770% with 11 lines in your changes missing coverage. Please review.

Project coverage is 68.48%. Comparing base (a6612e6) to head (4fc1fd8).

Files with missing lines Patch % Lines
...xt/src/autogen_ext/tools/graphrag/_local_search.py 88.88% 6 Missing ⚠️
...t/src/autogen_ext/tools/graphrag/_global_search.py 89.58% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4612      +/-   ##
==========================================
+ Coverage   68.09%   68.48%   +0.38%     
==========================================
  Files         162      166       +4     
  Lines       10194    10355     +161     
==========================================
+ Hits         6942     7092     +150     
- Misses       3252     3263      +11     
Flag Coverage Δ
unittests 68.48% <93.16%> (+0.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 4, 2025

Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?

@lspinheiro
Copy link
Collaborator Author

Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?

How much data do you think it is ok to add? I think the sherlock holmes book generates roughly 10mb of data between the parquet and vector db files. I can try to look into something smaller but I dont know how to estimate output files from input in graphrag so hard to say how much size I need to store in the repo as test data files

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 6, 2025

How much data do you think it is ok to add?

How about a text file with 10 sentences? What is the size of the index?

@jackgerrits
Copy link
Member

Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?

How much data do you think it is ok to add? I think the sherlock holmes book generates roughly 10mb of data between the parquet and vector db files. I can try to look into something smaller but I dont know how to estimate output files from input in graphrag so hard to say how much size I need to store in the repo as test data files

Is there a mirror we can fetch it from instead of including it in the repo?

@lspinheiro
Copy link
Collaborator Author

@ekzhu @jackgerrits , I added the data in a conftest file. Since we are mocking the LLM calls it wont matter as much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proj-extensions rag retrieve-augmented generative agents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants