Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High concurrency can cause the result generation to fail, and some fields in the result file may be empty. #51

Open
lycfight opened this issue Jan 21, 2025 · 5 comments

Comments

@lycfight
Copy link

lycfight commented Jan 21, 2025

The --num_threads parameter has puzzled me for a long time. When set to 1, the results are vastly different from when set to 10, with the file size alone differing by several times. This issue occurs both with the embedding and LLM interfaces.

In the code, requests that fail due to excessive concurrency are simply written as empty, without any error being raised. Instead, these empty responses are combined with the partially successful ones, causing the output to almost completely mismatch the provided result files.

Especially when calling the embedding interface, even though I set --num_threads=1, I passed the maximum possible values for max_retries and timeout as follows:

embed_model = OpenAIEmbedding(model_name="text-embedding-3-small", max_retries=120, timeout=300)

https://github.com/OpenAutoCoder/Agentless/blob/5ce5888b9f149beaace393957a55ea8ee46c9f71/agentless/fl/Index.py#L260

(agentless) root@dsw-124175-7748dbc65-27jct:/mnt/2050data/yongcong.li/Agentless# python agentless/fl/retrieve.py --index_type simple \
                                --filter_type given_files \
                                --filter_file results/swe-bench-verified/file_level_irrelevant/loc_outputs.jsonl \
                                --output_folder results/swe-bench-verified/retrievel_embedding \
                                --persist_dir embedding/swe-bench_simple \
                                --num_threads 1 \
                                --dataset=princeton-nlp/SWE-bench_Verified \
                                --target_id=astropy__astropy-13398

  0%|                                                                                                                                                                                                                                                                                            | 0/500 [00:00<?, ?it/s]Total number of considered files: 71
Total number of documents: 71
Embedding Tokens: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:26<00:00, 18.82it/s]
(agentless) root@dsw-124175-7748dbc65-27jct:/mnt/2050data/yongcong.li/Agentless# 
(agentless) root@dsw-124175-7748dbc65-27jct:/mnt/2050data/yongcong.li/Agentless# python agentless/fl/retrieve.py --index_type simple                                 --filter_type given_files                                 --filter_file results/swe-bench-verified/file_level_irrelevant/loc_outputs.jsonl                                 --output_folder results/swe-bench-verified/retrievel_embedding                                 --persist_dir embedding/swe-bench_simple                                 --num_threads 1                                 --dataset=princeton-nlp/SWE-bench_Verified  0%|                                                                                                                                                                                                                                                                                            | 0/500 [00:00<?, ?it/s]Total number of considered files: 27
Total number of documents: 27
Embedding Tokens: 0
  0%|▌                                                                                                                                                                                                                                                                                 | 1/500 [00:27<3:44:47, 27.03s/it]Total number of considered files: 15
Total number of documents: 15
Embedding Tokens: 0
  0%|█                                                                                                                                                                                                                                                                                 | 2/500 [00:37<2:23:26, 17.28s/it]Total number of considered files: 31
Total number of documents: 31
Embedding Tokens: 0
  1%|█▋                                                                                                                                                                                                                                                                                | 3/500 [00:52<2:16:01, 16.42s/it]Total number of considered files: 71
Total number of documents: 71
Embedding Tokens: 0
  1%|██▏                                                                                                                                                                                                                                                                               | 4/500 [01:23<3:01:41, 21.98s/it]Total number of considered files: 47
Total number of documents: 47
Embedding Tokens: 0
  1%|██▋                                                                                                                                                                                                                                                                               | 5/500 [01:41<2:49:10, 20.51s/it]Total number of considered files: 22
Total number of documents: 22
Embedding Tokens: 0
  1%|███▎                                                                                                                                                                                                                                                                              | 6/500 [01:49<2:15:33, 16.46s/it]Total number of considered files: 61
Total number of documents: 61
Embedding Tokens: 0
  1%|███▊                                                                                                                                                                                                                                                                              | 7/500 [02:06<2:16:03, 16.56s/it]Total number of considered files: 62
Total number of documents: 62
Embedding Tokens: 0
  2%|████▍                                                                                                                                                                                                                                                                             | 8/500 [02:27<2:26:18, 17.84s/it]Total number of considered files: 144
Total number of documents: 144
  2%|████▍                                                                                                                                                                                                                                                                             | 8/500 [03:01<3:06:01, 22.69s/it]
Traceback (most recent call last):
  File "/mnt/2050data/yongcong.li/Agentless/agentless/fl/retrieve.py", line 182, in <module>
    main()
  File "/mnt/2050data/yongcong.li/Agentless/agentless/fl/retrieve.py", line 178, in main
    retrieve(args)
  File "/mnt/2050data/yongcong.li/Agentless/agentless/fl/retrieve.py", line 105, in retrieve
    retrieve_locs(
  File "/mnt/2050data/yongcong.li/Agentless/agentless/fl/retrieve.py", line 74, in retrieve_locs
    file_names, meta_infos, traj = retriever.retrieve(mock=args.mock)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/2050data/yongcong.li/Agentless/agentless/fl/Index.py", line 261, in retrieve
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/base.py", line 119, in from_documents
    return cls(
           ^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 76, in __init__
    super().__init__(
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/base.py", line 77, in __init__
    index_struct = self.build_index_from_nodes(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 310, in build_index_from_nodes
    return self._build_index_from_nodes(content_nodes, **insert_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 279, in _build_index_from_nodes
    self._add_nodes_to_index(
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 232, in _add_nodes_to_index
    nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 139, in _get_node_with_embedding
    id_to_embed_map = embed_nodes(
                      ^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/indices/utils.py", line 160, in embed_nodes
    new_embeddings = embed_model.get_text_embedding_batch(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py", line 321, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py", line 335, in get_text_embedding_batch
    embeddings = self._get_text_embeddings(cur_batch)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py", line 465, in _get_text_embeddings
    return _retryable_get_embeddings()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py", line 458, in _retryable_get_embeddings
    return get_embeddings(
           ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py", line 169, in get_embeddings
    data = client.embeddings.create(input=list_of_text, model=engine, **kwargs).data
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/resources/embeddings.py", line 124, in create
    return self._post(
           ^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/_base_client.py", line 1283, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/_base_client.py", line 960, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/_base_client.py", line 1066, in _request
    return self._process_response(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/_base_client.py", line 1165, in _process_response
    return api_response.parse()
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/_response.py", line 325, in parse
    parsed = self._options.post_parser(parsed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/agentless/lib/python3.11/site-packages/openai/resources/embeddings.py", line 112, in parser
    for embedding in obj.data:
                     ^^^^^^^^
AttributeError: 'str' object has no attribute 'data'

This is caused by the embedding interface response failure, which returns None. When encountering the above error, I can only keep trying with larger max_retries and timeout values. Can you suggest a more elegant fix for this issue?

@brutalsavage
Copy link
Contributor

If you set --num_threads=1 there really shouldn't be any issue with request limits. Is someone else also using your OpenAI key? You might want to check your OpenAI rate limit https://platform.openai.com/settings/organization/limits to see.

@lycfight
Copy link
Author

lycfight commented Jan 21, 2025

If you set --num_threads=1 there really shouldn't be any issue with request limits. Is someone else also using your OpenAI key? You might want to check your OpenAI rate limit https://platform.openai.com/settings/organization/limits to see.

This system is built by the company using a third-party Azure service, and I’ve noticed a curious phenomenon:

  • When --num_threads > 1, response failures don’t throw errors. Instead, empty fields are written, and the process continues running. I believe this is a key reason why the results are inconsistent.
  • When --num_threads = 1, response failures cause the script to throw an error and interrupt execution. Based on this, I tried increasing max_retries and timeout. I found that the number of successfully completed tasks increases with each error occurrence, which alleviates the issue but doesn’t fully resolve it.

I suspect that the actual concurrency level of the requests (which might depend on the number of documents) exceeds a certain limit, causing the response failures. For instance, if a large number of documents need to be processed at once, the requests might surpass the concurrency limit of the interface.

Is there a way to limit the actual concurrency?

https://github.com/OpenAutoCoder/Agentless/blob/5ce5888b9f149beaace393957a55ea8ee46c9f71/agentless/fl/Index.py#L261

I also noticed in the log output that some data progress bars show as completed, but they still print Embedding Tokens: 0. What does this indicate? Why are the Embedding Tokens zero?

Total number of considered files: 251
Total number of documents: 251
Embedding Tokens: 0
 17%|█████████████████▉                                                                                         | 84/500 [53:49<4:51:38, 42.06s/it]Total number of considered files: 165
Total number of documents: 165
Embedding Tokens: 0
 17%|██████████████████▏                                                                                        | 85/500 [54:49<5:26:48, 47.25s/it]Total number of considered files: 79
Total number of documents: 79
Embedding Tokens: 0
 17%|██████████████████▍                                                                                        | 86/500 [55:31<5:14:41, 45.61s/it]Total number of considered files: 272
Total number of documents: 272
Embedding Tokens: 0

@brutalsavage
Copy link
Contributor

This system is built by the company using a third-party Azure service

I see, then I am not sure how it is handled. I do agree its probably dependent on concurrency limit of the azure service. If its not possible for you to switch to the default openai service which we use to reproduce the results. My suggestion is maybe you can just do it instance by instance with some timeout in the middle or maybe implement some caching (if even one instance is too many tokens for the interface to handle)

I also noticed in the log output that some data progress bars show as completed, but they still print Embedding Tokens: 0

Yes the number of embedding tokens is only accurate when the number of threads is set to 1 (this is specified in the help documentation). The reason is again due to some concurrency where the embedding token counting will be overwritten once we use multiple threads. To ensure accurate counting, you can use --mock with --num_threads = 1 (which won't actually call the embedding model) to save cost and compute the number of embedding tokens needed.

@lycfight
Copy link
Author

This system is built by the company using a third-party Azure service

I see, then I am not sure how it is handled. I do agree its probably dependent on concurrency limit of the azure service. If its not possible for you to switch to the default openai service which we use to reproduce the results. My suggestion is maybe you can just do it instance by instance with some timeout in the middle or maybe implement some caching (if even one instance is too many tokens for the interface to handle)

I also noticed in the log output that some data progress bars show as completed, but they still print Embedding Tokens: 0

Yes the number of embedding tokens is only accurate when the number of threads is set to 1 (this is specified in the help documentation). The reason is again due to some concurrency where the embedding token counting will be overwritten once we use multiple threads. To ensure accurate counting, you can use --mock with --num_threads = 1 (which won't actually call the embedding model) to save cost and compute the number of embedding tokens needed.

I tried setting --num_threads = 1 for all steps to avoid failed responses being written as empty results without raising errors, which could give the impression that the steps ran successfully and produced correct results.As well as configuring the embed_batch_size parameter of OpenAIEmbedding to 10 in retrieval. It seems to have successfully completed the retrieval process, and there were no empty responses in the logs. However, after running the combine step, I found that the suspicious files obtained are not exactly the same as the ones provided officially. For example, in the following three cases:

Official:
{"instance_id": "astropy__astropy-13033", "found_files": ["astropy/timeseries/core.py", "astropy/timeseries/binned.py", "astropy/timeseries/downsample.py", "astropy/timeseries/sampled.py"], "additional_artifact_loc_file": {}, "file_traj": {}}
Mine:
{"instance_id": "astropy__astropy-13033", "found_files": ["astropy/timeseries/core.py", "astropy/timeseries/binned.py", "astropy/timeseries/__init__.py", "astropy/time/core.py"], "additional_artifact_loc_file": {}, "file_traj": {}}

Official:
{"instance_id": "django__django-11087", "found_files": ["django/db/models/deletion.py", "django/db/models/query.py", "django/db/models/sql/compiler.py", "django/db/backends/mysql/operations.py", "django/db/backends/mysql/validation.py"], "additional_artifact_loc_file": {}, "file_traj": {}}
Mine:
{"instance_id": "django__django-11087", "found_files": ["django/db/models/deletion.py", "django/db/models/query.py", "django/db/backends/mysql/base.py", "django/db/backends/mysql/operations.py", "django/db/backends/mysql/validation.py"], "additional_artifact_loc_file": {}, "file_traj": {}}

Official:
{"instance_id": "sphinx-doc__sphinx-7748", "found_files": ["sphinx/ext/autodoc/__init__.py", "sphinx/ext/autodoc/directive.py", "sphinx/util/docstrings.py", "sphinx/ext/autodoc/type_comment.py", "sphinx/application.py"], "additional_artifact_loc_file": {}, "file_traj": {}}
Mine:
{"instance_id": "sphinx-doc__sphinx-7748", "found_files": ["sphinx/ext/autodoc/__init__.py", "sphinx/ext/autodoc/directive.py", "sphinx/util/docstrings.py", "sphinx/ext/autodoc/type_comment.py", "sphinx/application.py"], "additional_artifact_loc_file": {}, "file_traj": {}}

@brutalsavage
Copy link
Contributor

Its definitely possible for some slight differences across runs due to non-deterministic embeddings.

Also It doesn't seem like there is any difference in the last example you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants