Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_msg lost #1

Open
changchengx opened this issue Jun 23, 2021 · 2 comments
Open

io_msg lost #1

changchengx opened this issue Jun 23, 2021 · 2 comments

Comments

@changchengx
Copy link
Owner

changchengx commented Jun 23, 2021

Describe the issue

client write data to server: after client send 9 million io_msgs to server successfully, customer see the issue that client send_io_message normally return(ucp_tag_send_nb return NULL), but server does not receive the io_msg.
Code: tags/v1.11-pre3

@changchengx
Copy link
Owner Author

changchengx commented Jun 23, 2021

reproduce test 1

Steps to Reproduce

  • Reproduce model in io_demo

    one server host:

    1. start one server process (single thread). This server process will be killed after 10 minutes and then restart the server process
    2. repeat 1st step for 4 times to see if the server found the received io_msg::sn is not continuous
    3. run the server process without killing it if 2nd step does not reproduce the problem

    two client hosts, for every client host:

    1. start one client process (single thread) to write data to server.
    2. once EP disconnected, delete DemoClient(ucp_context & ucp_worker) and restart new DemoClient to write date to server
    3. client will stall if it found that the replied io_msgs::n is not continuous
  • Reproduce implemenation
    Full Code: https://github.com/changchengx/ucx/tree/mlnx_tag_v1.11_pre3_dbg
    Patch for tags/v1.11-pre3 : https://github.com/changchengx/ucx/compare/64df5168..fbd739d4
    Note: this patch has been verified that it could work once sn isn't continuous in client or server process

  • Manual execute the steps in reprodude model

Reproduce result:

Problem is not reproduced
Detail: Every client send about 3 billion io_msg to server without findind none-continuous replied io_msg from server. Server process is manual killed for 4 times. Keep server process running at the 5th time without finding none-continuous received io_msg.

@changchengx
Copy link
Owner Author

changchengx commented Jun 23, 2021

reproduce test 2

Steps to Reproduce

  • Reproduce model in io_demo

    one server host:

    1. start one server process (single thread). This server process will be killed after 10 minutes and then restart the server process
    2. repeat 1st step until the server fond the received io_msg::sn is not continuous.

    two client hosts, for every client host:

    1. start one client process (single thread) to write data to server.
    2. once EP disconnected, delete DemoClient(ucp_context & ucp_worker) and restart new DemoClient to write date to server
    3. client will stall if it found that the replied io_msgs::n is not continuous
  • Reproduce implemenation
    Full Code: https://github.com/changchengx/ucx/tree/mlnx_tag_v1.11_pre3_dbg
    Patch for tags/v1.11-pre3 : https://github.com/changchengx/ucx/compare/64df5168..fbd739d4

  • Reproduce script:

  1. Server host: run server_run.sh and monitor_server.sh
  2. Client host: run this command
  • How to check whether sn none-continuous is hit through above reproduce script?
  1. If server process hit the problem: monitor_server.sh will exit and server process will stall
  2. If client process hit the problem: client process will stall

Reproduce result:

Problem is not reproduced after server killed/restart for 117 times

changchengx pushed a commit that referenced this issue Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant