-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SGE Tests segfault in CI #653
Comments
In #654 I've been playing around with skipping various tests and enabling them again. It seems like enabling any two of the tests results in the segfault. Enabling more than one only causes the error to appear once though. |
I have a local reproducer now. Here are the steps I took to get it set up on my machine. # Build SGE container
cd ci/sge
cp ../environment.yaml .
docker compose build
# Start SGE stack (based on ci/sge.sh)
./start-sge.sh
docker exec sge_master /bin/bash -c "chmod -R 777 /shared_space"
# Install dask-jobqueue in editible install
docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd /dask-jobqueue; pip install -e ." I also installed I then created a new test file with a single test that consistently reproduces the segfault. # dask_jobqueue/tests/test_jsge_segfault.py
from dask_jobqueue.sge import SGECluster
from dask.distributed import Client
import pytest
@pytest.mark.anyio
@pytest.mark.env("sge")
async def test_cluster():
async with SGECluster(1, cores=1, memory="1GB", asynchronous=True) as cluster:
async with Client(cluster, asynchronous=True):
pass Then you can run the test via $ docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge"
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': corrupted size vs. prev_size: 0x0000560d54c76aa0 ***
/bin/bash: line 1: 29477 Aborted (core dumped) pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge
ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge` failed. (See above for error)
============================= test session starts ==============================
platform linux -- Python 3.8.19, pytest-8.3.2, pluggy-1.5.0 -- /opt/anaconda/envs/dask-jobqueue/bin/python3.8
cachedir: .pytest_cache
rootdir: /dask-jobqueue
plugins: anyio-4.4.0
collecting ... collected 1 item
../dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py::test_cluster PASSED
============================== 1 passed in 1.07s ===============================
$ echo $?
134 |
Since upgrading to Python 3.9 in CI this issues seems to have gone away. It's strange because I'm still able to reproduce some problems locally, but perhaps there is something cached that I'm not taking into account. Given that CI is all green and PRs and merges are passing consistently I'm going to close this out. |
Looks like a similar erorr happened when running CI for #660.
Perhaps it's not as resolved as I had hoped. |
Still seeing this after bumping to Python 3.10.
|
Opening an issue to triage the segfault that seems to be happening in th SGE tests.
For some time the SGE tests have been failing. When you look at the logs of a recent run on
main
it contains the following error.I also opened #652 to bump the minimum Python version here to 3.9 and I see a similar issue happening but with a slightly different error.
Strangely in both cases
pytest
reports everything has passed.The text was updated successfully, but these errors were encountered: