Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure integration tests failing #379

Open
TomAugspurger opened this issue Sep 12, 2022 · 0 comments
Open

Azure integration tests failing #379

TomAugspurger opened this issue Sep 12, 2022 · 0 comments
Labels
bug Something isn't working provider/azure/vm Cluster provider for Azure Virtual Machines

Comments

@TomAugspurger
Copy link
Member

Discovered in #378, the Azure integration tests are failing:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of ThreadPoolExecutor-2_0 (140540411098880) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  File "/opt/conda/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", line 78, in _worker
    work_item = work_queue.get(block=True)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of IO loop (140540401657600) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  File "/opt/conda/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 499, in run_loop
    loop.start()
  File "/opt/conda/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py", line 339, in _start
    await super()._start()
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 309, in _start
    self.scheduler = await self.scheduler
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 64, in _
    await self.start()
  File "/dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py", line 90, in start
    await self.wait_for_scheduler()
  File "/dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py", line 50, in wait_for_scheduler
    while not is_socket_open(ip, port):
  File "/dask-cloudprovider/dask_cloudprovider/utils/socket.py", line 7, in is_socket_open
    connection.connect((ip, int(port)))

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
FAILED
dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_create_rapids_cluster_sync ERROR
dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init FAILED
dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init ERROR

=================================================================================================================================== ERRORS ===================================================================================================================================
____________________________________________________________________________________________________________ ERROR at teardown of test_create_rapids_cluster_sync ____________________________________________________________________________________________________________

fixturedef = <FixtureDef argname='event_loop' scope='function' baseid=''>, request = <SubRequest 'event_loop' for <Function test_create_rapids_cluster_sync>>

    @pytest.hookimpl(trylast=True)
    def pytest_fixture_post_finalizer(fixturedef: FixtureDef, request: SubRequest) -> None:
        """Called after fixture teardown"""
        if fixturedef.argname == "event_loop":
            policy = asyncio.get_event_loop_policy()
            try:
                loop = policy.get_event_loop()
            except RuntimeError:
                loop = None
            if loop is not None:
                # Clean up existing loop to avoid ResourceWarnings
>               loop.close()

opt/conda/lib/python3.8/site-packages/pytest_asyncio/plugin.py:364:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
opt/conda/lib/python3.8/asyncio/unix_events.py:58: in close
    super().close()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_UnixSelectorEventLoop running=True closed=False debug=False>

    def close(self):
        if self.is_running():
>           raise RuntimeError("Cannot close a running event loop")
E           RuntimeError: Cannot close a running event loop

opt/conda/lib/python3.8/asyncio/selector_events.py:89: RuntimeError
________________________________________________________________________________________________________________ ERROR at teardown of test_render_cloud_init _________________________________________________________________________________________________________________

fixturedef = <FixtureDef argname='event_loop' scope='function' baseid=''>, request = <SubRequest 'event_loop' for <Function test_render_cloud_init>>

    @pytest.hookimpl(trylast=True)
    def pytest_fixture_post_finalizer(fixturedef: FixtureDef, request: SubRequest) -> None:
        """Called after fixture teardown"""
        if fixturedef.argname == "event_loop":
            policy = asyncio.get_event_loop_policy()
            try:
                loop = policy.get_event_loop()
            except RuntimeError:
                loop = None
            if loop is not None:
                # Clean up existing loop to avoid ResourceWarnings
>               loop.close()

opt/conda/lib/python3.8/site-packages/pytest_asyncio/plugin.py:364:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
opt/conda/lib/python3.8/asyncio/unix_events.py:58: in close
    super().close()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_UnixSelectorEventLoop running=True closed=False debug=False>

    def close(self):
        if self.is_running():
>           raise RuntimeError("Cannot close a running event loop")
E           RuntimeError: Cannot close a running event loop

opt/conda/lib/python3.8/asyncio/selector_events.py:89: RuntimeError
================================================================================================================================== FAILURES ==================================================================================================================================
______________________________________________________________________________________________________________________ test_create_rapids_cluster_sync _______________________________________________________________________________________________________________________

    @pytest.mark.asyncio
    @pytest.mark.timeout(1200)
    @skip_without_credentials
    @pytest.mark.external
    async def test_create_rapids_cluster_sync():

>       with AzureVMCluster(
            vm_size="Standard_NC12s_v3",
            docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
            worker_class="dask_cuda.CUDAWorker",
            worker_options={"rmm_pool_size": "15GB"},
        ) as cluster:

dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py:88:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask-cloudprovider/dask_cloudprovider/azure/azurevm.py:570: in __init__
    super().__init__(debug=debug, **kwargs)
dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py:297: in __init__
    super().__init__(**kwargs, security=self.security)
opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py:275: in __init__
    self.sync(self._start)
opt/conda/lib/python3.8/site-packages/distributed/utils.py:338: in sync
    return sync(
opt/conda/lib/python3.8/site-packages/distributed/utils.py:401: in sync
    wait(10)
opt/conda/lib/python3.8/site-packages/distributed/utils.py:390: in wait
    return e.wait(timeout)
opt/conda/lib/python3.8/threading.py:558: in wait
    signaled = self._cond.wait(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x7fd20dd94420>, 0)>, timeout = 10

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.

        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.

        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.

        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).

        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.

        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   Failed: Timeout >1200.0s

opt/conda/lib/python3.8/threading.py:306: Failed
___________________________________________________________________________________________________________________________ test_render_cloud_init ___________________________________________________________________________________________________________________________

args = (), kwargs = {}, coro = <coroutine object test_render_cloud_init at 0x7fd20da17640>

    @functools.wraps(func)
    def inner(*args, **kwargs):
        coro = func(*args, **kwargs)
        if not inspect.isawaitable(coro):
            pyfuncitem.warn(
                pytest.PytestWarning(
                    f"The test {pyfuncitem} is marked with '@pytest.mark.asyncio' "
                    "but it is not an async function. "
                    "Please remove asyncio marker. "
                    "If the test is not marked explicitly, "
                    "check for global markers applied via 'pytestmark'."
                )
            )
            return
>       task = asyncio.ensure_future(coro, loop=_loop)

opt/conda/lib/python3.8/site-packages/pytest_asyncio/plugin.py:452:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
opt/conda/lib/python3.8/asyncio/tasks.py:672: in ensure_future
    task = loop.create_task(coro_or_future)
opt/conda/lib/python3.8/asyncio/base_events.py:429: in create_task
    self._check_closed()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_UnixSelectorEventLoop running=False closed=True debug=False>

    def _check_closed(self):
        if self._closed:
>           raise RuntimeError('Event loop is closed')
E           RuntimeError: Event loop is closed

opt/conda/lib/python3.8/asyncio/base_events.py:508: RuntimeError
============================================================================================================================== warnings summary ==============================================================================================================================
dask_cloudprovider/azure/tests/test_azurevm.py::test_create_cluster
dask_cloudprovider/azure/tests/test_azurevm.py::test_create_cluster_sync
  /opt/conda/lib/python3.8/contextlib.py:120: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight!
    next(self.gen)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================================== short test summary info ===========================================================================================================================
FAILED dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_create_rapids_cluster_sync - Failed: Timeout >1200.0s
FAILED dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init - RuntimeError: Event loop is closed
ERROR dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_create_rapids_cluster_sync - RuntimeError: Cannot close a running event loop
ERROR dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init - RuntimeError: Cannot close a running event loop
======================================================================================================= 2 failed, 3 passed, 2 warnings, 2 errors in 2228.83s (0:37:08) =======================================================================================================
sys:1: RuntimeWarning: coroutine 'test_render_cloud_init' was never awaited

I'll look into this, once I can figure out how to enable all the logs in Azure.

@jacobtomlinson jacobtomlinson added bug Something isn't working provider/azure/vm Cluster provider for Azure Virtual Machines labels Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working provider/azure/vm Cluster provider for Azure Virtual Machines
Projects
None yet
Development

No branches or pull requests

2 participants