Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with pymanopt #1606

Merged
merged 3 commits into from
Jan 12, 2022
Merged

Issue with pymanopt #1606

merged 3 commits into from
Jan 12, 2022

Conversation

miguelgfierro
Copy link
Collaborator

@miguelgfierro miguelgfierro commented Jan 10, 2022

Description

Fixing a bug in https://github.com/microsoft/recommenders/runs/4763168801?check_suite_focus=true. Fixing the PR #1605

Related Issues

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@miguelgfierro
Copy link
Collaborator Author

Spark tests are all failing:

        with TemporaryDirectory(dir=tmp_path_factory.getbasetemp()) as td:
            config = {
                "spark.local.dir": td,
                "spark.sql.shuffle.partitions": 1,
                "spark.sql.crossJoin.enabled": "true",
            }
>           spark = start_or_get_spark(app_name=app_name, url=url, config=config)

tests/conftest.py:85: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
recommenders/utils/spark_utils.py:69: in start_or_get_spark
    return eval(".".join(spark_opts))
.tox/spark/lib/python3.7/site-packages/pyspark/sql/session.py:228: in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:384: in getOrCreate
    SparkContext(conf=conf or SparkConf())
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:147: in __init__
    conf, jsc, profiler_cls)
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:209: in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:321: in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
.tox/spark/lib/python3.7/site-packages/py4j/java_gateway.py:1569: in __call__
    answer, self._gateway_client, None, self._fqn)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

answer = 'xro21'
gateway_client = <py4j.java_gateway.GatewayClient object at 0x7fed1a1c38d0>
target_id = None, name = 'org.apache.spark.api.java.JavaSparkContext'

    def get_return_value(answer, gateway_client, target_id=None, name=None):
        """Converts an answer received from the Java gateway into a Python object.
    
        For example, string representation of integers are converted to Python
        integer, string representation of objects are converted to JavaObject
        instances, etc.
    
        :param answer: the string returned by the Java gateway
        :param gateway_client: the gateway client used to communicate with the Java
            Gateway. Only necessary if the answer is a reference (e.g., object,
            list, map)
        :param target_id: the name of the object from which the answer comes from
            (e.g., *object1* in `object1.hello()`). Optional.
        :param name: the name of the member from which the answer comes from
            (e.g., *hello* in `object1.hello()`). Optional.
        """
        if is_error(answer)[0]:
            if len(answer) > 1:
                type = answer[1]
                value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                if answer[1] == REFERENCE_TYPE:
                    raise Py4JJavaError(
                        "An error occurred while calling {0}{1}{2}.\n".
>                       format(target_id, ".", name), value)
E                   py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
E                   : java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.

@laserprec do you know if there was any change in the spark config?

@laserprec
Copy link
Contributor

@laserprec do you know if there was any change in the spark config?

Hmm, not that I am aware of. I am curious of why it is trying to connect to "None.org.apache.spark.api.java.JavaSparkContext".

@miguelgfierro
Copy link
Collaborator Author

bufff another error in the gpu tests:

tests/unit/recommenders/models/test_deeprec_model.py .......             [ 14%]
tests/unit/recommenders/models/test_deeprec_utils.py ....                [ 22%]
tests/unit/recommenders/models/test_ncf_singlenode.py ..............     [ 50%]
tests/unit/recommenders/models/test_newsrec_model.py ....                [ 58%]
tests/unit/recommenders/models/test_newsrec_utils.py ....                [ 66%]
tests/unit/recommenders/models/test_rbm.py ...                           [ 72%]
tests/unit/recommenders/models/test_wide_deep_utils.py ...               [ 78%]
tests/unit/recommenders/utils/test_gpu_utils.py FFs..FF                  [ 92%]
tests/unit/recommenders/utils/test_tf_utils.py ....                      [100%]

=================================== FAILURES ===================================
______________________________ test_get_gpu_info _______________________________

    @pytest.mark.gpu
    def test_get_gpu_info():
>       assert len(get_gpu_info()) >= 1
E       assert 0 >= 1
E        +  where 0 = len([])
E        +    where [] = get_gpu_info()

tests/unit/recommenders/utils/test_gpu_utils.py:24: AssertionError
------------------------------ Captured log call -------------------------------
17:17:19 ERROR Call to cuInit results in UNKNOWN_CUDA_ERROR
_____________________________ test_get_number_gpus _____________________________

    @pytest.mark.gpu
    def test_get_number_gpus():
>       assert get_number_gpus() >= 1
E       assert 0 >= 1
E        +  where 0 = get_number_gpus()

tests/unit/recommenders/utils/test_gpu_utils.py:29: AssertionError
_____________________________ test_tensorflow_gpu ______________________________

    @pytest.mark.gpu
    def test_tensorflow_gpu():
>       assert tf.test.is_gpu_available()
E       AssertionError: assert False
E        +  where False = <function is_gpu_available at 0x7f0c0d752710>()
E        +    where <function is_gpu_available at 0x7f0c0d752710> = <module 'tensorflow._api.v2.test' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/tensorflow/_api/v2/test/__init__.py'>.is_gpu_available
E        +      where <module 'tensorflow._api.v2.test' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/tensorflow/_api/v2/test/__init__.py'> = tf.test

tests/unit/recommenders/utils/test_gpu_utils.py:51: AssertionError
------------------------------ Captured log call -------------------------------
17:17:20 WARNING From /home/runner/work/recommenders/recommenders/tests/unit/recommenders/utils/test_gpu_utils.py:51: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
_______________________________ test_pytorch_gpu _______________________________

    @pytest.mark.gpu
    def test_pytorch_gpu():
>       assert torch.cuda.is_available()
E       AssertionError: assert False
E        +  where False = <function is_available at 0x7f0ba6750950>()
E        +    where <function is_available at 0x7f0ba6750950> = <module 'torch.cuda' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/cuda/__init__.py'>.is_available
E        +      where <module 'torch.cuda' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/cuda/__init__.py'> = torch.cuda

@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Jan 11, 2022

Now it is failing the numba test that detects the GPU:

tests/unit/examples/test_notebooks_gpu.py F......                        [100%]

=================================== FAILURES ===================================
_________________________________ test_gpu_vm __________________________________

    @pytest.mark.notebooks
    @pytest.mark.gpu
    def test_gpu_vm():
>       assert get_number_gpus() >= 1
E       assert 0 >= 1
E        +  where 0 = get_number_gpus()

tests/unit/examples/test_notebooks_gpu.py:18: AssertionError

the ADO test of GPU noteooks in python 3.6 passes: https://dev.azure.com/best-practices/recommenders/_build/results?buildId=55750&view=results
however the GitHub one with 3.7 is failing, it might be related to numba?

@miguelgfierro
Copy link
Collaborator Author

the test runs on ADO and when I install a python 3.7 env it just works:

$ pip list | grep -E 'numpy|numba|tensorflow'
numba                        0.54.1
numpy                        1.20.3
tensorflow                   2.7.0
tensorflow-estimator         2.7.0
tensorflow-io-gcs-filesystem 0.23.1

$ pytest tests/unit/recommenders/utils/test_gpu_utils.py::test_get_number_gpus
============================================================= slowest 10 durations ==============================================================
4.81s call     tests/unit/recommenders/utils/test_gpu_utils.py::test_get_number_gpus

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
========================================================= 1 passed, 1 warning in 9.42s ==========================================================

@miguelgfierro
Copy link
Collaborator Author

@laserprec I have been trying to debug the code, it looks the problem happens only in GitHub actions (see messages before). Any idea about where the problem could be?

@miguelgfierro miguelgfierro mentioned this pull request Jan 11, 2022
4 tasks
@laserprec
Copy link
Contributor

@laserprec I have been trying to debug the code, it looks the problem happens only in GitHub actions (see messages before). Any idea about where the problem could be?

I think we've seen this before and it could that the NVIDIA driver is not available on the github action runner (perhaps due to Ubuntu auto-updating the NVIDIA drivers, and the machine needs to restart to apply the changes).

@@ -236,6 +236,7 @@ def test_cornac_bpr_integration(


@pytest.mark.integration
@pytest.mark.experimental
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't been following some of the latest code changes, but do we have this new pytest.marker defined anywhere in our configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now there is no pipeline for the experimental, as we improve the dependency installation, we will put back some of the tests in the normal pipeline (cpu, gpu or spark). Also, see #1606 (comment)

Copy link
Contributor

@laserprec laserprec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :), just a bit curious at where this new pytest.marker.experimental is defined

@miguelgfierro
Copy link
Collaborator Author

just a bit curious at where this new pytest.marker.experimental is defined

this was a way to take out the dependencies that were making conflicts on the pipeline, @anargyri can provide more context

@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Jan 12, 2022

image

All ADO tests passing, the issue with the GitHub GPU test is fixed thanks to @laserprec. The issue with the spark test is a flaky test due to a memory error, but the code works.

Merging

Preparing for release 🚀🚀🚀

@miguelgfierro miguelgfierro merged commit 7af6edd into staging Jan 12, 2022
@miguelgfierro miguelgfierro deleted the miguel/pymanopt branch January 12, 2022 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants