Skip to content
3 changes: 3 additions & 0 deletions .azure-pipelines/azure-pipelines-linux.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions .azure-pipelines/azure-pipelines-osx.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .ci_support/linux_64_python3.10.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.10.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- linux-64
zlib:
Expand Down
2 changes: 1 addition & 1 deletion .ci_support/linux_64_python3.11.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.11.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- linux-64
zlib:
Expand Down
2 changes: 1 addition & 1 deletion .ci_support/linux_64_python3.8.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.8.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- linux-64
zlib:
Expand Down
2 changes: 1 addition & 1 deletion .ci_support/linux_64_python3.9.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.9.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- linux-64
zlib:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ __migrator:
build_number: 1
kind: version
migration_number: 1
aws_sdk_cpp:
- 1.11.182
migrator_ts: 1697695685.8141418
migrator_ts: 1699325293.519726
pytorch:
- '2.1'
2 changes: 1 addition & 1 deletion .ci_support/osx_64_python3.10.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.10.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- osx-64
zlib:
Expand Down
2 changes: 1 addition & 1 deletion .ci_support/osx_64_python3.11.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.11.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- osx-64
zlib:
Expand Down
2 changes: 1 addition & 1 deletion .ci_support/osx_64_python3.8.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.8.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- osx-64
zlib:
Expand Down
2 changes: 1 addition & 1 deletion .ci_support/osx_64_python3.9.____cpython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pin_run_as_build:
python:
- 3.9.* *_cpython
pytorch:
- '2.0'
- '2.1'
target_platform:
- osx-64
zlib:
Expand Down
17 changes: 13 additions & 4 deletions .scripts/build_steps.sh

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions .scripts/run_docker_build.sh

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 11 additions & 4 deletions .scripts/run_osx_build.sh

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 7 additions & 5 deletions recipe/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{% set version = "0.6.1" %}
{% set version = "0.7.0" %}

package:
name: torchdata
version: {{ version }}

source:
url: https://github.com/pytorch/data/archive/refs/tags/v{{ version }}.tar.gz
sha256: c596db251c5e6550db3f00e4308ee7112585cca4d6a1c82a433478fd86693257
sha256: 0b444719c3abc67201ed0fea92ea9c4100e7f36551ba0d19a09446cc11154eb3

build:
number: 2
number: 0
# no pytorch on windows in conda-forge, see
# https://github.com/conda-forge/pytorch-cpu-feedstock/issues/32
skip: true # [win]
Expand Down Expand Up @@ -39,6 +39,7 @@ requirements:
- requests
- urllib3 >=1.25
- pytorch
- cryptography <40.0.2,>=3.3.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like capping cryptography, that's an extremely low-level but highly security-critical package, with a very conservative evolution speed. My preference would be to patch the upper bound completely.

And if there's a good argument for not doing that, then at least please order this sensibly, i.e.

- cryptography >=3.3.2,<40.0.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awscli has the cap, not torchdata directly. The awscli stuff doesn't do a pip check so this is not caught there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this cryptography pin be placed under test.requires instead of being listed as a runtime dependency actually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiji14 - I am not very familiar with torchdata. Feel free to take over this PR if you have any ideas how to push it forward.

Copy link
Member

@weiji14 weiji14 Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me open another PR to add myself to the recipe maintainer list (#21).


test:
imports:
Expand All @@ -64,11 +65,12 @@ test:
{% set tests_to_skip = tests_to_skip + " or test_fsspec_memory_list" %}
{% set tests_to_skip = tests_to_skip + " or test_elastic_training_dl1_backend_gloo" %}
{% set tests_to_skip = tests_to_skip + " or test_elastic_training_dl2_backend_gloo" %}
# fails because fsspec is not available (AWS S3 stuff)
{% set tests_to_skip = tests_to_skip + " or test_fsspec_io_iterdatapipe" %}
{% set tests_to_skip = tests_to_skip + " or test_s3_io_iterdatapipe" %}
Comment on lines +68 to +70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, why not add fsspec as a test dependency then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind, fsspec is already there. Could you explain what you mean by "fails because fsspec is not available" then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got no idea why it fails, I was assuming because fsspec was not a dep.

Copy link
Member

@weiji14 weiji14 Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried re-enabling those tests in 089dc2b. This is the traceback from https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=830806&view=logs&j=4f922444-fdfe-5dcf-b824-02f86439ef14&t=937c195f-508d-5135-dc9f-d4b5730df0f7&l=1292:

=================================== FAILURES ===================================
_______________ TestDataPipeRemoteIO.test_fsspec_io_iterdatapipe _______________

self = <test_remote_io.TestDataPipeRemoteIO testMethod=test_fsspec_io_iterdatapipe>

    @skipIfNoFSSpecS3
    def test_fsspec_io_iterdatapipe(self):
        input_list = [
            ["s3://ai2-public-datasets"],  # bucket without '/'
            ["s3://ai2-public-datasets/charades/"],  # bucket with '/'
            [
                "s3://ai2-public-datasets/charades/Charades_v1.zip",
                "s3://ai2-public-datasets/charades/Charades_v1_flow.tar",
                "s3://ai2-public-datasets/charades/Charades_v1_rgb.tar",
                "s3://ai2-public-datasets/charades/Charades_v1_480.zip",
            ],  # multiple files
        ]
        for urls in input_list:
            fsspec_lister_dp = FSSpecFileLister(IterableWrapper(urls), anon=True)
            self.assertEqual(
>               sum(1 for _ in fsspec_lister_dp), self.__get_s3_cnt(urls, recursive=False), f"{urls} failed"
            )

test_remote_io.py:278: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_remote_io.py:253: in __get_s3_cnt
    res = subprocess.run(aws_cmd, shell=True, check=True, capture_output=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input = None, capture_output = True, timeout = None, check = True
popenargs = ('aws --output json s3api list-objects  --bucket ai2-public-datasets --no-sign-request --delimiter /',)
kwargs = {'shell': True, 'stderr': -1, 'stdout': -1}
process = <Popen: returncode: 255 args: 'aws --output json s3api list-objects  --bucke...>
stdout = b''
stderr = b'\n<botocore.awsrequest.AWSRequest object at 0x7fbea3642dd0>\n'
retcode = 255

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.
    
        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
        or pass capture_output=True to capture both.
    
        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.
    
        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.
    
        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.
    
        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.
    
        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE
    
        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE
    
        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command 'aws --output json s3api list-objects  --bucket ai2-public-datasets --no-sign-request --delimiter /' returned non-zero exit status 255.

../../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/lib/python3.11/subprocess.py:571: CalledProcessError
_________________ TestDataPipeRemoteIO.test_s3_io_iterdatapipe _________________

self = <test_remote_io.TestDataPipeRemoteIO testMethod=test_s3_io_iterdatapipe>

    @skipIfNoAWS
    @unittest.skipIf(IS_M1, "PyTorch M1 CI Machine doesn't allow accessing")
    def test_s3_io_iterdatapipe(self):
        # S3FileLister: different inputs
        input_list = [
            ["s3://ai2-public-datasets"],  # bucket without '/'
            ["s3://ai2-public-datasets/"],  # bucket with '/'
            ["s3://ai2-public-datasets/charades"],  # folder without '/'
            ["s3://ai2-public-datasets/charades/"],  # folder without '/'
            ["s3://ai2-public-datasets/charad"],  # prefix
            [
                "s3://ai2-public-datasets/charades/Charades_v1",
                "s3://ai2-public-datasets/charades/Charades_vu17",
            ],  # prefixes
            ["s3://ai2-public-datasets/charades/Charades_v1.zip"],  # single file
            [
                "s3://ai2-public-datasets/charades/Charades_v1.zip",
                "s3://ai2-public-datasets/charades/Charades_v1_flow.tar",
                "s3://ai2-public-datasets/charades/Charades_v1_rgb.tar",
                "s3://ai2-public-datasets/charades/Charades_v1_480.zip",
            ],  # multiple files
            [
                "s3://ai2-public-datasets/charades/Charades_v1.zip",
                "s3://ai2-public-datasets/charades/Charades_v1_flow.tar",
                "s3://ai2-public-datasets/charades/Charades_v1_rgb.tar",
                "s3://ai2-public-datasets/charades/Charades_v1_480.zip",
                "s3://ai2-public-datasets/charades/Charades_vu17",
            ],  # files + prefixes
        ]
        for input in input_list:
            s3_lister_dp = S3FileLister(IterableWrapper(input), region="us-west-2")
>           self.assertEqual(sum(1 for _ in s3_lister_dp), self.__get_s3_cnt(input), f"{input} failed")

test_remote_io.py:341: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_remote_io.py:253: in __get_s3_cnt
    res = subprocess.run(aws_cmd, shell=True, check=True, capture_output=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input = None, capture_output = True, timeout = None, check = True
popenargs = ('aws --output json s3api list-objects  --bucket ai2-public-datasets --no-sign-request',)
kwargs = {'shell': True, 'stderr': -1, 'stdout': -1}
process = <Popen: returncode: 255 args: 'aws --output json s3api list-objects  --bucke...>
stdout = b''
stderr = b'\n<botocore.awsrequest.AWSRequest object at 0x7f56bdd8e790>\n'
retcode = 255

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.
    
        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
        or pass capture_output=True to capture both.
    
        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.
    
        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.
    
        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.
    
        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.
    
        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE
    
        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE
    
        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command 'aws --output json s3api list-objects  --bucket ai2-public-datasets --no-sign-request' returned non-zero exit status 255.

../../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/lib/python3.11/subprocess.py:571: CalledProcessError

Seems to be something related to opening the s3 objects on https://registry.opendata.aws/allenai-arc/? What's strange is that these tests fail on Linux, but passes for OSX-64.

# tend to fail due to Google Drive rate-limiting
{% set tests_to_skip = tests_to_skip + " or test_gdrive_iterdatapipe" %}
{% set tests_to_skip = tests_to_skip + " or test_online_iterdatapipe" %}
# unclear this fails only on py<=39
{% set tests_to_skip = tests_to_skip + " or test_fsspec_io_iterdatapipe" %} # [py<=39]
# test_audio_examples uses an uninstalled local folder ("examples");
# avoid test_text_examples due to cycle since torchtext depends on torchdata
- pytest -v --ignore=test_audio_examples.py --ignore=test_text_examples.py -k "not ({{ tests_to_skip }})"
Expand Down