Skip to content

Conversation

@kevingurney
Copy link
Member

@kevingurney kevingurney commented Sep 18, 2023

Rationale for this change

To enable initial CSV I/O support, this PR adds arrow.io.csv.TableReader and arrow.io.csv.TableWriter MATLAB classes to the MATLAB interface.

What changes are included in this PR?

  1. Added a new arrow.io.csv.TableReader class
  2. Added a new arrow.io.csv.TableWriter class

Example

>> matlabTableWrite = array2table(rand(3))

matlabTableWrite =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> arrowTableWrite = arrow.table(matlabTableWrite)

arrowTableWrite = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> writer = arrow.io.csv.TableWriter("example.csv")

writer = 

  TableWriter with properties:

    Filename: "example.csv"

>> writer.write(arrowTableWrite)

>> reader = arrow.io.csv.TableReader("example.csv")

reader = 

  TableReader with properties:

    Filename: "example.csv"

>> arrowTableRead = reader.read()

arrowTableRead = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> matlabTableRead = table(arrowTableRead)

matlabTableRead =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> isequal(arrowTableRead, arrowTableWrite)

ans =

  logical

   1

>> isequal(matlabTableRead, matlabTableWrite)

ans =

  logical

   1

Are these changes tested?

Yes.

  1. Added new CSV I/O tests including test/arrow/io/csv/tRoundTrip.m and test/arrow/io/csv/tError.m.
  2. Both of these test classes inherit from a CSVTest superclass.

Are there any user-facing changes?

Yes.

  1. Users can now read and write CSV files using arrow.io.csv.TableReader and arrow.io.csv.TableWriter.

Future Directions

  1. Expose options for controlling CSV reading and writing in MATLAB.
  2. Add more read/write tests for null value handling and other datatypes beyond numeric and string values.
  3. Add a RecordBatchReader and RecordBatchWriter for CSV.
  4. Add support for more I/O formats like Parquet, JSON, ORC, Arrow IPC, etc.

Notes

  1. Thank you @sgilmore10 for your help with this pull request!
  2. I chose to add both the TableReader and TableWriter in one pull request because it simplified testing. My apologies for the slightly lengthy pull request.

@kevingurney kevingurney requested a review from kou as a code owner September 18, 2023 15:57
@github-actions
Copy link

⚠️ GitHub issue #37770 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@sgilmore10 sgilmore10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Thanks for adding the base test class and test utilities. We can re-use these in the future for other file types!

@kevingurney
Copy link
Member Author

kevingurney commented Sep 18, 2023

It looks like the ARROW_CSV component is currently not enabled in the Arrow C++ libraries ExternalProject build in the MATLAB CI workflow.

I just pushed a commit that enables this.

@kevingurney
Copy link
Member Author

kevingurney commented Sep 18, 2023

Enabling building of the ARROW_CSV component caused some build failures due to RapidJSON being inaccessible when building integration tests:

https://github.com/apache/arrow/actions/runs/6225733757/job/16896790846?pr=37773#step:9:681.

We could try to fix these build failures by explicitly installing RapidJSON into the GitHub Actions environments. However, considering that we were already considering removing GoogleTest support from the MATLAB build system - it might be better to just use this opportunity to do that (this is what the Python bindings did, as well).

I have already confirmed that not building the Arrow C++ library tests when ARROW_CSV is enabled does work on all platforms in another CI job in mathworks/arrow:

https://github.com/mathworks/arrow/actions/runs/6226385022

I'll start working on a separate PR to remove GoogleTest support from the CMake build system for the MATLAB interface. In the meantime, I will leave this PR open, and I will rebase once GoogleTest support has been removed.

@kevingurney
Copy link
Member Author

For reference, I am working on the changes required to remove GoogleTest support from the CMake build system of the MATLAB interface here: https://github.com/mathworks/arrow/tree/GH-37532.

@kevingurney
Copy link
Member Author

For reference - I've opened #37784 for removing GoogleTest support.

kevingurney added a commit that referenced this pull request Sep 19, 2023
…ke build system for the MATLAB interface (#37784)

### Rationale for this change

This pull request removes `GoogleTest` support from the CMake build system for the MATLAB interface.

1. `GoogleTest` support adds a lot of additional complexity to the CMake build system for the MATLAB interface, and we currently don't have any standalone C++ tests for the MATLAB interface code.
2. In order to use `GoogleTest` in the MATLAB CI workflows, we are currently relying on building the tests for the Arrow C++ libraries in order to "re-use" the `GoogleTest binaries. This adds additional overhead to the MATLAB CI workflows.
3. If we want to test some internal C++ code for the MATLAB interface in the future, we can instead use a MEX function to call the code from a MATLAB test as suggested by @ kou in #37532 (comment).
4. There is [precedent for testing internal C++ code without GoogleTest for the Python bindings](#14117).
5. On a somewhat related note - removing `GoogleTest` support will help unblock #37773 as discussed in #37773 (comment).

### What changes are included in this PR?

1. Removed the `MATLAB_BUILD_TESTS` flag from the CMake build system for the MATLAB interface since there are no longer any C++ tests for the MATLAB interface to build.
2. Updated the `matlab_build.sh` CI workflow script to avoid building the tests for the Arrow C++ libraries and to no longer call `ctest`.
3. Updated the `README.md` for the MATLAB interface to no longer mention building or running C++ tests.
4. Updated the design document for the MATLAB Interface to no longer mention `GoogleTest` since we may end up testing internal C++ code using MEX function calls from MATLAB instead.
5. Removed placeholder C++ test (i.e. `placeholder_test.cc`).

### Are these changes tested?

Yes.

The MATLAB CI workflow is passing on all platforms.

### Are there any user-facing changes?

Yes.

There are no longer any C++ tests for the MATLAB interface. The `MATLAB_BUILD_TESTS` flag has been removed from the CMake build system to reflect this change. If a user supplies a value for `MATLAB_BUILD_TESTS` when building the MATLAB interface, the flag will be ignored by CMake.

### Future Directions

1. Add more developer-focused documentation on how to test C++ code via MEX function calls from MATLAB.

### Notes

1. In the future, we can consider testing internal C++ code using MEX function calls from MATLAB tests as suggested by @ kou in #37532 (comment). Currently, we don't have any C++ tests that need to be adapted to use this approach.
2. Thank you @ sgilmore10 for your help with this pull request!
* Closes: #37532

Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
@kevingurney
Copy link
Member Author

Update: the code to remove GoogleTest support from the MATLAB interface was merged in #37784.

Now that the Arrow C++ tests are no longer being built, enabling ARROW_CSV in the MATLAB build works without any issues.

At this point, these changes should be ready to merge.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting merge Awaiting merge awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 20, 2023
@kevingurney
Copy link
Member Author

The failures for the Dev / Source Release and Merge Script CI workflow seem unrelated to this PR.

@kevingurney
Copy link
Member Author

CI failures are related to #37803.

@kevingurney
Copy link
Member Author

+1

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 20, 2023
kevingurney and others added 17 commits September 20, 2023 14:14
2. Change `write` method to take in an `arrow.tabular.Table`, rather
than an `arrow.tabular.RecordBatch`.
3. Add basic implementation of `arrow.io.csv.TableReader` class.

Co-authored-by: Sarah Gilmore <[email protected]>
2. Create CSV test superclass.
3. Add tError test class.

Co-authored-by: Sarah Gilmore <[email protected]>
…iter filename argument.

2. Add basic error tests for TableReader and TableWriter.

Co-authored-by: Sarah Gilmore <[email protected]>
2. Add more error tests.

Co-authored-by: Sarah Gilmore <[email protected]>
@kevingurney
Copy link
Member Author

#37808 has been merged.

I just rebased these changes on top of main so that the latest MATLAB CI workflows will run.

@kevingurney
Copy link
Member Author

The MATLAB CI workflows are passing now.

I will merge this PR.

@kevingurney
Copy link
Member Author

+1

@kevingurney kevingurney merged commit 2b34e37 into apache:main Sep 20, 2023
@kevingurney kevingurney deleted the GH-37770 branch September 20, 2023 18:31
@kevingurney kevingurney removed the awaiting change review Awaiting change review label Sep 20, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2b34e37.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…he CMake build system for the MATLAB interface (apache#37784)

### Rationale for this change

This pull request removes `GoogleTest` support from the CMake build system for the MATLAB interface.

1. `GoogleTest` support adds a lot of additional complexity to the CMake build system for the MATLAB interface, and we currently don't have any standalone C++ tests for the MATLAB interface code.
2. In order to use `GoogleTest` in the MATLAB CI workflows, we are currently relying on building the tests for the Arrow C++ libraries in order to "re-use" the `GoogleTest binaries. This adds additional overhead to the MATLAB CI workflows.
3. If we want to test some internal C++ code for the MATLAB interface in the future, we can instead use a MEX function to call the code from a MATLAB test as suggested by @ kou in apache#37532 (comment).
4. There is [precedent for testing internal C++ code without GoogleTest for the Python bindings](apache#14117).
5. On a somewhat related note - removing `GoogleTest` support will help unblock apache#37773 as discussed in apache#37773 (comment).

### What changes are included in this PR?

1. Removed the `MATLAB_BUILD_TESTS` flag from the CMake build system for the MATLAB interface since there are no longer any C++ tests for the MATLAB interface to build.
2. Updated the `matlab_build.sh` CI workflow script to avoid building the tests for the Arrow C++ libraries and to no longer call `ctest`.
3. Updated the `README.md` for the MATLAB interface to no longer mention building or running C++ tests.
4. Updated the design document for the MATLAB Interface to no longer mention `GoogleTest` since we may end up testing internal C++ code using MEX function calls from MATLAB instead.
5. Removed placeholder C++ test (i.e. `placeholder_test.cc`).

### Are these changes tested?

Yes.

The MATLAB CI workflow is passing on all platforms.

### Are there any user-facing changes?

Yes.

There are no longer any C++ tests for the MATLAB interface. The `MATLAB_BUILD_TESTS` flag has been removed from the CMake build system to reflect this change. If a user supplies a value for `MATLAB_BUILD_TESTS` when building the MATLAB interface, the flag will be ignored by CMake.

### Future Directions

1. Add more developer-focused documentation on how to test C++ code via MEX function calls from MATLAB.

### Notes

1. In the future, we can consider testing internal C++ code using MEX function calls from MATLAB tests as suggested by @ kou in apache#37532 (comment). Currently, we don't have any C++ tests that need to be adapted to use this approach.
2. Thank you @ sgilmore10 for your help with this pull request!
* Closes: apache#37532

Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…AB classes (apache#37773)

### Rationale for this change

To enable initial CSV I/O support, this PR adds `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter` MATLAB classes to the MATLAB interface.

### What changes are included in this PR?

1. Added a new `arrow.io.csv.TableReader` class
2. Added a new `arrow.io.csv.TableWriter` class

**Example**
```matlab
>> matlabTableWrite = array2table(rand(3))

matlabTableWrite =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> arrowTableWrite = arrow.table(matlabTableWrite)

arrowTableWrite = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> writer = arrow.io.csv.TableWriter("example.csv")

writer = 

  TableWriter with properties:

    Filename: "example.csv"

>> writer.write(arrowTableWrite)

>> reader = arrow.io.csv.TableReader("example.csv")

reader = 

  TableReader with properties:

    Filename: "example.csv"

>> arrowTableRead = reader.read()

arrowTableRead = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> matlabTableRead = table(arrowTableRead)

matlabTableRead =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> isequal(arrowTableRead, arrowTableWrite)

ans =

  logical

   1

>> isequal(matlabTableRead, matlabTableWrite)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Added new CSV I/O tests including `test/arrow/io/csv/tRoundTrip.m` and `test/arrow/io/csv/tError.m`.
2. Both of these test classes inherit from a `CSVTest` superclass.

### Are there any user-facing changes?

Yes.

1. Users can now read and write CSV files using `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter`.

### Future Directions

1. Expose [options](https://github.com/apache/arrow/blob/main/cpp/src/arrow/csv/options.h) for controlling CSV reading and writing in MATLAB.
2. Add more read/write tests for null value handling and other datatypes beyond numeric and string values.
4. Add a `RecordBatchReader` and `RecordBatchWriter` for CSV.
5. Add support for more I/O formats like Parquet, JSON, ORC, Arrow IPC, etc.

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!
2. I chose to add both the `TableReader` and `TableWriter` in one pull request because it simplified testing. My apologies for the slightly lengthy pull request.
* Closes: apache#37770

Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…he CMake build system for the MATLAB interface (apache#37784)

### Rationale for this change

This pull request removes `GoogleTest` support from the CMake build system for the MATLAB interface.

1. `GoogleTest` support adds a lot of additional complexity to the CMake build system for the MATLAB interface, and we currently don't have any standalone C++ tests for the MATLAB interface code.
2. In order to use `GoogleTest` in the MATLAB CI workflows, we are currently relying on building the tests for the Arrow C++ libraries in order to "re-use" the `GoogleTest binaries. This adds additional overhead to the MATLAB CI workflows.
3. If we want to test some internal C++ code for the MATLAB interface in the future, we can instead use a MEX function to call the code from a MATLAB test as suggested by @ kou in apache#37532 (comment).
4. There is [precedent for testing internal C++ code without GoogleTest for the Python bindings](apache#14117).
5. On a somewhat related note - removing `GoogleTest` support will help unblock apache#37773 as discussed in apache#37773 (comment).

### What changes are included in this PR?

1. Removed the `MATLAB_BUILD_TESTS` flag from the CMake build system for the MATLAB interface since there are no longer any C++ tests for the MATLAB interface to build.
2. Updated the `matlab_build.sh` CI workflow script to avoid building the tests for the Arrow C++ libraries and to no longer call `ctest`.
3. Updated the `README.md` for the MATLAB interface to no longer mention building or running C++ tests.
4. Updated the design document for the MATLAB Interface to no longer mention `GoogleTest` since we may end up testing internal C++ code using MEX function calls from MATLAB instead.
5. Removed placeholder C++ test (i.e. `placeholder_test.cc`).

### Are these changes tested?

Yes.

The MATLAB CI workflow is passing on all platforms.

### Are there any user-facing changes?

Yes.

There are no longer any C++ tests for the MATLAB interface. The `MATLAB_BUILD_TESTS` flag has been removed from the CMake build system to reflect this change. If a user supplies a value for `MATLAB_BUILD_TESTS` when building the MATLAB interface, the flag will be ignored by CMake.

### Future Directions

1. Add more developer-focused documentation on how to test C++ code via MEX function calls from MATLAB.

### Notes

1. In the future, we can consider testing internal C++ code using MEX function calls from MATLAB tests as suggested by @ kou in apache#37532 (comment). Currently, we don't have any C++ tests that need to be adapted to use this approach.
2. Thank you @ sgilmore10 for your help with this pull request!
* Closes: apache#37532

Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…AB classes (apache#37773)

### Rationale for this change

To enable initial CSV I/O support, this PR adds `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter` MATLAB classes to the MATLAB interface.

### What changes are included in this PR?

1. Added a new `arrow.io.csv.TableReader` class
2. Added a new `arrow.io.csv.TableWriter` class

**Example**
```matlab
>> matlabTableWrite = array2table(rand(3))

matlabTableWrite =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> arrowTableWrite = arrow.table(matlabTableWrite)

arrowTableWrite = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> writer = arrow.io.csv.TableWriter("example.csv")

writer = 

  TableWriter with properties:

    Filename: "example.csv"

>> writer.write(arrowTableWrite)

>> reader = arrow.io.csv.TableReader("example.csv")

reader = 

  TableReader with properties:

    Filename: "example.csv"

>> arrowTableRead = reader.read()

arrowTableRead = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> matlabTableRead = table(arrowTableRead)

matlabTableRead =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> isequal(arrowTableRead, arrowTableWrite)

ans =

  logical

   1

>> isequal(matlabTableRead, matlabTableWrite)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Added new CSV I/O tests including `test/arrow/io/csv/tRoundTrip.m` and `test/arrow/io/csv/tError.m`.
2. Both of these test classes inherit from a `CSVTest` superclass.

### Are there any user-facing changes?

Yes.

1. Users can now read and write CSV files using `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter`.

### Future Directions

1. Expose [options](https://github.com/apache/arrow/blob/main/cpp/src/arrow/csv/options.h) for controlling CSV reading and writing in MATLAB.
2. Add more read/write tests for null value handling and other datatypes beyond numeric and string values.
4. Add a `RecordBatchReader` and `RecordBatchWriter` for CSV.
5. Add support for more I/O formats like Parquet, JSON, ORC, Arrow IPC, etc.

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!
2. I chose to add both the `TableReader` and `TableWriter` in one pull request because it simplified testing. My apologies for the slightly lengthy pull request.
* Closes: apache#37770

Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MATLAB] Add CSV TableReader and TableWriter MATLAB classes

3 participants