GH-735: Support write arrow record batch #904

luoyuxia · 2025-11-05T06:59:18Z

What's Changed

New Features

Added Parquet Writer Support: Introduced ParquetWriter class to write Arrow VectorSchemaRoot to Parquet files via JNI

Implementation

C++ Side:

Implemented JavaOutputStreamAdapter to wrap Java OutputStream as Arrow OutputStream
Added JNI methods: nativeCreateParquetWriter, nativeWriteParquetBatch, nativeCloseParquetWriter
Implemented property builders to convert Java properties to C++ Parquet writer properties

This contains breaking changes.

Closes #735

github-actions · 2025-11-10T06:33:29Z

Thank you for opening a pull request!

Please label the PR with one or more of:

bug-fix
chore
dependencies
documentation
enhancement

Also, add the 'breaking-change' label if appropriate.

See CONTRIBUTING.md for details.

luoyuxia · 2025-11-10T06:49:08Z

@lidavidm Hi, could you please help review this pr when you are free?

This reverts commit 3cc6e16.

This reverts commit 0b82881.

V-Fenil · 2025-11-12T20:21:43Z

@lidavidm Hi, could you please help review this pr when you are free?

Hi @luoyuxia I really appreciate that you have implemented this, however I was trying to test this implementation for ParquetWriter
I was able to resolve all errors by skipping few test case. But seems like your current open PR does not have .dll and .so files due to which it is giving me this error in java

exception: error loading native lib arrow_dataset_jni/x86_64/arrow_dataset_jni.dll FileNotFoundException

Is there any other way I can test this? or I need to wait until it gets merged in main branch?

luoyuxia · 2025-11-13T04:25:36Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

V-Fenil · 2025-11-13T05:41:28Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)

Total build time was 49 mins
And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

luoyuxia · 2025-11-18T02:07:50Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)

Total build time was 49 mins And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

Of course I can share it. I can share you the libarrow_dataset_jni.so as well as the jar built with libarrow_dataset_jni.so. How can I share it? Send it to you email or by other way?

lidavidm · 2025-11-18T02:13:42Z

You can download the binaries built by CI: https://github.com/apache/arrow-java/actions/runs/19222857860?pr=904

V-Fenil · 2025-11-18T02:58:58Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)
Total build time was 49 mins And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

Of course I can share it. I can share you the libarrow_dataset_jni.so as well as the jar built with libarrow_dataset_jni.so. How can I share it? Send it to you email or by other way?

Let me try what @lidavidm suggested if that does not work, I will reach out to you separately.

Fenil-v · 2025-11-18T05:13:04Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)
Total build time was 49 mins And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

Of course I can share it. I can share you the libarrow_dataset_jni.so as well as the jar built with libarrow_dataset_jni.so. How can I share it? Send it to you email or by other way?

I'm testing PR #904 (native Parquet writer via JNI) on Ubuntu 22.04/WSL2 with Java 11
and hitting a consistent failure during ParquetWriter initialization.

Setup:

Downloaded jni-linux-x86_64 artifacts from CI build (run #19222857860)
Using Arrow Java 19.0.0-SNAPSHOT with both libarrow_dataset_jni.so and
libarrow_cdata_jni.so loaded
All library dependencies resolved (ldd shows no missing libraries)

Error:
The ParquetWriter constructor fails at line 71 with a memory leak error during cleanup:

java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128)
Allocator(ROOT) 0/128/4998/9223372036854775807 (res/actual/peak/limit)
at org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158)
at org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)

Analysis:
Looking at the bytecode, the constructor creates a RootAllocator, then calls either
ArrowSchema.allocateNew() (line 24) or Data.exportSchema() (line 37), which throws
an exception. The constructor's cleanup calls close(), which detects the 128-byte leak
from the allocator created at line 14.

Questions:

Are there additional native libraries or system dependencies required beyond
libarrow_dataset_jni.so and libarrow_cdata_jni.so?
Is the CI build fully functional, or does it require Arrow C++ runtime libraries
to be installed separately?
What's the expected initialization sequence for ParquetWriter with these JNI
libraries?

The Java code is simply:
java
FileOutputStream fos = new FileOutputStream(outputPath);
ParquetWriter writer = new ParquetWriter(fos, schema);

Any guidance would be appreciated. Thanks for this PR - looking forward to using
the native performance!

luoyuxia · 2025-11-18T06:40:53Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)
Total build time was 49 mins And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

Of course I can share it. I can share you the libarrow_dataset_jni.so as well as the jar built with libarrow_dataset_jni.so. How can I share it? Send it to you email or by other way?

I'm testing PR #904 (native Parquet writer via JNI) on Ubuntu 22.04/WSL2 with Java 11 and hitting a consistent failure during ParquetWriter initialization.

Setup:

Downloaded jni-linux-x86_64 artifacts from CI build (run #19222857860)

Using Arrow Java 19.0.0-SNAPSHOT with both libarrow_dataset_jni.so and
libarrow_cdata_jni.so loaded

All library dependencies resolved (ldd shows no missing libraries)

Error: The ParquetWriter constructor fails at line 71 with a memory leak error during cleanup:

java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128) Allocator(ROOT) 0/128/4998/9223372036854775807 (res/actual/peak/limit) at org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158) at org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)

Analysis: Looking at the bytecode, the constructor creates a RootAllocator, then calls either ArrowSchema.allocateNew() (line 24) or Data.exportSchema() (line 37), which throws an exception. The constructor's cleanup calls close(), which detects the 128-byte leak from the allocator created at line 14.

Questions:

Are there additional native libraries or system dependencies required beyond
libarrow_dataset_jni.so and libarrow_cdata_jni.so?

Is the CI build fully functional, or does it require Arrow C++ runtime libraries
to be installed separately?

What's the expected initialization sequence for ParquetWriter with these JNI
libraries?

The Java code is simply: java FileOutputStream fos = new FileOutputStream(outputPath); ParquetWriter writer = new ParquetWriter(fos, schema);

Any guidance would be appreciated. Thanks for this PR - looking forward to using the native performance!

Hi, what's your code? I used the following in my local mac os but can't reproduce it

ublic class ParquetWriteTest {

    @TempDir public Path tempDir;

    @Test
    void test() throws Exception{
        String parquetFilePath =
                Paths.get("testParquetWriter.parquet").toString();
        List<Field> fields =
                Arrays.asList(
                        Field.nullable("id", new ArrowType.Int(32, true)),
                        Field.nullable("name", new ArrowType.Utf8()));
        Schema arrowSchema = new Schema(fields);

        int[] ids = new int[] {1, 2, 3, 4, 5};
        String[] names = new String[] {"Alice", "Bob", "Charlie", "Diana", "Errrrve"};

        // Write Parquet file
        try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
              FileOutputStream fos = new FileOutputStream(parquetFilePath);
             ParquetWriter writer = new ParquetWriter(fos, arrowSchema);
             VectorSchemaRoot vectorSchemaRoot = createData(allocator, arrowSchema, ids, names)) {
            writer.write(vectorSchemaRoot);
        }
    }

    private static VectorSchemaRoot createData(
            BufferAllocator allocator, Schema schema, int[] ids, String[] names) {
        // Create VectorSchemaRoot from schema
        VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
        // Allocate space for vectors (we'll write 5 rows)
        root.allocateNew();

        // Get field vectors
        IntVector idVector = (IntVector) root.getVector("id");
        VarCharVector nameVector = (VarCharVector) root.getVector("name");

        // Write data to vectors
        for (int i = 0; i < ids.length; i++) {
            idVector.setSafe(i, ids[i]);
            nameVector.setSafe(i, names[i].getBytes());
        }

        // Set the row count
        root.setRowCount(ids.length);

        return root;
    }
}

I'll try to find time to reproduce it in linux.

Fenil-v · 2025-11-18T06:52:11Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)
Total build time was 49 mins And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

Of course I can share it. I can share you the libarrow_dataset_jni.so as well as the jar built with libarrow_dataset_jni.so. How can I share it? Send it to you email or by other way?

I'm testing PR #904 (native Parquet writer via JNI) on Ubuntu 22.04/WSL2 with Java 11 and hitting a consistent failure during ParquetWriter initialization.
Setup:

Downloaded jni-linux-x86_64 artifacts from CI build (run #19222857860)

Using Arrow Java 19.0.0-SNAPSHOT with both libarrow_dataset_jni.so and
libarrow_cdata_jni.so loaded

All library dependencies resolved (ldd shows no missing libraries)

Error: The ParquetWriter constructor fails at line 71 with a memory leak error during cleanup:
java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128) Allocator(ROOT) 0/128/4998/9223372036854775807 (res/actual/peak/limit) at org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158) at org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)
Analysis: Looking at the bytecode, the constructor creates a RootAllocator, then calls either ArrowSchema.allocateNew() (line 24) or Data.exportSchema() (line 37), which throws an exception. The constructor's cleanup calls close(), which detects the 128-byte leak from the allocator created at line 14.
Questions:

Are there additional native libraries or system dependencies required beyond
libarrow_dataset_jni.so and libarrow_cdata_jni.so?

Is the CI build fully functional, or does it require Arrow C++ runtime libraries
to be installed separately?

What's the expected initialization sequence for ParquetWriter with these JNI
libraries?

The Java code is simply: java FileOutputStream fos = new FileOutputStream(outputPath); ParquetWriter writer = new ParquetWriter(fos, schema);
Any guidance would be appreciated. Thanks for this PR - looking forward to using the native performance!

Hi, what's your code? I used the following in my local mac os but can't reproduce it
ublic class ParquetWriteTest {

    @TempDir public Path tempDir;

    @Test
    void test() throws Exception{
        String parquetFilePath =
                Paths.get("testParquetWriter.parquet").toString();
        List<Field> fields =
                Arrays.asList(
                        Field.nullable("id", new ArrowType.Int(32, true)),
                        Field.nullable("name", new ArrowType.Utf8()));
        Schema arrowSchema = new Schema(fields);

        int[] ids = new int[] {1, 2, 3, 4, 5};
        String[] names = new String[] {"Alice", "Bob", "Charlie", "Diana", "Errrrve"};

        // Write Parquet file
        try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
              FileOutputStream fos = new FileOutputStream(parquetFilePath);
             ParquetWriter writer = new ParquetWriter(fos, arrowSchema);
             VectorSchemaRoot vectorSchemaRoot = createData(allocator, arrowSchema, ids, names)) {
            writer.write(vectorSchemaRoot);
        }
    }

    private static VectorSchemaRoot createData(
            BufferAllocator allocator, Schema schema, int[] ids, String[] names) {
        // Create VectorSchemaRoot from schema
        VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
        // Allocate space for vectors (we'll write 5 rows)
        root.allocateNew();

        // Get field vectors
        IntVector idVector = (IntVector) root.getVector("id");
        VarCharVector nameVector = (VarCharVector) root.getVector("name");

        // Write data to vectors
        for (int i = 0; i < ids.length; i++) {
            idVector.setSafe(i, ids[i]);
            nameVector.setSafe(i, names[i].getBytes());
        }

        // Set the row count
        root.setRowCount(ids.length);

        return root;
    }
}


I'll try to find time to reproduce it in linux.

@luoyuxia

Thanks for the quick response! Here's a minimal test case that reproduces the issue on
Ubuntu 22.04/WSL2:

Test Code:
java
import org.apache.arrow.dataset.file.ParquetWriter;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.*;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.Schema;

import java.io.File;
import java.io.FileOutputStream;
import java.util.Arrays;
import java.util.List;

public class MinimalParquetTest {
public static void main(String[] args) throws Exception {
// Load native libraries (required on Linux)
System.load(new File("src/main/resources/arrow_cdata_jni/x86_64/libarrow_cdata_jni.so").getAbsolutePath());
System.load(new File("src/main/resources/arrow_dataset_jni/x86_64/libarrow_dataset_jni.so").getAbsolutePath());

    String parquetFilePath = "/tmp/test.parquet";
    
    List<Field> fields = Arrays.asList(
        Field.nullable("id", new ArrowType.Int(32, true)),
        Field.nullable("name", new ArrowType.Utf8())
    );
    Schema arrowSchema = new Schema(fields);

    int[] ids = new int[] {1, 2, 3, 4, 5};
    String[] names = new String[] {"Alice", "Bob", "Charlie", "Diana", "Eve"};

    // THIS LINE FAILS - ParquetWriter constructor throws during initialization
    try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
         FileOutputStream fos = new FileOutputStream(parquetFilePath);
         ParquetWriter writer = new ParquetWriter(fos, arrowSchema)) {  // ← Fails here
        
        // Never gets here - constructor fails
        VectorSchemaRoot root = VectorSchemaRoot.create(arrowSchema, allocator);
        root.allocateNew();
        
        IntVector idVector = (IntVector) root.getVector("id");
        VarCharVector nameVector = (VarCharVector) root.getVector("name");
        
        for (int i = 0; i < ids.length; i++) {
            idVector.setSafe(i, ids[i]);
            nameVector.setSafe(i, names[i].getBytes());
        }
        
        root.setRowCount(ids.length);
        writer.write(root);
    }
}

}

Environment:

OS: Ubuntu 22.04 LTS (WSL2 on Windows 11)
Java: OpenJDK 11.0.25
Arrow Version: 19.0.0-SNAPSHOT from PR GH-735: Support write arrow record batch #904
Native libs: Downloaded from CI run #19222857860 (jni-linux-x86_64.zip)

Maven Dependencies:
xml

org.apache.arrow
arrow-dataset
19.0.0-SNAPSHOT

org.apache.arrow
arrow-c-data
19.0.0-SNAPSHOT

org.apache.arrow
arrow-vector
19.0.0-SNAPSHOT

org.apache.arrow
arrow-memory-netty
19.0.0-SNAPSHOT

Run Command:
bash
java --add-opens=java.base/java.nio=ALL-UNNAMED
-Xmx4G
-cp "target/classes:target/lib/*"
MinimalParquetTest

Stack Trace:

java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128)
Allocator(ROOT) 0/128/4998/9223372036854775807 (res/actual/peak/limit)
at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504)
at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:27)
at org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158)
at org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)

Key Differences from macOS:
On macOS, you likely have Arrow C++ libraries installed via Homebrew. On Linux/WSL2,
I'm using only the JNI libraries from the CI build. Could the JNI libraries need
system Arrow C++ libraries to be installed separately on Linux?

Things I've verified:

Both libarrow_dataset_jni.so and libarrow_cdata_jni.so load successfully
No missing library dependencies (ldd shows all resolved)
File permissions are correct (755 on .so files)

Would appreciate any Linux-specific setup steps I might be missing. Thanks!

luoyuxia · 2025-11-19T03:23:26Z

@V-Fenil Hi, thanks for your interest. I already built .so(for linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll for you. To verify this pr, you'll need to build from source, see https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. That's also what I did to verify my pr.

Hi @luoyuxia I'm testing your PR on linux. Could you share the built libarrow_dataset_jni.so file? I can build java but need the native library. (more specific my build was success but I can't find .so file)
Total build time was 49 mins And Arrow Java C Data Interface & Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I guess, if would be better if you share direct file.

Of course I can share it. I can share you the libarrow_dataset_jni.so as well as the jar built with libarrow_dataset_jni.so. How can I share it? Send it to you email or by other way?

I'm testing PR #904 (native Parquet writer via JNI) on Ubuntu 22.04/WSL2 with Java 11 and hitting a consistent failure during ParquetWriter initialization.
Setup:

Downloaded jni-linux-x86_64 artifacts from CI build (run #19222857860)

Using Arrow Java 19.0.0-SNAPSHOT with both libarrow_dataset_jni.so and
libarrow_cdata_jni.so loaded

All library dependencies resolved (ldd shows no missing libraries)

Error: The ParquetWriter constructor fails at line 71 with a memory leak error during cleanup:
java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128) Allocator(ROOT) 0/128/4998/9223372036854775807 (res/actual/peak/limit) at org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158) at org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)
Analysis: Looking at the bytecode, the constructor creates a RootAllocator, then calls either ArrowSchema.allocateNew() (line 24) or Data.exportSchema() (line 37), which throws an exception. The constructor's cleanup calls close(), which detects the 128-byte leak from the allocator created at line 14.
Questions:

Are there additional native libraries or system dependencies required beyond
libarrow_dataset_jni.so and libarrow_cdata_jni.so?

Is the CI build fully functional, or does it require Arrow C++ runtime libraries
to be installed separately?

What's the expected initialization sequence for ParquetWriter with these JNI
libraries?

The Java code is simply: java FileOutputStream fos = new FileOutputStream(outputPath); ParquetWriter writer = new ParquetWriter(fos, schema);
Any guidance would be appreciated. Thanks for this PR - looking forward to using the native performance!

Hi, what's your code? I used the following in my local mac os but can't reproduce it
ublic class ParquetWriteTest {

    @TempDir public Path tempDir;

    @Test
    void test() throws Exception{
        String parquetFilePath =
                Paths.get("testParquetWriter.parquet").toString();
        List<Field> fields =
                Arrays.asList(
                        Field.nullable("id", new ArrowType.Int(32, true)),
                        Field.nullable("name", new ArrowType.Utf8()));
        Schema arrowSchema = new Schema(fields);

        int[] ids = new int[] {1, 2, 3, 4, 5};
        String[] names = new String[] {"Alice", "Bob", "Charlie", "Diana", "Errrrve"};

        // Write Parquet file
        try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
              FileOutputStream fos = new FileOutputStream(parquetFilePath);
             ParquetWriter writer = new ParquetWriter(fos, arrowSchema);
             VectorSchemaRoot vectorSchemaRoot = createData(allocator, arrowSchema, ids, names)) {
            writer.write(vectorSchemaRoot);
        }
    }

    private static VectorSchemaRoot createData(
            BufferAllocator allocator, Schema schema, int[] ids, String[] names) {
        // Create VectorSchemaRoot from schema
        VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
        // Allocate space for vectors (we'll write 5 rows)
        root.allocateNew();

        // Get field vectors
        IntVector idVector = (IntVector) root.getVector("id");
        VarCharVector nameVector = (VarCharVector) root.getVector("name");

        // Write data to vectors
        for (int i = 0; i < ids.length; i++) {
            idVector.setSafe(i, ids[i]);
            nameVector.setSafe(i, names[i].getBytes());
        }

        // Set the row count
        root.setRowCount(ids.length);

        return root;
    }
}


I'll try to find time to reproduce it in linux.
@luoyuxia

Thanks for the quick response! Here's a minimal test case that reproduces the issue on Ubuntu 22.04/WSL2:

Test Code: java import org.apache.arrow.dataset.file.ParquetWriter; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.*; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema;

import java.io.File; import java.io.FileOutputStream; import java.util.Arrays; import java.util.List;

public class MinimalParquetTest { public static void main(String[] args) throws Exception { // Load native libraries (required on Linux) System.load(new File("src/main/resources/arrow_cdata_jni/x86_64/libarrow_cdata_jni.so").getAbsolutePath()); System.load(new File("src/main/resources/arrow_dataset_jni/x86_64/libarrow_dataset_jni.so").getAbsolutePath());
    String parquetFilePath = "/tmp/test.parquet";
    
    List<Field> fields = Arrays.asList(
        Field.nullable("id", new ArrowType.Int(32, true)),
        Field.nullable("name", new ArrowType.Utf8())
    );
    Schema arrowSchema = new Schema(fields);

    int[] ids = new int[] {1, 2, 3, 4, 5};
    String[] names = new String[] {"Alice", "Bob", "Charlie", "Diana", "Eve"};

    // THIS LINE FAILS - ParquetWriter constructor throws during initialization
    try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
         FileOutputStream fos = new FileOutputStream(parquetFilePath);
         ParquetWriter writer = new ParquetWriter(fos, arrowSchema)) {  // ← Fails here
        
        // Never gets here - constructor fails
        VectorSchemaRoot root = VectorSchemaRoot.create(arrowSchema, allocator);
        root.allocateNew();
        
        IntVector idVector = (IntVector) root.getVector("id");
        VarCharVector nameVector = (VarCharVector) root.getVector("name");
        
        for (int i = 0; i < ids.length; i++) {
            idVector.setSafe(i, ids[i]);
            nameVector.setSafe(i, names[i].getBytes());
        }
        
        root.setRowCount(ids.length);
        writer.write(root);
    }
}
}

Environment:

OS: Ubuntu 22.04 LTS (WSL2 on Windows 11)

Java: OpenJDK 11.0.25

Arrow Version: 19.0.0-SNAPSHOT from PR GH-735: Support write arrow record batch #904

Native libs: Downloaded from CI run #19222857860 (jni-linux-x86_64.zip)

Maven Dependencies: xml org.apache.arrow arrow-dataset 19.0.0-SNAPSHOT org.apache.arrow arrow-c-data 19.0.0-SNAPSHOT org.apache.arrow arrow-vector 19.0.0-SNAPSHOT org.apache.arrow arrow-memory-netty 19.0.0-SNAPSHOT

Run Command: bash java --add-opens=java.base/java.nio=ALL-UNNAMED -Xmx4G -cp "target/classes:target/lib/*" MinimalParquetTest

Stack Trace:

java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (128) Allocator(ROOT) 0/128/4998/9223372036854775807 (res/actual/peak/limit) at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504) at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:27) at org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158) at org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)

Key Differences from macOS: On macOS, you likely have Arrow C++ libraries installed via Homebrew. On Linux/WSL2, I'm using only the JNI libraries from the CI build. Could the JNI libraries need system Arrow C++ libraries to be installed separately on Linux?

Things I've verified:

Both libarrow_dataset_jni.so and libarrow_cdata_jni.so load successfully

No missing library dependencies (ldd shows all resolved)

File permissions are correct (755 on .so files)

Would appreciate any Linux-specific setup steps I might be missing. Thanks!

@V-Fenil Hi, I spent some time debug in linux env and did find some problem.

The .so download from ci doesn't include the JNI method I introduce in this pr although I don't know why. You can use nm -D libarrow_dataset_jni.so | grep nativeCreateParquetWriter to check.
So I built it by mvn generate-resources -Pgenerate-libs-jni-macos-linux -N according to the guide from https://arrow.apache.org/docs/developers/java/building.html#id3. Then, it works. If you need, I can send to you the .so and the arrow-dataset I built

Fenil-v · 2025-11-19T03:36:37Z

@luoyuxia sure thanks, can you email me files. I will try to import that .so files first and run my test case. [email protected]

luoyuxia requested review from jbonofre, laurentgo, lidavidm and wgtmac as code owners November 5, 2025 06:59

luoyuxia marked this pull request as draft November 5, 2025 06:59

luoyuxia mentioned this pull request Nov 5, 2025

Support parquet write from Arrow record batch #735

Open

support write arrow record batch

6a99076

luoyuxia force-pushed the support-write-arrow-batch branch from b815f6d to 99fea13 Compare November 10, 2025 06:31

revert hard code

3cc6e16

luoyuxia force-pushed the support-write-arrow-batch branch from 99fea13 to 3cc6e16 Compare November 10, 2025 06:32

luoyuxia changed the title ~~support write arrow record batch~~ GH-735: Support write arrow record batch Nov 10, 2025

github-actions bot added the breaking-change label Nov 10, 2025

luoyuxia marked this pull request as ready for review November 10, 2025 06:48

luoyuxia added 2 commits November 11, 2025 14:24

Revert "revert hard code"

0b82881

This reverts commit 3cc6e16.

Reapply "revert hard code"

8ccc105

This reverts commit 0b82881.

fix release arrowSchema issue

03419e5

luoyuxia force-pushed the support-write-arrow-batch branch from 1461231 to 03419e5 Compare November 19, 2025 02:58

GH-735: Support write arrow record batch #904

Are you sure you want to change the base?

GH-735: Support write arrow record batch #904

Uh oh!

Conversation

luoyuxia commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's Changed

New Features

Implementation

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

luoyuxia commented Nov 10, 2025

Uh oh!

V-Fenil commented Nov 12, 2025

Uh oh!

luoyuxia commented Nov 13, 2025

Uh oh!

V-Fenil commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luoyuxia commented Nov 18, 2025

Uh oh!

lidavidm commented Nov 18, 2025

Uh oh!

V-Fenil commented Nov 18, 2025

Uh oh!

Fenil-v commented Nov 18, 2025

Uh oh!

luoyuxia commented Nov 18, 2025

Uh oh!

Fenil-v commented Nov 18, 2025

Uh oh!

luoyuxia commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fenil-v commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

luoyuxia commented Nov 5, 2025 •

edited

Loading

V-Fenil commented Nov 13, 2025 •

edited

Loading

luoyuxia commented Nov 19, 2025 •

edited

Loading