Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DuckDB data type ARRAY #282

Open
VWagen1989 opened this issue Dec 10, 2024 · 0 comments
Open

Support DuckDB data type ARRAY #282

VWagen1989 opened this issue Dec 10, 2024 · 0 comments
Labels
compatibility Be compatible with some old fashion request

Comments

@VWagen1989
Copy link
Contributor

panic: encountered unknown DuckDB type({FLOAT[4] { 0 0 0 false 0 0 [] }}). This is likely a bug - please check the duckdbDataType function for missing type mappings

I was trying to import the vectors based on Arrow format to MyDuck Server following this instruction but failed. Because MyDuck Server can not handle data type List for now.

  • The table schema:
CREATE TABLE IF NOT EXISTS embedded_documents (
    document TEXT,
    embedding FLOAT[4]
)
  • Data generation and COPY operation
import numpy as np
import pandas as pd
import pyarrow as pa
import io

# Generate random document content
def generate_random_document(length=10):
    return ''.join(random.choice(string.ascii_letters) for _ in range(length))

# Generate a random vector with 1024 dimensions and convert it to string format
def generate_random_vector(dimension=VECTOR_DIMENSION):
    vector = np.random.rand(dimension).tolist()
    return vector

# Load data using the COPY command with Arrow format
def copy_arrow_data(conn, batch_size, batch_number):
    data = [(generate_random_document(), generate_random_vector()) for _ in range(batch_size)]
    cursor = None
    try:
        # Convert data to a pandas DataFrame
        df = pd.DataFrame(data, columns=['document', 'embedding'])

        # Convert pandas DataFrame to Arrow Table
        table = pa.Table.from_pandas(df)

        # Write Arrow Table to a BytesIO stream
        output_stream = io.BytesIO()
        with pa.ipc.RecordBatchStreamWriter(output_stream, table.schema) as writer:
            writer.write_table(table)

        # Use the COPY command to insert data in Arrow format
        cursor = conn.cursor()
        with cursor.copy(f"COPY {TABLE_NAME} (document, embedding) FROM STDIN (FORMAT arrow)") as copy:
            copy.write(output_stream.getvalue())

        conn.commit()
        print(f"Inserted batch of size {len(data)}")
    except Exception as e:
        print(f"Error inserting data: {e}")
    finally:
        if cursor:
            cursor.close()
    print(f"Inserted batch {batch_number}")
@VWagen1989 VWagen1989 added the compatibility Be compatible with some old fashion request label Dec 10, 2024
@fanyang01 fanyang01 changed the title Support DuckDB data type List Support DuckDB data type ARRAY Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility Be compatible with some old fashion request
Projects
None yet
Development

No branches or pull requests

1 participant