Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some pointers for Bigquery::Storage append_rows #19093

Open
jdelStrother opened this issue Aug 31, 2022 · 3 comments
Open

Some pointers for Bigquery::Storage append_rows #19093

jdelStrother opened this issue Aug 31, 2022 · 3 comments
Labels
api: bigquerystorage Issues related to the BigQuery Storage API. type: question Request for information or clarification. Not an issue.

Comments

@jdelStrother
Copy link

I've been struggling to figure out using append_rows in Bigquery::Storage.

The AppendRowsRequest has a proto_rows -> writer_schema -> proto_descriptor property, which I've not managed to generate properly in the ruby sdk.
I've got a protobuf descriptor for my rows, looking something like:

  Google::Protobuf::DescriptorPool.generated_pool.build do
    add_file("listen.proto", :syntax => :proto2) do
      add_message "MyRow" do
        required :post_id, :int64, 1
        required :body, :string, 2
        required :timestamp, :string, 3
      end
    end
  end

  MyRow = ::Google::Protobuf::DescriptorPool.generated_pool.lookup("MyRow").msgclass

and MyRow.descriptor returns an instance of Google::Protobuf::Descriptor.

I've been trying a lot of variations on this -

      schema = ::Google::Cloud::Bigquery::Storage::V1::ProtoSchema.new(
        proto_descriptor: MyRow.descriptor
      )
      rows = [MyRow.new({post_id: 1, body: "aaa", timestamp: "2022-01-01").to_proto]
      append_request = {
        write_stream: "projects/#{project}/datasets/#{dataset}/tables/#{table}/streams/_default",
        proto_rows: ::Google::Cloud::Bigquery::Storage::V1::AppendRowsRequest::ProtoData.new(
          writer_schema: schema,
          rows: ::Google::Cloud::Bigquery::Storage::V1::ProtoRows.new(serialized_rows: rows)
        )
      }
      client.append_rows([append_request]).each do |response|
        pp response
      end

but that raises, eg, " Invalid type Google::Protobuf::Descriptor to assign to submessage field 'proto_descriptor'".
I confess I'm a bit fuzzy on the distinction between Descriptor and DescriptorProto. Any chance you could give a hint or two?

@NivedhaSenthil NivedhaSenthil added the type: question Request for information or clarification. Not an issue. label Sep 7, 2022
@dazuma
Copy link
Member

dazuma commented Sep 8, 2022

I'm not an expert in the internal protobuf design, but as far as I can tell, Google::Protobuf::Descriptor is a Ruby view of the internal protobuf data structures that define the message format in memory, whereas Google::Protobuf::DescriptorProto is a proto-formatted description of the message that is used to communicate the message format to other systems (e.g. to tell BigQuery Storage what the message looks like). (The format of the latter data structure is actually described here.) BigQuery requires the latter because fields of protos have to be protos (or primitives). You can't put an arbitrary Ruby object into a proto because protobuf doesn't understand it.

One would think that there would be a way to construct the "communication" data from the "internal" data. Unfortunately, though I spent some time this morning digging into the protobuf source code, I wasn't able to find one accessible from Ruby. There might be C code in the internal C implementation of protobuf that does it, but the Ruby interface is quite well defined and does not, AFAICT, provide access to such code even if it exists. I'll try to verify internally with the protobuf team, though.

That said, one place the DescriptorProto is present is actually the DSL that is used to define a message type. That DSL (defined here) actually builds a DescriptorProto, and then passes it down into the C code which then defines the internal data structures, including the internal Descriptor object. We can grab a copy of the DescriptorProto in the middle of the DSL, like so:

  # "Declare" this local variable here, so that it's scoped globally.
  # If we don't do this, Ruby will limit the variable's scope to the DSL block
  # below, and we won't be able to access it later.
  listen_file_descriptor_proto = nil

  Google::Protobuf::DescriptorPool.generated_pool.build do
    add_file("listen.proto", :syntax => :proto2) do
      add_message "MyRow" do
        required :post_id, :int64, 1
        required :body, :string, 2
        required :timestamp, :string, 3
      end

      # Grab the FileDescriptorProto at the end of the add_file block, after the
      # messages have been populated into it. The internal build() method
      # normalizes and returns the file descriptor proto.
      listen_file_descriptor_proto = build
    end
  end

  # Retrieve the desired message DescriptorProto from the FileDescriptorProto.
  # See https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.proto
  # for the structure of these objects. In this case, message_type is a repeated field containing
  # the messages in the order they were added in the DSL.
  my_row_descriptor_proto = listen_file_descriptor_proto.message_type.first

  # You should be able to pass this to BigQuery storage:
  schema = ::Google::Cloud::Bigquery::Storage::V1::ProtoSchema.new(
    proto_descriptor: my_row_descriptor_proto
  )

This is NOT an official Google-supported answer. It is a hack that I think will work for now, while we investigate better solutions.

@jdelStrother
Copy link
Author

@dazuma amazing, thank you. That makes more sense, and I was able to upload via Storage with this.

Would love to see a more officially-supported solution, but happy for this issue to be closed unless you want to keep it open while we wait for that.

@zailleh
Copy link

zailleh commented Dec 5, 2024

This doesn't appear to work anymore, do we have any other way to do this? It seems as though there's currently no way to use the storage write api with the ruby library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the BigQuery Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

5 participants