Google BigQuery sink for Apache Flume
This is released under Apache License v2
Google BigQuery Sink uses the server-to-server OAuth 2.0 flow for authentication so you need to have one to be able to steam data to your project as well as to have the private key dowloaded from the Google Developers console. More information can be found find here https://developers.google.com/accounts/docs/OAuth2ServiceAccount.
Sink configuration example:
agent.sinks.bigquerySink.type = com.tango.flume.bigquery.sink.GoogleBigQuerySink
agent.sinks.bigquerySink.serviceAccountId = {service_account_id}
agent.sinls.bigquerySink.serviceAccountPrivateKeyFromP12File = /path/to/p12/private/key
agent.sinks.bigquerySink.projectId = %{projectId}
agent.sinks.bigquerySink.datasetId = %{datasetId}
agent.sinks.bigquerySink.tableId = %{tableIdId}
agent.sinks.bigquerySink.batchSize = 100
agent.sinks.bigquerySink.connectTimeoutMs = 10000
agent.sinls.bigquerySink.readTimeoutMs = 20000
agent.sinls.bigquerySink.maxTryCount = -1
agent.sinks.bigquerySink.rowFactory = com.tango.flume.bigquery.sink.InsertRequestRowsBuilderFactory
agent.sinks.bigquerySink.rowFactory.idHeader = id
agent.sinks.bigquerySink.rowFactory.includeHeaders = timestamp=event_time,event_type,event_duration
agent.sinks.bigquerySink.rowFactory.excludeHeaders =
- maxTryCount - if specified value is more than 0 than no more than maxTryCount insert request attempts will be accomplished and if finally failed than events batch will be skipped; if specified value is less or equal to zero than in case of failure transaction will be rolled back and the same batch insert will be retried in a new transaction and so indefinitely till success
- serviceAccountId - email address for your service account (usually looks like [email protected])
- serviceAccountPrivateKeyFromP12File - path to private key downloaded from the Google Developers console
- projectId, datasetId, tableId - corresponding BigQuery project id, dataset id and table id to stream data to. Arbitrary header names are supported for these fields so you can either specify them directly in the conf file or you can send them in the event headers.
- rowFactory - FQCN of com.tango.flume.bigquery.sink.IInsertRequestRowsBuilderFactory interface implementation that defines the way the BigQuery table row will be composed from the flume event, default is com.tango.flume.bigquery.sink.InsertRequestRowsBuilderFactory
com.tango.flume.bigquery.sink.InsertRequestRowsBuilderFactory
has additional configuration abilities:
- idHeader - the name of flume event header which value will be used to assign id to the BigQuery row. If not specified - no id will be assigned.
- includeHeaders - comma separated list of flume event header names to be used as columns of the BigQuery row. Renaming of columns is supported by specifying source and target name with equal sign delimiter (see sink configuration example).
- excludeHeaders - comma separated list of flume event header names to not include as columns of the BigQuery row.
- If no excludeHeaders is specified, than just listed in includeHeaders property ones will be used to create the BigQuery row.
- If excludeHeaders is specified, than all headers except listed in excludeHeaders property will be used to create the BigQuery row.
- If nor excludeHeaders nor includeHeaders are specified than all headers will be used to create the BigQuery row.
This project uses maven for building all the artefacts. You can build it with the following command: mvn clean install
This will build the following artefacts:
-
flume-bigquery-dist/target/flume-bigquery-1.0.0-SNAPSHOT-dist.tar.gz The tarball can be directly unpacked into Apache Flume plugins.d directory
-
flume-bigquery-dist/target/rpm/tango-flume-bigquery/RPMS/noarch/tango-flume-bigquery-1.0.0-SNAPSHOT*.noarch.rpm This package will install itself on top of Apache Flume package and be ready for use right away.