Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to map a topic name to a valid Hive tablenames? #155

Open
rmoff opened this issue Dec 1, 2016 · 13 comments
Open

Is there a way to map a topic name to a valid Hive tablenames? #155

rmoff opened this issue Dec 1, 2016 · 13 comments

Comments

@rmoff
Copy link

rmoff commented Dec 1, 2016

I have data coming through from GoldenGate Kafka Connector, which takes the fully-qualified Oracle table name as the topic name. From what I can see there is no way to override this.
When kafka-connect-hdfs tries to create a Hive table, it fails.

Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Invalid table
        at io.confluent.connect.hdfs.hive.HiveMetaStore.createTable(HiveMetaStore.java:200)
        at io.confluent.connect.hdfs.parquet.ParquetHiveUtil.createTable(ParquetHiveUtil.java:46)
        at io.confluent.connect.hdfs.TopicPartitionWriter$1.call(TopicPartitionWriter.java:640)
        at io.confluent.connect.hdfs.TopicPartitionWriter$1.call(TopicPartitionWriter.java:637)
        ... 4 more
Caused by: InvalidObjectException(message:ORCL.SOE.ORDERS is not a valid object name)

In the elasticsearch connector there is a topic.index.map configuration to address this exact scenario (source topic names being invalid target objects). Is there an equivalent for the HDFS connector (or another suitable workaround?).

@rmoff rmoff changed the title Is there a way to map topic names to valid Hive tables? Is there a way to map a topic name to a valid Hive tablenames? Dec 1, 2016
@cotedm
Copy link

cotedm commented Jan 10, 2017

@rmoff are . characters supported in Hive table names? I know column names can have . in them, but can you verify you can do this with Hive table names?

That said, we should be able to add an optional topic mapping index, so I'll label this as an enhancement.

@rmoff
Copy link
Author

rmoff commented Jan 10, 2017

@cotedm I think the problem is that . is a object name separator in Hive. FOO.BAR would be a table called FOO in the BAR schema, but XYZ.FOO.BAR makes no sense in a Hive object naming context.

There is an open PR with a workaround for this issue : #137

@cotedm
Copy link

cotedm commented Jan 10, 2017

Ah thanks @rmoff for the clarification, I didn't see #137. It looks like here you are looking for a manual way to map topic names to table names and #137 just maps the '.' character to '_'. Let me ask you this, would you still want a topic-table mapping feature if #137 was in place?

@rmoff
Copy link
Author

rmoff commented Jan 10, 2017

@cotedm Implementation of #137 would be bare-minimum necessary for the Oracle GoldenGate -> Kafka -> Hive pipeline to work, at all. But it's still inflexible; topic-table mapping would be my preferred option.

@cotedm
Copy link

cotedm commented Jan 11, 2017

Thanks @rmoff for clarifying. I think this is good to keep open as an enhancement then. If you would like to implement it, please submit a PR and I'm happy to do an initial review.

@ig-michaelpearce
Copy link
Contributor

ig-michaelpearce commented Jan 14, 2017

@rmoff just an FYI #137 became #164 as we had locally reforked so to address review comments needed to make a new PR.

Re supporting configuration based table name mapping, should be fairly straight forward to add now once #164 is merged.

As to resolve table name with dots i added a method to io.confluent.connect.hdfs.hive.HiveMetaStore which everything calls to resolve the end hive table name allowing to logic to be in one place rather than repeating.

currently its just doing very basic replacement of dots.

public String tableNameConverter(String table){
   return table == null ? table : table.replaceAll("\\.", "_");
}

but you can easily make/enhance the io.confluent.connect.hdfs.hive.HiveMetaStore take in a map of table name translations/mappings on construction/via config, and extend/alter this method to do a table name translation/mapping lookup also.

@rmoff
Copy link
Author

rmoff commented Jan 16, 2017

Great stuff, thanks @ig-michaelpearce

@cotedm
Copy link

cotedm commented Jan 18, 2017

@rmoff #164 was merged yesterday just FYI. Do you have any interest in doing the implementation of the mapping feature here? If not, I may take a crack at it as time permits (probably won't be for a bit).

@rmoff
Copy link
Author

rmoff commented Jan 20, 2017

@cotedm I have plenty of interest, but ATM little time :) So go ahead and have a crack, I'd be happy to try and validate any fix you make.

@mafc
Copy link

mafc commented Apr 6, 2017

I made an implementation of a topic to hive name mapping here: https://github.com/mafc/kafka-connect-hdfs/tree/topic_map. I'm not sure if I should make a pull request off it since it uses java 8 features and also contains an bump of avro version to 1.8.1.

@ewencp
Copy link
Contributor

ewencp commented Apr 6, 2017

This should also be doable via SMTs, right? I think RegexRouter would work here?

@jurgispods
Copy link

+1 for this enhancement. One thing that #164 does not handle is topic names that start with an underscore. The reason is that Hive does not consider (Parquet) files with a leading unserscore. This is another use case where topic mapping would be desirable.

@cadl
Copy link

cadl commented May 3, 2018

@rmoff @cotedm

How about adding two config hive.table.map.regex and hive.table.map.replacement to achieve it? Read the above config and do regex replacement in the tableNameConverter function.

If it looks ok, I would like to create a PR to implement it.

I tried RegexRouter SMT, but DataWriter TopicPartitionWriter is bounded to SinkTaskContext.assignment(). RegexRouter not works (throw NPE at DataWriter.java#L351).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants