-
Notifications
You must be signed in to change notification settings - Fork 3k
Support for file paths in SparkCatalogs via HadoopTables #1843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is still a bit rough, doesn't pass 1 test and I haven't tested One key issue: The In my mind this means the Spark3 tests that run against the session catalog are (partially) broken as the So:
@rdblue @aokolnychyi any thoughts? |
|
I was thinking about this a bit differently, rather than providing a full catalog I was thinking that we just have the various spark catalogs treat file types as namespaces. Then we just have them switch to returning Hadoop tables in the load table method. Here is an example @aokolnychyi was typing up on the create ticket |
I was thinking this way originally as well. My first attempt at it made for weird interactions between |
I like this example too. A very useful feature. This is slightly different/more complicated though right? That would add a sql extension and a separate impl to do write the metadata of the new table. Or do you mean the path based table is already an iceberg table and the migrate is effectively reduced to a rename from |
I re-read what you said earlier @RussellSpitzer and I realized I mis-read it. Your suggestion to just modify the spark catalogs may be easier but it still has a lot of duplication and would require a mechanism to handle Note regardless of how we implement the change |
|
I have added I am not a huge fan of identifying a path based table by simply checking if there is a |
|
I'm not sure we want to allow those other operations, since I'm not sure they actually have a lot of meaning. Like a rename on a path based table in a filesystem without renames means a full copy of the entire dataset. So it may be best to treat file based path's as static or at least immovable. I think we probably could identify path based tables based on the namespace too, like "parquet or file" or something. I'll take a look today |
I ran into this and ended up skipping non-Hive tests in other suites. We still have fairly good coverage because the session catalog is simply delegating. So as long as it works for a SparkCatalog backed by Hive, it will work for other catalog types. Fixing this in Spark would be great. |
|
I agree with the approach to use For table imports, I think we have more options because we're controlling the parsed statement or the stored procedure. That procedure could use optional arguments like Looking at the Spark, the identifier is immediately used to load the table, or is added to a CTAS plan. I don't think that the identifier is modified after it is returned. We could use a custom public class PathIdentifier implements Identifier {
private final String location;
private final String name;
public PathIdentifier(String location) {
this.location = location;
this.name = Iterables.getLastElement(Splitter.on("/").splitToList(location));
}
public String location() {
return location;
}
public String namespace() {
return new String[] { location };
}
public String name() {
return name;
}
}This uses the last part of the location string as the table name, so that the default subquery aliases added in Spark work. Then each method in |
| Map<String, String> properties) throws TableAlreadyExistsException { | ||
| Schema icebergSchema = SparkSchemaUtil.convert(schema); | ||
| try { | ||
| // can't stage a hadoop table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay, but if staging isn't supported then a path identifier should fail instead of going ahead without using the path. Otherwise, this will attempt to create a table in the Iceberg catalog with a crazy name, which would probably fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this a little more, I think we will need to support some form of staging for Hadoop tables. Because this catalog implements the atomic operation mix-in, the staging calls will be used for all CTAS plans. Using SupportsCatalogOptions would mean that save() gets turned into a CTAS. So if we don't want the existing creates to fail, we have to support a staged table.
We can do that in a couple of ways. First, we could create a table builder based on HadoopCatalogTableBuilder that supports a location. Second, we could reuse the fake staged table from the session catalog (for non-Iceberg tables). I'd prefer to create a builder that can construct the transactions for path tables. We could add it to HadoopTables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to create a builder that can construct the transactions for path tables. We could add it to HadoopTables.
Agreed. Added in most recent update
|
Just a bit of clarification on You are suggesting that the change in #1783 returns a What I am unsure about is how to handle path based FWIW I agree that we shouldn't have a 'special' namespace for importing path based tables, however my approach of ignoring namespaces if the name is a path probably isn't quite right either. |
|
Sorry I think I am pretty late in the whole conversation, completely forgot to follow up on #1783 after my initial comment, still trying to catch up... I also like the idea of a custom For these DDL, I actually like the approach that Anton mentioned in #1306 about wrapping |
|
Hey @jackye1995 thanks for the comments! The first approach in this patch was trying to tackle the problem with a new catalog. You can see it here. On the plus side it keeps all the filesystem handling in 1 class but it does have the downside of creating a lot of duplicate code and wasn't super nice. @RussellSpitzer suggested the approach in this PR and I think I prefer it now too, it keeps the scope if this change very small too. What I am struggling w/ at the moment is:
For the second point I think we should reject any namespace elements and treat a path as a namespaceless table with the full table path as the table/identifier name. For the first part I think the only reliable way is to try to do IO with the path: can we get an FS from the path, is it the local fs, is it a relative path etc. Thoughts? |
With all the given constraints, I think we have to rely on some string pattern to determine if it is a path or table. Use And with whatever way we go for the check, I think it should be somewhere in the core package, instead of a protected method
+1 for
|
|
Thanks for the comment @jackye1995 , I have added a check to reject the path table if it has a namespace. I have also moved the checks to Re path identifiers. Delta allows absolute paths only so I don't think its outrageous if we limit paths to absolute paths only. I am inclined to use a check similar to what they do. Thoughts? |
|
I think we should keep the scope of this small. There isn't a need to address how to embed path-based tables in SQL commands right now. However we choose to do that would probably be a different solution. Right now, we need to unblock multicatalog support in For the larger question about In the long term, I've advocated that Spark should have some way of identifying and passing path identifiers to plugins. The problem here is that no one seems to know what the behavior or path-based tables in SQL is or should be. But I think that a similar Last, if we want to support identifiers with a catalog and a quoted path, we can do that any time by choosing when to interpret the catalog name as a quoted path. |
Considering the behavior for SQL is undefined, I would rather not add support for this right away. Delta needs this because it already supported these identifiers in v1 and has to be compatible. But I think we should have a use case that requires this before adding support. |
| /** | ||
| * Check to see if the location is a potential Hadoop table by checking if its an absolute path on some filesystem. | ||
| */ | ||
| public static boolean isHadoopTable(String location) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about isValidLocation?
| } | ||
|
|
||
| private String[] currentNamespace() { | ||
| return SparkSession.active().sessionState().catalogManager().currentNamespace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this uses a session, then it should be identified when the catalog is created and stored as an instance field. Catalogs are specific to a SQL session, so they should not use the active session except for in initialization. After that, the same session should always be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Just for my learning why is it ok to get hadoop config from active session but not the current namespace? I have been following the commonly used SparkSession.active().sessionState().newHadoopConf() pattern
👍 |
|
Thanks for the comments @rdblue, I have reduced the scope as suggested and updated the path identifier check. |
| @Override | ||
| public void renameTable(Identifier from, Identifier to) throws NoSuchTableException, TableAlreadyExistsException { | ||
| try { | ||
| // can't rename hadoop tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should throw an exception if it can't rename, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have a method like checkNotPathIdentifier that throws an IllegalArgumentException if a PathIdentifier is passed, so that we can ensure that they aren't passed to methods that don't support paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| @Override | ||
| public Identifier[] listTables(String[] namespace) { | ||
| // no way to identify if this is a path and we should use tables instead of catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be no way to pass a path as a namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how to identify a path here? Single element w/ '/' in it? Seems arbitrary. The HadoopCatalog treats directory names as namespaces so there would be no way to identify that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't very clear. If anything calls this, then it went through a path for normal identifiers. So I don't think that there is a need to worry about paths here because there isn't a way to pass a path in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, have removed the comment.
I don't think so. That setting is for v1. Instead, Spark needs to define how it will pass path-based tables to catalogs, and what the behavior requirements are for those tables. Right now, no one has done the work to find out what the behavior of v1 is or to try to build consensus in the Spark community about what it should be. I think that means that Iceberg should avoid Spark's path-based syntax for now and focus in this PR on the narrower case of passing identifiers for paths used in |
|
just rebased w/ your caching catalog change @rdblue. Hopefully the build shouild be green now |
spark3/src/main/java/org/apache/iceberg/spark/PathIdentifier.java
Outdated
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/PathIdentifier.java
Outdated
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
Outdated
Show resolved
Hide resolved
|
Looks good overall. I think I'd prefer to refactor the load and builder creation a little bit, and add that check to the Hadoop builder. Thanks @rymurr! |
|
What are the plans for a test suite on this? I think this is all going in the right direction I just want to make sure we have a good set of checks to go along with it. |
|
@RussellSpitzer Ive been wondering the same. It is hard to test till #1783 is merged. I will have a think tho |
|
Thanks @rdblue! both the check in HadoopTables and the simplification in SparkCatalog have been added. |
We can always add a suite that passes |
|
I have added a simple test to ensure the correct behaviour of the All is up to date now and I think we are just about ready! |
| public Catalog.TableBuilder withLocation(String newLocation) { | ||
| Preconditions.checkArgument(newLocation == null || location.equals(newLocation), | ||
| String.format("Table location %s differs from the table location (%s) from the PathIdentifier", | ||
| newLocation, location)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: preconditions already support argument formatting, so String.format is redundant.
|
Thanks for the fixes! I merged this. |
| metadata = ops.current().buildReplacement(schema, spec, SortOrder.unsorted(), location, properties); | ||
| } else { | ||
| metadata = tableMetadata(schema, spec, null, properties, location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.