Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break datafusion-catalog code into its own crate #11182

Open
alamb opened this issue Jun 30, 2024 · 6 comments
Open

Break datafusion-catalog code into its own crate #11182

alamb opened this issue Jun 30, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 30, 2024

Is your feature request related to a problem or challenge?

As we start thinking of adding new catalog support in DataFusion for example, @goldmedal tried to do this for DynamicFileCatalog in #11035 , the current code structure requires that any such implementation be in datafusion/core which is already quite large and monolithic

Also, I think long term it would be a more sustainable structure if we can move ListingTable out of the core as well

However, this move is not likely possible until we move table provider and related catalog traits out of the core.

Also, if we could split up datafusion core more, it would be easier for people to use DataFusion as a smaller embedded library (aka not bring in file format support if they didn't need it)

Describe the solution you'd like

Ideally I would like the following traits / code moved into a datafusion-catalog crate:

  • TableProvider
  • CatalogProvider
  • SchemaProvider
  • MemoryCatalogProvider
  • InformationSchemaProvider

Describe alternatives you've considered

No response

Additional context

No response

Part of #10782

@alamb alamb added the enhancement New feature or request label Jun 30, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 30, 2024

I looked into this --

One of the major challenges is that SessionState's constructor basically installs the "pre-provided" functionality (like data sources, etc

pub fn new_with_config_rt_and_catalog_list(
config: SessionConfig,
runtime: Arc<RuntimeEnv>,
catalog_list: Arc<dyn CatalogProviderList>,
) -> Self {
let session_id = Uuid::new_v4().to_string();
// Create table_factories for all default formats
let mut table_factories: HashMap<String, Arc<dyn TableProviderFactory>> =
HashMap::new();
#[cfg(feature = "parquet")]
table_factories.insert("PARQUET".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("CSV".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("JSON".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("NDJSON".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("AVRO".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("ARROW".into(), Arc::new(DefaultTableFactory::new()));
if config.create_default_catalog_and_schema() {
let default_catalog = MemoryCatalogProvider::new();
default_catalog
.register_schema(
&config.options().catalog.default_schema,
Arc::new(MemorySchemaProvider::new()),
)
.expect("memory catalog provider can register schema");
Self::register_default_schema(
&config,
&table_factories,
&runtime,
&default_catalog,
);
catalog_list.register_catalog(
config.options().catalog.default_catalog.clone(),
Arc::new(default_catalog),
);
}
let mut new_self = SessionState {
session_id,
analyzer: Analyzer::new(),
optimizer: Optimizer::new(),
physical_optimizers: PhysicalOptimizer::new(),
query_planner: Arc::new(DefaultQueryPlanner {}),
catalog_list,
table_functions: HashMap::new(),
scalar_functions: HashMap::new(),
aggregate_functions: HashMap::new(),
window_functions: HashMap::new(),
serializer_registry: Arc::new(EmptySerializerRegistry),
file_formats: HashMap::new(),
table_options: TableOptions::default_from_session_config(config.options()),
config,
execution_props: ExecutionProps::new(),
runtime_env: runtime,
table_factories,
function_factory: None,
};
#[cfg(feature = "parquet")]
if let Err(e) =
new_self.register_file_format(Arc::new(ParquetFormatFactory::new()), false)
{
log::info!("Unable to register default ParquetFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(JsonFormatFactory::new()), false)
{
log::info!("Unable to register default JsonFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(CsvFormatFactory::new()), false)
{
log::info!("Unable to register default CsvFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(ArrowFormatFactory::new()), false)
{
log::info!("Unable to register default ArrowFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(AvroFormatFactory::new()), false)
{
log::info!("Unable to register default AvroFormat: {e}")
};
// register built in functions
functions::register_all(&mut new_self)
.expect("can not register built in functions");
// register crate of array expressions (if enabled)
#[cfg(feature = "array_expressions")]
functions_array::register_all(&mut new_self)
.expect("can not register array expressions");
functions_aggregate::register_all(&mut new_self)
.expect("can not register aggregate functions");
new_self

One way to handle this would be to allow constructing SessionState with the minimal built in features, and have a function in SessionContext like SessionContext::register_built_ins that would register things like listing table, information schema, etc.

That way it would still be easy to use DataFusion with a minimal SessionState but also easily register all the built in extensions 🤔

@alamb
Copy link
Contributor Author

alamb commented Jun 30, 2024

I made #11183 to start breaking apart the API and implementation -- there is still a ways to go

@devinjdangelo
Copy link
Contributor

@alamb I took a stab at moving parquet functionality into a datafusion-parquet crate (#11188) , and I ran into similar challenges you highlight here. I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.

@alamb
Copy link
Contributor Author

alamb commented Jul 1, 2024

I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.

I agree with this entirely

The center of the knot is SessionState I think -- figuring out how to get that out of the core is likely key to breaking things up reasonably

@alamb
Copy link
Contributor Author

alamb commented Jul 1, 2024

The more I think about this the more I think trying to make SessionState a container that doesn't have all the optional features (like parquet support) by default makes sense

So like

let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here

@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2024

The more I think about this the more I think trying to make SessionState a container that doesn't have all the optional features (like parquet support) by default makes sense

So like

let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here

I filed #11320 to track this idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants