Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching datasets in context [Question] #1396

Open
a-agmon opened this issue Dec 3, 2021 · 10 comments
Open

Caching datasets in context [Question] #1396

a-agmon opened this issue Dec 3, 2021 · 10 comments
Labels
question Further information is requested

Comments

@a-agmon
Copy link

a-agmon commented Dec 3, 2021

Hi,
Is there a way to somehow cache a large CSV or Parquet file that was loaded to CTX in order to avoid reading it again each time when a new process tried to access it
What is the proper way to manage this in case you want to write something like a server that needs to answer multiple requests about the same file?

Thanks

@capkurmagati
Copy link
Contributor

I think you can do something like this.

let mut ctx = ExecutionContext::new();
// read a file
ctx.register_csv("c", "path_to_csv", CsvReadOptions::new()).await?;
let df = ctx.sql("select * from c").await?;
let partitions = df.collect().await?;
// convert it into a memory table and register it to the context
let provider = MemTable::try_new(Arc::new(df.schema().into()), vec![partitions])?;
ctx.register_table("t", Arc::new(provider)).unwrap();
let df = ctx.sql("select * from t").await?;
df.show().await?;

@houqp
Copy link
Member

houqp commented Dec 4, 2021

I think we could add a caching option in context to automatically cache full table scans between runs

@alamb
Copy link
Contributor

alamb commented Dec 4, 2021

I suspect it is also possible perhaps to use CREATE TABLE AS SELECT for this purpose

Something like

echo  "1" > /tmp/foo.csv
datafusion-cli
create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';

create table bar as select * from foo;

@alamb alamb added the question Further information is requested label Dec 4, 2021
@a-agmon
Copy link
Author

a-agmon commented Dec 4, 2021

I suspect it is also possible perhaps to use CREATE TABLE AS SELECT for this purpose

Something like

echo  "1" > /tmp/foo.csv
datafusion-cli
create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';

create table bar as select * from foo;

Thanks @alamb
But would it be saved or cached in mem for subsequent access?

@Dandandan
Copy link
Contributor

I suspect it is also possible perhaps to use CREATE TABLE AS SELECT for this purpose

Something like

echo  "1" > /tmp/foo.csv
datafusion-cli
create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';

create table bar as select * from foo;

Thanks @alamb
But would it be saved or cached in mem for subsequent access?

At this moment it will be cached, as it's using MemTable to load the data.
However, I think it's likely it will store the data in the future to a persistent location / format (parquet) by default. By then we need to add some option in the SQL syntax to store the results in memory.

@Dandandan
Copy link
Contributor

Dandandan commented Dec 4, 2021

We could take some inspiration from Spark, where you can .cache() or .persist() a DataFrame.

@alamb
Copy link
Contributor

alamb commented Dec 5, 2021

The relevant code is here: https://github.com/apache/arrow-datafusion/blob/414c826bf06fd22e0bb52edbb497791b5fe558e0/datafusion/src/sql/planner.rs#L139-L171

(note how CREATE TABLE AS SELECT ... gets translated into LogicalPlan::CreateMemoryTable

@justinrmiller
Copy link

I had a question along these lines. Is there a way to load a CSV or Parquet file directly from memory into a context?

ctx.register_csv("c", "path_to_csv", CsvReadOptions::new()).await?;

This is great if the file already exists on disk, but if I'm pulling the data from say Redis then I have to write the data to disk first then read it back. Thanks!

@alamb
Copy link
Contributor

alamb commented Jan 2, 2022

@justinrmiller I don't know how to do this today -- it may be possible to register a new memory ObjectStore source and then pass the URL into register_csv but one would have to play around with that more to really find out)

@justinrmiller
Copy link

@justinrmiller I don't know how to do this today -- it may be possible to register a new memory ObjectStore source and then pass the URL into register_csv but one would have to play around with that more to really find out)

Thanks I'll check the source code and try to figure out a way to do so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants