Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle very large datasets (>100M single cells) #96

Open
6 tasks
shntnu opened this issue Feb 11, 2018 · 2 comments
Open
6 tasks

Handle very large datasets (>100M single cells) #96

shntnu opened this issue Feb 11, 2018 · 2 comments
Labels

Comments

@shntnu
Copy link
Member

shntnu commented Feb 11, 2018

Come up with a solution for ingesting CSVs into a backend that is more scalable than SQLite.

Context: The package cytominer-database is a key component of image-based profiling workflows: it is a small Python-based command line tool that ingests data generated from CellProfiler into a database. Currently, only SQLite is supported but it would be great if we could use something more scalable. Addressing this issue would equip researchers working with single cell imaging data to execute queries and perform analysis across tens of millions of cells. This would be particularly useful to analyze single cell data across all plates in an experiments, or across multiple experiments.

E.g. a recent 135 plate experiment had 100M single cells and there's no easy way to analyze this.

Here's how to started on this issue:

  • Read the ingest documentation here
  • Test out cytominer-database by running the tests yourself
  • Read the code
  • Now focus on this part of the code, where the ingestion into the backend happens
  • Come up with ideas on how this could be done at scale and discuss it here
  • if you have any questions, please use this issue to discuss or open a new issue
@gwaybio
Copy link
Member

gwaybio commented Sep 7, 2019

I would just add that if there any questions, to add them here or open a new issue - intro steps look great 👍

@shntnu
Copy link
Member Author

shntnu commented Dec 9, 2019

https://parquet.apache.org/ looks like a good option for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants