Handle very large datasets (>100M single cells) #96

shntnu · 2018-02-11T16:56:38Z

Come up with a solution for ingesting CSVs into a backend that is more scalable than SQLite.

Context: The package cytominer-database is a key component of image-based profiling workflows: it is a small Python-based command line tool that ingests data generated from CellProfiler into a database. Currently, only SQLite is supported but it would be great if we could use something more scalable. Addressing this issue would equip researchers working with single cell imaging data to execute queries and perform analysis across tens of millions of cells. This would be particularly useful to analyze single cell data across all plates in an experiments, or across multiple experiments.

E.g. a recent 135 plate experiment had 100M single cells and there's no easy way to analyze this.

Here's how to started on this issue:

Read the ingest documentation here
Test out cytominer-database by running the tests yourself
Read the code
Now focus on this part of the code, where the ingestion into the backend happens
Come up with ideas on how this could be done at scale and discuss it here
if you have any questions, please use this issue to discuss or open a new issue

The text was updated successfully, but these errors were encountered:

gwaybio · 2019-09-07T11:10:05Z

I would just add that if there any questions, to add them here or open a new issue - intro steps look great 👍

shntnu · 2019-12-09T20:38:25Z

https://parquet.apache.org/ looks like a good option for this

shntnu added the Enhancement label Feb 11, 2018

This was referenced Feb 11, 2018

Pool multiple backend instances cytomining/cytominer#33

Closed

Normalize cell features by z-scoring - multiple plates cytomining/cytominer#11

Closed

mcquin added Feature and removed Enhancement labels Feb 20, 2018

shntnu added the Won't Fix label May 4, 2018

shntnu closed this as completed May 4, 2018

shntnu reopened this Jul 17, 2019

shntnu removed the Won't Fix label Jul 24, 2019

diskontinuum mentioned this issue Feb 10, 2020

Parquet_integration #122

Merged

gwaybio mentioned this issue May 19, 2022

Potential memory leak in SingleCell's .merge_single_cells() method cytomining/pycytominer#195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle very large datasets (>100M single cells) #96

Handle very large datasets (>100M single cells) #96

shntnu commented Feb 11, 2018 •

edited

Loading

gwaybio commented Sep 7, 2019

shntnu commented Dec 9, 2019

Handle very large datasets (>100M single cells) #96

Handle very large datasets (>100M single cells) #96

Comments

shntnu commented Feb 11, 2018 • edited Loading

gwaybio commented Sep 7, 2019

shntnu commented Dec 9, 2019

shntnu commented Feb 11, 2018 •

edited

Loading