ETL

Comparison of techs to perform ETL

Data

OS Open UPRN https://osdatahub.os.uk/downloads/open/OpenUPRN

full count: 41,011,955 test count: 2,000,000

run upload:

cd data
sh initial_upload.sh

Databases

origin: postgres target: target

create target:

createdb target

check image sizes

sudo docker images

check memory

chmod +x log_memory.sh

Conclusions

Sling

Great for replications as it includes many inbuild features (retries, streaming etc)
it has a very low memory impact
it is not as fast as other solutions

DuckDB

is a winner (in terms of execution time) for both small and large datasets
it is not distributed so it might struggle with very large datasets
it is mostly sql based. Familiar for many but might have limitations.

Spark

handles well memory for both small and large datasets
not as fast as duckdb
it is distributed so it can handle very large datasets (Terabytes)
allows SQL, python and scala
It also has machine learning and graph theory capabilities

Polars

Very efficient compared to Pandas and for small datasets competes well against spark.
Very similar to pandas.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
data		data
dlthub		dlthub
duckdb_copy		duckdb_copy
duckdb_copy_parquet		duckdb_copy_parquet
elusion		elusion
modin_copy		modin_copy
modin_to_sql		modin_to_sql
pandas_copy		pandas_copy
pandas_to_sql		pandas_to_sql
pg_dump_restore		pg_dump_restore
polars_adbc_copy		polars_adbc_copy
polars_connectorx_copy		polars_connectorx_copy
polars_connectorx_write		polars_connectorx_write
pyspark_copy		pyspark_copy
pyspark_write		pyspark_write
sling		sling
spark		spark
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
TODO.md		TODO.md
comparisons.csv		comparisons.csv
log_memory.sh		log_memory.sh
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL

Data

Databases

check image sizes

check memory

Conclusions

Sling

DuckDB

Spark

Polars

About

Releases

Packages

Languages

License

carlospadron/etl

Folders and files

Latest commit

History

Repository files navigation

ETL

Data

Databases

check image sizes

check memory

Conclusions

Sling

DuckDB

Spark

Polars

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages