Responsible for importing a raw data set into a table in a Postgres database
We often have data in a file on our filesystem and we want to import it into a
table in a Postgres database. Other database management systems like Microsoft
SQL Server offer a graphical import wizard, but Postgres does not. The
postgres-import-wizard
(henceforth known as the wizard
) fills this gap.
The wizard
is used to copy a flat-data file such as a CSV or tab-delimited
text file into a Postgres table. Although Postgres provides a built-in
COPY
command, it
requires the table to already exist; otherwise there's nothing to copy the data
into. To circumvent this issue, the wizard
reads the header of the file you
want to load and uses those fields to define a table with text
columns in
a database you specify.
No installation is required. The wizard
is meant to be run directly from
source using Python 3.
That said, it's recommended to run this in a Python virtual
environment and install what's in
requirements.txt
(which is really just
psycopg
).
usage: postgres-import-wizard [-h] [--clean] [--delimiter {,,|, }] --file FILE
[--postgres_connection POSTGRES_CONNECTION]
[--schema SCHEMA] --table TABLE
Responsible for importing a raw data set into a table in a Postgres database
optional arguments:
-h, --help show this help message and exit
--clean Delete any table that already exists at
"SCHEMA"."TABLE" (default: False)
--delimiter {,,|, } Delimiter separating fields in FILE (default: ,)
--file FILE Path to data file to import (default: None)
--postgres_connection POSTGRES_CONNECTION
Postgres connection string for raw database (default:
postgresql://postgres@localhost:5433/raw)
--schema SCHEMA Name of schema in which to create table (default:
public)
--table TABLE Name of table to create (default: None)
Executing the wizard
will print log messages on standard out within your
shell. It will also persist messages from your most recent run in a file under
a logs/
folder in your working directory. Log settings can be configured in
log.py
, and the logs/
folder can safely be removed whenever.
In this section we provide a few examples illustrating usage of the wizard
.
All examples are run using the default Postgres connection string provided by
the utility, so be sure that if you run them verbatim you have a local Postgres
instance running on your machine, available at port 5433, with a database named
raw
. Also be sure to use a Postgres password
file to specify
credentials for your database user.
The example data sets provided under the examples/
folder are subsets of the
Private School Universe Survey (PSS) for
the 2017-18 school year provided by the National Center for Education
Statistics. The PSS is one of Niche's data
sources and provides a biennial census of
private schools in the United States of America.
Importing the example CSV data set provided under examples/
can be done by
executing
$ python main.py --file examples/data.csv --table "example1"
Issuing that will load the comma-separated data set into a table named
example1
within the default public
schema of the database.
We can reimport the data:
$ python main.py --clean --file examples/data.csv --table "example1"
If we had omitted the --clean
flag the wizard
would have thrown an error
like
CRITICAL [2019-10-04 08:56:20] Exception occurred while loading raw data: relation "example1" already exists
This example functions much the same as CSV, Public Schema above except we try to load the data into a non-public schema:
$ python main.py --file examples/data.csv --schema "sonic" --table "example1"
INFO [2019-10-04 09:12:27] Starting the wizard
...
CRITICAL [2019-10-04 09:12:27] Exception occurred while loading raw data: schema "sonic" does not exist
LINE 2: CREATE TABLE "sonic"."example1" (
^
INFO [2019-10-04 09:12:27] Exiting the wizard. Goodbye.
Oh no! We need to create the schema before we can create tables in it. After
creating a schema called sonic
we can rerun the command above and everything
should complete fine.
The only tricky thing with importing a tab-delimited data file is that we need
to specify a literal \t
as the delimiter. This can be accomplished by
escaping it in our argument list with $'\t'
, like the following
$ python main.py --delimiter $'\t' --file examples/data.txt --table "example2"