Skip to content

palmermbandy/socrata-py

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

socrata-py

Python SDK for the Socrata Data Management API

Installation

This only supports python3.

Installation is available through pip. Using a virtualenv is advised. Install the package by running

pip3 install socrata-py

The only hard dependency is requests which will be installed via pip. Pandas is not required, but creating a dataset from a Pandas dataframe is supported. See below.

Example

Try the command line example with

python -m examples.create 'Police Reports' ~/Desktop/catalog.data.gov/Seattle_Real_Time_Fire_911_Calls.csv 'pete-test.test-socrata.com' --username $SOCRATA_USERNAME --password $SOCRATA_PASSWORD

Using

Boilerplate

# Import some stuff
from socrata.authorization import Authorization
from socrata import Socrata
import os

# Boilerplate...
# Make an auth object
auth = Authorization(
  "pete-test.test-socrata.com",
  os.environ['SOCRATA_USERNAME'],
  os.environ['SOCRATA_PASSWORD']
)

Simple usage

Create a new Dataset from a csv, tsv, xls or xlsx file

To create a dataset, you can do this:

with open('cool_dataset.csv', 'rb') as file:
    # Upload + Transform step

    # revision is the *change* to the view in the catalog, which has not yet been applied.
    # output is the OutputSchema, which is a change to data which can be applied via the revision
    (revision, output) = Socrata(auth).create(
        name = "cool dataset",
        description = "a description"
    ).csv(file)

    # Transformation step
    # We want to add some metadata to our column, drop another column, and add a new column which will
    # be filled with values from another column and then transformed
    (ok, output) = output\
        .change_column_metadata('a_column', 'display_name').to('A Column!')\
        .change_column_metadata('a_column', 'description').to('Here is a description of my A Column')\
        .drop_column('b_column')\
        .add_column('a_column_squared', 'A Column, but times itself', 'to_number(`a_column`) * to_number(`a_column`)', 'this is a column squared')\
        .run()


    # Validation of the results step
    (ok, output) = output.wait_for_finish()
    # The data has been validated now, and we can access errors that happened during validation. For example, if one of the cells in `a_column` couldn't be converted to a number in the call to `to_number`, that error would be reflected in this error_count
    assert output.attributes['error_count'] == 0

    # If you want, you can get a csv stream of all the errors
    (ok, errors) = output.schema_errors_csv()
    for line in errors.iter_lines():
        print(line)

    # Update step

    # Apply the revision - this will make it public and available to make
    # visualizations from
    (ok, job) = revision.apply(output_schema = output)

    # This opens a browser window to your revision, and you will see the progress
    # of the job
    revision.open_in_browser()

    # Application is async - this will block until all the data
    # is in place and readable
    job.wait_for_finish()

Similar to the csv method are the xls, xlsx, and tsv methods, which upload those files.

There is a blob method as well, which uploads blobby data to the source. This means the data will not be parsed, and will be displayed under "Files and Documents" in the catalog once the revision is applied.

Create a new Dataset from Pandas

Datasets can also be created from Pandas DataFrames

import pandas as pd
df = pd.read_csv('socrata-py/test/fixtures/simple.csv')
# Do various Pandas-y changes and modifications, then...
(revision, output) = Socrata(auth).create(
    name = "Pandas Dataset",
    description = "Dataset made from a Pandas Dataframe"
).df(df)

# Same code as above to apply the revision.

Updating a dataset

A Socrata update is actually an upsert. Rows are updated or created based on the row identifier. If the row-identifer doesn't exist, all updates are just appends to the dataset.

A replace truncates the whole dataset and then inserts the new data.

Generating a config and using it to update

# This is how we create our view initially
with open('cool_dataset.csv', 'rb') as file:
    (revision, output) = Socrata(auth).create(
        name = "cool dataset",
        description = "a description"
    ).csv(file)

    revision.apply(output_schema = output)

# This will build a configuration using the same settings (file parsing and
# data transformation rules) that we used to get our output. The action
# that we will take will be "update", though it could also be "replace"
(ok, config) = output.build_config("cool-dataset-config", "update")

# Now we need to save our configuration name and view id somewhere so we
# can update the view using our config
configuration_name = "cool-dataset-config"
view_id = revision.view_id()

# Now later, if we want to use that config to update our view, we just need the view and the configuration_name
socrata = Socrata(auth)
(ok, view) = socrata.views.lookup(view_id) # View will be the view we are updating with the new data

with open('updated-cool-dataset.csv', 'rb') as my_file:
    (revision, job) = socrata.using_config(
        configuration_name,
        view
    ).csv(my_file)
    print(job) # Our update job is now running

Advanced usage

Create a revision

# This is our socrata object, using the auth variable from above
socrata = Socrata(auth)

# This will make our initial revision, on a view that doesn't yet exist
(ok, revision) = socrata.new({'name': 'cool dataset'})
assert ok

# revision is a Revision object, we can print it
print(revision)
Revision({'created_by': {'display_name': 'rozap',
                'email': '[email protected]',
                'user_id': 'tugg-ikce'},
 'fourfour': 'ij46-xpxe',
 'id': 346,
 'inserted_at': '2017-02-27T23:05:08.522796',
 'metadata': None,
 'update_seq': 285,
 'upsert_jobs': []})

# We can also access the attributes of the revision
print(revision.attributes['metadata']['name'])
'cool dataset'

Create an upload

# Using that revision, we can create an upload
(ok, upload) = revision.create_upload('foo.csv')
assert ok

# And print it
print(upload)
Source({'content_type': None,
 'created_by': {'display_name': 'rozap',
                'email': '[email protected]',
                'user_id': 'tugg-ikce'},
 'source_type': {
    'filename': 'foo.csv',
    'type': 'upload'
 },
 'finished_at': None,
 'id': 290,
 'inserted_at': '2017-02-27T23:07:18.309676',
 'schemas': []})

Upload a file

# And using that upload we just created, we can put bytes into it
with open('test/fixtures/simple.csv', 'rb') as f:
    (ok, source) = upload.csv(f)
    assert ok

Transforming your data

Transforming data consists of going from input data (data exactly as it appeared in the source) to output data (data as you want it to appear).

Transformation from input data to output data often has problems. You might, for example, have a column full of numbers, but one row in that column is actually the value hehe! which cannot be transformed into a number. Rather than failing at each datum which is dirty or wrong, transforming your data allows you to reconcile these issues.

We might have a dataset called temps.csv that looks like

date, celsius
8-24-2017, 22
8-25-2017, 20
8-26-2017, 23
8-27-2017, hehe!
8-28-2017,
8-29-2017, 21

Suppose we uploaded it in our previous step, like this:

with open('temps.csv', 'rb') as f:
    (ok, source) = upload.csv(f)
    assert ok
    input_schema = source.get_latest_input_schema()

Our input_schema is the input data exactly as it appeared in the CSV, with all values of type string.

Our output_schema is the output data as it was guessed by Socrata. Guessing may not always be correct, which is why we have import configs to "lock in" a schema for automation. We can get the output_schema like so:

(ok, output_schema) = input_schema.get_latest_output_schema()
assert ok

We can now make changes to the schema, like so

(ok, new_output_schema) = output
    # Change the field_name of date to the_date
    .change_column_metadata('date', 'field_name').to('the_date')\
    # Change the description of the celsius column
    .change_column_metadata('celsius', 'description').to('the temperature in celsius')\
    # Change the display name of the celsius column
    .change_column_metadata('celsius', 'display_name').to('Degrees (Celsius)')\
    # Change the transform of the_date column to to_fixed_timestamp(`date`)
    .change_column_transform('the_date').to('to_fixed_timestamp(`date`)')\
    # Make the celsius column all numbers
    .change_column_transform('celsius').to('to_number(`celsius`)')\
    # Add a new column, which is computed from the `celsius` column
    .add_column('fahrenheit', 'Degrees (Fahrenheit)', '(to_number(`celsius`) * (9 / 5)) + 32', 'the temperature in celsius')\
    .run()

change_column_metadata(column_name, column_attribute) takes the field name used to identify the column and the column attribute to change (field_name, display_name, description, position)

add_column(field_name, display_name, transform_expression, description) will create a new column

We can also call drop_column(celsius) which will drop the column.

.run() will then make a request and return the new output_schema, or an error if something is invalid.

Transforms can be complex SoQL expressions. Available functions are listed here. You can do lots of stuff with them;

For example, you could change all null values into errors (which won't be imported) by doing something like

(ok, new_output_schema) = output
    .change_column_transform('celsius').to('coalesce(to_number(`celsius`), error("Celsius was null!"))')
    .run()

Or you could add a new column that says if the day was hot or not

(ok, new_output_schema) = output
    .add_column('is_hot', 'Was the day hot?', 'to_number(`celsius`) >= 23')
    .run()

Or you could geocode a column, given the following CSV

address,city,zip,state
10028 Ravenna Ave NE, Seattle, 98128, WA
1600 Pennsylvania Avenue, Washington DC, 20500, DC
6511 32nd Ave NW, Seattle, 98155, WA

We could transform our first output_schema into a single column dataset, where that single column is a Point of the address

(ok, output) = output\
    .add_column('location', 'Incident Location', 'geocode(`address`, `city`, `state`, `zip`)')\
    .drop_column('address')\
    .drop_column('city')\
    .drop_column('state')\
    .drop_column('zip')\
    .run()

Composing these SoQL functions into expressions will allow you to validate, shape, clean and extend your data to make it more useful to the consumer.

Wait for the transformation to finish

Transformations are async, so if you want to wait for it to finish, you can do so

(ok, output_schema) = new_output_schema.wait_for_finish()
assert ok, output_schema

Errors in a transformation

Transformations may have had errors, like in the previous example, we can't convert hehe! to a number. We can see the count of them like this:

print(output_schema.attributes['error_count'])

We can view the detailed errors like this:

(ok, errors) = output_schema.schema_errors()

We can get a CSV of the errors like this:

(ok, csv_stream) = output_schema.schema_errors_csv()

Validating rows

We can look at the rows of our schema as well

(ok, rows) = output_schema.rows(offset = 0, limit = 20)
assert ok,

self.assertEqual(rows, [
    {'b': {'ok': ' bfoo'}},
    {'b': {'ok': ' bfoo'}},
    {'b': {'ok': ' bfoo'}},
    {'b': {'ok': ' bfoo'}}
])

Do the upsert!

# Now we have transformed our data into the shape we want, let's do an upsert
(ok, job) = revision.apply(output_schema = output_schema)

# This will complete the upsert behind the scenes. If we want to
# re-fetch the current state of the upsert job, we can do so
(ok, job) = job.show()

# To get the progress
print(job.attributes['log'])
[
    {'details': {'Errors': 0, 'Rows Created': 0, 'Rows Updated': 0, 'By RowIdentifier': 0, 'By SID': 0, 'Rows Deleted': 0}, 'time': '2017-02-28T20:20:59', 'stage': 'upsert_complete'},
    {'details': {'created': 1}, 'time': '2017-02-28T20:20:59', 'stage': 'columns_created'},
    {'details': {'created': 1}, 'time': '2017-02-28T20:20:59', 'stage': 'columns_created'},
    {'details': None, 'time': '2017-02-28T20:20:59', 'stage': 'started'}
]


# So maybe we just want to wait here, printing the progress, until the job is done
job.wait_for_finish(progress = lambda job: print(job.attributes['log']))

# So now if we go look at our original four-four, our data will be there

Metadata only revisions

When there is an existing Socrata view that you'd like to update the metadata of, you can do so by creating a Source which is the Socrata view.

(ok, view) = socrata.views.lookup('abba-cafe')
assert ok, view

(ok, revision) = view.revisions.create_replace_revision()
assert ok, revision
(ok, source) = revision.source_from_dataset()
assert ok, source
output_schema = source.get_latest_input_schema().get_latest_output_schema()
(ok, new_output_schema) = output_schema\
    .change_column_metadata('a', 'description').to('meh')\
    .change_column_metadata('b', 'display_name').to('bbbb')\
    .change_column_metadata('c', 'field_name').to('ccc')\
    .run()


revision.apply(output_schema = new_output_schema)

Development

Testing

Install test deps by running pip install -r requirements.txt. This will install pdoc and pandas which are required to run the tests.

Configuration is set in test/auth.py for tests. It reads the domain, username, and password from environment variables. If you want to run the tests, set those environment variables to something that will work.

If I wanted to run the tests against my local instance, I would run:

SOCRATA_DOMAIN=localhost SOCRATA_USERNAME=$SOCRATA_LOCAL_USER SOCRATA_PASSWORD=$SOCRATA_LOCAL_PASS bin/test

Generating docs

make the docs by running

make docs

Releasing

release to pypi by bumping the version to something reasonable and running

python setup.py sdist upload -r pypi

Note you'll need your .pypirc file in your home directory. For help, read this

Library Docs

ArgSpec
    Args: auth

Top level publishing object.

All functions making HTTP calls return a result tuple, where the first element in the tuple is whether or not the call succeeded, and the second element is the returned object if it was a success, or a dictionary containing the error response if the call failed. 2xx responses are considered successes. 4xx and 5xx responses are considered failures. In the event of a socket hangup, an exception is raised.

Shortcut to create a dataset. Returns a Create object, which contains functions which will create a view, upload your file, and validate data quality in one step.

To actually place the validated data into a view, you can call .apply() on the revision

(revision, output_schema) Socrata(auth).create(
    name = "cool dataset",
    description = "a description"
).csv(file)

(ok, job) = revision.apply(output_schema = output_schema)

Args:

   **kwargs: Arbitrary revision metadata values

Returns:

    result (Revision, OutputSchema): Returns the revision that was created and the OutputSchema created from your uploaded file

Examples:

Socrata(auth).create(
    name = "cool dataset",
    description = "a description"
).csv(open('my-file.csv'))
ArgSpec
    Args: metadata

Create an empty revision, on a view that doesn't exist yet. The view will be created for you, and the initial revision will be returned.

Args:

    metadata (dict): Metadata to apply to the revision

Returns:

    result (bool, Revision | dict): Returns an API Result; the Revision if it was created or an API Error response

Examples:

    (ok, rev) = Socrata(auth).new({
        'name': 'hi',
        'description': 'foo!',
        'metadata': {
            'view': 'metadata',
            'anything': 'is allowed here'

        }
    })
ArgSpec
    Args: config_name, view

Update a dataset, using the configuration that you previously created, and saved the name of. Takes the config_name parameter which uniquely identifies the config, and the View object, which can be obtained from socrata.views.lookup('view-id42')

Args:

    config_name (str): The config name
    view (View): The view to update

Returns:

    result (Revision, Job): Returns the Revision and the Job, which is now running

Examples:

    with open('my-file.csv', 'rb') as my_file:
        (rev, job) = p.using_config(name, view).csv(my_file)
ArgSpec
    Args: domain, username, password, request_id_prefix
    Defaults: domain=

Manages basic authorization for accessing the socrata API. This is passed into the Socrata object once, which is the entry point for all operations.

auth = Authorization(
    "data.seattle.gov",
    os.environ['SOCRATA_USERNAME'],
    os.environ['SOCRATA_PASSWORD']
)
publishing = Socrata(auth)

Disable SSL checking. Note that this should only be used while developing against a local Socrata instance.

ArgSpec
    Args: fourfour, auth
ArgSpec
    Args: metadata, permission
    Defaults: metadata={}, permission=public

Create a revision on the view, which when applied, will replace the data.

Args:

    metadata (dict): The metadata to change; these changes will be applied when the revision
        is applied
    permission (string): 'public' or 'private'

Returns:

    result (bool, dict | Revision): The new revision, or an error

Examples:

    >>> view.revisions.create_replace_revision(metadata = {'name': 'new dataset name', 'description': 'updated description'})
ArgSpec
    Args: metadata, permission
    Defaults: metadata={}, permission=public

Create a revision on the view, which when applied, will update the data rather than replacing it.

This is an upsert; if there is a rowId defined and you have duplicate ID values, those rows will be updated. Otherwise they will be appended.

Args:

    metadata (dict): The metadata to change; these changes will be applied when the revision is applied
    permission (string): 'public' or 'private'

Returns:

    result (bool, dict | Revision): The new revision, or an error

Examples:

    view.revisions.create_update_revision(metadata = {
        'name': 'new dataset name',
        'description': 'updated description'
    })
ArgSpec
    Args: config

Create a revision for the given dataset.

List all the revisions on the view

Returns:

    result (bool, dict | list[Revision])
ArgSpec
    Args: revision_seq

Lookup a revision within the view based on the sequence number

Args:

    revision_seq (int): The sequence number of the revision to lookup

Returns:

    result (bool, dict | Revision): The Revision resulting from this API call, or an error
ArgSpec
    Args: auth, response, parent

A revision is a change to a dataset

ArgSpec
    Args: output_schema

Apply the Revision to the view that it was opened on

Args:

    output_schema (OutputSchema): Optional output schema. If your revision includes
        data changes, this should be included. If it is a metadata only revision,
        then you will not have an output schema, and you do not need to pass anything
        here

Returns:

    result (bool, dict | Job): Returns the job that is being run to apply the revision

Examples:

(ok, job) = revision.apply(output_schema = my_output_schema)
ArgSpec
    Args: filename

Create an upload within this revision

Args:

    filename (str): The name of the file to upload

Returns:

    result (bool, dict | Source): The Source created by this API call, or an error

Discard this open revision.

Returns:

    result (bool, dict | Revision): The closed Revision or an error

Get a list of the operations that you can perform on this object. These map directly onto what's returned from the API in the links section of each resource

Open this revision in your browser, this will open a window

ArgSpec
    Args: output_schema_id

Set the output schema id on the revision. This is what will get applied when the revision is applied if no ouput schema is explicitly supplied

Args:

    output_schema_id (int): The output schema id

Returns:

    result (bool, dict | Revision): The updated Revision as a result of this API call, or an error

Examples:

    (ok, revision) = revision.set_output_schema(42)

Create a dataset source within this revision

ArgSpec
    Args: url

Create a URL source

Args:

    url (str): The URL to create the dataset from

Returns:

    result (bool, dict | Source): The Source created by this API call, or an error

This is the URL to the landing page in the UI for this revision

Returns:

    url (str): URL you can paste into a browser to view the revision UI
ArgSpec
    Args: body

Set the metadata to be applied to the view when this revision is applied

Args:

    body (dict): The changes to make to this revision

Returns:

    result (bool, dict | Revision): The updated Revision as a result of this API call, or an error

Examples:

    (ok, revision) = revision.update({
        'metadata': {
            'name': 'new name',
            'description': 'new description'
        }
    })
ArgSpec
    Args: auth
ArgSpec
    Args: filename

Create a new source. Takes a body param, which must contain a filename of the file.

Args:

    filename (str): The name of the file you are uploading

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    (ok, upload) = revision.create_upload('foo.csv')
ArgSpec
    Args: auth, response, parent

Uploads bytes into the source. Requires content_type argument be set correctly for the file handle. It's advised you don't use this method directly, instead use one of the csv, xls, xlsx, or tsv methods which will correctly set the content_type for you.

ArgSpec
    Args: revision

Associate this Source with the given revision.

ArgSpec
    Args: file_handle

Uploads a Blob dataset. A blob is a file that will not be parsed as a data file, ie: an image, video, etc.

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-blob.jpg', 'rb') as f:
        (ok, upload) = upload.blob(f)
ArgSpec
    Args: name

Change a parse option on the source.

If there are not yet bytes uploaded, these parse options will be used in order to parse the file.

If there are already bytes uploaded, this will trigger a re-parsing of the file, and consequently a new InputSchema will be created. You can call source.latest_input() to get the newest one.

Parse options are: header_count (int): the number of rows considered a header column_header (int): the one based index of row to use to generate the header encoding (string): defaults to guessing the encoding, but it can be explicitly set column_separator (string): For CSVs, this defaults to ",", and for TSVs " ", but you can use a custom separator quote_char (string): Character used to quote values that should be escaped. Defaults to """

Args:

    name (string): One of the options above, ie: "column_separator" or "header_count"

Returns:

    change (ParseOptionChange): implements a `.to(value)` function which you call to set the value

For our example, assume we have this dataset

This is my cool dataset
A, B, C
1, 2, 3
4, 5, 6

We want to say that the first 2 rows are headers, and the second of those 2 rows should be used to make the column header. We would do that like so:

Examples:

    (ok, source) = source            .change_parse_option('header_count').to(2)            .change_parse_option('column_header').to(2)            .run()
ArgSpec
    Args: file_handle

Upload a CSV, returns the new input schema.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-file.csv', 'rb') as f:
        (ok, upload) = upload.csv(f)
ArgSpec
    Args: dataframe

Upload a pandas DataFrame, returns the new source.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    import pandas
    df = pandas.read_csv('test/fixtures/simple.csv')
    (ok, upload) = upload.df(df)
ArgSpec
    Args: file_handle

Upload a geojson file, returns the new input schema.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-geojson-file.geojson', 'rb') as f:
        (ok, upload) = upload.geojson(f)
ArgSpec
    Args: file_handle

Upload a KML file, returns the new input schema.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-kml-file.kml', 'rb') as f:
        (ok, upload) = upload.kml(f)

Get a list of the operations that you can perform on this object. These map directly onto what's returned from the API in the links section of each resource

Open this source in your browser, this will open a window

ArgSpec
    Args: file_handle

Upload a Shapefile, returns the new input schema.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-shapefile-archive.zip', 'rb') as f:
        (ok, upload) = upload.shapefile(f)
ArgSpec
    Args: file_handle

Upload a TSV, returns the new input schema.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-file.tsv', 'rb') as f:
        (ok, upload) = upload.tsv(f)

This is the URL to the landing page in the UI for the sources

Returns:

    url (str): URL you can paste into a browser to view the source UI
ArgSpec
    Args: progress, timeout, sleeptime
    Defaults: progress=<function noop at 0x7f6f42ffe7b8>, sleeptime=1

Wait for this dataset to finish transforming and validating. Accepts a progress function and a timeout.

ArgSpec
    Args: file_handle

Upload an XLS, returns the new input schema

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-file.xls', 'rb') as f:
        (ok, upload) = upload.xls(f)
ArgSpec
    Args: file_handle

Upload an XLSX, returns the new input schema.

Args:

    file_handle: The file handle, as returned by the python function `open()`

Returns:

    result (bool, Source | dict): Returns an API Result; the new Source or an error response

Examples:

    with open('my-file.xlsx', 'rb') as f:
        (ok, upload) = upload.xlsx(f)
ArgSpec
    Args: auth
ArgSpec
    Args: name, data_action, parse_options, columns

Create a new ImportConfig. See http://docs.socratapublishing.apiary.io/ ImportConfig section for what is supported in data_action, parse_options, and columns.

List all the ImportConfigs on this domain

ArgSpec
    Args: name

Obtain a single ImportConfig by name

ArgSpec
    Args: auth, response, parent
ArgSpec
    Args: name

Change a parse option on the source.

If there are not yet bytes uploaded, these parse options will be used in order to parse the file.

If there are already bytes uploaded, this will trigger a re-parsing of the file, and consequently a new InputSchema will be created. You can call source.latest_input() to get the newest one.

Parse options are: header_count (int): the number of rows considered a header column_header (int): the one based index of row to use to generate the header encoding (string): defaults to guessing the encoding, but it can be explicitly set column_separator (string): For CSVs, this defaults to ",", and for TSVs " ", but you can use a custom separator quote_char (string): Character used to quote values that should be escaped. Defaults to """

Args:

    name (string): One of the options above, ie: "column_separator" or "header_count"

Returns:

    change (ParseOptionChange): implements a `.to(value)` function which you call to set the value

For our example, assume we have this dataset

This is my cool dataset
A, B, C
1, 2, 3
4, 5, 6

We want to say that the first 2 rows are headers, and the second of those 2 rows should be used to make the column header. We would do that like so:

Examples:

    (ok, source) = source            .change_parse_option('header_count').to(2)            .change_parse_option('column_header').to(2)            .run()
ArgSpec
    Args: fourfour

Create a new Revision in the context of this ImportConfig. Sources that happen in this Revision will take on the values in this Config.

Delete this ImportConfig. Note that this cannot be undone.

Get a list of the operations that you can perform on this object. These map directly onto what's returned from the API in the links section of each resource

ArgSpec
    Args: body

Mutate this ImportConfig in place. Subsequent revisions opened against this ImportConfig will take on its new value.

ArgSpec
    Args: auth, response, parent

This represents a schema exactly as it appeared in the source

Note that this does not make an API request

Returns: output_schema (OutputSchema): Returns the latest output schema

Get the latest (most recently created) OutputSchema which descends from this InputSchema

Returns: result (bool, OutputSchema | dict): Returns an API Result; the new OutputSchema or an error response

Get a list of the operations that you can perform on this object. These map directly onto what's returned from the API in the links section of each resource

ArgSpec
    Args: body

Transform this InputSchema into an Output. Returns the new OutputSchema. Note that this call is async - the data may still be transforming even though the OutputSchema is returned. See OutputSchema.wait_for_finish to block until the

This is data as transformed from an InputSchema

ArgSpec
    Args: field_name, display_name, transform_expr, description

Add a column

Args:

    field_name (str): The column's field_name, must be unique
    display_name (str): The columns display name
    transform_expr (str): SoQL expression to evaluate and fill the column with data from
    description (str): Optional column description

Returns:

    output_schema (OutputSchema): Returns self for easy chaining

Examples:

(ok, new_output_schema) = output
    # Add a new column, which is computed from the `celsius` column
    .add_column('fahrenheit', 'Degrees (Fahrenheit)', '(to_number(`celsius`) * (9 / 5)) + 32', 'the temperature in celsius')
    # Add a new column, which is computed from the `celsius` column
    .add_column('kelvin', 'Degrees (Kelvin)', '(to_number(`celsius`) + 273.15')
    .run()

Whether or not any transform in this output schema has failed

ArgSpec
    Args: name, data_action

Create a new ImportConfig from this OutputSchema. See the API docs for what an ImportConfig is and why they're useful

ArgSpec
    Args: field_name, attribute

Change the column metadata. This returns a ColumnChange, which implements a .to function, which takes the new value to change to

Args:

    field_name (str): The column to change
    attribute (str): The attribute of the column to change

Returns:

    change (TransformChange): The transform change, which implements the `.to` function

Examples:

    (ok, new_output_schema) = output
        # Change the field_name of date to the_date
        .change_column_metadata('date', 'field_name').to('the_date')
        # Change the description of the celsius column
        .change_column_metadata('celsius', 'description').to('the temperature in celsius')
        # Change the display name of the celsius column
        .change_column_metadata('celsius', 'display_name').to('Degrees (Celsius)')
        .run()
ArgSpec
    Args: field_name

Change the column transform. This returns a TransformChange, which implements a .to function, which takes a transform expression.

Args:

    field_name (str): The column to change

Returns:

    change (TransformChange): The transform change, which implements the `.to` function

Examples:

    (ok, new_output_schema) = output
        .change_column_transform('the_date').to('to_fixed_timestamp(`date`)')
        # Make the celsius column all numbers
        .change_column_transform('celsius').to('to_number(`celsius`)')
        # Add a new column, which is computed from the `celsius` column
        .add_column('fahrenheit', 'Degrees (Fahrenheit)', '(to_number(`celsius`) * (9 / 5)) + 32', 'the temperature in celsius')
        .run()
ArgSpec
    Args: field_name

Drop the column

Args:

    field_name (str): The column to drop

Returns:

    output_schema (OutputSchema): Returns self for easy chaining

Examples:

    (ok, new_output_schema) = output
        .drop_column('foo')
        .run()

Get a list of the operations that you can perform on this object. These map directly onto what's returned from the API in the links section of each resource

ArgSpec
    Args: offset, limit
    Defaults: offset=0, limit=500

Get the rows for this OutputSchema. Acceps offset and limit params for paging through the data.

Run all adds, drops, and column changes.

Returns:

    result (bool, OutputSchema | dict): Returns an API Result; the new OutputSchema or an error response

Examples:

    (ok, new_output_schema) = output
        # Change the field_name of date to the_date
        .change_column_metadata('date', 'field_name').to('the_date')
        # Change the description of the celsius column
        .change_column_metadata('celsius', 'description').to('the temperature in celsius')
        # Change the display name of the celsius column
        .change_column_metadata('celsius', 'display_name').to('Degrees (Celsius)')
        # Change the transform of the_date column to to_fixed_timestamp(`date`)
        .change_column_transform('the_date').to('to_fixed_timestamp(`date`)')
        # Make the celsius column all numbers
        .change_column_transform('celsius').to('to_number(`celsius`)')
        # Add a new column, which is computed from the `celsius` column
        .add_column('fahrenheit', 'Degrees (Fahrenheit)', '(to_number(`celsius`) * (9 / 5)) + 32', 'the temperature in celsius')
        .run()
ArgSpec
    Args: offset, limit
    Defaults: offset=0, limit=500

Get the errors that resulted in transforming into this output schema. Accepts offset and limit params

Get the errors that results in transforming into this output schema as a CSV stream.

Note that this returns an (ok, Reponse) tuple, where Reponse is a python requests Reponse object

ArgSpec
    Args: field_name

Set the row id. Note you must call validate_row_id before doing this.

Args:

    field_name (str): The column to set as the row id

Returns:

    result (bool, OutputSchema | dict): Returns an API Result; the new OutputSchema or an error response

Examples:

(ok, new_output_schema) = output.set_row_id('the_id_column')
ArgSpec
    Args: field_name

Set the row id. Note you must call validate_row_id before doing this.

Args:

    field_name (str): The column to validate as the row id

Returns:

    result (bool, dict): Returns an API Result; where the response says if it can be used as a row id
ArgSpec
    Args: progress, timeout, sleeptime
    Defaults: progress=<function noop at 0x7f6f42ffe7b8>, sleeptime=1

Wait for this dataset to finish transforming and validating. Accepts a progress function and a timeout.

ArgSpec
    Args: auth, response, parent

Has this job finished or failed

Get a list of the operations that you can perform on this object. These map directly onto what's returned from the API in the links section of each resource

ArgSpec
    Args: progress, timeout, sleeptime
    Defaults: progress=<function noop at 0x7f6f42ffe7b8>, sleeptime=1

Wait for this dataset to finish transforming and validating. Accepts a progress function and a timeout.

About

socrata data-pipeline python library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Other 0.9%