`kamu new`: generate snapshots that will be immediately ready to be added/worked on #888

s373r · 2024-10-09T15:10:11Z

As a user, I'd like to generate template datasets that can be used immediately without the analysis of which fields need to be filled with real data.

Root (snapshot)

 ---
 kind: DatasetSnapshot
 version: 1
 content:
   # A human-friendly alias of the dataset
   name: root
   # Root datasets are the points of entry of external data into the system
   # See: https://docs.kamu.dev/cli/ingest/
   kind: Root
   # List of metadata events that get dataset into its initial state
   # See: https://docs.kamu.dev/odf/reference/#metadataevent
   metadata:
     # Specifies the source of data that can be periodically polled to refresh the dataset
     # See: https://docs.kamu.dev/odf/reference/#setpollingsource
     - kind: SetPollingSource
       # Where to fetch the data from.
       # Includes source URL, a protocol to use, cache control
       # See: https://docs.kamu.dev/odf/reference/#fetchstep
       fetch:
         kind: Url
+        url: https://example.com/city_populations_over_time.zip
       # OPTIONAL: How to prepare the binary data
       # Includes decompression, file filtering, format conversions
       prepare:
         - kind: Decompress
           format: Zip
       # How to interpret the data.
       # Includes data format, schema to apply, error handling
       # See: https://docs.kamu.dev/odf/reference/#readstep
       read:
         kind: Csv
         header: true
         timestampFormat: yyyy-M-d
         schema:
           - "date TIMESTAMP"
           - "city STRING"
           - "population STRING"
       # OPTIONAL: Pre-processing query that shapes the data.
       # Useful for converting text data read from CSVs into strict types
       # See: https://docs.kamu.dev/odf/reference/#transform
       preprocess:
         kind: Sql
         # Use one of the supported engines and a query in its dialect
         # See: https://docs.kamu.dev/cli/supported-engines/
         engine: datafusion
         query: |
           select
             date,
             city,
             -- remove commas between thousands
             cast(replace(population, ",", "") as bigint)
           from input
       # How to combine data ingested in the past with the new data.
       # See: https://docs.kamu.dev/odf/reference/#mergestrategy
       merge:
         kind: Ledger
         primaryKey:
           - date
           - city
       # Lets you manipulate names of the system columns to avoid conflicts
       # or use names better suited for your data.
       # See: https://docs.kamu.dev/odf/reference/#setvocab
     - kind: SetVocab
       eventTimeColumn: date

Derivative (snapshot)

 ---
 kind: DatasetSnapshot
 version: 1
 content:
   # A human-friendly alias of the dataset
   name: 2
   # Derivative datasets produce data by transforming and combining
   # one or multiple existing datasets.
   # See: https://docs.kamu.dev/cli/transform/
   kind: Derivative
   # List of metadata events that get dataset into its initial state
   # See: https://docs.kamu.dev/odf/reference/#metadataevent
   metadata:
     # Transformation that will be applied to produce new data
     # See: https://docs.kamu.dev/odf/reference/#settransform
     - kind: SetTransform
       # References the datasets that will be used as inputs.
       # Note: We are associating inputs by name, but could also use IDs.
       inputs:
         - datasetRef: com.example.city-populations
       # Transformation steps that ise one of the supported engines and query dialects
       # See: https://docs.kamu.dev/cli/supported-engines/
       transform:
         kind: Sql
         engine: datafusion
         query: |
           select
             date,
             city,
             population + 1 as population
+           from `com.example.city-populations`
     # Lets you manipulate names of the system columns to avoid
     # conflicts or use names better suited for your data.
     # See: https://docs.kamu.dev/odf/reference/#setvocab
     - kind: SetVocab
       eventTimeColumn: date

zaychenko-sergei · 2024-12-20T10:24:13Z

@s373r Please make a more technical description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kamu new`: generate snapshots that will be immediately ready to be added/worked on #888

`kamu new`: generate snapshots that will be immediately ready to be added/worked on #888

s373r commented Oct 9, 2024 •

edited

Loading

zaychenko-sergei commented Dec 20, 2024

kamu new: generate snapshots that will be immediately ready to be added/worked on #888

kamu new: generate snapshots that will be immediately ready to be added/worked on #888

Comments

s373r commented Oct 9, 2024 • edited Loading

zaychenko-sergei commented Dec 20, 2024

`kamu new`: generate snapshots that will be immediately ready to be added/worked on #888

`kamu new`: generate snapshots that will be immediately ready to be added/worked on #888

s373r commented Oct 9, 2024 •

edited

Loading