Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kamu new: generate snapshots that will be immediately ready to be added/worked on #888

Open
s373r opened this issue Oct 9, 2024 · 1 comment

Comments

@s373r
Copy link
Member

s373r commented Oct 9, 2024

As a user, I'd like to generate template datasets that can be used immediately without the analysis of which fields need to be filled with real data.

Root (snapshot)
 ---
 kind: DatasetSnapshot
 version: 1
 content:
   # A human-friendly alias of the dataset
   name: root
   # Root datasets are the points of entry of external data into the system
   # See: https://docs.kamu.dev/cli/ingest/
   kind: Root
   # List of metadata events that get dataset into its initial state
   # See: https://docs.kamu.dev/odf/reference/#metadataevent
   metadata:
     # Specifies the source of data that can be periodically polled to refresh the dataset
     # See: https://docs.kamu.dev/odf/reference/#setpollingsource
     - kind: SetPollingSource
       # Where to fetch the data from.
       # Includes source URL, a protocol to use, cache control
       # See: https://docs.kamu.dev/odf/reference/#fetchstep
       fetch:
         kind: Url
+        url: https://example.com/city_populations_over_time.zip
       # OPTIONAL: How to prepare the binary data
       # Includes decompression, file filtering, format conversions
       prepare:
         - kind: Decompress
           format: Zip
       # How to interpret the data.
       # Includes data format, schema to apply, error handling
       # See: https://docs.kamu.dev/odf/reference/#readstep
       read:
         kind: Csv
         header: true
         timestampFormat: yyyy-M-d
         schema:
           - "date TIMESTAMP"
           - "city STRING"
           - "population STRING"
       # OPTIONAL: Pre-processing query that shapes the data.
       # Useful for converting text data read from CSVs into strict types
       # See: https://docs.kamu.dev/odf/reference/#transform
       preprocess:
         kind: Sql
         # Use one of the supported engines and a query in its dialect
         # See: https://docs.kamu.dev/cli/supported-engines/
         engine: datafusion
         query: |
           select
             date,
             city,
             -- remove commas between thousands
             cast(replace(population, ",", "") as bigint)
           from input
       # How to combine data ingested in the past with the new data.
       # See: https://docs.kamu.dev/odf/reference/#mergestrategy
       merge:
         kind: Ledger
         primaryKey:
           - date
           - city
       # Lets you manipulate names of the system columns to avoid conflicts
       # or use names better suited for your data.
       # See: https://docs.kamu.dev/odf/reference/#setvocab
     - kind: SetVocab
       eventTimeColumn: date
Derivative (snapshot)
 ---
 kind: DatasetSnapshot
 version: 1
 content:
   # A human-friendly alias of the dataset
   name: 2
   # Derivative datasets produce data by transforming and combining
   # one or multiple existing datasets.
   # See: https://docs.kamu.dev/cli/transform/
   kind: Derivative
   # List of metadata events that get dataset into its initial state
   # See: https://docs.kamu.dev/odf/reference/#metadataevent
   metadata:
     # Transformation that will be applied to produce new data
     # See: https://docs.kamu.dev/odf/reference/#settransform
     - kind: SetTransform
       # References the datasets that will be used as inputs.
       # Note: We are associating inputs by name, but could also use IDs.
       inputs:
         - datasetRef: com.example.city-populations
       # Transformation steps that ise one of the supported engines and query dialects
       # See: https://docs.kamu.dev/cli/supported-engines/
       transform:
         kind: Sql
         engine: datafusion
         query: |
           select
             date,
             city,
             population + 1 as population
+           from `com.example.city-populations`
     # Lets you manipulate names of the system columns to avoid
     # conflicts or use names better suited for your data.
     # See: https://docs.kamu.dev/odf/reference/#setvocab
     - kind: SetVocab
       eventTimeColumn: date
@zaychenko-sergei
Copy link
Contributor

@s373r Please make a more technical description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants