A Julia package providing functions to interact with Dataiku DSS instances
This package provides an interface to use DSS remotely and to create recipes/notebooks in Dataiku DSS
Note: This documentation is for remote use. If you want to use the Julia language from within DSS, install the plugin from DSS UI or ask your administrator. More info on the plugin integration here: https://www.dataiku.com/dss/plugins/info/julia.html
To read a dataset into a DataFrame
:
using Dataiku, DataFrames
import Dataiku: get_dataframe
df = get_dataframe(dataset"PROJECTKEY.myDataset")
partitions::AbstractArray
: specify the partitions wantedinfer_types::Bool=true
: uses the types detected by CSV.jl rather than the DSS schemalimit::Integer
: Limits the number of rows returnedratio::AbstractFloat
: Limits the ratio to at n% of the datasetsampling::AbstractString="head"
: Sampling method, ifhead
returns the first rows of the dataset. Incompatible with ratio parameter.random
returns a random sample of the datasetrandom-column
returns a random sample of the dataset. Incompatible with limit parameter.
truestrings
,falsestrings
: Vectors of Strings that indicate howtrue
andfalse
values are representedsampling_column::AbstractString
: Select the column used for "columnwise-random" sampling
Examples :
get_dataframe(dataset"myDataset")
get_dataframe(dataset"myDataset", [:col1, :col2, :col5]; partitions=["2019-02", "2019-03"])
get_dataframe(dataset"PROJECTKEY.myDataset"; infer_types=false, limit=200, sampling="random")
To be able to read data by chunk, without loading all the data in memory. The same keyword parameters can be given.
channel = Dataiku.iter_dataframes(dataset"myDataset", 1000)
first_thousand_row = take!(channel)
second_thousand_row = take!(channel)
It is possible to iterate through the channel
for chunk in channel
do_stuff(chunk)
end
Iteration row by row with DataFrameRows or tuples is also possible
Dataiku.iter_rows(ds::DSSDataset, columns::AbstractArray=[]; kwargs...)
Dataiku.iter_tuples(ds::DSSDataset, columns::AbstractArray=[]; kwargs...)
Dataiku.write_with_schema(dataset"myOutputDataset", df)
Dataiku.write_dataframe(dataset"myOutputDataset", df) # does not set the schema of the dataset.
The output dataset must already exist in the project.
It is also possible to write datasets chunk by chunk to avoid loading full datasets in memory.
input = Dataiku.iter_dataframes(dataset"myInputDataset", 500)
Dataiku.write_dataframe(dataset"myOutputDataset") do chnl
for chunk in input
put!(chnl, chunk)
end
end
partition::AbstractString
: specify the partition to write.overwrite::Bool=true
: iffalse
, appends the data to the already existing dataset.
When using the package inside DSS (recipe or notebook) the projectKey doesn't need to be initialized. Otherwise, you may want to use Dataiku.set_current_project(project"MYPROJECTKEY")
.
If no project key is initialized, or to use objects for other projects, it's needed to indicate the project during the object creation :
dataset"PROJECTKEY.datasetname"
DSSDataset(project"PROJECTKEY", "datasetname") # equivalent
...
To use the package outside DSS, a url to the instance and an API key or a DKU Ticket is needed. API key can be retrieved in DSS, Administration -> Security
There are 2 ways to initialize it
Create this json file in you're home path $HOME/.dataiku/config.json
{
"dss_instances": {
"default": {
"url": "http://localhost:XXXX/",
"api_key": "$(APIKEY secret)"
}
},
"default_instance": "default"
}
Dataiku.init_context(url::AbstractString, auth::AbstractString)
The Package implements DSSTypes to interact with different DSS objects
DSSDataset
DSSFolder
DSSMLTask
DSSModelVersion
DSSProject
DSSSavedModel
DSSTrainedModel
These types are only indicators and don't store any data or metadata.
You can have details about what you can do with these types in notebooks or julia REPL like this : ?DSSDataset
For more accessibility, str_macros exist to create most of these types :
dataset"mydataset"
is equivalent toDSSDataset("mydataset")
dataset"PROJECTKEY.mydataset"
=>DSSDataset("mydataset", DSSProject("PROJECTKEY"))
project"PROJECTKEY"
=>DSSProject("PROJECTKEY")
folder"XXXXXX"
=>DSSFolder("XXXXXX")
A running DSS instance and a config file ($HOME/.dataiku/config.json
) are required to run the tests.
config.json :
{
"dss_instances": {
"default": {
"url": INSTANCE_URL,
"api_key": API_KEY_SECRET
}
},
"default_instance": "default"
}