GitHub - stelar-eu/data-profiler

data-profiler

Overview

data-profiler is a Python library providing various functions for profiling different types of data and files.

Quick start

Please see the provided notebooks.

Documentation

Please see here.

Installation

data-profiler needs python version >=3.8 and < 4.0.

Python Module - Local library

data-profiler, after it is downloaded from here can be installed with:

$ cd data-profiler
$ pip install .

How to import local library

After you install the data-profile as a local library you can import it in your python:

import stelardataprofiler

How to run the app

After you install the data-profile as a local library you can run the app either by executing the stelardataprofilerapp script or by executing streamlit run inside the streamlitapp folder.

$ stelardataprofilerapp run -- <absolute-folder-path-for-app-outputs>

or

$ cd data-profiler/streamlitapp
$ streamlit run app.py -- <absolute-folder-path-for-app-outputs>

NOTE: The default is '.' which means that in the first case the folder will be created inside the python package while in the second case the folder will be created inside the data-profiler/streamlitapp folder.
In the first case we can run the app from anywhere.
Additionally, in both options we can make use of streamlit flags. For example:

stelardataprofilerapp run --server.port 9040 -- absolute-path-for-app-outputs

streamlit run app.py --server.port 9040 -- absolute-path-to-output-folder

Configuration

Change the config_template according to the requirements of each profiler and execute main.py to create the mapping.ttl file.

Execute profiler-mappings script (after local library installation)

$ cd data-profiler
$ profiler-mappings config_template.json

NOTE: We can execute profile-mappings from anywhere as it is a console script, but we must have the correct path to the config_template.json and change the 'path' parameters of the config_template.json to correctly take the input and write the output.

Output

JSON

All profiling functions output the results in a JSON and an HTML file. A brief example of the JSON output of the raster profiler given two images as input is as follows.

{
"analysis":  { "date_start": "2023-04-28 12:09:45.815132",
               "date_end": "2023-04-28 12:09:54.920661",
                ... 
             },
"table":     { "byte_size": 2925069,
               "n_of_imgs": 2,
                ...
             },
"variables": [{"name": "image_1",
               "type": "Raster",
               "crs": "EPSG:4326",
               "spatial_coverage": "POLYGON ((83 275, 183 0, 83 275))"
              }, ...]
}

In short, the analysis field contains some metadata regarding the profiling task, such as the start and end time. The table field contains profiling results regarding the whole dataset, i.e., not considering the input images separately (e.g., number of images and total size in bytes). Finally, the variables field contains per image results, such as the CRS and spatial coverage.

A complete JSON output example can be found here.

HTML

The HTML file contains various plots that visualize the profiling results. Examples of such HTML visualizations of profiles can be found here, here and here.

Apply mappings to generate RDF graph

Predefined mappings for profiles of the various types of datasets are available and can be used to generate an RDF graph with the profiling information. Once the profiling process completes, an automatically configured mapping.ttl file is available in the same folder as the output JSON. All such customized mappings are expressed in the RDF Mapping language (RML) and can be used to transform the JSON profile into various serializations in RDF, as specified by the user in a configuration. To apply such mappings, you need to download the latest release of RML Mapper and execute the downloaded JAR in Java as follows:

java -jar <path-to-RML_Mapper.JAR> -m <output-path>/mapping.ttl -d -s <RDF-serialization> -o <path-to-output-RDF-file>

File mapping.ttl required for this step has been created in the same folder as the JSON output produced by the data-profiler, as specified in the user's configuration. Options for the <RDF-serialization> include: nquads (default), turtle, ntriples, trig, trix, jsonld, hdt. If the path to the output RDF file is ommitted, then the RDF triples will be listed in standard output.

NOTE: Executing this operation with the RML Mapper requires Java 11 or later.

License

The contents of this project are licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
datasets		datasets
docs-sphinx		docs-sphinx
docs		docs
examples		examples
notebooks		notebooks
stelardataprofiler		stelardataprofiler
streamlitapp		streamlitapp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config_template.json		config_template.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-profiler

Overview

Quick start

Documentation

Installation

Python Module - Local library

How to import local library

How to run the app

Configuration

Execute profiler-mappings script (after local library installation)

Output

JSON

HTML

Apply mappings to generate RDF graph

License

About

Releases

Packages

Contributors 4

Languages

License

stelar-eu/data-profiler

Folders and files

Latest commit

History

Repository files navigation

data-profiler

Overview

Quick start

Documentation

Installation

Python Module - Local library

How to import local library

How to run the app

Configuration

Execute profiler-mappings script (after local library installation)

Output

JSON

HTML

Apply mappings to generate RDF graph

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages