Skip to content

Commit e219a08

Browse files
committed
minor updates for inline and docs/ text
1 parent 773ef68 commit e219a08

File tree

9 files changed

+73
-108
lines changed

9 files changed

+73
-108
lines changed

docs/backends.rst

+6-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,12 @@
11
Backends
22
========
33

4-
Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage. Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent. In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python ``collections`` library. Backend extensions will be accepted conditional to the extention of ``backends/tests`` to demonstrate new Backend capability. We can not accept pull requests that are not tested.
4+
Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage.
5+
Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations.
6+
A contributed backend abstract class may extend another backend to inherit the properties of the parent.
7+
In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python ``collections`` library.
8+
Backend extensions will be accepted conditional to the extention of ``backends/tests`` to demonstrate new Backend capability.
9+
We can not accept pull requests that are not tested.
510

611
Note that any contributed backends or extensions should include unit tests in ``backends/tests`` to demonstrate the new Backend capability.
712

docs/contributing_readers.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22
Making a Reader for Your Application
33
====================================
44

5-
DSI readers are the primary way to transform outside data to metadata that DSI can ingest. Readers are Python classes that must include a few methods, namely ``__init__``, ``pack_header``, and ``add_rows``.
5+
DSI readers are the primary way to transform outside data to metadata that DSI can ingest.
6+
Readers are Python classes that must include a few methods, namely ``__init__``, ``pack_header``, and ``add_rows``.
67

78
Initializer: ``__init__(self) -> None:``
89
-------------------------------------------
@@ -65,13 +66,13 @@ Example alternate ``add_rows``: ::
6566
my_data["joey"] = 20
6667
my_data["amy"] = 30
6768

68-
self.set_schema2(my_data)
69+
self.set_schema_2(my_data)
6970

7071
Implemented Examples
7172
--------------------------------
7273
If you want to see some full reader examples in-code, some can be found in
73-
`dsi/plugins/env.py <https://github.com/lanl/dsi/blob/main/dsi/plugins/env.py>`_.
74-
``Hostname`` is an especially simple example to go off of.
74+
`dsi/plugins/file_reader.py <https://github.com/lanl/dsi/blob/main/dsi/plugins/file_reader.py>`_.
75+
``Csv`` is an especially simple example to go off of.
7576

7677
Loading Your Reader
7778
-------------------------

docs/core.rst

+8-3
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
11
Core
22
====
33

4-
The DSI Core middleware defines the Terminal concept. An instantiated Terminal is the human/machine DSI interface. The person setting up a Core Terminal only needs to know how they want to ask questions, and what metadata they want to ask questions about. If they don’t see an option to ask questions the way they like, or they don’t see the metadata they want to ask questions about, then they should ask a Backend Contributor or a Plugin Contributor, respectively.
4+
The DSI Core middleware defines the Terminal concept. An instantiated Terminal is the human/machine DSI interface.
5+
The person setting up a Core Terminal only needs to know how they want to ask questions, and what metadata they want to ask questions about.
6+
If they don’t see an option to ask questions the way they like, or they don’t see the metadata they want to ask questions about, then they should ask a Backend Contributor or a Plugin Contributor, respectively.
57

68
Core: Terminal
79
--------------
810

9-
The Terminal class is a structure through which users can interact with Plugins (Readers/Writers) and Backends as "module" objects. Each reader/writer/backend can be "loaded" to make ready for use and users can further interact with backends by ingesting, querying, or processing data as well as generating an interactive notebook with data. All relevant functions have been listed below for further clarity.
11+
The Terminal class is a structure through which users can interact with Plugins (Readers/Writers) and Backends as "module" objects.
12+
Each reader/writer/backend can be "loaded" to make ready for use and users can further interact with backends by ingesting, querying, processing, or finding data as well as generating an interactive notebook with data.
13+
14+
All relevant functions have been listed below for further clarity. Examples section displays various workflows using this Terminal class.
1015

1116
Notes for users:
1217
- All plugin writers that are loaded must be followed by calling transload() after to execute them. Readers are automatically executed upon loading.
@@ -44,7 +49,7 @@ Example 3: Querying data from a backend
4449

4550
.. literalinclude:: ../examples/core/query.py
4651

47-
Example 4: Processing data from a backend to generate an Entity Relationship diagram
52+
Example 4: Processing data from a backend to generate an Entity Relationship diagram using a Writer
4853

4954
.. literalinclude:: ../examples/core/process.py
5055

docs/examples.rst

+4-61
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ In the first step, a python script is used to parse the slurm output files and c
2020
./parse_slurm_output.py --testname leblanc
2121
2222
23-
.. literalinclude:: ../examples/pennant/parse_slurm_output.py
23+
.. .. literalinclude:: ../examples/pennant/parse_slurm_output.py
2424
25-
A second python script,
25+
In the second step, another python script,
2626

2727
.. code-block:: unixconfig
2828
@@ -31,69 +31,17 @@ A second python script,
3131
3232
reads in the CSV file and creates a database:
3333

34-
.. code-block:: python
35-
36-
"""
37-
Creates the DSI db from the csv file
38-
"""
39-
"""
40-
This script reads in the csv file created from parse_slurm_output.py.
41-
Then it creates a DSI db from the csv file and performs a query.
42-
"""
43-
44-
import argparse
45-
import sys
46-
from dsi.backends.sqlite import Sqlite, DataType
47-
import os
48-
from dsi.core import Terminal
49-
50-
isVerbose = True
51-
52-
if __name__ == "__main__":
53-
""" The testname argument is required """
54-
parser = argparse.ArgumentParser()
55-
parser.add_argument('--testname', help='the test name')
56-
args = parser.parse_args()
57-
test_name = args.testname
58-
if test_name is None:
59-
parser.print_help()
60-
sys.exit(0)
61-
62-
table_name = "rundata"
63-
csvpath = 'pennant_' + test_name + '.csv'
64-
dbpath = 'pennant_' + test_name + '.db'
65-
output_csv = "pennant_read_query.csv"
66-
67-
#read in csv
68-
core = Terminal(run_table_flag=False)
69-
core.load_module('plugin', "Csv", "reader", filenames = csvpath, table_name = table_name)
70-
71-
if os.path.exists(dbpath):
72-
os.remove(dbpath)
73-
74-
#load data into sqlite db
75-
core.load_module('backend','Sqlite','back-write', filename=dbpath)
76-
core.artifact_handler(interaction_type='put')
77-
78-
# update dsi abstraction using a query to the sqlite db
79-
query_data = core.artifact_handler(interaction_type='get', query = f"SELECT * FROM {table_name} WHERE hydro_cycle_run_time > 0.006;", dict_return = True)
80-
core.update_abstraction(table_name, query_data)
81-
82-
#export to csv
83-
core.load_module('plugin', "Csv_Writer", "writer", filename = output_csv, table_name = table_name)
84-
core.transload()
34+
.. literalinclude:: ../examples/pennant/create_and_query_dsi_db.py
8535

8636
Resulting in the output of the query:
8737

8838
.. figure:: example-pennant-output.png
8939
:alt: Screenshot of computer program output.
9040
:class: with-shadow
9141

92-
9342
The output of the PENNANT example.
9443

9544

96-
9745
Wildfire Dataset
9846
----------------
9947

@@ -109,12 +57,7 @@ To run this example, load dsi and run:
10957
11058
python3 examples/wildfire/wildfire.py
11159
112-
Within ``wildfire.py``, Sqlite is imported from the available DSI backends and DataType is the derived class for the defined (regular) schema.
113-
114-
.. code-block:: unixconfig
115-
116-
from dsi.backends.sqlite import Sqlite, DataType
117-
60+
.. literalinclude:: ../examples/wildfire/wildfire.py
11861

11962
This will generate a wildfire.cdb folder with downloaded images from the server and a data.csv file of numerical properties of interest. This cdb folder is called a `Cinema`_ database (CDB). Cinema is an ecosystem for management and analysis of high dimensional data artifacts that promotes flexible and interactive data exploration and analysis. A Cinema database is comprised of a CSV file where each row of the table is a data element (a run or ensemble member of a simulation or experiment, for example) and each column is a property of the data element. Any column name that starts with 'FILE' is a path to a file associated with the data element. This could be an image, a plot, a simulation mesh or other data artifact.
12063

docs/introduction.rst

+35-26
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,29 @@
1+
Introduction
2+
============
13

4+
The goal of the Data Science Infrastructure Project (DSI) is to manage data through metadata capture and curation.
5+
DSI capabilities can be used to develop workflows to support management of simulation data, AI/ML approaches, ensemble data, and other sources of data typically found in scientific computing.
26

3-
4-
The goal of the Data Science Infrastructure Project (DSI) is to manage data through metadata capture and curation. DSI capabilities can be used to develop workflows to support management of simulation data, AI/ML approaches, ensemble data, and other sources of data typically found in scientific computing. DSI infrastructure is designed to be flexible and with these considerations in mind:
5-
6-
- Data management is subject to strict, POSIX-enforced, file security.
7-
- DSI capabilities support a wide range of common metadata queries.
8-
- DSI interfaces with multiple database technologies and archival storage options.
9-
- Query-driven data movement is supported and is transparent to the user.
10-
- The DSI API can be used to develop user-specific workflows.
7+
DSI infrastructure is designed to be flexible and with these considerations in mind:
8+
- Data management is subject to strict, POSIX-enforced, file security.
9+
- DSI capabilities support a wide range of common metadata queries.
10+
- DSI interfaces with multiple database technologies and archival storage options.
11+
- Query-driven data movement is supported and is transparent to the user.
12+
- The DSI API can be used to develop user-specific workflows.
1113

1214
.. figure:: data_lifecycle.png
1315
:alt: Figure depicting the data life cycle
1416
:class: with-shadow
1517
:scale: 50%
1618

17-
A depiction of data life cycle can be seen here. The Data Science Infrastructure API supports the user to manage the life cycle aspects of their data.
19+
A depiction of data life cycle can be seen here. The DSI API supports the user to manage the life cycle aspects of their data.
1820

19-
DSI system design has been driven by specific use cases, both AI/ML and more generic usage. These use cases can often be generalized to user stories and needs that can be addressed by specific features, e.g., flexible, human-readable query capabilities. DSI uses Object Oriented design principles to encourage modularity and to support contributions by the user community. The DSI API is Python-based.
21+
DSI system design has been driven by specific use cases, both AI/ML and more generic usage.
22+
These use cases can often be generalized to user stories and needs that can be addressed by specific features, e.g., flexible, human-readable query capabilities.
23+
DSI uses Object Oriented design principles to encourage modularity and to support contributions by the user community. The DSI API is Python-based.
2024

2125
Implementation Overview
22-
=======================
26+
-----------------------
2327

2428
The DSI API is broken into three main categories:
2529

@@ -28,19 +32,19 @@ The DSI API is broken into three main categories:
2832
- DSI Core: the *middleware* that contains the basic functionality to use the DSI API.
2933

3034
Plugin Abstract Classes
31-
-----------------------
35+
~~~~~~~~~~~~~~~~~~~~~~~
3236

33-
Plugins transform an arbitrary data source into a format that is compatible with the DSI core. The parsed and queryable attributes of the data are called *metadata* -- data about the data. Metadata share the same security profile as the source data.
37+
Plugins transform an arbitrary data source into a format that is compatible with the DSI core.
38+
The parsed and queryable attributes of the data are called *metadata* -- data about the data. Metadata share the same security profile as the source data.
3439

35-
Plugins can operate as data readers or data writers. A simple data reader might parse an application's output file and place it into a core-compatible data structure such as Python built-ins and members of the popular Python ``collection`` module. A simple data writer might execute an application to supplement existing data and queryable metadata, e.g., adding locations of outputs data or plots after running an analysis workflow.
40+
Plugins can operate as data readers or data writers.
41+
A simple data reader might parse an application's output file and place it into a core-compatible data structure such as Python built-ins and members of the popular Python ``collection`` module.
42+
A simple data writer might execute an application to supplement existing data and queryable metadata, e.g., adding locations of outputs data or plots after running an analysis workflow.
3643

3744
Plugins are defined by a base abstract class, and support child abstract classes which inherit the properties of their ancestors.
3845

3946
Currently, DSI has the following readers:
4047

41-
- CSV file reader: reads in comma separated value (CSV) files.
42-
- Bueno reader: can be used to capture performance data from `Bueno <https://github.com/lanl/bueno>`_.
43-
4448
.. figure:: PluginClassHierarchy.png
4549
:alt: Figure depicting the current plugin class hierarchy.
4650
:class: with-shadow
@@ -49,27 +53,32 @@ Currently, DSI has the following readers:
4953
Figure depicting the current DSI plugin class hierarchy.
5054

5155
Backend Abstract Classes
52-
------------------------
56+
~~~~~~~~~~~~~~~~~~~~~~~
5357

5458
Backends are an interface between the core and a storage medium.
55-
Backends are designed to support a user-needed functionality. Given a set of user metadata captured by a DSI frontend, a typical functionality needed by DSI users is to query that metadata by SQL query. Because the files associated with the queryable metadata may be spread across filesystems and security domains, a supporting backend is required to assemble query results and present them to the DSI core for transformation and return.
59+
Backends are designed to support a user-needed functionality.
60+
Given a set of user metadata captured by a DSI frontend, a typical functionality needed by DSI users is to query that metadata by SQL query.
61+
Because the files associated with the queryable metadata may be spread across filesystems and security domains,
62+
a supporting backend is required to assemble query results and present them to the DSI core for transformation and return.
5663

5764
.. figure:: user_story.png
5865
:alt: This figure depicts a user asking a typical query on the user's metadata
5966
:class: with-shadow
6067
:scale: 50%
6168

62-
In this typical **user story**, the user has metadata about their data stored in DSI storage of some type. The user needs to extract all files with the variable **foo** above a specific threshold. DSI backends query the DSI metadata store to locate and return all such files.
69+
In this typical **user story**, the user has metadata about their data stored in DSI storage of some type.
70+
The user needs to extract all files with the variable **foo** above a specific threshold.
71+
DSI backends query the DSI metadata store to locate and return all such files.
6372

6473
Current DSI backends include:
6574

66-
- Sqlite: Python based SQL database and backend; the default DSI API backend.
67-
- GUFI: the Grand Unified File Index system `Grand Unified File-Index <https://github.com/mar-file-system/GUFI>`_ ; developed at LANL, GUFI is a fast, secure metadata search across a filesystem accessible to both privileged and unprivileged users.
75+
- SQLite: Python based SQL database and backend; the default DSI API backend.
76+
- GUFI: the `Grand Unified File Index system <https://github.com/mar-file-system/GUFI>`_ ; developed at LANL. GUFI is a fast, secure metadata search across a filesystem accessible to both privileged and unprivileged users.
6877
- Parquet: a columnar storage format for `Apache Hadoop <https://hadoop.apache.org>`_.
6978

7079
DSI Core
71-
--------
72-
73-
DSI basic functionality is contained within the middleware known as the *core*. The DSI core is focused on delivering user-queries on unified metadata which can be distributed across many files and security domains. DSI currently supports Linux, and is tested on RedHat- and Debian-based distributions. The DSI core is a home for DSI Plugins and an interface for DSI Backends.
80+
~~~~~~~~
7481

75-
Core Documentation
82+
DSI basic functionality is contained within the middleware known as the *core*.
83+
The DSI core is focused on delivering user-queries on unified metadata which can be distributed across many files and security domains.
84+
DSI currently supports Linux, and is tested on RedHat- and Debian-based distributions. The DSI core is a home for DSI Plugins and an interface for DSI Backends.

docs/plugins.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Note that any contributed plugins or extension should include unit tests in ``p
1111
:class: with-shadow
1212
:scale: 100%
1313

14-
Figure depicts the current DSI plugin class hierarchy.
14+
Figure depicts prominent portion of the current DSI plugin class hierarchy.
1515

1616
.. automodule:: dsi.plugins.plugin
1717
:members:
@@ -29,15 +29,15 @@ Note for users:
2929
- Plugin readers in DSI repo can/should handle data files with mismatched number of columns. Ex: file1: table1 has columns a, b, c. file2: table1 has columns a, b, d
3030

3131
- if only reading in one table, users can utilize python pandas to stack mulutiple dataframes vertically (CSV reader)
32-
- if ingesting multiple tables at a time, users must pad tables with null data (YAML1 and TOML1 use this. YAML1 has example code at bottom of add_row())
32+
- if ingesting multiple tables at a time, users must pad tables with null data (YAML1 uses this and has example code at bottom of add_row() to implement this)
3333
.. automodule:: dsi.plugins.file_reader
3434
:members:
3535

3636
File Writers
3737
------------
3838

3939
Note for users:
40-
- If runTable flag is True in Terminal instantiaton, the run table is only included in ER Diagram writer if data is processed from a backend. View Example 4 in Core Examples
40+
- If runTable flag is True in Terminal instantiation, the run table is only included in ER Diagram writer if data is processed from a backend. View Example 4 in Core Examples
4141
.. automodule:: dsi.plugins.file_writer
4242
:members:
4343

0 commit comments

Comments
 (0)