Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iyer/issue141 #142

Merged
merged 15 commits into from
Mar 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion docs/backends.rst
Original file line number Diff line number Diff line change
@@ -1,25 +1,50 @@
Backends
========

Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage. Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent. In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python ``collections`` library. Backend extensions will be accepted conditional to the extention of ``backends/tests`` to demonstrate new Backend capability. We can not accept pull requests that are not tested.
Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage.
Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations.
A contributed backend abstract class may extend another backend to inherit the properties of the parent.
In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python ``collections`` library.
Backend extensions will be accepted conditional to the extention of ``backends/tests`` to demonstrate new Backend capability.
We can not accept pull requests that are not tested.

Note that any contributed backends or extensions should include unit tests in ``backends/tests`` to demonstrate the new Backend capability.

.. figure:: BackendClassHierarchy.png
:alt: Figure depicting the current backend class hierarchy.
:class: with-shadow
:scale: 100%
:align: center

Figure depicts the current DSI backend class hierarchy.

.. automodule:: dsi.backends.filesystem
:members:

SQLite
------

.. automodule:: dsi.backends.sqlite
:members:
:special-members: __init__

SQLAlchemy
------

.. automodule:: dsi.backends.sqlalchemy
:members:
:special-members: __init__

GUFI
------

.. automodule:: dsi.backends.gufi
:members:
:special-members: __init__

Parquet
------

.. automodule:: dsi.backends.parquet
:members:
:special-members: __init__
6 changes: 4 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
exec(open("../dsi/_version.py").read())

project = 'DSI'
copyright = '2023, Triad National Security, LLC. All rights reserved.'
copyright = '2025, Triad National Security, LLC. All rights reserved.'
author = 'The DSI Project team'
release = '0.0.0'
release = __version__

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand All @@ -22,6 +23,7 @@
templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store', 'README.rst']

rst_prolog = f".. |version_num| replace:: {__version__}"

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
Expand Down
27 changes: 18 additions & 9 deletions docs/contributing_readers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
Making a Reader for Your Application
====================================

DSI readers are the primary way to transform outside data to metadata that DSI can ingest. Readers are Python classes that must include a few methods, namely ``__init__``, ``pack_header``, and ``add_rows``.
DSI readers are the primary way to transform outside data to metadata that DSI can ingest.
Readers are Python classes that must include a few methods, namely ``__init__``, ``pack_header``, and ``add_rows``.

Initializer: ``__init__(self) -> None:``
-------------------------------------------
Expand Down Expand Up @@ -49,11 +50,14 @@ Example ``add_rows``: ::

self.add_to_output(my_data)

*Alternate* Add Rows: ``add_rows(self) -> None``
*Newer* Add Rows: ``add_rows(self) -> None``
-------------------------------------
If you are confident that the the data you read in ``add_rows`` is in the form of an OrderedDict (the data structure used to store all ingested data), you can bypass the use of ``pack_header`` and ``add_to_output`` with an alternate ``set_schema`` function.
If you are confident that the the data you read in ``add_rows`` is in the form of an OrderedDict (the data structure used to store all ingested data),
you can bypass the use of ``pack_header`` and ``add_to_output`` with an alternate ``set_schema`` function.

This function, ``set_schema_2(self, collection, validation_model=None) -> None``, directly assigns the data you read in ``add_rows`` to the internal DSI abstraction layer, provided that the data you pass as the ``collection`` variable is an OrderedDict. This method allows you to quickly append data to the abstraction wholesale, rather than row-by-row.
This function, ``set_schema_2(self, collection, validation_model=None) -> None``, directly assigns the data you read in ``add_rows`` to the internal DSI abstraction layer,
provided that the data you pass as the ``collection`` variable is an OrderedDict.
This method allows you to quickly append data to the abstraction wholesale, rather than row-by-row.

Example alternate ``add_rows``: ::

Expand All @@ -65,20 +69,25 @@ Example alternate ``add_rows``: ::
my_data["joey"] = 20
my_data["amy"] = 30

self.set_schema2(my_data)
self.set_schema_2(my_data)

Implemented Examples
--------------------------------
If you want to see some full reader examples in-code, some can be found in
`dsi/plugins/env.py <https://github.com/lanl/dsi/blob/main/dsi/plugins/env.py>`_.
``Hostname`` is an especially simple example to go off of.
`dsi/plugins/file_reader.py <https://github.com/lanl/dsi/blob/main/dsi/plugins/file_reader.py>`_.
``Csv`` is an especially simple example to go off of.

Loading Your Reader
-------------------------
There are two ways to load your reader, internally and externally.

- Internally: If you want your reader loadable internally with the rest of the provided implementations (in `dsi/plugins <https://github.com/lanl/dsi/tree/main/dsi/plugins>`_), it must be registered in the class variables of ``Terminal`` in `dsi/core.py <https://github.com/lanl/dsi/blob/main/dsi/core.py>`_. If this is done correctly, your reader will be loadable by the ``load_module`` method of ``Terminal``.
- Externally: If your reader is not along side the other provided implementations, possibly somewhere else on the filesystem, your reader will be loaded externally. This is done by using the ``add_external_python_module`` method of ``Terminal``. If you load an external Python module this way (ex. ``term.add_external_python_module('plugin','my_python_file','/the/path/to/my_python_file.py')``), your reader will then be loadable by the ``load_module`` method of ``Terminal``.
- Internally: If you want your reader loadable internally with the rest of the provided implementations (in `dsi/plugins <https://github.com/lanl/dsi/tree/main/dsi/plugins>`_),
it must be registered in the class variables of ``Terminal`` in `dsi/core.py <https://github.com/lanl/dsi/blob/main/dsi/core.py>`_.
If this is done correctly, your reader will be loadable by the ``load_module`` method of ``Terminal``.
- Externally: If your reader is not along side the other provided implementations, possibly somewhere else on the filesystem, your reader will be loaded externally.
This is done by using the ``add_external_python_module`` method of ``Terminal``.
If you load an external Python module this way (ex. ``term.add_external_python_module('plugin','plugin_class_name','/path/to/python_file.py')``),
your reader will then be loadable by the ``load_module`` method of ``Terminal``.


Contributing Your Reader
Expand Down
98 changes: 82 additions & 16 deletions docs/core.rst
Original file line number Diff line number Diff line change
@@ -1,31 +1,97 @@
Core
====

The DSI Core middleware defines the Terminal concept. An instantiated Terminal is the human/machine DSI interface. The person setting up a Core Terminal only needs to know how they want to ask questions, and what metadata they want to ask questions about. If they don’t see an option to ask questions the way they like, or they don’t see the metadata they want to ask questions about, then they should ask a Backend Contributor or a Plugin Contributor, respectively.
The DSI Core middleware defines the Terminal and Sync concept.
An instantiated Terminal is the human/machine DSI interface to connect Reader/Writer plugins and DSI backends.
An instantiated Sync supports data movement capabilities between local and remote locations and captures metadata documentation

A Core Terminal is a home for Plugins (Readers/Writers), and an interface for Backends. A Core Terminal is instantiated with a set of default Plugins and Backends, but they must be loaded before a user query is attempted. ``core.py`` contains examples of how you might work with DSI using an interactive Python interpreter for your data science workflows:
Core: Terminal
--------------

.. literalinclude:: ../examples/coreterminal.py
The Terminal class is a structure through which users can interact with Plugins (Readers/Writers) and Backends as "module" objects.
Each reader/writer/backend can be "loaded" to be ready for use and users can interact with backends by ingesting, querying, processing, or finding data,
as well as generating an interactive notebook of the data.

All relevant functions have been listed below for further clarity. Examples section displays various workflows using this Terminal class.

At this point, you might decide that you are ready to collect data for inspection. It is possible to utilize DSI Backends to load additional metadata to supplement your Plugin metadata, but you can also sample Plugin data and search it directly.
Notes for users:
- All plugin writers that are loaded must be followed by calling transload() after to execute them. Readers are automatically executed upon loading.
- Terminal.load_module: if user wants to relate tables of data from a plugin reader under the same name, they can use the `target_table_prefix`` input to specify a prefix.

- users must note that if accessing data from these tables they must remember the table names will include specified prefix. Ex: collection1__math, collection1_english
- Terminal.artifact_handler: 'notebook' interaction_type stores data from first loaded backend, not existing DSI abstraction, in new notebook file
- Terminal find functions only access the first loaded backend
- Terminal.unload_module: removes last loaded backend of specified mod_name. ex: 2 loaded Sqlite backends, second is unloaded
- Terminal handles errors from any loaded DSI/user-written modules (plugins/backends).
If writing an external plugin/backend, return a caught error as a tuple (error, error_message_string). Do not print in a new class
.. autoclass:: dsi.core.Terminal
:members:
:special-members: __init__

Core: Sync
----------

The process of transforming a set of Plugin writers and readers into a queryable format is called transloading. A DSI Core Terminal has a ``transload()`` method which may be called to execute all Plugins at once::
The DSI Core middleware also defines data management functionality in ``Sync``.
The purpose of ``Sync`` is to provide file metadata documentation and data movement capabilities when moving data to/from local and remote locations.
The purpose of data documentation is to capture and archive metadata
(i.e. location of local file structure, their access permissions, file sizes, and creation/access/modification dates)
and track their movement to the remote location for future access.
The primary functions, ``Copy``, ``Move``, and ``Get`` serve as mechanisms to copy data, move data, or retrieve data from remote locations by creating a DSI database in the process,
or retrieving an existing DSI database that contains the location(s) of the target data.

>>> a.transload()
>>> a.active_metadata
>>> # OrderedDict([('uid', [1000]), ('effective_gid', [1000]), ('moniker', ['qwofford'])...
.. autoclass:: dsi.core.Sync
:members:
:special-members: __init__

Examples
--------
Before interacting with the plugins and backends, they must each be loaded.
Examples below display various ways users can incorporate DSI into their data science workflows.
They can be found and run in ``examples/core/``

Once a Core Terminal has been transloaded, no further Plugins may be added.
Example 1: Intro use case
~~~~~~~~~~
Baseline use of DSI to list Modules

Core:Sync
---------
.. literalinclude:: ../examples/core/baseline.py

The DSI Core middleware also defines data management functionality in ``Sync``. The purpose of ``Sync`` is to provide file metadata documentation and data movement capabilities when moving data to/from local and remote locations. The purpose of data documentation is to capture and archive metadata (i.e. location of local file structure, their access permissions, file sizes, and creation/access/modification dates) and track their movement to the remote location for future access. The primary functions, ``Copy``, ``Move``, and ``Get`` serve as mechanisms to copy data, move data, or retrieve data from remote locations by creating a DSI database in the process, or retrieving an existing DSI database that contains the location(s) of the target data.
Example 2: Ingest data
~~~~~~~~~~
Ingesting data from a Reader to a backend

Core Modules and Functions
--------------------------
.. literalinclude:: ../examples/core/ingest.py

.. automodule:: dsi.core
:members:
Example 3: Query data
~~~~~~~~~~
Querying data from a backend

.. literalinclude:: ../examples/core/query.py

.. _example4_label:
Example 4: Process data
~~~~~~~~~~
Processing data from a backend to generate an Entity Relationship diagram using a Writer

.. literalinclude:: ../examples/core/process.py

Example 5: Generate notebook
~~~~~~~~~~
Generating a python notebook file (mostly Jupyter notebook) from a backend to view data interactively

.. literalinclude:: ../examples/core/notebook.py

Example 6: Find data
~~~~~~~~~~
Finding data from a backend - tables, columns, cells, or all matches

.. literalinclude:: ../examples/core/find.py

Example 7: External plugin
~~~~~~~~~~
Loading an external python plugin reader from a separate file:

.. literalinclude:: ../examples/core/external_plugin.py

``text_file_reader``:

.. literalinclude:: ../examples/core/text_file_reader.py
Loading