Skip to content

gremid/idrovora

Repository files navigation

idrovora - A pump station for your XProc pipelines

This is work in progress!

Idrovora by Marianna57 / Wikimedia Commons / CC-BY-SA-4.0

XProc is a versatile language to describe XML-based processing pipelines. With XML Calabash and its embedded Saxon Processor exists a stable and feature-rich implementation, that allows developers to implement document processing logic in a platform-independent way using XML-based technologies.

Idrovora provides an execution context for XProc pipelines, where

  1. pipelines are run with input read from and results written to the filesystem following a common directory layout,
  2. the execution of pipelines can be triggered via requests to an embedded HTTP server or via filesystem events (hot folders),
  3. pipelines are run concurrently and asynchronously by a daemon process, thereby not incurring JVM-related startup costs for each pipeline execution.

It is designed as a backend service for systems which depend on XML-based processing logic and consequently have to incorporate a Java runtime environment needed by the aforementioned tools, but want to do so with a minimalistic interface based on HTTP and the local filesystem.

Getting Started

Prerequisites

In order to build, test and run Idrovora, you need to install

Alternatively you can run Idrovora within a Docker container in which case nothing is needed except a Docker installation.

Build Docker container

$ make

Test

$ make test

Develop and Run

$ clojure -m idrovora.cli --help
Idrovora - A pump station for your XProc pipelines
Copyright (C) 2020 Gregor Middell

See <https://github.com/gremid/idrovora> for more information.

Usage: clojure -m idrovora.cli [OPTION]...

Options:
  -x, --xpl-dir $IDROVORA_XPL_DIR           workspace/xpl   source directory with XProc pipeline definitions
  -j, --job-dir $IDROVORA_JOB_DIR           workspace/jobs  spool directory for pipeline jobs
  -p, --port $IDROVORA_HTTP_PORT            3000            HTTP port for embedded server
  -c, --cleanup $IDROVORA_CLEANUP_SCHEDULE  0 1 0 * * ?     Schedule for periodic cleanup of old jobs (cron expression)
  -a, --job-max-age $IDROVORA_JOB_MAX_AGE   PT168H          Maximum age of jobs; older jobs are removed periodically
  -h, --help
[…]

Starting Idrovora with defaults, creating a workspace/ in the current directory and running an HTTP server on port 3000, simply execute the JAR without any arguments:

$ clojure -m idrovora.cli
2020-03-09 14:52:51 [main       | INFO  | idrovora.cli        ] Starting Idrovora
2020-03-09 14:52:51 [main       | INFO  | idrovora.http       ] Starting HTTP server at 3000/tcp
2020-03-09 14:52:51 [main       | INFO  | idrovora.workspace  ] Start watching 'workspace/jobs'
2020-03-09 14:52:51 [main       | INFO  | idrovora.workspace  ] Scheduling cleanup of old jobs (0 1 0 * * ?)

Pipelines can be defined in workspace/xpl/, for instance a pipeline named test in workspace/xpl/test.xpl:

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                version="1.0">
  <p:option name="source-dir" required="true"/>
  <p:option name="result-dir" required="true"/>
  <p:load name="read-from-input">
    <p:with-option name="href" select="concat($source-dir,'document.xml')"/>
  </p:load>
  <p:identity/>
  <p:store name="store-to-output">
    <p:with-option name="href" select="concat($result-dir,'document.xml')"/>
  </p:store>
</p:declare-step>

Jobs for this pipeline can then be created in sub-directories under workspace/jobs/test/, i. e.

workspace/jobs/
└── test
    └── 82d11012-cf02-4ec0-b3c6-f9fe004de7b0
        ├── result
        │   └── document.xml
        ├── source
        │   └── document.xml
        └── status
            ├── job-failed
            ├── result-ready
            └── source-ready

In a job-directory, here one named with the UUID 82d11012-cf02-4ec0-b3c6-f9fe004de7b0, the sub-directories source/ and result/ hold input and output data for the job; they are passed as URIs via options (source-dir and result-dir) to the pipeline. Files in the status/ sub-directory are used for controlling job execution and signaling job completion:

  • touching status/source-ready will signal to Idrovora that the sources have been written to source/ and the job can be scheduled for execution,
  • upon creation/modification of status/result-ready, a listening process can assume that the pipeline executed successfully and results can be picked up from result/,
  • the creation/modification of status/job-failed signals an error occurring while the job was run through the pipeline.

Roadmap

  • XProc engine based on XML Calabash
    • Call XProc pipelines with job context (source/result dirs) given as pipeline options
    • Improved error reporting
  • Filesystem watcher, notifying engine when jobs are ready to be run
  • HTTP interface
    • Embedded HTTP server
    • Allow jobs to be triggered via POST requests
    • Synchronously respond to job trigger requests when jobs are completed
  • User interface
    • Basic command line interface
    • Provide usage information/help
    • Make HTTP port configurable
  • Provide runtime metrics

Notes/Links

License

Copyright © 2020 Gregor Middell.

This project is licensed under the GNU General Public License v3.0.