Managing Processing Pipelines
Conductor is a Java application for managing queues of source files to be processed by sequences of procedures.
A Conductor doesn't take a flute or a clarinet and show someone how to
play it. He tells them what they have to do and they have to find out
how."
- Validimir Horowitz
Processing Pipelines:
Data production operations commonly involve the application of a sequence of procedures that are routinely used to generate output data products from input data sources; this is a processing pipeline. More than one procedure sequence may be employed, typically depending on the kind of input data and/or the desired type of output, but any given sequence is usually sufficiently well defined that it can be automated into a non-interactive uber-procedure that encapsulates the individual procedures of the sequence. This often takes the form of a script that uses as input the file containing the data source and executes the sequence of procedures on the file, and/or intermediate data files, to generate one or more products containing the desired output data.
The productivity advantages of automated processing scripts has resulted in using the mechanism with increasingly complex procedure sequences where the data to be processed exist in more than one file and/or the sequence of procedures involve more than a single, simple, linear logic. Procedure scripts can easily evolve to have numerous command line arguments and masses of intricate code to handle many processing options and exceptional processing conditions. Just incorporating the often overlooked business of handling possible error conditions resulting from each procedure in the sequence can cause a simple script to mushroom into a monster. And if any procedure in the sequence undergoes a significant change in its interface - the command line used to execute it or the data input/output requirements - alterations in the script can become a maintenance nightmare. Often what starts out as a simple procedure pipeline script turns out to be a prototype for a complex application program.
Managing Pipelines:
When implementing a processing pipeline as a script it is easy to overlook the mundane tasks of output logging, checking the completion status of each procedure and gracefully handling failure conditions. Conductor does this automatically. Managing all the parameters that control a processing pipeline can also be a challenge. Conductor uses a configuration file, supplemented by dynamically managed parameters, to provide parameter management. Parameters and database field values can be referenced by procedure definitions much like scripting language syntax. However, Conductor is not intended as a substitute for scripts. Conductor manages procedures. Any of the procedures in a Conductor pipeline may be scripts as well as binary executables. Conductor is intended to keep the processing pipeline manageable even when the procedures and processing environment become complex.
Database Driven:
Conductor is designed to manage processing pipelines without requiring the pipeline implementer to write a script. Instead, the definition of the sequence of procedures in a pipeline is provided in a Pipeline_Procedures database table and the names of the source files to be processed are entered into a Pipeline_Sources database table. These two tables constitute the named Pipeline. The database server containing the pipeline tables may be located anywhere: on the system where Conductor is running or on a remote network accessible system. This enables Conductors running on many systems to share the same pipeline definitions without requiring shared filesystems or special interprocess protocols. Access information to the database server is provided in the configuration file given to Conductor. If, while Conductor is running, its connection to the database server is lost it will make repeated attempts to reconnect before giving up and reporting loss of database connectivity.
Pipeline_Sources
Each source record identifies the data source to be processed, its
processing status, and the location of the log file containing a
detailed report of all processing of the data source. Additional fields,
beyond those required by Conductor, may be present; for example, a
Last_Update
time field automatically maintained by the
database is recommended. Source records are processed in the order they
occur in the table, and new records may be added to the table at any
time.
The only field value that must be user specified to identify a data
source is the Source_Pathname
that provides a pathname to a
file. Conductor will confirm read access to each source file; files
without read access will not be processed. The
Source_Pathname
need not be unique; sources may be repeatedly
processed. A Source_Number
field that contains a unique
integer value, usually maintained by the database server as an
auto-increment field, is required. In addition a Source_ID
may be provided by the user; otherwise Conductor will set this to the
filename portion of the Source_Pathname
, with any extension
removed.
Conductor will only process a source record that indicates
unprocessed status. Conductor acquires a source record by setting the
Conductor_ID
field to the processing hostname in a way that
guarantees only one Conductor will acquire each record, thus allowing
any number of Conductor processes to be simultaneously working on a
pipeline. Conductor maintains the Status
field of the
source record with a list of Status Indicators, one for each procedure
that has been applied to the source, in procedure sequence order. The
value indicates a procedure in progress or the final status of the
procedure - success, failure or timeout. A source record with a
Status
field value when it is acquired will be reprocessed
beginning with the next unrecorded procedure sequence, but only if the
the last procedure was successful.
Conductor will always write a detailed report of all source record
processing to the file listed in the Log_Pathname
field.
The user may specify the value of this field or let Conductor determine
the filename based on the pipeline name, Source_ID
and
Source_Number
values. The log file will be created in
Conductor's current working directory unless a
Log_Directory
configuration parameter has been provided. If
Conductor finds that the Log_Pathname
field has a value
then its report will be appended to an existing file, which assures that
source record reprocessing will be reported to the same log file. The
log file contains details about the pipeline Conductor is processing,
the host system in use, the source data identification, timestamps for
the beginning and end of each procedure, and a copy of all normal
(stdout) and error (stderr) listings from each procedure. The report is
marked in a way to facilitate automated data extraction.
Pipeline_Procedures
Each procedure record specifies the order in which procedures are to be applied to a data source, a primary command line, completion success conditions, and a branch command to be used should the primary command not complete successfully. Additional fields, beyond those required by Conductor, may be present; in particular a Description field, if present, will be included in the log reporting. Procedure definitions may be safely modified while Conductor is running.
The Sequence
field determines the order in which
procedures are run. The value of this field is a real number to enable
new procedure records to be inserted in the table in any effective
sequence location without requiring procedure record reordering or
renumbering.
The Command_Line
specifies the procedure to be executed.
The command line specification may contain embedded references to be
resolved by configuration parameter or database field values. The
Conductor configuration file parameters, defined with simple Parameter
Value Language syntax, are automatically supplemented with all the
environment variables available to Conductor plus a set of parameters
identifying the pipeline and its database. Conductor also maintains a
set of dynamic parameters that identify the current source record,
procedure record sequence number and the completion status value of the
last procedure run. Both parameter and field references may be nested.
A procedure will only be allowed to run the amount of time specified
by the Time_Limit
value to avoid "hung" or "runaway"
processes. The specification is reference resolved and also evaluated as
a potential mathematical expression.
Once a procedure has completed its exit status is compared to the
Success_Status
value to determine if the procedure completed
successfully. In addition to being reference resolved and evaluated as a
potential mathematical expression, the Success_Status
is
also evaluated as a potential logical expression to determine the
success condition of the procedure exit status. As an alternative to the
Success_Status
, a Success_Message
(reference
resolved) may be specified; it is used as a regular expression match
condition on the normal and error output from the procedure. Conductor
may be configured to use by default the conventional zero
Success_Status
or to always assume the procedure completed
successfully.
If Conductor determines that the procedure completed successfully,
then the next procedure is applied or, if the last procedure in the
sequence completed, the next source record is acquired and the sequence
of procedure processing begins again. However, if the procedure failed,
or timed out, the On_Failure
command line (reference
resolved, of course) is run instead. This procedure is allowed to run to
completion (without any time limit) before the next source record is
acquired and the sequence begins again. The Conductor distribution
includes a Notify
procedure that is typically used as an
On_Failure
procedure to send an email notification of
these, or other, special events.
Distributed Processing:
Conductor is designed for use in a distributed processing systems
environment. Each pipeline may be managed by multiple Conductor
processes simultaneously; each source record is guaranteed to be
processed by only one Conductor regardless of the number of Conductors
working on the pipeline. Typically each Conductor runs on a different
processing engine to maximize throughput. Multiple Conductors may be run
on one or more host systems with each processing a separate pipeline
"segment". A segment is a pipeline in which the final procedure, or any
branch (On_Failure
) procedure, makes a source entry in
another pipeline considered to be "chained" to the first segment.
Networks of pipeline segments may be constructed in this way. The
Conductor distribution includes a Pipeline_Source
procedure
for entering a source file record into a pipeline; it may also be used
to bulk additions of source files lists into a pipeline. Also, Conductor
itself may be a procedure in a pipeline to enable networks of dynamic,
adaptive pipelines to be implemented.
Operating Modes:
Conductor may be run in either monitor, batch or daemon mode. In
monitor mode a graphical user interface is provided to control and
monitor pipeline processing; the entire log report for each source
record is displayed as it is being produced, along with other constantly
updated information about the procedures that are being run. In batch
mode Conductor processes all unprocessed source records in the pipeline
and then quits. In daemon mode Conductor runs in the background and
continuously polls, at a configurable interval, for unprocessed source
records. In this mode Conductor can be run unattended and it will
automatically process new source records as they appear in the pipeline
queue.
Installation
System Requirements:
- A Java Runtime Environment (JRE).
Java version 1.4 or above is required. Enter the command "java -version" to find the version number. Go to the Sun Java web site for no-cost Java distributions and documentation. To install the Process patch build the class files from a source code distribution and the Java Development Kit (JDK; a.k.a Java Software Development Kit or SDK) will be needed.
Note: The Java
Process
class and itsUNIXProcess
implementation must be patched as described below in the Process patch section. For this reason Conductor is currently only supported on Unix operating systems. The patch kit is available from the PIRL distribution site in theProcess.patch.tar.gz
tarball. It has been successfully tested on Solaris, Mac OS-X, FreeBSD and Linux. Installing the patch requires the use of a Java jar utility which is included in the Java Software Development Kit. - The PIRL Java Packages.
Various forms of this distribution are available from the PIRL distribution site:
Conductor.jar
- A complete distribution of the PIRL Java Packages, plus the necessary support packages (below), in a Java jar archive of class files configured to be run directly as the Conductor application. This is a simple, ready-to-run solution.
PIRL.jar
- The PIRL Java Packages distributed as class files. Installing this file in the Java extensions directory for the system or user, or naming the file in the CLASSPATH environment variable, is sufficient to make all of the classes available. The location of the system's Java extensions directory for the typical Unix system is $JRE_HOME/lib/ext (JRE_HOME may equivalent to $JAVA_HOME/jre on some system; the systems administrator should know where Java is installed). For Mac OS X system's Java extensions directory is probably /Library/Java/Extensions (or Library/Java/Extensions under a user's home directory for personal use).
PIRL.tar.gz
- The PIRL Java Packages distributed as source code files. This requires the additional support packages (below). This is the best solution for a site that would like to take advantage of access to the source code.
- A MySQL database server.
The server may be running on the local host or on any accessible remote host. Conductor has been successfully used with MySQL versions 3.23.38 and above (to 5.0.37). See the MySQL web site to obtain an appropriate distribution for your system. Note: Conductor does not itself implement database access; this is done by the PIRL Database package where support for additional database server implementations can be provided. The Lunar Reconnaissance Orbiter Camera (LROC) Project has successfully implemented support for the PostgreSQL database server for use with Conductor; this will be included in an upcoming Database package distribution.
- The MySQL Connector/J JDBC driver for MySQL.
This is the official MySQL JDBC driver so the MySQL web site will have the latest version. A distribution of the driver is available from the PIRL distribution site in the
mysql-connector.tar.gz
tarball. A jar archive distribution is in themysql-connector.jar
file. -
The Java Components For Mathematics.
A project at Hobart and William Smith Colleges that includes a Java package with mathematical expression evaluation. A tarball of the complete version 1.0 distribution is provided (
jcm-1.0.tar.gz
). However, this is a large tarball and only theedu.hws.jcm
.data package is used by the PIRL software, so a jar file for just this package is provided (jcm_data-1.0.jar
, a.k.a.jcm.jar
).
The interface for the Sun Java Process class, as of Java 1.4/5, has an
unfortunate shortcoming. The user is able to access neither the
process identification (PID) of the executed process nor provide the
timeout argument to the Object.wait method used by the waitFor method.
These capabilities are required by Conductor. The Process.patch kit -
contained in the Conductor.tar.gz
tarball or
available separately as the Process.patch.tar.gz
tarball -
will correct this deficit. Detailed installation instructions for this
trivial fix are in the Process.patch README file. The patch must be
applied before building or using Conductor.
Note: This simple little patch has absolutely no affect, direct or indirect, on any existing Java functionality.
Unpacking tarballs:
When building the PIRL Java Packages from the tarball distributions
choose an installation root directory and unpack the tarballs there.
Many sites, and users, collect all of their Java package installations
under a single directory. This is not necessary, but usually makes
managing the installations easier, so we'll proceed under this
assumption. If you are using GNU tar (named gtar
on many
systems, but also often named tar
; if you are unsure if
your tar utility is GNU tar enter the command "tar
--version
" which will produce a version listing for GNU tar and
an error message otherwise) unpack your tarballs using:
gtar xzf TARBALL_FILE
Otherwise use:
gunzip -c TARBALL_FILE | tar xf -
The PIRL packages will all be unpacked into a subdirectory named
PIRL-N.N.N
, where N.N.N
is the current release version number. A link named PIRL
will also be produced that points to the current release subdirectory.
Java requires that all packages be located in directories that are
named the same as their package names; thus the PIRL
link name satisfies this requirement.
The support distribution tarballs will be unpacked into their own
subdirectories: The Connector/J JDBC driver for MySQL into
mysql-connector-N.N.N
, with a mysql-connector
link; and the Java Components For Mathematics into
jcm-N.N
, with a jcm
link.
Note: The jar file distributions do not need to be unpacked.
Unpacked tarball distributions may be used in conjunction with jar
files, or vice versa. For example, it would be quite suitable to use the
the jar files for the support distributions, which are much smaller than
their tarballs, along with an unpacked PIRL and compiled tarball
distribution. If you are using the Conductor.jar
or the
PIRL.jar
with the support jar files there is nothing to
unpack.
Compiling the Code:
The PIRL Java Packages source code distribution includes a Build
subdirectory with GNU make
compatible Makefile
which will invoke the Makefile
in each package directory.
Before compiling the PIRL Java Packages the JCM package
must be installed.
The Database driver support package is not needed for
compiling the code. Simply installing the jar files for these packages in
the Java extensions directory is sufficient. Otherwise the contents of
each tarball contains a ready to use set of class files. Each PIRL Java
Package Makefile
that needs the JCM specifies its location
with the JCM
macro. This location is /opt/java/jcm by
default. The JCM
environment variable may be set to the JCM
location (which may be a directory or jar file pathname). Then all the
PIRL Java Packages class files can be built from the Build directory by
simply entering the make
(or gmake
) command.
Note: If the PIRL.jar
file is installed in the Java
extensions directory compiling the code is not needed. Installing the jar
files in the Java extensions directory is the easiest way to get started.
Documentation:
On-line reference manuals for the PIRL Java Packages are available in javadoc form.
The source code distriubtion includes the javadocs files in the
docs
directory. To build fresh javadoc files from the
unpacked PIRL.tar.gz
distribution source code files use the
make docs
command in the Build subdirectory. The directory
where the documentation files will be written is controlled by the
PIRL_DOCS_DIR variable in the Makefile
; by default this is
the PIRL/docs
directory.
The JAVA_DOCS_DIR
variable specifies the location of the
core Java documentation (where the api
subdirectory with
its package-list
file is found) that is used to provide
links from the PIRL Java Packages documentation to Java Foundation
Classes (JFC) documentation. By default this is set to /usr/java/docs;
set it to the location for your site or to a URL where the JFC
documentation will be found.
Use
Configuration File:
The Conductor configuration file provides the information necessary to connect to the database server. A sample Conductor.conf has been provided in the PIRL/Conductor directory and is also available from the PIRL distribution site. It should be copied to the user's home directory or project directory and the permissions set to be read-only by the file owner (while not strictly required, the user's MySQL password must be entered into this file and so it should be kept private). Edit the parameters in the localhost group as appropriate for access to your MySQL database. Alternatively, add another database Server specification using the PIRL server specification as a model.
The sample configuration file also provides numerous parameters used by Conductor with their default values. The value for the Catalog is not a default - it is used with the sample setup in the tests subdirectory - and should be set to the name of the MySQL catalog (MySQL calls this a "database") where your pipeline tables will be found. The default for the Log_Directory will place log files in the current working directory. This can be changed to any existing, writable directory or specified in the Log_Pathname field of the Sources table.
Database Tables:
The queue of source files and sequence of pipeline procedures that Conductor uses are provided in a pair of database tables located on a MySQL database server. Each record of the table of Sources is used to specify the pathname of the file to be processed. Each record of the table of Procedures is used to define a procedure to be executed. These tables must be available to Conductor. The Conductor documentation provides details for the definition of these tables.
The PIRL/Conductor/tests directory contains *.SQL
script
files for creating a Test pipeline in a Proc_Test catalog from the
supplied *.table
files. The creator of the tables must, of
course, have appropriate permissions on the MySQL database server. The
Procedures table employs the PERL test_procedure
that is
provided, and the Sources table simply lists the
test_source
file that is provided. A noop
(no
operation) binary procedure is also used. Trivial C-language code to
build this program is included (noop.c
; usually just
entering "make noop
" in the same directory will build this
program). Alternatively, a link named noop
referring to a
suitable program (e.g. /bin/echo
or /bin/true
)
may be used as a substitute.
The sample Test_Procedures table provides a model that can be used with the documentation of the Conductor package to build your own pipelines. The only required entry is the Command_Line specification. Without the Sequence number all procedures will have the same precedence and will be executed in the top-to-bottom order they occur in the table. The Command_Line specification may just be the name of a command to execute, but usually at least the pathname of the source file is provided with the ${Source_Pathname} parameter reference. The Success_Status defaults to 0, which is conventional for Unix programs.
As the sample Test_Sources table demonstrates, only the Source_Pathname is required. The Source_ID is not required and will default to the filename portion of the Source_Pathname.
Running Conductor:
To run Conductor from its jar file just use:
java -jar Conductor.jar arguments
Otherwise the Java classpath must be set to include the
parent directory of the PIRL directory, the mysql-connector directory or
jar file and the jcm directory or jar file. This can be specified with
the -cp
argument when running java or in the
CLASSPATH
environment variable. For example, if you have
unpacked all the tarballs in /usr/local/java and are using a C shell (csh or
tcsh) then:
setenv CLASSPATH /usr/local/java:/usr/local/java/mysql-connector:/usr/local/java/jcm
The equivalent for the Bourne shell (sh) is:
CLASSPATH=/usr/local/java:/usr/local/java/mysql-connector:/usr/local/java/jcm; export CLASSPATH
If the support packages are in their jar files:
setenv CLASSPATH /usr/local/java:/usr/local/mysql-connector.jar:/usr/local/java/jcm.jar
Adding these commands to your .login (for C shell users) or .profile
(for Bourne shell users) file in your home directory (some systems use
different environment setup files; check with your systems administrator)
will ensure that the CLASSPATH
is automatically set on login.
Note: If the the PIRL.jar
file has been installed
in the Java extensions directory then the classpath does not need to be
specified since the Java Runtime Environment will find the appropriate
classes automatically. Also, providing a wrapper script that sets the
appropriate environment and executes the appropriate command for your
system will make life much easier. The "
Conductor
" wrapper used at PIRL, which is included in
the tarball distribution, provides an example.
Then:
java PIRL.Conductor.Conductor arguments
If you have created the Test pipeline and adjusted the
Conductor.conf
file in your home directory as necessary
then the command:
java PIRL.Conductor.Conductor -m Test
should start Conductor, make a connection to the database server specified in the configuration file, display a monitor window and wait for you to hit the Start button. Select the Exit menu item to exit from Conductor.
It is a good idea to initially run Conductor in monitor mode to directly confirm the correct operation of your pipeline. Problems, such as malformed references or missing source files, will be listed in red with a description of the source of the problem. Once your pipeline is executes successfully it might be more appropriate to run Conductor silently, as a daemon, that constantly checks your Sources table for new sources to process. The log files for the processed sources will still contain the same complete record that would appear in the monitor display.
Status:
As each source record is processed through the pipeline its
Status
field is updated with the comma separated list of
status indicators for each procedure that is executed. This field is
kept current as each procedure is executed and may be safely monitored
by any appropriate database client (e.g. Data_View that is provided with
the PIRL Java Packages). The interpretation of status indicators is
provided in the Conductor.Status_Indicator(String) method description.
Basically, a first value of 0 means success and the parenthesized value
is the actual procedure exit status. The status indicators are, of
course, listed in Sequence number order.
Copyright
The PIRL Java Packages are Copyright (C) 2003-2007 Arizona Board of
Regents on behalf of the Planetary Image Research Laboratory, Lunar and
Planetary Laboratory at the University of Arizona. They are distributed
under the terms of the
GNU General Public License, version 2, as published by the Free
Software Foundation. A copy of this license should be included with the
distributed files.
Contact
In general, when operational problems occur, an Error dialog appears describing the problem and identifying the software classes involved. Problem reports should include all this information. Comments and suggestions that could improve this application's usefulness are welcome.
Bradford Castalia | |
Senior Systems Analyst | Castalia@Arizona.edu |
Planetary Image Research Laboratory | 520-621-4824 |
Department of Planetary Sciences | 1541 E. University Blvd. |
University of Arizona | Tucson, Arizona 85721-0063 |
"Build an image in your mind, fit yourself into it."