Conductor: Managing Processing Pipelines
Conductor

Managing Processing Pipelines

Conductor is a Java application for managing queues of source files to be processed by sequences of procedures.

A Conductor doesn't take a flute or a clarinet and show someone how to play it. He tells them what they have to do and they have to find out how."
    - Validimir Horowitz

Processing Pipelines:

Data production operations commonly involve the application of a sequence of procedures that are routinely used to generate output data products from input data sources; this is a processing pipeline. More than one procedure sequence may be employed, typically depending on the kind of input data and/or the desired type of output, but any given sequence is usually sufficiently well defined that it can be automated into a non-interactive uber-procedure that encapsulates the individual procedures of the sequence. This often takes the form of a script that uses as input the file containing the data source and executes the sequence of procedures on the file, and/or intermediate data files, to generate one or more products containing the desired output data.

The productivity advantages of automated processing scripts has resulted in using the mechanism with increasingly complex procedure sequences where the data to be processed exist in more than one file and/or the sequence of procedures involve more than a single, simple, linear logic. Procedure scripts can easily evolve to have numerous command line arguments and masses of intricate code to handle many processing options and exceptional processing conditions. Just incorporating the often overlooked business of handling possible error conditions resulting from each procedure in the sequence can cause a simple script to mushroom into a monster. And if any procedure in the sequence undergoes a significant change in its interface - the command line used to execute it or the data input/output requirements - alterations in the script can become a maintenance nightmare. Often what starts out as a simple procedure pipeline script turns out to be a prototype for a complex application program.

Managing Pipelines:

When implementing a processing pipeline as a script it is easy to overlook the mundane tasks of output logging, checking the completion status of each procedure and gracefully handling failure conditions. Conductor does this automatically. Managing all the parameters that control a processing pipeline can also be a challenge. Conductor uses a configuration file, supplemented by dynamically managed parameters, to provide parameter management. Parameters and database field values can be referenced by procedure definitions much like scripting language syntax. However, Conductor is not intended as a substitute for scripts. Conductor manages procedures. Any of the procedures in a Conductor pipeline may be scripts as well as binary executables. Conductor is intended to keep the processing pipeline manageable even when the procedures and processing environment become complex.

Database Driven:

Conductor is designed to manage processing pipelines without requiring the pipeline implementer to write a script. Instead, the definition of the sequence of procedures in a pipeline is provided in a Pipeline_Procedures database table and the names of the source files to be processed are entered into a Pipeline_Sources database table. These two tables constitute the named Pipeline. The database server containing the pipeline tables may be located anywhere: on the system where Conductor is running or on a remote network accessible system. This enables Conductors running on many systems to share the same pipeline definitions without requiring shared filesystems or special interprocess protocols. Access information to the database server is provided in the configuration file given to Conductor. If, while Conductor is running, its connection to the database server is lost it will make repeated attempts to reconnect before giving up and reporting loss of database connectivity.

Pipeline_Sources

Each source record identifies the data source to be processed, its processing status, and the location of the log file containing a detailed report of all processing of the data source. Additional fields, beyond those required by Conductor, may be present; for example, a Last_Update time field automatically maintained by the database is recommended. Source records are processed in the order they occur in the table, and new records may be added to the table at any time.

The only field value that must be user specified to identify a data source is the Source_Pathname that provides a pathname to a file. Conductor will confirm read access to each source file; files without read access will not be processed. The Source_Pathname need not be unique; sources may be repeatedly processed. A Source_Number field that contains a unique integer value, usually maintained by the database server as an auto-increment field, is required. In addition a Source_ID may be provided by the user; otherwise Conductor will set this to the filename portion of the Source_Pathname, with any extension removed.

Conductor will only process a source record that indicates unprocessed status. Conductor acquires a source record by setting the Conductor_ID field to the processing hostname in a way that guarantees only one Conductor will acquire each record, thus allowing any number of Conductor processes to be simultaneously working on a pipeline. Conductor maintains the Status field of the source record with a list of Status Indicators, one for each procedure that has been applied to the source, in procedure sequence order. The value indicates a procedure in progress or the final status of the procedure - success, failure or timeout. A source record with a Status field value when it is acquired will be reprocessed beginning with the next unrecorded procedure sequence, but only if the the last procedure was successful.

Conductor will always write a detailed report of all source record processing to the file listed in the Log_Pathname field. The user may specify the value of this field or let Conductor determine the filename based on the pipeline name, Source_ID and Source_Number values. The log file will be created in Conductor's current working directory unless a Log_Directory configuration parameter has been provided. If Conductor finds that the Log_Pathname field has a value then its report will be appended to an existing file, which assures that source record reprocessing will be reported to the same log file. The log file contains details about the pipeline Conductor is processing, the host system in use, the source data identification, timestamps for the beginning and end of each procedure, and a copy of all normal (stdout) and error (stderr) listings from each procedure. The report is marked in a way to facilitate automated data extraction.

Pipeline_Procedures

Each procedure record specifies the order in which procedures are to be applied to a data source, a primary command line, completion success conditions, and a branch command to be used should the primary command not complete successfully. Additional fields, beyond those required by Conductor, may be present; in particular a Description field, if present, will be included in the log reporting. Procedure definitions may be safely modified while Conductor is running.

The Sequence field determines the order in which procedures are run. The value of this field is a real number to enable new procedure records to be inserted in the table in any effective sequence location without requiring procedure record reordering or renumbering.

The Command_Line specifies the procedure to be executed. The command line specification may contain embedded references to be resolved by configuration parameter or database field values. The Conductor configuration file parameters, defined with simple Parameter Value Language syntax, are automatically supplemented with all the environment variables available to Conductor plus a set of parameters identifying the pipeline and its database. Conductor also maintains a set of dynamic parameters that identify the current source record, procedure record sequence number and the completion status value of the last procedure run. Both parameter and field references may be nested.

A procedure will only be allowed to run the amount of time specified by the Time_Limit value to avoid "hung" or "runaway" processes. The specification is reference resolved and also evaluated as a potential mathematical expression.

Once a procedure has completed its exit status is compared to the Success_Status value to determine if the procedure completed successfully. In addition to being reference resolved and evaluated as a potential mathematical expression, the Success_Status is also evaluated as a potential logical expression to determine the success condition of the procedure exit status. As an alternative to the Success_Status, a Success_Message (reference resolved) may be specified; it is used as a regular expression match condition on the normal and error output from the procedure. Conductor may be configured to use by default the conventional zero Success_Status or to always assume the procedure completed successfully.

If Conductor determines that the procedure completed successfully, then the next procedure is applied or, if the last procedure in the sequence completed, the next source record is acquired and the sequence of procedure processing begins again. However, if the procedure failed, or timed out, the On_Failure command line (reference resolved, of course) is run instead. This procedure is allowed to run to completion (without any time limit) before the next source record is acquired and the sequence begins again. The Conductor distribution includes a Notify procedure that is typically used as an On_Failure procedure to send an email notification of these, or other, special events.

Distributed Processing:

Conductor is designed for use in a distributed processing systems environment. Each pipeline may be managed by multiple Conductor processes simultaneously; each source record is guaranteed to be processed by only one Conductor regardless of the number of Conductors working on the pipeline. Typically each Conductor runs on a different processing engine to maximize throughput. Multiple Conductors may be run on one or more host systems with each processing a separate pipeline "segment". A segment is a pipeline in which the final procedure, or any branch (On_Failure) procedure, makes a source entry in another pipeline considered to be "chained" to the first segment. Networks of pipeline segments may be constructed in this way. The Conductor distribution includes a Pipeline_Source procedure for entering a source file record into a pipeline; it may also be used to bulk additions of source files lists into a pipeline. Also, Conductor itself may be a procedure in a pipeline to enable networks of dynamic, adaptive pipelines to be implemented.

Operating Modes:

Conductor may be run in either monitor, batch or daemon mode. In monitor mode a graphical user interface is provided to control and monitor pipeline processing; the entire log report for each source record is displayed as it is being produced, along with other constantly updated information about the procedures that are being run. In batch mode Conductor processes all unprocessed source records in the pipeline and then quits. In daemon mode Conductor runs in the background and continuously polls, at a configurable interval, for unprocessed source records. In this mode Conductor can be run unattended and it will automatically process new source records as they appear in the pipeline queue.

Installation

System Requirements:

Process Patch:

The interface for the Sun Java Process class, as of Java 1.4/5, has an unfortunate shortcoming. The user is able to access neither the process identification (PID) of the executed process nor provide the timeout argument to the Object.wait method used by the waitFor method. These capabilities are required by Conductor. The Process.patch kit - contained in the Conductor.tar.gz tarball or available separately as the Process.patch.tar.gz tarball - will correct this deficit. Detailed installation instructions for this trivial fix are in the Process.patch README file. The patch must be applied before building or using Conductor.

Note: This simple little patch has absolutely no affect, direct or indirect, on any existing Java functionality.

Unpacking tarballs:

When building the PIRL Java Packages from the tarball distributions choose an installation root directory and unpack the tarballs there. Many sites, and users, collect all of their Java package installations under a single directory. This is not necessary, but usually makes managing the installations easier, so we'll proceed under this assumption. If you are using GNU tar (named gtar on many systems, but also often named tar; if you are unsure if your tar utility is GNU tar enter the command "tar --version" which will produce a version listing for GNU tar and an error message otherwise) unpack your tarballs using:

gtar xzf TARBALL_FILE

Otherwise use:

gunzip -c TARBALL_FILE | tar xf -

The PIRL packages will all be unpacked into a subdirectory named PIRL-N.N.N, where N.N.N is the current release version number. A link named PIRL will also be produced that points to the current release subdirectory. Java requires that all packages be located in directories that are named the same as their package names; thus the PIRL link name satisfies this requirement.

The support distribution tarballs will be unpacked into their own subdirectories: The Connector/J JDBC driver for MySQL into mysql-connector-N.N.N, with a mysql-connector link; and the Java Components For Mathematics into jcm-N.N, with a jcm link.

Note: The jar file distributions do not need to be unpacked. Unpacked tarball distributions may be used in conjunction with jar files, or vice versa. For example, it would be quite suitable to use the the jar files for the support distributions, which are much smaller than their tarballs, along with an unpacked PIRL and compiled tarball distribution. If you are using the Conductor.jar or the PIRL.jar with the support jar files there is nothing to unpack.

Compiling the Code:

The PIRL Java Packages source code distribution includes a Build subdirectory with GNU make compatible Makefile which will invoke the Makefile in each package directory. Before compiling the PIRL Java Packages the JCM package must be installed. The Database driver support package is not needed for compiling the code. Simply installing the jar files for these packages in the Java extensions directory is sufficient. Otherwise the contents of each tarball contains a ready to use set of class files. Each PIRL Java Package Makefile that needs the JCM specifies its location with the JCM macro. This location is /opt/java/jcm by default. The JCM environment variable may be set to the JCM location (which may be a directory or jar file pathname). Then all the PIRL Java Packages class files can be built from the Build directory by simply entering the make (or gmake) command.

Note: If the PIRL.jar file is installed in the Java extensions directory compiling the code is not needed. Installing the jar files in the Java extensions directory is the easiest way to get started.

Documentation:

On-line reference manuals for the PIRL Java Packages are available in javadoc form.

The source code distriubtion includes the javadocs files in the docs directory. To build fresh javadoc files from the unpacked PIRL.tar.gz distribution source code files use the make docs command in the Build subdirectory. The directory where the documentation files will be written is controlled by the PIRL_DOCS_DIR variable in the Makefile; by default this is the PIRL/docs directory.

The JAVA_DOCS_DIR variable specifies the location of the core Java documentation (where the api subdirectory with its package-list file is found) that is used to provide links from the PIRL Java Packages documentation to Java Foundation Classes (JFC) documentation. By default this is set to /usr/java/docs; set it to the location for your site or to a URL where the JFC documentation will be found.

Use

Configuration File:

The Conductor configuration file provides the information necessary to connect to the database server. A sample Conductor.conf has been provided in the PIRL/Conductor directory and is also available from the PIRL distribution site. It should be copied to the user's home directory or project directory and the permissions set to be read-only by the file owner (while not strictly required, the user's MySQL password must be entered into this file and so it should be kept private). Edit the parameters in the localhost group as appropriate for access to your MySQL database. Alternatively, add another database Server specification using the PIRL server specification as a model.

The sample configuration file also provides numerous parameters used by Conductor with their default values. The value for the Catalog is not a default - it is used with the sample setup in the tests subdirectory - and should be set to the name of the MySQL catalog (MySQL calls this a "database") where your pipeline tables will be found. The default for the Log_Directory will place log files in the current working directory. This can be changed to any existing, writable directory or specified in the Log_Pathname field of the Sources table.

Database Tables:

The queue of source files and sequence of pipeline procedures that Conductor uses are provided in a pair of database tables located on a MySQL database server. Each record of the table of Sources is used to specify the pathname of the file to be processed. Each record of the table of Procedures is used to define a procedure to be executed. These tables must be available to Conductor. The Conductor documentation provides details for the definition of these tables.

The PIRL/Conductor/tests directory contains *.SQL script files for creating a Test pipeline in a Proc_Test catalog from the supplied *.table files. The creator of the tables must, of course, have appropriate permissions on the MySQL database server. The Procedures table employs the PERL test_procedure that is provided, and the Sources table simply lists the test_source file that is provided. A noop (no operation) binary procedure is also used. Trivial C-language code to build this program is included (noop.c; usually just entering "make noop" in the same directory will build this program). Alternatively, a link named noop referring to a suitable program (e.g. /bin/echo or /bin/true) may be used as a substitute.

The sample Test_Procedures table provides a model that can be used with the documentation of the Conductor package to build your own pipelines. The only required entry is the Command_Line specification. Without the Sequence number all procedures will have the same precedence and will be executed in the top-to-bottom order they occur in the table. The Command_Line specification may just be the name of a command to execute, but usually at least the pathname of the source file is provided with the ${Source_Pathname} parameter reference. The Success_Status defaults to 0, which is conventional for Unix programs.

As the sample Test_Sources table demonstrates, only the Source_Pathname is required. The Source_ID is not required and will default to the filename portion of the Source_Pathname.

Running Conductor:

To run Conductor from its jar file just use:

java -jar Conductor.jar arguments

Otherwise the Java classpath must be set to include the parent directory of the PIRL directory, the mysql-connector directory or jar file and the jcm directory or jar file. This can be specified with the -cp argument when running java or in the CLASSPATH environment variable. For example, if you have unpacked all the tarballs in /usr/local/java and are using a C shell (csh or tcsh) then:

setenv CLASSPATH /usr/local/java:/usr/local/java/mysql-connector:/usr/local/java/jcm

The equivalent for the Bourne shell (sh) is:

CLASSPATH=/usr/local/java:/usr/local/java/mysql-connector:/usr/local/java/jcm; export CLASSPATH

If the support packages are in their jar files:

setenv CLASSPATH /usr/local/java:/usr/local/mysql-connector.jar:/usr/local/java/jcm.jar

Adding these commands to your .login (for C shell users) or .profile (for Bourne shell users) file in your home directory (some systems use different environment setup files; check with your systems administrator) will ensure that the CLASSPATH is automatically set on login.

Note: If the the PIRL.jar file has been installed in the Java extensions directory then the classpath does not need to be specified since the Java Runtime Environment will find the appropriate classes automatically. Also, providing a wrapper script that sets the appropriate environment and executes the appropriate command for your system will make life much easier. The " Conductor" wrapper used at PIRL, which is included in the tarball distribution, provides an example.

Then:

java PIRL.Conductor.Conductor arguments

If you have created the Test pipeline and adjusted the Conductor.conf file in your home directory as necessary then the command:

java PIRL.Conductor.Conductor -m Test

should start Conductor, make a connection to the database server specified in the configuration file, display a monitor window and wait for you to hit the Start button. Select the Exit menu item to exit from Conductor.

It is a good idea to initially run Conductor in monitor mode to directly confirm the correct operation of your pipeline. Problems, such as malformed references or missing source files, will be listed in red with a description of the source of the problem. Once your pipeline is executes successfully it might be more appropriate to run Conductor silently, as a daemon, that constantly checks your Sources table for new sources to process. The log files for the processed sources will still contain the same complete record that would appear in the monitor display.

Status:

As each source record is processed through the pipeline its Status field is updated with the comma separated list of status indicators for each procedure that is executed. This field is kept current as each procedure is executed and may be safely monitored by any appropriate database client (e.g. Data_View that is provided with the PIRL Java Packages). The interpretation of status indicators is provided in the Conductor.Status_Indicator(String) method description. Basically, a first value of 0 means success and the parenthesized value is the actual procedure exit status. The status indicators are, of course, listed in Sequence number order.


Copyright

The PIRL Java Packages are Copyright (C) 2003-2007 Arizona Board of Regents on behalf of the Planetary Image Research Laboratory, Lunar and Planetary Laboratory at the University of Arizona. They are distributed under the terms of the GNU General Public License, version 2, as published by the Free Software Foundation. A copy of this license should be included with the distributed files.

Contact

In general, when operational problems occur, an Error dialog appears describing the problem and identifying the software classes involved. Problem reports should include all this information. Comments and suggestions that could improve this application's usefulness are welcome.

Bradford Castalia  
Senior Systems Analyst Castalia@Arizona.edu
Planetary Image Research Laboratory 520-621-4824
Department of Planetary Sciences 1541 E. University Blvd.
University of Arizona Tucson, Arizona 85721-0063

"Build an image in your mind, fit yourself into it."