Pipes Software Documentation

D. Paige

8/12/09

1. Introduction

The "pipes" software was developed in the late 1980's as an approach to
efficiently process unformatted data. It was originally developed in the
VAX/VMS environment by David Paige, Carol Chang and Mark Sullivan. The
unix pipes are an elegant system that lets the user string together a
set of simple tools to create a custom pipeline process without the need
to create messy intermediate files. The standard unix pipes normally
operate on character data. Our pipes software works primarily on
unformatted binary data. This document provides an overview of how the
pipes work, and describes some of the main pipes tools. The pipes
software is currently implemented in g77 and gcc. By convention, all
pipes tools start with the letter p. The 'master' source code for the
pipes is in /u/paige/dap/pipesv3. You can read the comments in the
source code for detailed descriptions of the pipes programs and parameters.

2. Data Format

The current version of the pipes data (Version 3) represents all data as
real*8 (Fortran) or double (C). This allows for the representation of
high precision numbers such as dates, as well as the precise
representation of most integers. There is some storage inefficiency in
representing all data in this manner, but the speed of processing is
much faster than with a mixed dataset. Currently, there is no way to
represent or deal with character data in the pipes system.

Pipes datasets are basically flat files with a fixed number of columns,
and an arbitrary number of rows. The data are read through the pipes in
records and processed as a group. A nice feature of the pipes data
representation is that you don't necessarily have to use the pipes tools
to process the data, since they can be easily read in in binary format
by other applications.

3. Descriptor Files

In order to let the pipes tools "know" what they are processing, pipes
data must be preceded by a descriptor file that describes the data
format. the format of the ASCII descriptor file is as follows:

'Descriptor file title'
'column 1' 'description of column 1'
'column 2' 'description of column 2'
....

Where 'column 1' is a short descriptive title of the contents of column
1, such as 'lat', and 'description of column 1' is a more detailed
description of column 1', such as "latitude (degrees)'. The short
description is required for all data, but the detailed description can
be left blank, but it must be enclosed in quotes.

Some pipes programs add and subtract columns from the dataset, in which
case, they must alter the descriptor file appropriately.

By convention, descriptor files on disk have a .des extension.

By convention, raw pipes data on disk have the extension of the
appropriate descriptor file, so for instance, a dataset that follows the
div38.des format will have a .div38 extension.

By convention, a dataset on disk that has an ascii descriptor file
prepended has a .pipe extension.

4. Pipes Front Ends

Use divdata to funnel Diviner RDR data into pipes programs, instead of using the "cat" method below.

There are two ways to get data into the pipes software. If the data is
already on disk in pipes binary format, then it's sufficient to just
read in the data using the unix < standard input, or use the unix cat
tool to read and glob files. Using cat is particularly powerful because
you can use it to get multiple files into the pipes, i.e.

cat /d1/marks/div38/c/*.c9.div38 | ...

will get you a lot of data.

The second way to get data into the pipes is to read it from an ascii
flat file using the pipes tool pread. To read in the following data
from the file mydata.txt into the pipes:

1 50 200 4
44 24 35 16
52 33 108 16

you would first construct a descriptor file called data.des

'data.des descriptor file'
'lat' 'latitude'
'lon' 'longitude'
'tb' 'brightness temperature'
'qual' 'data quality'

Then, you would use pread to read in the data:

pread des=data.des < mydata.txt | ....

pread by default expects you to supply a descriptor file that tells
pread what to read. pread uses fortran list directed input, which is
fairly powerful and fast. Ascii files can be tab, space or comma
delimited. You can optionally supply a format statement for fixed format
ascii data that may improve reading speed. pread will bomb if it
encounters weird character or binary data.

5. Pipes processing tools

We have created an array of simple but useful pipes processing tools. A
good example is pcons, which constrains out data going through the
pipes. Let's say, from the previous example that we wanted to get a
subset of the data from the previous example, we would say:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 | ...

Two important points about pcons. First, the lower boundary is greater
than or equal to, the upper boundary is less than. Second, you can do a
band reject constraint by making the lower boundary higher than the
upper boundary.

Another useful tool is pextract, which reduces the number of columns in
a pipeline. Using the previous example, if we wanted just a printout of
tb and lon, we would do:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 \
| pextract extract=tb,lon | ...

and then we'd only have tb and lon going down the pipe in that order.
Note that the \ is the unix shell continuation line, which makes editing
pipes processing pipelines a little easier.

Check in the /u/paige/dap/pipesv3 directory for other tools. We will
hopefully have a good fortran template that will allow users to create
their own custom pipe programs.

6. Terminal pipes

Pipe data can be output in binary format from all pipes tools. If we
wanted to output the results of the previous example in binary form, we
would do:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 nodes \
> mydata.data

The optional nodes argument tells pcons to not prepend the descriptor
file to the output pipe stream. For pipes that change the descriptor
file, you can say newdes=newdescriptorfile.des and it will write out
newdescriptorfile for future use.

Another good terminal pipe tool is pprint, which provides formatted
ascii printout. From the previous example, if we wanted to print out the
results to output.txt , we would do:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 \
| pprint > output.txt

pprint can also provide column titles:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 \
| pprint titles=0 > output.txt

this will print out the column titles at the top of the prinout. You can
print column titles every nth row using titles=n

pprint has the option to provide formatted output, but currently there's
a limitation in that mixed format statements don't work properly.
Homogeneous format statements do work. For Diviner, we've found that
pprint format='(35f20.8)' works well.