Pipes Software Documentation

D. Paige

8/12/09

1. Introduction

The "pipes" software was developed in the late 1980's as an approach to 
efficiently process unformatted data. It was originally developed in the 
VAX/VMS environment by David Paige, Carol Chang and Mark Sullivan. The 
unix pipes are an elegant system that lets the user string together a 
set of simple tools to create a custom pipeline process without the need 
to create messy intermediate files. The standard unix pipes normally 
operate on character data. Our pipes software works primarily on 
unformatted binary data. This document provides an overview of how the 
pipes work, and describes some of the main pipes tools. The pipes 
software is currently implemented in g77 and gcc. By convention, all 
pipes tools start with the letter p. The 'master' source code for the 
pipes is in /u/paige/dap/pipesv3. You can read the comments in the 
source code for detailed descriptions of the pipes programs and parameters.

2. Data Format

The current version of the pipes data (Version 3) represents all data as 
  real*8 (Fortran) or double (C). This allows for the representation of 
high precision numbers such as dates, as well as the precise 
representation of most integers. There is some storage inefficiency in 
representing all data in this manner, but the speed of processing is 
much faster than with a mixed dataset. Currently, there is no way to 
represent or deal with character data in the pipes system.

Pipes datasets are basically flat files with a fixed number of columns, 
and an arbitrary number of rows. The data are read through the pipes in 
records and processed as a group. A nice feature of the pipes data 
representation is that you don't necessarily have to use the pipes tools 
to process the data, since they can be easily read in in binary format 
by other applications.

3. Descriptor Files

In order to let the pipes tools "know" what they are processing, pipes 
data must be preceded by a descriptor file that describes the data 
format. the format of the ASCII descriptor file is as follows:

'Descriptor file title'
'column 1' 'description of column 1'
'column 2' 'description of column 2'
....

Where 'column 1' is a short descriptive title of the contents of column 
1, such as 'lat', and 'description of column 1' is a more detailed 
description of column 1', such as "latitude (degrees)'. The short 
description is required for all data, but the detailed description can 
be left blank, but it must be enclosed in quotes.

Some pipes programs add and subtract columns from the dataset, in which 
case, they must alter the descriptor file appropriately.

By convention, descriptor files on disk have a .des extension.

By convention, raw pipes data on disk have the extension of the 
appropriate descriptor file, so for instance, a dataset that follows the 
div38.des format will have a .div38 extension.

By convention, a dataset on disk that has an ascii descriptor file 
prepended has a .pipe extension.

4. Pipes Front Ends

Use divdata to funnel Diviner RDR data into pipes programs, instead of using the "cat" method below.

There are two ways to get data into the pipes software. If the data is 
already on disk in pipes binary format, then it's sufficient to just 
read in the data using the unix < standard input, or use the unix cat 
tool to read and glob files. Using cat is particularly powerful because 
you can use it to get multiple files into the pipes, i.e.

cat /d1/marks/div38/c/*.c9.div38 | ...

will get you a lot of data.

The second way to get data into the pipes is to read it from an ascii 
flat file  using the pipes tool pread. To read in the following data 
from the file mydata.txt into the pipes:

1 	50 	200 	4
44	24	35	16
52	33	108	16

you would first construct a descriptor file called data.des

'data.des descriptor file'
'lat'	'latitude'
'lon'	'longitude'
'tb'	'brightness temperature'
'qual'	'data quality'

Then, you would use pread to read in the data:

pread des=data.des < mydata.txt | ....

pread by default expects you to supply a descriptor file that tells 
pread what to read. pread uses fortran list directed input, which is 
fairly powerful and fast. Ascii files can be tab, space or comma 
delimited. You can optionally supply a format statement for fixed format 
ascii data that may improve reading speed. pread will bomb if it 
encounters weird character or binary data.

5. Pipes processing tools

We have created an array of simple but useful pipes processing tools. A 
good example is pcons, which constrains out data going through the 
pipes. Let's say, from the previous example that we wanted to get a 
subset of the data from the previous example, we would say:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 | ...

Two important points about pcons. First, the lower boundary is greater 
than or equal to, the upper boundary is less than. Second, you can do a 
band reject constraint by making the lower boundary higher than the 
upper boundary.

Another useful tool is pextract, which reduces the number of columns in 
a pipeline. Using the previous example, if we wanted just a printout of 
tb and lon, we would do:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 \
| pextract extract=tb,lon | ...

and then we'd only have tb and lon going down the pipe in that order. 
Note that the \ is the unix shell continuation line, which makes editing 
pipes processing pipelines a little easier.

Check in the /u/paige/dap/pipesv3 directory for other tools. We will 
hopefully have a good fortran template that will allow users to create 
their own custom pipe programs.

6. Terminal pipes

Pipe data can be output in binary format from all pipes tools. If we 
wanted to output the results of the previous example in binary form, we 
would do:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 nodes \
 > mydata.data

The optional nodes argument tells pcons to not prepend the descriptor 
file to the output pipe stream. For pipes that change the descriptor 
file, you can say newdes=newdescriptorfile.des and it will write out 
newdescriptorfile for future use.

Another good terminal pipe tool is pprint, which provides formatted 
ascii printout. From the previous example, if we wanted to print out the 
results to output.txt , we would do:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 \
| pprint > output.txt

pprint can also provide column titles:

pread des=data.des < mydata.txt | pcons lat=-20.,50 tb=200,250 \
| pprint titles=0 > output.txt

this will print out the column titles at the top of the prinout. You can 
print column titles every nth row using titles=n

pprint has the option to provide formatted output, but currently there's 
a limitation in that mixed format statements don't work properly. 
Homogeneous format statements do work. For Diviner, we've found that
pprint format='(35f20.8)' works well.