MULCH - a filter to do MULtiple CHanges


Freeware by David Mitchell - dave@zenonic.demon.co.uk

Function

This filter acts just like an editor given a succession of global change commands. It is designed to be a fast way of making several changes to a file in one go.

MULCH allows changes to be specified as character strings, but provides a number of metacharacters to make it easy to restrict changes to the start or ends of lines, change carriage returns or escapes into printable characters etc.

Usage

MULCH expects a file of change requests to tell it what changes to make. You invoke it like this:

      MULCH <infile >outfile changefile

Note that MULCH is a filter and you have to use redirection to get it to read from or write to files.

The changefile can contain up to 50 lines, each containing a change request in the form:

   /from/to/ comments
where 'from' and 'to' are character strings and / is any character (except *) which does not appear in either 'from' or 'to'

A line beginning with an asterisk '*' is treated as a comment.

Note that the first character on the line is taken as the delimiter. The 'to' string can be null, but the 'from' string may not be. The maximum length of a line is 255 characters and each must be terminated with a carriage return/linefeed.

Lines beginning with '*' are treated as comments, and are merely displayed by MULCH.

The following special 'metacharacters' can appear in either string:

       @c = carriage return (X'0D')
       @l = line feed       (X'0A')
       @t = tab             (X'09')
       @e = escape          (X'1B')
       @z = control-z (EOF) (X'1A')
       @@ = @
       @hnn = character x'nn'

Metacharacters can be in upper or lower case.

Just typing MULCH will produce some on-line help.

Examples

Here are some sample changefile lines.

   /this/that/             - changes each occurrence of 'this' into 'that'

   ?this?that? comment     - as above, the comment is ignored

    this  that             - deletes 'this' since the delimiter is a blank
                             and there are two blanks between 'this' and
                             'that'. The second of these is thus the third
                             delimiter and ends the change request. The
                             word 'that' is treated as a comment

   /th@h61t/this/          - changes 'that' back into 'this', since X'61'
                             is 'a'

   / @c/@c/                - removes a single trailing blank from all lines
                             that have one

   /@c@l@c@l/@c@l/         - deletes all null lines, that is situations
                             where one carriage return/line feed sequence is
                             immediately followed by another

   /@l@c//                 - also deletes null lines, more simply

   /@l@t/@l /              - turns leading tab character into single blank
                             (except on the very first line, which won't
                             have a preceding line feed)

   /:ol@c/@ol.@c/          - 'legalises' a GML ordered list tag, turning an
                             ':ol' on a line by itself into ':ol.'

The last example was the reason I wrote MULCH. I had to make sure that all the GML list tags in a set of large files were strictly legal. Thus an ':ol' on a line by itself had to be changed to ':ol.', while an ":ol.' was to be left untouched. It was taking too long to do the work using a text editor (I had over 600K of text in lots of files to process) and none of the other tools I tried would do the job without a lot of fiddling about. With MULCH and a simple file of change requests the job was almost trivial.

Note that with one exception noted below, MULCH tries to work just like an editor would if you entered the change commands one after the other. If you wanted to change the word 'the' to 'THE' without also changing 'then' to 'THEn' or 'rather' to raTHEr' etc, here's the change file you'd need:

   / the / THE /         deal with all occurrences within a line
   /@lthe /@lTHE /       deal with all occurrences at the start of a line
   / the@c/ THE@c/       deal with all occurrences at the end of a line

This will almost do - it won't change 'the' if it occurs at the very beginning of the file, and it won't deal with 'the' preceded or followed punctuation (commas etc) or tab characters.

The MULCH Algorithm

Any program that tries to handle multiple global changes has to handle some awkward issues. The MULCH algorithm is a compromise between speed, simplicity and avoidance of infinite loops. Basically MULCH:

The effect of this algorithm is that not much time is spent shuffling data about, I/O is generally fast, spill files and infinite loops are avoided and most situations are properly handled.

Warning

There is one situation in which MULCH doesn't do quite what it should. Imagine that just at the end of a chunk the following string occurs:

     ....ABCDEFGHI....

where the E is the last character in a chunk. Now if the change request file looks like this:

     /FGH/XYZ/
     /EF/12/

Then you'd expect the result to be:

     ....ABCDEXYZI....

since the first change should be applied first. But this won't happen. On the first pass, MULCH will stop at the E, since no match is in effect at that point. Thus it won't see the 'F' yet. On the second pass through the same data a match will be in effect at the end, and the next byte, 'F', will be read and 'EF' will be changed to '12'. So the result will actually be:

     ....ABCD12GHI....

I can't see any simple way round this that won't seriously impact MULCH's performance. As you can see, the problem is confined to certain types of overlapping changes, so the best course is to avoid these - doing them as entirely separate runs. I can see a complicated way of solving the problem but I'd be grateful for suggestions.

Error Messages

The following errors messages may appear:

     >> Missing Changefile name
     This occurs if the command line has nothing but blanks

     >> Error reading input file
     >> Error opening changefile
     >> Error reading changefile
     >> Error writing output file
     These occur if DOS reports I/O errors

     >> Source string must not be null
     A request of the form //this/ does not make any sense to MULCH.

     >> Error in hex string
     If @h is not followed by a pair of valid hex digits this message
     will appear. Note that upper or lower case digits are allowed.

     >> Error in change command
     This appears if less than three delimiters are found on a line.

     >> Error in metacharacter
     If @ is followed by a character other than c,l,t,e,z,@ or h.

     >> Too many change commands
     MULCH is limited to 50 change requests - and the change buffer is
     limited to 4k in size.

     >> Changes have increased text beyond buffer size
          MULCH cannot continue
     As explained above, MULCH reads a file in 20K chunks. If a chunk
     expands beyond 28K in size during the application of the set of
     change requests processing is terminated.

     >> Insufficient memory - MULCH needs 64K
     MULCH uses a full 64K of memory - it has two 28K buffers for data
     and a 4K buffer for change requests. The code is about 2K in size.