Bash Shell and Beyond

Patching and Compiling Bash Shell

In a previous article, I presented shell functions which emulate C functions strcat(3), strcpy(3), strlen(3), and strcmp(3). Since the shell's main job is to parse text, a pure shell solution was possible for string operations found in <string.h>. However, this is rare. It's not always possible to emulate C in shell, especially for accessing low-level system libraries and third-party applications. Even if it were possible, you would be re-inventing the wheel by ignoring the work that has gone into the C libraries. In addition, shell scripts are, without exception, orders of magnitude slower. But, shell has the advantage of rapid development and easy maintenance, because it's easier to write and read.

What is needed, then, is the ability to write a shell wrapper with binding to C routines. A shell mechanism which allows you to write a C extension is called a builtin, eg. read, echo, and printf, etc. When certain features require changes in the way the shell interprets an expression, then modifications must be made to the shell's parsing code. When you need speed, then a C extension is must.

My patch to Bash-3.0 shell is available from

http://freshmeat.net/projects/bashdiff/ --- announcement
http://home.eol.ca/~parkw/index.html#bash --- documentation

The latest tarball bashdiff-1.11.tar.gz contains 2 diff files:

bashdiff-core-1.11.diff is for features that will be compiled into shell statically. It adds new features by modifying Bash parsing code. It's 100% backward compatible, in that no existing meaning is changed; so, what works in your old shell, will also work in the new shell. For example, it adds
- a new brace expansion {a..b} --- integer/letter generation, positional parameters and array expansion
- new parameter expansion ${var|...} --- content filtering, list comprehension (like Python)
- new command substitution $(=...) --- floating-point hook to Awk
- extended case statement --- regex, continuation, then/else sections
- extended for/while/until loops --- then/else sections, multiple for-loop variables
- try-block with integer exception (like Python)
- new <<+ here-document --- relative indentation
bashdiff-william-1.11.diff is for dynamically loadable builtins (loadables) which are available separately from your shell session. It adds new commands, to interface with system and application libraries and to provide a fast wrapper for common operations. For example, it adds
- extended read/echo builtins --- DOS lines
- sscanf(3), <string.h> and <ctype.h> wrappers, ASCII/string conversion
- new raise builtin for try-blocks
- array cut/splicing, array filter/map/zip/unzip (like Python)
- regex(3) operations --- match, split, search, replace, callback
- HTML template engine (like PHP, JSP, ASP)
- GDBM, SQLite, PostgreSQL, and MySQL database interface
- Expat XML parser interface
- stack/queue operations on arrays and positional parameters
- x-y character plot

All features are documented in the shell's internal help files, which can be accessed by the help command.

Compiling

Before being introduced to a patched shell, you have to know how to compile from source, since the patch is against source tree. Here are the steps required to download and compile the standard Bash-3.0 shell:

    wget ftp://ftp.gnu.org/pub/gnu/bash/bash-3.0.tar.gz
    tar -xzf bash-3.0.tar.gz
    cd bash-3.0
        ./configure
        make

You now have a binary executable bash which is just like your current shell, usually /bin/bash. You can try it out, like

    ./bash              # using freshly compiled Bash-3.0
    date
    ls
    exit                # back to your old shell session

Patching

To compile my patched shell the steps are essentially the same as above. You download a tarball, apply my patch to the source tree (from the above steps), and compile. bashdiff.tar.gz will always point to the latest patch, which at the moment is bashdiff-1.10.tar.gz.

    wget http://home.eol.ca/~parkw/bashdiff/bashdiff-1.10.tar.gz
    tar -xzf bashdiff-1.10.tar.gz
    mv bash-3.0 bash            # it's no longer standard Bash-3.0
    cd bash
        make distclean
        patch -p1 < ../bashdiff-core-1.10.diff
        patch -p1 < ../bashdiff-william-1.10.diff
        autoconf
        ./configure
        make
        make install            # as root
        cd examples/loadables/william
            make
            make install        # as root
            ldconfig            # as root

Now, you have

bash which is the main shell just like before, and it will be installed as /usr/local/bin/bash, and
william.so which is a shared object containing loadables, and it will be installed as /usr/local/lib/libwilliam.so with a symbolic link to /usr/local/lib/william.so. There are 33 loadable builtins in version 1.10, namely

Lsql Msql Psql gdbm xml array arraymap arrayzip arrayunzip basp match vplot pp_append pp_collapse pp_flip pp_overwrite pp_pop pp_push pp_rotateleft pp_rotateright pp_set pp_sort pp_swap pp_transpose sscanf strcat strcpy strlen strcmp tonumber tostring chnumber isnumber

Dynamically loadable builtins

If your shell has 'enable -[fd]', then you can load/unload builtin commands dynamically, hence the name. Usage is simple. For example,

    enable -f william.so vplot

will load vplot command from the shared library william.so which you just compiled and installed. Use './william.so' if you haven't installed it yet. Once loaded, you can use them just like standard builtin commands which are statically linked into the shell. So,

    help vplot
    help -s vplot

will print the long and short help file for the command,

vplot [-0 -x columns -y lines -X xtitle -Y ytitle] {xy | x y | x1 y1 x2 y2 ...}

and

    x=( `seq -100 100` )
    y=( `for i in ${x[*]}; do echo $((i*i)); done` )    # y = x^2
    vplot x y

will print an x-y character plot of a parabolic curve on your terminal. To unload,

    enable -d vplot

All-in-one

Loadables are convenient if you just want to load the builtins you need and don't want to or can't change your login shell. Also, loadables are easier to compile incrementally, which is important since new builtins are added or updated more often than the main parsing code of the shell.

However, you may want to compile and link everything into a single executable, say on Windows for an example. To compile an "all-in-one" binary, you have to type a bit more. You still have to generate the default bash binary, because you need all those .h and .o files.

    cd bash
        make bash
        make bash+william       # all in one
        make install-bin        # installs only 'bash', 'bashbug', 'bash+william'

Here, bash+william is like bash, but with all builtins linked statically into it. I recommend single binary bash+william for newbies, because you don't have to remember what to load and unload. Everything is at your fingertips.

strcat, strcpy, strlen, and strcmp

In a previous article, you've seen strcat(3), strcpy(3), strlen(3), and strcmp(3) as shell functions. Now, shell version of those C functions are also available as builtins.

    enable -f william.so strcat strcpy strlen strcmp
    help strcat strcpy strlen strcmp

You will discover that command usage is the same as shell functions, except for '-i' option in strcmp for case insensitive comparison, i. e.

strcat var string [a:b]
strcpy var string [a:b]
strlen string...
strcmp [-i] string1 string2 [a:b]

For example,

    strcpy a abc
    strcat a 123
    echo $a                             # abc123
    strcmp $a abc123
    strlen abc123 0123456789            # 6 10

If you have both a shell function and a shell builtin with the same name, then the shell function will take priority. To find out what is what,

    type strcat strcpy strlen strcmp

and to delete shell functions,

    unset -f strcat strcpy strlen strcmp

To compare their speed,

    . string.sh
    a=; time for i in `seq 10000`; do builtin strcat a "$i "; done
    b=; time for i in `seq 10000`; do strcat b "$i "; done
    strlen "$a" "$b"            # 48894 48894
    strcmp "$a" "$b"

You'll find that the shell function is only about 5x slower, which is pretty good since we're talking about shell script vs. C. But, if you use substring options,

    a=; time for i in `seq 10000`; do builtin strcat a "$i " 1:-1; done
    b=; time for i in `seq 10000`; do strcat b "$i " 1:-1; done
    strlen "$a" "$b"            # 28894 28894
    strcmp "$a" "$b"

there will be a 25x difference.

ASCII and <ctype.h>

Although string operations are easy in shell, it's generally difficult to examine and manipulate individual characters of the string. Also, printing the full range of ASCII chars (0-127) and high-bit chars (128-255) is difficult, because you have to use octal, hex, or backslash-escaped characters if they are not printable. Capitalizing a word, for an example, is unbelievably verbose in regular shell,

    word=abc123
    first=`echo ${word:0:1} | tr 'a-z' 'A-Z'`
    rest=`echo ${word:1} | tr 'A-Z' 'a-z'`
    echo $first$rest

which only works in English locales, because of explicit [a-z] and [A-Z] ranges. In C, however, this is simple matter of calling isupper(3), islower(3), toupper(3), and tolower(3), which work in all locales that the C library supports.

What we need are shell wrappers for all those standard C functions - toupper(3), tolower(3), toascii(3), toctrl(), isalnum(3), isalpha(3), isascii(3), isblank(3), iscntrl(3), isdigit(3), isgraph(3), islower(3), isprint(3), ispunct(3), isspace(3), isupper(3), isxdigit(3), and isword(). Most of these are defined in <ctype.h>, so that character operations can be done simply and efficiently.

tonumber string...
tostring [-v var] number...
I decided to follow 'od', and convert strings into sequences of ASCII numbers (0-255). tonumber prints the ASCII number of each char in 'string', much like 'od -A n -t dC'. These are now whitespace-delimited fields which are much easier to work with in shell. In reverse, tostring converts each 'number' to an ASCII character. If the -v option is specified, the output will be saved in 'var' variable. Eg.
```
    tonumber ABC                # 65 66 67
    tostring 65 66 67           # ABC
```
One notable feature about tostring is that it can handle the null byte (\0), so you can write shell scripts to handle binary data, like
```
    tostring 65 00 97 | od -c           # A \0 a
```
chnumber { toupper | tolower | toascii | toctrl } [number...]

Shell versions of toupper(3), tolower(3), and others in <ctype.h>. These read numbers and print converted numbers according to options which are the same name as those of the corresponding C functions, e.g.
```
    tonumber aA                 # 97 65
    chnumber toupper 97 65      # convert to uppercase: 65 65
    chnumber tolower 97 65      # convert to lowercase: 97 97
```
isnumber { alnum | alpha | ascii | blank | cntrl | digit | graph | lower | print | punct | space | upper | xdigit | word } [number...]

Shell versions of isupper(3), islower(3), and others in <ctype.h>. They read numbers and return success or failure, according to options which are the names of the corresponding C functions without the 'is' prefix. Eg.
```
    isnumber upper 65                   # is 'A' uppercase?
    isnumber upper 97                   # is 'a' uppercase?
    isnumber alnum 97 98 99 49 50 51    # is 'abc123' alphanumeric?
```

So, the above example of capitalizing a word becomes

    set -- `tonumber abc123`
    set -- `chnumber toupper $1; shift; chnumber tolower $*`
    tostring $* 10              # add \n for terminal

which is much more efficient and understandable.

Now that Bash has pretty good coverage of <string.h> and <ctype.h>, you can do string and character operations in shell script much the same way as in C code. Both text and binary data are handled with ease and consistency. This alone represents a vast improvement over the standard shell.

Formatted I/O

sscanf

One of the first things you learn in any language is reading and printing. In C, you use printf(3), scanf(3), and others defined in <stdio.h>. For printing in the shell, you use echo and printf builtins. Curiously, though, a shell version of scanf(3) is missing. For example, to parse out 4 numbers of 11.22.33.44, you can do

    IFS=. read a b c d <<< 11.22.33.44

However, if the field you want is not nicely delimited as above, then it gets complicated.

I've added shell version of C function sscanf(3):

sscanf input format var1 [... var9]

Since the shell only has the string data type, it supports only string formats, i. e. %s, %c, %[...], %[^...], and up to 9 variables. So, you can parse formatted strings just like the way you would in C, eg.

    sscanf 11.22.33.44 '%[0-9].%[0-9].%[0-9].%[0-9]' a b c d
    declare -p a b c d          # a=11 b=22 c=33 d=44

    sscanf 'abc 123 45xy' '%s %s %[0-9]%[a-z]' a b c d
    declare -p a b c d          # a=abc b=123 c=45 d=xy

Reading and printing DOS lines

From time to time, you have to print and read DOS lines which end with \r\n (CR/NL). Although you can print \r explicitly, the automatic insertion of \r just before \n is difficult in shell. For reading, you need to explicitly remove the trailing \r.

I've patched standard echo and read builtins to read and print DOS lines:

echo [-...] -D [arg ...]
read [-...] -D [name ...]

For example,

    echo abc | od -c                    # a b c \n
    echo -D abc | od -c                 # a b c \r \n

    read a b <<< $'11 22 \r'            # a=11 b=$'22 \r'
    read -D a b <<< $'11 22 \r'         # a=11 b=22

Simple Awk emulation

Often, you need to parse lines and work with Awk-style variables like NF, NR, $1, $2, ..., $NF. However, when you use Awk, it's difficult to bring those variables back into shell; you have to write them to a temporary file in shell syntax and then source it. Because of this, it's a hassle to jump back and forth between shell and Awk.

I've patched the standard read builtin to provide simple Awk emulation, creating NF and NR variables and assigning the fields to $1, $2, ..., $NF.

read [-...] -A [name ...]

For example,

    IFS=. read -A <<< 11.22.33.44
    echo $#: $*                 # 4: 11 22 33 44
    declare -p NF NR

And, just like Awk, each call to read -A will increment NR.

Indentation in << here-document

'<<' is the input redirection operator, where standard input is taken from actual text in the script source. '<<' will preserve the leading whitespaces, and '<<-' will remove all leading tabs. The problem with '<<-' is that relative indentation is lost.

I've added a new operator '<<+' which preserves tab indentation of the here-document relative to the first line. This is available directly from the shell (i. e. ./bash or /usr/local/bin/bash), because it's patched into the main parsing code. So,

    cat <<+ EOF
            first line
                    second line
    EOF

will print

    first line
            second line

Sequence Generators {a..b} and {a--b}

Integer sequence {a..b}

Bash-3.0 (and Zsh) have the '{a..b}' expression which generates an integer sequence as part of the brace expansion, but you can't use variable substitution because the '{a..b}' expression must contain explicit integers.

My patch extends the brace expansion to include variable, parameter, and array substitution, as well as a single letter sequence generator. For example,

    a=1 b=10 x=a y=b
        echo {1..10}
        echo {a..b}
        echo {!x..!y}           # use 'set +H' to suppress ! expansion
    set -- `seq 10`
        echo {**}
        echo {##}
        echo {1..#}
    z=( `seq 10` )
        echo {^z}

all produce the same result, i. e. 1 2 ... 10. More details are available from the help file:

    help '{a..b}'

One useful application might be in downloading a bunch of images from a website. There are so many family-oriented sites on the Web, it's difficult to recommend one. When you find one chock full of educational content, you can try

    wget -x http://your.favourite.site/conception/pic{001..200}.jpeg

so that you can continue your private study (as allowed by the Copyright Act of your country) later when you have more time.

Letters sequence {a--b}

In addition to integers, you can also generate a sequence of single letters using the '{a--b}' variation, where 'a' and 'b' are explicit letters as recognized by isletter(3) in <ctype.h>. Eg.

    echo {A--z}         # A B C ... z

skipping any non-letters (if exist) between the end points.

Content Filtering ${var|...}

This is called list comprehension in Python and functional languages. Essentially, it's way of generating a list from another list. For each list element, you can change the content or choose not to include it all.

${var|command}
By default, the output of command substitution `command var` is used in the parameter expansion, instead of the original string. If the stdout is empty, then it is removed from the expansion. Here, 'var' can be anything that can appear in other parameter expansions, i. e. ${var:...}, ${var#...}, ${var%...}, and ${var/...}. 'command' is anything you can type on your command line, i. e. alias, shell function, builtin command, external command, and shell scripts. So,
```
    b=( `date` )
    func () { tr 'a-zA-Z' 'A-Za-z' <<< "$1"; }
    echo ${b[*]|func}           # switch case of letters

    set -- `date`
    func () { [[ $1 == *[!0-9]* ]] || echo $(( $1 + 1 )); }
    echo ${*|func}              # only numbers, and incremented by 1
```
This is similar to what's available in functional languages, except it's implemented in the shell framework. Unfortunately, command substitution doesn't preserve whitespace, because it captures stdout.
${var|?command}

When 'command' follows immediately after '?', then the original string is included in the parameter expansion only if 'command var' returns success (0). If not, then it's removed from the expansion. The content is not changed, but you can decide whether to include it or not. Therefore, ${var|?true} will be equivalent to ${var}, since 'true' always returns success (0). Eg.
```
    b=( `date` )
    func () { [[ $1 == [A-Z]* ]]; }
    echo ${b[*]|?func}          # only capitalized words

    set -- `date`
    func () { [[ $1 == *[!0-9]* ]]; }
    echo ${*|?func}             # only non-numbers
```
${var|/regex}
${var|=glob}
${var|~regex}
${var|!glob}
As a special case of filtering, you can specify a glob(7) or regex(7) pattern to be matched against items in the variable: ${var|=glob} and ${var|/regex} will include the string only if there is a match; conversely, ${var|!glob} and ${var|~regex} will include the string only if there is no match. The above examples can be rewritten as
```
    b=( `date` )
    echo ${b[*]|=[A-Z]*}        # only capitalized words

    set -- `date`
    echo ${*|=*[!0-9]*}         # only non-numbers
```
${var|:a:b}

You can extract a Python-style [a:b] range using ${var|:a:b} which is similar to standard shell syntax ${var:a:n}. If 'var' is string, then it will be a substring; if 'var' is list, then it will be sublist. Eg.
```
    a=0123456789
    echo ${a|::3} ${a|:-3:} ${a|:1:-1}          # 012 789 12345678

    set -- {a--z}
    echo ${*|::3} ${*|:-3:} ${*|:1:-1}
```
will print the first 3, the last 3, and all except the first and the last, respectively, chars or list elements.

${var|*n}

When you need to duplicate a string or a list, then ${var|*n} will copy string or list 'n' times. Eg.

    a=abc123
    echo ${a|*3}                # 3 times

    set -- a b c
    echo ${*|*2+3}              # 5 times

Case Statement

regex pattern

Syntax of standard 'case' statement is

    case WORD in
        glob [| glob]...) COMMANDS ;;
        ...
    esac

I have extended the syntax to

    case WORD in
        glob [| glob]...) COMMANDS ;;
        regex [| regex]...)) COMMANDS ;;
        ...
    esac

so that the pattern list will be interpreted as 'regex' if it's terminated by double parenthesis '))'. Other than that, it works like before. Although Bash-3.0 has [[ string =~ regex ]], a case statement is still better syntax for two or more patterns, or if you need to test for both 'glob' and 'regex' in the same context.

Whereas 'glob' matches the entire string in order to return success, 'regex' can match a substring. If there is a match, then array variable SUBMATCH will contain the matching substring in SUBMATCH[0] and any parenthesized groups in 'regex' pattern in SUBMATCH[1], SUBMATCH[2], etc. For example,

    case .abc123. in
        '([a-z]+)([0-9]+)' )) echo yes ;;
    esac
    declare -p SUBMATCH

will match successfully, and

SUBMATCH[0]=abc123, the entire regex '([a-z]+)([0-9]+)'
SUBMATCH[1]=abc, the first group '([a-z]+)'
SUBMATCH[2]=123, the second group '([0-9]+)'

Continuation

In Zsh and Ksh, you can continue on with the next command list if you use ';&' instead of ';;'. So,

    case WORD in
        pattern1) command1 ;&
        pattern2) command2 ;;
        ...
    easc

'command1' will run if 'pattern1' matches. After that, execution will continue on to 'command2' and subsequent command list, until it encounters double semi-colon. Now, Bash can do it too.

In addition, when you terminate command list with ';;&',

    case WORD in
        pattern1) command1 ;;&
        pattern2) command2 ;;
        ...
    easc

'command1' will run if 'pattern1' matches. After that, execution will continue on to testing 'pattern2' instead of exiting the case statement. Therefore, it will test all of the pattern list, whether or not there was a successful match. Zsh and Ksh don't have this feature. :-)

Exit condition

Often, you need to know the exit condition of a 'case' statement. You can use '*)' as a default pattern, but it's not straightforward to find out if there was a match as you're coming out of the 'case' statement. With my patch, you can add an optional 'then' and 'else' section at the end of 'case' statement right after 'esac', and treat the 'case' statement as big 'if' statement. The new syntax goes something like

case ... in             case ... in             case ... in
    ...                     ...                     ...
esac then               esac then               esac else
    COMMANDS                COMMANDS                COMMANDS
else                    fi                      fi
    COMMANDS
fi

where 'esac then' and 'esac else' cannot be separated by ';' or newlines. The then-COMMANDS will be executed if there was a match, or else-COMMANDS will be executed if there was no match.

For example,

    case abc123 in
        [A-Z]*) echo matched ;;
    esac then
        echo yes
    else
        echo no         # no match
    fi

will print 'no', but

    case Xabc123 in
        [A-Z]*) echo matched ;;         # match
    esac then
        echo yes                        # match
    else
        echo no
    fi

will print 'matched' and 'yes'.

For/While/Until Loops

Multi-variable for-loop

In standard shell, you can only use one variable in a 'for' loop. I added multi-variable syntax, so that

    for a,b,c in {1..10}; do
        echo $a $b $c
    done

will print

as you expect. Here, the variables must be separated by comma. If there is shortage of items to assign in the last iteration, the leftover variables will be assigned the empty (null) string.

Exit condition

Just like the 'case' statement, you often need to know if you exited the loop normally or through the use of 'break'. With my patch, you can add optional 'then' and 'else' sections at the end of 'for', 'while', and 'until' loops right after 'done'. The new syntax goes something like

[for|while|until] ...; do       [for|while|until] ...; do       [for|while|until] ...; do
    ...                             ...                             ...
done then                       done then                       done else
    COMMANDS                        COMMANDS                        COMMANDS
else                            fi                              fi
    COMMANDS
fi

where 'done then' and 'done else' cannot be separated by ';' or newlines. Here, then-COMMANDS will be executed if the loop exited normally, and else-COMMANDS will be executed if 'break' was used. By "normal", I mean the 'for' loop exhausted all list items, the 'while' test failed, or the 'until' test succeeded.

For example,

    for i in 1 2 3; do
        echo $i
        break
    done then
        echo normal
    else
        echo used break         # 1 
    fi

will print '1' only for the first iteration, then it will break out of the loop. But,

    for i in 1 2 3; do
        echo $i
    done then
        echo normal             # 1 2 3
    else
        echo used break
    fi

will print all items '1 2 3', and the exit condition will be normal. The same applies to 'while' and 'until' loops.

The ability to test the exit condition improves the readability of shell scripts, because you don't have to use a variable as a flag. Python has a similar mechanism for testing the exit condition of a loop, but it uses the return value of the test. So, a 'while' loop exits when the test fails, and Python uses 'else' for the normal exit condition, which is a bit confusing.

Exception and Try Block

Practically every modern language has ability to raise an exception to break out of deeply nested code, to handle errors, or to do multi-point jumps. I added a new 'try' block to Bash which will catch integer exceptions raised by a new 'raise' builtin.

raise [n]
```
try
    COMMANDS
done in 
    NUMBER [| NUMBER]... ) COMMANDS ;;
    ...
esac
```
where 'done in' cannot be separated by ';' or newlines. Also, the pattern list in the case-like statement must be an explicit integer number.

This combines elements of loops, the break builtin and the case statement. Within a try-block, the 'raise' builtin can be used to raise an integer exception. Then, the execution will break out of the try block, just like 'break'ing out of for/until/while loops. You can use an optional case-like statement to catch the exception. If the exception is caught, then it will be reset and execution will continue following the try-block. If the exception is not caught, then execution will break out upward until it is caught or until there are no more try-blocks.

For examples,

    try
        echo a
        while true; do  # infinite loop
            echo aa
            raise
            echo bb
        done
        echo b
    done

will print 'a aa', and

    try
        echo a
        raise 2
        echo b
    done in
        0) echo normal ;;
        1) echo raised one ;;
        2) echo raised two ;;   # raise 2
    esac

will print 'a' and the exception is 2.

Summary

In the next article, I'll cover dynamically-loadable builtins related to arrays, regex splitting, interfacing to external libraries like an SQL database and an XML parser, and some interesting applications like HTML templates and a POP3 spam checker.

[BIO] I learned Unix using the original Bourne shell. And, after my journey through language wilderness, I have come full-circle back to shell. Recently, I've been patching features into Bash, giving other scripting languages a run for their money. Slackware has been my primary distribution since the beginning, because I can type. In my toolbox, I have Vim, Bash, Mutt, Tin, TeX/LaTeX, Python, Awk, Sed. Even my shell command line is in Vi-mode.

Copyright © 2004, William Park. Released under the Open Publication license unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 109 of Linux Gazette, December 2004

<-- prev | next -->