...making Linux just a little more fun!
By William Park
In a previous article, I presented shell functions which emulate C functions strcat(3), strcpy(3), strlen(3), and strcmp(3). Since the shell's main job is to parse text, a pure shell solution was possible for string operations found in <string.h>. However, this is rare. It's not always possible to emulate C in shell, especially for accessing low-level system libraries and third-party applications. Even if it were possible, you would be re-inventing the wheel by ignoring the work that has gone into the C libraries. In addition, shell scripts are, without exception, orders of magnitude slower. But, shell has the advantage of rapid development and easy maintenance, because it's easier to write and read.
What is needed, then, is the ability to write a shell wrapper with binding to C routines. A shell mechanism which allows you to write a C extension is called a builtin, eg. read, echo, and printf, etc. When certain features require changes in the way the shell interprets an expression, then modifications must be made to the shell's parsing code. When you need speed, then a C extension is must.
My patch to Bash-3.0 shell is available from
bashdiff-core-1.11.diff is for features that will be compiled into shell statically. It adds new features by modifying Bash parsing code. It's 100% backward compatible, in that no existing meaning is changed; so, what works in your old shell, will also work in the new shell. For example, it adds
bashdiff-william-1.11.diff is for dynamically loadable builtins (loadables) which are available separately from your shell session. It adds new commands, to interface with system and application libraries and to provide a fast wrapper for common operations. For example, it adds
Before being introduced to a patched shell, you have to know how to compile from source, since the patch is against source tree. Here are the steps required to download and compile the standard Bash-3.0 shell:
wget ftp://ftp.gnu.org/pub/gnu/bash/bash-3.0.tar.gz tar -xzf bash-3.0.tar.gz cd bash-3.0 ./configure makeYou now have a binary executable bash which is just like your current shell, usually /bin/bash. You can try it out, like
./bash # using freshly compiled Bash-3.0 date ls exit # back to your old shell session
To compile my patched shell the steps are essentially the same as above. You download a tarball, apply my patch to the source tree (from the above steps), and compile. bashdiff.tar.gz will always point to the latest patch, which at the moment is bashdiff-1.10.tar.gz.
wget http://home.eol.ca/~parkw/bashdiff/bashdiff-1.10.tar.gz tar -xzf bashdiff-1.10.tar.gz mv bash-3.0 bash # it's no longer standard Bash-3.0 cd bash make distclean patch -p1 < ../bashdiff-core-1.10.diff patch -p1 < ../bashdiff-william-1.10.diff autoconf ./configure make make install # as root cd examples/loadables/william make make install # as root ldconfig # as rootNow, you have
bash which is the main shell just like before, and it will be installed as /usr/local/bin/bash, and
william.so which is a shared object containing loadables, and it will be installed as /usr/local/lib/libwilliam.so with a symbolic link to /usr/local/lib/william.so. There are 33 loadable builtins in version 1.10, namely
Lsql Msql Psql gdbm xml array arraymap arrayzip arrayunzip basp match vplot pp_append pp_collapse pp_flip pp_overwrite pp_pop pp_push pp_rotateleft pp_rotateright pp_set pp_sort pp_swap pp_transpose sscanf strcat strcpy strlen strcmp tonumber tostring chnumber isnumber
If your shell has 'enable -[fd]', then you can load/unload builtin commands dynamically, hence the name. Usage is simple. For example,
enable -f william.so vplotwill load vplot command from the shared library william.so which you just compiled and installed. Use './william.so' if you haven't installed it yet. Once loaded, you can use them just like standard builtin commands which are statically linked into the shell. So,
help vplot help -s vplotwill print the long and short help file for the command,
x=( `seq -100 100` ) y=( `for i in ${x[*]}; do echo $((i*i)); done` ) # y = x^2 vplot x ywill print an x-y character plot of a parabolic curve on your terminal. To unload,
enable -d vplot
Loadables are convenient if you just want to load the builtins you need and don't want to or can't change your login shell. Also, loadables are easier to compile incrementally, which is important since new builtins are added or updated more often than the main parsing code of the shell.
However, you may want to compile and link everything into a single executable, say on Windows for an example. To compile an "all-in-one" binary, you have to type a bit more. You still have to generate the default bash binary, because you need all those .h and .o files.
cd bash make bash make bash+william # all in one make install-bin # installs only 'bash', 'bashbug', 'bash+william'Here, bash+william is like bash, but with all builtins linked statically into it. I recommend single binary bash+william for newbies, because you don't have to remember what to load and unload. Everything is at your fingertips.
In a previous article, you've seen strcat(3), strcpy(3), strlen(3), and strcmp(3) as shell functions. Now, shell version of those C functions are also available as builtins.
enable -f william.so strcat strcpy strlen strcmp help strcat strcpy strlen strcmpYou will discover that command usage is the same as shell functions, except for '-i' option in strcmp for case insensitive comparison, i. e.
strcpy a abc strcat a 123 echo $a # abc123 strcmp $a abc123 strlen abc123 0123456789 # 6 10
If you have both a shell function and a shell builtin with the same name, then the shell function will take priority. To find out what is what,
type strcat strcpy strlen strcmpand to delete shell functions,
unset -f strcat strcpy strlen strcmp
To compare their speed,
. string.sh a=; time for i in `seq 10000`; do builtin strcat a "$i "; done b=; time for i in `seq 10000`; do strcat b "$i "; done strlen "$a" "$b" # 48894 48894 strcmp "$a" "$b"You'll find that the shell function is only about 5x slower, which is pretty good since we're talking about shell script vs. C. But, if you use substring options,
a=; time for i in `seq 10000`; do builtin strcat a "$i " 1:-1; done b=; time for i in `seq 10000`; do strcat b "$i " 1:-1; done strlen "$a" "$b" # 28894 28894 strcmp "$a" "$b"there will be a 25x difference.
Although string operations are easy in shell, it's generally difficult to examine and manipulate individual characters of the string. Also, printing the full range of ASCII chars (0-127) and high-bit chars (128-255) is difficult, because you have to use octal, hex, or backslash-escaped characters if they are not printable. Capitalizing a word, for an example, is unbelievably verbose in regular shell,
word=abc123 first=`echo ${word:0:1} | tr 'a-z' 'A-Z'` rest=`echo ${word:1} | tr 'A-Z' 'a-z'` echo $first$restwhich only works in English locales, because of explicit [a-z] and [A-Z] ranges. In C, however, this is simple matter of calling isupper(3), islower(3), toupper(3), and tolower(3), which work in all locales that the C library supports.
What we need are shell wrappers for all those standard C functions -
toupper(3), tolower(3), toascii(3), toctrl(),
isalnum(3), isalpha(3), isascii(3), isblank(3), iscntrl(3),
isdigit(3), isgraph(3), islower(3), isprint(3), ispunct(3),
isspace(3), isupper(3), isxdigit(3), and isword()
.
Most of these are defined in <ctype.h>, so that character
operations can be done simply and efficiently.
I decided to follow 'od', and convert strings into sequences of ASCII numbers (0-255). tonumber prints the ASCII number of each char in 'string', much like 'od -A n -t dC'. These are now whitespace-delimited fields which are much easier to work with in shell. In reverse, tostring converts each 'number' to an ASCII character. If the -v option is specified, the output will be saved in 'var' variable. Eg.
tonumber ABC # 65 66 67 tostring 65 66 67 # ABC
One notable feature about tostring is that it can handle the null byte (\0), so you can write shell scripts to handle binary data, like
tostring 65 00 97 | od -c # A \0 a
chnumber { toupper | tolower | toascii | toctrl } [number...]
Shell versions of toupper(3), tolower(3), and others in <ctype.h>. These read numbers and print converted numbers according to options which are the same name as those of the corresponding C functions, e.g.
tonumber aA # 97 65 chnumber toupper 97 65 # convert to uppercase: 65 65 chnumber tolower 97 65 # convert to lowercase: 97 97
isnumber { alnum | alpha | ascii | blank | cntrl | digit | graph | lower | print | punct | space | upper | xdigit | word } [number...]
Shell versions of isupper(3), islower(3), and others in <ctype.h>. They read numbers and return success or failure, according to options which are the names of the corresponding C functions without the 'is' prefix. Eg.
isnumber upper 65 # is 'A' uppercase? isnumber upper 97 # is 'a' uppercase? isnumber alnum 97 98 99 49 50 51 # is 'abc123' alphanumeric?
So, the above example of capitalizing a word becomes
set -- `tonumber abc123` set -- `chnumber toupper $1; shift; chnumber tolower $*` tostring $* 10 # add \n for terminalwhich is much more efficient and understandable.
Now that Bash has pretty good coverage of <string.h> and <ctype.h>, you can do string and character operations in shell script much the same way as in C code. Both text and binary data are handled with ease and consistency. This alone represents a vast improvement over the standard shell.
One of the first things you learn in any language is reading and printing. In C, you use printf(3), scanf(3), and others defined in <stdio.h>. For printing in the shell, you use echo and printf builtins. Curiously, though, a shell version of scanf(3) is missing. For example, to parse out 4 numbers of 11.22.33.44, you can do
IFS=. read a b c d <<< 11.22.33.44However, if the field you want is not nicely delimited as above, then it gets complicated.
I've added shell version of C function sscanf(3):
sscanf 11.22.33.44 '%[0-9].%[0-9].%[0-9].%[0-9]' a b c d declare -p a b c d # a=11 b=22 c=33 d=44 sscanf 'abc 123 45xy' '%s %s %[0-9]%[a-z]' a b c d declare -p a b c d # a=abc b=123 c=45 d=xy
From time to time, you have to print and read DOS lines which end with \r\n (CR/NL). Although you can print \r explicitly, the automatic insertion of \r just before \n is difficult in shell. For reading, you need to explicitly remove the trailing \r.
I've patched standard echo and read builtins to read and print DOS lines:
echo abc | od -c # a b c \n echo -D abc | od -c # a b c \r \n read a b <<< $'11 22 \r' # a=11 b=$'22 \r' read -D a b <<< $'11 22 \r' # a=11 b=22
Often, you need to parse lines and work with Awk-style variables like NF, NR, $1, $2, ..., $NF. However, when you use Awk, it's difficult to bring those variables back into shell; you have to write them to a temporary file in shell syntax and then source it. Because of this, it's a hassle to jump back and forth between shell and Awk.
I've patched the standard read builtin to provide simple Awk emulation, creating NF and NR variables and assigning the fields to $1, $2, ..., $NF.
IFS=. read -A <<< 11.22.33.44 echo $#: $* # 4: 11 22 33 44 declare -p NF NRAnd, just like Awk, each call to read -A will increment NR.
'<<' is the input redirection operator, where standard input is taken from actual text in the script source. '<<' will preserve the leading whitespaces, and '<<-' will remove all leading tabs. The problem with '<<-' is that relative indentation is lost.
I've added a new operator '<<+' which preserves tab indentation of the here-document relative to the first line. This is available directly from the shell (i. e. ./bash or /usr/local/bin/bash), because it's patched into the main parsing code. So,
cat <<+ EOF first line second line EOFwill print
first line second line
Bash-3.0 (and Zsh) have the '{a..b}' expression which generates an integer sequence as part of the brace expansion, but you can't use variable substitution because the '{a..b}' expression must contain explicit integers.
My patch extends the brace expansion to include variable, parameter, and array substitution, as well as a single letter sequence generator. For example,
a=1 b=10 x=a y=b echo {1..10} echo {a..b} echo {!x..!y} # use 'set +H' to suppress ! expansion set -- `seq 10` echo {**} echo {##} echo {1..#} z=( `seq 10` ) echo {^z}all produce the same result, i. e. 1 2 ... 10. More details are available from the help file:
help '{a..b}'
One useful application might be in downloading a bunch of images from a website. There are so many family-oriented sites on the Web, it's difficult to recommend one. When you find one chock full of educational content, you can try
wget -x http://your.favourite.site/conception/pic{001..200}.jpegso that you can continue your private study (as allowed by the Copyright Act of your country) later when you have more time.
In addition to integers, you can also generate a sequence of single letters using the '{a--b}' variation, where 'a' and 'b' are explicit letters as recognized by isletter(3) in <ctype.h>. Eg.
echo {A--z} # A B C ... zskipping any non-letters (if exist) between the end points.
This is called list comprehension in Python and functional languages. Essentially, it's way of generating a list from another list. For each list element, you can change the content or choose not to include it all.
By default, the output of command substitution `command var` is used in the parameter expansion, instead of the original string. If the stdout is empty, then it is removed from the expansion. Here, 'var' can be anything that can appear in other parameter expansions, i. e. ${var:...}, ${var#...}, ${var%...}, and ${var/...}. 'command' is anything you can type on your command line, i. e. alias, shell function, builtin command, external command, and shell scripts. So,
b=( `date` ) func () { tr 'a-zA-Z' 'A-Za-z' <<< "$1"; } echo ${b[*]|func} # switch case of letters set -- `date` func () { [[ $1 == *[!0-9]* ]] || echo $(( $1 + 1 )); } echo ${*|func} # only numbers, and incremented by 1This is similar to what's available in functional languages, except it's implemented in the shell framework. Unfortunately, command substitution doesn't preserve whitespace, because it captures stdout.
${var|?command}
When 'command' follows immediately after '?', then the original string is included in the parameter expansion only if 'command var' returns success (0). If not, then it's removed from the expansion. The content is not changed, but you can decide whether to include it or not. Therefore, ${var|?true} will be equivalent to ${var}, since 'true' always returns success (0). Eg.
b=( `date` ) func () { [[ $1 == [A-Z]* ]]; } echo ${b[*]|?func} # only capitalized words set -- `date` func () { [[ $1 == *[!0-9]* ]]; } echo ${*|?func} # only non-numbers
As a special case of filtering, you can specify a glob(7) or regex(7) pattern to be matched against items in the variable: ${var|=glob} and ${var|/regex} will include the string only if there is a match; conversely, ${var|!glob} and ${var|~regex} will include the string only if there is no match. The above examples can be rewritten as
b=( `date` ) echo ${b[*]|=[A-Z]*} # only capitalized words set -- `date` echo ${*|=*[!0-9]*} # only non-numbers
${var|:a:b}
You can extract a Python-style [a:b] range using ${var|:a:b} which is similar to standard shell syntax ${var:a:n}. If 'var' is string, then it will be a substring; if 'var' is list, then it will be sublist. Eg.
a=0123456789 echo ${a|::3} ${a|:-3:} ${a|:1:-1} # 012 789 12345678 set -- {a--z} echo ${*|::3} ${*|:-3:} ${*|:1:-1}will print the first 3, the last 3, and all except the first and the last, respectively, chars or list elements.
${var|*n}
When you need to duplicate a string or a list, then ${var|*n} will copy string or list 'n' times. Eg.
a=abc123 echo ${a|*3} # 3 times set -- a b c echo ${*|*2+3} # 5 times
Syntax of standard 'case' statement is
case WORD in glob [| glob]...) COMMANDS ;; ... esacI have extended the syntax to
case WORD in glob [| glob]...) COMMANDS ;; regex [| regex]...)) COMMANDS ;; ... esacso that the pattern list will be interpreted as 'regex' if it's terminated by double parenthesis '))'. Other than that, it works like before. Although Bash-3.0 has [[ string =~ regex ]], a case statement is still better syntax for two or more patterns, or if you need to test for both 'glob' and 'regex' in the same context.
Whereas 'glob' matches the entire string in order to return success, 'regex' can match a substring. If there is a match, then array variable SUBMATCH will contain the matching substring in SUBMATCH[0] and any parenthesized groups in 'regex' pattern in SUBMATCH[1], SUBMATCH[2], etc. For example,
case .abc123. in '([a-z]+)([0-9]+)' )) echo yes ;; esac declare -p SUBMATCHwill match successfully, and
case WORD in pattern1) command1 ;& pattern2) command2 ;; ... easc'command1' will run if 'pattern1' matches. After that, execution will continue on to 'command2' and subsequent command list, until it encounters double semi-colon. Now, Bash can do it too.
In addition, when you terminate command list with ';;&',
case WORD in pattern1) command1 ;;& pattern2) command2 ;; ... easc'command1' will run if 'pattern1' matches. After that, execution will continue on to testing 'pattern2' instead of exiting the case statement. Therefore, it will test all of the pattern list, whether or not there was a successful match. Zsh and Ksh don't have this feature. :-)
Often, you need to know the exit condition of a 'case' statement. You can use '*)' as a default pattern, but it's not straightforward to find out if there was a match as you're coming out of the 'case' statement. With my patch, you can add an optional 'then' and 'else' section at the end of 'case' statement right after 'esac', and treat the 'case' statement as big 'if' statement. The new syntax goes something like
case ... in case ... in case ... in ... ... ... esac then esac then esac else COMMANDS COMMANDS COMMANDS else fi fi COMMANDS fi
For example,
case abc123 in [A-Z]*) echo matched ;; esac then echo yes else echo no # no match fiwill print 'no', but
case Xabc123 in [A-Z]*) echo matched ;; # match esac then echo yes # match else echo no fiwill print 'matched' and 'yes'.
In standard shell, you can only use one variable in a 'for' loop. I added multi-variable syntax, so that
for a,b,c in {1..10}; do echo $a $b $c donewill print
1 2 3 4 5 6 7 8 9 10as you expect. Here, the variables must be separated by comma. If there is shortage of items to assign in the last iteration, the leftover variables will be assigned the empty (null) string.
Just like the 'case' statement, you often need to know if you exited the loop normally or through the use of 'break'. With my patch, you can add optional 'then' and 'else' sections at the end of 'for', 'while', and 'until' loops right after 'done'. The new syntax goes something like
[for|while|until] ...; do [for|while|until] ...; do [for|while|until] ...; do ... ... ... done then done then done else COMMANDS COMMANDS COMMANDS else fi fi COMMANDS fi
For example,
for i in 1 2 3; do echo $i break done then echo normal else echo used break # 1 fiwill print '1' only for the first iteration, then it will break out of the loop. But,
for i in 1 2 3; do echo $i done then echo normal # 1 2 3 else echo used break fiwill print all items '1 2 3', and the exit condition will be normal. The same applies to 'while' and 'until' loops.
The ability to test the exit condition improves the readability of shell scripts, because you don't have to use a variable as a flag. Python has a similar mechanism for testing the exit condition of a loop, but it uses the return value of the test. So, a 'while' loop exits when the test fails, and Python uses 'else' for the normal exit condition, which is a bit confusing.
Practically every modern language has ability to raise an exception to break out of deeply nested code, to handle errors, or to do multi-point jumps. I added a new 'try' block to Bash which will catch integer exceptions raised by a new 'raise' builtin.
try COMMANDS done in NUMBER [| NUMBER]... ) COMMANDS ;; ... esacwhere 'done in' cannot be separated by ';' or newlines. Also, the pattern list in the case-like statement must be an explicit integer number.
This combines elements of loops, the break builtin and the case statement. Within a try-block, the 'raise' builtin can be used to raise an integer exception. Then, the execution will break out of the try block, just like 'break'ing out of for/until/while loops. You can use an optional case-like statement to catch the exception. If the exception is caught, then it will be reset and execution will continue following the try-block. If the exception is not caught, then execution will break out upward until it is caught or until there are no more try-blocks.
For examples,
try echo a while true; do # infinite loop echo aa raise echo bb done echo b donewill print 'a aa', and
try echo a raise 2 echo b done in 0) echo normal ;; 1) echo raised one ;; 2) echo raised two ;; # raise 2 esacwill print 'a' and the exception is 2.
In the next article, I'll cover dynamically-loadable builtins related to arrays, regex splitting, interfacing to external libraries like an SQL database and an XML parser, and some interesting applications like HTML templates and a POP3 spam checker.
I learned Unix using the original Bourne shell. And, after my
journey through language wilderness, I have come full-circle
back to shell. Recently, I've been patching features into Bash,
giving other scripting languages a run for their money.
Slackware has been my primary distribution since the beginning,
because I can type. In my toolbox, I have Vim, Bash, Mutt, Tin,
TeX/LaTeX, Python, Awk, Sed. Even my shell command line is in
Vi-mode.