Audiobooks Galore

This article is a follow-up to Maxin B. John's article, which introduced us to the Festival text-to-speech synthesizer and some possible applications. Here, we will push it a bit further and see how we can convert ebooks from the most common formats like HTML, CHM, PS and PDF into audiobooks ready to send to your portable player.

The Why

With the high availability of cheap and small portable MP3 players these days, it has become very convenient to listen to books and articles just anywhere when you would not necessarily have the time to read them. Audiobooks usually require very small bit-rates, and hence very small sizes - and as a consequence they are the most suitable content for the cheap/small capacity MP3 players (128 MB or less).

There are lots of websites out there catering for audiobooks needs with a wide range of choices. However, it might happen that you really want to read that article or book that you found on the web as a PDF or as HTML, and there is probably no audio version of it available (yet). I will provide you with some scripts that will enable you to convert all your favorite texts into compressed audio files ready to upload and enjoy on your portable player. Here we go!

The Tools

archmage (CHM) requires also python

ps2ascii (PS and PDF) from the ghostscript-library package

Lynx

text2wave from Festival

the lame MP3 encoder (we'll encode to MP3 since this is the most widely-supported format in hardware players)

Most of these tools are packaged in the main Linux distributions. Once you have all of the above installed, we can start the fun. We will begin with one of the most common format for ebooks: Adobe PDF.

Postcript / Adobe PDF to MP3

#!/bin/sh -

chunks=200

if [ "$#" == 0 ]; then
        echo "Usage:  $0 [-a author] [-t title] [-l lines] <ps or pdf file>"
        exit 1
fi

while getopts "a:t:l:" option
do
case "$option" in
        a)author="$OPTARG";;
        t)title="$OPTARG";;
        l)chunks="$OPTARG";;
esac
done
shift $((OPTIND-1))

ps2ascii $@ | split -l $chunks - tmpsplit
count=1
for i in `ls tmpsplit*`
do
        text2wave $i | lame --ta "${author:-psmp3}" --tt "$count ${title:-psmp3}" \
		--tl "${title:-psmp3}" --tn "$count" --tg Speech --preset mw-us   \
		- abook${count}.mp3
        count=`expr $count + 1`
done
rm tmpsplit*

How it works

First 'ps2ascii' converts the PDF file or Postscript file to simple text. That text is then split into chunks of $chunks lines; you might have to tweak that value, since splitting the book into more than 255 files might cause troubles in some players (the id3v1 track number tag can only go up to 255.) After that, each chunk is processed by text2wave and the resulting audio stream is sent directly to 'lame' through a pipe. The encoding is performed with the mw-us preset, which is mono ABR 40 kbps average at 16 kHz. That should be enough, since Festival outputs a voice sampled at 16 kHz by default. You can leave it as it is, unless you are using a voice synthesizer with a different sampling rate. Refer to lame --preset help for optimum settings for different sampling rates.

When you input the artist or title, do not forget to quote the string if it includes spaces; for example:

ps2mp3 -a "This is the author" -t "This is the title" my.pdf

Next, we are going to see how to convert to an audio file from the most common format: HTML.

HTML to MP3

#!/bin/sh -
#requires lynx, festival and lame

if [ "$#" == 0 ]; then
        echo "Usage: echo $0 [-a author] [-t title] <html file1> <html file2> ..."
        exit 1
fi

while getopts "a:t:" option
do
case "$option" in
        a)author="$OPTARG";;
        t)title="$OPTARG";;
esac
done
shift $((OPTIND-1))

count=1
for htmlfile in $@
do
        section=`expr match "${htmlfile##*/}" '\(.*\)\.htm'`
        lynx -dump -nolist $htmlfile | text2wave - | lame --ta "${author:-html2mp3}" \
		--tt "$count. ${section:-html2mp3}" --tl "${title:-html2mp3}"        \
		--tn "$count" --tg Speech --preset mw-us - ${section}.mp3
        #rm /tmp/est_*
        count=`expr $count + 1`
done

How it works

The first part of the script, up to line 16, is about extracting the optional parameters from the command line. From line 19 we are going to perform a loop on the list of all HTML files, the remaining arguments given at the command line. On line 21, "${htmlfile##*/}" strips out everything up to and including the last "/" character - useful if we are dealing with URLs or a directory path - so only the filename remains. Then the '$.*$\.htm'` regular expression takes care of the extension of the file so the variable section holds only the stem of the file. It will be used to tag and name the resulting MP3 files.

Line 22 is really the heart of the script: first, 'lynx' takes an HTML file as input and dumps its text to stdout. That output is piped to 'text2wave' and converted into a WAV-encoded stream, which is then piped to 'lame' to be encoded with the mw-us preset and id3-tagged with the artist/title/speech genre.

Note that the script can also take URLs as arguments, since they are directly sent to lynx.

This html2mp3 script is going to be very useful for our next step, which is converting from CHM to MP3.

CHM files are a proprietary format developed by Microsoft, but basically they are just compiled HTML files with an index and a table of contents in one file. Their use as an ebook format is certainly not as widespread as HTML or PDF, but as you will see, it is pretty straightforward to convert them to audio files once you have the right tools.

CHM to MP3

#!/bin/sh -
#requires archmage and html2mp3

if [ "$#" == 0 ]; then
echo "Usage:"
echo "        $0 <chm file> [-a author] [-t title] <html file1> <html file2> ..."
exit 1
fi

while getopts "a:t:" o
do
case "$o" in
        a)author="$OPTARG";;
        t)title="$OPTARG";;
esac
done
shift $((OPTIND-1))

archmage $1 tmpchm
find tmpchm -name "*.htm*" -exec html2mp3 -a "$author" -t "$title" {} \;

rm -fr tmpchm

How it works

archmage is a Python-based script that extracts HTML files from CHM. You will need to have Python installed to get it to run.

Unlike 'ps2mp3', 'chm2mp3' does not require an arbitrary decision on where to split the book: every page compiled into the CHM file becomes its own audio file. All we need to do is extract these pages with 'archmage' and convert them with 'html2mp3'.

We are using the find command to recursively search for HTML files in the CHM book that we extracted, since sometimes the HTML files are stored in subdirectories inside the CHM. Then, for each HTML file found, we call 'html2mp3'.

Timing

Remember that it can take a while to encode several dozen pages of text to speech and then to MP3. But you do not need to encode a full book to start uploading and enjoying it on your portable player.

Audiobooks Galore

The Why

The Tools

Postcript / Adobe PDF to MP3

How it works

HTML to MP3

How it works

CHM to MP3

How it works

Timing

Further Reading