PostgreSQL - import of large texts and dynamic content
Introduction
PostgreSQL is an enterprise-level database. It is an open-source software, competes by some features with products from Oracle, so it is no surpise that more and more projects delegate such important duties like data-mining and data handling to PostgreSQL. And what's more important, the architectural design for this database is extremely powerful and conforms to KISS principal very close, so it is really very fun to deal with PGSQL internal programming, as well as maintaining it.
The task
One of the projects that I need to take care of is based on PostgreSQL and Liferay Portal, which in turn is based on Tomcat. Actually, it has a 3-tier architecture, where Tomcat works as a web-server, i.e. it basically is a front-end, that serves requests from around the world. Liferay's web-pages can be as: (a) plain HTML, (b) JSP (Java Server Pages), (c) or can be programmed as servlets (portlets) as well. The latter two scenarios require you to have an IDE (Integrated Development Environment) with JSP-, portlet-, JDBC-bindings deployed. Basically, either JSP, or portlet contains code, that just fetches actual SQL-data from database instance (for example, news_portal) and prepares lovely HTML-page, which shows today's weather forecast, or currency rates. However, you might be interested in generating the same page without doing time consuming efforts like downloading and installing IDE, programming new servlet and deploying it afterall. How to make it happen? Simply execute necessary SQL-requests at backyard, i.e. within operating system space, where Tomcat and PostgreSQL servers reside. You can program it in 10 minutes - in bash, python or any other scripting language. In my case I generate HTML-page, that consists of thousand lines of text, and push it back into Liferay's CMS database engine (news_lportal), so HTML contents of this page is to be displayed by Liferay itself. I also scheduled via cron how often I regenerate this information page, so Liferay would always show up-to-date news, rates, etc.
Data manipulating with PSQL
There's a native client, that comes with PostgreSQL server, called psql. Although psql is console application only it has essentially the same capabilities as it's counterpart - a GUI, GTK-based PgAdmin. If you don't have it installed in your system, please run aptitude (for Debian):
# aptitude install postgresql-client Reading package lists... Done Building dependency tree Reading state information... Done Reading extended state information Initializing package states... Done Reading task descriptions... Done The following NEW packages will be installed: postgresql-client postgresql-client-8.3{a} postgresql-client-common{a} 0 packages upgraded, 3 newly installed, 0 to remove and 6 not upgraded. Need to get 2028kB of archives. After unpacking 5276kB will be used. Do you want to continue? [Y/n/?]
Listing 1. Postgresql package consists of psql as well as other auxiliary utilities
It will install psql, as well as pg* utilities (pg_dump, pg_restore and others).
Surely, you can install a GUI application as well for performing complex tasks, like data analysis:
# aptitude install pgadmin3
Figure 1. PgAdmin - graphical application for handling SQL-queries
With the help of psql you can quite easily run any SQL-statement, like this:
psql -q -n -h 127.0.0.1 news_lportal -U root -c "select userid, emailaddress from user_"
where used:
host to connect to - 127.0.0.1 desired database within PostgreSQL pool - news_lportal username, which is granted to execute SQL-command - root and SQL-command itself - select userid, emailaddress from user_
Alternatively, you can run psql with update operator, like this:
psql -q -n -h 127.0.0.1 news_lportal -U root -c "update journalarticle set content = '<H1>Hello, World!</H1>' where id_ = 24326"
Where ID with number 24326 is my HTML document previosuly created by CMS-engine on top of Liferay and stored inside PostgreSQL database - news_lportal.
In such a way, you could refresh any information, that is stored inside journalarticle table. The only thing you should remember about is the correct ID for your article.
However, in real life this update trick won't work as it should. I've prepared an update script (import_table.sh), where the contents of table_news.html file should be uploaded into PostgreSQL.
#!/bin/sh ct=`cat table_news.html` psql -t -l -q -n -h 127.0.0.1 news_lportal -U root -c "update journalarticle set content = '$ct' where id_ = 24326;"
Listing 2. Very simple import script (import_table.sh), first version
Argh! It didn't work out - PostgreSQL client refused to run an update command.
$ ./import_table.sh ./import_table.sh: line 4: /usr/bin/psql: Argument list too long
At first glance, the file table_news.html seems to be quite good. But a closer look shows another catch-up - the file is a bit too large - 400Kb in size.
$ file table_news.html table_news.html: UTF-8 Unicode text, with very long lines $ cat table_news.html | wc 617 2505 408460
Is there a mechanism to load any text file larger than 2Kb into database? Yes! Luckily, PostgreSQL has import/export functions that will ease communicating with file I/O operations. Let's declare our own procedure get_text_document_portal() that will load into database any text file.
- Function: get_text_document_portal(character varying) -- DROP FUNCTION get_text_document_portal(character varying); CREATE OR REPLACE FUNCTION get_text_document_portal(p_filename character varying) RETURNS text AS $BODY$ SELECT CAST(pg_read_file(E'liferay_import/' || $1 ,0, 100000000) AS TEXT); $BODY$ LANGUAGE sql VOLATILE SECURITY DEFINER COST 100; ALTER FUNCTION get_text_document_portal(character varying) OWNER TO postgres;
Listing 3. Our new procedure will call pg_read_file() function and read text file from disk
In order to load a text file into the database named news_lportal, I've written the script below (import_table_2.sh), which takes filename - in this example, table.text - as a parameter for get_text_document_portal() procedure and places it's contents into corresponding field of table journalarticle.
#!/bin/sh psql -q -n -h 127.0.0.1 news_lportal -U root -c "update journalarticle set content = get_text_document_portal('table.text') where id_ = 24326;"
Listing 4. Import script (import_table_2.sh) that triggers our new pgsql-procedure
All you need to do is to change the source HTML file, named table.text and run import_table_2.sh. Please pay attention to a location, where the imported file should be placed - this is a subdirectory liferay_import under /var/lib/postgresql/8.3/main/ tree.
$ ls -l /var/lib/postgresql/8.3/main/ total 48 -rw------- 1 postgres postgres 4 Nov 9 10:20 PG_VERSION drwx------ 10 postgres postgres 4096 Nov 10 11:16 base drwx------ 2 postgres postgres 4096 Mar 4 16:44 global drwx------ 2 postgres postgres 4096 Dec 3 18:27 liferay_import drwx------ 2 postgres postgres 4096 Nov 9 10:20 pg_clog drwx------ 4 postgres postgres 4096 Nov 9 10:20 pg_multixact drwx------ 2 postgres postgres 4096 Mar 1 13:29 pg_subtrans drwx------ 2 postgres postgres 4096 Nov 9 10:20 pg_tblspc drwx------ 2 postgres postgres 4096 Nov 9 10:20 pg_twophase drwx------ 3 postgres postgres 4096 Mar 4 12:43 pg_xlog -rw------- 1 postgres postgres 133 Feb 11 22:09 postmaster.opts -rw------- 1 postgres postgres 53 Feb 11 22:09 postmaster.pid lrwxrwxrwx 1 root root 31 Nov 9 10:20 root.crt -> /etc/postgresql-common/root.crt lrwxrwxrwx 1 root root 36 Nov 9 10:20 server.crt -> /etc/ssl/certs/ssl-cert-snakeoil.pem lrwxrwxrwx 1 root root 38 Nov 9 10:20 server.key -> /etc/ssl/private/ssl-cert-snakeoil.key
Listing 5. Owners' information for PostgreSQL disk storage pool
It is owned by postgres and can be written by this user only. Or, by root account. Of course, you could add an entry into root's crontab, but a good practice is - split jobs between different accounts. Assigning database jobs to postgres only, and every other task trust to, for instance, tomcat account. So how can tomcat user be able to write to liferay_import directory with postgres owner access bits? By making a link - symlink doesn't work, but hardlink will do!
# ln /var/lib/postgresql/8.3/main/liferay_import/table.text /home/tomcat/db/table.text
Listing 6. Hardlink allows to override owners' limitations provided by symlink
#!/bin/sh /home/tomcat/db/prepare_table_news.sh > /home/tomcat/db/table.text /home/tomcat/db/import_table_2.sh
Listing 7. Script (mk_db.sh), that prepares arbitrary HTML-document and loads it into database
Hooray! Now I can place an entry into tomcat's crontab and get the news information updated every hour. And this is done from under tomcat account. Really nice.
$ crontab -l # m h dom mon dow command 0 * * * * /home/tomcat/db/mk_db.sh > /dev/null
Listing 8. One entry in tomcat's crontab that should be executed every hour in order to update news
Conclusion
There exist different approaches how to provide up-to-date information when you deal with Liferay Portal and portlet-technology. First way requires to have dedicated developer environment preinstalled (NetBeans IDE with portlets bindings), whilst another way needs only to have a basic shell-scripting knowledge and be able to correctly construct SQL-queries. Of course, the better way is a good full-time developer with hands on IDE, JSR168 / JSR268 portlet standards, that would program whatever web-application you need, especially HTML-page with dynamically changed information. However, you can achieve the same results much quicker - simply rely upon casual Linux console tools.
Resources
[1] http://www.postgresql.org/
[2] "Practical PostgreSQL" - Joshua D. Drake, John C. Worsley, O'Reilly Media
Anton has jumped in into Linux world in 1997, when he first tried a tiny muLinux distribution, being run from a single floppy. Later on, Red Hat and Slackware became his favorite choice. Nowdays, Anton designs. Linux-oriented applications and middleware, and prefers to work with hardware labeled as "Powered by Linux".
Share |
Talkback: Discuss this article with The Answer Gang
Anton jumped into Linux world in 1997, when he first tried a tiny muLinux distribution, being run from a single floppy. Later on, Red Hat and Slackware became his favorite choice. Nowdays, Anton designs Linux-oriented applications and middleware, and prefers to work with hardware labeled as "Powered by Linux".