NAME

mod_perl notes -- A brief introduction to mod_perl


REVISIONS


Disclaimer

I'm no mod_perl (or Apache) expert and there are bound to be errors here. Let me know what they are so I can make the modifications. Also, if you have any questions please let me know or contact the mod_perl mailing list.


What is mod_perl?

mod_perl is a Perl module that enables you to embed a Perl interpreter within the Apache web server. (Detailed discussion of Perl, Apache or web servers is generally beyond the scope of this document. Browse LINKS for general information.)

Embedding the interpreter allows you to write Apache handlers or modules entirely in Perl. You can even configure the server using Perl code. And all existing modules on CPAN (or elsewhere) are available to you during this process.

Also, mod_perl can greatly increase the speed of pages dynamically created by CGI scripts or other means.


The Apache web server

The core of the Apache web server is actually very minimal. Nearly all of its functionality is provided via modules. This makes it fairly painless (if you know what you're doing) to extend the web server to accomplish just about anything you can imagine.

For instance, a current module (mod_dav -- Apache modules are traditionally known as mod_<module name>) implements an IETF specification of Distributed Authoring and Versioning. (See LINKS for the specification.) DAV in its current version is a sort of web-enabled CVS, allowing many people to work on documents at the same time and have a centralized server manage changes. Microsoft has chosen to use DAV to allow Office 2000 users to create a sort of groupware. While their intention was to tie Office 2000 closely to their Exchange server product, mod_dav will enable Apache to serve them as well.

Other modules exist to enable Java servlets, authenticate via any number of methods (Samba, Kerberos, LDAP, RADIUS, MySQL database...), log requests to external databases, implement a version of the Cold Fusion markup language, utilize server-side Javascript and much more. See LINKS for a link to the complete listing in the module registry.


Apache modules

Apache implements modules through its API. (Find a link to the API in the LINKS section.) While the documentation is probably more easily understood if you program in C, the idea behind it is fairly straightforward.

The Apache API defines a series of phases that every request goes through. You can write a handler for any of these phases and have your program take care of that phase.

Here is a list of phases. This list is paraphrased from the document Apache API Notes by Robert Thau, which you can get from the URL in LINKS.

Note for the curious: the actual C handler declaration looks like this:

  module cgi_module = {
     STANDARD_MODULE_STUFF,
     NULL,                     /* initializer */
     NULL,                     /* dir config creator */
     NULL,                     /* dir merger --- default is to override */
     make_cgi_server_config,   /* server config */
     merge_cgi_server_config,  /* merge server config */
     cgi_cmds,                 /* command table */
     cgi_handlers,             /* handlers */
     translate_scriptalias,    /* filename translation */
     NULL,                     /* check_user_id */
     NULL,                     /* check auth */
     NULL,                     /* check access */
     type_scriptalias,         /* type_checker */
     NULL,                     /* fixups */
     NULL,                     /* logger */
     NULL                      /* header parser */
  };

the important stuff being the comments to the right. The stuff I've skipped I haven't yet got :)

So if you wanted to log all requests for files in the /data directory to a special file, you'd write a handler that stepped in at the logging phase. You'd check the request directory and if it matched up to /data you'd write the request to your special file.

Want to see what modules are compiled into your Apache? Type: httpd -l (or ./httpd -l as necessary). Doing so on my development system gives me:

  Compiled-in modules:
    http_core.c
    mod_env.c
    mod_log_config.c
    mod_mime.c
    mod_negotiation.c
    mod_include.c
    mod_autoindex.c
    mod_dir.c
    mod_cgi.c
    mod_asis.c
    mod_imap.c
    mod_actions.c
    mod_userdir.c
    mod_alias.c
    mod_access.c
    mod_auth.c
    mod_setenvif.c
    mod_auth_mysql.c
    mod_perl.c

Of this listing, only mod_auth_mysql (a module to allow authentication from a MySQL database) and mod_perl were added by me. The rest of them were included with Apache itself.

Note that dynamic modules, a feature of Apache 1.3 on both Unix and Win32 systems, will not be listed in this listing since they are not compiled into the program itself.


Parents v. children

The Unix version of Apache uses a pre-forking model to handle many web requests at once. What this means is that on startup, the httpd program (known as the parent) will spawn a number of children, each of which is a fully functioning web server. The parent then listens to the web server port (normally 80) and shunts off requests to the children.

An important note is that each child is distinct from the other children -- sharing information directly among the children is generally a no-no. This becomes important when you think of a user hitting a website multiple times, say for browsing a database. The user may not get the same child process for each request. Therefore, we cannot store state information in a child process, so using a separate data store (such as some form of database) is necessary.


Installing mod_perl

You can find mod_perl on CPAN. Check the

  modules/by-module/Apache

subdirectory. You'll also find lots of other modules to use with mod_perl there.

The latest version is 1.16_02, although 1.17 will be out in the next few weeks. Apache 1.3 is strongly recommended for use with mod_perl, although I believe it will still work with 1.2.

mod_perl must be compiled along with Apache. (Kind souls compile Win32 versions from time to time.) Be sure to read the INSTALL file when you unpack the module. It gives you very detailed instructions on how to install mod_perl.

If you're just playing around and experimenting, I recommend you install support for all the phases listed above. To do so, startup the process with:

  perl Makefile.PL EVERYTHING=1

Once you're done, you'll have a new httpd binary in the Apache source tree. Note the size: for a normal httpd you'd expect 400K or so; the site of a mod_perl httpd can run over 1MB.

Run your new httpd just as you would a normal Apache binary.

Note that the size of the new httpd while running can grow depending on the modules you're loading into memory. I routinely see httpd children in excess of 5 MB. Keeping 10-15 children around means that 50-75+ MB of memory is necessary just for the web server. If you run out of physical memory and start swapping to disk, you might as well kill the server and allocate fewer children because the performance will be awful. Installing more memory generally takes care of severe performance problems.


Working with mod_perl

mod_perl follows the same modular line of reasoning as Apache, allowing you to write handlers entirely in Perl. Its list of handlers seems to be longer than that for Apache on the whole, primarily because it allows you to run code when a child starts up and exits.

Note that the name of the mod_perl handler used when configuring the server is in (parentheses).


Modifying your .conf files

mod_perl is mostly controlled from your .conf files. I prefer to put everything in one file, httpd.conf, so I'll work from there.


Preloading modules

mod_perl allows you to load modules at parent startup so subsequent calls to load a module happen instantaneously since the module is already compiled and in memory.

 # Modules to load at startup.
  PerlModule    Apache::DBI
  PerlModule    Apache::AuthenDBI
  PerlModule    Apache::AuthzDBI
  PerlModule    CGI

You can load up to 10 modules this way. If you need more than 10, use the PerlRequire directive to include a file which itself uses the modules.


Registering Handlers

You can setup a handler to step in at a particular stage of the request like this:

 # Takes care of URL rewriting
  PerlTransHandler      Apache::MySite_Redirect

The directive PerlTransHandler tells mod_perl that we want the package Apache::MySite_Redirect to handle URL rewriting. You can also specify a subroutine name:

 # Takes care of URL rewriting
  PerlTransHandler      Apache::MySite_Redirect::url_modify

Here's another set of directives where we restrict the directive to a particular location.

 # Where we keep all the scripts to make up each
 # page. Apache::Registry should cache them, 
 # making them go lickety-split!
 <Location /page>
   SetHandler   Perl-script
   PerlHandler  Apache::Registry
   Options      +ExecCGI
 </Location>

(We'll discuss Apache::Registry below.)


Other stuff

You can also place pieces of your configuration within <Perl>...</Perl> tags and mod_perl will execute the Perl code between them. I have not yet dived into this area so I can't say much.


Apache::Registry

Apache::Registry is a replacement for CGI that allows your CGI scripts to be cached in memory, making them run extremely fast, about as fast as a static page request.

To enable Apache::Registry, put the following lines of code in your .conf file:

 Alias /cgi-bin /usr/local/httpd/cgi-bin/mysite
 <Location /cgi-bin>
   SetHandler   perl-script
   PerlHandler  Apache::Registry
 </Location>

mod_perl will then cache your CGI scripts in memory as it encounters them. This can have a huge performance increase, but there are also a number of traps. CGI scripting can encourage messy programming -- since your program will only be around for one instance, why bother using strict and similar checks? However, with mod_perl your program can be around for some time, so you can run into problems with incorrectly initialized variables, data structures that hang around past their lifetime, and so forth. The mod_perl documentation has some help on this issue.


Using DBI with mod_perl

A module Apache::DBI exists to cache database connections on a per-child basis. As mentioned earlier, sharing information (including a database connection) among the children can be difficult, to say the least.

So upon a child process startup, this module will register itself with mod_perl. Any successive calls to DBI's connect method will get re-routed to Apache::DBI, which maintains a series of database connections. Each connection is distinguished by its unique data source name (DSN -- generally the driver name combined with the database you're connecting to), so when a call comes in for that DSN Apache::DBI doesn't bother making the actual connection but instead hands off the already established connection.

Everything else should work exactly the same. You should ensure that on busy websites your database can handle the number of connections this can generate.


Small Example

One of the sites we host is template driven, with the different pieces of a page determined by codes placed into the HTML.

It's not anything earth-shattering, but will hopefully give you an idea of what mod_perl can do.

You can view the site at:

   http://www.ctaa.org/

Every page uses server-side includes. A server-side include is a snippet of HTML which the server parses and replaces with other information -- the user only sees the text the server puts in place of the SSI directive. Examples include a last-modified date, a common item of HTML included in many pages (e.g., navigation bar) or a hit counting program.


The Guts

A number of modules exist so I can call the routines either from a CGI script or from an HTML page. Most routines are in a module called CTAA::PagePieces.pm. The SSI directives used to call the routines look like this:

  <!--#perl sub="Apache::Include" arg="/page/page_side_menu.pl"-->

The .pl files in the /page location are just stubs to parse through the environment variables and call the routines in CTAA::PagePieces.pm. I included the option to get the menu, area and URI from elsewhere for testing purposes.

 #!/usr/bin/perl
 use strict;
 use CTAA::PagePieces;
 { 
  my $current_menu = lc $ENV{CTAA_MENU} || shift @ARGV;
  my $current_area = lc $ENV{CTAA_LOC}  || shift @ARGV;
  my $current_uri  = lc $ENV{DOCUMENT_URI} || shift @ARGV;
  print CTAA::PagePieces::show_side_menu( $current_menu, 
                                          $current_area, 
                                          $current_uri );
 }


Authentication and authorization

We've setup several directories that require authentication. We use the module Apache::AuthenDBI and Apache::AuthzDBI to authenticate and authorize users from a MySQL database.

The .conf code looks like this:

Simple Authentication

 # Authorization for CTAA Services
 <Location /cgi-bin/valid>
   AuthName "CTAA Services"
   AuthType Basic
   PerlAuthenHandler    Apache::AuthenDBI
   PerlSetVar Auth_DBI_data_source      'dbi:mysql:CTAA'
   PerlSetVar Auth_DBI_username         'myuser'
   PerlSetVar Auth_DBI_password         'mypass'
   PerlSetVar Auth_DBI_pwd_table        'Users'
   PerlSetVar Auth_DBI_uid_field        'Username'
   PerlSetVar Auth_DBI_pwd_field        'Password'
   require valid-user

 # If they get an authorization required error,
 # direct users to the User Registration page.
   ErrorDocument 401 /cgi-bin/users.cgi?Action=BadLogin
 </Location>

Authentication with Authorization

 # Same authentication as above, but
 # we add an AuthzHandler which ensures
 # that the user is a member of one or
 # more groups who are able to access
 # the CTAA Admin stuff.
 #
 # Note that this should match both the /cgi-bin/admin
 # and /admin URLs (as well as /ct/admin , /ntrc/admin ,
 # etc.)
 <LocationMatch "admin">
   Options +ExecCGI
   DirectoryIndex home.shtml home.html home.htm
   AuthName "CTAA Administration"
   AuthType Basic
   PerlAuthenHandler    Apache::AuthenDBI
   PerlAuthzHandler     Apache::AuthzDBI
   PerlSetVar Auth_DBI_data_source      'dbi:mysql:CTAA'
   PerlSetVar Auth_DBI_username         'myuser'
   PerlSetVar Auth_DBI_password         'mypass'
   PerlSetVar Auth_DBI_pwd_table        'Users'
   PerlSetVar Auth_DBI_uid_field        'Username'
   PerlSetVar Auth_DBI_pwd_field        'Password'
   PerlSetVar Auth_DBI_grp_table        'UsersGroups'
   PerlSetVar Auth_DBI_grp_field        'Groupname'
   require group webadmin
   ErrorDocument 401 /admin_only.shtml
 </LocationMatch>

The documentation for the authentication/authorization modules tell you which variables you need to set via the PerlSetVar directive.


Logging

Every virtual host has configuration lines like this:

 # Log our html files to the database (neat!)
  PerlSetVar     INTES_VHOST   'www.ctaa.org'
  PerlLogHandler               Apache::INTES_LogDBI

The logging routine reads the variable INTES_VHOST and modifies its entry to the database accordingly. Here's the actual module -- shamelessly swiped from Lincoln Stein:

  package Apache::INTES_LogDBI;
  use Apache::Constants ':common'; 
  use strict;
  use vars qw/ 
    $dbh $sth
  /;
  use DBI;
  use POSIX 'strftime';
  my $DSN       = 'DBI:mysql:WebStuff';
  my $db_user   = 'myuser';
  my $db_passwd = 'mypass';
  my $log_table = 'WebLogs';
  $dbh = DBI->connect( $DSN, $db_user, $db_passwd );
  my $sql = qq/
   INSERT INTO $log_table
   VALUES ( ?,?,?,?,?,
            ?,?,?,?,? )
  /;
  $sth = $dbh->prepare( $sql );
 
  sub handler {
   my $r = shift;
   my $url     = $r->uri;
   return DECLINED if ( $url !~ /htm(l)?$/ );
   my $date    = strftime( '%Y-%m-%d %H:%M:%S', localtime );
   my $host    = $r->get_remote_host;
   my $method  = $r->method;
   my $user    = $r->connection->user;
   my $referer = $r->header_in( 'Referer' );
   my $browser = $r->header_in( 'User-agent' );
   my $status  = $r->status;
   my $bytes   = $r->bytes_sent;
   my $vhost   = $r->dir_config( 'INTES_VHOST' );
   $sth->execute( $date, $host, $method, $url, $user,
                  $browser, $referer, $status, $bytes, $vhost );
   return OK;
  }
  1;


LINKS

Here is some online information to help out.


AUTHOR

   Chris Winters
   cwinters@intes.net