Regular Expressions and Special Variables

Special Variables

$_ is the “default input and pattern matching” variable; the default input is often the current line of a file

@_ is the list of incoming parameters to a subroutine

See Well House Consultants

$. is the current input file

$$ is the current process ID

$^O is the operating system (that’s an “Oh”)

$#_ is the index number of the last parameter

A basic pattern-matching loop

while ($my_var = <MY_FILE_HANDLE>) {

if ($my_var =~ /search_pattern/) {

# Notice that =~
# It’s the “search/match” operator

# Also, use those / / characters for
# MUCH faster operation!
print MY_FILE_HANDLE $my_var;
# We just printed to the file

}
# Alternately, just dump the line:
print $my_var;

}

If you run this loop but don’t name a loop variable, $_ is already waiting for you:

while (<MY_FILE_HANDLE>) {

/search_pattern/ and print MY_FILE_HANDLE ;

print ;
}

@_ # The array of incoming parameters supplied to a subroutine.

@_
# The whole array

@_[0]
# The first element of the array

@_[1]
# The second element

@#_
# This one’s odd: it’s the index of the last element (which is not quite the same as the count, because this is a zero-based array).

sub call_me {
print “Element zero is ” . @_[0] . “\n”;
print “There were ” , $#_+1 , “\n”;
}

use English;

This is a pragma.

Allows addressing @_ as @ARG

Allows addressing $$ as $PID

 

This is why we learned about =~.

$my_string = “I’m hard at work.\n”;

if ($my_string =~ /work/) {
print “He’s working.\n”;
}

Metacharacters

\n

\t

\d
# matches any single digit

\w
# matches any letter, digit or the underscore

\s
# matches any space (white space): space, tab, \n, \r

Capitalize any of the above to invert its meaning.

^
# Beginning of line or string
/^string/

$
# End of line or string
/string$/

.
# Generic wildcard character: matches any ONE character
# so /x.z/ matches x1z, xSz, x-z, etc.
/str.ng/

*
#Preceding character match: matches the preceding character ZERO OR MORE TIMES
/s*ring/

One use of * is with the dot character, when any number of any characters could appear at that position:
/this.*/
Matches “this followed by anything.”

+
# Preceding character match: matches the preceding character ONE OR MORE TIMES
/xy+z/

?
# Preceding character match: matches the preceding character ZERO OR ONE TIMES
/xy?z/

Create groups of optional characters with parentheses:

/Fred(die)?/

 

Combining expressions

/^http:.+html?/

 

Character Classes

/[qwerty]/
# matches any of q, w, e, r, t or y

/[^qwerty]/
# DOESN’T match any of q, w, e, r, t or y
# Be darn careful where that ^ is.

 

Flags

/string/i
# case-insensitive

s/match_string/replacement_string/g
# search; replace; global
# g also tells Perl to return to its last position
# in the string on the next iteration

 

Subexpressions

if ($_ =~ /heck|darn|dang|fooey/) {

print “This mild cussing is present in this line: $1.\n”;
}

The $1 variable holds the string that produced the match. If the match was “heck” then $1 = “heck”. If you have two subexpressions, you’ll have $1 and $2, and so forth:

$singer = “Wendy Wall”;
$singer =~ /(\w+) (\w+)/;
# $1 holds “Wendy” and $2 holds “Wall”

 

Search and Replace

$sentence = “This is the usual cat and dog example. It mentions two cats.”;
$sentence =~ s/cat/dog/g;
print $sentence;

 

Resources

The formal Perldoc – http://perldoc.perl.org/perlre.html#Regular-Expressions

Perl Matching With Regular Expressions – a long page with very good detail – http://work.lauralemay.com/samples/perl.html

Troubleshooter.com – with many good examples – http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Ringofsaturn.com – with more, and detailed, examples – http://networking.ringofsaturn.com/Unix/regex.php

Regular Expression Reference – useful, concise and highly recommended – http://www.regular-expressions.info/reference.html

 

A line from an Apache log file looks like this:

 132.62.20.9 - - [01/Nov/2000:00:00:19 -0400] "GET /news/home/index.htm HTTP/1.1" 200 2285

So let’s hack on this analyzer, called analyze.pl:

#!/usr/bin/perl

# We have to supply the log name as the first command argument
$logfile = $ARGV[0];

unless ($logfile) { die “Usage: analyze.pl <httpd log file>”; }

analyze($logfile);
report();

sub analyze {
my ($logfile) = @_;

open (LOG, “$logfile”) or die “Could not open log $logfile – $!”;

while ($line = <LOG>) {
@fields = split(/\s/, $line);

# Make /about/ and /about/index.html the same URL.
$fields[6] =~ s{/$}{/index.html};

# Log successful requests by file type. URLs without an extension
# are assumed to be text files.
if ($fields[8] eq ‘200’) {
if ($fields[6] =~ /\.([a-z]+)$/i) {
$type_requests{$1}++;
} else {
$type_requests{‘txt’}++;
}
}

# Log the hour of this request
$fields[3] =~ /:(\d{2}):/;
$hour_requests{$1}++;

# Log the URL requested
$url_requests{$fields[6]}++;

# Log status code
$status_requests{$fields[8]}++;

# Log bytes, but only for results where byte count is non-zero
if ($fields[9] ne “-“) {
$bytes += $fields[9];
}
}

close LOG;
}

sub report {
print “Total bytes requested: “, $bytes, “\n”;

print “\n”;

report_section(“URL requests:”, %url_requests);
report_section(“Status code results:”, %status_requests);
report_section(“Requests by hour:”, %hour_requests);
report_section(“Requests by file type:”, %type_requests);
}

sub report_section {
my ($header, %type) = @_;

print $header, “\n”;
for $i (sort keys %type) {
print $i, “: “, $type{$i}, “\n”;
}

print “\n”;
}