Illustrations:

[1] the NewAssignment.pl script

[2] the mprimescripts README file

[3] the mprime.sh script

[4] the output produced by lynx -dump http://www.mersenne.org/range.htm

[5] my first complete results file

[6] an archival results file generated by mprime.sh

[7] George Woltman's mprime readme.txt file

[8] Chris K. Caldwell's top ten list

-------------- BEGIN DRAFT

Perl makes writing a secure metacomputing assignment client easy

David Nicol

Until I became acquainted with GIMPS, the Great Internet Mersenne Prime Search, home page at http://www.mersenne.org, the idle cycles on my workstation were spent entirely in wait states. But no longer. My computer now runs a program in the background which is breaking genuinely new mathematical ground, even in the time between processing my keystrokes as I write this. It's factoring large unfactored numbers, as part of an internet-wide coordinated effort to find another Highest Prime Number.

Recently, the world has seen two large problems successfully attacked with radically distributed systems, an approach also known as "metacomputing." The DES Challenge, which teamed up computers around the world to decipher a message encrypted with a 56-bit RSA secret key, succeeded on June 18 1997, and the GIMPS project, which succeeded November 13 1996 in demonstrating that 2^1398269 -1 is a prime number, both were computer programs that ran with low priority on thousands of computers volunteered around the world.

Prime numbers are integers greater than one which can be evenly divided by no numbers other than themselves and one. They are considered by number theorists to be basic building blocks of arithmetic, as all whole numbers have exactly one set of "prime factors" which multiply together to produce them. General truths about primes, such as that there are infinitely many of them, have been available to mathematicians since Euclid wrote Elements 2300 years ago. Computers give us the ability to do math with a precision formerly impossible, and all branches of mathematics have benefited from their advent. Where millennia ago thinkers could only posit the existence of numbers larger than humanly conceivable, at best giving rough estimates, today we work with them on a regular basis.

What difference does it make, one might ask, that a new Largest Prime Number has been found?

An answer to that question might be that the search for high prime numbers is largely symbolic, like mountain climbing or space exploration. Just as new technologies are used to explore previously unknown places, as new approaches to calculation become available, they are put to the test of charting the reaches of mathematical knowledge. Before mechanical calculation, Edouard Lucas's 2^127-1 was the largest known prime number, even though people had known forever (or at least since Euclid) that there are infinitely many of them. This record stood from 1876 to 1951, which was a very busy year. First the record was beaten by a man with a mechanical calculator, then a team with an electrical computer found a dozen higher ones. The fact that the first post-Lucas largest prime was found with a mechanical rather than electrical calculator is interesting, but as Chris Caldwell puts it on his web page at http://www.utm.edu/research/primes/notes/by_year.html, by the beginning of 1952 the age of electronic math had most definitely begun. So what difference can finding a new highest prime number make? Outside of the prestige of having your name appear in the math almanacs and the full-employment implications of obsoleting textbook editions, not a whole lot. George Woltman's GIMPS project is just for fun. Other metacomputing projects, in which thousands of computers work independently and simultaneously on different ranges of a large problem, may have bearing on less entertaining issues, such as weather modeling. The DESCHALL project proved that 56-bit RSA keys are not secure, while my personal recurring daydream, a distributed computerized Go opponent, would also be just for fun.

Mersenne numbers, named after the French monk Marin Mersenne (1588-1648), are numbers of the form 2^p-1, where p is a prime number. Except for a short time in 1951, since 1588 the highest known prime number has always been either a Mersenne prime or a factor of a Mersenne composite. The relative ease of the Lucas method of verifying primality is what made Lucas's prime, 2^127-1, reign as the highest known prime number before mechanical computing devices. It's not that Mersenne numbers tend to be prime, it's just vastly easier to test them. Where for most numbers one needs to do exhaustive factoring to prove a number prime, it is possible to test a Mersenne number for primality through rotation and addition only, using a method called the "Lucas-Lehmer test."

Currently 35 Mersenne numbers are known to be prime, and more are being tested, even as you are reading these words, by software running with a low priority on otherwise idle machines. The most convenient software to do this is a package by George Woltman called mprime, which is optimized for the i386 architecture. Mprime survives sudden shutdown or termination by keeping its state written out to a file, so not all work is lost in case of interruptions. For other architectures, a more generic version distributed as C source is mers by Will Edgington, at ftp://ftp.delta.com:/pub/users/mersenne. Rewriting the mprimescripts package to use the mers program instead of mprime would be fairly simple. Mprime.sh would have to invoke mers instead of mprime, and NewAssignment.pl would need different values assigned to the %Flags hash. In fact, the name of the program to be invoked could be moved out of the mprime.sh script and into the %Flags hash, in the event that different kinds of tasks required different programs to be run, rather than different command line switches presented to the same program. In the words of Larry Wall, "There's more than one way to do it."

The procedure for enrolling your computer in the GIMPS project is as follows:
1: Download Lucas-Lehmer and factoring software and the associated database of factors
2: Reserve a range of numbers to factor or test, to avoid duplication of effort
3: Configure your machine's startup scripts to start your software with the chosen task
4: Reboot your machine, or start your software without rebooting your machine.

The tasks can take months, and therefore these steps are quite comfortably done by hand. After going through the cycle once, I realized it is possible to automate them with a script, and the mprime.sh and NewAssignment.pl scripts automate steps two, three and four.

Automating step 1 requires creating a .netrc file in the mprime user's home directory



machine ftp.mersenne.org
login anonymous
password mprimescripts_user
macdef init
binary
get database.gz
! gunzip --force database.gz
quit

and setting up a cron job (type man crontab at your unix shell prompt) to periodically ftp ftp.mersenne.org.

# periodically download a new database
51 3 25 * * ftp ftp.mersenne.org
The five numbers at the beginning of crontab entries are, in order, minute, hour, day of month, month of year, and day of week. The above entry will download a new database every month, on the 25th, at 3:51 am. Please choose different numbers, so as to avoid a network bottleneck, or better yet, uncomment the line in mprime.sh after you have the .netrc file working right. People installing mprime on machines without permanent network connections will of course get network inaccessible errors unless their dialup connections are brought up before running ftp.

When the cron job runs, it invokes the ftp program with a command line argument of the name of the machine. The ftp program opens a File Transfer Protocol session with the machine with the given name, and reads the .netrc file to see if it has instructions pertaining to this connection. In the case of ftp.mersenne.org, it does. Ftp will automatically log in as user anonymous, give mprimescripts_user for a password, change the mode of the ftp connection to binary, and download the database.gz file which can (at the time of this writing) be found there without changing directories. After the file is downloaded, the line starting with a bang (exclamation point) is passed to a shell, and expands the new file, overwriting the old one (that's what the --force is for.)

Although the entire functionality of the package, in handling steps two, three and four, could be included in a slightly more complex perl program, the vast majority of the mprimescripts' runtime is spent waiting for the mprime program to complete, and the best solution in terms of swap page management might be some kind of "on-completion-execute" parameter that the mprime program could take. Since there isn't one, a shell script is the winner of the contest to see what can use the least amount of pages of swap space while holding its place in a process during an extended sleep. In this scenario, Perl isn't used to do everything, just what it is best suited for: practically extracting data from a report. The report is the "Reserve a range page" (figure 4) and the data is the command line arguments to be given to the mprime program.

Mprime.sh, a shell script, (figure 3) invokes the mprime program with command line arguments read from a file called assignment, stands patiently by while it runs, then invokes NewAssignment.pl when mprime has finished its assignment and needs a new one. To preserve mprime's "killability," mprime.sh checks the "result" file created by mprime to see if mprime finished its task or was terminated by a signal.

Unless the words "Please send the results file to" appear in the results file, it is assumed that mprime was killed by a signal and it is time for the mprime.sh script to quit. When the results file has the "Please send the results..." line, the grep statement



        # quit the loop unless "Please send" appears in the results file
        grep "^Please send the results file to" results || break

returns a true value, meaning that the words were found in the file, and the shell ignores everything after the double-bar or, continuing with the loop, archiving the results file, emailing George Woltman the results file, removing the results file, choosing a new assignment, and starting over. When the words do not appear, the grep returns a false value, causing the shell to execute the statement after the double-bar or and exit the mprime.sh script's main loop. To keep mprime's administrator apprised of the situation, a message is e-mailed to them

echo mprime $ASSIGNMENT terminated by error or signal: results not needed sent. | mail mprime

before the script exits. Before running mprime.sh, mprime needs to be added to the machine's /etc/aliases file, pointing to the machine's administrator, as discussed in (figure 2) the mprimescripts README file.

Choosing a new assignment is too complex a task for a shell script, but the powerful regular-expression-parsing capabilities of Perl made writing NewAssignment.pl a straightforward piece of work. The approach it takes could easily be adapted to any other metacomputing project involving a strongly dividable large problem, as a wrapper providing command line parameters to a highly optimized program that is going to do the actual work.

The first step in choosing a new GIMPS assignment is obtaining the list of available assignments, which is done by pointing a web browser at George Woltman's Choose a Range page, http://www.mersenne.org/range.htm. It happens that the lynx text-based web browser can take a command line switch, -dump, which makes it print the web page it is pointed at out to its standard output stream. Invoking this behavior inside backquotes


        # load the "reserve a range" web page into a variable:
        $ReserveARangePage = `lynx -dump http://www.mersenne.org/range.htm`;
gives us a page such as the one seen in figure 4. Inspection reveals some structure to this page, and Mr. Woltman, although he hasn't (as far as I know) automated its maintenance, has agreed not to alter the format in such a way as to break this deconstruction.

There are three sets of ranges. Each range begins with a description of what the task is, followed by the words "a range of" and the minimum recommended range size, followed by more words, the last of which is "from," followed by a comma-separated list of pairs of the form N to N, where N are both natural numbers, terminated with a period.

This structure can be represented by a Perl regular expression:


m/a range of (\d+) .*? from ([\s0-9to,]+)\./s

Here is a quick review of the regular expression atoms used above:
the m starts the regular expression
the text a range of matches exactly those words, wherever they may appear
the parentheses remember their contents, in this case \d+, which will be the first and lower of the suggested range sizes.
\d matches digits 0123456789
the plus matches all the numbers that are there
dot matches anything
star matches all the anything that is there
question mark after a star or plus means the expression should match as little as possible , until the next string, in this case from
square brackets set off enumerated lists of matching characters. The second set of parentheses will remember the list of range pairs, of the form N to N, separated by commas, with whitespace, matched by \s, interspersed throughout.
\. is a literal dot, the period at the end of the list of ranges.
the s after the closing slash means that the expression may span lines of text.

The part of this regexp that matches the range, in the NewAssignment.pl script is called $RangeMatcher.


$RangeMatcher='.*a range of (\d+) .* from ([\s0-9to,]+)\.';
$RangeMatcher is defined as a single-quoted string so that the special regexp characters are not translated. If it was in doublequotes, the backslashes in it would be translated away. Coupled with another word to localize the search to a part of the file, it remembers the number of numbers we want to test, and the comma-separated list of ranges, and assigns them to a list:

($PortionSize, $RangeListing) = ($ReserveARangePage =~ m/$PreText$RangeMatcher/s);

In the event that the format of the page changes, or when adapting NewAssignment.pl to a different idle-time metacomputing project, the text used to locate the $RangeMatcher regexp to the proper place on the web page depending on the choice of assignment type A, B, or C is stored in the scalars $PreTextA, $PreTextB, and $PreTextC.

The other part of the RegExp that will match the list of ranges is called $PreText, because it is the text before the range. There are three possible kinds of assignments on the Reserve a Range page, represented internally in the NewAssignment script as A, B and C. Assignments A and B are testing mersenne numbers for primality using the Lucas-Lehmer test, while assignment C is factoring known composites, which determines previously unknown large prime numbers which are added to the central database. As discussed in the mprime readme.txt file (figure 7), factoring a range is done by invoking the mprime program with a -f command line switch, which is taken care of by prefixing the range, in the assignment file, with another variable called $Flags{$Suffix} which is an empty string for assignment types A and B but -f for type C.

I named the scalar $Suffix $Suffix because it is appended to the token 'PreText' to come up with the name of the variable that is assigned to $PreText.


 $PreText = ${'PreText'.$Suffix};

So, depending on whether $Suffix is A, B or C, $PreText is set to equal $PreTextA, $PreTextB, or $PreTextC. The advantage to doing it that way rather than having an associative array %PreText from which $PreText would be selected was that fewer curly braces were required; it also serves as an example of indirect naming, a feature available in few other programming languages.

Once we've decided what kind of thing we want to do, by randomly selecting A, B or C, the script invokes the lynx text-only web browser to read the Reserve A Range web page: $ReserveARangePage = `lynx -dump http://www.mersenne.org/range.htm`; lynx -dump prints the contents of the given web page to its standard output, which is captured by the backquotes and stored in the scalar $ReserveARangePage. Just in case the network is down, or www.mersenne.org is down, this is done inside a loop. The RegExp $PreText$RangeMatcher is applied to the dumped web page, and if all is going well it matches and sets $PortionSize and $RangeListing appropriately to the type of assignment we have chosen. The other possibilities are that we have made a mistake in our RegExp or that there is a network problem. The testing section, which will run when NewAssignment.pl is invoked with any command line arguments, such as "test," dumps out the first hundred characters of the web page, to assure that lynx was able to find the page, and the expression $PreText$RangeMatcher, to show what exactly it was that the program was unable to find on the page. Debugging sections such as this one are essential to software development. Without them, especially when working with an unfamiliar programming language, programming can be like stumbling in the dark through someone else's messy office, and it was through using this testing section that I was able to realize that the $RangeMatcher needed to be set equal to a string in single quotes rather than double.

For users whose computers are connected to the internet via dialup PPP connections, there is a place ready to add the ppp-up and ppp-down scripts so that the connection does not stay up while waiting a polite three hours to try the www.mersenne.org server again.

Making it past the GETLISTING loop means that $PortionSize and $RangeListing have been set. $PortionSize is already an integer, and the only transformation it might need is if we're running mprimescripts on an exceptionally speedy machine which can take larger than normal portions, so this case is provided for, but commented out. Massaging $RangeListing takes a few more steps. The script strips the whitespace out, then converts the word "to" to a space, then splits the result on its commas into a list, @Ranges. $MyRange is randomly chosen, using the same method used above to randomly choose $Suffix. $MyRange has two numbers in it, separated with a space, which is a form that is split into two more scalars, $High and $Low. Except for the case in which $High and $Low are already closer than the recommended spread, $High is redefined as $Low plus $PortionSize.

$Assignment, which is what mprime.sh will put on mprime's command line, is defined to be -f, when required for assignment type C, and the endpoints of the range. All that remains to do is to write $Assignment into the assignment file, email you and George Woltman with the selected range, and exit. All e-mails sent from mprimescripts have subject lines which could be filtered by software such as procmail and directed to another Perl script which could fully automate the assignment server as well, when designing a metacomputing project around this model.

In order to adapt mprimescripts for use in a different distributed idle-time project, there would need to be a central authority responsible for declaring what assignments are available and a program invokable from a shell script that takes its assignment on the command line. Given a project sharing these two features with GIMPS, editing mprimescripts to have different text in the mail messages, to invoke the new program instead, to select the numbers for the assignments from parts of the available tasks page landmarked by different identifying text, to change the command line switches associated with the different invoked program, in short, rewriting them appropriately to the new project, will involve a minimum of effort.

GIMPS uses a web page, but replacing the "lynx -dump" with opening any other kind of shared network file would be straightforward. In fact, the pace of GIMPS may be better suited to human administration than machine. Mathematicians who have theories as to where the next Mersenne prime might lie can reserve that range to test, like roulette players with a system, in hopes of getting their names in the history books. NewAssignment.pl chooses its assignment randomly to minimize the possibility of the administrative hassle of a collision: random selection from a long list of choices is a straightforward way to lower the frequency of duplicated selections. In fully automated systems, designed from the beginning to be distributed projects, having the distributed programs query a central server for their next assignment when they're done, or even the local centurion in a network of central servers, after they have finished their assignments, may be more efficient. The DESCHALL project, for instance, was distributed as source code as well as binaries for various platforms, and each client program in the "brute force search" of the 56-bit RSA key space checked into a central "key server" regularly to get a new range of numbers to try. Had the DESCHALL software not been distributed as uncompiled source code, this would have presented a substantial security risk. Allowing strange programs access to your computers network layer is risky, without assurance that the metacomputing client is only doing what it is supposed to be doing.

Mprimescripts serves well as a secure network interface for a precompiled metacomputing client binary. All the metacomputing program's contact with the outside world is through the scripts, and the program itself does not need network socket privileges. Perl doesn't need socket privs either, for that matter. As mprimescripts is written, the mail and lynx programs can be setuid programs runnable by crippled users, such as the mprime user, minimizing the risk that a future revision of mprime is going to give an evil "man-in-the-middle" hacker posing as George Woltman a backdoor login shell on your computer.

David Nicol (dnconsult@tipjar.com) does unix system administration in Kansas City. He can often be found logged on to the Internet Go Server as kcspirit.
--------------END DRAFT