Simulating Symmetric Multi-Processing with fork()

This example shows how to use excl.osi:fork on non-Windows platforms to utilize all available processors for Lisp computations. It assumes that your application has compute bound parts which can be run in parallel. In the source code linked at the end of this page, the "work" is simulated with a loop calling expt.

This technique is not appropriate for problems where the granularity of parallelism is very fine. The overhead would be too large for those problems. The overhead is such that around 11,500 calls per second can be made with this framework, on a 1.8GHz x86_64 machine.

The first part of the example code is a framework for executing the work units on different processors. The second part is a specific example using this framework.

This example does not use anything fancy to pass information between the parent and child processes, just the printer and reader. The less information passed the less overhead there will be.

Terminology:

  • task: a unit of work, or in lisp terms an expression that can be evaluated
  • CPU: an actual hardware processor
  • processor: an entity which can do work, or in terms of this example a lisp subprocess which performs a task on a CPU

There can (and often will) be more processors than CPUs, though if there are many more processors than CPUs then tasks might take much longer than expected to complete.

In the "Example" section, there is an example which is run with a varying number of processors. For each run, there is an idea of the single processor time it would take to complete. This is labeled "WORK" in the test run below. In a theoretical sense, if there was 100 seconds of work to be done, 4 CPUs and the tasks the right size and independent of each other, you might get the work done in 25 seconds of real time.

Here is an example run on a 4 processor Opteron system. Each processor is running at 1.8GHz.


cl-user(2): (run)

Detected 4 CPUs

Iterations 40, processors 2, WORK: 8.0 seconds, REAL TIME: 4.164

Iterations 40, processors 3, WORK: 12.0 seconds, REAL TIME: 4.207

Iterations 40, processors 4, WORK: 16.0 seconds, REAL TIME: 4.18

Iterations 40, processors 5, WORK: 20.0 seconds, REAL TIME: 5.905

Iterations 40, processors 6, WORK: 24.0 seconds, REAL TIME: 7.075

Iterations 40, processors 7, WORK: 28.0 seconds, REAL TIME: 7.811

Iterations 40, processors 8, WORK: 32.0 seconds, REAL TIME: 8.717

Iterations 40, processors 9, WORK: 36.0 seconds, REAL TIME: 9.549

Iterations 40, processors 10, WORK: 40.0 seconds, REAL TIME: 10.525

Iterations 40, processors 11, WORK: 44.0 seconds, REAL TIME: 11.529

Iterations 40, processors 12, WORK: 48.0 seconds, REAL TIME: 12.535

nil

cl-user(3): 

We can see the work is well distributed over the actual CPUs and the real time to complete the work is roughly work / cpus.

Now, let's look at a Dual 2.4GHz Xeon system. Due to hyperthreading the Linux kernel believes there are 4 processors on this system.


cl-user(2): (run)

Detected 4 CPUs

Iterations 40, processors 2, WORK: 8.0 seconds, REAL TIME: 4.318

Iterations 40, processors 3, WORK: 12.0 seconds, REAL TIME: 5.452

Iterations 40, processors 4, WORK: 16.0 seconds, REAL TIME: 12.303

Iterations 40, processors 5, WORK: 20.0 seconds, REAL TIME: 14.495

Iterations 40, processors 6, WORK: 24.0 seconds, REAL TIME: 11.512

Iterations 40, processors 7, WORK: 28.0 seconds, REAL TIME: 17.971

Iterations 40, processors 8, WORK: 32.0 seconds, REAL TIME: 19.951

Iterations 40, processors 9, WORK: 36.0 seconds, REAL TIME: 26.855

Iterations 40, processors 10, WORK: 40.0 seconds, REAL TIME: 30.471

Iterations 40, processors 11, WORK: 44.0 seconds, REAL TIME: 27.072

Iterations 40, processors 12, WORK: 48.0 seconds, REAL TIME: 33.404

nil

cl-user(3): 

Not nearly as good as the first system, which as it happens costs about 10 times as much. For 12 processors it was close to 3 times as fast.

As was said at the outset, this approach isn't for every application. It can be very useful without a lot of trouble. The main benefit is that in the presence of multiple processors, an application can increase efficiency without a lot of work. The downside, which can be overcome with good programming techniques, is that debugging is more difficult using this approach. This is not that big a deal, since any serious server application will need to employ these same error recovery techniques.

Source Code

View or download.

Copyright © 2023 Franz Inc., All Rights Reserved | Privacy Statement Twitter