|
Allegro CL version 11.0 |
The runtime analyzer is a tool to make Lisp programs more efficient. Runtime analyzer information tells you which parts of the program run slowly or use a lot of space (thus increasing garbage collection). Once you know this, you can improve efficiency by optimizing (see compiling.html) or by recoding.
The Runtime Analyzer works on all platforms. Note the runtime analyzer produces profiles of running code. We generally use the terms runtime analysis and profile interchangeably and many symbols associated with the runtime analyzer have profile in their names.
Allegro CL provides four runtime analysis tools for assisting users in determining where time is spent or space is used. Most powerful are the space and time profilers which statistically sample a running program to determine which functions use the most time or which use the most space. The call-counting profiler keeps a count of the number of times a compiled function is actually called while a program is running. The real time profiler says where time is spent sampling every quantum of real (clock) time. Its results depend on what else is happening on the machine (in contract to the other profilers which generally do not depend on that), so results can change, perhaps very significantly from run to run.
The Allegro CL Integrated Development Environment in Windows provides a graphical display of runtime analyzer data and various menu commands for starting and stopping analysis. See the description on the Runtime Analyzer Results dialog in the IDE help file. The IDE additionally provides a Runtime Analyzer Control dialog which provides graphic tools for starting, suspending, and stopping profiles and for controlling what is profiled. The Runtime Analysis Graph dialog is described with the Runtime Analyzer Results dialog.
We describe the four types of runtime analyzer below. Later sections describe how to use the runtime analyzer and how to interpret the results. Please note the following:
The runtime analyzer code is included in most supplied images but it can be loaded with the following form:
(require :prof)
The CPU time analyzer collects data by interrupting execution of the Lisp periodically and recording the functions on the execution stack. The variable *granularity* controls how often execution is interrupted. Its value is the integer number of machine clock ticks between samples.
A machine clock tick occurs approximately every 0.01 seconds (the exact value is machine dependent). In that time, the machine will typically execute tens of thousands of machine instructions. The statistical nature of the time analyzer should therefore be evident. However, under reasonable assumptions concerning lack of interaction between execution of your program and machine clock ticks and with sufficient samples, the information gathered by the sampling process will closely mirror the actual behavior of the code which is executed.
The reasonable assumptions state that (1) there is no cycle in the program harmonic with the program interrupting time so that the interrupting times do provide a random sample of program execution and (2) if your program does several different things, enough time is spent on each to provide statistically significant results. Several thousand samples are statistically sufficient. Of course, functions where your code spends very little time may be missed altogether by the time analyzer. However, the purpose of the runtime analyzer is to assist you in making your code run faster, and functions which use little time are not important in achieving that result.
The real time analyzer (not avialble in version 9.0) collects data by sampling periodically and recording the functions on the execution stack. The variable *granularity* controls how often execution is interrupted. Its value is the integer number of machine clock ticks between samples. Data is collected whether or not Lisp itself is executing.
A machine clock tick occurs approximately every 0.01 seconds (the exact value is machine dependent). In that time, the machine will typically execute tens of thousands of machine instructions. The statistical nature of the time analyzer should therefore be evident. However, under reasonable assumptions concerning lack of interaction between execution of your program and machine clock ticks and with sufficient samples, the information gathered by the sampling process will closely mirror the actual behavior of the code which is executed.
A real time analysis is done when :real-time
is specified as the profile type to prof:start-profiler or to prof:with-profiling.
The space analyzer tries to identify the functions which use the most space. Like the time analyzer, the space analyzer is statistical but the method for sampling is quite different.
The time analyzer stops execution regularly and examines the stack. The space analyzer stops execution whenever a page of memory is allocated. The statistical nature of the space analyzer arises from the fact that the function which calls for a page of memory will not (typically) use it all. After the function which called for the page has completed, there may be sufficient room left on the page for the space requirements of the next function. Therefore, only the first function is charged with needing space even though both functions used the space.
The assumption of the space analyzer is that the requirement for new pages of memory occurs randomly and in proportion to the amount of space actually required by a function. So given a reasonably large sample, the functions that use large amounts of space will have proportionally large amounts of space attributed to them by the analyzer.
In Allegro CL, memory is organized into three kinds of objects: symbols, conses, and other. Both symbols and conses have a specific size and so a known number can fit on a page. Other Lisp objects require variable amounts of memory (depending on what information they contain). All objects other than symbols and conses are stored on other pages.
The space analyzer thus records information whenever one of the functions which allocate a new page of memory is requested. The request can come from the internal function _page_from_new (for symbols and conses) or from the function new_other (for other objects). These functions are deep in the system and are not typically user visible. On machines where this is possible, during analysis, hooks associated with those two functions stop execution when they are called and record information about what is on the stack. Note that because of the way stack linking is done on certain machines, you may not see the actual _page_from_new calls, but rather Lisp functions (such as cons) which trigger the calls. These analyses are harder to interpret, but the stack linking requirements make it impossible to list these calls with more accuracy. (The examples in the document are from platforms that do show _page_from_new.)
Note that the time macro also provides information on space used by the form evaluated. This information is quite accurate (in some cases, the total space used is more accurately reported by the time macro than by the space analyzer). Time does not provide a breakdown of what within a form caused space usage and that is what the space analyzer is designed to do.
The call-counting profiler is integrated with the space and time analyzers. No special compilation is necessary for calls to a function to be counted, but you must tell the system to keep counts by specifying true values for the :count or :count-list keyword arguments when the runtime analyzer is started. (However, see verify-funcalls-switch. funcall'ed functions in certain cases in compiled code will not be counted if the call was compiled with that compiler switch false.) Any function can be call-counted. The count is exact, not statistical. The only calls that will be missed are where a function is compiled inline and the inline code is executed, and, on certain architectures, calls triggered by funcall. Note that the overhead for counting calls can be significant, and that overhead is recorded by the time analyzer. There is no space overhead, so the space analyzer results are unaffected. Therefore, the reported time usage is skewed by the call-counting overhead when a time analysis also does call-counting. The time results are still useful, but must be interpreted with care and a time-only analysis should not be compared with a time and call-counting analysis.
Unless otherwise indicated, the symbols described in this section are in the profiler (nicknamed prof) package.
In order to do a runtime analysis, you need to start it and stop it. You can start it with start-profiler and stop it with stop-profiler or you can use the macro with-profiling to cause the runtime analyzer to be run while a specific body of code is executed. The variable *maxsamples* controls the maximum number of samples that will be taken. Data collection will stop when that number of samples has been recorded even if stop-profiler is not called or the code specified to with-profiling has not completed.
Note that it is important to distinguish between starting and stopping data collection and suspending data collection. start-profiler and stop-profiler are the beginning and end of a data collection run. No data is collected before start-profiler is called and none can be collected after stop-profiler is called. Temporary suspension of data collection is done with the functions start-sampling and stop-sampling. On Windows using the Integrated Development Environment, commands on the Run menu can be used to start and stop data collection and sampling.
The macro with-profiling can be used to collect data on the evaluation of specific forms. The macro with-sampling similarly can be used to sample only as specific forms are evaluated.
Please note the following about time and space data collection.
prof::sizefile-template
and prof::samplefile-template
, whose values are either "/usr/tmp/sizes" and "/usr/tmp/sample" or "/tmp/sizes" and "/tmp/sample". (The actual data files have filenames longer than sizes and sample.) You may change the values of these variables if the wrong or an inappropriate temporary directory is specified. The values must be strings like "You may see this message when collecting data. This section explains what it means. The reading of the C symbol table into lisp form has been enhanced to not only include symbols from the executable itself (which has itself shrunk to a very small size) but also any shared-libraries that the lisp uses, including the heart of the lisp itself. The function list-all-foreign-libraries lists those libraries whose symbols Allegro attempts to read in.
Sometimes, the shared-libraries do not have the necessary headers and sections that are needed for proper building of a C symbol table in lisp. In these cases, a profiler-error is given, for example:
USER(2): (prof:show-flat-profile)
profiler error: couldn't find .text section
profiler error: couldn't find .text section
In this case (on a Sparc Solaris 2.6) two libraries did not have .text sections, so no symbols are recorded for those two libraries.
The failure to find symbols is not fatal. The debugging tools that use the libraries are simply not able to give as complete information as if the symbols had been available. The same soft failure occurs when a symbol in a .so had been declared static (or in the case of Windows, any C symbol not explicitly DllExport-ed). The exact failure mode depends on the arrangement of the functions in memory; a function at one address whose symbol was not properly recorded will appear to be part of the next-lower-addressed function whose name was recorded.
The status of the data collecting can be determined at any time using the function profiler-status. Called with no arguments, it returns the status of the current profile. The status can be one of the following:
:inactive
-- data collection has not yet started:suspended
-- data collection has started, but sampling is currently not being done:sampling
-- data collection has started and sampling is occurring:saved
-- data collection has been stopped, but as yet no analysis of saved data has occurred:analyzed
-- all data has been analyzedA finalized analysis (profiler-status returns :analyzed
) can be saved using the function save-named-profile which causes profile data to be written to a profile object which is given the name specified. The name can be used as the value of the profile keyword argument to various runtime analyzer functions. The function find-named-profile takes a name and returns the associated profile object, which can also be used as the value of the profile argument. Once you are done with a profile, you can delete the profile object with delete-named-profile.
Saving an analysis with save-named-profile (save-current-profile in versions 9.0 and 10.0) allows you to analyze data at your leisure, running other analyses in the meantime.
The function cancel-current-profile will stop the current profile and delete all associated data. It is unusual to call this function unless the system stopped collecting useful data, because, for example, an error occurred while running the code you wish to profile.
Fairly fine control is provided over the collection of data while the data is being collected. Sampling can be started and stopped with the functions start-sampling and stop-sampling. The macro with-sampling can be used to sample just while specified forms are evaluated. Note that data collection started with the function start-profiler may start sampling immediately or may delay sampling (depending on the value of the keyword :start-sampling-p
).
The variables *maxsamples* and *granularity* also control the behavior of the runtime analyzer. The first puts a limit on the number of samples collected. Its initial value is most-positive-fixnum (which in 64-bit Lisps means, with default *granularity*, time sampling will continue for roughly 350 million years). The second specifies the time interval between time samples. The units are clock ticks, an internal value usually about 1/100 of a second (during which time thousands of machine instructions are executed). The space analyzer, which is not statistical in the same way, ignores the value of *granularity*.
When data is being collected, everything done in Lisp is sampled, including the runtime analyzer itself. The runtime analyzer overhead (for either space or time) is small. The runtime analyzer can be turned on and off (that is suspended and resumed) with the functions start-sampling and stop-sampling.
After data collection completes, the data is stored within Lisp. If you wish to analyze the data, you must do so before sampling for another analysis is started. (You can, however, begin an analysis of the collected data and save it for later detailed analysis). If after data is collected, collecting data for a new runtime analysis is started, the data collected by the previous runtime analysis will be lost.
As an example of how to use these functions, we show how to collect data on a single function foo that never calls itself embedded in a larger piece of code. Suppose we want to analyze foo alone, the steps are as follows:
(def-fwrapper timer-wrap (&rest args)
(prog2
(prof:start-sampling)
(call-next-fwrapper)
(prof:stop-sampling)))
(fwrap 'foo 'timer-fw1 'timer-wrap)
Start the runtime analyzer with prof:start-profiler with nil
as the value of the keyword :start-sampling-p
. This prepares the runtime analyzer to collect data without actually beginning sampling.
Run the code containing calls to foo.
Stop data collection when done.
It should be clear why foo cannot call itself: the fwrapper code placed around foo will cause data collection to stop when the embedded call to foo returns even though the original call is still being evaluated. If a function does call itself (or calls another function which calls it back), use the following code:
(defparameter *prof-depth* 0)
(def-fwrapper timer-wrap-2 (&rest args)
(prog2
(progn (incf *prof-depth*) (if (= 1 *prof-depth*)
(prof:start-sampling)))
(call-next-fwrapper)
(decf *prof-depth*)
(if (zerop *prof-depth*) (prof:stop-sampling))))
(fwrap 'foo 'timer-fw2 'timer-wrap-2)
Collecting data on more than one function is more complex, because of embedded calls. Unless you are sure that the functions you are interested in never call each other, you must use global variables as flags or semaphores as above to be sure that data collection is done exactly when you want.
This section describes the interaction between the runtime analyzer and the multiprocessing capabilities of Allegro Common Lisp. The multiprocessing facility is described in multiprocessing.html. The runtime analyzer has the ability to collect data on all processes or just a selected set.
Data collection on processes can be controlled on a process by process basis with the function mp:profile-process-p which returns the value of the profile-process-p flag of an argument process object. If that flag is nil
, samples are not collected when the process is running. Setf can be used to change the value of that flag on a process.
The operators prof:start-profiler and prof:with-profiling accept the keyword argument :except-processes
or :except-stackgroups
(non-OS-threads paltforms) which allows another form of control over which processes have data collected on them. By giving a list of processes as the value of that argument, you can prohibit the runtime analyzer from collecting samples for the given processes. The Process Browser in Allegro Composer (on UNIX) provides a way to set this flag with the mouse.
The runtime-analyzer has been enhanced to separate closure calls. prof:start-profiler and prof:with-profiling have a new keyword called :interpret-closures
. When :interpret-closures
is nil
(the default) the closures are not saved and analyzed separately so the behavior is the same as in earlier releases.When true, closures are saved and analyzed as separate entries, even though they may use the same template function.
A function spec called closure
has been added (see implementation.html for information on function specs). This is valid whenever analyzed data exists and identifies the closures which the runtime analyzer has seen. The function spec is (prof:closure n)
where n is a small integer enumerating the closures in the data. This allows the user to grab the closure object and to do a disassemble on it without having to know what symbol holds its fdefinition
(which is sometimes very hard).
When flat analyses and call-graphs are reported, closures are identified by (prof:closure n)
, followed by the printed representation of the closure object itself.
Space and time analysis results can be displayed graphically in the IDE (see cgide.html but can only be printed to the terminal if you are not. Graphical display in the IDE uses the Runtime Analyzer Results dialog. The system will display space or time data automatically according to what kind of data was collected.
See also Profiles in a function disassembly for how to display profile data in a disassembly.
Note that you should not collect data for additional analyses once you have started to display results on the terminal. Only do another runtime analyzer run when your analysis is complete. Note too that you can save the results with save-current-profile for how to display profile data in a disassembly.
You can look at the data in two ways: a flat analysis and a call-graph. A flat analysis shows the time or space information sorted by function from largest consumer of time or space to least consumer.
The call-graph format shows information about the callers and callees of each function grouped with the information about the function.
The two functions show-call-counts and list-call-counts report call-count results. The first prints the results and the second returns them in a list. (See the definitions of the functions for information on specifying an analysis other than the current one and controlling the number of functions about which information is printed.)
The function show-flat-profile prints a flat analysis to standard-output
(or another stream if you desire). It shows how much time is spent or space is used by functions calls when code is executed. There is an example of a flat analysis output below.
The function show-call-graph allows you to examine which functions call and are called by other functions, and how time spent or space usage is distributed among the callees and callers.
The function show-outline-profile shows another view of the information, printing an outline with indentations showing what calls what and displaying the fraction of time spent. It also provides sequence numbers for calls which allows them to be more easily referenced.
There are examples of each type of output below.
Here is how show-call-graph and show-outline-profile work. First show-call-graph.
For each function, F, a call-graph shows the functions that called it (parents of F) and the functions that it called (children of F). Roughly speaking, a call-graph is the flat analysis display expanded to show parents and children. The display for function F is surrounded by dashed lines. F's display shows first information about the parents of F, then information about F itself, then information about F's children. (See the example later in this section.)
The following algorithm is used to find the parents of a function F. Each sample is examined and if F appears in the call stack, the function calling F is recorded as the parent. If F appears more than once in the call stack, all the different parents (callers) are recorded, but if the same caller calls F several times in the stack, only the oldest instance is recorded. The time spent in (or the space used by) F and its descendents due to calls by a particular parent is proportional to the ratio of the number of samples where this particular parent of F was found to the total number of samples.
The algorithm for finding children is as follows. Each sample is examined and if F is in the call stack, the function it calls is recorded as its child. If F appears more than once in the stack, each different child is recorded but if the same child appears more than once, only the oldest instance is recorded. The time spent in (or the space used by) a particular child and its descendents due to calls by F is proportional to the ratio of the number of samples where this particular child of the function F was found to the number of samples.
If there is a recursive entry in a call-graph, then the local % entry for that function will have an r after it. If there is a recursive entry, then time attributed to this entry may be counted twice because of recursion. For example, if bar calls baz calls foo calls baz, then foo and bar are parents of baz, and there will be two entries for parents of baz, corresponding to only 1 sample. So the report shows two lines (one for foo and one for bar), but the time is double counted. Similarly for a child case: bar calls foo calls bar calls baz. Then foo and baz are both children of bar (reported on 2 separate lines in the report) corresponding to only 1 sample.
The output of show-outline-profile display much the same information as show-call-graph -- displaying which functions are called by which functions, but in a more compact format using indentation to indicate calls. The functions called by function F are displayed below F and indented. When F calls multiple functions, there are listed in the order they are called by F and equally indented. The displayed lines have sequence numbers which can be used to refer to a particular call.
The output of a flat analysis and particularly of a call-graph and an outline may be voluminous. In order to reduce the amount of data to a manageable amount, the default display of either a flat analysis or a call-graph/outline suppresses some internal functions and packages. When a package is hidden, functions named by symbols in the package are not displayed. The functions and packages suppressed are in the lists *hidden-functions* and *hidden-packages*, respectively. You can suppress display of extra functions by adding them to the list that is the value of *hidden-functions*. You can suppress the display of extra packages (that is, of functions named by symbols in the packages) by adding them to the list that is the value of *hidden-packages*.
You can specify a maximum indentation for outlines with the variable *outline-max-indentation*.
When functions are suppressed from the display, the runtime analyzer attributes any effects these functions might have had to the function that called them. So, assuming the calling function is itself not suppressed from the display, the calling function in the runtime analyzer display can be thought of as a kind of composite, made up of itself and the suppressed function(s) it called. Functions which call hidden functions are indicated in the runtime analyzer display by prepending "..." to the function name.
In a call-graph and an outline all directly called undisplayed functions are attributed to the caller. Such undisplayed functions may eventually directly or indirectly call displayed functions, which will be shown in the call graph/outline as children of the original caller. I.e., if foo calls bar calls baz, and bar is suppressed, then foo will be displayed as "... foo" and baz will be shown as a child of foo. foo gets dots because it can be thought of as a composite of foo and bar (as we explained above).
Sometimes a called function and all its descendants may be suppressed (e.g., none of them are displayable), in which case the runtime analyzer display just uses "..." to indicate that there are completely suppressed cases.
You can further control the amount of analysis you see with the variables *significance-threshold* and *fractional-significance-threshold*.
The variable *hidden-space-functions* lists functions (mostly kernel functions named with strings rather than symbols) whose suppression makes the space analyzer output more understandable. The functions mostly actually allocate space requested (directly or implicitly) by other functions presumably of more interest.
Note some things about the display control variables we have just discussed.
They only affect the display of runtime analyzer data, not the collection of that data. Therefore, with the results of a analysis available, you can display the data with various values for these variables.
Setting *hidden-packages*,a *hidden-functions*, and *hidden-space-functions* all to nil
and setting both *significance-threshold* and *fractional-significance-threshold* to 0.0 results in all available detail from the run being displayed.
You can always reset the value of the five display control variables, *hidden-packages*, *hidden-functions*, *hidden-space-functions*, *significance-threshold*, and *fractional-significance-threshold*, with the function reset-profiler-display-defaults. That function does not reset the value of *outline-max-indentation*.
The function show-all sets the five display control variables to nil
or 0.0 (as appropriate) so that all resuts are displayed. show-all does not modify the value of *outline-max-indentation*.
In earlier releases, the function disassemble-profile was used to display a diassembly with profile information. Use of this function is now deprecated and instead users should call cl:disassemble with the new profile keyword argument as we now describe.
The function disassemble has been extended in Allegro CL to accept a profile keyword argument. When that argument is specified non-nil
, a disassembly of a specific function will be printed which indicates the estimated time spent for each instruction. Calls to other functions are identified as jumps to subroutines (using hardware specific terminology). We remind you that the time data is statistical and so the time spent in particular instructions are estimates. The less time spent in a function overall, the worse the estimate.
There is an example of a profile in a disassembly in A disassembly with profiling example below.
The runtime analyzer emits headers similar to the following (the numbers are different of course) in the show-flat-profile or show-call-graph output:
Sampling stopped after 1181 samples taken for the time profiler.
Sample represents 1.2 seconds of user+system CPU time
(out of a total of 5.5).
Sampling is done on the basis of accumulated processor time in threads known to the Allegro CL implementation. The value "1181" is the number of times a Lisp thread accumulated sufficient processor time to qualify as a sample. The value "1.2" is a total sampling time calculated by multiplying the number of samples by the sampling granularity. The value "5.5" is the total user and system time used by the Lisp process during the sampling period. When this value is the same as the total sampling time, it is not included in the message.
The total user+system time can be larger than the total sampling time when multiple processors are available. In Virtual Threads Lisps (ie non-os-threads, non-SMP) only one Lisp thread may be running at any moment, but there may be system threads and foreign threads running and accumulating time in parallel with the Lisp thread. In SMP Lisps the total user+system time will often be larger than the total sampling time since several Lisp threads can run in parallel.
Note that when a Lisp thread is waiting, it accumulates no time at all and thus does not appear in the profiler output.
When the profiler's low-level mechanism sees a function on the stack, it records the fact that it's there, but it doesn't use the address of the funtion to identify it; instead it uses a hash code that has been generated when the function object was created. Every function object gets a hash code from a single counter when the object is created. Because of this, uniqueness is mostly guaranteed, except when the counter wraps around (and this is rare because of the sheer size of the function's hash code field).
When it comes time to identify the function whose hash code was recorded, however, that function may have disappeared, i.e. the garbage collector may have noticed that there are no references to that function object and thus collects the object. In response the profiler creates an ephemeral function, which serves as a placeholder for the real object that disappeared. The name of the ephemeral object is (:internal <ephemeral-name> <hash-code>)
where <ephemeral-name>
is literally prof::<ephemeral>
and <hash-code>
is the hash code that was recorded with the function object.
It my be hard to figure out which function went away, but the fact that hash-codes are assigned incrementally might help the user to figure out what function was removed, based on other functions allocated at around the same time.
This section gives examples of using the runtime analyzer. These examples were run using Allegro CL 64-bit for Linux. The results might be quite different if run on a different platform. Further, optimization settings during compilation dramatically affect the results. You should not expect to see the results included here if you try the examples on your system.
The first example uses the time runtime analyzer. We will profile the following functions, which we define and compile.
(defun fact (x)
(if (= x 1) 1
(* x (fact (- x 1)))))
(defun bar (x)
(dotimes (i 2000)
(make-array x)))
(defun foo (x)
(fact x)
(bar x))
(compile 'fact)
(compile 'foo)
(compile 'bar)
First we collect the data by evaluating the following form calling with-profiling:
(prof:with-profiling (:type :time) (foo 20000))
We display the flat analysis by evaluating the following form calling show-flat-profile:
(prof:show-flat-profile)
That causes the following to be printed (the results you see may be different, of course):
Sampling stopped after 94 samples taken for the time profiler.
Sample represents .9 seconds of user+system CPU time (out of a total of 1.0).
Times below 1.0% will be suppressed.
% % self total self total Function
Time Cum. secs secs calls ms/call ms/call name
47.9 47.9 0.4 0.4 "sigprocmask"
37.2 85.1 0.3 0.8 "bignum_multiply"
12.8 97.9 0.1 0.1 "new_simple_vector"
1.1 98.9 0.0 0.0 "block_asynch_signals"
1.1 100.0 0.0 0.0 "mode_line_gc_end"
The first question you might have is where are fact, foo, and bar? (fact, foo, and bar are the functions we defined at the beginning of the example.) A flat analysis displays entries ordered by the (estimated) time a function is at the top of the stack, not on the stack. The three functions we defined call the low-level functions listed to do their work and so themselves spend little time at the top of the stack, so little that it falls below the default value of *significance-threshold* (0.01, equivalent to 1%) for fact and foo. If you printed this display with the value of *significance-threshold* 0.0, you would see fact and foo (and quite a few more functions) listed.
We can see from the output that most of the time is spent in the Lisp runtime-system functions bignum_multiply and make_svector_common. (We know they are runtime-system functions because they are named by strings.)
Now let us look at the call graph. We display it by evaluating the form calling show-call-graph:
(prof:show-call-graph)
Here is the call graph that is printed (the results you see may be different, of course):
Time profile of sampled pc values by function, children, and parents.
Total times below 1.0% will be suppressed.
Parent and child times less 2.0% of the node will be suppressed.
Sampling stopped after 94 samples taken for the time profiler.
Sample represents .9 seconds of user+system CPU time (out of a total of 1.0).
% % Parent
self total total local Function
Time Time secs % Child
0.0 100.0 0.9 0.0 ==> ... "start"
0.9 100.0 ... "start_reborn_lisp"
-----------------------------------------------------
0.9 100.0 ... "start"
0.0 100.0 0.9 0.0 ==> "start_reborn_lisp"
0.9 100.0 "first_lisp_thread"
-----------------------------------------------------
0.9 100.0 "start_reborn_lisp"
0.0 100.0 0.9 0.0 ==> "first_lisp_thread"
0.9 100.0 excl::lisp-thread-start
-----------------------------------------------------
0.9 100.0 "first_lisp_thread"
0.0 100.0 0.9 0.0 ==> excl::lisp-thread-start
0.9 100.0 excl::run-initial-thread
-----------------------------------------------------
0.9 100.0 excl::lisp-thread-start
0.0 100.0 0.9 0.0 ==> ... excl::run-initial-thread
0.9 100.0 ... excl::thread-reset-catcher
-----------------------------------------------------
0.9 100.0 ... excl::run-initial-thread
0.0 100.0 0.9 0.0 ==> excl::thread-reset-catcher
0.9 100.0 excl::thread-bind-and-call
-----------------------------------------------------
0.9 100.0 excl::thread-reset-catcher
0.0 100.0 0.9 0.0 ==> ... excl::thread-bind-and-call
0.9 100.0 ... eval
-----------------------------------------------------
0.9 100.0 ... excl::thread-bind-and-call
0.0 100.0 0.9 0.0 ==> eval
0.9 100.0 (:anonymous-lambda 13)
-----------------------------------------------------
0.9 100.0 eval
0.0 100.0 0.9 0.0 ==> (:anonymous-lambda 13)
0.9 100.0 prof::prof-a-funcall
-----------------------------------------------------
0.9 100.0 (:anonymous-lambda 13)
0.0 100.0 0.9 0.0 ==> prof::prof-a-funcall
0.9 100.0 (:internal (:anonymous-lambda 13) 0)
-----------------------------------------------------
0.9 100.0 prof::prof-a-funcall
0.0 100.0 0.9 0.0 ==> (:internal (:anonymous-lambda 13) 0)
0.9 100.0 foo
-----------------------------------------------------
0.9 100.0 (:internal (:anonymous-lambda 13) 0)
0.0 100.0 0.9 0.0 ==> foo
0.8 84.0 fact
0.2 16.0 bar
-----------------------------------------------------
0.8 100.0r fact
0.8 100.0r foo
0.0 84.0 0.8 0.0 ==> fact
0.8 100.0r "integer_multiply"
0.8 100.0r fact
-----------------------------------------------------
0.8 100.0 fact
0.0 84.0 0.8 0.0 ==> "integer_multiply"
0.8 100.0 "bignum_multiply"
-----------------------------------------------------
0.8 100.0 "integer_multiply"
37.2 84.0 0.8 44.3 ==> "bignum_multiply"
0.4 54.4 "sigprocmask"
-----------------------------------------------------
0.4 95.6 "bignum_multiply"
0.0 4.4 bar
47.9 47.9 0.4 100.0 ==> "sigprocmask"
-----------------------------------------------------
0.2 100.0 foo
0.0 16.0 0.2 0.0 ==> bar
0.1 80.0 "new_simple_vector"
0.0 13.3 "sigprocmask"
0.0 6.7 "mode_line_gc_end"
-----------------------------------------------------
0.1 100.0 bar
12.8 12.8 0.1 100.0 ==> "new_simple_vector"
-----------------------------------------------------
0.0 100.0 bar
1.1 1.1 0.0 100.0 ==> "mode_line_gc_end"
-----------------------------------------------------
0.0 100.0 "bignum_multiply"
1.1 1.1 0.0 100.0 ==> "block_asynch_signals"
-----------------------------------------------------
Look at the entry for foo (12th down). It tells us that foo spends much more time in fact (84%) than it does in bar (16%). Looking at the entry for fact, the "r" next to the local percent numbers indicate that this function has been called recursively. (The suspension points ...
that appear in some entries mean that hidden functions are being called.)
To show call-counting, we call for a time analysis again but specify :count 100 and :count-list (list #'fact #'bar #'make-array)
, so the call to with-profiling is
(prof:with-profiling
(:type :time :count 100 :count-list (list #'fact #'bar #'make-array))
(foo 20000))
Here is the flat analysis produced by show-flat-profile and the printout of show-call-counts (somewhat abbreviated).
cl-user(23): (prof:show-flat-profile)
Sampling stopped after 87 samples taken for the time profiler.
Sample represents .9 seconds of user+system CPU time.
Times below 1.0% will be suppressed.
% % self total self total Function
Time Cum. secs secs calls ms/call ms/call name
43.7 43.7 0.3 0.3 "sigprocmask"
40.0 83.7 0.3 0.6 "bignum_multiply"
15.0 98.7 0.1 0.1 "new_simple_vector"
1.2 100.0 0.0 0.0 "getrusage"
cl-user(18): (prof:show-call-counts)
21,999 #<Function *_2op>
2,000 #<Function calc-array-size>
216 #<Function fspec-candidate-p>
216 #<Function function-cons-name-p>
216 #<Function function-name-p>
216 #<Function instance-read-1>
108 #<Function arglist>
108 #<Function autoloadp>
108 #<Function class-of>
108 #<standard-generic-function aclmop:class-precedence-list>
108 #<Function clean-up-scheduled-finalizations>
108 #<Function closure-has-arglist-p>
108 #<Closure Template Function default-gc-after-hook @ #x1000002de22>
108 #<Function encapsulated-basic-definition>
108 #<Function gc-after>
108 #<Function generic-function-p>
108 #<Function internal-class-of>
108 #<Function length>
108 #<Function pop-finalization>
108 #<Function process-finalizations>
108 #<Function release-scheduled-finalization-list>
108 #<Function resident-function>
108 #<Function try-reserve-scheduled-finalization-list>
108 #<Function typep-class>
2 #<standard-generic-function profiler::actual-profile-type>
2 #<Function structure-ref>
1 #<standard-generic-function profiler::actual-profile-status>
1 #<Function bar>
1 #<Function fact>
1 #<Function foo>
1 #<Function increment-call-counts>
1 #<Function instance-write-1>
[...]
cl-user(19):
Note that the call counts are by function object, not function name, that the call-counting overhead in this case is not very significant -- see the comparable result without call-counting above -- (but in other cases it might be), and that runtime-system functions (named by strings in the flat analysis) are not counted.
We next show an example of a space analysis. We use the following code, which (1) uses a lot of space and (2) uses different types of space. Consider the following functions:
(defun foo (n)
(let ((array (bar n)))
(dotimes (i (* n 8000))
(setf (aref array i) (cons nil nil)))))
(defun bar (n)
(make-array (* n 8000)))
(compile 'foo)
(compile 'bar)
We assume these functions have been defined and compiled within Lisp. Note that bar creates an array (which will trigger calls to new_other or new_simple_vector) and foo does a lot of consing, which will trigger calls to _page_from_new or lisp_cons.
We call these functions in a with-profiling form:
(prof:with-profiling (:type :space) (foo 1))
Before looking at the results, we set *hidden-functions* to nil
, so we can see what system functions are actually allocating the space.
(setf prof:*hidden-functions* nil)
We print the results with show-flat-profile.
(prof:show-flat-profile)
At first, we see foo and bar but no internal functions (the suspension points ...
mean that functions called by foo and bar are using the space). Then we set *hidden-space-functions* to nil
and we see the runtime-system space functions actually called.
Sampling stopped after 23 samples taken for the space profiler.
Sample represents 184.0 Kbytes of space allocated.
Mem.s below 1.0% will be suppressed.
% % self total self total Function
Mem. Cum. Kbyte Kbyte calls by/call by/call name
65.2 65.2 120 184 ... foo
34.8 100.0 64 64 ... bar
cl-user(32): (setq prof:*hidden-space-functions* nil)
nil
cl-user(33): (prof:show-flat-profile)
Sampling stopped after 23 samples taken for the space profiler.
Sample represents 184.0 Kbytes of space allocated.
Mem.s below 1.0% will be suppressed.
% % self total self total Function
Mem. Cum. Kbyte Kbyte calls by/call by/call name
65.2 65.2 120 120 "lisp_cons"
34.8 100.0 64 64 "new_simple_vector"
cl-user(34):
The first thing to notice is that "lisp_cons" and "new_simple_vector" use all the space. This is not surprising since they actually allocate the space. No other functions appear since all the space is allocated by these functions.
Now we look at a call graph by calling show-call-graph.
cl-user(34): (prof:show-call-graph)
Space profile of sampled pc values by function, children, and parents.
Total mem.s below 1.0% will be suppressed.
Parent and child mem.s less 2.0% of the node will be suppressed.
Sampling stopped after 23 samples taken for the space profiler.
Sample represents 184.0 Kbytes of space allocated.
% % Parent
self total total local Function
Mem. Mem. Kbyte % Child
0.0 100.0 184 0.0 ==> "start"
184 100.0 "cont_setstack"
-----------------------------------------------------
184 100.0 "start"
0.0 100.0 184 0.0 ==> "cont_setstack"
184 100.0 "startup_lisp"
-----------------------------------------------------
184 100.0 "cont_setstack"
0.0 100.0 184 0.0 ==> "startup_lisp"
184 100.0 "start_reborn_lisp"
-----------------------------------------------------
184 100.0 "startup_lisp"
0.0 100.0 184 0.0 ==> "start_reborn_lisp"
184 100.0 "first_lisp_thread"
-----------------------------------------------------
184 100.0 "start_reborn_lisp"
0.0 100.0 184 0.0 ==> "first_lisp_thread"
184 100.0 excl::lisp-thread-start
-----------------------------------------------------
184 100.0 "first_lisp_thread"
0.0 100.0 184 0.0 ==> excl::lisp-thread-start
184 100.0 excl::run-initial-thread
-----------------------------------------------------
184 100.0 excl::lisp-thread-start
0.0 100.0 184 0.0 ==> excl::run-initial-thread
184 100.0 excl::setup-required-thread-bindings
-----------------------------------------------------
184 100.0 excl::run-initial-thread
0.0 100.0 184 0.0 ==> excl::setup-required-thread-bindings
184 100.0 excl::thread-reset-catcher
-----------------------------------------------------
184 100.0 excl::setup-required-thread-bindings
0.0 100.0 184 0.0 ==> excl::thread-reset-catcher
184 100.0 excl::thread-bind-and-call
-----------------------------------------------------
184 100.0 excl::thread-reset-catcher
0.0 100.0 184 0.0 ==> excl::thread-bind-and-call
184 100.0 #<Runtime Function Object (:runsys "lisp_apply")>
-----------------------------------------------------
184 100.0r tpl:start-interactive-top-level
184 100.0r excl::thread-bind-and-call
0.0 100.0 184 0.0 ==> #<Runtime Function Object (:runsys "lisp_apply")>
184 100.0r tpl:top-level-read-eval-print-loop
184 100.0r excl::start-lisp-execution
-----------------------------------------------------
184 100.0 #<Runtime Function Object (:runsys "lisp_apply")>
0.0 100.0 184 0.0 ==> excl::start-lisp-execution
184 100.0 tpl:start-interactive-top-level
-----------------------------------------------------
184 100.0 excl::start-lisp-execution
0.0 100.0 184 0.0 ==> tpl:start-interactive-top-level
184 100.0 #<Runtime Function Object (:runsys "lisp_apply")>
-----------------------------------------------------
184 100.0 #<Runtime Function Object (:runsys "lisp_apply")>
0.0 100.0 184 0.0 ==> tpl:top-level-read-eval-print-loop
184 100.0 tpl::top-level-read-eval-print-loop1
-----------------------------------------------------
184 100.0 tpl:top-level-read-eval-print-loop
0.0 100.0 184 0.0 ==> tpl::top-level-read-eval-print-loop1
184 100.0 excl::read-eval-print-loop
-----------------------------------------------------
184 100.0 tpl::top-level-read-eval-print-loop1
0.0 100.0 184 0.0 ==> excl::read-eval-print-loop
184 100.0 tpl::read-eval-print-one-command
-----------------------------------------------------
184 100.0 excl::read-eval-print-loop
0.0 100.0 184 0.0 ==> tpl::read-eval-print-one-command
184 100.0 eval
-----------------------------------------------------
184 100.0 tpl::read-eval-print-one-command
0.0 100.0 184 0.0 ==> eval
184 100.0 (:anonymous-lambda 15)
-----------------------------------------------------
184 100.0 eval
0.0 100.0 184 0.0 ==> (:anonymous-lambda 15)
184 100.0 prof::prof-a-funcall
-----------------------------------------------------
184 100.0 (:anonymous-lambda 15)
0.0 100.0 184 0.0 ==> prof::prof-a-funcall
184 100.0 (:internal (:anonymous-lambda 15) 0)
-----------------------------------------------------
184 100.0 prof::prof-a-funcall
0.0 100.0 184 0.0 ==> (:internal (:anonymous-lambda 15) 0)
184 100.0 foo
-----------------------------------------------------
184 100.0 (:internal (:anonymous-lambda 15) 0)
0.0 100.0 184 0.0 ==> foo
120 65.2 "lisp_cons"
64 34.8 bar
-----------------------------------------------------
120 100.0 foo
65.2 65.2 120 100.0 ==> "lisp_cons"
-----------------------------------------------------
64 100.0 foo
0.0 34.8 64 0.0 ==> bar
64 100.0 "new_simple_vector"
-----------------------------------------------------
64 100.0 bar
34.8 34.8 64 100.0 ==> "new_simple_vector"
-----------------------------------------------------
cl-user(35):
Here is an example using prof:show-outline-profile. We first define new functions.
cl-user(1): (shell "cat calls.cl")
(defun foo (n)
(declare (optimize speed (safety 0))
(fixnum n))
(dotimes (i (the fixnum (* n 10)))
(apply 'bar i nil))
(dotimes (j (the fixnum (* n 20)))
(bas j))
(dotimes (k (the fixnum (* n 30)))
(bam k)))
(defun bar (x)
(* x 2))
(defun bas (x)
(* x 2))
(defun bam (x)
(* x 2))
0
cl-user(2): :cf calls
;;; Compiling file calls.cl
;;; Writing fasl file calls.fasl
;;; Fasl write complete
cl-user(3): :ld calls
; Fast loading /user/home/calls.fasl
cl-user(4): (prof:with-profiling () (foo 10000000))
nil
cl-user(5): (prof:show-flat-profile)
Sampling stopped after 1329 samples taken for the time profiler.
Sample represents 13.3 seconds of user+system CPU time.
Times below 1.0% will be suppressed.
% % self total self total Function
Time Cum. secs secs calls ms/call ms/call name
36.2 36.2 4.8 4.8 excl::*_2op
32.5 68.7 4.3 13.3 ... foo
19.3 88.0 2.6 4.9 bam
8.7 96.8 1.2 2.7 bas
3.2 100.0 0.4 1.3 bar
cl-user(6): (prof:show-outline-profile)
100.00 "start" 0
100.00 "cont_setstack" 1
100.00 "startup_lisp" 2
100.00 "start_reborn_lisp" 3
100.00 "first_lisp_thread" 4
100.00 excl::lisp-thread-start 5
100.00 excl::run-initial-thread 6
100.00 excl::setup-required-thread-bindings 7
100.00 excl::thread-reset-catcher 8
100.00 excl::thread-bind-and-call 9
100.00 #<Runtime Function Object (:runsys "lisp_apply")> 10
100.00 excl::start-lisp-execution 11
100.00 tpl:start-interactive-top-level 12
100.00 #<Runtime Function Object (:runsys "lisp_apply")> 13
100.00 tpl:top-level-read-eval-print-loop 14
100.00 tpl::top-level-read-eval-print-loop1 15
100.00 excl::read-eval-print-loop 16
100.00 tpl::read-eval-print-one-command 17
100.00 eval 18
100.00 (:anonymous-lambda 17) 19
100.00 prof::prof-a-funcall 20
100.00 (:internal (:anonymous-lambda 17) 0) 21
100.00 foo 22
/37.10 bam 23
| 17.76 excl::*_2op 24
+27.69 #<Runtime Function Object (:runsys "lisp_apply")> 25
|/9.78 bar 26
|| 6.55 excl::*_2op 27
|\1.35 "funcall_check" 28
\20.62 bas 29
11.89 excl::*_2op 30
cl-user(7): (disassemble #'bas :profile :current)
;; disassembly of #<Function bas> collected by #<time profile current>
;; formals: x
;; code start: #x10001798fa0:
39 (20%) 0: 48 83 ec 68 sub rsp,$104
4: 4c 89 74 24 08 movq [rsp+8],r14
9: 48 83 f8 08 cmp rax,$8
13: 74 01 jz 16
15: 06 (push es) ; sys::trap-argerr
9 ( 4%) 16: 41 80 7f a7 00 cmpb [r15-89],$0 ; sys::c_interrupt-pending
21: 74 01 jz 24
23: 17 (pop ss) ; sys::trap-signal-hit
7 ( 3%) 24: 49 8b af 6f fb movq rbp,[r15-1169] ; excl::*_2op
ff ff
31: 48 c7 c6 10 00 movq rsi,$16 ; 2
00 00
38: ff 53 d0 call *[rbx-48] ; sys::tramp-two
79 (41%) 41: 4c 8b 74 24 78 movq r14,[rsp+120]
46: 4c 89 7c 24 18 movq [rsp+24],r15
51: 48 8d 64 24 68 leaq rsp,[rsp+104]
56: c3 ret
57: 90 nop
number of matched-hits/total-hits 134/190
Addresses seen in arg-saving symbol trampoline:
17 ( 8%) 0: b0 10 movb al,$16
2: 48 83 e0 7f and rax,$127
6: 4c 8b 75 fd movq r14,[rbp-3]
^^^^^ caller ^^^^^
vvvvv callee vvvvv
10: 48 3b a3 98 00 cmpq rsp,[rbx+152]
00 00
13 ( 6%) 17: 48 89 44 24 20 movq [rsp+32],rax
22: 76 27 jbe 63
24: 48 89 7c 24 38 movq [rsp+56],rdi
7 ( 3%) 29: 48 89 74 24 40 movq [rsp+64],rsi
34: 48 89 54 24 48 movq [rsp+72],rdx
39: 48 89 4c 24 50 movq [rsp+80],rcx
44: 4c 89 44 24 58 movq [rsp+88],r8
7 ( 3%) 49: 4c 89 4c 24 60 movq [rsp+96],r9
2 ( 1%) 54: 48 f7 5c 24 20 negq [rsp+32]
10 ( 5%) 59: 41 ff 66 f6 jmp *[r14-10]
63: 1e (push ds) ; sys::trap-stack-ovfl
cl-user(8): (prof:show-outline-profile :name 22) ;; sequence 22 is function foo
100.00 foo 0
/37.10 bam 1
| 17.76 excl::*_2op 2
+27.69 #<Runtime Function Object (:runsys "lisp_apply")> 3
|/9.78 bar 4
|| 6.55 excl::*_2op 5
|\1.35 "funcall_check" 6
\20.62 bas 7
11.89 excl::*_2op 8
cl-user(9): (prof:show-outline-profile :name 22)
Error: The name spec 22 is not in the time profile result.
;; The latest outline printed provides the sequence number.
;; 22 was foo is the first outline but the outline just above
;; only has 8 sequence numbers so 22 does not work.
[1] cl-user(9): :res
cl-user(10): (prof:show-outline-profile :name 4)
9.78 bar 0
6.55 excl::*_2op 1
cl-user(11):
As said in Profiles in a function disassembly above, the function disassemble has been extended to accept a profile keyword which, when specified with a saved profile name, will display profile information in the disassembly.
In the disassembly, each instruction is annotated with actual hits on that instruction, plus a percentage of the total hits in the function. Since the hits are from a statistical sampling, an estimate can be made as to where in the function time is being spent (the more hits, the better the estimate).
A hit that occurs just after (we define just after for various platforms below) an instruction that does a call probably means that the function was not itself executing but had called another function at the time of the hit. A hit that occurs any other place means the function was actually executing at the time.
By just after above, we mean different things for different architectures:
On Windows, the hit is on the instruction after the jsr, call, or bctrl instruction, respectively.
On Sparcs, the hit occurs exactly at the jmpl instruction.
Here is an example following on the examples above.
We rerun the code we profiled above (just in case you are starting here or have done something else since running that example):
defun fact (x)
(if (= x 1) 1
(* x (fact (- x 1)))))
(defun bar (x)
(dotimes (i 2000)
(make-array x)))
(defun foo (x)
(fact x)
(bar x))
(compile 'fact)
(compile 'foo)
(compile 'bar)
(prof:with-profiling (:type :time) (foo 20000))
Now we show a flat profile and give the profile a name with save-named-profile. The name which will be used in the call to cl:disassemble.
cl-user(21): (prof:show-flat-profile)
Sampling stopped after 88 samples taken for the time profiler.
Sample represents .9 seconds of user+system CPU time.
Times below 1.0% will be suppressed.
% % self total self total Function
Time Cum. secs secs calls ms/call ms/call name
42.0 42.0 0.4 0.7 "bignum_multiply"
42.0 84.1 0.4 0.4 "sigprocmask"
14.8 98.9 0.1 0.1 "new_simple_vector"
1.1 100.0 0.0 0.0 "write"
cl-user(22): (prof:save-named-profile 'foo-prof)
#<time profile foo-prof>
cl-user(23):
Now we call disassemble on the function object #'foo specifying :profile 'foo-prof
:
cl-user(23): (disassemble #'foo :profile 'foo-prof)
;; disassembly of #<Function foo> collected by #<time profile foo-prof>
;; formals: x
;; constant vector:
0: fact
1: bar
;; code start: #x10000c48b60:
0: 48 83 ec 78 sub rsp,$120
4: 4c 89 74 24 08 movq [rsp+8],r14
9: 48 83 f8 08 cmp rax,$8
13: 74 01 jz 16
15: 06 (push es) ; sys::trap-argerr
16: 41 80 7f a7 00 cmpb [r15-89],$0 ; sys::c_interrupt-pending
21: 74 01 jz 24
23: 17 (pop ss) ; sys::trap-signal-hit
24: 48 89 7c 24 68 movq [rsp+104],rdi ; x
29: 49 8b 6e 36 movq rbp,[r14+54] ; fact
33: b0 08 movb al,$8
35: ff d3 call *rbx
9 (90%) 37: 48 8b 7c 24 68 movq rdi,[rsp+104] ; x
42: 49 8b 6e 3e movq rbp,[r14+62] ; bar
46: b0 08 movb al,$8
48: ff d3 call *rbx
1 (10%) 50: 4c 8b b4 24 88 movq r14,[rsp+136]
00 00 00
58: 4c 89 7c 24 18 movq [rsp+24],r15
63: 48 8d 64 24 78 leaq rsp,[rsp+120]
68: c3 ret
69: 90 nop
number of matched-hits/total-hits 10/10
cl-user(26):
The time is spent in the instructions just after the calls to fact and bar.
Copyright (c) 2023, Franz Inc. Lafayette, CA., USA. All rights reserved.
|
Allegro CL version 11.0 |