Metrics

Last updated: November 27, 2002

Understanding the dynamic behavior of programs is one important aspect in developing effective new strategies for optimizing compilers and runtime systems. Research papers in these areas often say that a particular technique is aimed at programs that are numeric, loop-intensive, pointer-intensive, memory-intensive, object-oriented, concurrent, and so on. However, there appears to be no well-established standard way of determining if a program fits into any of these categories. An important goal of our work was to develop dynamic metrics that can be used to measure relevant runtime properties of programs, with the ultimate goal of establishing some standard metrics that could be used for quantitative analysis of benchmark programs in compiler research.

To be useful, dynamic metrics should provide a concise, yet informative, summary of different aspects of the dynamic behavior of programs. By concise, we mean that a small number of numeric values should be enough to summarize the behavior. For example, a complete profile of a program would not be a concise metric, whereas the fact that 90% of program execution is accounted for by 5% of the methods is a concise metric. By informative, we mean that the metric must measure some program characteristic of relevance to compiler developers, and the metric should differentiate between programs with different behaviors. For example, computing the ratio of number of lines of code to number of lines of comments does not capture anything about program behavior, and is not an informative metric, for our purposes.

In order to get a good overview of program characteristics of interest to compiler and runtime systems developers, we first studied the typical dynamic characteristics reported in papers presented over the last several years in the key conferences in the area. Although we did find many interesting experiments that suggest potential dynamic metrics, we also found that many summaries of benchmarks and results focused on static program measurements (#lines of code, # of methods, # loops transformed, # methods that are inlinable, # number of locations pointed to at dereference sites, and so on). Based on our own experiences, we suspect that this focus on static, rather than dynamic, metrics is at least partly because the static metrics are much easier to compute. Thus, one important goal of our work to provide both a methodology to compute the dynamic metrics and a database of dynamic metrics for commonly used benchmarks.

In our development of dynamic metrics we also discovered that different program behaviors need to be summarized in different ways. For example, if one is measuring the behavior of virtual method calls in an object-oriented language like Java, one summary might be the average number of target methods per virtual call site. However, compiler developers are usually most interested in the special cases where the virtual call is monomorphic (good for inlining) or perhaps corresponds to a very small number of target methods (good for specialization). Thus, a much more relevant metric might be what percentage of virtual calls correspond to 1, 2, or more than 2, targets. Thus, we define three basic ways of reporting dynamic metrics, values, bins, and percentiles, and we suggest which type is most appropriate for each situation.

Requirements for Dynamic Metrics

Dynamic metrics need to exhibit a number of characteristics in order to render clear and comparable numbers for any kind of program. The following is a non-exhaustive list of desirable qualities.

Unambiguous: one lesson learned from static metric literature is that ambiguous definitions lead to unusable metrics. For instance, the most widely used metric for program size is lines of code (LOC). LOC is sufficient to give a ball park measure of program size. However, without further specification it is virtually useless to compare two programs. Are comments and blank lines counted? What is the effect of indentation? How do you compare two programs from different languages? Within a given language, the LOC of a pretty-printed version of a program with comments and blank lines removed would give an unambiguous measurement that can be used to compare two programs.

Dynamic: obviously a dynamic metric needs to be dynamic. In other words, the metric should measure an aspect of a program that can only be obtained by actually running the program. While this usually requires more work than a static measurement, the resulting numbers will be more meaningful since they will not change by adding unexecuted code to the program. Dead code should not influence the measurement.

Robust: the other side of the coin of using dynamic measurements is the possibility that those numbers are heavily influenced by the program input. Where static numbers may be meaningless because non-executed parts contribute to the numbers, dynamic numbers may be meaningless because the particular execution that is measured may not reflect the common program behavior. Unfortunately, we simply cannot guarantee that a program's input is representative. However, we can take care to define metrics that are robust with respect to the input. In other words, a small change in a program's input should cause a correspondingly small change in the resulting metric. In particular, a robust metric should not be overly sensitive to the size of a program's input. Total number of instructions executed is not a robust metric, since a bubblesort, for example, will execute four times as many instructions if the input size is increased by a factor of two. Number of different instructions executed is a robust metric since the size of the input will not drastically change the size of the part of the program that is executed.

Machine-independent: since the metrics pertain to program behavior, they should not change if the measurement takes place on a different platform (including virtual machine implementation). For example, number of objects 3 allocated per second is a platform-dependent metric which disallows comparisons between measurements from different studies, because it is virtually impossible to guarantee that they all use identical platforms. On the other hand, number of objects allocated per 1000 executed bytecode instructions (kbc), is a platform-independent metric. In general, metrics defined around the byte code as a unit of measurement are machine-independent for Java programs.

Kinds of Dynamic Metrics

While there are many possible metrics one could gather, we have found that the most commonly described metrics, and the ones which seem most useful compiler optimization, tend to be belong to just a few basic categories. This includes the ubiquitous single value metrics such as average, more detailed continuous expansions of single value metrics, hot spot detection metrics, and metrics based on discrete categorization. It is of course possible to design and use a metric that does not fit into these categories; these initial metric kinds, however, enable us to at least begin to explore the various potential metrics by considering whether an appropriate metric exists in each of our categories.
Value Metrics
The first kind of metric we present is a standard, usually one value answer. Many data gatherers, for instance, will present a statistic like average or maximum as a rough indicator of some quantity; the idea being that a single value is sufficiently accurate. Typically this is intended to allow one to observe differences in behaviour before and after some optimization, smoothing out unimportant variations. For example, a value such as running time is perhaps best presented as an average over several executions. Value metrics appear in almost every compiler research article that presents dynamic measurements.
Continuous Value Metrics
Continuous value metrics are value metrics that have a continuous analogue, meant to be an expanded presentation of the value metric. E.g., a running average computed over time, or a count of file accesses presented over time. Motivation for continuous value metrics arises from the inherent inaccuracy of a single value metric in many situations: a horizontal line in a graph can have the same overall average as a diagonal line, but clearly indicates very different behaviour. Additional descriptive values like standard deviation can be included in order to allow further refinement; unfortunately, secondary metrics are themselves often inadequate to really describe the difference in behaviour, requiring further tertiary metrics, and so on. Specific knowledge of other aspects of the metric space may also be required; correct use of standard deviation, for example, requires understanding the underlying distribution space of result values. Analysis situations in compiler optimization design may or may not result in simple normal distributions; certainly few if any compiler researchers verify or even argue that property. In order to present a better, less error-prone metric for situations where a single value is potentially inaccurate, a straightforward solution is to present a graph of the value over a continuous domain (like time). Biased interpretations based on a single value are thus avoided, and an astute reader can judge the relative accuracy or appropriateness of the single value metric themselves. Continous value metrics can then be seen as an extension to value metrics, giving a more refined view of the genesis of a particular value.
Percentiles
Frequently in compiler optimization it is important to know whether the relative contributions of aspects of a program to a metric are evenly or unevenly distributed among the program elements. If a few elements dominate, then those can be considered hot, and therefore worthy of further examination or optimization. Knowing, for example, that 2% of allocation sites are responsible for 90% of allocated bytes indicates that those top 2% of allocation sites are of particular interest. For comparison, a program where 50% of allocation sites contribute 90% of allocated bytes indicates a program that has a more even use of allocation, and so intensive optimization of a few areas will be less fruitful. Similar metrics can be found in compiler optimization literature; e.g., the top x% of most frequently-executed methods.
Bins
Compiler optimization is often based on identifying specific categories of measurements, with the goal of applying different optimization strategies to different cases. A call-site optimization, for instance, may use one approach for monomorphic sites, a more complex system for polymorphic sites of degree 2, and may be unable to handle sites with a higher degree of polymorphism. In such a situation single value metrics do not measure the situation well, e.g., computing an average number of types per call site may not give a good impression of the optimization opportunities. An appropriate metric for this example would be to give a relative or absolute value for each of the categories of interest, 1, 2, or 3 target types. We refer to these kinds of metrics as bins, since the measurement task is to appropriately divide elements of the sample space into a few categories or bins. There are many examples of bins in the literature; e.g., categorizing runtime safety checks according to type (null, array, type), the % of loops requiring less than x registers.

Naming Metrics

We have developed a new naming scheme for our metrics which includes the metric category, the metric name and its kind, as follows:
[category].[name].[kind]
For example, memory.averageHeapSize.value and concurrency.threadDensity.bin are valid names. This new naming scheme is more informative than the traditional way of naming metrics, i.e. using a single label to represent the metric. This allows us to have both concurrency.threadDensity.value and concurrency.threadDensity.bin, which makes it clear that these metrics measure the same general aspect of the program behaviour (thread density) while doing so in different two ways (one computes an average of the number of concurrent active threads, the other one measures the percentage of the total execution that is happens at various levels of concurency).