McGill

Metrics

Last updated: November 27, 2002
Sable


Understanding the dynamic behavior of programs is one important aspect in developing effective new strategies for optimizing compilers and runtime systems. Research papers in these areas often say that a particular technique is aimed at programs that are numeric, loop-intensive, pointer-intensive, memory-intensive, object-oriented, concurrent, and so on. However, there appears to be no well-established standard way of determining if a program fits into any of these categories. An important goal of our work was to develop dynamic metrics that can be used to measure relevant runtime properties of programs, with the ultimate goal of establishing some standard metrics that could be used for quantitative analysis of benchmark programs in compiler research.

To be useful, dynamic metrics should provide a concise, yet informative, summary of different aspects of the dynamic behavior of programs. By concise, we mean that a small number of numeric values should be enough to summarize the behavior. For example, a complete profile of a program would not be a concise metric, whereas the fact that 90% of program execution is accounted for by 5% of the methods is a concise metric. By informative, we mean that the metric must measure some program characteristic of relevance to compiler developers, and the metric should differentiate between programs with different behaviors. For example, computing the ratio of number of lines of code to number of lines of comments does not capture anything about program behavior, and is not an informative metric, for our purposes.

In order to get a good overview of program characteristics of interest to compiler and runtime systems developers, we first studied the typical dynamic characteristics reported in papers presented over the last several years in the key conferences in the area. Although we did find many interesting experiments that suggest potential dynamic metrics, we also found that many summaries of benchmarks and results focused on static program measurements (#lines of code, # of methods, # loops transformed, # methods that are inlinable, # number of locations pointed to at dereference sites, and so on). Based on our own experiences, we suspect that this focus on static, rather than dynamic, metrics is at least partly because the static metrics are much easier to compute. Thus, one important goal of our work to provide both a methodology to compute the dynamic metrics and a database of dynamic metrics for commonly used benchmarks.

In our development of dynamic metrics we also discovered that different program behaviors need to be summarized in different ways. For example, if one is measuring the behavior of virtual method calls in an object-oriented language like Java, one summary might be the average number of target methods per virtual call site. However, compiler developers are usually most interested in the special cases where the virtual call is monomorphic (good for inlining) or perhaps corresponds to a very small number of target methods (good for specialization). Thus, a much more relevant metric might be what percentage of virtual calls correspond to 1, 2, or more than 2, targets. Thus, we define three basic ways of reporting dynamic metrics, values, bins, and percentiles, and we suggest which type is most appropriate for each situation.



Requirements for Dynamic Metrics

Dynamic metrics need to exhibit a number of characteristics in order to render clear and comparable numbers for any kind of program. The following is a non-exhaustive list of desirable qualities.


Back to Top


Kinds of Dynamic Metrics

While there are many possible metrics one could gather, we have found that the most commonly described metrics, and the ones which seem most useful compiler optimization, tend to be belong to just a few basic categories. This includes the ubiquitous single value metrics such as average, more detailed continuous expansions of single value metrics, hot spot detection metrics, and metrics based on discrete categorization. It is of course possible to design and use a metric that does not fit into these categories; these initial metric kinds, however, enable us to at least begin to explore the various potential metrics by considering whether an appropriate metric exists in each of our categories.

Value Metrics

The first kind of metric we present is a standard, usually one value answer. Many data gatherers, for instance, will present a statistic like average or maximum as a rough indicator of some quantity; the idea being that a single value is sufficiently accurate. Typically this is intended to allow one to observe differences in behaviour before and after some optimization, smoothing out unimportant variations. For example, a value such as running time is perhaps best presented as an average over several executions. Value metrics appear in almost every compiler research article that presents dynamic measurements.

Continuous Value Metrics

Continuous value metrics are value metrics that have a continuous analogue, meant to be an expanded presentation of the value metric. E.g., a running average computed over time, or a count of file accesses presented over time. Motivation for continuous value metrics arises from the inherent inaccuracy of a single value metric in many situations: a horizontal line in a graph can have the same overall average as a diagonal line, but clearly indicates very different behaviour. Additional descriptive values like standard deviation can be included in order to allow further refinement; unfortunately, secondary metrics are themselves often inadequate to really describe the difference in behaviour, requiring further tertiary metrics, and so on. Specific knowledge of other aspects of the metric space may also be required; correct use of standard deviation, for example, requires understanding the underlying distribution space of result values. Analysis situations in compiler optimization design may or may not result in simple normal distributions; certainly few if any compiler researchers verify or even argue that property. In order to present a better, less error-prone metric for situations where a single value is potentially inaccurate, a straightforward solution is to present a graph of the value over a continuous domain (like time). Biased interpretations based on a single value are thus avoided, and an astute reader can judge the relative accuracy or appropriateness of the single value metric themselves. Continous value metrics can then be seen as an extension to value metrics, giving a more refined view of the genesis of a particular value.

Percentiles

Frequently in compiler optimization it is important to know whether the relative contributions of aspects of a program to a metric are evenly or unevenly distributed among the program elements. If a few elements dominate, then those can be considered hot, and therefore worthy of further examination or optimization. Knowing, for example, that 2% of allocation sites are responsible for 90% of allocated bytes indicates that those top 2% of allocation sites are of particular interest. For comparison, a program where 50% of allocation sites contribute 90% of allocated bytes indicates a program that has a more even use of allocation, and so intensive optimization of a few areas will be less fruitful. Similar metrics can be found in compiler optimization literature; e.g., the top x% of most frequently-executed methods.

Bins

Compiler optimization is often based on identifying specific categories of measurements, with the goal of applying different optimization strategies to different cases. A call-site optimization, for instance, may use one approach for monomorphic sites, a more complex system for polymorphic sites of degree 2, and may be unable to handle sites with a higher degree of polymorphism. In such a situation single value metrics do not measure the situation well, e.g., computing an average number of types per call site may not give a good impression of the optimization opportunities. An appropriate metric for this example would be to give a relative or absolute value for each of the categories of interest, 1, 2, or 3 target types. We refer to these kinds of metrics as bins, since the measurement task is to appropriately divide elements of the sample space into a few categories or bins. There are many examples of bins in the literature; e.g., categorizing runtime safety checks according to type (null, array, type), the % of loops requiring less than x registers.


Back to Top


Naming Metrics

We have developed a new naming scheme for our metrics which includes the metric category, the metric name and its kind, as follows:
[category].[name].[kind]

For example, memory.averageHeapSize.value and concurrency.threadDensity.bin are valid names. This new naming scheme is more informative than the traditional way of naming metrics, i.e. using a single label to represent the metric. This allows us to have both concurrency.threadDensity.value and concurrency.threadDensity.bin, which makes it clear that these metrics measure the same general aspect of the program behaviour (thread density) while doing so in different two ways (one computes an average of the number of concurrent active threads, the other one measures the percentage of the total execution that is happens at various levels of concurency).


Back to Top