Re: [abc] PLDI paper -99 cflow and steamloom

From: Pavel Avgustinov <pavel.avgustinov@magdalen.oxford.ac.uk>
Date: Sat Jan 07 2006 - 20:02:12 GMT

The paper does seem a bit weak (the questions it answers seemed
self-evident to me when they were first introduced, and indeed the
conclusion is merely a re-statement of "common knowledge" a la "stack
walking is infeasible, counters are good").

Having said that, abc does look pretty good in their measurements, so in
a way it would feel good to see it published.. ;-)

I'm a bit worried about their claim that abc used interprocedural
analyses -- if it wasn't for the '3 minutes compilation time' towards
the end of the paper, I would have been inclined to think that they were
just running it with default optimisations. In any case, I think
including measurements for abc without interproc analyses (especially
since that would take care of their problem with the closed-world
assumption) would make the measurements rather more interesting.

As such, I think the paper could be published provided they do a better
job of explaining what they did with abc and, time permitting, include
both interproc and intraproc abc measurements.

Your comments below seem extremely helpful and reasonable.

- P

Professor Laurie Hendren wrote:

>I have been reviewing paper 99 for PLDI, which is on the topic
>of how the steamloom guys implemented cflow in their VM. This
>is the only "aspectjy" paper that seems to have any chance.
>However, there are certain problems with the paper. At least
>I would like to at least clarify the discussion of abc in this
>paper, even if it gets rejected - assuming that this paper will
>find a home somewhere.
>
>I have prepared the attached discussion of abc, and used the
>microbenchmark given in the paper to clarify the issue. I have
>attached what I plan to add to my review (abcnotes.txt).
>
>I would really like to have abc'ers opinion on this paper. Do
>we want to push it a bit - at least it shows that cflow is
>a problem and abc does well ... or is it too weak a paper
>overall? Do you agree with my discussion of abc/ajc and how
>it relates to this paper?
>
>Cheers, Laurie
>
>+-----------------------------------------------------------------
>| Laurie Hendren --- laurie.hendren@mcgill.ca
>| Associate Dean (Academic), Faculty of Science,
>| Dawson Hall, McGill University, 853 Sherbrooke St W,
>| Montreal QC H3A 2T6 Canada, 514-398-7179, fax 514-398-1774
>+----------------------------------------------------------------
>| For contact and home page info as Professor, Computer Science:
>| http://www.sable.mcgill.ca/~hendren --- hendren@cs.mcgill.ca
>| Research: http://www.sable.mcgill.ca http://aspectbench.org
>+----------------------------------------------------------------
>
>------------------------------------------------------------------------
>
>It is important to clearly understand and differentiate the different
>static implementations of cflow used in both ajc (called AspectJ in the
>paper) and abc.
>
>The high overheads of cflow were first analyzed in the
>OOPSLA 2004 paper entitled "Measuring the Dynamic Behaviour of AspectJ
>Programs" which instrumented the bytecode generated by the ajc 1.1.1 compiler
>and showed very high overheads, with suprisingly high memory allocation,
>for the implementation of cflow. This paper also went on to suggest
>that many of the overheads were due to the heavy-weight cflow stack
>implementation, which in many common cases could be replaced by a
>much cheaper counter implementation, and it demonstrated this by
>modifying ajc 1.1.1 to use counters. Subsequently, the ajc
>group picked up this optimization and a simple form of it has been
>in ajc since ajc 1.2.1.
>
>The abc compiler was designed to implement efficient code
>generation strategies, optimizations and an efficient run-time library.
>As presented in the PLDI 2005 paper, "Optimising AspectJ", cflow
>was optimized in many different ways, both intraprocedural and
>interprocedural.
>
>Intraprocedural Optimizations (applied by default, same as -O1 flag)
>-----------------------------
>
>First, there are INTRAPROCEDURAL (section 4.1 of the PLDI05 paper)
>optimizations that optimize the following:
> 1. share cflow counters/stacks for equivalent cflows
> 2. use counters instead of stacks whenever possible
> 3. reuse of the same ThreadLocal cflow counter/stack within a method
>
>I think that 2. and 3. apply to the microbenchmarks used in this paper,
>the benchmarks are not complex enough to need 1.
>
>In addition to these cflow-specific optimizations, abc also has general
>optimizations and code generation strategies for minimizing the overhead
>of calling advice. The first reuses the aspectOf instance when it is
>used multiple times within the same method. The second inlines the
>advice body in cases where the advice body is small.
>These optimizations apply to the microbenchmarks in this paper.
>
>Finally, abc uses a different runtime library than ajc, which has been
>built to reduce the cost of performing operations on cflow stacks/counters
>and on minimizing the cost of retrieving the StackLocal cflow stacks/counters
>(by caching the last used one and reusing it if the current thread is the
>same as the last thread).
>
>Interprocedural Optimizations (applied when -O3 flag is used).
>------------------------------
>
>These optimizations do a whole program analysis and eliminate cflow
>operations when they can be shown to be unnecessary. For the example
>program in the paper it can determine that the before advice always
>applies to the calls to x() within the body m(), that is never applies
>to calls to x() within the body of m1(), and it MAY apply to the
>call to x() within the body of foo(). Since there is at least one
>MAY occurence, the counter must be maintained for the execution of m().
>
>--------------------------------
>
>Based on this discussion, there are several weaknesses in the
>experimental evaluation in the paper being reviewed, as follows.
>
>1) the paper does not say which abc optimizations it uses in the experiments.
> It says that abc must do a whole program analysis, but this is only true
> for the case of the -O3 flag which enables interprocedural analysis.
> A substantial number of intraprocedural optimizations and efficient code
> code generation strategies are used by default (-O1), and none of these
> need whole program analysis. Furthermore, on the microbenchmark used
> in the paper, the interprocedural optimizations aren't necessary (see
> data below).
>
>2) Given that abc/ajc/steamloom are all very different systems, it is very
> hard to determine, from performance results, what is causing better
> performance, and if it is due to the treatment of cflow or is due to other
> factors. See below for an interesting set of results for ajc and abc
> using the microbenchmark from the paper. It would be interesting
> to expand this to include the other systems presented in the paper.
>
>----------------------------------
>
>Using the microbenchmark given in the paper (Listing 2), on one thread,
>using 10000 repetitions, I found the following results, times in millisecs.
>(Note that these are not well polished results, and variance is somewhat
> high between different runs. However, the main points hold.)
>
> (5,100) (100,5) (5,5) (100,100)
>
>ajc1.2 344 5485 5479 277
>ajc1.2.1 166 1669 1661 83
>
>abc 62 600 516 26
>abc -O3 60 516 517 26
>
>abc -before-after-lining:off 63 623 530 26
>abc -cflow-use-counters:off 69 602 513 26
>abc -cflow-share-thread-locals:off 81 956 960 44
>
>The main difference between ajc1.2 and ajc1.2.1 is that ajc1.2.1 uses cflow
>counters instead of stacks.
>
>The difference in performance between ajc1.2.1 and abc is due to many
>factors. Both of them use counters instead of stacks, but abc has a
>more tuned runtime library, generates more efficient code, and applies
>all of the intraproc opts explained earlier.
>
>The difference between abc and abc -O3 is
>that the interprocedural cflow analysis is applied.
>Note that the interprocedural analysis does not give very much improvement
>over the intraprocedural optimizations for this benchmark. This is
>because most of the cflow overhead cannot be eliminated due to "may" cflow.
>Thus, for this benchmark, there is no benefit in doing a whole program
>analysis.
>
>To see which abc intraproc optimization is important, we tried turning
>off the three most likely candidates.
>
>before-after-inlining:off
>-------------------------
>It seems that advice inlining is not so important, probably the VM does
>the same inlining.
>
>before-cflow-counters:off
>-------------------------
>Suprisingly, for this benchmark, the use of counters (instead of stacks)
>was not so important. This is probably due to the fact that the abc
>compiler uses a fast implementation of the stacks.
>
>cflow-share-thread-locals:off
>-----------------------------
>It appears that sharing of ThreadLocal counters is very important for
>this benchmark. When this sharing is disabled, there is an obvious
>slowdown. Thus, it is likely that steamloom's VM implemenation of the
>ThreadLocal counters is helpful, since retreiving these counters is
>expensive.
>
>--------------------------------------------------------------
>
>Just to show the difference between the code generated by ajc1.2.1 and abc,
>here are some decompiled snippets of the method m(), after weaving. (Note
>that the duplication of the catch block is due to the decompilation
>process.) You can see that the two compilers take
>different approaches to compiling the counters. The ajc compiler
>uses calls to a runtime library for all the operaions, whereas
>the abc compiler compiles the instructions directly into the
>code. Furthermore, abc reuses the ThreadLocal counter, only fetching
>it once for the execution of the method, reuses the AspectOf
>object, and inlines the advice.
>
>By studying the code generated by abc, we can see why the code
>generated by abc does not leave much room for improvement by
>implementing cflow in a VM, as steamloom does. First, the operations on
>the counters have been compiled into fairly simple operations, that
>any VM should be able to execute efficiently. Second, various
>optimizations have been done to reduce the number of times ThreadLocal
>counters need to be fetched. So, even though a direct VM implementation
>of the ThreadLocal counters could be faster, the abc-generated code only
>peforms the ThreadLocal fetch once per method call.
>
>ajc1.2.1
>--------
> public void m(int i0) throws java.lang.Throwable
> { int $i3;
> CflowAspect.ajc$cflowCounter$0.inc();
> try
> { entries++; }
> catch (Throwable $r7)
> { CflowAspect.ajc$cflowCounter$0.dec();
> throw $r7;
> }
> label_0:
> while (true)
> { try
> { $i3 = i0;
> i0--;
> if ($i3 != 0)
> { this.foo();
> if (CflowAspect.ajc$cflowCounter$0.isValid())
> { CflowAspect.aspectOf().ajc$before$CflowAspect$1$e8317405();
> }
> this.x();
> continue label_0;
> }
> }
> catch (Throwable $r7)
> { CflowAspect.ajc$cflowCounter$0.dec();
> throw $r7;
> }
> CflowAspect.ajc$cflowCounter$0.dec();
> return;
> }
> }
>
>------------------------------------------------------------------------------
>abc
>---
>
>public void m(int i0) throws java.lang.Throwable
> {
> CflowAspect r1;
> org.aspectbench.runtime.internal.cflowinternal.Counter r2;
> int i1, i2, $i5, i9, i10, i11, i12;
>
> r1 = null;
> r2 = CflowAspect.abc$cflowCounter$0.getThreadCounter();
> i1 = r2.count;
> i2 = i1 + 1;
> r2.count = i2;
> try
> { entries++; }
> catch (Throwable $r4)
> { i9 = r2.count;
> i10 = i9 + -1;
> r2.count = i10;
> throw $r4;
> }
> label_0:
> while (true)
> { try
> { $i5 = i0;
> i0--;
> if ($i5 != 0)
> { this.foo();
> if (r2.count > 0)
> { if (r1 == null)
> { r1 = CflowAspect.aspectOf();
> }
> r1.ctr++;
> }
> this.x();
> continue label_0;
> }
> }
> catch (Throwable $r4)
> { i9 = r2.count;
> i10 = i9 + -1;
> r2.count = i10;
> throw $r4;
> }
> i11 = r2.count;
> i12 = i11 + -1;
> r2.count = i12;
> return;
> }
> }
>-------------------------------------------------------------------
>
Received on Sat Jan 7 20:02:22 2006

This archive was generated by hypermail 2.1.8 : Sun Jan 08 2006 - 03:20:08 GMT