It is important to clearly understand and differentiate the different
static implementations of cflow used in both ajc (called AspectJ in the
paper) and abc.

The high overheads of cflow were first analyzed in the
OOPSLA 2004 paper entitled "Measuring the Dynamic Behaviour of AspectJ 
Programs" which instrumented the bytecode generated by the ajc 1.1.1 compiler
and showed very high overheads, with suprisingly high memory allocation,
for the implementation of cflow.   This paper also went on to suggest
that many of the overheads were due to the heavy-weight cflow stack
implementation,  which in many common cases could be replaced by a 
much cheaper counter implementation,  and it demonstrated this by 
modifying ajc 1.1.1 to use counters.    Subsequently,  the ajc 
group picked up this optimization and a simple form of it has been 
in ajc since ajc 1.2.1.

The abc compiler was designed to implement efficient code 
generation strategies, optimizations and an efficient run-time library.   
As presented in the PLDI 2005 paper, "Optimising AspectJ",  cflow 
was optimized in many different ways, both intraprocedural and 
interprocedural.

Intraprocedural Optimizations  (applied by default, same as -O1 flag)
-----------------------------

First, there are INTRAPROCEDURAL (section 4.1 of the PLDI05 paper) 
optimizations that optimize the following:
   1. share cflow counters/stacks for equivalent cflows
   2. use counters instead of stacks whenever possible
   3. reuse of the same ThreadLocal cflow counter/stack within a method 

I think that 2. and 3. apply to the microbenchmarks used in this paper,
the benchmarks are not complex enough to need 1.

In addition to these cflow-specific optimizations, abc also has general
optimizations and code generation strategies for minimizing the overhead 
of calling advice.   The first reuses the aspectOf instance when it is
used multiple times within the same method.  The second inlines the 
advice body in cases where the advice body is small.    
These optimizations apply to the microbenchmarks in this paper.

Finally, abc uses a different runtime library than ajc, which has been
built to reduce the cost of performing operations on cflow stacks/counters
and on minimizing the cost of retrieving the StackLocal cflow stacks/counters 
(by caching the last used one and reusing it if the current thread is the
same as the last thread).

Interprocedural Optimizations (applied when -O3 flag is used).
------------------------------

These optimizations do a whole program analysis and eliminate cflow
operations when they can be shown to be unnecessary.  For the example
program in the paper it can determine that the before advice always
applies to the calls to x() within the body m(),  that is never applies
to calls to x() within the body of m1(), and it MAY apply to the
call to x() within the body of foo().   Since there is at least one
MAY occurence, the counter must be maintained for the execution of m(). 

--------------------------------

Based on this discussion,  there are several weaknesses in the 
experimental evaluation in the paper being reviewed, as follows.

1) the paper does not say which abc optimizations it uses in the experiments.
   It says that abc must do a whole program analysis,  but this is only true
   for the case of the -O3 flag which enables interprocedural analysis.
   A substantial number of intraprocedural optimizations and efficient code
   code generation strategies are used by default (-O1), and none of these
   need whole program analysis.  Furthermore, on the microbenchmark used
   in the paper, the interprocedural optimizations aren't necessary (see
   data below).

2) Given that abc/ajc/steamloom are all very different systems,  it is very
   hard to determine,  from performance results,  what is causing better
   performance,  and if it is due to the treatment of cflow or is due to other
   factors.   See below for an interesting set of results for ajc and abc
   using the microbenchmark from the paper.    It would be interesting
   to expand this to include the other systems presented in the paper.

----------------------------------

Using the microbenchmark given in the paper (Listing 2),  on one thread,
using 10000 repetitions,  I found the following results, times in millisecs.
(Note that these are not well polished results, and variance is somewhat
 high between different runs.  However, the main points hold.)
       
                                        (5,100)   (100,5)   (5,5)   (100,100) 

ajc1.2                                  344        5485      5479      277
ajc1.2.1                                166        1669      1661       83

abc                                      62         600       516       26 
abc -O3                                  60         516       517       26

abc -before-after-lining:off             63         623       530       26
abc -cflow-use-counters:off              69         602       513       26
abc -cflow-share-thread-locals:off       81         956       960       44

The main difference between ajc1.2 and ajc1.2.1 is that ajc1.2.1 uses cflow
counters instead of stacks.

The difference in performance between ajc1.2.1 and abc is due to many 
factors.  Both of them use counters instead of stacks,  but abc has a 
more tuned runtime library, generates more efficient code, and applies 
all of the intraproc opts explained earlier.

The difference between abc and abc -O3 is
that the interprocedural cflow analysis is applied.
Note that the interprocedural analysis does not give very much improvement 
over the intraprocedural optimizations for this benchmark.   This is 
because most of the cflow overhead cannot be eliminated due to "may" cflow.
Thus, for this benchmark, there is no benefit in doing a whole program 
analysis. 

To see which abc intraproc optimization is important, we tried turning 
off the three most likely candidates.   

before-after-inlining:off
-------------------------
It seems that advice inlining is not so important, probably the VM does
the same inlining.

before-cflow-counters:off
-------------------------
Suprisingly, for this benchmark, the use of counters (instead of stacks) 
was not so important.  This is probably due to the fact that the abc 
compiler uses a fast implementation of the stacks.   

cflow-share-thread-locals:off 
-----------------------------
It appears that sharing of ThreadLocal counters is very important for 
this benchmark.   When this sharing is disabled, there is an obvious
slowdown.   Thus, it is likely that steamloom's VM implemenation of the
ThreadLocal counters is helpful, since retreiving these counters is
expensive.

--------------------------------------------------------------

Just to show the difference between the code generated by ajc1.2.1 and abc,
here are some decompiled snippets of the method m(), after weaving.  (Note
that the duplication of the catch block is due to the decompilation
process.)    You can see that the two  compilers take
different approaches to compiling the counters.   The ajc compiler
uses calls to a runtime library for all the operaions,  whereas
the abc compiler compiles the instructions directly into the
code.   Furthermore, abc reuses the ThreadLocal counter, only fetching
it once for the execution of the method,  reuses the AspectOf 
object,  and inlines the advice.   

By studying the code generated by abc, we can see why the code
generated by abc does not leave much room for improvement by 
implementing cflow in a VM, as steamloom does.   First, the operations on
the counters have been compiled into fairly simple operations, that
any VM should be able to execute efficiently.  Second, various
optimizations have been done to reduce the number of times ThreadLocal
counters need to be fetched.   So, even though a direct VM implementation
of the ThreadLocal counters could be faster,  the abc-generated code only
peforms the ThreadLocal fetch once per method call.

ajc1.2.1
--------
  public void m(int i0) throws java.lang.Throwable
    { int $i3;
      CflowAspect.ajc$cflowCounter$0.inc();
      try 
        { entries++; }
      catch (Throwable $r7)
        { CflowAspect.ajc$cflowCounter$0.dec();
          throw $r7;
        }
     label_0:
     while (true)
       { try
          { $i3 = i0;
            i0--; 
            if ($i3 != 0)
              { this.foo();
                  if (CflowAspect.ajc$cflowCounter$0.isValid())
                  { CflowAspect.aspectOf().ajc$before$CflowAspect$1$e8317405();
                  }
                  this.x();
                  continue label_0;
              }
            }
            catch (Throwable $r7)
            { CflowAspect.ajc$cflowCounter$0.dec();
              throw $r7;
            }
            CflowAspect.ajc$cflowCounter$0.dec();
            return;
        }
    }

------------------------------------------------------------------------------
abc
---

public void m(int i0) throws java.lang.Throwable
    {
        CflowAspect r1;
        org.aspectbench.runtime.internal.cflowinternal.Counter r2;
        int i1, i2, $i5, i9, i10, i11, i12;

        r1 = null;
        r2 = CflowAspect.abc$cflowCounter$0.getThreadCounter();
        i1 = r2.count;
        i2 = i1 + 1;
        r2.count = i2;
        try
        { entries++; }
        catch (Throwable $r4)
        { i9 = r2.count;
          i10 = i9 + -1;
          r2.count = i10;
          throw $r4; 
        }
        label_0:
        while (true)
        { try
            { $i5 = i0;
              i0--; 
              if ($i5 != 0)
                { this.foo();
                  if (r2.count > 0)
                    { if (r1 == null)
                        { r1 = CflowAspect.aspectOf();
                        }
                        r1.ctr++;
                    }
                    this.x();
                    continue label_0;
                }
            }
            catch (Throwable $r4)
            { i9 = r2.count;
              i10 = i9 + -1;
              r2.count = i10;
              throw $r4; 
            }
            i11 = r2.count;
            i12 = i11 + -1;
            r2.count = i12;
            return;
        }
    }
-------------------------------------------------------------------