It is important to clearly understand and differentiate the different static implementations of cflow used in both ajc (called AspectJ in the paper) and abc. The high overheads of cflow were first analyzed in the OOPSLA 2004 paper entitled "Measuring the Dynamic Behaviour of AspectJ Programs" which instrumented the bytecode generated by the ajc 1.1.1 compiler and showed very high overheads, with suprisingly high memory allocation, for the implementation of cflow. This paper also went on to suggest that many of the overheads were due to the heavy-weight cflow stack implementation, which in many common cases could be replaced by a much cheaper counter implementation, and it demonstrated this by modifying ajc 1.1.1 to use counters. Subsequently, the ajc group picked up this optimization and a simple form of it has been in ajc since ajc 1.2.1. The abc compiler was designed to implement efficient code generation strategies, optimizations and an efficient run-time library. As presented in the PLDI 2005 paper, "Optimising AspectJ", cflow was optimized in many different ways, both intraprocedural and interprocedural. Intraprocedural Optimizations (applied by default, same as -O1 flag) ----------------------------- First, there are INTRAPROCEDURAL (section 4.1 of the PLDI05 paper) optimizations that optimize the following: 1. share cflow counters/stacks for equivalent cflows 2. use counters instead of stacks whenever possible 3. reuse of the same ThreadLocal cflow counter/stack within a method I think that 2. and 3. apply to the microbenchmarks used in this paper, the benchmarks are not complex enough to need 1. In addition to these cflow-specific optimizations, abc also has general optimizations and code generation strategies for minimizing the overhead of calling advice. The first reuses the aspectOf instance when it is used multiple times within the same method. The second inlines the advice body in cases where the advice body is small. These optimizations apply to the microbenchmarks in this paper. Finally, abc uses a different runtime library than ajc, which has been built to reduce the cost of performing operations on cflow stacks/counters and on minimizing the cost of retrieving the StackLocal cflow stacks/counters (by caching the last used one and reusing it if the current thread is the same as the last thread). Interprocedural Optimizations (applied when -O3 flag is used). ------------------------------ These optimizations do a whole program analysis and eliminate cflow operations when they can be shown to be unnecessary. For the example program in the paper it can determine that the before advice always applies to the calls to x() within the body m(), that is never applies to calls to x() within the body of m1(), and it MAY apply to the call to x() within the body of foo(). Since there is at least one MAY occurence, the counter must be maintained for the execution of m(). -------------------------------- Based on this discussion, there are several weaknesses in the experimental evaluation in the paper being reviewed, as follows. 1) the paper does not say which abc optimizations it uses in the experiments. It says that abc must do a whole program analysis, but this is only true for the case of the -O3 flag which enables interprocedural analysis. A substantial number of intraprocedural optimizations and efficient code code generation strategies are used by default (-O1), and none of these need whole program analysis. Furthermore, on the microbenchmark used in the paper, the interprocedural optimizations aren't necessary (see data below). 2) Given that abc/ajc/steamloom are all very different systems, it is very hard to determine, from performance results, what is causing better performance, and if it is due to the treatment of cflow or is due to other factors. See below for an interesting set of results for ajc and abc using the microbenchmark from the paper. It would be interesting to expand this to include the other systems presented in the paper. ---------------------------------- Using the microbenchmark given in the paper (Listing 2), on one thread, using 10000 repetitions, I found the following results, times in millisecs. (Note that these are not well polished results, and variance is somewhat high between different runs. However, the main points hold.) (5,100) (100,5) (5,5) (100,100) ajc1.2 344 5485 5479 277 ajc1.2.1 166 1669 1661 83 abc 62 600 516 26 abc -O3 60 516 517 26 abc -before-after-lining:off 63 623 530 26 abc -cflow-use-counters:off 69 602 513 26 abc -cflow-share-thread-locals:off 81 956 960 44 The main difference between ajc1.2 and ajc1.2.1 is that ajc1.2.1 uses cflow counters instead of stacks. The difference in performance between ajc1.2.1 and abc is due to many factors. Both of them use counters instead of stacks, but abc has a more tuned runtime library, generates more efficient code, and applies all of the intraproc opts explained earlier. The difference between abc and abc -O3 is that the interprocedural cflow analysis is applied. Note that the interprocedural analysis does not give very much improvement over the intraprocedural optimizations for this benchmark. This is because most of the cflow overhead cannot be eliminated due to "may" cflow. Thus, for this benchmark, there is no benefit in doing a whole program analysis. To see which abc intraproc optimization is important, we tried turning off the three most likely candidates. before-after-inlining:off ------------------------- It seems that advice inlining is not so important, probably the VM does the same inlining. before-cflow-counters:off ------------------------- Suprisingly, for this benchmark, the use of counters (instead of stacks) was not so important. This is probably due to the fact that the abc compiler uses a fast implementation of the stacks. cflow-share-thread-locals:off ----------------------------- It appears that sharing of ThreadLocal counters is very important for this benchmark. When this sharing is disabled, there is an obvious slowdown. Thus, it is likely that steamloom's VM implemenation of the ThreadLocal counters is helpful, since retreiving these counters is expensive. -------------------------------------------------------------- Just to show the difference between the code generated by ajc1.2.1 and abc, here are some decompiled snippets of the method m(), after weaving. (Note that the duplication of the catch block is due to the decompilation process.) You can see that the two compilers take different approaches to compiling the counters. The ajc compiler uses calls to a runtime library for all the operaions, whereas the abc compiler compiles the instructions directly into the code. Furthermore, abc reuses the ThreadLocal counter, only fetching it once for the execution of the method, reuses the AspectOf object, and inlines the advice. By studying the code generated by abc, we can see why the code generated by abc does not leave much room for improvement by implementing cflow in a VM, as steamloom does. First, the operations on the counters have been compiled into fairly simple operations, that any VM should be able to execute efficiently. Second, various optimizations have been done to reduce the number of times ThreadLocal counters need to be fetched. So, even though a direct VM implementation of the ThreadLocal counters could be faster, the abc-generated code only peforms the ThreadLocal fetch once per method call. ajc1.2.1 -------- public void m(int i0) throws java.lang.Throwable { int $i3; CflowAspect.ajc$cflowCounter$0.inc(); try { entries++; } catch (Throwable $r7) { CflowAspect.ajc$cflowCounter$0.dec(); throw $r7; } label_0: while (true) { try { $i3 = i0; i0--; if ($i3 != 0) { this.foo(); if (CflowAspect.ajc$cflowCounter$0.isValid()) { CflowAspect.aspectOf().ajc$before$CflowAspect$1$e8317405(); } this.x(); continue label_0; } } catch (Throwable $r7) { CflowAspect.ajc$cflowCounter$0.dec(); throw $r7; } CflowAspect.ajc$cflowCounter$0.dec(); return; } } ------------------------------------------------------------------------------ abc --- public void m(int i0) throws java.lang.Throwable { CflowAspect r1; org.aspectbench.runtime.internal.cflowinternal.Counter r2; int i1, i2, $i5, i9, i10, i11, i12; r1 = null; r2 = CflowAspect.abc$cflowCounter$0.getThreadCounter(); i1 = r2.count; i2 = i1 + 1; r2.count = i2; try { entries++; } catch (Throwable $r4) { i9 = r2.count; i10 = i9 + -1; r2.count = i10; throw $r4; } label_0: while (true) { try { $i5 = i0; i0--; if ($i5 != 0) { this.foo(); if (r2.count > 0) { if (r1 == null) { r1 = CflowAspect.aspectOf(); } r1.ctr++; } this.x(); continue label_0; } } catch (Throwable $r4) { i9 = r2.count; i10 = i9 + -1; r2.count = i10; throw $r4; } i11 = r2.count; i12 = i11 + -1; r2.count = i12; return; } } -------------------------------------------------------------------