I've just had a prolonged look at the code after reading the SAS paper.
It seems to me that the merging approach taken in TypeVariable.java is
singularly ineffective, and we're probably pretty close to its
worst-case input.
There is, however, what seems like a reasonably easy way to
significantly reduce the time and memory requirements.
The type inference algorithm constructs a directed constraint graph of
the static types and (at the beginning) untyped local variables. It then
collapses cycles in this graph by merging two adjacent nodes in a cycle
repeatedly until the cycle consists of only one node (at least if I
understand the code correctly - I haven't looked at it in *too* much
detail. It would seem more reasonable to replace the whole cycle at once
in any case, as when we start merging neighbours we must have already
detected the cycle).
It's the algorithm for merging to TypeVariable objects that seems
particularly clumsy. It makes a new TreeSet containing all of the
elements of its "parents" list, then adds all the elements of the
"parents" list of the TypeVariable it's merging with, then extracts a
list from the TreeSet and remembers this as its new "parents" list.
Note that a lot of time and memory is wasted here - the TreeSet is
constructed only to remove duplicates, and is discarded immediately.
Also, the old "parents" list is thrown away, a new one is constructed.
The same procedure applies to the "children" lists of the two objects.
This results in nodes aggregating 6000+ parents and 3500+ children in
steps of 1-3 at a time, and these are then collapsed back down to
single-digit numbers, again, 1-3 at a time. The algorithm seems to
aggregate all nodes into one node, repeatedly merging it with nodes with
0-1 children 1 and parent. Each iteration constructs a TreeSet and a new
List for all of the parents and children.
Part of the reason for this repeated construction of the list is that
both the parents and children lists are kept as "unmodifiableList"s in
what seems to be an ill-placed attempt at a functional approach to these
data structures. Perhaps I'm missing something, but I don't see a reason
why this should be the case - if we really want to protect the lists
from improper write access, then just expose read-only interface
methods, and keep the rest private to the package - however, clearly it
would be beneficial to be able to destructively update the lists from
within the toolkits.typing package.
In fact, I have a feeling that the best data structure for these two
lists is either a HashSet or a LinkedHashSet. Does anybody know why
they're implemented as unmodifiable LinkedLists?
(sorry for the length of this, apologies for any confusion I may have
caused by unclear formulations.)
- P
Ondrej Lhotak wrote:
>On Thu, Sep 16, 2004 at 03:13:19PM -0600, John Jorgensen wrote:
>
>
>>>>>>>"hendren" == Prof Laurie HENDREN <hendren@sable.mcgill.ca> writes:
>>>>>>>
>>>>>>>
>> hendren> The problem is that it spends a
>> hendren> huge proportion of its time (while compiling itself)
>> hendren> and the type assignment in one class which has a big
>> hendren> monster method implementing a parser.
>>
>>Let me guess without even looking at the example: does this
>>method have hundreds of local variables? The type assigner does
>>not scale well as the number of local variables increases.
>>
>>When it merges two type nodes, the type assigner uses temporary
>>TreeSets to remove the duplicates from the merged sets of parent
>>nodes and child nodes (grep suggests that this happens in
>>soot.jimple.toolkits.typing.TypeVariable). When the number of
>>locals gets high, you spend a lot of time creating TreeSets that
>>you immediately discard.
>>
>>
>
>Ah, yes, I had the exact same problem in Spark. When nodes were merged,
>all edges into and out of them needed to also be merged. I had no idea
>that the same thing happened in the type assigner also. It's a bit more
>complicated in Spark, in that merging green nodes also requires merging
>any associated red nodes.
>
>I played around with it quite a bit in Spark, and I didn't come up with
>any silver bullet solution. I managed to get reasonable performance with
>the following things. First, no Java data structures. They're too slow.
>It was arrays, and hand-coded. Well, I see I did end up using HashSets,
>but in a hybrid way, so that they were only used when the sets to be
>merged were big (so the O(1) time of hashing was a big enough win to
>overcome the big constant involved). Second, lazy merging. I didn't
>bother merging until someone actually wanted to see the edges. Since
>this meant they had to be traversed anyway, the extra cost wasn't too
>bad. Again, merging in a hybrid way. It's quite a mess.
>
>In the end, this was such a big pain (and node merging ended up being
>such a small improvement) that in Paddle, I ripped out all the merging
>stuff. Looking at the Spark code now, it makes me shudder.
>
>Since it uses TreeSets, the type assigner could probably be sped up
>significantly using something like the stuff in Spark, but at the cost
>of making the code terrible. And, if the type assigner reads and merges
>those edges more often than Spark did (likely), it will still be rather
>slow.
>
>Does anyone have an idea of typical ditribution of the in/out-degrees of
>the nodes in the type assigner? In Spark, there were some very high, and
>many very low, which made all these tricks necesary. If that's not the
>case in the type assigner, it might be easier.
>
>Also, how many nodes might there be in total? If it's limited to 32K,
>just using a bit vector might be quick enough. In Spark, 320K was
>typical, so that wasn't an option.
>
>I agree that the size of such an undertaking is about appropriate for a
>621 project. I wonder a bit whether it's compilery enough, since it's
>mainly implementation, but I guess most of the projects are like that,
>so it's probably fine.
>
>Ondrej
>
>
>
>
>> hendren> I remember that you did have some solutions for
>> hendren> speeding up the type assigner, but I don't remember
>> hendren> if we ever integrated them or not. What is your
>> hendren> memory on this? Any chance you could take a look?
>>
>>No, I didn't have a solution; I had a hack which confirmed the
>>nature of the problem (since it did improve times on the tests
>>where it worked) but which is not correct (it doesn't work for
>>arbitrary input).
>>
>>The hack was to keep the parent and child nodes as sorted lists.
>>That lets you eliminate duplicates in a merged list in a single
>>pass, without creating any TreeSet. The difficulty with this, if
>>I remember correctly, is that there is no guarantee that the list
>>which you sort at time N will still be sorted any more at time N+m,
>>even if you don't move the elements, since the keys may change
>>(in the event that any of the nodes in the list are themselves
>>merged with some other node).
>>
>>Dealing with this properly probably requires re-engineering the
>>data structures used for typing. The job might even be big
>>enough for a 623 (?) project.
>>
>>
>>
>>
>>
>>
>
>
>
>
>
Received on Fri Sep 17 19:02:32 2004
This archive was generated by hypermail 2.1.8 : Sat Sep 18 2004 - 00:00:02 BST