Re: [abc] type assigner complexity in Soot

From: Ondrej Lhotak <olhotak@sable.mcgill.ca>
Date: Fri Sep 17 2004 - 01:15:02 BST

On Thu, Sep 16, 2004 at 03:13:19PM -0600, John Jorgensen wrote:
> >>>>> "hendren" == Prof Laurie HENDREN <hendren@sable.mcgill.ca> writes:
>
> hendren> The problem is that it spends a
> hendren> huge proportion of its time (while compiling itself)
> hendren> and the type assignment in one class which has a big
> hendren> monster method implementing a parser.
>
> Let me guess without even looking at the example: does this
> method have hundreds of local variables? The type assigner does
> not scale well as the number of local variables increases.
>
> When it merges two type nodes, the type assigner uses temporary
> TreeSets to remove the duplicates from the merged sets of parent
> nodes and child nodes (grep suggests that this happens in
> soot.jimple.toolkits.typing.TypeVariable). When the number of
> locals gets high, you spend a lot of time creating TreeSets that
> you immediately discard.

Ah, yes, I had the exact same problem in Spark. When nodes were merged,
all edges into and out of them needed to also be merged. I had no idea
that the same thing happened in the type assigner also. It's a bit more
complicated in Spark, in that merging green nodes also requires merging
any associated red nodes.

I played around with it quite a bit in Spark, and I didn't come up with
any silver bullet solution. I managed to get reasonable performance with
the following things. First, no Java data structures. They're too slow.
It was arrays, and hand-coded. Well, I see I did end up using HashSets,
but in a hybrid way, so that they were only used when the sets to be
merged were big (so the O(1) time of hashing was a big enough win to
overcome the big constant involved). Second, lazy merging. I didn't
bother merging until someone actually wanted to see the edges. Since
this meant they had to be traversed anyway, the extra cost wasn't too
bad. Again, merging in a hybrid way. It's quite a mess.

In the end, this was such a big pain (and node merging ended up being
such a small improvement) that in Paddle, I ripped out all the merging
stuff. Looking at the Spark code now, it makes me shudder.

Since it uses TreeSets, the type assigner could probably be sped up
significantly using something like the stuff in Spark, but at the cost
of making the code terrible. And, if the type assigner reads and merges
those edges more often than Spark did (likely), it will still be rather
slow.

Does anyone have an idea of typical ditribution of the in/out-degrees of
the nodes in the type assigner? In Spark, there were some very high, and
many very low, which made all these tricks necesary. If that's not the
case in the type assigner, it might be easier.

Also, how many nodes might there be in total? If it's limited to 32K,
just using a bit vector might be quick enough. In Spark, 320K was
typical, so that wasn't an option.

I agree that the size of such an undertaking is about appropriate for a
621 project. I wonder a bit whether it's compilery enough, since it's
mainly implementation, but I guess most of the projects are like that,
so it's probably fine.

Ondrej

>
> hendren> I remember that you did have some solutions for
> hendren> speeding up the type assigner, but I don't remember
> hendren> if we ever integrated them or not. What is your
> hendren> memory on this? Any chance you could take a look?
>
> No, I didn't have a solution; I had a hack which confirmed the
> nature of the problem (since it did improve times on the tests
> where it worked) but which is not correct (it doesn't work for
> arbitrary input).
>
> The hack was to keep the parent and child nodes as sorted lists.
> That lets you eliminate duplicates in a merged list in a single
> pass, without creating any TreeSet. The difficulty with this, if
> I remember correctly, is that there is no guarantee that the list
> which you sort at time N will still be sorted any more at time N+m,
> even if you don't move the elements, since the keys may change
> (in the event that any of the nodes in the list are themselves
> merged with some other node).
>
> Dealing with this properly probably requires re-engineering the
> data structures used for typing. The job might even be big
> enough for a 623 (?) project.
>
>
>
>
Received on Fri Sep 17 01:19:47 2004

This archive was generated by hypermail 2.1.8 : Fri Sep 17 2004 - 19:10:02 BST