Abstract
MATLAB is a popular dynamic arraybased language commonly used by students, scientists and engineers who appreciate the interactive development style, the rich set of array operators, the extensive builtin library, and the fact that they do not have to declare static types. Even though these users like to program in MATLAB, their computations are often very computeintensive and are better suited for emerging high performance computing systems. This thesis reports on MIX10, a sourcetosource compiler that automatically translates MATLAB programs to X10, a language designed for “Performance and Productivity at Scale"; thus, helping scientific programmers make better use of high performance computing systems.
There is a large semantic gap between the arraybased dynamicallytyped nature of MATLAB and the objectoriented, staticallytyped, and highlevel array abstractions of X10. This thesis addresses the major challenges that must be overcome to produce sequential X10 code that is competitive with stateoftheart static compilers for MATLAB which target more conventional imperative languages such as C and Fortran. Given that efficient basis, the thesis then provides a translation for the MATLAB parfor construct that leverages the powerful concurrency constructs in X10.
The MIX10 compiler has been implemented using the McLab compiler tools, is open source, and is available both for compiler researchers and enduser MATLAB programmers. We have used the implementation to perform many empirical measurements on a set of 17 MATLAB benchmarks. We show that our best MIX10generated code is significantly faster than the de facto Mathworks’ MATLAB system, and that our results are competitive with stateoftheart static compilers that target C and Fortran. We also show the importance of finding the correct approach to representing the arrays in X10, and the necessity of an IntegerOkay analysis that determines which double variables can be safely represented as integers. Finally, we show that our X10based handling of the MATLAB parfor greatly outperforms the de facto MATLAB implementation.
Résumé
MATLAB est un langage de programmation dynamique, orientétableaux communément utilisé par les étudiants, les scientifiques et les ingénieurs qui apprécient son style de développement interactif, la richesse de ses opérateurs sur les tableaux, sa librairie impressionnante de fonctions de base et le fait qu’on aie pas à déclarer statiquement le type des variables. Bien que ces usagers apprécient MATLAB, leurs programmes nécessitent souvent des ressources de calcul importantes qui sont offertes par les nouveaux systèmes de haute performance. Cette thèse fait le rapport de MIX10, un compilateur sourceàsource qui fait la traduction automatique de programmes MATLAB à X10, un langage construit pour "la performance et la productivité à grande échelle." Ainsi, MIX10 aide les programmeurs scientifiques à faire un meilleur usage des ressources des systèmes de calcul de haute performance.
Il y a un écart sémantique important entre le typage dynamique et le focus sur les tableaux de MATLAB et l’approche orientéeobjet, le typage statique et les abstractions de haut niveau sur les tableaux de X10. Cette thèse discute des défis principaux qui doivent être surmontés afin de produire du code X10 séquentiel compétitif avec les meilleurs compilateurs statiques pour MATLAB qui traduisent vers des langages impératifs plus conventionnels, tels que C et Fortran. Fort de cette fondation efficace, cette thèse décrit ensuite la traduction de l’instruction parfor de MATLAB afin d’utiliser les opérations sophistiquées de traitement concurrent de X10.
Le compilateur MIX10 a été implémenté à l’aide de la suite d’outils de McLab, un projet libre de droits, disponible à la communauté de recherche ainsi qu’aux utilisateurs de MATLAB. Nous avons utilisé notre implémentation afin d’effectuer des mesures empiriques de performance sur un jeu de 17 programmes MATLAB. Nous démontrons que le code généré par MIX10 est considérablement plus rapide que le système MATLAB de Mathworks et que nos résultats sont compétitifs avec les meilleurs compilateurs statiques qui produisent du code C et Fortran. Nous montrons également l’importance d’une représentation appropriée des tableaux en X10 et la nécessité d’une analyse IntegerOkay qui permet de déterminer quelles variables de type réel (double) peuvent être correctement représentées par des entiers (int). Finalement, nous montrons que notre traitement de l’instruction parfor en X10 nous permet d’atteindre des vitesses d’exécution considérablement meilleures que dans MATLAB.
Acknowledgements
I am thankful to my supervisor, Laurie Hendren, whose constant encouragement and guidance not only helped me to understand compilers better, but also helped me become a better writer and a better presenter.
I would like to give a special thanks to David Grove for suggesting this line of research and in helping us understand key parts of the X10 language and implementation.
I am also thankful to the McLAB team for building an amazing framework for compiler research on MATLAB. In particular, I am thankful to Anton Dubrau (M.Sc.) for helping me understand the MCSAF and Tamer frameworks, and Xu Li (M.Sc.) for all the interesting discussions on how to statically compile MATLAB.
Finally, I would like to thank my parents and my brother, who made it possible for me to pursue my M.Sc. degree.
Apart from all the wonderful people around me, I am thankful to my computers, Antriksh, Heisenberg, and cougar.cs.mcgill.ca for tirelessly and reliably crunching numbers day in and day out.
This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada (NSERC).
Table of Contents
List of Figures
List of Tables
Introduction
___________________________________________________________________________________
MATLAB is a popular numeric programming language, used by millions of scientists, engineers as well as students worldwide[Mol]. MATLAB programmers appreciate the highlevel matrix operators, the fact that variables and types do not need to be declared, the large number of library and builtin functions available, and the interactive style of program development available through the IDE and the interpreterstyle readevalprint loop. However, even though MATLAB programmers appreciate all of the features that enable rapid prototyping, their applications are often quite compute intensive and time consuming. These applications could perform much more efficiently if they could be easily ported to a high performance computing system.
X10 [IBM12], on the other hand, is an objectoriented and staticallytyped language which uses cilkstyle arrays indexed by Point objects and railbacked multidimensional arrays, and has been designed with welldefined semantics and high performance computing in mind. The X10 compiler can generate C++ or Java code and supports various communication interfaces including sockets and MPI for communication between nodes on a parallel computing system.
In this thesis we present MIX10, a sourcetosource compiler that helps to bridge the gap between MATLAB, a language familiar to scientists, and X10, a language designed for high performance computing systems. MIX10 statically compiles MATLAB programs to X10 and thus allows scientists and engineers to write programs in MATLAB (or use old programs already written in MATLAB) and still get the benefits of high performance computing without having to learn a new language.Also, systems that use MATLAB for prototyping and C++ or Java for production, can benefit from MIX10 by quickly convert ing MATLAB prototypes to C++ or Java programs via X10
On one hand, all the aforementioned characteristics of MATLAB make it a very userfriendly and thus a popular application to develop software among a nonprogrammer community. On the other hand, these same characteristics make MATLAB a difficult language to compile statically. Even the de facto standard, Mathworks’ implementation of MATLAB is essentially an interpreter with a JIT accelarator[The02] which is generally slower than statically compiled languages. GNU Octave, which is a popular open source alternative to MATLAB and is mostly compatible with MATLAB, introduced JIT compilation only in its most recent release 3.8 in March 2014. Before that it was implemented only as an interpreter[Oct]. Lack of formal language specification, unconventional semantics and closed source make it even harder to write a compiler for MATLAB. These are some of the challenges that MIX10 shares with other static compilers which convert MATLAB to C or Fortran. However, targeting X10 raises some significant new challenges. For example, X10 is an objectoriented, heapbased, language with varying levels of highlevel abstractions for arrays and iterators of arrays. An open question was whether or not it was possible to generate X10 code that both maintains the MATLAB semantics, but also leads to code that is as efficient as stateoftheart translators that target C and Fortran. Finding an effective and efficient translation from MATLAB to X10 was not obvious, and this thesis reports on the key problems we encountered and the solutions that we designed and implemented in our MIX10 system. By demonstrating that we can generate sequential X10 code that is as efficient as generated C or Fortran code, we then enabled the possibility of leveraging the high performance nature of X10’s parallel constructs. To demonstrate this, we exposed scalable concurrency in MATLAB via X10 and examined how to use X10 features to provide a good implementation for MATLAB’s parfor construct.
Built on top of McLAB static analysis framework[Doh11, DH12b], MIX10, together with its set of reusable static analyses for performance optimization and extended support for MATLAB features, ultimately aims to provide MATLAB’s ease of use, sequential performance comparable to that provided by stateofthe art compilers targetting C and Fortran, to support parallel constructs like parfor and to expose scalable concurrency.
The major contributions of this thesis are as follows:
We determined that a key remaining problem was overhead due to casting doubles to integers. This casting overhead was quite high, and was substantially higher for the C++ backend than for the Java backend (since the casting instructions are handled efficiently by the Java VM). This casting problem arises because the default data type for MATLAB variables is double  even variables used to iterate through arrays, and to index into arrays typically have double type, even though they hold integral values. To tackle this issue we developed an IntegerOkay analysis which determines which double variables can be safely converted to integer variables in the generated X10 code, hence greatly reducing the number of casting operations needed.
This thesis is divided into 11 chapters, including this one and is structured as follows.
Chapter 2 provides an introduction to the X10 language and describes how it compares to MATLAB from the point of view of language design. Chapter 3 gives a description of various existing MATLAB compiler tools upon which MIX10 is implemented, presents a highlevel design of MIX10, and explains the design and need of MIX10 IR. In Chapter 4 we compare the two kinds of arrays provided by X10, identify when each of them must be used in the generated code, identify and address challenges involved in efficient compilation of MATLAB arrays. Chapter 5 describes the builtin handling framework used by MIX10 to generate specialized code for MATLAB builtins used in the source program. Chapter 6 gives a description of efficient and effective code generation strategies for the core sequential constructs of MATLAB. A description of code generation for the MATLAB parfor construct is provided in Chapter 7, which also describes our strategy to introduce concurrency constructs in MATLAB. In Chapter 8 we provide a description of the IntegerOkay analysis to identify variables that are safe to be declared as Long type. Chapter 9 provides performance results for code generated using MIX10 for a suite of benchmarks. It gives a comparison between performance achieved by MIX10 generated code and that generated by the MATLAB coder and MC2FOR compilers. Chapter 10 provides an overview of related work and Chapter 11 concludes and outlines possible future work. Chapter 2
Introduction to the X10 Programming Language
___________________________________________________________________________________
In this chapter, we describe key X10 semantics and features to help readers unfamiliar with X10 to have a better understanding of the MIX10 compiler.^{1}
X10 is an award winning opensource programming language being developed by IBM Research. The goal of the X10 project is to provide a productive and scalable programming model for the newage high performance computing architectures ranging from multicore processors to clusters and supercomputers [IBM12].
X10, like Java, is a classbased, stronglytyped, garbagecollected and objectoriented language. It uses Asynchronous Partitioned Global Address Space (APGAS) model to support concurrency and distribution [SBP^{+}12]. The X10 compiler has a native backend that compiles X10 programs to C++ and a managed backend that compiles X10 programs to Java.
X10’s sequential core is a containerbased objectoriented language that is very similar to that of Java or C++ [SBP^{+}12]. An X10 program consists of a collection of classes, structs or interfaces, which are the toplevel compilation units. X10’s sequential constructs like ifelse statements, for loops, while loops, switch statements, and exception handling constructs throw and try…catch are also same as those in Java. X10 provides both, implicit coercions and explicit conversions on types, and both can be defined on userdefined types. The as operator is used to perform explicit type conversions; for example, x as Long{self != 0} converts x to type Long and throws a runtime exception if its value is zero. Multidimensional arrays in X10 are provided as userdefined abstractions on top of x10.lang.Rail, an intrinsic onedimensional array analogous to onedimensional arrays in languages like C or Java. Two families of multidimensional array abstractions are provided: simple arrays, which provide a restricted but efficient implementation, and region arrays which provide a flexible and dynamic implementation but are not as efficient as simple arrays. Listing 2.1 shows a sequential X10 program that calculates the value of π using the Monte Carlo method.^{2} It highlights important sequential and objectoriented features of X10 detailed in the following subsections. It reads an argument from the console in an intrinsic 1 dimensions array(Rail) of String and converts it to a Long value N. It then generates two random numbers and performs computations on them in a loop from 1 to N. Finally, it calculates the value of π and displays the result on the console.
X10’s syntax is similar to that of Java. The objectoriented semantics are also similar to that of Java. Mutable and immutable variables are denoted by keywords var and val respectively. for loops can iterate over a range of values. For e.g. in listing 2.1, the for loop iterates over a LongRange of 1 to N. A LongRange represents a range of Long type integers.
A program consists of a collection of toplevel units, where a unit is either a class, a struct or an interface. A program can contain multiple units, however only one unit can be made public and its name must be same as that of the program file. Similar to Java, access to these toplevel units is controlled by packages. Below is a description of the core objectoriented constructs in X10:
Methods may be inherited. Methods defined in the superclass are available in the subclasses, unless overridden by another method with same signature. Method overloading allows programmers to define multiple methods with same name as long as they have different signatures. Methods can be accesscontrolled to be private, protected or public; private methods can only be accessed by other methods in the same class, protected methods can be accessed in the same class or its subclasses, and public methods can be accessed from any code. By default, all methods are package protected which means they can be accessed from any code in the same package.
Methods with the same name as that of the containing class are called constructors. They are used to instantiate a class.
X10 provides all the standard statements similar to Java. Assignment, if  else and while loop statements are identical to those in Java.
for loops in X10 are more advanced and apart from the standard Clike for loop, X10 provides three different kinds of for loops:
In order to understand the challenges of translating MATLAB to X10, one must understand the different flavours and functionality of X10 arrays.
At the lowest level of abstraction, X10 provides an intrinsic onedimensional fixedsize array called Rail which is indexed by a Long type value starting at 0. This is the X10 counterpart of builtin arrays in languages like C or Java. In addition, X10 provides two types of more sophisticated array abstractions in packages, x10.array and x10.regionarray.
Both types of arrays also support distribution across places. A place is one of the central innovations in X10, which permits the programmer to deal with notions of locality. places are discussed in more detail in section 2.2.4
X10 is a statically typechecked language: Every variable and expression has a type that is known at compiletime and the compiler checks that the operations performed on an expression are permitted by the type of that expression. The name c of a class or an interface is the most basic form of type in X10. There are no primitive types.
X10 also allows type definitions, that allow a simple name to be supplied for a complicated type, and for type aliases to be defined. For example, a type definition like public static type bool(b:Boolean) = Boolean{self=b} allows the use of expression bool(true) as a shorthand for type Boolean{self=true}.
X10 types are available at runtime, unlike Java(which erases them).
X10 is a high performance language that aims at providing productivity to the programmer. To achieve that goal, it provides a simple yet powerful concurrency model that provides four concurrency constructs that abstract away the lowlevel details of parallel programming from the programmer, without compromising on performance. X10’s concurrency model is based on the Asynchronous Partitioned Global Address Space (APGAS) model [IBM13a]. The APGAS model has a concept of global address space that allows a task in X10 to refer to any object (local or remote). However, a task may operate only on an object that resides in its partition of the address space (local memory). Each task, called an activity, runs asynchronously parallel to each other. A logical processing unit in X10 is called a place. Each place can run multiple activities. There are four types of concurrency constructs provided by X10 [IBM13b], as described in the following sections:
The fundamental concurrency construct in X10 is async. The statement async S creates a new activity to execute S and returns immediately. The current activity and the “forked" activity execute asynchronously parallel to each other and have access to the same heap of objects as the current activity. They communicate with each other by reading and writing shared variables. There is no restriction on statement S in the sense that it can contain any other constructs (including async). S is also permitted to refer to any immutable variable defined in lexically enclosing scope.
An activty is the fundamental unit of execution in X10. It may be thought of as a very lightweight thread of execution. Each activity has its own control stack and may invoke recursive method calls. Unlike Java threads, activities in X10 are unnamed. Activities cannot be aborted or interrupted once they are in flight. They must proceed to completion, either finishing correctly or by throwing an exception. An activity created by async S is said to be locally terminated if S has terminated. It is said to be globally terminated if it has terminated locally and all activities spawned by it recursively, have themselves globally terminated.
Global termination of an activity can be converted to local termination by using the finish construct. This is necessary when the programmer needs to be sure that a statement S and all the activities spawned transitively by S have terminated before execution of the next statement begins. For instance in the listing 2.5 below, the use of finish ensures that the Console.OUT.println(~a(1) = ~ + a(1)); statement is executed only after all the asynchronously executing operations (async a(i) *= 2; have completed.
The statement atomic S ensures that the statement S is executed in a single step with respect to all other activities in the system. When S is being executed in one activity all other activities containing s are suspended. However, the atomic statement S must be sequential (it should not contain an async or finish staement), nonblocking (it should not contain a blocking construct like when, explained below) and local (in this context, local means placelocal, that is, it must not contain an at construct, explained in Sec. 2.2.4). Consider the code fragment in listing 2.6. It asynchronously adds Long values to a linkedlist list and simultaneously holds the size of the list in a variable size. The use of atomic guarantees that no other operation, in any activity, is executed in between (or simultaneously with) these two operations, which is necessary to ensure correctness of the program.
Note that, atomic is a syntactic sugar for the construct when (c) , which is the conditional atomic statement based on binary condition (c). Statement when (c) S executes statement S atomically only when c evaluates to true; if it is false, the execution blocks waiting for c to be true. Condition c must be sequential, nonblocking and local.
A place in X10 is the fundamental processing unit. It is a collection of data and activities that operate on that data. A program is run on a fixed number of places. The binding of places to hardware resources (e.g. nodes in a cluster, accelerators) is provided externally by a configuration file, independent of the program.
The at construct provides a placeshifting operation, that is used to force execution of a statement or an expression at a particular place. An activity executing at (p) S suspends execution at the current place; The object graph G at the current place whose roots are all the variables V used in S is serialized, and transmitted to place p, deserialized (creating a graph G′ isomorphic to G), an environment is created with the variables V bound to the corresponding roots in G′, and S executed at p in this environment. On local termination of S, computation resumes after at (p) S in the original location. The object graph is not automatically transferred back to the originating place when S terminates: any updates made to objects copied by an at will not be reflected in the original object graph.
In order to understand the compilation flow of the MIX10 compiler and enhancements made to the X10 compiler for efficient use of X10 as a target language for MATLAB, it is important to understand the design of the X10 compiler and its runtime environment.
X10 is implemented as a sourcetosource compiler that translates X10 programs to either C++ or Java. This allows X10 to achieve critical portability, performance and interoperability objectives. The generated C++ or Java program is, in turn, compiled by the platform C++ compiler to an executable or to class files by a Java compiler. The C++ backend is referred to as Native X10 and the Java backend is called Managed X10.
The sourcetosource compilation approach in X10 provides three main advantages: (1) It makes X10 available for a wide range of platforms; (2) It takes advantage of the underlying classical and platformspecific optimizations in C++ or Java compilers, while the X10 implementation includes only X10 specific optimizations; and (3) It allows programmers to take advantage of the existing C++ and Java libraries.
Figure 2.1 shows the overall architecture of the X10 compiler [IBM13b].
Figure 2.2 shows the major components of the X10 runtime and their relative hierarchy [IBM13b].
The runtime bridges the gap between application program and the lowlevel network transport system and the operating system. X10RT, which is the lowest layer of the X10 runtime, provides abstraction and unification of the functionalities provided by various network layers.
The X10 Language Native Runtime provides implementation of the sequential core of the language. It is implemented in C++ for native X10 and Java for Managed X10.
XRX Runtime, the X10 runtime in X10 is the core of the X10 runtime system. It provides implementation for the primitive X10 constructs for concurrency and distribution (async, finish, atomic and at). It is primarily written in X10 over a set of lowlevel APIs that provide a platformindependent view of processes, threads, synchronization mechanisms and interprocess communication.
At the top of the X10 runtime system, is a set of core class libraries that provide fundamental data types, basic collections, and key APIs for concurrency and distribution.
Background and Highlevel Design of MIX10
___________________________________________________________________________________
MIX10 is implemented on top of several existing MATLAB compiler tools. The overall structure is given in Figure 3.1, where the new parts are indicated by the shaded boxes, and future work is indicated by dashed boxes.
As illustrated at the top of the figure, a MATLAB programmer only needs to provide an entrypoint MATLAB function (called myprog.m in this example), plus a collection of other MATLAB functions and libraries (directories of functions) which may be called, directly or indirectly, by the entry point. The programmer may also specify the types and/or shapes of the input parameters to the entrypoint function. As shown at the bottom of the figure, our MIX10 compiler automatically produces a collection of X10 output files which contain the generated X10 code for all reachable MATLAB functions, plus one X10 file called mix10.x10 which contains generated and specialized X10 code for the required builtin MATLAB functions. Thus, from the MATLAB programmer’s point of view, the MIX10 compiler is quite simple to use.
MATLAB is actually quite a complicated language to compile, starting with its rather unusual syntax, which cannot be parsed with standard LALR techniques. There are several issues that must be dealt with including distinguishing places where white space and new line characters have syntactic meaning, and filling in missing end keywords, which are sometimes optional. The McLAB frontend handles the parsing of MATLAB through a two step process. There is a preprocessing step which translates MATLAB programs to a cleaner subset, called Natlab, which has a grammar that can be expressed cleanly for a LALR parser. The McLAB frontend delivers a highlevel AST based on this cleaner grammar.
After parsing, the next major phase of MIX10 uses the MCSAF framework [DH12a, Doh11] to disambiguate identifiers using kind analysis [DHR11], which determines if an identifier refers to a variable or a named function. This is required because the syntax of MATLAB does not distinguish between variables and functions. For example, the expression a(i) could refer to four different computations, a could be an array or a function, and i could refer to the builtin function for the imaginary value i, or it could refer to a variable i. The MCSAF framework also simplifies the AST, producing a lowerlevel AST which is more amenable to subsequent analysis.
The next major phase is the Tamer [DH12b], which is a key component for any tool which statically compiles MATLAB. The Tamer generates an even more defined AST called Tamer IR, as well as performing key interprocedural analyses to determine both the call graph and an estimate of the base type and shape of each variable, at each program point. The call graph is needed to determine which files (functions) need to be compiled, and the type and shape information is very important for generating reasonable code when the target language is statically typed, as is the case for X10.
The Tamer also provides an extensible interprocedural value analysis and an interprocedural analysis framework that extends the intraprocedural framework provided by MCSAF. Any static backend will use the standard results of the Tamer, but is also likely to implement some targetlanguagespecific analyses which estimate properties useful for generating code in a specific target language. Currently, we have implemented two analyses : (1) An analysis for determining if a MATLAB variable is real or complex to enable support for complex numbers in MIX10 and other MATLAB compilers based on McLAB; and (2) IntegerOkay analysis to identify which variables can be safely declared to be of an integer type (Int or Long) instead of the default type Double.
For the purposes of MIX10, the output of the Tamer is a lowlevel, wellstructured AST, which along with key analysis information about the call graph, the types and shapes of variables, and X10specific information. These Tamer outputs are provided to the code generator, which generates X10 code, and which is the main focus of this thesis.
The X10 source code generator actually gets inputs from two places. It uses the Tamer IR it receives from the the Tamer to drive the code generation, but for expressions referring to builtin MATLAB functions it interacts with the Builtin Handler which used the builtin template file we provide. We describe the functioning of the builtin handler in Chapter 5.
Chapter 4 concentrates on generating efficient code for MATLAB arrays. Chapter 6 describes the code generation strategy for the sequential core of MATLAB, while Chapter 7 describes our strategy to generate parallel X10 code for MATLAB parfor construct, and introducing X10 like concurrency constructs in MATLAB. The focus of this thesis is to address challenges in generating efficient X10 code whose performance is comparable to stateoftheart tools that generate more traditional imperative languages like C and Fortran.
The MIX10 code generator is the key component which makes the translation from the Tamer IR, which is based on MATLAB programming constructs and semantics, to X10. The overall structure of the MIX10 code generator is given in Figure 3.2.
The input to to MIX10 compiler is the call graph generated by Tamer and the Tame IR annotated with the necessary analysis information like shape, iscomplex, and type and IntegerOkay. Rather than do a direct code generation to X10 source code, MIX10 translates the Tamer IR to MIX10 IR, a general and extensible IR, designed by us, to represent X10. This translation is done by the X10 IR generator module of MIX10. Finally, with the inputs from the builtin handler and the X10 IR generator (after the X10 specific analyses and transformations have been done), the X10 source code generator generates the resultatnt X10 source code.
The MIX10 IR is a lowlevel, three address like intermediate representation that is similar in design to the Tamer IR and abstracts the X10 constructs while capturing the static information required to generate the X10 source code. We have implemented the IR using JastAdd [EH04, jas], which allows us to easily add new AST nodes by simply extending the JastAdd specification grammar.
There are three important reasons to use an IR, instead of directly generating the X10 code:
Appendix B provides the JastAdd implementation of the MIX10 IR grammar. Chapter 4
Techniques for Efficient Compilation of MATLAB Arrays
___________________________________________________________________________________
Arrays are the core of the MATLAB programming language. Every value in MATLAB is a Matrix and has an associated array shape. Even scalar values are represented as 1×1 arrays. Most of the data read and write operations involve accessing individual or a set of array elements. Given the central role of arrays in MATLAB, it is of utmost importance for our MIX10 compiler to find effective and efficient translations to X10 arrays.
Given the shape information provided by the shape analysis engine [LH14] built into the McLAB analysis framework [DH12a, Doh11, DH12b], it was not hard to compile MATLAB arrays to X10. However, to generate X10 code whose performance would be competitive to the generated C code (via MATLAB coder) and the generated Fortran code (via the MC2FOR compiler), was not straightforward, and required deeper understanding of the X10 array system and careful handling of several features of the MATLAB arrays.
As described in Section 2.1.3, X10 provides two higher level abstractions for arrays, simple arrays, a high performance but rigid abstraction, and region arrays, a flexible abstraction but not as efficient as the simple arrays. In order to achieve more efficiency, our strategy is to use the simple arrays whenever possible, and to fall back to the region arrays when necessary. Note that it is possible to force the MIX10 compiler to use region arrays via a switch, for experimentation purposes.
In dealing with the simple railbacked arrays, there were two important challenges. First, we needed to determine when it is safe to use the simple railbacked arrays, and second, we needed an implementation of simple railbacked arrays that handles the columnmajor, 1indexing, and linearization operations required by MATLAB. When to use simple railbacked arrays: After the shape analysis of the source MATLAB program, if shapes of all the arrays in the program: (1) are known statically, (2) are supported by the X10 implementation of simple arrays and (3) the dimensionality of the shapes remain same at all points in the program; then MIX10 generates X10 code that uses simple arrays. Columnmajor indexing: In order to make X10 simple arrays more compatible with MATLAB, we modified the implementation of the Array_2 and Array_3 classes in x10.array package to use columnmajor ordering instead of the default rowmajor ordering when linearizing multidimensional arrays to the backing Rail storage.^{1} ^{1}http://www.sable.mcgill.ca/mclab/mix10/x10_update/ Since MATLAB uses columnmajor ordering to linearize arrays, this modification also makes it trivial to support linear indexing operations in MATLAB. ^{2} ^{2}http://www.mathworks.com/help/matlab/math/matrixindexing.html MATLAB naturally supports linear indexing for individual element access. More precisely, if the number of subscripts in an array access is less than the number of dimensions of the array, the last subscript is linearly indexed over the remaining number of dimensions in a columnmajor fashion. Our modification to use columnmajor ordering for the backing Rail make it easier and more efficient to support linear indexing by allowing direct access to the underlying Rail at the calculated linear offset.
Given that we can determine when it is safe to use the simple railbacked arrays, and our improved X10 implementation of them, we then designed the appropriate translations from MATLAB to X10, for array construction, array accesses for both individual elements and ranges. Given the number of dimensions and the size of each dimension, it is easy to construct a simple array. For example a twodimensional array A of type T and shape m×n can be constructed using a statement like val A:Array_2[T] = new Array_2[T](m,n);. Additional arguments can be passed to the constructor to initialize the array. Another important thing to note is that MATLAB allows the use of keyword end or an expression involving end (like end1) as a subscript. end denotes the highest index in that dimension. If the highest index is not known the numElems_i property of the simple arrays is used to get the number of elements in the ith dimension of the array.
With MATLAB’s dynamic nature and unconventional semantics, it is not always possible to statically determine the shape of an arrays accurately. Luckily, with some thought to a proper translation, X10’s region arrays are flexible enough to support MATLAB’s “wild" arrays. Also, since Point objects can be a set of arbitrary integers, there is no restriction on the starting index of the arrays. Region arrays can easily use onebased indexing. Array construction: Array construction for region arrays involves creating a region over a set of points (or index objects) and assigning it to an array. Regions of arbitrary ranks can be created dynamically. For example, consider the following MATLAB code snippet:
In this code, the number of dimensions of array t and hence array x cannot be determined statically at compiletime. In such case, it is not possible to generate X10 code that uses simple arrays, however, it can still be compiled to the following X10 code for function foo().
In this generated X10 code, t is an array of type Double which can be created by copying from another array returned by bar(a) without knowing the shape of the returned array. Array access: Subscripting operations to access individual elements are mapped to X10’s region array subscripting operation. If the rank of array is 4 or less, it is subscripted directly by integers corresponding to subscripts in MATLAB otherwise we create a Point object from these integer values and use it to subscript the array. In case an expression involving end is used for indexing and the complete shape information is not available, method max(Long i), provided by the Region class is used, allowing to determine the highest index for a particular dimension at runtime. Rank specialization: Although region arrays can be used with minimal compiletime information, providing additional static information can improve performance of the resultant code by eliminating runtime checks involving provided information. One of the key specializations that we introduced with use of region arrays is to specify the rank of an array in its declaration, whenever it is known statically. For example if rank of an array A of type T is known to be two, it can be declared as val A:Array[T](2);. This specialization provided better performance compared to unspecialized code as shown in section 9.6.
MATLAB allows the use of an expression such as a:b (or colon(a,b)) to create a vector of integers [a, a+1, a+2, ... b]. In another form, an expression like a:i:b can be used to specify an integer interval of size i between the elements of the resulting vector. Use of a colon expression for array subscripting takes all the elements of the array for which the subscript in a particular dimension is in the vector created by the colon expression in that dimension.^{3} Consider the following MATLAB code:
Here y is a threedimensional array of shape 3×4×5 and x is a subarray of y of shape 1×2×5. Such array accesses can be handled by simply calling the getSubArray[T]() that we have implemented in a Helper class provided with the generated code. The generated X10 code with simple array for this example is as follows:
The colon operator can also be used on the left hand side for an array set operation that updates multiple values of the array. For example, in the MATLAB statement x(:,4) = y;, all the values of the fourth column of x will be set to y if y is a scalar or to corresponding values of y if y is a column vector with length equal to the size of first dimension of x. To handle this kind of operation we have implemented another helper method, setSubArray(). This method takes as input, the target array, the bounds on each dimension, and the source array. x(:,4) = y; will be translated by MIX10 to x = Helper.setSubArray(x, 1, x.numElems_1, 4, 4, y);
We have implemented overloaded versions of the getSubArray() and the setSubArray() methods for arrays of different dimensions. For region arrays, we provide the same methods that operate on region arrays in a different version of the Helper class. MIX10 provides the correct version of the Helper class, based on what kind of arrays are used.
MATLAB allows explicit array growth during runtime via the horzcat() and the vertcat() builtin functions for array concatenation operations. In MIX10 this feature is supported for simple arrays as long as the array growth does not change the number of dimensions of the array. For region arrays, this feature is supported in full. For simple arrays, X10 allows a variable declared to be an array of rank i, to hold any array value of the same rank. For example, consider the following set of statements:
Here, x is defined to be of type Array_2[Long] and can hold arrays of different sizes at different points in the program.
Region arrays, being more dynamic, also support array growth even if it changes the rank of the array. For example, the following set of statements are valid in an X10 program that uses region arrays:
Here x is a 2dimensional array and y is a 3dimensional array.
Section 9.6 discusses the performance results obtained by using different kinds of arrays and provides a comparison of them, thus showing the efficiency of our approach for compiling MATLAB arrays to X10.
Handling MATLAB Builtins
___________________________________________________________________________________
MATLAB builtin methods are the core of the language and one of the features that make it popular among scientists. They provide a huge set of commonly used numerical functions. All the operators, including the standard binary operators (+, , *,/), comparison operators (<, >, <=, >=, ==) and logical operators (&, &&, , ) are merely syntactic sugar for corresponding builtin methods that take the operands as arguments. For example an expression like a+b is actually implemented as plus(a,b). An important thing to note is that unlike most programming languages, all the MATLAB builtin methods by default operate on matrix values as a whole. For example a*b or mtimes(a,b) actually performs matrix multiplication on matrix values a and b. However, most of the builtin methods also accept one or more scalar, or more accurately, 1×1 matrix arguments. Builtin methods are overloaded to accept almost all possible shapes of arguments. Thus mtimes(a,b) can have both a and b as matrix arguments (including 1×1 matrices) with number of columns in a equal to number of rows in b, in which case the result is a matrix multiplication of a and b or one of them can be a 1×1 matrix and other can be a matrix of any size and the result is a matrix containing each element of the nonscalar argument times the scalar argument. Wherever possible, MATLAB builtins also support complex numerical values. X10 on the other hand, like most of the programming languages operates on scalar values by default.
Due to the fact that X10 is still new and evolving, it has a very limited set of libraries, specially to support a large subset of available MATLAB builtin methods. The X10 Global Matrix Library (GML) supports doubleprecision matrix operations however it is still not as extensive as MATLAB’s set of operations and it poses some restrictions:
Due to above restrictions, X10 Global Matrix Library is useful in some situations, for example when there is a matrix multiplication of a very large data size, but cannot be used or is not a good choice for a large number of operations.
For a language with opensourced libraries, it would be possible to actually compile the library methods to X10. However, many MATLAB libraries are closed source and thus it is not possible to translate them to X10.
We decided to write our own X10 implementations of the commonly used MATLAB builtin methods. Currently we have implemented only those methods that are used in our benchmarks. In this thesis, we concentrate on how these methods are included in the generated X10 code with minimal loss of readability and performance rather than the actual implementation.
The code below shows the X10 code for the MATLAB builtin method plus(a,b) involving 2dimensional simple arrays.
This X10 code contains four overloaded versions (and it still does not contain methods to support complex values, variables of types other than Double, simple arrays of other dimensions, and region arrays) based on whether the arguments are scalar or Array_2 and their relative position in the list of arguments.
Including all the overloaded versions and versions specialized for arrays of different dimensions or region arrays, in the generated X10 code would result in lot of lookup overhead, would require producing redundant code (versions of methods with arguments of similar shape but different types will have the same algorithm) and would generate large code with less readability. Instead we designed a specialization technique that selects the appropriate versions of only the methods used in the source MATLAB program. Note that the use of generic types to handle arguments of different types is not always a good idea, since several builtin implementations involve calls to X10 library functions which are not defined on generic types. For example, functions in the x10.lang.Math librarry like floor(Double a), max(Double a, Double b), etc. do not take generic type arguments.
After studying numerous builtin methods we categorized them into the following five categories:
Each of these categories, except Type 5, uses similar code template for different types of values. Note that due to the three address codelike structure of Tame IR, any call to a builtin almost always contains zero, one or two arguments. For builtin calls like horzcat and vertcat which may contain variable number of arguments, MIX10 packs all the arguments in a Rail and passes a single argument of type Rail. Accordingly, these builtins are implemented in Type 2 category and receive a single argument of type Rail.
We build an XML file that contains the method bodies for each category for every builtin method (that we support). The XML also contains specialized implementations of every builtin for different kinds of arrays. We implement the following strategy to select and generate the correct and required methods. First, we make a pass through the AST to make a list of all the builtin methods used in the source MATLAB program. Next, we parse the XML file once and read in the X10 code templates for all the categories of the builtin methods collected in the first step. Next, whenever a call to a builtin method is made, based on the results of the value analysis we: (1) Identify the required specialization for the method (simple array or region array); and (2) generate the correct method header and select the corresponding builtin template in the required specialization for that method. The generated methods are finally written to a X10 class file named Mix10.x10. In the code generated for actual MATLAB program, the call to a builtin method is simply replaced by a call to the corresponding method in the Mix10 class. For example, MATLAB expression plus(a,b) is translated to X10 expression Mix10.plus(a,b). Appendix A demonstrates the structure of the builtin XML with an example implementation of the builtin plus.
Using the above approach not only improves the readability of the generated code, but it also allows for future extensibility, better maintenance and more specialization. Specializations that we plan to add in future are: (1) the ability to use the Global Matrix Library for the available methods in it and whenever the data size is large enough; and (2) concurrent versions of the relevant builtins to support the execution of vector instructions in parallel fashion, as described in section 7.3. We also encourage advanced users to mdoify the generated Mix10.x10 file to enhance or add builtin implementations for higher performance of the generated code.
Code Generation for the Sequential Core
___________________________________________________________________________________
MATLAB is a programming language designed specifically for numerical computations. Every value is a Matrix and has an associated array shape. Even scalar values are 1×1 matrices. Vectors are 1×n or n×1 matrices. All the values are by default of type double. MATLAB naturally supports imaginary components for all numerical values and almost all operators and library functions support complex inputs. In the rest of this chapter we describe some of the key features of MATLAB that demonstrate what makes MATLAB different and challenging to compile statically and techniques used by MIX10 to translate these “wild" features to X10. We provide necessary details on the various MATLAB constructs as we discuss them, however for the readers who are totally unfamiliar with MATLAB, we suggest reading chapter 2 of the M.Sc. thesis on the Tamer framework [Dub12].
A function definition in MATLAB takes one or more input arguments and returns one or more values. A typical MATLAB function looks as follows:
This function has two input arguments a and b that can be of any type and any shape and returns two values x and y of the same shape as a and b respectively and of types determined by MATLAB’s type conversion rules. The Tamer IR provides a list of input arguments and a list of return values for a function. The interprocedural value analysis identifies the types, shapes and whether they are complex numerical values for all the arguments and the return values.
MATLAB functions are mapped to X10 methods. If it is the entry function, the type of the input argument is specified by the user (Tame IR requires to have an entry function or a driver function with one argument. This function may call other functions with any number of input arguments). For other functions the parameter types are computed by the value analysis performed by the Tamer on the Tame IR. The type information computed includes the type of the value, its shape and whether it is a complex value. Other statements in the function block are processed recursively and corresponding nodes are created in the X10 IR. Finally, if there are any return values, as determined by the Tame IR, a return statement is inserted in the X10 IR at the end of the method. If the function returns only one value, say x then the inserted statement is simply return x; but if the function returns more than one values (which is quite common in MATLAB) then we return a Rail of type Any whose elements are the values that are returned. So, for the above example the return statement is return [x as Any, y as Any];. Note that the use of short syntactic form for Rail construction improves the readability of the generated code. Below is the generated code for the simple example above.
Also note that the variables mc_t0 and mc_t1 are introduced by Tamer in the Tame IR.
MATLAB provides following basic types:
These basic types are naturally mapped to X10 base types as follows. Floating point values are mapped to Double and Float respectively, unsigned integers are mapped to UByte, UShort, UInt and ULong, integer values are mapped to Byte, Short, Int and Long, logical is mapped to Boolean and char is mapped to Char (vector of chars is mapped to String type). If the shape of an identifier of type T is greater than 1×1 it is mapped to Array[T]. The type conversion rules are quite different from standard languages. For example, an operation involving a double and an int32 results in a value of type int32.^{1} ^{1}The type rules are explained in detail in the Tamer documents, www.sable.mcgill.ca/mclab/tamer.html. MIX10 inserts an explicit typecast wherever required.
All the MATLAB operators are designed to work on matrix values and are provided as syntactic sugar to the corresponding builtin methods that take operands as arguments. Operators are overloaded to support different semantics for 1×1 matrices (scalar values). MATLAB provides two types of operators  matrix operators and array operators. Matrix operators work on whole matrix values. These include matrix multiplication (*) and matrix division (\, /). Array operators always operate in an elementwise manner. For example array multiply operator .* performs elementwise multiplication. As described in Chapter 5, MIX10 implements all operators as builtins.
MATLAB is a dynamically typed language which means that variables need not be declared and take up any value that they are assigned to. X10 however, is statically typed and requires variables to be declared before being assigned to. MIX10 maintains a list of all the declared variables. It starts with an empty list. Whenever an identifier appears in an assignment statement on LHS, if it is not already present in the list, a declaration statement is added to the X10 IR and the variable (with its associated type and value information) is added to the list, followed by an assignment statement corresponding to the statement in MATLAB. If the identifier is already present in the list, the assignment statement is added to the X10 IR and the associated type and value information is updated. In case the MATLAB assignment statement is inside a loop and needs a declaration, the declaration statement (without any assignment) is added to the method block outside any loop or conditional scope and the assignment statement is added in the scope where it is present in MATLAB code. If the identifier on LHS is an array, then the declaration creates a new array with the shape corresponding to the shape of the RHS array. For example a MATLAB statement like a=b; where shape of a is, say, 3×3 and type is double will be translated to a:Array_2[Double]=new Array_2[Double](b); for simple arrays and a:Array[Double]=new Array[Double](b); for region arrays (outside the scope of any loops or conditionals).
Loops in MATLAB are fairly intuitive except for one semantic difference from most of the languages. In a for loop if the loop index variable is redefined inside the body of the loop, then its new value is persistent only in a particular iteration and does not affect the number of loop iterations. For example, consider the following listing.
Note that inside every iteration, the value of loop index variable i is 3 but the loop still terminates after ten iterations. The above code would be translated to the following X10 code:
To handle this somewhat different semantics we introduce a new loop index variable and assign it to the original loop index variable at the beginning of the loop body. The rest of the loop body is translated by standard rules. Note that the new loop index variable is introduced only if the actual loop index variable is redefined inside the loop body.
In MATLAB conditionals are expressed using the ifelseifelse construct and do not have any wild semantics. MATLAB also allows switch statements which are converted to equivalent ifelse statements by the Tamer. It also recursively converts a statement like if (B1) S1 elseif (B2) S2 else S3 to a series of ifelse clauses like if (B1) S1 else{ if(B2) S2 else S3}. This ifelse construct is intuitively mapped to the ifelse construct in X10.
Function calls in MATLAB are similar to other programming languages if the called function returns nothing or returns only one value. However, MATLAB allows a function to return multiple values. Whenever a call is made to such a function, returned values are received in a list in the order specified by function definition. For example in the statement [x,n] = bubble(a); a call is made to the function bubble which returns two values that are read into x and n respectively. This statement is compiled to following code in X10.
The key idea here is to create a Rail of type Any and read the returned value. Remember that MIX10 packs the multiple return values of a method in a Rail of type Any and returns it. Individual elements of the list simply read the values from this array. If the function call is inside a loop, all the declarations are moved out of the loop and only assignments are inside the loop.
Cell arrays in MATLAB are arrays of data containers called cells and each cell can contain data of any type. For example fooCell = {’x’,10,’I like’,ones(3,3)}; creates a cell array containing values of type char, double, char array and a double array. To convert to X10, the elements of the cell array are packed into a rail of type Any. While accessing an element it is type cast into its original type. Consider the following MATLAB listing:
It creates a cell array containing two arrays. It is translated to the below X10 code:
Code Generation for Concurrency in MATLAB
___________________________________________________________________________________
MATLAB programmers often recognize the parallel nature of computations involved in their programs but cannot express it due to the lack of finegrained concurrency controls in MATLAB. Some concurrency can be achieved using controls like parfor and other tools in Mathwork’s parallel computing toolbox, but this has several drawbacks: (1) the parallel toolbox is limited in terms of scaling (MATLAB currently supports only up to 12 workers processes to execute applications on a multicore processor [Mat13]); (2) the parallel toolbox must be purchased separately, so not even all licensed MATLAB users will have it available; and (3) MATLAB’s concurrency is often slower compared to X10’s concurrency controls (as shown in section 9.8). Vectorization ^{1} ^{1}http://www.mathworks.com/help/matlab/matlab_prog/vectorization.html is a technique to convert loopbased scalar operations to vector operations, for which MATLAB is optimized. So, another way of exposing parallelism in MATLAB is to optimize these instructions to perform the computations concurrently on the elements of the vector.
Sec. 2.2 gives an introduction to the concurrency controls in X10. Readers not familiar with X10 may find it useful to read it before continuing with this chapter.
The MATLAB parfor construct is an important feature in MATLAB and is provided by the Mathworks’ parallel computing toolbox [Mat13]. It allows the for loop iterations in the MATLAB programs to be executed in parallel, whenever safely possible, thus greatly enhancing the performance of the for loop execution. Other static MATLAB compilers like MATLAB coder and MC2FOR do not support the parfor loop due to the lack of builtin concurrency features in their target languages, C and Fortran. However, X10, being a parallel programming language, naturally provides concurrency control features. The MIX10 compiler supports parallel code generation for the MATLAB parfor construct and provides significantly better performance compared to MATLAB code with parfor, and also the sequential version of the X10 code generated for the same program.
The parfor (or parallel for loop) is a key parallelization control provided by the MATLAB parallel computing toolbox that can be used to execute each iteration of the for loop in parallel with each other. The challenge was to implement it with X10’s concurrency controls while maintaining its complex semantics and aiming for better performance than provided by the parallel computing toolbox. There are three important semantic characteristics of MATLAB’s parfor loop:
Here x = x+i; is a reduction assignment [Matb] statement. The value of d is local to each iteration and the initial value before the loop is retained after the loop. Note that the value of d outside the loop is invisible inside the loop. For statement A(i) = d;, each iteration modifies a unique element accessible only to it, hence the final value of A is independent of order of execution; thus its value is updated after the loop.
The MIX10 compiler uses the following strategy to translate parfor loops to X10:
An equivalent X10 code for the above example MATLAB code is given on the right side of Figure 7.1. The use of finish and async ensure that each iteration is executed in parallel and the statement after the for loop is blocked until all the iterations have finished executing. Note that the for loop is iterated over a LongRange to ensure that the declaration of the loop variable i is local to each iteration. The statement x = x+i is a reduction statement, since its order of execution does not affect the value of x at the end of the loop. It is declared to be atomic to ensure that the two operations of addition and assignment in the statement are executed as a whole, without any interference from its execution in other iterations. Since the variable d is also defined outside the loop, it is replaced by a local variable d_local inside the loop. Finally, since each array variable A(i) is unique, it is executed normally for each iteration.
To conclude, we can translate the parfor in MATLAB to semantically equivalent code in X10 and since X10 can handle massive scaling, we can get significantly better performance for X10 compared to MATLAB as shown by our experimental results in Section 9.8.
In order to enable MATLAB to be compiled for high performance computing it is important to let programmers exploit finegrained concurrency in their MATLAB programs. Due to the lack of finegrained concurrency controls in traditional MATLAB, we decided to introduce such controls in MATLAB that can be translated by our MIX10 compiler to analogous concurrency controls in X10. However it was important that introduction of such controls should not have any sideeffects when compiled by traditional Mathworks’ MATLAB compiler, so we introduced them as structured special comments.
We introduced the following concurrency constructs in MATLAB: (1) %!async, (2) %!finish, (3) %!atomic, (4) %!when(condition) (5) (where condition is a boolean expression) and (6) %!at(p) (where p is an integer value denoting a place in X10). Programmers can express these constructs before the statements that they want to control and specify the end of a control by using %!end after the statements. Note that because of the preceding % sign these constructs will be treated like comments by other MATLAB compilers and will not cause any unwanted effects. It is important to note that the MATLAB programmer, using these constructs in her program, must ensure that the parallel execution caused by the use of these constructs does not change the behavior of the program from that of the sequential execution of the program. In short, it is the responsibility of the programmer to ensure the safety of the program when using these constructs.
Figure 7.2 shows an example of how to use these controls in MATLAB followed by the generated X10 code for it.
The use of vectorized instructions is another optimization technique used by MATLAB to speedup single operations on multiple scalar values by combining scalar values in a vector and executing the operation on the vector. Such Single instruction, multiple data style operations are good candidates for parallelization. However, efficiency of parallelization of such operations depends on the size of the vector, the complexity of the operation involved, and the executing hardware. Thus, in order to make it most effective, we wanted to provide full support for parallelization of vector instructions and give the programmer the ability to control when the vector operation is executed concurrently, based on the size of the vector.
We implemented a concurrent version of the relevant builtin operations that can operate in a parallel fashion on vectors of arbitrary sizes.^{2} ^{2}Currently, these concurrent builtin implementations are not integrated into the builtin handling framework. However, due to the extensible nature of the builtin handling framework, it should be straightforward to add a specialization for a concurrent version of the builtins. We plan to do it as a future work. We also introduced a compiler switch for MIX10 that lets programmers specify a vector length threshold for all builtins or a specific builtin above which the concurrent version of the builtin will be executed. For example, if the user wants an operation sin(A) to be executed concurrently only if A is a vector of length greater than, say, 1000; then while invoking the MIX10 compiler she can specify the threshold by using the switch vec_par_length sin=1000. MIX10 will generate a call to the concurrent version of sin() if the length of A is greater than 1000 else it will call the sequential version. Using the vec_par_length switch programmer can specify threshold for one or more or all builtin methods. For example vec_par_length all=500 sin=1000 cos=1000 will set the threshold for sin() and cos() to 1000 and to 500 for all other builtins.
As a future work, we plan to extensively evaluate the concurrent execution for the vectorized instructions. It would be interesting to study the performance variations of concurrently executing vector instructions, with varying threshold vector length values and on different parallel architectures.
The IntegerOkay Analysis
___________________________________________________________________________________
In this chapter we present the IntegerOkay analysis to identify which variables in the source MATLAB program can be safely declared to be of an integer type instead of the default double type. In MATLAB all the variables holding a numerical value are by default of type Double, which means that by default, in the X10 code generated from MATLAB, all variables are statically declared to be of Double. However, in languages like X10, Java and C++, certain program operations require the variables used to be of an integer type. A prominent example of such an operation is an array access operation. An array access requires the variables used to index into the array to be of an integer type. For example, in a statement like x = A(i,j), the variables i and j are required to be of integer type and result in an error otherwise.
A simple solution to handle this problem in the generated code from MATLAB is to explicitly cast the variable from Double to Long, whenever it is required to be used as an integer. However, our experiments showed this approach to be very inefficient. With this approach, we observed that the C++ programs generated by the X10 compiler’s C++ backend were slow, and often even slower than the Java code generated by the X10 Java backend for the same program (which was somewhat surprising). The reason for the added slowness in the C++ code was because each typecast from Double to Long involved an explicit check on the value of the Double type variable to ensure that it lies in the 64bit range supported by Long, whereas the cast in Java is handled by a primitive bytecode cast instruction. However, even in Java, extraneous casts clearly hurt performance.
To solve this problem, we designed and implemented the IntegerOkay analysis that identifies variables that can be safely declared to be of Long type, thus eliminating the need for costly typecasting on these variables.
To understand the effect on performance caused by typecasting consider a simple example of X10 code shown in listing 8.1 that just loops over a 2dimensional array and sets each element A(i,j) to A(i1,j1) + A(i+1, j+1). In this example, the index variables i and j are declared to be of type Double and are typecast to Long when used for indexing into the array. This example reflects the type of X10 code that we would generate if we do not have the IntegerOkay analysis.
Listing 8.2 shows the same example, but with i and j declared to be Long, and thus not requiring an explicit typecast. This example reflects the code that we would be able to generate with a good IntegerOkay analysis.
input args: 100, 200000  input args : 10000, 20
 
Java  C++  Java  C++  
useDoubles  6.9  33.7  7.6  35.2 
useLongs  3.4  1.5  3.7  2.0 
Table 8.1 shows running times (in seconds) for these two examples for different values of input arguments. For the listing 8.1, the C++ code generated by the X10 compiler is nearly 5 times slower as compared to the Java code generated from X10 for the same example. Compared to 8.2 it is slower than the C++ code for this example by almost 20 times. On the other hand, Java code for the listing 8.1 is nearly 2 times slower compared to the Java code for the listing 8.2. For the C++ backend, since the C++ compiler does not provide the checks for Double to Long typecast, it is implemented in the X10 C++ backend. For the Java backend, X10 relies on these checks provided by the JVM. The more efficient implementation of these checks in the JVM, compared to that in the X10 C++ backend explains for comparatively lower slowdowns for the Java code. Section 9.7 gives detailed evaluation of the performance benefits obtained by using IntegerOkay analysis on our benchmark set.
In this section we give a highlevel overview of how the IntegerOkay analysis works. A detailed algorithm for it is provided in the next section (section 8.4). The basic idea behind the IntegerOkay analysis is that, for each variable x, if for every use and every definition of x in the given MATLAB function, x can be safely assumed to be an integer, i.e. its declaration as an integer does not change the result of the program, then it can be declared as an integer. Thus, the problem boils down to answering the question of whether each use or a definition, x can be safely assumed to be an integer.
There are three possible answers to this question:
In our MIX10 compiler we solve the IntegerOkay problem using a simple fixedpoint computation. For each variable use and definition, the algorithm initially associates it with one of the three abstract values above. We then compute the fixedpoint by iteratively refining the dependency lists of the conditional variables. Consider each variable x, if every use and definition of x has been determined to be IntegerOkay, then x is removed from the dependency lists of all the variables that are Conditionally IntegerOkay and depend on x. Once the dependency list for a particular use or definition of a variable is empty, it is upgraded to be IntegerOkay for that particular use or definition.
If a variable is not IntegerOkay at some point in the function or its dependency list does not become empty for some point in the function (say, for circular dependency), it cannot be declared as an integer. Since, every time we declare a variable to be integer, one or more Conditionally IntegerOkay variables might be upgraded to IntegerOkay, we iteratively repeat the process of finding variables that are IntegerOkay at all points in the function, until we reach a fixed point. Note that since we never downgrade a variable to Not IntegerOkay or Conditionally IntegerOkay, our iterative algorithm will always terminate.
The IntegerOkay analysis is an intraprocedural, flowinsensitive and pathinsensitive analysis. The basic idea behind identifying whether a variable can be declared as an integer is that if a variable can safely be an integer for its every definition and every use in the function, independent of any other variable’s type, then it can be declared to be an integer for the entire function. It is important that any variable identified as IntegerOkay must be safe to be declared as an integer, thus the analysis takes a conservative approach and identifies a variable as IntegerOkay only when it is completely unambiguous. This eliminates any falsepositives (a variable is identified as IntegerOkay, when in fact it is not IntegerOkay) but may lead to some falsenegatives (a variable is identified not IntegerOkay, when in fact it could be intgerOkay). For example, if a variable i is used as an array index variable and also used as an argument to the rdivide() builtin call somewhere in the function, the analysis identifies it as not IntegerOkay, even though under the assumption that the function would execute without any runtime errors, i would be IntegerOkay (use of a nonintegral value as an array index results in a runtime error).
The input to the analysis is a set of all the double variables in the function, a set of definitions for each of the variables in this set, and a set of all the uses for each of the variables in this variable set. The aim is to output a set of variables that can be safely declared as integers.
As mentioned in Sec. 8.3, the analysis initializes each definition and each use of every variable to one of the three abstract values and assigns a set of dependency variables if the assigned abstract value is conditionally IntegerOkay.
The analysis represents these three abstract values by the following state values : IntOk, NotIntOk, and CondIntOk for is IntegerOkay, not IntegerOkay, and conditionally IntegerOkay respectively. Furthermore, let d.state be the state of variable v for definition d , and u.state be the state of variable v for use u. Also, let d.deps be a set of variables on which the CondIntOk state for variable v for definition d depends on. Similarly, let u.deps be a set of variables on which the CondIntOk state for variable v for use u depends on. The analysis starts with a set of all the double variables V in the MATLAB function, where each variable v has an associated set of its definitions D and uses U. Every d.state and u.state is initialized to NotIntOk and every d.deps and u.deps is initially set to ∅. Algorithm 1 gives the algorithm for the initialization step of the analysis. It checks for every definition and every use of each variable and assocites an IntegerOkay state with each use and definition. It also assocites a dependency list of variables whose state affects the state of the variable in question.
Once the initial state has been assigned for each use and each definition of every variable in STEP 1, the next step is to find the variables that can be of integer type across the function. The fixed point solver iteratively finds these variables until a fixed point is reached and no more variables are safe to be defined as integers.
The input to the fixed point solver is the set of all double variables V with state and dependency list initialized for each definition and each use of every variable v in V . The output of the fixed point solver is V′, the set of variables that can be safely defined as integers. The fixed point solver also defines the variable v.State that stores the final state value for the variable v in the function, independent of any use or definition. For each variable v, v.State is assigned an initial empty value ~. V′ is initially set to ∅.
Algorithm 2 provides an algorithm for this fixed point solver. Note that the algorithm uses a operator to merge the state values of various definitions and uses of a variable to obtain its final state. Table 8.2 gives a definition of the operator used by this analysis. In the table, CondIntOk(s,s = ∅) is the case when d.deps∪u.deps = ∅ for a variable v, and CondIntOk(s,s≠∅) is the case when d.deps∪u.deps≠∅.
IntOk  CondIntOk(s,s = ∅)  CondIntOk(s,s≠∅)  NotIntOk  ~  
IntOk  IntOk  IntOk  NotIntOk  NotIntOk  IntOk 
CondIntOk(s,s = ∅)  IntOk  IntOk  NotIntOk  NotIntOk  CondIntOk(s,s = ∅) 
CondIntOk(s,s≠∅)  NotIntOk  NotIntOk  NotIntOk  NotIntOk  CondIntOk(s,s≠∅) 
NotIntOk  NotIntOk  NotIntOk  NotIntOk  NotIntOk  NotIntOk 
~  IntOk  CondIntOk(s,s = ∅)  CondIntOk(s,s≠ ∅)  NotIntOk  ~ 
Consider the following pseudocode for example:
In this example, the initialization step proceeds as follows. On line 1, x is IntegerOkay since 3.0 is a real integer. On line 2, y is Not IntegerOkay. On line 3, z is Conditionally IntegerOkay and depends on x and y, whereas x and y are IntegerOkay in their use in the expression x+y. On line 4, i is IntegerOkay in its definition i = 0, in its use in the expression i < 5, and also in the definition i++. On line 5, y is conditionally IntegerOkay and depends on i in its definition and it is IntegerOkay in its use in y+i. i is IntegerOkay in its use in y+i. Note that on line 5, we do not include y in its own dependency list, since if we say, y is conditionally IntegerOkay and depends on y, it is safe to declare y as integer as long as it does not have any other dependencies and is IntegerOkay everywhere else in the function.
The fixedpoint solver for this example proceeds as follows. We look for variables that are IntegerOkay at every point in the function. x and i are two such variables and we can declare them to be an integer. We also remove x from the dependency list of definition of z on line 3, and i from the dependency list of definition of y on line 5. Next, we search again and find that y is IntegerOkay in its use on line 3 and line 5, and also in its definition on line 5, however it is Not IntegerOkay in its definition on line 2 and thus it cannot be declared as an integer. z on line 3 is dependent on y and thus it can also not be declared as an integer. At this point, we have reached a fixed point since there are no more upgrades. Finally, we declare x and i as integers, and y and z as doubles.
Evaluation
___________________________________________________________________________________
In this chapter we evaluate the performance of our compiler. In this research our main aim was to generate X10 code for MATLAB such that its sequential performance would be comparable to the performance provided by the state of the art tools which translate MATLAB to more traditional imperative languages such as C and Fortran. To demonstrate our results, we compiled a set of 17 MATLAB programs to X10 via the MIX10 compiler and compared their performance results with those of the original MATLAB programs, C programs generated for our benchmarks via the MATLAB coder, and Fortran programs generated by the MC2FOR compiler.^{1} ^{1}We also compared our results to Octave, a widely used open source alternative to MATLAB. However, since Octave involves an interpreter, it performed slower than the standard MATLAB compiler (with a mean slowdown of 66.67 times slower) over all of our benchmarks, thus in this section we do not concentrate on comparison of our results with Octave. In addition to showing our best overall sequential performance, we also demonstrate the results of compiling the generated X10 code to Java compared to C++, effects on the performance for the various efficiency enhancing techniques discussed in this thesis, and finally the performance of the parallel X10 code generated for MATLAB parfor loops.
The set of benchmarks used for our experiments consists of benchmarks from various sources; Most of them are from related projects like FALCON [RP99] and OTTER [QMSZ98a], Chalmers university of Technology^{2} ^{2}http://www.elmagn.chalmers.se/courses/CEM/ , “Mathworks’ central file excahange”^{3} ^{3}http://www.mathworks.com/matlabcentral/fileexchange , and the presentation on parallel programming in MATLAB by Burkardt and Cliff^{4} . This set of benchmarks covers the commonly used MATLAB features like arrays of different dimensions, loops, use of numerical functions like random number generation, trigonometric operations, and array operations like transpose and matrix multiplication. Table 9.1 gives a description of all the benchmarks we used and shows their special features.
Benchmark  Source  Description  Key features 
bbai  MATLAB file exchange  Implementation of the Babai estimation algorithm  2D arrays, random number generation 
bubl  McLab  Bubble sort  1D array, nested loops 
capr  Chalmers University  Computes the capacitance of a transmission line using finite difference and GaussSeidel method  Array operations on 2D arrays, nested loops 
clos  Otter project  Calculates the transitive closure of a directed graph  Matrix multiplication, 2D arrays 
crni  Falcon project  CrankNicholson solution to the heat equation  read/write operations on a very large 2D array 
dich  Falcon project  Dirichlet solution to Laplace’s equation  Array operations on 2D arrays, nested loops 
diff  MATLAB file exchange  Calculates the diffraction pattern of monochromatic light  2D arrays, Concatenation operations, complex numbers 
edit  MATLAB file exchange  Calculates the edit distance between two strings  many 1D arrays of characters 
fiff  Falcon project  Computes the finite difference solution to the wave equation  Array operations on 2D arrays, nested loops 
lgdr  Calculates derivatives of Legendre polynomials  Array transpose on row vectors 

mbrt  MCFOR project  Computes Mandelbrot sets  Complex numbers, parfor loop 
nb1d  Otter project  Simulates the 1dimensional nbody problem  Columnvectors, nested loops, parfor loop 
matmul  McLab  naive matrix multiplication  2D arrays, nested loops, parfor loop 
mcpi  McLab  Calculates π by the Monte Carlo method  Scalar values, Random number generation, parfor loop 
numprime  Burkardt and Cliff  Simulates the sieve of Eratosthenes for calculating number of prime numbers less than a given number  Scalar values, nested loops, parfor loop 
optstop  Burkardt and Cliff  Solution to the optimal stopping problem  Row vectors, random number generation, parfor loop 
quadrature  Burkardt and Cliff  Simulates the quadrature approach for calculating integral of a function  Scalar values, parfor loop 
We used Mathworks’ MATLAB release R2013a to execute our benchmarks in MATLAB and MATLAB coder. We also executed them using the GNU Octave version 3.2.4. We compiled our benchmarks to Fortran using the MC2FOR compiler and compiled the generated Fortran code using the GCC 4.6.3 GFortran compiler with optimization level O3. To compile the generated X10 code from our MIX10 compiler, we used X10 version 2.4.0. We used OpenJDK Java 1.7.0_51 to compile and run Java code generated by the X10 compiler, and GCC 4.6.4 g++ compiler to compile the C++ code generated by the X10 compiler. All the experiments were run on a machine with Intel(R) Core(TM) i73820 CPU @ 3.60GHz processor and 16 GB memory running GNU/Linux(3.8.035generic #52Ubuntu). For each benchmark, we used an input size to make the program run for approximately 20 seconds on the de facto MATLAB compiler. We used the same input sizes for compiling and running benchmarks via other compilers. We collected the execution times (averaged over five runs) for each benchmark and compared their speedups over Mathworks’ MATLAB runtimes (normalized to one).
The MIX10 compiler compiles the source MATLAB code to X10 code, which is then compiled by the X10 compiler. The X10 compiler is also a source to source compiler that provides two backends, a C++ backend that generates C++ code and a Java backend that generates Java code, which are then compiled by their respective compilers to executable code. Both these backends provide a NO_CHECKS switch that generates the C++/Java code that does not include dynamic array bounds checks, which are otherwise included by default. As we described in section 4.1.1, we altered the X10 compiler to use columnmajor array indexing. We always used the O optimization flag for the X10 compiler for both the backends, with notable exceptions where the X10 optimizer generated code which interacted extremely negatively with the Java JIT, as discussed in section 9.5.1. For all of our experiments, we used our IntegerOkay analysis, except for the experiment which investigates the performance impact of this analysis. Our best results were obtained by compiling the generated X10 code with the C++ backend with NO_CHECKS enabled, where the X10 code itself was generated by the MIX10 compiler with simple arrays and our IntegerOkay analysis enabled.
We compared the performance of the generated X10 code with that of the original MATLAB code run on Mathworks’ implementation of MATLAB. To compare against the state of the art static compilers, we also compared the performance of the MIX10 generated X10 code with the C code generated by MATLAB coder and the Fortran code generated by the MC2FOR compiler.
Figure 9.1 shows the speedups and slowdowns for the code generated for our benchmarks by different compilers. For MIX10 we have included the results for the X10 code compiled by the X10 C++ backend compiled, once with NO_CHECKS enabled and once with NO_CHEKCS disabled. For Fortran we included the results for the code generated without bounds checks. C code from MATLAB coder was generated with default settings and includes bounds checks. We also calculated the geometric mean of speedups(slowdowns) for all the benchmarks for each compiler.
Overall, one can see that the static compilers all provide excellent speedups, often of at least an order of magnitude faster. Thus, for the kinds of benchmarks in our benchmark set, it would seem that tools like the MATLAB coder, MC2FOR, and our MIX10 tools are very useful. MIX10 outperforms MATLAB coder in 9 out of 17 benchmarks, when compared with the X10 version compiled with bounds checks, and 10 out of 17 benchmarks when compared with the version with no bounds checks. For Fortran, the generated X10 does better in 7 out of 17 benchmarks with no bounds checks, and 6 out of 17 benchmarks with bounds checks enabled. Note that MATLAB coder was not able to compile 1 of our benchmarks (diff ) due to the dynamic array growth involved in it; MIX10 supports dynamic array growth.
We achieved a mean speedup of 4.8 over MATLAB, for the X10 code with bounds checks, and 6.8 for the x10 code with no bounds checks. On the other hand, MATLAB coder gave a mean speedup of 6.3 and MC2FOR gave a mean speedup of 10.2. However, we noticed that our mean result was skewed due to two benchmarks for which the generated X10 performed very poorly compared to the generated C code. These benchmarks are clos and lgdr. In the following paragraphs we explain the reason for their poor performance. If we do not consider these two benchmarks, we get a mean speedup of 6.7 for the X10 code with bounds checks compared to 5.2 for the C code. For the X10 code compiled with no bounds checks we get a mean speedup of 9.3 compared to 11.6 for Fortran.
clos involves repeated calls to the builtin matrix multiplication operation for the 2dimensional matrices. The generated C code from MATLAB coder uses highly optimized matrix multiplication libraries compared to the naive matrix multiplication implementation used by MIX10. Thus, MIX10 gets a speedup of 0.02 as compared to 1.05 for C. Note that the generated Fortran code is also slowed down (speedup of 0.04) due to the same reason. As a future work, We plan to replace our matrix multiplication implementation with calls to an optimized library function.
lgdr involves repeated transpose of a row vector to a column vector. MATLAB and Fortran, both being array languages are highly optimized for transpose operations. MIX10 currently uses a naive transpose algorithm which is not optimized. We did achieve a speedup of over 10 times compared to MATLAB but it is not as good as the speedups achieved by C (speedup of 505.0) and Fortran (speedup of 336.7). For transpose operation also, we plan to replace our current implementation with an optimized implementation or a call to an optimized library function.
Both of these examples show that in addition to generating good code, another important task is developing optimized X10 library routines for key array computations.
Other interesting numbers are shown by optstop, fiff, nb1d, and quadrature. optstop gave a speedup of just 1.5 even without bounds checks. It involves repeated random number generation, which our experiments showed to be slow for the X10 C++ backend compared to Fortran and even the Java backend. This problem is worse with C, due to which the C code from MATLAB coder gives a speedup of mere 0.6(slowdown). Fortran performs better with a speedup of 4.3. bbai shows a similar pattern due to the same reason. fiff is characterized by stencil operations in a loop on a 2dimensional array. These operations are also optimized by arraybased languages like Fortran and MATLAB. For nb1d, Fortran performs better due to the use of column vectors in the benchmark, which are represented as 2dimensional arrays in X10 but in Fortran they are represented as 1dimensional and are optimized. 2dimensional arrays are not as fast in X10 as the 1dimensional arrays. quadrature involves repeated arithmetic calculations on a range of numbers. We achieve a speedup of about 200 times compared to MATLAB, however it is slow compared to speedups of 3348 and 502 by C and Fortran respectively. We believe that MATLAB coder leverages partial evaluation for optimizing numerical methods’ implementations.
For most of the other benchmarks, we perform better or nearly equal to C and Fortran code. Despite the facts that: (1) the sequential core of X10 is a highlevel object oriented language which is not specialized for arraybased operations; and (2) Generating the executable binaries via MIX10 involves two levels of sourcetosource compilations (MATLAB → X10 → C++); we have achieved performance comparable to C, the state of the art in statically compiled languages and Fortran, a statically compiled language highly specialized for arrays.
The X10 compiler provides two backends, a C++ backend that compiles the X10 code to native binary via C++, and a Java backend that compiles the X10 code to JVM code via Java. We were interested to see how well these backends perform when used to compile the MIX10 generated code. Even though we did not expect Java code to perform as well as the C++ code, our aim was to make sure that we achieved good performance, significantly better than MATLAB, for the X10 Java backend. This would enable MATLAB programmers to use our MIX10 compiler to generate code that could be integrated into Java applications.
In this section we present the performance comparison of the MIX10 generated X10 code compiled by the X10 C++ backend with that compiled by the X10 Java backend. Columns 3 and 4 of the Table 9.2 show speedups for our benchmarks compiled with the X10 C++ backend without bounds checks, and with bounds checks respectively. Columns 5 and 6 show these values for compilation with the X10 Java backend without bounds checks, and with bounds checks respectively. We also show the geometric mean of the speedups for all the 4 cases.
Benchmark  Matlab  C++ (no checks)  C++ (checks)  Java (no checks)  Java (checks) 
bbai  1  4.9  4.9  11.3  10.7 
bubl  1  15.8  10.8  7.5  7.5 
capr  1  13.5  12.7  11.1  6.3 
clos  1  0.03  0.02  0.02  0.002 
crni  1  6.5  5.3  5.6  4 
dich  1  7.6  7.3  7  1.6 
diff  1  4.6  4.4  0.3  0.3 
edit  1  29.7  19.4  22.1  20 
fiff  1  9.8  3.5  2.1  1.4 
lgdr  1  13.5  11.9  10.6  10.6 
mbrt  1  6.5  6.5  0.3  0.3 
nb1d  1  6.2  4.8  5.5  4.1 
matmul  1  14.7  2.5  1.1  0.8 
mcpi  1  3.3  2.5  2.9  3 
numprime  1  7.6  6.8  6.5  6.4 
optstop  1  1.5  1.2  1.8  1.4 
quadrature  1  200.9  182.6  167.4  154.5 
Geometric mean  1  6.8  4.8  3.4  2.4 
The mean speedups for the C++ backend are 6.8 and 4.8 respectively for the version with bounds checks switched off, and the version with bounds checks switched on, whereas for Java backend these values are 3.4 and 2.4 respectively. This is expected, given that C++ is compiled to native binary while Java is JIT compiled.
bbai and optstop are the two exceptions, where Java performs better than C++. For bbai, both the Java versions gave a speedup of over 10, whereas for C++, the speedup is under 5 for both the versions. For optstop, the difference is not large, with C++ speedup at 1.5 (1.2 for the bounds checks version) compared to 1.8 (1.4 for the bounds checks version) for the Java backend. bbai is slower with X10 C++ backend because it includes repeated calls to the X10’s Random.nextDouble() function to generate random numbers. We found it to be significantly slower in the C++ backend compared to the Java backend. We have reported our findings to the X10 development team and they have validated our findings. optstop is slower for the same reason : It also involves repeated random number generation. Note that for these two benchmarks, even the C code generated via MATLAB coder is slower than the C++ code, with speedups of 4.2 and 0.6 respectively for bbai and optstop.
Other interesting results are for the benchmarks diff, fiff, mbrt and matmul. For these benchmarks, results from the Java backend are significantly slower compared to the C++ backend. diff and mbrt involve operations on complex numbers. In the X10 C++ backend, complex numbers are stored as structs and are kept on the stack, whereas in the Java backend, they are stored as objects and reside in the heap storage. fiff and matmul are characterized by repeated array access and read/write operations on 2dimensional arrays. For these benchmarks, the Java backend performs significantly slower compared to the C++ backend with no bounds checks(2.1 vs. 9.8 for fiff and 1.1 vs. 14.7 for matmul), however compared to the performance by the C++ beckend with bounds checks, it is not as slow(2.1 vs. 3.5 for fiff and 1.1 vs. 2.5 for matmul). The reason is that even with bounds checks turned off for the X10 to Java compiler, The Java compiler by default has bounds checks on. These checks have a significant effect on performance for 2dimensional array operations.
One of the most surprising results in this set of experiments was the fact that we had to sometimes disable the X10 O optimizer switch when using the X10 Java backend.. For the benchmarks capr and dich, in the case when X10 bounds checks are switched on, we found very pathological performance, with slowdowns of over 2 orders of magntitude, when the X10 compiler’s optimization switch(O) was used.
We recorded running times of 785.3 seconds for capr compared to 3.2 seconds without the optimization, and 1558.9 seconds for dich compared to 12.7 seconds without the optimization. With the help of the X10 development team we determined that switching on the optimization triggered code inlining for the array bounds check code, which then caused the resultant Java program to be too large to be handled by the JIT compiler. In fact, the Java JIT effectively gives up on this code and reverts to the interpreter.
Thus, it would seem that the X10 optimizer needs to be improved in order to apply aggressive inlining only when it does not have a negative impact on code size, and that different inlining strategies are needed for the C++ and Java backends.
One of the key optimizations used by MIX10 is to use simple arrays, wherever possible, for higher performance. In this section we discuss the performance gains obtained by using simple arrays over region arrays and the specialized region arrays. A description of these three kinds of arrays provided by X10 was given in Sec. 2.1.3. Table 9.3 shows the relative speedups and slowdowns for our benchmarks compiled to use different kinds of X10 arrays for the C++ backend and the Java backend.
X10 C++ backend  X10 Java backend  
Benchmark  Matlab  Simple arrays  Region arrays  Sp. region arrays  Simple arrays  Region arrays  Sp. region arrays 
bbai  1.0  4.9  2.7  2.7  11.3  6.4  6.6 
bubl  1.0  15.8  11.4  11.6  7.5  3.6  3.7 
capr  1.0  13.5  11.6  12.4  11.1  0.02  10.5 
clos  1.0  0.03  0.01  0.01  0.02  0.01  0.01 
crni  1.0  6.5  3.6  3.9  5.6  4  4.1 
dich  1.0  7.6  6.7  7.0  7.0  0.01  0.02 
diff  1.0  4.6  4.2  4.3  0.3  0.3  0.3 
edit  1.0  29.7  4.3  3.6  22.1  9.4  9.4 
fiff  1.0  9.8  2.0  2.8  2.1  1.4  1.4 
lgdr  1.0  13.5  1.6  1.5  10.6  5.4  5.4 
mbrt  1.0  6.5  6.3  6.5  0.3  5.1  5.1 
nb1d  1.0  6.2  0.3  0.3  5.5  1.4  1.4 
matmul  1.0  14.7  1.3  1.4  1.1  0.5  0.5 
mcpi  1.0  3.3  2.8  2.7  2.9  3.0  3.0 
numprime  1.0  7.6  5.7  5.7  6.5  6.3  6.3 
optstop  1.0  1.5  0.4  0.4  1.8  1.2  1.3 
quadrature  1.0  200.9  200.9  200.9  167.4  143.5  154.5 
Geometric mean  1.0  6.8  2.7  2.8  3.4  1.3  1.9 
For the C++ backend we obtained a mean speedup of 6.8 for Simple arrays, compared to 2.7 for region arrays and 2.8 for specialized region arrays. For the Java backend, we obtained speedups of 3.4, 1.3 and 1.9 for the simple arrays, region arrays, and the specialized region arrays respectively. These results are as expected in Sec. 4.1. For the C++ backend, most noticeable performance differences between simple arrays and region arrays are for edit, fiff, lgdr, nb1d, matmul and optstop. All of these benchmarks are characterized by large number of array accesses and read/write operations on 2dimensional arrays, except optstop and edit, which have multiple large 1dimensional arrays. The performance difference is most noticeable for nb1d, where the region arrays are about 20 times slower than the simple arrays. This is because nb1d involves simple operations on a large column vector. With simple arrays, since the compiler knows that it is a column vector, rather than a 2dimensional matrix, even though it is declared as a 2dimensional array, the performance can be optimized to match that of the underlying Rail. However, this is not possible for region arrays where the size of each dimension is not known statically. For the C++ backend, we do not observe significant performance differences between the region arrays and the specialized region arrays.
For the Java backend we observed a higher difference in the mean performance for the simple arrays and the region arrays. The mean speedup for region arrays is 1.3 whereas for simple arrays it is nearly 3 times more at 3.4. There is also a significant difference between the performance of specialized region arrays and region arrays. Speedup for specialized region arrays is 1.9. Like the C++ backend, here also, most noticeable performance gain for simple arrays is for benchmarks involving a large number of array accesses and read/writes. capr and dich ask for special consideration. For capr, even with X10 compiler’s bounds checks turned off, the region array version slows down by more than 500 times compared to the simple array version and even the specialized region array version. This again, is due to the fact that region arrays, with the dynamic shape checks, generated more code than the JIT compiler could handle. For dich the slowdown due to region arrays was about 700 times compared to simple arrays. For dich, even the specialization on region arrays was not enough to reduce the code size enough to be able to be JIT compiled.
In this section we present an overview of the performance improvements achieved by MIX10 by using the IntegerOkay analysis. Table 9.4 shows a comparison of speedups gained by using the IntegerOkay analysis over those without using it. For this experiment we used results with X10 optimizations turned on and bounds checks turned off.
X10 C++ backend  X10 Java backend  
Benchmark  Matlab  IntegerOkay  All Doubles  IntegerOkay  All Doubles 
bbai  1.0  4.9  3.8  11.3  8.2 
bubl  1.0  15.8  2.2  7.5  4.2 
capr  1.0  13.5  1.7  11.1  9.7 
clos  1.0  0.03  0.02  0.02  0.01 
crni  1.0  6.5  2.8  5.6  5.5 
dich  1.0  7.6  1.0  7.0  5.8 
diff  1.0  4.6  4.5  0.3  0.3 
edit  1.0  29.7  13.5  22.1  20.0 
fiff  1.0  9.8  1.0  2.1  1.8 
lgdr  1.0  13.5  10.1  10.6  11.0 
mbrt  1.0  6.5  6.1  0.3  0.3 
nb1d  1.0  6.2  5.2  5.5  5.5 
matmul  1.0  14.7  14.5  1.1  1.2 
mcpi  1.0  3.3  3.3  2.9  2.9 
numprime  1.0  7.6  7.6  6.5  7.1 
optstop  1.0  1.5  1.0  1.8  1.0 
quadrature  1.0  200.9  182.6  167.4  143.5 
Geometric mean  1.0  6.8  3.4  3.4  2.9 
For the C++ backend, we observed a mean speedup of 6.8 which is two times the speedup gained by not using the IntegerOkay analysis, which is equal to 3.4. We observed a significant gain in performance by using IntegerOkay analysis for the benchmarks that involve significant number of array indexing operations. bubl, capr, crni, dich, and fiff show the most significant performance gains. The reason for this behaviour is that, X10 requires all array indices to be of type Long, thus if the variables used as array indices are declared to be of type Double (which is the default in MATLAB), they must be typecast to Long type. Double to Long is very time consuming because every cast involves a check on the value of the Double type variable to ensure that it can safely fit into Long type.
For the Java backend, with the IntegerOkay analysis, we get a mean speedup of 3.4 as compared to 2.9 without it. The reason for the lower difference as compared to that for the C++ backend is that, for Java backend, the X10 compiler does not generate the value checks for Double type values, instead it relies on the JVM to make these checks, resulting in a better performance. Also note that, without the IntegerOkay analysis, the C++ backend results are slower than the Java backend results for 9 out of 17 benchmarks.
To conclude, IntegerOkay analysis is very important for efficient performance of code involving Arrays, specially for the X10 C++ backend.
Given that we have established that we can generate competitive sequential code, we wanted to also do a preliminary study to see if we could get additional benefits from the highperfomance nature of X10. In this section we present the preliminary results for the compilation of MATLAB parfor construct to the parallel X10 code. 7 out of our 17 benchmarks could be safely modified to use the parfor loop. We compare the performance of the generated parallel X10 programs for these benchmarks to that of MATLAB code using parfor, and to their sequential X10 versions. For this experiment, we used the sequential versions of the generated X10 programs with optimizations and no bounds checks. For the parallel versions we used both the variants, with optimizations and bounds checks, and with optimizations and without bounds checks. It is important to note that the Intel(R) Core(TM) i73820 CPU used for this experiment has 4 CPU cores and 8 hardware threads, thus bounding the X10 parallelization speedup at 8. Table 9.5 shows the results for both the X10 C++ and the X10 Java backends.
X10 C++ backend  X10 Java backend  
Sequential  Parallel  Sequential  Parallel
 
Benchmarks  MATLAB  MATLAB parfor  No checks  No checks  Checks  No checks  No checks  Checks 
mbrt  1.0  1.7  6.5  25.3  25.3  0.3  1.3  1.3 
nb1d  1.0  0.4  6.2  7.3  7.3  5.5  19.0  15.1 
matmul  1.0  1.3  14.7  68.7  3.9  1.1  1.1  0.8 
mcpi  1.0  4.6  3.3  15.7  15.9  2.9  18.1  18.2 
numprime  1.0  5.3  7.6  30.0  30.9  6.5  24.8  26.4 
optstop  1.0  4.1  1.4  2.0  2.1  1.8  10.7  9.7 
quadrature  1.0  3.8  200.9  11.3  11.2  167.4  13.0  13.0 
Geometric mean  1.0  2.3  8.9  14.5  9.8  3.7  7.8  7.1 
For the X10 C++ backend we achieved a mean speedup of 14.5 for the generated parallel X10 code without bounds checks, which is around 7 times of the speedup for the MATLAB code with parfor, at a speedup of 2.3. Compared to X10 sequential code which has a mean speedup of 8.9, it is more than 1.5 times as fast. Even with the parallel X10 code with bounds checks we achieved a speedup of 9.8 which is over 4 times better than the MATLAB parfor code. The optstop benchmark is an interesting exception. It is actually slower (at a speedup of 2.1) than the MATLAB parfor version (at a speedup of 4.1). The sequential version of optstop is just slightly faster than the sequential MATLAB version, with a speedup of just 1.5 (due to the reasons explained earlier in Sec. 9.4) and the total time for the parallel execution of each iteration, and managing the parallelization of these activities is just slightly faster than the sequential version. We see a similar trend for the matmul benchmark for the X10 Java backend. It shows a speedup of just 0.8 (version with bounds checks) as compared to 1.3 for the MATLAB parfor version. For the parallel version of matmul, the speedup for the C++ backend with checks turned on is just 3.9 compared to 14.7 for the sequential version with no checks, however it is 1.6 times faster than the sequential version with checks, which has a speedup of just 2.5 (shown in figure 9.1). The quadrature benchmark shows very interesting numbers, with a slowdown of about 20 times and 13 times for the parallel X10 code compared to the sequential X10 code, for the C++ backend and the Java backend respectively. The quadrature benchmark consists of a loop with a large number of iterations (4600000 for our experiment), where each iteration performs an insignificant amount of computations with just 21 simple arithmetic operations on scalar values. A sequential speedup of over 200 times compared to the sequential MATLAB code indicates the insignificance of each iteration in terms of computation time. In the parallel X10 code, each iteration is executed as a separate activity, and the time taken to manage the large number of activities overshadows the sequential speedup. Also, each activity has an atomic statement, which is a blocking construct (when one activity is executing the atomic statement, all other activities are blocked) and adds further overhead to the activity management system. Performance numbers for quadrature assert the fact that not all parallel MATLAB programs are good candidates to be compiled to parallel X10 in a straightforward fashion, specially the ones with large number of parallel loop iterations where each iteration itself does insignificant amount of work. As a future work, we plan to handle such programs with more sophisticated parallelization techniques. For example, one way to make a benchmark like quadrature more efficient would be to break the loop into two nested loops and executing the outer loop in a parallel fashion and the inner loop sequentially to ensure that each concurrent loop does substantial amount of work, however the challenge here is to determine the right number of iterations for the inner sequential loop to make it significant enough to take advantage of the parallelization.
Overall for the Java backend, we obtained a mean speedup of 7.8 for the X10 code with no bounds checks and 7.1 for the code with bounds checks. Compared to the MATLAB parfor version we obtained about 3.5 times better performance. The parallel X10 code is over 2 times faster than the sequential X10 code (speedup of 3.7). For both, the C++ backend, and the Java backend, the mean speedup for the sequential X10 code is also substantially faster than the MATLAB parallel code.
To conclude, the parallel X10 code provides much higher performance gains compared to the MATLAB parfor code and even the X10 sequential code, which by itself is most of the times faster than the MATLAB parallel code.
We showed that the MIX10 compiler with its efficient handling of array operations and optimizations like IntegerOkay can generate X10 code that provides performance comparable to the native code generated by the state of the art languages like C, which is fairly lowlevel, and Fortran, which itself is an arraybased language. As a future work, we plan to use more efficient implementations of the builtin functions, and believe that it would further improve the performance of the code generated by MIX10.
With MIX10, we also took first steps to compile MATLAB to parallel X10 code to take full advantage of the high performance features of X10. Our preliminary results are very inspiring and we plan to continue in this direction further, in the future.
In addition to demonstrating that our approach leads to good code, these experiments have also been quite valuable for the X10 development team. Our generated code has stressed the X10 system more than the handwritten X10 benchmarks and has exposed places where further improvements can be made.
Related Work
___________________________________________________________________________________
The work in this thesis presents a research compiler for MATLAB. We provide a sourcetosource static compiler that targets the highlevel, object oriented language, X10. Our approach is open and extensible and generates sequential X10 code that is competitive in performance to code in stateoftheart languages like C and Fortran, generated by other MATLAB compilers. We also generated high performance code for MATLAB’s parfor loops, that showed great performance improvements.
Although our focus is on handling MATLAB itself, there are notable open source alternatives of MATLAB like Scilab[INR09], Julia[BKSE12], NumPy[Sci], FreeMat[DKB] and Octave[Oct]. Octave is an open source implementation of the MATLAB language and supports most of the MATLAB programs, however until the recently released version 3.8.1, Octave was solely an interpreted language and our experiments showed it to be orders of magnitude slower than MATLAB. We would like to evaluate its performance for the latest version that comes with a JIT compiler. Scilab and FreeMat are similar in syntax to MATLAB but have significant differences that prevent a significant portion of MATLAB programs to be able to run on these systems. NumPy is a python package for scientific computing, and Julia is a technical computing language. Julia also provides a fairly advanced distributed parallel execution environment. However, these alternative systems concentrate on providing open library support and have not tackled the problems of static compilation. We are investigating if there is any way of sharing some of their library support with MIX10. A Comparative evaluation of Matlab, Octave, FreeMat, Scilab, R, and IDL on a cluster computing system is presented in a technical report by Coman at al. [CBP^{+}12] The MEGHA project[PAG11] provides an interesting approach to map MATLAB array operations to CPUs and GPUs, but supports only a very small subset of MATLAB.
There have been previous research projects on static compilation of MATLAB which focused particularly on the arraybased subset of MATLAB and developed advanced static analyses for determining shapes and sizes of arrays. For example, FALCON [RP99] is a MATLAB to FORTRAN90 translator with sophisticated type inference algorithms. The McLab group has previously implemented a prototype Fortran 95 generator [Li09], and has more recently developed the next generation Fortran generator, MC2FOR [LH14] in parallel with the MIX10 project. Some of the solutions are shared between the projects, especially the parts which extend the Tamer. MATLAB Coder is a commercial compiler by MathWorks[Mata], which produces C code for a subset of MATLAB. We also compared MIX10’s performance to MATLAB coder. Below are more details about these and few other notable MATLAB compiler research projects:
A major difference between MIX10 and these compilers is that MIX10 targets an object oriented language with highlevel array abstractions, and which is itself sourcetosource compiled, whereas all the above mentioned projects either target JIT compilation or static compilation to either Fortran, which is itself an array based language and in fact was the precursor of MATLAB, or C, which is a fairly lowlevel language with advanced native compilers. Even though some of the techniques used by these compilers were helpful in generating X10 code for simple MATLAB constructs, the novelty of MIX10 lies in the fact that it identified major challenges involved in efficiently compiling MATLAB arrays and array based operations to X10 and achieved a performance comparable to that of compilers targeting C and Fortran.
In terms of sourcetosource compilers for X10, we are aware of two other projects. StreamX10 is a stream programming framework based on X10 [WTLY12]. StreamX10 consists of a stream programming language COStream and a compiler which translates programs in COStream to parallel X10 code. Tetsu discusses the design of a Rubybased DSL for parallel programming that is compiled to X10 [Soh11]. Both these projects focus on compiling a parallel programming language to X10, a more efficient and generic parallel programming language. In the future, we would like to study these projects in more depth and hope to get useful insights into efficiently using X10 as a target language for parallelism in MATLAB. Chapter 11
Conclusions and Future Work
___________________________________________________________________________________
This thesis is about providing a bridge between two communities, the scientists/engineers/students who like to program in MATLAB on one side; and the programming language and compiler community who have designed elegant languages and powerful compiler techniques on the other side.
The X10 language has been designed to provide highlevel array and concurrency abstractions, and our main goal was to develop a tool that would allow programmers to automatically convert their MATLAB code to efficient X10 code. In this way programmers can port their existing MATLAB code to X10, or continue to develop in MATLAB and use our MIX10 compiler as a backend to generate X10 code. Since X10 is publicly available under the Eclipse Public License (x10lang.org/home/x10license.html), users could have efficient highperformance code that they could freely distribute. Further, X10 itself can compile the code to either Java or C++, so our tool could be used in a tool chain to convert MATLAB to those languages as well.
Our tool is part of the McLab project, which is entirely open source. Thus, we are providing an infrastructure for other compiler researchers to build upon this work, or to use some of our approaches to handle other popular languages such as R.
In this thesis we demonstrated the endtoend organization of the MIX10 tool, and we identified that the correct handling of arrays, the minimization of casting by safely mapping MATLAB double variables to X10 integer variables via IntegerOkay analysis, and handling concurrency features were the key challenges. We developed a custom version of X10’s railbacked simple arrays, and identified where and how this could be used for generating efficient X10 code. For cases where precise static array shape and type information is not available, we showed how we can use the very flexible regionbased arrays in X10, and our experiments demonstrated that it is very important to use the simple railbacked arrays, for both the Java and C++ backends for X10.
Our experiments show that our generated X10 code is competitive with other stateofthe art code generators which target more traditional languages like C and Fortran. The C++ X10 backend produces faster code than the Java backend, but good performance is achieved in both cases.
One of the main motivations of choosing X10 as the target language is that it supports highperformance computing, which is often desirable for the computationintensive applications developed by the engineers and scientists. To demonstrate how to take advantage of X10’s concurrency, we presented an effective translation of the MATLAB parfor construct to semantically equivalent X10.
Our experiments showed significant performance gains for our generated parallel X10 code, as compared to MATLAB’s parallel toolbox. This confirms that compiling MATLAB to a modern highperformance language can lead to significant performance improvements.
In the process of developing MIX10, we also stressed the X10 compiler with machine generated code and found it to be mostly robust and efficient. We found that the X10 Java backend’s optimizations may in some cases generate Java code that is too large to be JIT compiled. We also found the random number generation for the C++ backend to be slower than that for the Java backend. We reported our feedback to the X10 development team and hope that our findings would help to make X10 even better. Overall, we found that with the right techniques, X10 can be used as an effective and efficient target language for scientific and arraybased languages, for both, the sequential core and the parallel part.
Based on our positive experiences to date, we plan to continue improving the MIX10 tool. There are three major areas where we plan to take ahead the development of MIX10, with the aim of making it more complete and more efficient:
Our tool is open source, and we hope that the other research teams will use our infrastructure, as well as learn from our experiences of generating effective and efficient X10 code.
Bibliography
___________________________________________________________________________________
[AP02] George Almási and David Padua. MaJIC: compiling MATLAB for speed
and responsiveness. In PLDI ’02: Proceedings of the ACM SIGPLAN 2002
Conference on Programming language design and implementation, Berlin,
Germany, 2002, pages 294–303. ACM.
[BKSE12] Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. Julia:
A Fast Dynamic Language for Technical Computing. CoRR, abs/1209.5145,
2012.
[Bud83] Timothy A. Budd. An APL compiler for the UNIX timesharing system. In
APL ’83: Proceedings of the international conference on APL, Washington,
D.C., United States, 1983, pages 205–209. ACM, New York, NY, USA.
[CB98] Stéphane Chauveau and François Bodin. Menhir: An Environment for High
Performance Matlab. In LCR ’98: Selected Papers from the 4th International
Workshop on Languages, Compilers, and RunTime Systems for Scalable
Computers, 1998, pages 27–40. SpringerVerlag, London, UK.
[CBP^{+}12] E Coman, M Brewster, S Popuri, A Raim, and M Gobbert. A Comparative
Evaluation of Matlab, Octave, FreeMat, Scilab, R, and IDL on Tara. Technical
report, Baltimore County, US, 2012.
[CDD^{+}95] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet,
K. Staney, D. Walker, and R. C. Whaley. LAPACK Working Note 95:
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory
Computers – Design Issues and Performance. Technical report, Knoxville,
TN, USA, 1995.
[DH12a] Jesse Doherty and Laurie Hendren. McSAF: A static analysis framework for
MATLAB. In Proceedings of ECOOP 2012, 2012, pages 132–155.
[DH12b] Anton Dubrau and Laurie Hendren. Taming MATLAB. In Proceedings of
OOPSLA 2012, 2012, pages 503–522.
[DHR11] Jesse Doherty, Laurie Hendren, and Soroush Radpour. Kind analysis for
MATLAB. In In Proceedings of OOPSLA 2011, 2011, pages 99–118.
[DKB] Eugene Ingerman Demetrios Kyriakis and Samit Basu. Freemat.
http://freemat.sourceforge.net/.
[Doh11] Jesse Doherty. McSAF: An Extensible Static Analysis Framework for the
MATLAB Language. Master’s thesis, McGill University, December 2011.
[Dub12] Anton Dubrau. Taming matlab. Master’s thesis, April 2012.
[EH04] Torbjörn Ekman and Görel Hedin. Reusable language specifications
in JastAdd II. In Thomas Cleenewerck, editor, Evolution and Reuse
of Language Specifications for DSLs (ERLS), 2004. Available from:
http://prog.vub.ac.be/~thomas/ERLS/Ekman.pdf.
[IBM12] IBM. X10 programming language. http://x10lang.org, February
2012.
[IBM13a] IBM. APGAS programming in X10 2.4, 2013.
http://x10lang.org/documentation/tutorials/apgasprogramminginx1024.html.
[IBM13b] IBM. An introduction to X10, 2013.
http://x10.sourceforge.net/documentation/intro/latest/html/node4.html.
[INR09] INRIA. Scilab, 2009. http://www.scilab.org/platform/.
[jas] JastAdd. http://jastadd.org/.
[JB01] Pramod G. Joisha and Prithviraj Banerjee. Correctly detecting intrinsic type
errors in typeless languages such as MATLAB. In APL ’01: Proceedings of
the 2001 conference on APL, New Haven, Connecticut, 2001, pages 7–21.
ACM, New York, NY, USA.
[Joi03] Pramod G. Joisha. a MATLABtoC translator, 2003.
http://www.ece.northwestern.edu/cpdc/pjoisha/mat2c/ .
[LH14] Xu Li and Laurie Hendren. Mc2for: A tool for automatically translating
matlab to fortran 95. In In Proceedings of CSMRWCRE 2014, 2014, pages
234–243.
[Li09] Jun Li. McFor: A MATLAB to FORTRAN 95 Compiler. Master’s thesis,
McGill University, August 2009.
[Mata] MathWorks. MATLAB Coder.
http://www.mathworks.com/products/matlabcoder/.
[Matb] Mathworks. Reduction variables. http://www.mathworks.com/help/distcomp/advancedtopics.html#bq_of7_3.
[Mat13] MathWorks. Parallel computing toolbox, 2013.
http://www.mathworks.com/products/parallelcomputing/.
[Mol] Cleve Moler. The Growth of MATLAB
and The MathWorks over Two Decades.
http://www.mathworks.com/company/newsletters/news_notes/clevescorner/jan06.pdf.
[Oct] GNU Octave.
http://www.gnu.org/software/octave/index.html.
[PAG11] Ashwin
Prasad, Jayvant Anantpur, and R. Govindarajan. Automatic compilation of
MATLAB programs for synergistic execution on heterogeneous processors.
In Proceedings of the 32nd ACM SIGPLAN conference on Programming
language design and implementation, San Jose, California, USA, 2011, PLDI
’11, pages 152–163. ACM, New York, NY, USA.
[QMSZ98a]Michael J. Quinn, Alexey Malishevsky, Nagajagadeswar Seelam, and Yan
Zhao. Preliminary Results from a Parallel MATLAB Compiler. In
Proceedings of Int. Parallel Processing Symp., 1998, IPPS, pages 81–87.
[QMSZ98b]Michael J. Quinn, Alexey Malishevsky, Nagajagadeswar Seelam, and Yan
Zhao. Preliminary Results from a Parallel MATLAB Compiler. In Proc. Int.
Parallel Processing Symp. (IPPS, 1998, pages 81–87. IEEE CS Press.
[RP99] Luiz De Rose and David Padua. Techniques for the translation of MATLAB
programs into Fortran 90. ACM Trans. Program. Lang. Syst., 21(2):286–323,
1999.
[SBP^{+}12] Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, and David
Grove. X10 Language Specification, Version 2.2. January 2012.
[Sch75] J. T. Schwartz. Automatic data structure choice in a language of very high
level. Commun. ACM, 18(12):722–728, 1975.
[Sci] Scipy.org. Numpy. http://www.numpy.org/.
[Soh11] Tetsu Soh. Design and implementation of a DSL based on Ruby for parallel
programming. Technical report, The University of Tokyo, January 2011.
[The02] The Mathworks. Technology Backgrounder:
Accelerating MATLAB, September 2002.
http://www.mathworks.com/company/newsletters/digest/sept02/accel_matlab.pdf.
[WS81] Zvi Weiss and Harry J. Saal. Compile time syntax analysis of APL programs.
In APL ’81: Proceedings of the international conference on APL, San
Francisco, California, United States, 1981, pages 313–320. ACM, New York,
NY, USA.
[WTLY12] Haitao Wei, Hong Tan, Xiaoxian Liu, and Junqing Yu. StreamX10: a stream
programming framework on X10. In Proceedings of the 2012 ACM SIGPLAN
X10 Workshop, Beijing, China, 2012, X10 ’12, pages 1:1–1:6. ACM, New
York, NY, USA.
Appendix A
XML structure for builtin framework
___________________________________________________________________________________
The below listing gives the XML structure for the builtin operation plus. It contains two upper level nodes for each real and complex numbers. Each of these have specialization nodes for the simple arrays of different dimensions, and the region arrays. Each of the specializations has 5 types of implementations.
MIX10 IR Grammar
___________________________________________________________________________________
Listing below gives the JastAdd implementation of the MIX10 IR grammar.