ARRAY '15 Accepted Papers with Abstracts

Paul Cockshott, Susanne Oehler and Youssef Gdura. Tool Description: Array programming in Pascal
Abstract: A review of previous array Pascals leads on to a description the
Glasgow Pascal compiler. The compiler is an ISO-Pascal superset
with semantic extensions to translate data parallel statements to
run on multiple SIMD cores. An appendix is given which includes
demonstrations of the tool.
Joao Bispo, Luís Reis and Joao M. P. Cardoso. Techniques for Efficient MATLAB-to-C Compilation
Abstract: The translation of MATLAB to C is an important step to raise the
overall abstraction level when mapping computations to embedded
systems, and thus to increase productivity and to achieve a full-automated model-driven design-flow. This paper describes recent
work developed in MATISSE, a MATLAB to C compiler targeting
embedded systems. We introduce several techniques to allow the
efficient implementation of C code, such as weak types, primitives
and pointer views. We evaluate the compiler with a set of 9 publicly
available benchmarks, targeting both embedded systems and a
desktop system. We compare the execution time of the generated C
code with the original code running on MATLAB, achieving a geometric mean speedup of 8.1×, and qualitatively compare with the
performance of related work.
Robert Bernecky and Sven-Bodo Scholz. Abstract Expressionism for Parallel Performance
Abstract: Programming with abstract, mathematical expressions offers significant benefits
to users, including terser programs, easier communication of algorithms, ability to
prove theorems about algorithms, and improved programming productivity.
It is commonly believed that higher levels of abstraction imply a larger semantic gap
between the user and computer and, therefore, typically slower execution,
be it sequential or parallel in nature.
In recent years, it has been demonstrated that domain-specific languages can
close this gap by means of sophisticated optimizations that benefit from
domain-specific knowledge.

In this paper, we demonstrate that the semantic gap can be closed
for non-domain-specific array languages,
without requiring the embedding of language-specific semantic knowledge
into the compiler tool chain. We present a simple example
of an APL-style \sac\ program that is compiled into C-code that outperforms
the equivalent C program, in both sequential and parallel (OpenMP) environments.

We give some insights into abstract expressionism in programming by comparing
the characteristics and performance of a numerical relaxation benchmark written
in C99, C99 with OpenMP directives, scheduling code, and pragmas, and in \sac,
a functional array language. We compare three algorithmic styles:
if/then/else, hand-optimized loop splitting, and an abstract, functional style that has
its roots in APL.

We show that all of our serial \sac\ algorithms outperform serial C, and that
the hand-optimized and abstract styles outperform serial C by a factor of nearly two.
Furthermore, the parallel \sac\ variants outperform the OpenMP C programs by
a factor of two, without requiring {\em any} source code modifications.
Youngsung Kim, Pavol Cerny and John Dennis. Performance Search Engine Driven by Prior Knowledge of Optimization
Abstract: For scientific array-based programs, optimization for a particular target platform is a hard problem. There are many optimization techniques such as (semantics-preserving) source code transformations, compiler directives, environment variables, and compiler flags that influence performance. Moreover, the performance impact of (combinations of) these factors is unpredictable. This pa- per focuses on providing a platform for automatically searching through search space consisting of such optimization techniques. We provide (i) a search-space description language, which enables the user to describe optimization options to be used; (ii) search engine that enables testing the performance impact of optimization options by executing optimized programs and checking their results; and (iii) an interface for implementing various search algorithms. We evaluate our platform by using two simple search algorithms - a random search and a casetree search that heuristically learns from the already examined parts of the search space. We show that such algorithms are easily implementable in our plat- form, and we empirically find that the framework can be used to find useful optimized algorithms.
Mathias Bourgoin and Emmmanuel Chailloux. High-Level Accelerated Array Programming in the Web Browser
Abstract: Client-side web programming currently means using
technologies embedded in web browsers to run computations on the client computer.
Most solutions imply using JavaScript which allows to
describe computations, and modifications of the DOM displayed by the browser.
However, JavaScript limits static checking as everything (types, names, etc.)
is checked at runtime. Moreover its concurrent model does not take advantage
of multi-core or GPU architectures.

In this paper we present WebSPOC, an adapted version of the SPOC
library for web applications. SPOC is an OCaml GPGPU library focusing
on abstracting memory transfers and handling GPGPU computations in a
strongly static typed context. SPOC proposes a specific language,
called Sarek, to express kernels and different parallel skeletons to
compose them. To run SPOC programs on the Web client side, its OCaml
part is compiled to JavaScript code and its Sarek part to kernels
running on GPUs or multi-core CPUs.
Mahesh Ravishankar, Paulius Micikevicius and Vinod Grover. Fusing Convolution Kernels through Tiling
Abstract: Image processing pipelines are continuously being developed to deduce more
information about objects captured in images. To facilitate the development of
such pipelines several Domain Specific Languages (DSLs) have been developed that
provide constructs for easy specification of such computations. It is then upto
the DSL compiler to generate code to efficiently execute the pipeline on
multiple hardware architectures. While such compilers are getting ever more
sophisticated, to achieve large scale adoption these DSLs have to beat or at
least match the performance that can be achieved by a skilled programmer.

Many of these pipelines use a sequence of convolution kernels that are memory
bandwidth bound. One way to address this bottleneck is through use of tiling. In
this paper we describe an approach to tiling within the context of a DSL called
Forma. Using the high-level specification of the pipeline in this DSL, we
describe a code generation algorithm that fuses multiple stages of the pipeline
through the use of tiling to reduce the memory bandwidth requirements on both
GPU and CPU. Using this technique improves the performance of pipelines like
Canny Edge Detection by 58% on NVIDIA GPUs, and of the Harris Corner
Detection pipeline by 71%.
Michael Budde, Martin Dybdal and Martin Elsman. Compiling APL to Accelerate Through a Typed Array Intermediate Language
Abstract: In this paper, we present a compiler that compiles a rich subset of APL into programs that can be executed on GPGPUs. The compiler is based on the APLTAIL compiler, which compiles APL programs into a typed array intermediate language, called TAIL. The central part of the compiler compiles TAIL programs into Haskell programs that make use of the Accelerate library, an array language embedded in Haskell for high-performance computation on GPGPUs. We demonstrate the feasibility of the approach by presenting some encouraging results for a number of smaller benchmarks. We also outline some problems that we need to overcome in order for the approach to result in competitive code for larger benchmarks.
Rahul Garg, Sameer Jagdale and Laurie Hendren. Velociraptor: A compiler toolkit for array-based languages targeting CPUs and GPUs
Abstract: We present a toolkit called Velociraptor that can be used by compiler writers to quickly build compilers and other tools for array-based languages. Velociraptor operates on it's own unique intermediate representation (IR) designed for array-based
languages, some novel analysis and transformations such as region detection and specialization, as well as a dynamic back-end with CPU and GPU code generation.
We discuss the components of the toolkit and also present case-studies illustrating the use of the toolkit.
Aaron Hsu. Accelerating Information Experts through Compiler Design
Abstract: Dyalog APL is a tool of thought for information experts, enabling rapid development of domain-centric software without the costly software engineering feedback loop often required. The Dyalog APL interpreter introduces performance constraints that hinder the analysis of large data sets, especially on highly-parallel computing architectures. The Co-dfns compiler project aims to reduce the overheads involved in creating high-performance code in APL. It focuses on integrating with the APL environment and compiles a familiar subset of the language, delivering significant performance and platform independence to information experts without requiring code rewrites and conversion into other languages.

The design of the Co-dfns compiler, itself an APL program, possesses a unique architecture that permits implementation without branching, recursion, or other complex forms of control flow. By integrating specific optimizations, the generated code competes with hand-written C code in the domain of financial simulations, exceeding it when integrated into the environment. Preliminary results demonstrate platform independence performance across CPUs and GPUs without modification of the source. Work continues to improve performance both of the architecture and the generated code. Eventually, the project hopes to convincingly demonstrate a wider range of techniques that extend the suitable domain for effective array programming.
Andreas Kloeckner. Loo.py: From Fortran to performance via transformation and substitution rules
Abstract: A large amount of numerically-oriented code is written and is being written in legacy languages. Much of this code could, in principle, make good use of data-parallel throughput-oriented computer architectures. Loo.py, a transformation-based programming system targeted at GPUs and general data-parallel architectures, provides a mechanism for user-controlled transformation of array programs. This transformation capability is designed to not just apply to programs written specifically for Loo.py, but also those imported from other languages such as Fortran. It eases the trade-off between achieving high performance, portability, and programmability by allowing the user to apply a large and growing family of transformations to an input program. These transformations are expressed in and used from Python and may be applied from a variety of settings, including a pragma-like manner from other languages.