Velociraptor: Toolkit for building compilers for GPUs

Are you building a compiler for array-based languages, targeting CPUs and GPUs? Then you will be interested in the tools we are building. Our tools will make it easier for you to quickly build your compiler. Our tools work across hardware vendors, and (soon) will be open-source under a very liberal license.

These tools are being built by Rahul Garg as part of his ongoing PhD. thesis about compiling array-based languages to hybrid CPU/GPU systems. The project is being supervised Prof. Laurie Hendren at SABLE lab and is part of the McLab project.

Velociraptor

Velociraptor is a compiler toolkit designed for array-based languages that is not tied to a single source language. You can use it to build compilers and other tools for array-based languages. Velociraptor takes as input a high-level IR called Velociraptor IR (VRIR for short) which is described below. Velociraptor provides multiple components that can be used to build compilers. The first component libVRIR is an analysis and transformation infrastructure that can be used by both static and dynamic tools. The second component, VRdino, is a dynamic compiler that generates LLVM for CPUs, and OpenCL for GPUs from VRIR.

VRIR

VRIR is a typed, AST based IR used by Velociraptor. VRIR is designed to be easy to generate from array-based languages. It has built-in operators for many high-level array operations (like matrix multiplication), flexible array indexing schemes, and support for various array layouts. VRIR also includes high-level constructs for parallelism and GPU acceleration. VRIR has a textual representation inspired from Lisp S-expressions.
Overview: A high-level overview for VRIR can be found here.
Grammar: The grammar for VRIR textual representation can be found here .
Samples of VRIR generated from example programs can be found here

libVRIR

libVRIR is an analysis and transformation infrastructure for VRIR that can be used in both static and dynamic tools. As a compiler writer interested in VRIR, this is your first stop. libVRIR provides many standard passes such as simplification, inlining and liveness analysis. In addition, it also provides an interesting new analysis called region detection analysis that identifies interesting regions of code (such as loops and vector operations) that may be specialized at runtime using information such as shape of variables. This information can be used by dynamic compilers or static tools such as code instrumentation tools. libVRIR can be downloaded here

VRdino

VRdino is the dynamic backend for Velociraptor. It generates LLVM for CPUs and OpenCL for GPUs, making it portable across many different systems. It performs some interesting optimizations, such as runtime code specialization of regions identified using region detection analysis. VRdino is accompanied by its intelligent runtime system, that is designed for maximizing throughput in hybrid systems by doing operations in parallel and by minimizing redundant data transfers. VRdino depends upon libVRIR, LLVM, OpenCL, BLAS and RaijinCL.

Case-studies

We are demonstrating the use of our tools above by using the tools in two different projects. First, we are using it to enable the use of GPUs in McVM, a virtual machine for MATLAB language built at our lab. This will be open-sourced soon.

Second, we have a proof-of-concept system for compiling annotataed Python code (specifically the numerical subset using NumPy) where both CPU and GPU code generation is handled by Velociraptor. You can find the Python to VRIR compiler here.

RaijinCL

RaijinCL is a high-performance, portable libary for implementing many matrix operations on the GPU such as matrix multiplication and reductions. It is difficult to write a single OpenCL code that works well everywhere. Therefore, RaijinCL does autotuning, a process where it tests hundreds of different versions of important functions and records the best one found on a user's machine.

On GEMM (GEMM = general matrix multiply) routines, we are able to match the performance of vendor libraries on most systems, and deliver good performance everywhere. We have also developed a prototype extension of our autotuning approach for single-chip CPU+GPU systems that utilizes both CPU and GPU. We found that on some systems, the GPU kernels that are optimized for GPU-only workloads are not optimal for hybrid CPU+GPU workloads. Depending on the machine, our autotuner outperforms CPU-only, GPU-only as well as naive CPU+GPU auto-tuning strategies.

RaijinCL download and support can be found here

Publications and Presentations

Rahul's Ph.D. thesis
"Velociraptor: an embedded compiler toolkit for numerical programs targeting CPUs and GPUs" by Rahul Garg, Laurie Hendren at PACT 2014
"Just-in-time shape inference for array languages" by Rahul Garg, Laurie Hendren at ARRAY 14
"A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems." by Rahul Garg, Laurie Hendren at PDP 2014
Presentation: CASCON Compiler Driven Performance workshop 2012: PDF (presentation only, workshop does not have proceedings)