OpenMP extensions for FPGA Accelerators

Daniel Cabrera\textsuperscript{1,2}, Xavier Martorell\textsuperscript{1,2}, Georgi Gaydadjiev\textsuperscript{3}, Eduard Ayguade\textsuperscript{1,2}, Daniel Jiménez-González\textsuperscript{1,2}

\textsuperscript{1}Barcelona Supercomputing Center
c/Jordi Girona 31,
Torre Girona,
E-08034 Barcelona, Spain

\textsuperscript{2}Universitat Politecnica de Catalunya
c/Jordi Girona 1-3,
Campus Nord-UPC, Modul C6,
E-08034 Barcelona, Spain
{dcabrera, xavim, eduard, djimenez}@ac.upc.edu

\textsuperscript{3}Delft University of Technology
Mekelweg 4,
2628 CD Delft,
The Netherlands
g.n.gaydadjiev@its.tudelft.nl

Abstract—Reconfigurable computing is one of the paths to explore towards low-power supercomputing. However, programming these reconfigurable devices is not an easy task and still requires significant research and development efforts to make it really productive. In addition, the use of these devices as accelerators in multicore, SMPs and ccNUMA architectures adds an additional level of programming complexity in order to specify the offloading of tasks to reconfigurable devices and the interoperability with current shared-memory programming paradigms such as OpenMP. This paper presents extensions to OpenMP 3.0 that try to address this second challenge and an implementation in a prototype runtime system. With these extensions the programmer can easily express the offloading of an already existing reconfigurable binary code (bitstream) hiding all the complexities related with device configuration, bitstream loading, data arrangement and movement to the device memory. Our current prototype implementation targets the SGI Altix systems with RASC blades (based on the Virtex 4 FPGA). We analyze the overheads introduced in this implementation and propose a hybrid host/device operational mode to hide some of these overheads, significantly improving the performance of the applications. A complete evaluation of the system is done with a matrix multiplication kernel, including an estimation considering different FPGA frequencies.

I. INTRODUCTION

The gigahertz race to which we were used to in the last decade has stopped due to power dissipation problems. The extra transistors that are available for new designs are not used to increase the complexity of superscalar architectures, out of order or multithreaded. The technological increase in transistor count is used to include more that one core in the same chip (homogeneous multicore) and/or to incorporate accelerators (heterogeneous multicore) well suited for certain application domains, such as for example GPU units in [1] or vector units in the Cell/B.E.[2]. For these accelerators, the exploitation of the potential parallelism available is not an easy task, and relies on the use of specific SDKs.

The use of specialized devices designed to compute some specific function (ASIC circuits) is another alternative to benefit a specific kind of applications. For example an ASIC to compute fast Fourier transforms can clearly eliminate the computation bottlenecks found in some bioinformatics applications. Field Programmable Gate Arrays (FPGA) are accelerators whose specific functionality can be retargetted to different domains at runtime. However, efficiently programming these specific functionalities requires the use of low-level hardware description languages (HDL), such as Verilog or VHDL, to which general-purpose programmers are not used to.

The productive parallelization of applications for heterogeneous multicore architectures that include one or more of such accelerators requires programming models able to express the proper ofoading of tasks and the data that is needed to perform the computation. This is the purpose of this paper, and in particular, to show a proposal that extends OpenMP 3.0 tasking [3] to target heterogeneous architectures with FPGA-based accelerators. OpenMP 3.0 task pragmas completely fits with the idea of using one or more FPGAs as accelerators. In this paper we are assuming that the bitstreams that corresponds to the computations to be offloaded in tasks are either existing IP cores or are generated using other compilation tools. This may impose some restrictions in the behavior of the tasks to be offloaded, such as for example on the use of synchronization constructs.

In order to motivate our extensions to OpenMP 3.0 and their implementation in the runtime system, Figure 1 shows part of the code that is necessary to ofload the execution of a matrix multiplication bitstream matmul\_fpga to one of the FPGAs available on the SGI RASC architecture [4], using the SGI RASCLib library [5]. In addition to this, the programmer needs to change the memory association of data in the host when transfers to/from the FPGA device.
(pack/unpack), which may also require the use of blocking in order to fit the requirements of the accelerator bitstream and memory. Our proposed extensions and runtime implementation try to hide all these complexities, making the parallelization and of tasking of tasks in accelerators more productive.

The rest of the paper is organized as follows: Section II presents the extensions proposed for OpenMP. Section III shows implementation details of the runtime system. Sections IV details the experimental setup. Sections V and VI show experimental evaluation. Section VII presents related work, and we conclude with Section VIII.

II. PROPOSED OPENMP EXTENSIONS FOR FPGA

Tasks are the most important new feature of OpenMP 3.0. A programmer can define deferrable units of work, called tasks, and later ensure that all the tasks defined up to some point have finished.

```c
#pragma omp task [clause-list] structured-block
```

Clauses can be used to specify data scoping (shared, private or firstprivate) and conditional execution as a task (if), mainly. The runtime system launches the execution of the code in `structured-block` in the scope of the parent task, following the data scoping attributes indicated. The current execution model assumes that a thread in the current team of threads will execute the task. The proposal in [6] extended the `task` construct with some additional clauses to derive dependencies among tasks at runtime

- `input(data-reference-list)`
- `output(data-reference-list)`
- `inout(data-reference-list)`

The information provided in these clauses will be used by the runtime system to analyze the dependencies among tasks and guarantee the proper order execution of them as proposed in [6]. Although in some cases the compiler can analyze the code and determine the input and output data sets, we provided these additional clauses to modify or augment the compiler analysis.

In an heterogeneous multicore architecture, we need some additional information in order to appropriately assign the execution of the task to any of the available devices, a GPU, a vector unit, FPGA device, ... Our proposals leverages on previous proposals that allow the specification of dependencies between tasks [6], loop blocking transformations specified at the pragma level [7], and the use of accelerators [8], all of them in the scope of OpenMP 3.0. In the following subsections we comment each one of the new pragmas that we use and/or extend in order to consider FPGA-based accelerator architectures.

Figure 2 shows the full version using our new pragmas in OpenMP 3.0 (pragmas details in Section II). Task of loading to `fpga` device is specified in the header function `matmul_fpga` (line 8 in Figure 2), and the appropriate blocking and packing/unpacking of data is expressed through the `block` pragma (line 17) and the specification of the direction of the arguments (line 7, 19, 20). The runtime system will take care of efficiently implementing them and hiding their possible overheads.

```c
void matrix_multiplication(float A[BS][BS],
float B[BS][BS],
float C[BS][BS]) {

/* Configure device */
res = rasclib_resource_configure("matmul_fpga",
num_devices, NULL);
algorithm_id = rasclib_algorithm_open("matmul_fpga",
RASCLIB_BUFFERED_RO);
/* Send inputs */
res = rasclib_algorithm_send(algorithm_id, "A",
A, sizeof(A));
res = rasclib_algorithm_send(algorithm_id, "B",
B, sizeof(B));
/* Execute */
rasclib_algorithm_go(algorithm_id);
/* Receive outputs */
res = rasclib_algorithm_receive(algorithm_id, "C",
C, sizeof(C));
/* Commit commands and wait */
rasclib_algorithm_wait(algorithm_id);
rasclib_resource_return("matmul_fpga", num_devices);
}
```

Fig. 1. Basic example programmed using RASClib. Error check has been omitted.

```c
float A[DIM_SIZE][DIM_SIZE];
float B[DIM_SIZE][DIM_SIZE];
float C[DIM_SIZE][DIM_SIZE];

#pragma omp target device(fpga) 
implement(block_matmul) 
copy_in(A,B) copy_input(C)
extern void matmul_fpga(float A[BS][BS],
float B[BS][BS], float C[BS][BS]);

for (i = 0; i < DIM_SIZE; i++) {
for (j = 0; j < DIM_SIZE; j++) {
    #pragma omp block nest(3) factor(BS,BS,BS) 
    #pragma omp task label(block_matmul) 
    input(A[i][k],B[k][j]) 
    output(C[i][j]) 
    C[i][j] += A[i][k] * B[k][j];
}
}
```

Fig. 2. Matrix multiplication using the proposed OpenMP extensions.
A. Target device pragma

The following pragma [8] may precede any existing pragma task or function declaration or header

```c
#pragma omp target device(device-name-list) 
 {pragma-task
 function-definition
 function-header}
```

It is used to specify that the execution of the task could be of oad to any of the devices specified in device-name-list. The names used in this list are vendor specific (i.e. cell, cuda, ...). In the case of using FPGA accelerators, the device-name should be fpga.

Then, when a task is ready for execution (i.e. it has no dependencies with other previously generated tasks) the runtime can choose among the different available targets to decide in which device to execute the task. If no resource is available (or not configured yet), the runtime could execute the default implementation on the host or stall until one of the resources becomes available.

Some additional clauses can be used with this pragma device:

- `copy_in(data-reference-list)
  copy_out(data-reference-list)
  copy_inout(data-reference-list)`

The first three clauses, which are ignored for the shared-memory architectures, specify data movement for the shared variables used inside the task. Copy_in will move variables in data-reference-list from host to device memory. Copy_out will move variable in data-reference-list back from device to host memory. Copy_inout will do both. Once the task is ready for execution, the runtime system will move variables in the copy_in or copy_inout lists. Once the task finishes execution, the runtime will move variables in the copy_out or copy_inout lists, if necessary.

Clause `implements(function-name or label-name)` is used to specify an alternative implementation for a function that is invoked as a task. For instance, in the following code excerpt:

```c
#pragma omp task input(A[BS][BS], B[BS][BS])
  output(C[BS][BS])
extern void matmul(float A[BSIZE][BSIZE],
  float B[BSIZE][BSIZE],
  float C[BSIZE][BSIZE]);
#pragma omp target device(fpga) \  
  implements(matmul) \  
extern void matmul_fpga(float A[BS][BS],
  float B[BS][BS],
  float C[BS][BS]);
```

The programmer specifies that `matmul_fpga` is an alternative implementation of `matmul` when of oad the execution of that function to an FPGA device. In addition the programmer is specifying a change in memory association of the blocks used in the host implementation (`matmul`) and in the of oad implementation (`matmul_fpga`). Notice that the accelerator version uses blocks of BSxBS contiguous elements, while in the host the block of BSxBS elements is part of a larger matrix of BSIZExBSIZE elements.

B. label-name clause

In order to allow the specification of a alternative implementations for structured code blocks, we extend the task pragma with an additional label-name clause. The input, output and inout clauses in task together with the copy_in, copy_out and copy_inout clauses in target are used to match variables used in the structured code block with arguments in the function used to implement it.

For the example shown in Figure 2, the programmer specifies that `matmul_fpga` bitstream will be used to of oad the execution of the structured code block `block_matmul` to the FPGA device.

In the case of function calls, it is not necessary to specify the label clause since the function name is used as label to identify alternative implementations.

C. Block pragma

When of oad the execution of tasks to accelerators, it is necessary to fit the problem to the limitations (for example memory) of those accelerators. In the case of FPGA accelerators, the specific hardware implementation of the task code in the FPGA can introduce additional constraints. In this paper we propose the use of pragmas to drive program transformations that are necessary to fit the task into the accelerator device. In particular the use of loop blocking, a well-known compiler technique used to optimize the exploitation of data locality.

The block pragma is introduced to block perfectly nested loops whose body is to be of oad into the accelerator.

```c
#pragma omp block nest(block-dimension) 
factor(block-size-list)  
task-code
```

The nest(block-dimension) clause specifies that block-dimension consecutive loops are affected by the blocking, being the inner loop the one that contains the block pragma. On the other hand, the block-size-list on the factor clause specifies the blocking size that should be used for each loop and induction variable. In addition to the loops, the pragma also transforms all the references in the data-reference-lists that are included in input, output and inout clauses.
To illustrate the effect of this pragma, Figure 3 shows how a source-to-source restructurer would transform the block pragma in Figure 2. The code transformation shown in Figure 3 should be manually applied by the programmer if the proposed block clause would not exist. This block pragma makes programming for such accelerators much more productive, avoiding the writing of the bounds of the block of elements that have to be moved to/from the accelerator.

Other more complex blocking strategies will have to be manually introduced by the programmer. However, the simple one proposed here is widely applied.

III. RUNTIME SYSTEM IMPLEMENTATION

The runtime system should provide support to ofoad the execution of bitstreams and the required data transfers for the SGI RASC technology following the OpenMP 3.0 pragma extensions described in Section II. In addition to this support, an implementation should also include the following main features:

- Bitstream cache and support for hybrid computation.
- Transparent change of memory association, providing data packing and unpacking when transferring data between host and FPGA device.
- Multithreaded FPGA library interface.

In the following subsections we detail a possible implementation of these features.

A. Bitstream cache and hybrid computation

As we will analyze in Section V, the time required to configure a bitstream in the FPGA can be significantly high.

To avoid unnecessary configurations, the runtime system implements a fully associative cache structure to keep information about the bitstreams currently loaded in the FPGA devices. When a task pragma is found, the runtime checks if the bitstream that implements the function or structured code block associated with the task is already configured. A hit in the bitstream cache produces the effective ofoad of the task execution. If the runtime detects a miss in the bitstream cache, it will apply a least frequently used (LFU) replacement policy, initializes the FPGA device associated to replace the bitstream and configure the FPGA device with the new bitstream. The number of entries of the bitstream cache is the number of FPGA devices we have. In case of having partial reconfigurations, the bitstream cache should take them into account.

In order to hide the FPGA configuration time that happens on a miss, the runtime applies what we call a hybrid computation policy. In this hybrid computation mode, when a miss occurs the runtime checks if the configuration for that task is already in progress; if it is not in progress, the runtime will launch the configuration of the bitstream. In both cases, the runtime will execute the task in the host processor overlapping with the configuration process. Once the bitstream is configured, the runtime will detect a hit in the bitstream cache and ofoad the execution of future instances of that task to the FPGA device.

B. Memory association changes: pack/unpack

The data transfer bandwidth between the host and the FPGA device can be a bottleneck depending on the application characteristics, and the system you are running on. Our runtime system deals with this issue doing data packing and unpacking, as indicated by the different memory associations detected by the source-to-source compiler.

C. Multithreaded FPGA library interface

The FPGA library interface should be implemented using threads in order to avoid the application to be blocked in FPGA management operations. For instance, a thread can be configuring the FPGA device meanwhile another one is doing useful work in the host side (hybrid computation). There are one master and one worker threads. The master thread queues operations in a consumer-producer structure that the worker reads. The master thread will be blocked if the queued operation is synchronous, otherwise it will continue. The worker thread is in charge of performing the queued operations and will unblock the master thread when necessary.

IV. EXPERIMENTAL SETUP

Our experiments have been run on a SGI Altix 4700 system equipped with 128 Itanium processors. The system we used also includes a RASC RC100 FPGA blade (with two Xilinx Virtex 4 LX200). The prototype runtime system
targets the SGI RASC library 2.2 and Core Services [5]. The source-to-source code transformation process has been implemented as a new pass in the Mercurium infrastructure (version 1.2.1) developed at BSC [9]. We use gcc 4.1.2 as backend for compilation to binary code. For all the compilations we have used -O3 as optimization level for the software part. Also, we have utilized all the Makefiles that SGI provides to compile the HDL codes and generate the configuration files. Those Makefiles use Xilinx ISE 9.1 to generate the bitstream. All timing results are obtained using the gettimeofday() system call.

In our current prototype implementation, an abstract layer, that is easily adaptable to other FPGA based architectures, has been implemented to manage the FPGA devices. Currently, such a layer does not consider neither multiple FPGA accelerators, nor partial reconfigurations. Finally, our current runtime implementation is not dealing with task dependencies.

V. IMPLEMENTATION ANALYSIS

We have tested our extensions and runtime with the RASC library examples (the data ow which is a pattern matching program and the simple algorithm which performs logic manipulations) to evaluate the communication and the programming model cost. Furthermore, we perform a detailed analysis of our implementation using a matrix multiplication kernel.

A. Communication Costs

Bitstream loading cost: We define the bitstream loading cost as the accumulated time of reading the bitstream file and the configuration of the reconfigurable device. For the 4MB size of the matrix multiplication bitstream this time is 1 second. In fact, this is a cost that you can save if you keep information of which bitstream is already configured in the FPGA device. Our runtime system keeps a bitstream cache with that information, reducing that time to 4ms (since there is some work to be done in any case).

In any case, the first time that the FPGA device has to be configured we will have to pay that large amount of time. Therefore, our runtime system performs an hybrid computation of the task that has to be of oad to an FPGA device, in the sense, that if there is a bitstream cache miss, the runtime system will spawn a thread to configure the bitstream in the FPGA device meanwhile executing the same task in an available device target.

Host-SRAM Communication Bandwidth: In order to evaluate the DMA transfer bandwidth from host to the SRAM of the FPGA board in the SGI Altix System, we have done a modification of one of the examples of the RASClib library, the simple algorithm which performs logic manipulation. These measures have been taken using the RASC driver in BUFFERED_IO mode. The modified bitstream reads the last element of a large chunk of data that has to be written in the memory of the FPGA and, when that value is valid, the algorithm finalizes, signaling the host part of the program. Using this bitstream we have a very good approximation of the real DMA transfer bandwidth between host and the SRAM of the FPGA: 0.8GB/s.

We have also evaluated the bandwidth of the non-DMA transfers measuring their execution time. The RASClib allows us to send a command to the FPGA using a register. The function that performs this operation finishes when the data has been sent. The non-DMA data transfer bandwidth is 630KB/s.

B. Programming Model Costs

In Table I we can see the operations used by our runtime system. Last column specifies the number of times that this operation has to be performed in the application: 1 for configuration/deconfiguration operations and n for application dependent operations, where n is equal or larger than 1. Results are for the pattern matching example that comes with the RASClib library. As we can see, the most time consuming operations are: initialization of the FPGA device, spawn of the bitstream to the FPGA, and waiting for the FPGA to finish. The initialization time is not significant since it is only done once, and the waiting for the FPGA depends on the application. Finally, the spawn to the FPGA time shown in the Table is the bitstream loading cost commented above, when there is a hit in the bitstream cache implemented in our runtime system.

In the case of using the blocking technique, we may have to pay a cost of packing/unpacking blocks of consecutive memory to send/receive them to/from the FPGA device. Figure 4 shows the packing/unpacking process of a data block of matrix C. That cost is host architecture dependend. In the case of our matrix multiplication, using blocking, we have analyzed which is the overall application cost of packing/unpacking when doing a matrix multiplication of 256x256 elements in the SGI Altix 4700 machine. Table II shows the overall execution times of packing/unpacking of the data varying the block size from 32x32 upto 128x128, for a 256x256 matrix multiplication. The larger the block,
Fig. 4. Packing/Unpacking process for a data block of matrix C.

<table>
<thead>
<tr>
<th>Block size</th>
<th>Pack/Unpack (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>32x32</td>
<td>7.9</td>
</tr>
<tr>
<td>64x64</td>
<td>2.8</td>
</tr>
<tr>
<td>128x128</td>
<td>2.1</td>
</tr>
</tbody>
</table>

TABLE II
OVERALL EXECUTION TIME OF PACKING/UNPACKING BLOCKS IN THE MATRIX MULTIPLICATION OF 256X256 ELEMENTS

the better the performance we achieve since we save block data management overheads, which is very significant for 32x32 blocks. Other memory transfer strategies to process all the blocks of whole matrix may help to improve the performance of the matrix multiplication.

VI. CASE OF STUDY: MxM
The objective of this section is not a performance evaluation of the FPGA in the SGI Altix 4700, neither a comparison between two different architectures, but to show a real example using our proposed extensions.

We have used a matrix multiplication core for up to 128x128 double precision elements, which works at 50 MHz in the FPGA. This is a version of the MxM in [10] adapted to the RASC interface. We use this core in order to perform a hardware blocked matrix multiplication using our blocking directives as shown in Section II.

The blocking directives let us use the matrix multiplication core as a block matrix multiplication, being able to multiply matrices larger than 128x128 elements (also smaller). Figure 2 shows the code we use in the evaluation.

Figure 5 shows execution time for three different versions of a 256x256 matrix multiplication: software version (256 sw), software version using blocking (256 sw-blocking), and blocking FPGA version (256 hw). In the case of the blocking version, block sizes of 32x32, 64x64 and 128x128 have been tested. The blocking software versions show better results than hardware version. Some of the reasons are: the extra overhead to execute in the FPGA, and the frequency of the matrix multiplication bitstream (50MHz). However, the execution time for the case of 32x32 blocks is very large and we analyzed it in detail. We figured out that the rasclib_algorithm_open function, used in the spawn of the thread to the FPGA to start executing the code, increases its execution time in each invocation. We cannot give any reason why this function is wasting that time since we do not have access to the SGI RASClib code. Table III shows the overhead produced by the rasclib_algorithm_open function. However, although any mechanism to reduce the number of times rasclib_algorithm_open is called will improve the overall performance of the matrix multiplication, the packing/unpacking overhead cost (shown in Table II), and other overhead costs per block, will still exist. Therefore, the best performance is achieved for the largest block size tested.

Finally, we wanted to evaluate the performance of the blocked hardware version if we have had a matrix multiplication with higher frequency than 50MHz. There, we have estimated the execution times of the matrix multiplication using 100MHz and 200MHz frequencies, properly scaling the execution time of the FPGA execution part. Figure 6 shows the execution time of the hardware version when the frequency is 50MHz (the real frequency of our matrix multiplication core), 100MHz and 200MHz. Results show that hardware version with larger frequencies will overcome the software version (256 sw-blocking in the Figure).
C extensions. There are many of them: new models have appeared. Most of these models are programming paradigm etc. In these models, the programmer uses C language on those sections of code that should be of loaded the FPGA as a task. Usually, there are two strategies to accelerated/of loaded sections of code. In the first strategy, the section of code to be of loaded to the FPGA is translated from C to VHDL. This strategy is followed by Streams C, Impulse C and ROCCC. In all of these models we have a software/hardware solution where the developers only have to use a C subset and the API to do the communications. More in detail, Streams C and Impulse C are focused on the communications between the core running on the FPGA and the application. In both models, the authors provide us an API for communications, allowing the developers to program communications like C streams. ROCCC is oriented to HPC, focusing on loop optimizations and how to exploit the parallelism of them using reconfigurable computing.

The second strategy is to map a soft processor into the FPGA, and translate the source code to be executed to the code that this soft processor understands. This strategy is followed by Mitrion C, where the compiler generates code for a Mitrion Virtual Processor (a customizable processor created by Mitrionics). In both strategies the compiler manages communications to improve transparency in programming. In our work, we do not generate code from C to any HDL or any assembler for a soft core. We use the OpenMP standard to use existing bitstreams in the applications.

In this paper, we have presented a runtime system implementation of the OpenMP 3.0 extensions to of loaded tasks to different target devices. In particular, we have implemented those extensions inside the Mercurium compiler framework in a SGI Altix architecture to use FPGA devices. Those OpenMP extensions help programmability and reduce development cost.

We have seen some bottlenecks on the SGI Altix architecture and environment, like the DMA transfer bandwidth.
and the \texttt{rasclib\_algorithm\_open} function, that penalize the overall performance of the application. We will improve our bandwidth using using \texttt{DIRECT\_IO} mode in the RASC driver in a future release. Another time consuming function of RASClib library is \texttt{rasclib\_resource\_con\_gure()}. This function has to read the bitstream from the hard disk, load it in the main memory, send it through the NUMA link connection and configure the Virtex 4 FPGA. In particular, we have observed that most of the time is spent reading from the hard disk and configuring the FPGA. Our runtime system hides this time using hybrid computation and a bitstream cache. This hybrid computation allows to execute the host (software) version of the code to be of oadable meanwhile the bitstream is configured. That has been done using the bitstream cache in the runtime system to avoid unnecessary configurations. That process is transparent to the programmer and significantly improves the performance of the application. Our experiments show that hybrid computation of the task can significantly improve the use of target devices whose initialization cost is large.

Finally, we have done an evaluation of the bottlenecks of SGI Altix architecture and programming model implementation using several kernels, showing some results for a blocked matrix multiplication on the SGI Altix System.

### A. Current and Future work

We are currently working on supporting several FPGAs in the SGI Altix System. Therefore, future implementations will consider multiple FPGA accelerators, or multiple accelerators in an FPGA and then, the runtime system will use those parameters to properly schedule the task executions.

As future work, we plan to get involved on the definition of the standard CPU-FPGA interface (CoreLib [20]), to connect bitstreams to the code generated by our compiler. Our idea is to use a toolchain from C to bitstream for those parts of code that will be mapped to the FPGA device.

Other future lines are the optimization of the data movements so that the runtime system can advance them (prefetching), new packing/unpacking techniques, runtime partial reconfiguration, OpenMP extensions to synchronize CPU and FPGA threads, etc.

### ACKNOWLEDGMENTS

The researchers at BSC-UPC were supported by the Spanish Ministry of Science and Innovation (no. TIN2007-60625, CSD2007-00505 and TIN2006-27664-E), the European Commission in the context of the SARC project (no. 27648) and the HiPEAC Network of Excellence (no. IST-004408), and the MareIIncognito project under the BSC-IBM collaboration agreement. We also want to thank Mihai Lefter, from Delft University, for providing us the matrix multiplication bitstream used in some of our evaluations. Also, special thanks to Roger Ferrer for providing us Mercurium compiler and for solving all the questions about the compiler.

### REFERENCES


