4. Accelerate design performance with Direct Memory Access (DMA) and estimate hardware co-processor timing#

4.1. Required files#

SpaceStudio Project

4.2. Introduction#

In data-centric applications with high bandwidth requirements, processing the data stream at real-time speed can be challenging. Memories are often used to buffer the stream for processing by algorithmic blocks, but such memory accesses can become a bottleneck.

Regarding computation timing, a system designer needs to estimate the computational budget for algorithm blocks, as failing to do so may yield system which does not meet performance requirements.

This tutorial is divided into two sections. In the first section, the attendee will explore efficient memory data transfer by leveraging ready-made SpaceStudio DMA capabilities. In the final section, the tutorial will present a methodology to estimate the computational budget for hardware co-processor IP block.

4.2.1. Efficient Memory Data Transfers#

General-purpose processors can perform memory transfers, but this implies a performance hit. Hardware IPs can access memories as bus masters using their own custom logic, but this incurs a development and maintenance effort that is not part of the IP’s core functionality.

A better solution is to use Direct Memory Access (DMA). A DMA IP can transfer large amounts of data with minimal interaction with the software. In a nutshell, the software indicates to the DMA the source and destination addresses along with the number of bytes to be transferred. The main advantage is offloading the actual data transfer to the DMA, freeing the processor to work on other parts of the algorithm. DMA IPs use various techniques to achieve high throughput such as burst bus operations, wide data paths, higher frequency, circular mode, scatter and gather, etc.

There are three types of DMA operations: 1) memory to memory, 2) memory to stream and 3) stream to memory. System designers enable DMA transfer in SpaceStudio by using one of the two fashions: 1) explicit or 2) implicit DMA transfer.

In both cases, a DMA IP is configured by a controller module that needs to be executed on a processor. The controller module can be dedicated to control the DMA IP or can be part of the application (executing a task of the application).

SpaceStudio automatically instantiates a DMA IP, correctly connects it to all relevant interconnects and generates an optimized software driver for the given API call.

4.2.1.1. Implicit DMA transfer#

In an implicit DMA transfer, the controller is part of the application, and thus, has access to the data. In an implicit DMA transfer, the DMA IP will synchronize with the application’s memory. For example, the controller computes an array that needs to be sent to a processing block. In such case, the controller inevitably has a C-like array which calculation is done upon.

SpaceStudio refers to an implicit DMA transfer because the communication API calls used is not dedicated for DMA transfer. In fact, SpaceStudio reuses the API calls ModuleRead and ModuleWrite and system designers indicate to SpaceStudio which communication channel (pair of matching ModuleRead and ModuleWrite) needs to be realized using a DMA IP.

Memory to memory DMA operation for implicit DMA transfer is not yet supported. Memory to stream is achieved when the controller uses the API call ModuleWrite and the destination is a co-processor. In the same manner, stream to memory is achieved when the controller uses the API call ModuleRead and the source is a co-processor.

The token SPACE_BLOCKING indicates the caller will return from the API call only when the transfer is complete. This behaviour is achieved using an interrupt that SpaceStudio automatically handles. If we do not want to wait for the transfer to be completed prior to returning from the API call, the token SPACE_NON_BLOCKING should be used, hence, no interrupt will be generated.

4.2.1.2. Explicit DMA transfer#

In an explicit DMA transfer fashion, the controller controls the data flow of the processing blocks and never accesses the data. For example, a controller dictates the last processing block to send the data (i.e., a processed image) to the video memory. In such case, the controller is not interested by the data but rather wants to feed a memory-centric device, the video controller.

SpaceStudio supports all three DMA operations types using respectively the API calls: 1) Memory2Memory, 2) Memory2Stream and 3) Stream2Memory. SpaceStudio refers to an Explicit DMA transfer when using such API calls because the API calls is dedicated to for instantiating a DMA transfer.

For example, if 1MB needs to be transferred from memory with id SOURCE_ID at offset 0x0 to memory id DESTINATION_ID at offset 0x0, the following API call does the trick:

Memory2Memory(SOURCE_ID, 0x0, DESTINATION_ID, 0x0, SPACE_BLOCKING, 0x100000);

In the previous example, the token SPACE_BLOCKING indicates the caller will return from the API call only when the transfer is complete. This behaviour is achieved using an interrupt that SpaceStudio automatically handles. If we do not want to wait for the transfer to be completed prior to returning from the API call, the token SPACE_NON_BLOCKING should be used, hence, no interrupt will be generated.

If the targeted operating system (OS) is Linux, explicit DMA transfer is faster than implicit DMA transfer because the kernel driver does not need to copy the user space memory to kernel space.

4.2.2. Computation budget#

During the design exploration loop, untimed models need to be refined to take account of the computation requirement. According to the Vivado HLS user guide, the loop latency is the number of cycles to execute all iterations of the loop.

The performance (computation timing) of an IP is defined in terms of cycles and can be determined by analysing the loop latency. Based on the reported timing of the HLS tool, system designers determine the appropriate timing to be back-annotated into the simulation’s model via the API call hw_compute_latency(x) where x represents the number of cycles.

The input algorithm code can be slightly modified to help the HLS tool to categorize the timing. For example, add labels for each for-loop as shown in Listing 4.1 .

Listing 4.1 Example for adding labels to for loops#
L1: for(int i=0;i<MATRIX_ROWS;i++)
    L2: for(int j=0;j<MATRIX_COLUMNS;j++)
        L3: for(int k=0;k<MATRIX_COLUMNS;k++)

4.3. Manipulation#

4.3.1. Create a new module#

In this first section, the attendee will create a new module called matrix_mult that multiplies matrices using the general approach as presented below:

\[ \begin{align}\begin{aligned}\begin{split}A = \begin{bmatrix} a_{1,1} & \dots & a_{1,n} \\ \vdots & \ddots & \vdots \\ a_{m,1} & \dots & a_{m,n} \end{bmatrix} B = \begin{bmatrix} b_{1,1} & \dots & b_{1,p} \\ \vdots & \ddots & \vdots \\ b_{n,1} & \dots & b_{n,p} \end{bmatrix}\end{split}\\\begin{split}A \times B = \begin{bmatrix} a_{1,1} b_{1,1} + \dots + a_{1,n} b_{n,1} & \dots & a_{1,1} b_{1,p} + \dots + a_{1,n} b_{n,p} \\ \vdots & \ddots & \vdots \\ a_{m,1} b_{1,1} + \dots + a_{m,n} b_{n, 1} & \dots & a_{m,1} b_{1,p} + \dots + a_{m,n} b_{n,p} \end{bmatrix}\end{split}\end{aligned}\end{align} \]

As a starting point, refer to for the module’s implementation.

Listing 4.2 matrix_mult implementation#
#include "matrix_mult.h"
#include "spacecomp_pre_cpp.h"

matrix_mult::matrix_mult(SPACECOMP_CTOR_PARAMS(matrix_mult, INDEX))
    : abstract_module(SPACECOMP_MBASE_ARGS(matrix_mult, INDEX))
    , SPACECOMP_START_ILIST(matrix_mult, INDEX)
{
    SPACECOMP_THREAD(matrix_mult, INDEX, thread);
    set_stack_size(0x16000+(MATRIX_ROWS*MATRIX_COLUMNS*4*3));
}

void matrix_mult::thread(SPACECOMP_THREAD_PARAMS(matrix_mult, INDEX, thread)) {
    spacecomp_thread_initialize();
    uint32_t matrix1[MATRIX_ROWS*MATRIX_COLUMNS];
    uint32_t matrix2[MATRIX_ROWS*MATRIX_COLUMNS];
    uint32_t result[MATRIX_ROWS*MATRIX_COLUMNS];
    while (1) {
        const bool initializing = spacecomp_thread_loop_start();
        if (initializing) {}
        read_matrix(matrix1);
        read_matrix(matrix2);
        multiply(matrix1, matrix2, result);
        send_result(result);
    }
}

void matrix_mult::multiply(uint32_t* in_1, uint32_t* in_2, uint32_t* out) {
    for(int i=0;i<MATRIX_ROWS;i++)
        for(int j=0;j<MATRIX_COLUMNS;j++) {
            out[i*MATRIX_ROWS+j]=0;
            for(int k=0;k<MATRIX_COLUMNS;k++) {
                out[i*MATRIX_ROWS+j]+=in_1[i*MATRIX_ROWS+k]*in_2[k*MATRIX_ROWS+j];
            }
        }
}

...

#include "spacecomp_post_cpp.h"

The matrices will be sent (or dispatched) by the controller to the module matrix_mult. In all cases (hardware fifo, implicit and explicit DMA transfer), the module matrix_mult reads and writes matrices using the API ModuleRead and ModuleWrite respectively. Only the controller module uses a different API depending on the targeted communication protocol.

The controller module and the command_generator.cpp have been modified to support the new operator z. This operator represents a matrix multiplication where both operands are an index (range [0-9]) of pre-generated matrices. The pre-generated matrices are located in %PROJECT_ROOT%/import/matrix. The tutorial comes with 5 sets of different dimensions (10, 50, 300, 500 and 1000) where each set contains 10 square matrices. Table 4.21 presents the description of the file in a given set.

Table 4.21 Pre-generated matrix files#

File name

Description

matrix.bin

Binary file containing 10 matrices (used to preload a memory)

matrix.c

Implementation file containing 10 matrices (to be compiled with application code)

matrix.h

Interface of matrix.c’s opaque implementation

matrix_def.h

Definitions

Importing files into SpaceStudio is done by following these steps:

  1. Right-click on a component under the node Application components (e.g. controller)

  2. In the popup menu, click Import File…

  3. Navigate to the folder where the files are located select them. Click Open then OK.

In this tutorial, the attendee will be asked to use different set of matrices of various sizes. A quick way to do this is to use a different solution for each set:

  1. Click on Solution

  2. From the dropdown menu, click on New Solution…

  3. Give a name for the new solution (e.g matrix_50)

  4. Checked the Based on existing solution box and select the previous solution

  5. Click OK

The new solution is an exact copy of the previous one. New change the imported files for matrix_mult to the files in %PROJECT_ROOT%/import/matrix/50x50 :

  1. Expand the Application components node in Project explorer

  2. For the controller and matrix_mult :

    1. Expand the Import node, if applicable

    2. For each imported file, remove it if it is inside the %PROJECT_ROOT%/import/matrix/10x10 directory, then import the equivalent file found in %PROJECT_ROOT%/import/matrix/50x50

4.3.2. Communication performance#

In this section, we will compare different communication APIs and how they perform in terms of simulation time. The Table 4 will be filled out throughout the progress of the tutorial in order to keep track of the obtained simulation time.

The command_generator.cpp (import of the input_reader module) is configured to loop through all operations, starting with matrix multiplication, and all operands. To keep simulation time low, do not fill the N/A fields in Table 4. Also, have matrix_mult stop the simulation after a single matrix multiplication. To do so, call sc_stop() after send_result() in matrix_mult.cpp.

Table 4.22 Simulation time for different communication APIs#

Matrix dimension

Hardware FIFO

Implicit DMA

Explicit DMA

10x10

50x50

300x300

N/A

500x500

N/A

1000x1000

N/A

4.3.2.1. Hardware FIFO#

For hardware fifo, the controller module needs matrix.c, matrix.h and matrix_def.h while matrix_mult needs matrix_def.h (refer to above steps).

The controller sends matrices using the API call ModuleWrite. The location of the matrices is defined by the interface (matrix.h). The result of the multiplication is read using the API call ModuleRead and stored in a member variable called m_matrix_result.

Note that the ModuleWrite and ModuleRead APIs need the number of elements (not bytes) to be transferred.

4.3.2.2. Implicit DMA transfer#

Implicit DMA transfer and hardware fifo share the same communication API (ModuleRead and ModuleWrite) and uses the same import files.

The only difference resides in the system designer’s decision to realize the communication channel using a DMA IP rather than a hardware fifo:

  1. Click on search_icon to manage communications

  2. Click the column Channel type for the first communication channel where the writer is controller and the reader is matrix_mult

  3. This will reveal a combo box that contains a list of possible item. From this list, select DMA

  4. Repeat for the second communication channel where the writer is matrix_mult and the reader is controller

4.3.2.3. Explicit DMA transfer#

Explicit DMA transfer differs from implicit DMA transfer in regards of the location of the data. In fact, data is stored in an external memory and is not part of the application code. In this section, an external memory will be used to store the matrices and a new communication API will be used to stream such matrices to the appropriate module.

For explicit DMA transfer, the controller and matrix_mult modules need matrix_def.h (refer to above steps).

  1. Open the diagram.

  2. Select the matrix_ram

  3. In the Properties view, make sure the matrix_ram has the following properties:

    Table 4.23 Matrix RAM properties#

    Property

    Value

    Memory size

    64MB

    Memory initialization

    %PROJECT_ROOT%/import/matrix/10x10/ram_init.json

When the simulation starts, the matrix_ram will be initialized with the matrix.bin file as dictated by the ram_init.json

In the controller module, replace the ModuleWrite to the matrix_mult module by Memory2Stream. Conversely, replace the ModuleRead from the matrix_mult module by Stream2Memory. For simplification, we added a user-define parameter to the controller module which allows you to easily switch from one API to another. To modify this user-defined parameter, select the controller instance, then head over to the Properties view and select the Parameter tab.

4.3.3. Computation budget#

In the previous section, we focused on the different communication mechanisms available to the system designer but we neglected the computation timing to realize such multiplication. In this section, we want to add computation timing to the module matrix_mult to better estimate the overall simulation time in order to appropriately perform the design space exploration loop.

Vitis HLS will be used to estimate the required timing budget. First, we need to add labels to the for-loop. Refer to Listing 4.1 and Listing 4.2 .

Once that is done, we must perform the high-level synthesis.

Important

A valid Vitis HLS license is required to estimate the computation budget.

4.3.3.1. Configure Architecture Implementation#

To perform the high-level synthesis, we must ensure the Electronic Design Automation (EDA) tool and High-Level Synthesis (HLS) tool we will use are configured correctly.

Note

This configuration only needs to be done once per EDA/HLS toolchain after having installed SpaceStudio. This will enable the EDA and HLS tools and specify to SpaceStudio where they are installed.

Note

In this tutorial, the HLS tool used is Vitis HLS, and the EDA tool used is Vivado, both developed by Xilinx. We will not be using Vivado yet, however this configuration will be necessary in the Architecture Implementation tutorial.

To do so:

  1. Click on Tools from the menu item

  2. In the drop-down menu, click on Preferences…

  3. Expand the path SpaceStudio > EDA > Xilinx - Vivado 2025.2

  4. Make sure that the EDA checkbox (which enables the use of Vivado) and HLS checkbox (which enables the use of Vitis HLS) are checked, and that the Xilinx Vivado installation directory is correctly configured to the path the Xilinx tools were installed in.

  5. Click Apply and Close

4.3.3.2. Run High-Level Synthesis#

Now that the HLS and EDA tools are configured, we may run high-level synthesis.

  1. Click on Tools

  2. From the dropdown menu, click on Architecture Implementation…

  3. In the popup window, fill in these details:

    1. Project directory: The target directory you want the Vitis HLS project to be created in. This will be referred refer to as %target_dir%.

    2. Electronic Design Automation (EDA) tool: Xilinx - Vivado 2025.2

    3. Board: ZedBoard Zynq Evaluation and Development Kit

    4. High-level synthesis: Vitis HLS

  4. In list Modules to synthesize, uncheck all modules except matrix_mult0.

  5. Click OK.

Architecture implementation will start to export the virtual platform to Vivado and Vitis HLS. The first step is the high-level synthesis of each previously selected module before creating the hardware platform. When it is done this message will appear in the Build console :

High-level synthesis [Completed successfully.]

At this point, the stop button stop_icon can be pressed since we don’t want to complete the architecture implementation here.

When the high-level synthesis is done, navigate to folder %target_dir%/hls/matrix_mult0/matrix_mult0/solution1/syn/report and open the file matrix_mult0_thread_csynth.rpt, then inspect the section named Performance Estimates, as well as Area Estimates.

The section Performance Estimates_ presents the estimated performance. Vitis HLS bases the names of the loops according to the label we added using Listing 4.1 . The Multiplication And Accumulation (MAC) operation of the inner loop (L3) takes 9 cycles (2 cycles to read both input, 1 cycle to write output and 6 cycles to compute the operation) independently of the matrix dimension.

Back-annotating the timing in SpaceStudio’s model is done using the function call hw_compute_latency(x). To avoid several SystemC context switches, it is recommended to perform only one hw_compute_latency(x). Using the 300x300 matrix set, without optimization, one MAC operation takes 9 cycles while the whole function call takes 243180600 cycles (loop latency of L1). Instead of performing several hw_compute_latency(9) inside the for-loop L3, add a hw_compute_latency(243180600) before returning from the function call hw_compute_latency().

The section Area Estimates presents the estimation resource utilization (LUT, Flip-flop, DSP, etc.). Area usage is an important metric when evaluating different types of optimisation because optimizing the IP normally requires more area.

The following sections present two concepts that exploit the parallelism between loop iterations.

4.3.3.3. Loop unrolling#

Vitis HLS supports loop unrolling via the #pragma HLS unroll directive. The inner loop (L3) can be performed in parallel by adding the pragma inside the for-loop L3. SpaceStudio passes the defined #pragma to the selected HLS tool.

Repeat the steps presented above and compare the results for section Performance Estimates and Area Estimates with the one obtained earlier (no optimization).

4.3.3.4. Loop pipelining#

In loop pipelining, the term initiation interval (II) describes the number of clock cycle required before the next iteration of the loop can start to process new data. Vitis HLS supports loop pipelining via the #pragma HLS PIPELINE II=X where X is the number of clock cycles before starting to process new data.

In the previous section, the MAC operation required 2 cycles before starting the computation. This is the initiation interval value (II).

Repeat the steps presented above and compare the results for section Performance Estimates and Area Estimates with the one obtained earlier (no optimization).

4.4. Result files#

SpaceStudio Project