2. Software acceleration with hardware co-processors#
2.1. Required files#
2.2. Goal#
The aim of this tutorial is to accelerate an application with hardware co-processors using the SpaceStudio development environment. In this tutorial, the user will optimize a single-threaded algorithm with low performance into a multithreaded algorithm with higher performance.
2.3. Single thread Application Specification#
The Motion JPEG (MJPEG) is a video format composed of a series of JPEG images. Figure 2.25 outlines the MJPEG decoder application’s task. The MJPEG decoder reads the video stream from the input memory and decodes the images to the video controller memory in RGB format. For simplification, an external subsystem initializes the input video with a MJPEG video.
Figure 2.25 MJPEG Application#
2.4. Profiling the software application#
Application profiling is performed to determine the time consumed per function. SpaceStudio’s monitoring feature (described in Tutorial 1) was used to determine the execution time of the application’s functions. Figure 2.26 and Table 2.1 outline the results of the execution time.
Figure 2.26 MJPEG functions execution time#
Function |
Execution time (%) |
|---|---|
|
18.93% |
|
17.15% |
|
9.35% |
|
8.46% |
|
6.9% |
|
6.01% |
|
5.12% |
Other functions |
28.08% |
2.5. Single-threaded MJPEG#
As a base reference, we need to determine how the single-threaded algorithm performs on a specific architecture. The supplied project comes with a virtual platform composed of a microblaze executing the MJPEG decoder algorithm as presented by Figure 2.25 . Follow these instructions to determine the decoding performance:
Open the SpaceStudio project
Open the
microblazearchitectureExecute the simulation
What is the simulation time for processing two frames?
2.6. Multithreaded MJPEG#
We would like to parallelize the algorithm to achieve better performance. Based on the results presented by Figure 2.26 and Table 2.1, we propose to accelerate two functions: operation_IDCT and calculate_output_pixel.
Hint
The function operation_IDCT is easier to parallelize
Suggested steps :
Create a new solution based on the current one. This will allow keeping the original code intact for further comparisons.
Create a new module (e.g,
idctorcalculate_output_pixel) in the new solution.Add the new module to the
validationarchitecture.Modify the MJPEG and implement the new module (move the function’s body to the module’s thread)
Adapt the communication (inputs and outputs) of the function using the communication API
Compile, run and debug!
Now that you have done a functional verification in validation architecture, modify the microblaze architecture in the new solution (add the co-processor) and compare the simulation time.
What is the speedup?
How can we simulate the hardware co-processor execution’s time?
What can be done to accelerate the algorithm even more?