The content presented in this post was developed during the winter class given at Federal University of Rio Grande do Norte, with professors Carlos Valderrama and Samuel Xavier. My group was composed by Wagner Wesner and me.
Our group task was targeting Vivado HLS to implement accelerator blocks for the PYNQ-Z1 board. The PYNQ consists of a board with some peripherals and a ZYNQ chip, the ZYNQ has a cluster with a Central Processing Unit (CPU) and a Field-Programmable Gate Array (FPGA) which enables the test of the synthesized blocks on Vivado. Vivado outputs such as a bitstream and a Tcl file are used to create a PYNQ overlay. The overlay is further used to communicate the generated blocks with the PYNQ python interface.
The High-Level Synthesis (HLS) is very useful to transform complex algorithms into Hardware Description Language (HDL) code. There is a variety of algorithms which takes considerable CPU processing time, those algorithms can be translated to a hardware description which can be implemented on an FPGA. Once the circuit is configured on the FPGA, the algorithm time demanding tasks are parallelized (summing up), which increases performance and brings other potential benefits.
The Vivavo HLS software starts the PYNQ overlay creation with a custom block.
This tutorial will present the sum of the steps needed to not create a really extended post, also assuming that the reader is a little familiarized with the basic Vivado HLS steps. The version used is 2017.2.
When creating a new project, choose the PYNQ ZYNQ board part “xc7z020clg400-1”.
The accelerator block modeling is achieved by writing C++ code. The Vivado HLS interface needs both a source file indicating the block behavior and a test bench to observe the block outputs. A simple 32-bit adder block is constructed and tested as shown below.
After testing with “run C simulation,” do “Solution->C Synthesis->Active Solution” to create the first synthesis blocks and enable the directives tab on your code.
At this step, the directives should list the ports from the C function. It is needed to set all inputs and outputs to a “s_axilite” interface, this is required to make the communication between CPU and the custom block easier when it is imported at Vivado. For the presented block, the directives should be as shown in the picture below:
The picture below shows the “s_axilite” interface set to the port “a”:
By rerunning the block synthesis, the block should be ready to be exported for Vivado. To do so, “Solution->Export RTL.” Vivado HLS will create a folder (“Explorer->Solution1->impl->ip”) which will contain all the data needed for the block to be imported at Vivado.
Now it’s time to open Vivado and create a new project. Select “RTL project” and choose the same board mentioned before “xc7z020clg400-1”.
From the “Project Manager” tab, select “IP Catalog,” the IP Catalog window will open. Right-click inside the window and select “Add Repository.”
Search where the Vivado HLS block was synthesized and select the “IP” folder under “solution1->impl”. The block now should be available on the Vivado IP catalog.
On the “IP INTEGRATOR” tab, create a new block design by selecting “Create Block Design.” It will open a new blank window. Right click inside and select “Add IP…”, search for the generated block name inside a window that will open as shown below:
After the block is instantiated, this is how it should look:
The next step is instantiating the Zynq processor system:
With both blocks placed, just select “Run Block Automation” and “Run Connection Automation” which appears on the top.
After the routing process, the schematic should look like this:
Note: I have added another block “sub_hls”, to show multiple blocks managing. The other block implements a subtraction.
The next step is creating an HDL wrapper to the design. On the “Sources” tab viewed from the “Diagram” window, right-click on the “design_1” with “design_1.bd” attached (or something similar) and select “Create HDL wrapper.” Keep the option “Let Vivado manage wrapper and auto-update” selected. After the wrapper is created, select “Generate Bitstream” for the final processing step on Vivado.
The overlay files are ready to be exported. On the Tcl Console type “write_bd_tcl filename” and the Tcl script will be generated. To export the bitstream, do “File->Export->Export Bitstream File…”, put the same name for both files to be used as a PYNQ overlay.
There is some data needed to be acquired before running the overlay on the PYNQ board. On the “Address Editor” tab, the “processing_system7_0” can be expanded to show the address attributed to each placed block. It is necessary for the overlay driver to know the block’s address. As the example below:
Another important information to be listed is the address for each signal on the Vivado HLS generated block. To check the files with all the information needed, do “Sources->Design Sources->design_1_wrapper->…” and go down the hierarchy until you find a file ending with “…io_s_axi” as shown below:
Double click on it and search for a bunch of commented lines indicating each port address:
With all files and information required, transfer the overlay data (‘.bit’ and ‘.tcl’) to the PYNQ board and open it on a browser to visualize the Jupyter notebooks.
The Overlay class is used to download the created files on the PYNQ FPGA. MMIO class is used to write and read data on the blocks of the schematic.
from pynq import Overlay from pynq import MMIO
Indicate the path of the Overlay files and download it:
ol = Overlay("/home/xilinx/jupyter_notebooks/add_sub_overlay/add_sub.bit") ol.download()
Instantiate both IPs by indicating their Offset address and their size on memory (64k = 0x10000). Both data got from Vivado interface as shown before.
add_ip = MMIO(0x43C00000,0x10000) sub_ip = MMIO(0x43C10000,0x10000)
Write some value on the block ports, passing the address (got from the Address info file shown before) and the value as parameters:
#port a add_ip.write(0x10,7) print("add a:",add_ip.read(0x10)) #port b add_ip.write(0x18,12) print("add b:",add_ip.read(0x18))
It is important to analyze the created block signals. The “start” bit needs to be activated with a logical ‘1’ so the block can start its calculation:
#ap_start bit add_ip.write(0x00,1)
The adder IP finishes its job with just one clock cycle, so there is no need to keep checking the “done” or “ready” bit to see if it has finished.
To get the result, read the output port with the current address:
#port y print("add y:",add_ip.read(0x20))
The result for both IPS are listed below on a Jupyter notebook on PYNQ:
The presented methodology can be incorporated on the base overlay from PYNQ to be an extension of it.
The source files can be found in:
For another project implementation, check this prime number calculator made from Wagner: