This post is an extension of “Creating a simple overlay for PYNQ-Z1 board from Vivado HLx“, which presented an Overlay creation methodology for an accelerator block. The implemented block only communicates with Zynq Processing System (PS) and does not explain the PYNQ peripherals management. This work was developed with the help of Wagner Wesner.
The “base” Overlay found inside the PYNQ package is composed of the basic structures needed to handle PYNQ functionalities. The Vivado project (built on Vivado 2016.1) used to develop the “base” Overlay can be reconstructed from the Tcl file and observed:
The content presented in this post was developed during the winter class given at Federal University of Rio Grande do Norte, with professors Carlos Valderrama and Samuel Xavier. My group was composed by Wagner Wesner and me.
Our group task was targeting Vivado HLS to implement accelerator blocks for the PYNQ-Z1 board. The PYNQ consists of a board with some peripherals and a ZYNQ chip, the ZYNQ has a cluster with a Central Processing Unit (CPU) and a Field-Programmable Gate Array (FPGA) which enables the test of the synthesized blocks on Vivado. Vivado outputs such as a bitstream and a Tcl file are used to create a PYNQ overlay. The overlay is further used to communicate the generated blocks with the PYNQ python interface.
The High-Level Synthesis (HLS) is very useful to transform complex algorithms into Hardware Description Language (HDL) code. There is a variety of algorithms which takes considerable CPU processing time, those algorithms can be translated to a hardware description which can be implemented on an FPGA. Once the circuit is configured on the FPGA, the algorithm time demanding tasks are parallelized (summing up), which increases performance and brings other potential benefits.
The Vivavo HLS software starts the PYNQ overlay creation with a custom block.
A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that was generated by a convolutional encoder, finding the most-likely sequence of hidden states from a sequence of observed events, in the context of hidden Markov models. In other words, in a communication system, for example, the transceiver encodes the desired bits to be transferred, encrypting and at the same time preparing the encoded bitstream for an unfortunate data change caused by the channel noise. The receiver decoder knows the state machine created by the encoder hardware, which can find the most-likely transmitted bits based on the most-likely path.
A convolutional encoder is built the same as presented in the video: https://www.youtube.com/watch?v=oI0MwwXJ8dE&t=640s
The input bits shifts at different times (clocks) on the three registers shown in the figure above. Depending on the bits available inside those registers, the system will deliver a pair of bits produced (in this case) by two xor gates connected like so. The constraint length (k) is the number of registers an input bit can influence on the encoded bits, in the presented case k=3. The code rate (R) is a relation upon the number of bits that enter the system and the number of bits leaving, leading to R=1/2.
The utilization of a Vector File makes controlling multiple input and output signals independently possible. Its implementation is important for a variety of circuits which need different pulses at specific times, relieving the effort to manage PWL voltage sources at schematic.
Note: comments begins with “;” character
radix 1 1 4 4
; radix specifies the number of ports and number of bits for each port,
; the base number is also specified ( from binary '1' to hexadecimal '4' )
io i i i i
; io defines the state of each port, as input "i", output "o"
; or bidirectional "b"
vname Primeiro_Sinal Segundo_Sinal Terceiro_Sinal[0:3] Quarto_Sinal,Quinto_Sinal,Sexto_Sinal,Setimo_Sinal
; vname sets the name of each port
; tunit configures the time unit of simulation
; trise defines the rise time of each vector
; tfall defines fall time of each vector
; vih defines logical '1' voltage
; vil defines logical '0' voltage
; period determines the time interval between each vector step, the
; number is multiplied by the previously time unit set
idelay 5 0 1 1 0
; idelay inserts a delay at the selected ports multiplied by the time unit
; The next lines will define the states of each port as the
; transient simulation happens, each line represents a time step
; The binary number can be represented with 0 or 1, and the
; hexadecimals with 0-F
0 0 0 0 ; All signals initially set to 0
1 0 F F ; First signal with a logical '1', second with a logical '0' , the other ones set to a logical '1'
0 1 C C ; And goes on
1 0 A A
1 1 3 3
The ports declaration and manipulation are indented by its position from left to right.
The Static Random Access Memory is widely used among digital circuits for its good trade-off between memory architectures. Being not large like register files, at the same time not slow as DRAM or hard disk management. It is also compatible with standard CMOS process.
The standard SRAM cell architecture is shown above, a pair of pMOS and nMOS transistors create cross-coupled inverters that can hold its state without needing to have its input driven by an extern signal.
The word line “WL” enables the bit lines “BL and “!BL” to drive the net “Q” to a logical value (either 0 or 1) and at the same time, opening the cell data to be read by a special circuitry.
Sizing those transistors correctly is required. If the inverter transistors “M1 to M4” are considerably stronger than “M5” and “M6” access transistors, the current passing through the transistors “M5 and M6” will not be enough to flip the state of the memory cell. The stable state created by the cross-coupled inverters needs to be overpowered to change its logical level. Moreover, the opposite situation with unbalanced strength (size) causes the logical state to be changed from bit lines noise even with the word lines disabled. The same problem can happen during the read operation which is going to be presented further.