PYNQ-Z1 peripherals control with an Overlay created from Vivado

This post is an extension of “Creating a simple overlay for PYNQ-Z1 board from Vivado HLx“, which presented an Overlay creation methodology for an accelerator block. The implemented block only communicates with Zynq Processing System (PS) and does not explain the PYNQ peripherals management. This work was developed with the help of Wagner Wesner.

Continue reading

Creating a simple overlay for PYNQ-Z1 board from Vivado HLx

The content presented in this post was developed during the winter class given at Federal University of Rio Grande do Norte, with professors Carlos Valderrama and Samuel Xavier. My group was composed by Wagner Wesner and me.

Our group task was targeting Vivado HLS to implement accelerator blocks for the PYNQ-Z1 board. The PYNQ consists of a board with some peripherals and a ZYNQ chip, the ZYNQ has a cluster with a Central Processing Unit (CPU) and a Field-Programmable Gate Array (FPGA) which enables the test of the synthesized blocks on Vivado. Vivado outputs such as a bitstream and a Tcl file are used to create a PYNQ overlay. The overlay is further used to communicate the generated blocks with the PYNQ python interface.

The High-Level Synthesis (HLS) is very useful to transform complex algorithms into Hardware Description Language (HDL) code. There is a variety of algorithms which takes considerable CPU processing time, those algorithms can be translated to a hardware description which can be implemented on an FPGA. Once the circuit is configured on the FPGA, the algorithm time demanding tasks are parallelized (summing up), which increases performance and brings other potential benefits.

The Vivavo HLS software starts the PYNQ overlay creation with a custom block.

vivado_hls_wel.png

This tutorial will present the sum of the steps needed to not create a really extended post, also assuming that the reader is a little familiarized with the basic Vivado HLS steps. The version used is 2017.2.

When creating a new project, choose the PYNQ ZYNQ board part “xc7z020clg400-1”.

The accelerator block modeling is achieved by writing C++ code. The Vivado HLS interface needs both a source file indicating the block behavior and a test bench to observe the block outputs. A simple 32-bit adder block is constructed and tested as shown below.

add_source_tb.png

After testing with “run C simulation,” do “Solution->C Synthesis->Active Solution” to create the first synthesis blocks and enable the directives tab on your code.

At this step, the directives should list the ports from the C function. It is needed to set all inputs and outputs to a “s_axilite” interface, this is required to make the communication between CPU and the custom block easier when it is imported at Vivado. For the presented block, the directives should be as shown in the picture below:

add_directive

The picture below shows the “s_axilite” interface set to the port “a”:

bundle_io

By rerunning the block synthesis, the block should be ready to be exported for Vivado. To do so, “Solution->Export RTL.” Vivado HLS will create a folder (“Explorer->Solution1->impl->ip”) which will contain all the data needed for the block to be imported at Vivado.

vivado_hlx

Now it’s time to open Vivado and create a new project. Select “RTL project” and choose the same board mentioned before “xc7z020clg400-1”.

From the “Project Manager” tab, select “IP Catalog,” the IP Catalog window will open. Right-click inside the window and select “Add Repository.”

repositories.png

Search where the Vivado HLS block was synthesized and select the “IP” folder under “solution1->impl”. The block now should be available on the Vivado IP catalog.

On the “IP INTEGRATOR” tab, create a new block design by selecting “Create Block Design.” It will open a new blank window. Right click inside and select “Add IP…”, search for the generated block name inside a window that will open as shown below:

Test_add

After the block is instantiated, this is how it should look:

test_add_block

The next step is instantiating the Zynq processor system:

ZYNQ_search

zynq_block

With both blocks placed, just select “Run Block Automation” and “Run Connection Automation” which appears on the top.

block_automation

After the routing process, the schematic should look like this:

block_beautiful.png

Note: I have added another block “sub_hls”, to show multiple blocks managing. The other block implements a subtraction.

The next step is creating an HDL wrapper to the design. On the “Sources” tab viewed from the “Diagram” window, right-click on the “design_1” with “design_1.bd” attached (or something similar) and select “Create HDL wrapper.” Keep the option “Let Vivado manage wrapper and auto-update” selected. After the wrapper is created, select “Generate Bitstream” for the final processing step on Vivado.

The overlay files are ready to be exported. On the Tcl Console type “write_bd_tcl filename” and the Tcl script will be generated. To export the bitstream, do “File->Export->Export Bitstream File…”, put the same name for both files to be used as a PYNQ overlay.

add_sub_overlay

There is some data needed to be acquired before running the overlay on the PYNQ board. On the “Address Editor” tab, the “processing_system7_0” can be expanded to show the address attributed to each placed block. It is necessary for the overlay driver to know the block’s address. As the example below:

address_editor

Another important information to be listed is the address for each signal on the Vivado HLS generated block. To check the files with all the information needed, do “Sources->Design Sources->design_1_wrapper->…” and go down the hierarchy until you find a file ending with “…io_s_axi” as shown below:

axi_file.png

Double click on it and search for a bunch of commented lines indicating each port address:

ports_info

With all files and information required, transfer the overlay data (‘.bit’ and ‘.tcl’) to the PYNQ board and open it on a browser to visualize the Jupyter notebooks.

The Overlay class is used to download the created files on the PYNQ FPGA. MMIO class is used to write and read data on the blocks of the schematic.

from pynq import Overlay
from pynq import MMIO

Indicate the path of the Overlay files and download it:


ol = Overlay("/home/xilinx/jupyter_notebooks/add_sub_overlay/add_sub.bit")
ol.download()

Instantiate both IPs by indicating their Offset address and their size on memory (64k = 0x10000). Both data got from Vivado interface as shown before.


add_ip = MMIO(0x43C00000,0x10000)

sub_ip = MMIO(0x43C10000,0x10000)

Write some value on the block ports, passing the address (got from the Address info file shown before) and the value as parameters:


#port a
add_ip.write(0x10,7)
print("add a:",add_ip.read(0x10))
#port b
add_ip.write(0x18,12)
print("add b:",add_ip.read(0x18))

It is important to analyze the created block signals. The “start” bit needs to be activated with a logical ‘1’ so the block can start its calculation:


#ap_start bit
add_ip.write(0x00,1)

The adder IP finishes its job with just one clock cycle, so there is no need to keep checking the “done” or “ready” bit to see if it has finished.

To get the result, read the output port with the current address:


#port y
print("add y:",add_ip.read(0x20))

The result for both IPS are listed below on a Jupyter notebook on PYNQ:

result

The presented methodology can be incorporated on the base overlay from PYNQ to be an extension of it.

The source files can be found in:

https://github.com/YangTavares/PYNQ/tree/master/custom_overlays/add_sub_overlay

For another project implementation, check this prime number calculator made from Wagner:

https://github.com/wagnerrn/pynq-primenumber

 

 

 

A Viterbi Decoder python implementation

A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that was generated by a convolutional encoder, finding the most-likely sequence of hidden states from a sequence of observed events, in the context of hidden Markov models. In other words, in a communication system, for example, the transceiver encodes the desired bits to be transferred, encrypting and at the same time preparing the encoded bitstream for an unfortunate data change caused by the channel noise. The receiver decoder knows the state machine created by the encoder hardware, which can find the most-likely transmitted bits based on the most-likely path.

A convolutional encoder is built the same as presented in the video: https://www.youtube.com/watch?v=oI0MwwXJ8dE&t=640s

viterbi_encoder

The input bits shifts at different times (clocks) on the three registers shown in the figure above. Depending on the bits available inside those registers, the system will deliver a pair of bits produced (in this case) by two xor gates connected like so. The constraint length (k) is the number of registers an input bit can influence on the encoded bits, in the presented case k=3. The code rate (R) is a relation upon the number of bits that enter the system and the number of bits leaving, leading to R=1/2.

A state machine can be built to describe the encoding system. The red boxes represent the states; a state transition happens when an input bit shifts on the registers. When a bit is shifted, not only the state changes but also another encoded bits are outputted.

st_machine

The above state machine represents the snapshots of the described system. Each arrow has a pair of numbers “x/yy” where x is the bit inserted, and yy is the encoded bits.

inputs = ("1","0","0","1","1")
def v_xor(bit0,bit1):
    if(bit0==bit1):
        return "0"
    else:
        return "1"
def viterbi_encoder(inputs):
    #shift register encoder
    s_reg = ["0","0","0"]
    obs = []
    for t in range (0,len(inputs)):
        #shifting the bits to right
        s_reg[2]=s_reg[1]
        s_reg[1]=s_reg[0]
        #inserting input
        s_reg[0]= inputs[t]
        state = s_reg[0]+ s_reg[1]
        obs.append([])
        #encoded bits
        obs[t] = v_xor(v_xor(s_reg[0],s_reg[1]),s_reg[2])+\
            v_xor(s_reg[0],s_reg[2])
        print s_reg,state
    print obs

viterbi_encoder(inputs)

The above code represents the implementation of a convolutional encoder following the previous logic. The system receives input bits with:

inputs = ("1","0","0","1","1")
viterbi_encoder(inputs)

Shift inserted bit for each time step:

for t in range (0,len(inputs)):
    #shifting the bits to right
    s_reg[2]=s_reg[1]
    s_reg[1]=s_reg[0]
    #inserting input
    s_reg[0]= inputs[t]

At the same time step, calculate encoded bits with a representation of xor gate:

def v_xor(bit0,bit1):
    if(bit0==bit1):
        return "0"
    else:
        return "1"

And their interconnection at the shift registers outputs:

obs[t] = v_xor(v_xor(s_reg[0],s_reg[1]),s_reg[2])+\
    v_xor(s_reg[0],s_reg[2])

Code output:

# shift_register / current state
[‘1’, ‘0’, ‘0’] 10
[‘0’, ‘1’, ‘0’] 01
[‘0’, ‘0’, ‘1’] 00
[‘1’, ‘0’, ‘0’] 10
[‘1’, ‘1’, ‘0’] 11

# encoded bits
[’11’, ’10’, ’11’, ’11’, ’01’]

The Viterbi algorithm is applied to decode the generated vector of bits. The bits transmitted are decoded based on the state machine analysis. Each machine state is expected to have a direction represented by arrows in the second figure, for example, the state “00” can’t jump to state “11”, as states “10” or “01” are not able to stay for two time steps. A Trellis diagram can be used to represent the possible paths between each state for each time step. This representation helps to visualize the code better.

The Trellis diagram below is based in: https://www.youtube.com/watch?v=iPh-HKZyWN4

trellis

Its structure consists of nodes referring to our machine states, each node interconnects with two possible branches representing possible state machine transitions, for each time step a calculation is performed based on the number of different bits between the observed encoded bits and branch expected bits. The result is a branch metric that points which is the most-likely path to follow.

trellis_ex.png

The above example shows the decoding operation for the pairs “11” and “10”. The blue number represents each node metric, at t=0 they are all set to 0. When t=1, each node verifies upcoming branches and their metrics, branches with bigger metric between two possibilities are eliminated. The remaining branch metric is incorporated to the node for further calculations. A branch is chosen randomly If two have the same metric. For t=2, same calculations are performed.

Since there are only two observations, the Viterbi algorithm trace-back begins. Firstly choosing the node with smaller metric, circled by yellow. Then, following the branches that aren’t eliminated and retrieving their input bit attribute also circled by yellow. For these two observations, the most-likely input bits are “10”. This approach brings an issue related to the trace-back mechanism because reading backward will deliver the input bits with inverted indexes. In this case, for example, the first output got is “01”. Also, all metrics need to be calculated before having an output bitstream.

obs = ("11","10","11","11","01","01","11")
start_metric = {'zero':0,'one': 0, 'two': 0,'three':0}
state_machine = {
    #current state, possible branches, branch information
    'zero': {'b1': {'out_b':"11",'prev_st': 'one','input_b':0},
             'b2': {'out_b':"00",'prev_st': 'zero','input_b':0}},
    'one': {'b1': {'out_b': "01", 'prev_st': 'three', 'input_b': 0},
             'b2': {'out_b': "10", 'prev_st': 'two', 'input_b': 0}},
    'two': {'b1': {'out_b': "11", 'prev_st': 'zero', 'input_b': 1},
             'b2': {'out_b': "00", 'prev_st': 'one', 'input_b': 1}},
    'three': {'b1': {'out_b': "10", 'prev_st': 'three', 'input_b': 1},
             'b2': {'out_b': "01", 'prev_st': 'two', 'input_b': 1}},

}

def bits_diff_num(num_1,num_2):
    count=0;
    for i in range(0,len(num_1),1):
        if num_1[i]!=num_2[i]:
            count=count+1
    return count

def viterbi(obs, start_metric, state_machine):
    #Trellis structure
    V = [{}]
    for st in state_machine:
        # Calculating the probability of both initial possibilities for the first observation
        V[0][st] = {"metric": start_metric[st]}
    #for t>0
    for t in range(1, len(obs)+1):
        V.append({})
        for st in state_machine:
            #Check for smallest bit difference from possible previous paths, adding with previous metric
            prev_st = state_machine[st]['b1']['prev_st']
            first_b_metric = V[(t-1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b1']['out_b'], obs[t - 1])
            prev_st = state_machine[st]['b2']['prev_st']
            second_b_metric = V[(t - 1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b2']['out_b'], obs[t - 1])
            #print(state_machine[st]['b1']['out_b'],obs[t - 1],t)
            if first_b_metric > second_b_metric:
                V[t][st] = {"metric" : second_b_metric,"branch":'b2'}
            else:
                V[t][st] = {"metric": first_b_metric, "branch": 'b1'}

    #print trellis nodes metric:
    for st in state_machine:
        for t in range(0,len(V)):
            print V[t][st]["metric"],
        print ""
    print ""

    smaller = min(V[t][st]["metric"] for st in state_machine)
    #traceback the path on smaller metric on last trellis column
    for st in state_machine:
        if V[len(obs)-1][st]["metric"] == smaller:
            source_state = st
            for t in range(len(obs),0,-1):
                branch = V[t][source_state]["branch"]
                print(state_machine[source_state][branch]['input_b']),
                source_state = state_machine[source_state][branch]['prev_st']
            print (source_state+"\n")
    print("Finish")

viterbi(obs,
        start_metric,
        state_machine)

The algorithm is implemented with Python as shown above. The Viterbi decoder takes as input the encoder state machine, start metric for t=0 and the observations:

obs = ("11","10","11","11","01","01","11")
start_metric = {'zero':0,'one': 0, 'two': 0,'three':0}
state_machine = {
    #current state, possible branches, branch information
    'zero': {'b1': {'out_b':"11",'prev_st': 'one','input_b':0},
             'b2': {'out_b':"00",'prev_st': 'zero','input_b':0}},
    'one': {'b1': {'out_b': "01", 'prev_st': 'three', 'input_b': 0},
             'b2': {'out_b': "10", 'prev_st': 'two', 'input_b': 0}},
    'two': {'b1': {'out_b': "11", 'prev_st': 'zero', 'input_b': 1},
             'b2': {'out_b': "00", 'prev_st': 'one', 'input_b': 1}},
    'three': {'b1': {'out_b': "10", 'prev_st': 'three', 'input_b': 1},
             'b2': {'out_b': "01", 'prev_st': 'two', 'input_b': 1}},

}

Starting with all metrics = 0  means the system has equal probability to have begun with any of four states. If the application specifies that the first state always starts at zero (“00”) for example, a short approach is raising the other state metrics. By knowing the beginning state, the decoder has less probability of getting a wrong bitstream from the noisy channel.

The vector “V” represents the Trellis graph, which loads primary metrics for t=0:

V = [{}]
for st in state_machine:
    # Calculating the probability of both initial possibilities for the first observation
    V[0][st] = {"metric": start_metric[st]}

After, the Trellis metrics and optimal branches are calculated:

#for t>0
for t in range(1, len(obs)+1):
    V.append({})
    for st in state_machine:
        #Check for smallest bit difference from possible previous paths, adding with previous metric
        prev_st = state_machine[st]['b1']['prev_st']
        first_b_metric = V[(t-1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b1']['out_b'], obs[t - 1])
        prev_st = state_machine[st]['b2']['prev_st']
        second_b_metric = V[(t - 1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b2']['out_b'], obs[t - 1])
        #print(state_machine[st]['b1']['out_b'],obs[t - 1],t)
        if first_b_metric > second_b_metric:
            V[t][st] = {"metric" : second_b_metric,"branch":'b2'}
        else:
            V[t][st] = {"metric": first_b_metric, "branch": 'b1'}

With all paths and metric set, the trace-back takes place from the smaller metric node:

smaller = min(V[t][st]["metric"] for st in state_machine)
#traceback the path on smaller metric on last trellis column
for st in state_machine:
    if V[len(obs)-1][st]["metric"] == smaller:
        source_state = st
        for t in range(len(obs),0,-1):
            branch = V[t][source_state]["branch"]
            print(state_machine[source_state][branch]['input_b']),
            source_state = state_machine[source_state][branch]['prev_st']
        print (source_state+"\n")

Code output:

#Trellis nodes metric, t=0, t=1 …
0 0 1 0 2 3 3 0
0 1 1 2 2 0 2 3
0 0 1 1 0 3 3 2
0 1 0 2 2 2 0 3

#Inverted input bits encoded and decoded
0 1 1 1 0 0 1 zero

Finish