About yangtavares

I am doing Bachelor of Science in Computer Engineering at the Federal University of Rio Grande do Norte (Brazil) in cooperation with the California State University, East Bay, where I participated in Science without Borders government exchange program. Currently participating on scientific initiation by the Instrumentation and Microelectronics Laboratory (LIME) at Center for Research and Innovation in Information Technology.

PYNQ-Z1 peripherals control with an Overlay created from Vivado

This post is an extension of “Creating a simple overlay for PYNQ-Z1 board from Vivado HLx“, which presented an Overlay creation methodology for an accelerator block. The implemented block only communicates with Zynq Processing System (PS) and does not explain the PYNQ peripherals management. This work was developed with the help of Wagner Wesner.

Continue reading

Creating a simple overlay for PYNQ-Z1 board from Vivado HLx

The content presented in this post was developed during the winter class given at Federal University of Rio Grande do Norte, with professors Carlos Valderrama and Samuel Xavier. My group was composed by Wagner Wesner and me.

Our group task was targeting Vivado HLS to implement accelerator blocks for the PYNQ-Z1 board. The PYNQ consists of a board with some peripherals and a ZYNQ chip, the ZYNQ has a cluster with a Central Processing Unit (CPU) and a Field-Programmable Gate Array (FPGA) which enables the test of the synthesized blocks on Vivado. Vivado outputs such as a bitstream and a Tcl file are used to create a PYNQ overlay. The overlay is further used to communicate the generated blocks with the PYNQ python interface.

The High-Level Synthesis (HLS) is very useful to transform complex algorithms into Hardware Description Language (HDL) code. There is a variety of algorithms which takes considerable CPU processing time, those algorithms can be translated to a hardware description which can be implemented on an FPGA. Once the circuit is configured on the FPGA, the algorithm time demanding tasks are parallelized (summing up), which increases performance and brings other potential benefits.

The Vivavo HLS software starts the PYNQ overlay creation with a custom block.

vivado_hls_wel.png

This tutorial will present the sum of the steps needed to not create a really extended post, also assuming that the reader is a little familiarized with the basic Vivado HLS steps. The version used is 2017.2.

When creating a new project, choose the PYNQ ZYNQ board part “xc7z020clg400-1”.

The accelerator block modeling is achieved by writing C++ code. The Vivado HLS interface needs both a source file indicating the block behavior and a test bench to observe the block outputs. A simple 32-bit adder block is constructed and tested as shown below.

add_source_tb.png

After testing with “run C simulation,” do “Solution->C Synthesis->Active Solution” to create the first synthesis blocks and enable the directives tab on your code.

At this step, the directives should list the ports from the C function. It is needed to set all inputs and outputs to a “s_axilite” interface, this is required to make the communication between CPU and the custom block easier when it is imported at Vivado. For the presented block, the directives should be as shown in the picture below:

add_directive

The picture below shows the “s_axilite” interface set to the port “a”:

bundle_io

By rerunning the block synthesis, the block should be ready to be exported for Vivado. To do so, “Solution->Export RTL.” Vivado HLS will create a folder (“Explorer->Solution1->impl->ip”) which will contain all the data needed for the block to be imported at Vivado.

vivado_hlx

Now it’s time to open Vivado and create a new project. Select “RTL project” and choose the same board mentioned before “xc7z020clg400-1”.

From the “Project Manager” tab, select “IP Catalog,” the IP Catalog window will open. Right-click inside the window and select “Add Repository.”

repositories.png

Search where the Vivado HLS block was synthesized and select the “IP” folder under “solution1->impl”. The block now should be available on the Vivado IP catalog.

On the “IP INTEGRATOR” tab, create a new block design by selecting “Create Block Design.” It will open a new blank window. Right click inside and select “Add IP…”, search for the generated block name inside a window that will open as shown below:

Test_add

After the block is instantiated, this is how it should look:

test_add_block

The next step is instantiating the Zynq processor system:

ZYNQ_search

zynq_block

With both blocks placed, just select “Run Block Automation” and “Run Connection Automation” which appears on the top.

block_automation

After the routing process, the schematic should look like this:

block_beautiful.png

Note: I have added another block “sub_hls”, to show multiple blocks managing. The other block implements a subtraction.

The next step is creating an HDL wrapper to the design. On the “Sources” tab viewed from the “Diagram” window, right-click on the “design_1” with “design_1.bd” attached (or something similar) and select “Create HDL wrapper.” Keep the option “Let Vivado manage wrapper and auto-update” selected. After the wrapper is created, select “Generate Bitstream” for the final processing step on Vivado.

The overlay files are ready to be exported. On the Tcl Console type “write_bd_tcl filename” and the Tcl script will be generated. To export the bitstream, do “File->Export->Export Bitstream File…”, put the same name for both files to be used as a PYNQ overlay.

add_sub_overlay

There is some data needed to be acquired before running the overlay on the PYNQ board. On the “Address Editor” tab, the “processing_system7_0” can be expanded to show the address attributed to each placed block. It is necessary for the overlay driver to know the block’s address. As the example below:

address_editor

Another important information to be listed is the address for each signal on the Vivado HLS generated block. To check the files with all the information needed, do “Sources->Design Sources->design_1_wrapper->…” and go down the hierarchy until you find a file ending with “…io_s_axi” as shown below:

axi_file.png

Double click on it and search for a bunch of commented lines indicating each port address:

ports_info

With all files and information required, transfer the overlay data (‘.bit’ and ‘.tcl’) to the PYNQ board and open it on a browser to visualize the Jupyter notebooks.

The Overlay class is used to download the created files on the PYNQ FPGA. MMIO class is used to write and read data on the blocks of the schematic.

from pynq import Overlay
from pynq import MMIO

Indicate the path of the Overlay files and download it:


ol = Overlay("/home/xilinx/jupyter_notebooks/add_sub_overlay/add_sub.bit")
ol.download()

Instantiate both IPs by indicating their Offset address and their size on memory (64k = 0x10000). Both data got from Vivado interface as shown before.


add_ip = MMIO(0x43C00000,0x10000)

sub_ip = MMIO(0x43C10000,0x10000)

Write some value on the block ports, passing the address (got from the Address info file shown before) and the value as parameters:


#port a
add_ip.write(0x10,7)
print("add a:",add_ip.read(0x10))
#port b
add_ip.write(0x18,12)
print("add b:",add_ip.read(0x18))

It is important to analyze the created block signals. The “start” bit needs to be activated with a logical ‘1’ so the block can start its calculation:


#ap_start bit
add_ip.write(0x00,1)

The adder IP finishes its job with just one clock cycle, so there is no need to keep checking the “done” or “ready” bit to see if it has finished.

To get the result, read the output port with the current address:


#port y
print("add y:",add_ip.read(0x20))

The result for both IPS are listed below on a Jupyter notebook on PYNQ:

result

The presented methodology can be incorporated on the base overlay from PYNQ to be an extension of it.

The source files can be found in:

https://github.com/YangTavares/PYNQ/tree/master/custom_overlays/add_sub_overlay

For another project implementation, check this prime number calculator made from Wagner:

https://github.com/wagnerrn/pynq-primenumber

 

 

 

A Viterbi Decoder python implementation

A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that was generated by a convolutional encoder, finding the most-likely sequence of hidden states from a sequence of observed events, in the context of hidden Markov models. In other words, in a communication system, for example, the transceiver encodes the desired bits to be transferred, encrypting and at the same time preparing the encoded bitstream for an unfortunate data change caused by the channel noise. The receiver decoder knows the state machine created by the encoder hardware, which can find the most-likely transmitted bits based on the most-likely path.

A convolutional encoder is built the same as presented in the video: https://www.youtube.com/watch?v=oI0MwwXJ8dE&t=640s

viterbi_encoder

The input bits shifts at different times (clocks) on the three registers shown in the figure above. Depending on the bits available inside those registers, the system will deliver a pair of bits produced (in this case) by two xor gates connected like so. The constraint length (k) is the number of registers an input bit can influence on the encoded bits, in the presented case k=3. The code rate (R) is a relation upon the number of bits that enter the system and the number of bits leaving, leading to R=1/2.

A state machine can be built to describe the encoding system. The red boxes represent the states; a state transition happens when an input bit shifts on the registers. When a bit is shifted, not only the state changes but also another encoded bits are outputted.

st_machine

The above state machine represents the snapshots of the described system. Each arrow has a pair of numbers “x/yy” where x is the bit inserted, and yy is the encoded bits.

inputs = ("1","0","0","1","1")
def v_xor(bit0,bit1):
    if(bit0==bit1):
        return "0"
    else:
        return "1"
def viterbi_encoder(inputs):
    #shift register encoder
    s_reg = ["0","0","0"]
    obs = []
    for t in range (0,len(inputs)):
        #shifting the bits to right
        s_reg[2]=s_reg[1]
        s_reg[1]=s_reg[0]
        #inserting input
        s_reg[0]= inputs[t]
        state = s_reg[0]+ s_reg[1]
        obs.append([])
        #encoded bits
        obs[t] = v_xor(v_xor(s_reg[0],s_reg[1]),s_reg[2])+\
            v_xor(s_reg[0],s_reg[2])
        print s_reg,state
    print obs

viterbi_encoder(inputs)

The above code represents the implementation of a convolutional encoder following the previous logic. The system receives input bits with:

inputs = ("1","0","0","1","1")
viterbi_encoder(inputs)

Shift inserted bit for each time step:

for t in range (0,len(inputs)):
    #shifting the bits to right
    s_reg[2]=s_reg[1]
    s_reg[1]=s_reg[0]
    #inserting input
    s_reg[0]= inputs[t]

At the same time step, calculate encoded bits with a representation of xor gate:

def v_xor(bit0,bit1):
    if(bit0==bit1):
        return "0"
    else:
        return "1"

And their interconnection at the shift registers outputs:

obs[t] = v_xor(v_xor(s_reg[0],s_reg[1]),s_reg[2])+\
    v_xor(s_reg[0],s_reg[2])

Code output:

# shift_register / current state
[‘1’, ‘0’, ‘0’] 10
[‘0’, ‘1’, ‘0’] 01
[‘0’, ‘0’, ‘1’] 00
[‘1’, ‘0’, ‘0’] 10
[‘1’, ‘1’, ‘0’] 11

# encoded bits
[’11’, ’10’, ’11’, ’11’, ’01’]

The Viterbi algorithm is applied to decode the generated vector of bits. The bits transmitted are decoded based on the state machine analysis. Each machine state is expected to have a direction represented by arrows in the second figure, for example, the state “00” can’t jump to state “11”, as states “10” or “01” are not able to stay for two time steps. A Trellis diagram can be used to represent the possible paths between each state for each time step. This representation helps to visualize the code better.

The Trellis diagram below is based in: https://www.youtube.com/watch?v=iPh-HKZyWN4

trellis

Its structure consists of nodes referring to our machine states, each node interconnects with two possible branches representing possible state machine transitions, for each time step a calculation is performed based on the number of different bits between the observed encoded bits and branch expected bits. The result is a branch metric that points which is the most-likely path to follow.

trellis_ex.png

The above example shows the decoding operation for the pairs “11” and “10”. The blue number represents each node metric, at t=0 they are all set to 0. When t=1, each node verifies upcoming branches and their metrics, branches with bigger metric between two possibilities are eliminated. The remaining branch metric is incorporated to the node for further calculations. A branch is chosen randomly If two have the same metric. For t=2, same calculations are performed.

Since there are only two observations, the Viterbi algorithm trace-back begins. Firstly choosing the node with smaller metric, circled by yellow. Then, following the branches that aren’t eliminated and retrieving their input bit attribute also circled by yellow. For these two observations, the most-likely input bits are “10”. This approach brings an issue related to the trace-back mechanism because reading backward will deliver the input bits with inverted indexes. In this case, for example, the first output got is “01”. Also, all metrics need to be calculated before having an output bitstream.

obs = ("11","10","11","11","01","01","11")
start_metric = {'zero':0,'one': 0, 'two': 0,'three':0}
state_machine = {
    #current state, possible branches, branch information
    'zero': {'b1': {'out_b':"11",'prev_st': 'one','input_b':0},
             'b2': {'out_b':"00",'prev_st': 'zero','input_b':0}},
    'one': {'b1': {'out_b': "01", 'prev_st': 'three', 'input_b': 0},
             'b2': {'out_b': "10", 'prev_st': 'two', 'input_b': 0}},
    'two': {'b1': {'out_b': "11", 'prev_st': 'zero', 'input_b': 1},
             'b2': {'out_b': "00", 'prev_st': 'one', 'input_b': 1}},
    'three': {'b1': {'out_b': "10", 'prev_st': 'three', 'input_b': 1},
             'b2': {'out_b': "01", 'prev_st': 'two', 'input_b': 1}},

}

def bits_diff_num(num_1,num_2):
    count=0;
    for i in range(0,len(num_1),1):
        if num_1[i]!=num_2[i]:
            count=count+1
    return count

def viterbi(obs, start_metric, state_machine):
    #Trellis structure
    V = [{}]
    for st in state_machine:
        # Calculating the probability of both initial possibilities for the first observation
        V[0][st] = {"metric": start_metric[st]}
    #for t>0
    for t in range(1, len(obs)+1):
        V.append({})
        for st in state_machine:
            #Check for smallest bit difference from possible previous paths, adding with previous metric
            prev_st = state_machine[st]['b1']['prev_st']
            first_b_metric = V[(t-1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b1']['out_b'], obs[t - 1])
            prev_st = state_machine[st]['b2']['prev_st']
            second_b_metric = V[(t - 1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b2']['out_b'], obs[t - 1])
            #print(state_machine[st]['b1']['out_b'],obs[t - 1],t)
            if first_b_metric > second_b_metric:
                V[t][st] = {"metric" : second_b_metric,"branch":'b2'}
            else:
                V[t][st] = {"metric": first_b_metric, "branch": 'b1'}

    #print trellis nodes metric:
    for st in state_machine:
        for t in range(0,len(V)):
            print V[t][st]["metric"],
        print ""
    print ""

    smaller = min(V[t][st]["metric"] for st in state_machine)
    #traceback the path on smaller metric on last trellis column
    for st in state_machine:
        if V[len(obs)-1][st]["metric"] == smaller:
            source_state = st
            for t in range(len(obs),0,-1):
                branch = V[t][source_state]["branch"]
                print(state_machine[source_state][branch]['input_b']),
                source_state = state_machine[source_state][branch]['prev_st']
            print (source_state+"\n")
    print("Finish")

viterbi(obs,
        start_metric,
        state_machine)

The algorithm is implemented with Python as shown above. The Viterbi decoder takes as input the encoder state machine, start metric for t=0 and the observations:

obs = ("11","10","11","11","01","01","11")
start_metric = {'zero':0,'one': 0, 'two': 0,'three':0}
state_machine = {
    #current state, possible branches, branch information
    'zero': {'b1': {'out_b':"11",'prev_st': 'one','input_b':0},
             'b2': {'out_b':"00",'prev_st': 'zero','input_b':0}},
    'one': {'b1': {'out_b': "01", 'prev_st': 'three', 'input_b': 0},
             'b2': {'out_b': "10", 'prev_st': 'two', 'input_b': 0}},
    'two': {'b1': {'out_b': "11", 'prev_st': 'zero', 'input_b': 1},
             'b2': {'out_b': "00", 'prev_st': 'one', 'input_b': 1}},
    'three': {'b1': {'out_b': "10", 'prev_st': 'three', 'input_b': 1},
             'b2': {'out_b': "01", 'prev_st': 'two', 'input_b': 1}},

}

Starting with all metrics = 0  means the system has equal probability to have begun with any of four states. If the application specifies that the first state always starts at zero (“00”) for example, a short approach is raising the other state metrics. By knowing the beginning state, the decoder has less probability of getting a wrong bitstream from the noisy channel.

The vector “V” represents the Trellis graph, which loads primary metrics for t=0:

V = [{}]
for st in state_machine:
    # Calculating the probability of both initial possibilities for the first observation
    V[0][st] = {"metric": start_metric[st]}

After, the Trellis metrics and optimal branches are calculated:

#for t>0
for t in range(1, len(obs)+1):
    V.append({})
    for st in state_machine:
        #Check for smallest bit difference from possible previous paths, adding with previous metric
        prev_st = state_machine[st]['b1']['prev_st']
        first_b_metric = V[(t-1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b1']['out_b'], obs[t - 1])
        prev_st = state_machine[st]['b2']['prev_st']
        second_b_metric = V[(t - 1)][prev_st]["metric"] + bits_diff_num(state_machine[st]['b2']['out_b'], obs[t - 1])
        #print(state_machine[st]['b1']['out_b'],obs[t - 1],t)
        if first_b_metric > second_b_metric:
            V[t][st] = {"metric" : second_b_metric,"branch":'b2'}
        else:
            V[t][st] = {"metric": first_b_metric, "branch": 'b1'}

With all paths and metric set, the trace-back takes place from the smaller metric node:

smaller = min(V[t][st]["metric"] for st in state_machine)
#traceback the path on smaller metric on last trellis column
for st in state_machine:
    if V[len(obs)-1][st]["metric"] == smaller:
        source_state = st
        for t in range(len(obs),0,-1):
            branch = V[t][source_state]["branch"]
            print(state_machine[source_state][branch]['input_b']),
            source_state = state_machine[source_state][branch]['prev_st']
        print (source_state+"\n")

Code output:

#Trellis nodes metric, t=0, t=1 …
0 0 1 0 2 3 3 0
0 1 1 2 2 0 2 3
0 0 1 1 0 3 3 2
0 1 0 2 2 2 0 3

#Inverted input bits encoded and decoded
0 1 1 1 0 0 1 zero

Finish

Controlling Virtuoso schematic inputs with Digital Vector File

The utilization of a Vector File makes controlling multiple input and output signals independently possible. Its implementation is important for a variety of circuits which need different pulses at specific times, relieving the effort to manage PWL voltage sources at schematic.

Note: comments begins with “;” character

radix 1 1 4 4
; radix specifies the number of ports and number of bits for each port,
; the base number is also specified ( from binary '1' to hexadecimal '4' )

io i i i i

; io defines the state of each port, as input "i", output "o"
; or bidirectional "b"
vname Primeiro_Sinal Segundo_Sinal Terceiro_Sinal[0:3] Quarto_Sinal,Quinto_Sinal,Sexto_Sinal,Setimo_Sinal

; vname sets the name of each port

tunit 1ns

; tunit configures the time unit of simulation

trise 1ps
; trise defines the rise time of each vector

tfall 1ps
; tfall defines fall time of each vector

vih 5.0
; vih defines logical '1' voltage

vil 0.0
; vil defines logical '0' voltage
period 10.0

; period determines the time interval between each vector step, the
; number is multiplied by the previously time unit set

idelay 5 0 1 1 0

; idelay inserts a delay at the selected ports multiplied by the time unit

; The next lines will define the states of each port as the
; transient simulation happens, each line represents a time step

; The binary number can be represented with 0 or 1, and the
; hexadecimals with 0-F

0 0 0 0 ; All signals initially set to 0
1 0 F F ; First signal with a logical '1', second with a logical '0' , the other ones set to a logical '1'
0 1 C C ; And goes on
1 0 A A
1 1 3 3

The ports declaration and manipulation are indented by its position from left to right.

To the visualization of the developed signals, it is necessary to open a schematic on Virtuoso and open: “Analog Environment->Setup->Simulation Files”, inserting the file path on the “Vector File” field.

After the simulation, the signals can be seen on results browser

primeiros_sinais

quarto_sinal

After the simulation, the signals can be seen on results browser
radix 1 1

io i i

vname A B ; schematic input ports 

tunit 1ns

trise 1ps

tfall 1ps

vih 5.0

vil 0.0

period 10.0

0 0
0 1
1 0
1 1
1 0
0 1
0 0 ; 70ns in total

 

and_2_schem

and_2_out

Reference: http://www.bioee.ee.columbia.edu/courses/cad/html/vector_file.pdf

SRAM write and read basics (1/2)

The Static Random Access Memory is widely used among digital circuits for its good trade-off between memory architectures. Being not large like register files, at the same time not slow as DRAM or hard disk management. It is also compatible with standard CMOS process.

sram_t

The standard SRAM cell architecture is shown above, a pair of pMOS and nMOS transistors create cross-coupled inverters that can hold its state without needing to have its input driven by an extern signal.

The word line “WL” enables the bit lines “BL and “!BL” to drive the net “Q” to a logical value (either 0 or 1) and at the same time, opening the cell data to be read by a special circuitry.

Sizing those transistors correctly is required. If the inverter transistors “M1 to M4” are considerably stronger than “M5” and “M6” access transistors, the current passing through the transistors “M5 and M6” will not be enough to flip the state of the memory cell. The stable state created by the cross-coupled inverters needs to be overpowered to change its logical level. Moreover, the opposite situation with unbalanced strength (size) causes the logical state to be changed from bit lines noise even with the word lines disabled. The same problem can happen during the read operation which is going to be presented further.

sram_tb_schem

The circuit structure is built on Cadence Virtuoso Schematic Editor, to show the write operation on a standard SRAM cell. The schematic contains two power sources for Vdd and GND and two pulse generators to drive the word line and bitlines.

write_good

The plot acquired above is based upon a well sized SRAM cell, when the “WORD” trace is high, enables the “BIT” from bit lines to drive the “DATA” of the cell. The book “CMOS VLSI Design: A circuit and Systems Perspective” tells that a good sizing can be achieved with the pull-down transistors 8/2 λ, pull-up 3/3 λ and access transistors with 4/2 λ.

write_bad1

With oversized pull-up and pull-down transistors, the access transistors will not have enough power to flip the logical cell level as shown above.

An array of SRAM cells is constructed by sharing the word line for multiple cells. Multiple arrays may have different words that become the programming address. The output of a decoder chooses specific rows or columns. A matrix can also be built with a similar cell schematic and two decoders controlling row and column to select a particular cell. As the memory size gets bigger, more capacitance (intrinsic from transistors) is connected to the shared cells bit lines. This capacitance brings the need of special circuitry to read the data of the cell. The access transistors are not designed to quickly drive the bit lines to a determined value and increasing the cells would reduce the memory size capability.

As the memory size gets bigger, more capacitance (intrinsic from transistors) is connected to the shared cells bit lines. This capacitance brings the need of special circuitry to read the data of the cell. The access transistors are not designed to quickly drive the bit lines to a determined value and increasing the cells would reduce the memory size capability.

sense_amp

(Circuit took from the book “CMOS Circuit Design, Layout, and Simulation”)

The sense amplifier circuit (above) is used to measure which bit line is higher and then amplify it. Both bit lines (“BL” and “!BL”) must be set to a logical ‘1’ before the sensing begins. The same logical level is necessary since the SRAM access transistors slowly drive the bit lines when a SRAM cell is selected to be read. Similar to a SRAM cell, the schematic consists of a pair of weak cross-coupled inverters, access transistors, and another transistor to control the “floating” state.

sense_schem

A test bench is set to observe the output signals of the sense amplifier. A pulse source is used to control the “SE” (Sense Enable) signal. Two ramp functions are generated from a PWL source to simulate the slowly bit line transition provoked by a selected SRAM cell.

sense_simu

At the transient simulation presented above, when the “SE” signal is set to a logical ‘0’ (pMOS control), the bottom nMOS transistor cuts the cross-coupled inverters connection to “GND” global net, making them to be in a floating state. At the same time, the access transistors are enabled, letting “Q and !Q” nets to be charged by the bit lines. Sequentially, raising “SE” to a logical ‘1’  disables the cross-coupled inverters connection to the bitlines and let the cell to be in a stable state with “GND”. A race condition happens which makes the “most charged” node to overpower the other one, showing a solid logic state at its outputs.