Module axi_master

This document contains technical documentation for the axi_master module.

An AXI read/write master is a convenient tool for performing memory operations from your FPGA design. It handles the complexity of performing raw AXI transactions and instead presents a very simple interface to the user. This type of module is often referred to as a “data mover” as well.

An AXI write master has a job interface where the user specifies a burst length as well as a target memory address. Parallel to this is a data interface where the data that shall be saved in memory is streamed using a ready / valid handshake interface. The AXI write master will adapt the jobs internally to make sure that all operations adhere to the AXI standard.

The AXI read master works analogously, with a job interface and a data interface. In this case, the data interface streams data from memory to the user.

Module structure

This module has two top levels that are an integration of the sub-entities. See axi_read_master and axi_write_master for detailed documentation and block diagrams.

Performance

This design achieves 100% utilization of the data channels (R and W). This is done by having full separation of the address and data/response channels. For axi_write_master this can be controlled via a generic, since the a logic footprint is a little higher when full throughput must be supported.

Limitations

  1. These AXI4 signals are not included in the interfaces, and are assumed to be constant:

    • Lock type: AxLOCK

    • Memory type: AxCACHE

    • Protection type: AxPROT

    • Quality of service: AxQOS

    • Region identifier: AxREGION

    • User-defined signaling: AxUSER and xUSER

  2. AXI standard demands there be no combinatorial paths between input and output handshake signals (ready and valid). This rule is not honored in this module, since it increases logic footprint and is not necessary to reach timing.

  3. The module does not have any reset functionality. The design targets modern SRAM-based FPGAs, where initial values can be used and there is no need for reset.

Resource utilization

The top-levels and sub-entities of this module feature generics for data width, address width and ID width. For these generics, a higher value will result in greater logic footprint. Special care should be taken to specify exactly the address and ID width that is actually needed. In most use cases the ID is not used, so the ID width can be set to zero.

Specifically for axi_write_master has a large impact. Setting it to zero is very beneficial.

Handshake interface

This module uses handshaking for data qualification on the job and data interfaces.

Using AXI4-Stream-like handshake interfaces (ready and valid to qualify data transactions) is very common in FPGA designs. It enables a backpressure situation where the slave, i.e. the receiver of data, can indicate when it is ready to receive the data.

Below are some rules governing how these handshake signals interact. They are adapted from the AMBA 4 AXI4-Stream Protocol Specification, ARM IHI 0051A (ID030610).

  1. A transactions occurs on the positive edge of the clock when both ready and valid are high. The graph below shows some typical transactions.

    ../../_images/wavedrom-02a10e69-2cb6-4e5d-9d1d-2547323ca829.svg
  2. The ready signal may fall without a transaction having occurred:

    ../../_images/wavedrom-cb609a42-1f9e-4281-9577-647fb575e9f6.svg
  1. The valid signal may NOT fall without a transaction having occurred:

    ../../_images/wavedrom-ac80b549-803c-4f17-af42-8f9711c75a83.svg
  2. Once valid is asserted, the associated data may NOT be changed unless a transaction has occurred.

    ../../_images/wavedrom-5850af98-11e8-41b9-89f9-3dce95d28b99.svg

    This applies to any auxillary signals associated with the bus as well, e.g. a last indicator.

    Note also that this restriction on data not changing only applies when valid is asserted. When it is not, the data may be changed freely.

  3. In order to avoid deadlock situations, the master may NOT wait for the slave to assert ready before asserting valid. The slave however may wait for valid before asserting ready.

axi_master_pkg.vhd

Package with types and utility functions for the AXI master eco system.

axi_read_master.vhd

component axi_read_master is
  generic (
    address_width : positive;
    id_width : natural;
    data_width : positive;
    -- Typically 256 or 16.
    max_axi_burst_length_beats : positive;
    -- Set to 'false' if job/data interfaces are in different clock domains than the AXI interface
    clocks_are_the_same : boolean;
    -- Depth of the job FIFO. Can be set to zero to disable FIFO.
    -- Must be non-zero if 'clocks_are_the_same' is false.
    job_fifo_depth : natural;
    -- Depth of the data FIFO. Can be set to zero to disable FIFO.
    -- Must be non-zero if 'clocks_are_the_same' is false.
    data_fifo_depth : natural;
    -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8.
    support_unaligned_length : boolean;
    -- Setting to true increases logic footprint
    remove_zero_length_input_jobs : boolean;
    -- If there is a known limitation on what lengths are set on the input jobs, resources can
    -- be saved by specifying a lower value here.
    max_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value
  );
  port (
    -- Clock for the job and data interfaces.
    clk : in std_ulogic;
    -- Clock for the AXI interface. Shall be assigned to same clock signal as 'clk'
    -- if 'clocks_are_the_same' is set to true.
    axi_clk : in std_ulogic;
    --# {{}}
    job_ready : out std_ulogic;
    job_valid : in std_ulogic;
    job : in axi_master_job_t;
    --# {{}}
    data_ready : in std_ulogic;
    data_valid : out std_ulogic;
    data : out std_ulogic_vector;
    data_id : out u_unsigned;
    data_resp : out std_ulogic_vector;
    data_last : out std_ulogic;
    data_strb : out std_ulogic_vector;
    --# {{}}
    axi_read_m2s : out axi_read_m2s_t;
    axi_read_s2m : in axi_read_s2m_t
  );
end component;

Top level for AXI read master that instantiates a job_partitioner and an axi_read_master_core. It also features optional FIFOs to provide buffering, and any clock domain crossing that is needed.

digraph my_graph {
graph [ dpi = 300 ];
rankdir="LR";
splines=ortho;

input_job [ label="job" shape=none ];
output_data [ label="data" shape=none ];

job_fifo [ label="" shape=none image="fifo.png"];
data_fifo [ label="" shape=none image="fifo.png"];

{
  rank=same;job_fifo;data_fifo;
}

input_job -> job_fifo;
output_data -> data_fifo [ dir="back" ];

job_partitioner [ label="job_partitioner" shape=box];
job_fifo -> job_partitioner;

axi_read_master_core [ label="axi_read_master_core" shape=box height=1.3 ];

job_partitioner -> axi_read_master_core;
data_fifo -> axi_read_master_core [ dir="back" ];

axi [ label="AXI" shape=box height=2 ];

# The rendering is royally messed up here. The labels are switched.
# Visually verify result when changing anything in this graph.
axi_read_master_core -> axi [ label="AR" dir="back" ];
axi_read_master_core -> axi [ label="R" ];
}

By setting job_fifo_depth and data_fifo_depth the amount of buffering can be controlled. If either of the values is set to zero, that FIFO is omitted. If the clocks_are_the_same generic is set to false the FIFOs will be asynchronous, which provides the necessary clock crossing.

Apart from the generics discussed above, the further generics are the same as to the job_partitioner and axi_read_master_core. See those entities for further documentation.

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_axi_master.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for axi_read_master.vhd netlist builds.

Generics

Total LUTs

FFs

RAMB36

RAMB18

DSP Blocks

(Using wrapper

axi_read_master_netlist_wrapper.vhd)

233

338

4

0

0

axi_read_master_core.vhd

component axi_read_master_core is
  generic (
    address_width : positive;
    id_width : natural;
    data_width : positive;
    -- Typically 256 or 16.
    max_burst_length_beats : positive;
    -- Set to '1' to support the case where job.length_bytes is not a multiple of data_width / 8.
    support_unaligned_length : boolean
  );
  port (
    clk : in std_ulogic;
    --# {{}}
    job_ready : out std_ulogic;
    job_valid : in std_ulogic;
    job : in axi_master_job_t;
    --# {{}}
    data_ready : in std_ulogic;
    data_valid : out std_ulogic;
    data : out std_ulogic_vector;
    data_id : out u_unsigned;
    data_resp : out std_ulogic_vector;
    data_last : out std_ulogic;
    data_strb : out std_ulogic_vector;
    --# {{}}
    axi_read_m2s : out axi_read_m2s_t;
    axi_read_s2m : in axi_read_s2m_t
  );
end component;

Create AXI AR and R transactions from a stream of job s.

The design is pipelined and has full separation between the AR and R. This achieves 100% utilization of the R channel, with no cycles wasted. The AR channel can have at most a 50% utilization, which means that this entity can accept a new job every second cycle.

Warning

This entity assumes that the incoming jobs are valid in an AXI sense:

  1. The jobs must not be of length zero.

  2. The jobs must not cross a 4k address boundary,

  3. The jobs must not be longer than max_burst_length_beats.

In cases where this is not guaranteed, a job_partitioner can be used to adapt the jobs before being sent to this entity. This is always done in axi_read_master.

This entity also assumes that job.address is aligned with data_width.

The generic support_unaligned_length shall be set based on the job length characteristics. If job.length_bytes is always a multiple of data_width / 8 then it can be set to false. If the length does not fulfill this condition in all cases however, the generic must be set to true. Enabling the generic does increase the logic footprint marginally (~10 LUT).

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_axi_master.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for axi_read_master_core.vhd netlist builds.

Generics

Total LUTs

Logic LUTs

FFs

DSP Blocks

data_width = 32

address_width = 29

id_width = 8

max_burst_length_beats = 256

support_unaligned_length = False

31

29

62

0

data_width = 32

address_width = 29

id_width = 8

max_burst_length_beats = 256

support_unaligned_length = True

38

34

66

0

axi_write_master.vhd

component axi_write_master is
  generic (
    address_width : positive;
    id_width : natural;
    data_width : positive;
    -- Typically 256 or 16.
    max_axi_burst_length_beats : positive;
    clocks_are_the_same : boolean;
    -- Depth of the job FIFO. Can be set to zero to disable FIFO.
    -- Must be non-zero if 'clocks_are_the_same' is false.
    job_fifo_depth : natural;
    -- Depth of the data FIFO. Can be set to zero to disable FIFO.
    -- Must be non-zero if 'clocks_are_the_same' is false.
    data_fifo_depth : natural;
    -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8.
    support_unaligned_length : boolean;
    -- Setting to true increases logic footprint
    remove_zero_length_input_jobs : boolean;
    -- If there is a known limitation on what lengths are set on the input jobs, resources can
    -- be saved by specifying a lower value here.
    max_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value;
    -- This enables full separation of the AW/W channels, which improves throughput.
    enable_full_throughput : boolean;
    -- For AXI3 operation this must be set to true in order for the WID field to be set correctly.
    set_wid : boolean
  );
  port (
    -- Clock for the job, data and job_done interfaces.
    clk : in std_ulogic;
    -- Clock for the AXI interface. Shall be assigned to same clock signal as 'clk'
    -- if 'clocks_are_the_same' is set to true.
    axi_clk : in std_ulogic;
    --# {{}}
    job_ready : out std_ulogic;
    job_valid : in std_ulogic;
    job : in axi_master_job_t;
    --# {{}}
    data_ready : out std_ulogic;
    data_valid : in std_ulogic ;
    data : in std_ulogic_vector;
    --# {{}}
    job_done_valid : out std_ulogic;
    job_done_id : out u_unsigned;
    --# {{}}
    axi_write_m2s : out axi_write_m2s_t;
    axi_write_s2m : in axi_write_s2m_t
  );
end component;

Top level for AXI write master that instantiates a job_partitioner and an AXI write master core.

digraph my_graph {
graph [ dpi = 300 ];
rankdir="LR";
splines=ortho;

input_job [ label="job" shape=none ];
input_data [ label="data" shape=none ];

job_fifo [ label="" shape=none image="fifo.png"];
data_fifo [ label="" shape=none image="fifo.png"];

{
  rank=same;job_fifo;data_fifo;
}

input_job -> job_fifo;
input_data -> data_fifo;

job_partitioner [ label="job_partitioner" shape=box];
job_fifo -> job_partitioner;

axi_write_master_core [ label="axi_write_master_core" shape=box height=1.3 ];

job_partitioner -> axi_write_master_core;
data_fifo -> axi_write_master_core;

axi [ label="AXI" shape=box height=2 ];

# The rendering is royally messed up here. The labels are switched.
# Verify result when changing anything in this graph.
axi_write_master_core -> axi [ label="AW" dir="back" ];
axi_write_master_core -> axi [ label="W" ];
axi_write_master_core -> axi [ label="B" ];
}

By setting job_fifo_depth and data_fifo_depth the amount of buffering can be controlled. If either of the values is set to zero, that FIFO is omitted. If the clocks_are_the_same generic is set to false the FIFOs will be asynchronous, which provides the necessary clock crossing.

There is a generic enable_full_throughput which controls a tradeoff between performance and logic footprint. If it is set to true a axi_write_master_core_full_throughput will be instantiated that has full separation of AW/W/B channels which enables 100% utilization of the W channel. If the generic is set to false a axi_write_master_core will be instantiated that has at least two cycles of overhead per burst.

Apart from the generics discussed above, the further generics are the same as to the job_partitioner and axi_read_master_core. See those entities for further documentation.

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_axi_master.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for axi_write_master.vhd netlist builds.

Generics

Total LUTs

FFs

RAMB36

RAMB18

DSP Blocks

address_width = 32

id_width = 8

data_width = 64

max_axi_burst_length_beats = 256

clocks_are_the_same = False

job_fifo_depth = 16

data_fifo_depth = 2048

support_unaligned_length = False

remove_zero_length_input_jobs = False

max_job_length_bytes = 65535

enable_full_throughput = False

set_wid = False

291

405

4

0

0

axi_write_master_core.vhd

component axi_write_master_core is
  generic (
    address_width : positive;
    id_width : natural;
    data_width : positive;
    -- Typically 256 or 16.
    max_burst_length_beats : positive;
    -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8.
    support_unaligned_length : boolean;
    -- For AXI3. Will increase logic footprint.
    set_wid : boolean;
    -- Control the metadata FIFO, which is written when an AW transaction occurs, and popped when
    -- a B transaction occurs. This indirectly controls the maximum number of
    -- outstanding transactions.
    -- The default value is somewhat arbitrarily chosen, but should not be limiting in any realistic
    -- situation. Also going down to e.g. 8 does not save any considerable amount of FIFO logic.
    metadata_fifo_depth : positive
  );
  port (
    clk : in std_ulogic;
    --# {{}}
    job_ready : out std_ulogic;
    job_valid : in std_ulogic;
    job : in axi_master_job_t;
    --# {{}}
    data_ready : out std_ulogic;
    data_valid : in std_ulogic;
    data : in std_ulogic_vector;
    --# {{}}
    -- Leaving job_done_ready at it's default value will minimize logic utilization.
    job_done_ready : in std_ulogic;
    job_done_valid : out std_ulogic;
    job_done_id : out u_unsigned;
    --# {{}}
    axi_write_m2s : out axi_write_m2s_t;
    axi_write_s2m : in axi_write_s2m_t
  );
end component;

Create AXI AW and W transactions from a stream of job s and data. AXI3 compliance can be enabled via the set_wid generic.

Note that this design does not achieve full utilization of the W channel. The AW channel will not perform a new transaction until the W channel has finished it’s previous burst. Assuming there is a FIFO on the AW channel, this still leaves two cycles overhead per job.

This could be amended if the AW and W channels are separated by having a FIFO that holds awlen and last_beat_strb. When this information is saved in the FIFO, the state machine could pop a new job and perform a new AW transaction. This adds a small amount of LUTs and flip-flops to the design.

Warning

This entity assumes that the incoming jobs are valid in an AXI sense:

  1. The jobs must not be of length zero.

  2. The jobs must not cross a 4k address boundary,

  3. The jobs must not be longer than max_burst_length_beats.

In cases where this is not guaranteed, a job_partitioner can be used to adapt the jobs before being sent to this entity. This is always done in axi_write_master.

This entity also assumes that job.address is aligned with data_width.

The generic support_unaligned_length shall be set based on the job length characteristics. If job.length_bytes is always a multiple of data_width / 8 then it can be set to false. If the length does not fulfill this condition in all cases however, the generic must be set to true. Enabling the generic does increase the logic footprint marginally (~10 LUT).

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_axi_master.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for axi_write_master_core.vhd netlist builds.

Generics

Total LUTs

Logic LUTs

FFs

DSP Blocks

data_width = 32

address_width = 29

id_width = 8

max_burst_length_beats = 256

support_unaligned_length = False

set_wid = False

51

49

71

0

data_width = 32

address_width = 29

id_width = 8

max_burst_length_beats = 256

support_unaligned_length = True

set_wid = False

54

52

74

0

axi_write_master_core_full_throughput.vhd

component axi_write_master_core_full_throughput is
  generic (
    address_width : positive;
    id_width : natural;
    data_width : positive;
    -- Typically 256 or 16.
    max_burst_length_beats : positive;
    -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8.
    support_unaligned_length : boolean;
    -- For AXI3. Will increase logic footprint and decrease throughput.
    set_wid : boolean;
    -- Control the metadata FIFOs, which are written when an AW transaction occurs. One of them is
    -- popped when a WLAST occurs, and the other is popped when a B transaction occurs.
    -- This indirectly controls the maximum number of outstanding transactions.
    -- The default value is somewhat arbitrarily chosen, but should not be limiting in any realistic
    -- situation. Also going down to e.g. 8 does not save any considerable amount of FIFO logic.
    metadata_fifo_depth : positive
  );
  port (
    clk : in std_ulogic;
    --# {{}}
    job_ready : out std_ulogic;
    job_valid : in std_ulogic;
    job : in axi_master_job_t;
    --# {{}}
    data_ready : out std_ulogic;
    data_valid : in std_ulogic ;
    data : in std_ulogic_vector;
    --# {{}}
    -- Leaving job_done_ready at it's default value will minimize logic utilization.
    job_done_ready : in std_ulogic;
    job_done_valid : out std_ulogic;
    job_done_id : out u_unsigned;
    --# {{}}
    axi_write_m2s : out axi_write_m2s_t;
    axi_write_s2m : in axi_write_s2m_t
  );
end component;

Like axi_write_master_core but has full separation of AW and W channels, which achieves 100% utilization of the W channel.

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_axi_master.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for axi_write_master_core_full_throughput.vhd netlist builds.

Generics

Total LUTs

Logic LUTs

FFs

DSP Blocks

data_width = 32

address_width = 29

id_width = 8

max_burst_length_beats = 256

support_unaligned_length = True

set_wid = False

74

64

92

0

job_fifo.vhd

component job_fifo is
  generic (
    -- Depth of the FIFO. Can be set to zero to disable FIFO.
    depth : natural;
    --
    asynchronous : boolean;
    address_width : positive;
    id_width : natural;
    -- If there is a known limitation on what lengths are set on the jobs, resources can
    -- be saved by specifying a lower value here.
    max_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value
  );
  port (
    -- Clock for the 'job' interface.
    job_clk : in std_ulogic;
    -- Clock for the 'buffered_job' interface. Shall be assigned to same clock signal as 'job_clk'
    -- if 'asynchronous' is set to false.
    buffered_job_clk : in std_ulogic;
    --# {{}}
    job_ready : out std_ulogic;
    job_valid : in std_ulogic;
    job : in axi_master_job_t;
    --# {{}}
    buffered_job_ready : in std_ulogic;
    buffered_job_valid : out std_ulogic;
    buffered_job : out axi_master_job_t
  );
end component;

FIFO wrapper for AXI master jobs. Can by synchronous or asynchronous. Can also be omitted by settings depth to zero.

job_partitioner.vhd

component job_partitioner is
  generic (
    address_width : positive;
    -- Setting to true increases logic footprint
    remove_zero_length_input_jobs : boolean;
    -- Typically limited by AXI SIZE and LEN
    max_output_job_length_bytes : integer range 1 to boundary_4k_bytes;
    -- If there is a known limitation on what lengths are set on the input jobs, resources can
    -- be saved by specifying a lower value here.
    max_input_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value
  );
  port (
    clk : in std_ulogic;
    --# {{}}
    input_ready : out std_ulogic;
    input_valid : in std_ulogic;
    input_job : in axi_master_job_t;
    --# {{}}
    output_ready : in std_ulogic;
    output_valid : out std_ulogic;
    output_job : out axi_master_job_t;
    --# {{}}
    too_long_input_job_error : out std_ulogic;
    -- Is always zero when remove_zero_length_input_jobs is set to true
    zero_length_input_job_error : out std_ulogic
  );
end component;

This entity makes sure that, based on unconstrained input jobs, the output jobs

  1. (Optional) Are not of length zero.

  2. Are not longer than max_output_job_length_bytes.

  3. Do not cross 4k address boundaries.

The first option is enabled by setting the remove_zero_length_input_jobs generic. It can be useful when this entity is used in conjunction with e.g. a axi_write_master_core, which can not handle jobs of length zero. This is only necessary, however, when there is a risk if input jobs sent to this entity being of length zero (i.e. null jobs). If it is known beforehand that input jobs always have non-zero length, the generic can be disabled to save some resources.

In order to fulfill the second two constraints, the input jobs are split into smaller jobs (unless already compliant).

The first output job will be a (potentially) short job that aligns the address with max_output_job_length_bytes. After this, upcoming jobs can be sent out using with the maximum length of max_output_job_length_bytes. The last job is shorter, unless it happens to line up exactly. Using this pattern, there is no need to monitor for 4k boundary crossings. This works based on the fact that max_output_job_length_bytes is a power of two that is less than or equal to 4k.

An alternative, and probably the most intuitive, approach would be to use the maximum length max_output_job_length_bytes already from the first job, and then shorten only the last job. This does imply that we have to monitor for 4k boundary crossings.

From experimentation it has been found that the approach in this entity results in smaller logic footprint than the intuitive approach.

The intuitive approach can result in fewer output jobs in some scenarios though. Consider the example: input_job.length_bytes is 256, while input_job.address is 128. In this case, our method will result in two jobs, while the intuitive approach will result in only one.

It is considered worth it to use this method, that is cheaper in terms of area. The increased number of jobs only happens in a few cases, and is not estimated to be significantly detrimental to memory throughput.

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_axi_master.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for job_partitioner.vhd netlist builds.

Generics

Total LUTs

Logic LUTs

FFs

DSP Blocks

address_width = 29

remove_zero_length_input_jobs = True

max_output_job_length_bytes = 2048

(Using wrapper

job_partitioner_netlist_wrapper.vhd)

112

112

82

0

address_width = 29

remove_zero_length_input_jobs = True

max_output_job_length_bytes = 2048

max_input_job_length_bytes = 10240

(Using wrapper

job_partitioner_netlist_wrapper.vhd)

80

80

71

0

address_width = 29

remove_zero_length_input_jobs = False

max_output_job_length_bytes = 2048

max_input_job_length_bytes = 10240

(Using wrapper

job_partitioner_netlist_wrapper.vhd)

73

73

71

0