Module axi_master
This document contains technical documentation for the axi_master
module.
An AXI read/write master is a convenient tool for performing memory operations from your FPGA design. It handles the complexity of performing raw AXI transactions and instead presents a very simple interface to the user. This type of module is often referred to as a “data mover” as well.
An AXI write master has a job
interface where the user specifies a burst length as well as a
target memory address.
Parallel to this is a data
interface where the data that shall be saved in memory is streamed
using a ready
/ valid
handshake interface.
The AXI write master will adapt the jobs internally to make sure that all operations adhere to
the AXI standard.
The AXI read master works analogously, with a job
interface and a data
interface.
In this case, the data
interface streams data from memory to the user.
Module structure
This module has two top levels that are an integration of the sub-entities. See axi_read_master and axi_write_master for detailed documentation and block diagrams.
Performance
This design achieves 100% utilization of the data channels (R
and W
).
This is done by having full separation of the address and data/response channels.
For axi_write_master this can be controlled via a generic,
since the a logic footprint is a little higher when full throughput must be supported.
Limitations
These AXI4 signals are not included in the interfaces, and are assumed to be constant:
Lock type:
AxLOCK
Memory type:
AxCACHE
Protection type:
AxPROT
Quality of service:
AxQOS
Region identifier:
AxREGION
User-defined signaling:
AxUSER
andxUSER
AXI standard demands there be no combinatorial paths between input and output handshake signals (
ready
andvalid
). This rule is not honored in this module, since it increases logic footprint and is not necessary to reach timing.The module does not have any reset functionality. The design targets modern SRAM-based FPGAs, where initial values can be used and there is no need for reset.
Resource utilization
The top-levels and sub-entities of this module feature generics for data width, address width and ID width. For these generics, a higher value will result in greater logic footprint. Special care should be taken to specify exactly the address and ID width that is actually needed. In most use cases the ID is not used, so the ID width can be set to zero.
Specifically for axi_write_master has a large impact. Setting it to zero is very beneficial.
Handshake interface
This module uses handshaking for data qualification on the job
and data
interfaces.
Using AXI4-Stream-like handshake interfaces (ready
and valid
to qualify data transactions)
is very common in FPGA designs.
It enables a backpressure situation where the slave, i.e. the receiver of data, can indicate when it
is ready to receive the data.
Below are some rules governing how these handshake signals interact. They are adapted from the AMBA 4 AXI4-Stream Protocol Specification, ARM IHI 0051A (ID030610).
A transactions occurs on the positive edge of the clock when both
ready
andvalid
are high. The graph below shows some typical transactions.The
ready
signal may fall without a transaction having occurred:
The
valid
signal may NOT fall without a transaction having occurred:Once
valid
is asserted, the associated data may NOT be changed unless a transaction has occurred.This applies to any auxillary signals associated with the bus as well, e.g. a
last
indicator.Note also that this restriction on data not changing only applies when
valid
is asserted. When it is not, the data may be changed freely.In order to avoid deadlock situations, the master may NOT wait for the slave to assert
ready
before assertingvalid
. The slave however may wait forvalid
before assertingready
.
axi_master_pkg.vhd
Package with types and utility functions for the AXI master eco system.
axi_read_master.vhd
Top level for AXI read master that instantiates a job_partitioner and an axi_read_master_core. It also features optional FIFOs to provide buffering, and any clock domain crossing that is needed.
By setting job_fifo_depth
and data_fifo_depth
the amount of buffering can be controlled.
If either of the values is set to zero, that FIFO is omitted.
If the clocks_are_the_same
generic is set to false
the FIFOs will be asynchronous, which
provides the necessary clock crossing.
Apart from the generics discussed above, the further generics are the same as to the job_partitioner and axi_read_master_core. See those entities for further documentation.
Resource utilization
This entity has netlist builds set up with
automatic size checkers
in module_axi_master.py
.
The following table lists the resource utilization for the entity, depending on
generic configuration.
Generics |
Total LUTs |
FFs |
RAMB36 |
RAMB18 |
DSP Blocks |
---|---|---|---|---|---|
(Using wrapper axi_read_master_netlist_wrapper.vhd) |
233 |
338 |
4 |
0 |
0 |
axi_read_master_core.vhd
Create AXI AR
and R
transactions from a stream of job
s.
The design is pipelined and has full separation between the AR
and R
.
This achieves 100% utilization of the R
channel, with no cycles wasted.
The AR
channel can have at most a 50% utilization, which means that this entity
can accept a new job
every second cycle.
Warning
This entity assumes that the incoming jobs are valid in an AXI sense:
The jobs must not be of length zero.
The jobs must not cross a 4k address boundary,
The jobs must not be longer than
max_burst_length_beats
.
In cases where this is not guaranteed, a job_partitioner can be used to adapt the jobs before being sent to this entity. This is always done in axi_read_master.
This entity also assumes that job.address
is aligned with data_width
.
The generic support_unaligned_length
shall be set based on the job
length characteristics.
If job.length_bytes
is always a multiple of data_width / 8
then it can be
set to false
.
If the length does not fulfill this condition in all cases however, the generic must be set
to true
.
Enabling the generic does increase the logic footprint marginally (~10 LUT).
Resource utilization
This entity has netlist builds set up with
automatic size checkers
in module_axi_master.py
.
The following table lists the resource utilization for the entity, depending on
generic configuration.
Generics |
Total LUTs |
Logic LUTs |
FFs |
DSP Blocks |
---|---|---|---|---|
data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = False |
31 |
29 |
62 |
0 |
data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = True |
38 |
34 |
66 |
0 |
axi_write_master.vhd
Top level for AXI write master that instantiates a job_partitioner and an AXI write master core.
By setting job_fifo_depth
and data_fifo_depth
the amount of buffering can be controlled.
If either of the values is set to zero, that FIFO is omitted.
If the clocks_are_the_same
generic is set to false
the FIFOs will be asynchronous, which
provides the necessary clock crossing.
There is a generic enable_full_throughput
which controls a tradeoff between performance and
logic footprint. If it is set to true
a
axi_write_master_core_full_throughput will
be instantiated that has full separation of AW
/W
/B
channels which enables 100%
utilization of the W
channel.
If the generic is set to false
a axi_write_master_core will
be instantiated that has at least two cycles of overhead per burst.
Apart from the generics discussed above, the further generics are the same as to the job_partitioner and axi_read_master_core. See those entities for further documentation.
Resource utilization
This entity has netlist builds set up with
automatic size checkers
in module_axi_master.py
.
The following table lists the resource utilization for the entity, depending on
generic configuration.
Generics |
Total LUTs |
FFs |
RAMB36 |
RAMB18 |
DSP Blocks |
---|---|---|---|---|---|
address_width = 32 id_width = 8 data_width = 64 max_axi_burst_length_beats = 256 clocks_are_the_same = False job_fifo_depth = 16 data_fifo_depth = 2048 support_unaligned_length = False remove_zero_length_input_jobs = False max_job_length_bytes = 65535 enable_full_throughput = False set_wid = False |
291 |
405 |
4 |
0 |
0 |
axi_write_master_core.vhd
Create AXI AW
and W
transactions from a stream of job
s and data
.
AXI3 compliance can be enabled via the set_wid
generic.
Note that this design does not achieve full utilization of the W
channel.
The AW
channel will not perform a new transaction until the W
channel has
finished it’s previous burst. Assuming there is a FIFO on the AW
channel, this still leaves
two cycles overhead per job
.
This could be amended if the AW
and W
channels are separated by having a FIFO that holds
awlen
and last_beat_strb
. When this information is saved in the FIFO, the state machine
could pop a new job
and perform a new AW
transaction.
This adds a small amount of LUTs and flip-flops to the design.
Warning
This entity assumes that the incoming jobs are valid in an AXI sense:
The jobs must not be of length zero.
The jobs must not cross a 4k address boundary,
The jobs must not be longer than
max_burst_length_beats
.
In cases where this is not guaranteed, a job_partitioner can be used to adapt the jobs before being sent to this entity. This is always done in axi_write_master.
This entity also assumes that job.address
is aligned with data_width
.
The generic support_unaligned_length
shall be set based on the job
length characteristics.
If job.length_bytes
is always a multiple of data_width / 8
then it can be
set to false
.
If the length does not fulfill this condition in all cases however, the generic must be set
to true
.
Enabling the generic does increase the logic footprint marginally (~10 LUT).
Resource utilization
This entity has netlist builds set up with
automatic size checkers
in module_axi_master.py
.
The following table lists the resource utilization for the entity, depending on
generic configuration.
Generics |
Total LUTs |
Logic LUTs |
FFs |
DSP Blocks |
---|---|---|---|---|
data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = False set_wid = False |
51 |
49 |
71 |
0 |
data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = True set_wid = False |
54 |
52 |
74 |
0 |
axi_write_master_core_full_throughput.vhd
Like axi_write_master_core but has full separation
of AW
and W
channels, which achieves 100% utilization of the W
channel.
Resource utilization
This entity has netlist builds set up with
automatic size checkers
in module_axi_master.py
.
The following table lists the resource utilization for the entity, depending on
generic configuration.
Generics |
Total LUTs |
Logic LUTs |
FFs |
DSP Blocks |
---|---|---|---|---|
data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = True set_wid = False |
74 |
64 |
92 |
0 |
job_fifo.vhd
FIFO wrapper for AXI master jobs. Can by synchronous or asynchronous. Can also be omitted by
settings depth
to zero.
job_partitioner.vhd
This entity makes sure that, based on unconstrained input jobs, the output jobs
(Optional) Are not of length zero.
Are not longer than
max_output_job_length_bytes
.Do not cross 4k address boundaries.
The first option is enabled by setting the remove_zero_length_input_jobs
generic.
It can be useful when this entity is used in conjunction with e.g. a
axi_write_master_core, which can not handle jobs of length zero.
This is only necessary, however, when there is a risk if input jobs sent to this entity being of
length zero (i.e. null jobs).
If it is known beforehand that input jobs always have non-zero length, the generic can be
disabled to save some resources.
In order to fulfill the second two constraints, the input jobs are split into smaller jobs (unless already compliant).
The first output job will be a (potentially) short job that aligns the address with
max_output_job_length_bytes
. After this, upcoming jobs can be sent out using with the maximum
length of max_output_job_length_bytes
. The last job is shorter, unless it happens to line up
exactly. Using this pattern, there is no need to monitor for 4k boundary crossings. This works
based on the fact that max_output_job_length_bytes
is a power of two that is less than or
equal to 4k.
An alternative, and probably the most intuitive, approach would be to use the maximum
length max_output_job_length_bytes
already from the first job, and then shorten only the
last job.
This does imply that we have to monitor for 4k boundary crossings.
From experimentation it has been found that the approach in this entity results in smaller logic footprint than the intuitive approach.
The intuitive approach can result in fewer output jobs in some scenarios though.
Consider the example: input_job.length_bytes
is 256, while input_job.address
is 128.
In this case, our method will result in two jobs, while the intuitive approach will result in
only one.
It is considered worth it to use this method, that is cheaper in terms of area. The increased number of jobs only happens in a few cases, and is not estimated to be significantly detrimental to memory throughput.
Resource utilization
This entity has netlist builds set up with
automatic size checkers
in module_axi_master.py
.
The following table lists the resource utilization for the entity, depending on
generic configuration.
Generics |
Total LUTs |
Logic LUTs |
FFs |
DSP Blocks |
---|---|---|---|---|
address_width = 29 remove_zero_length_input_jobs = True max_output_job_length_bytes = 2048 (Using wrapper job_partitioner_netlist_wrapper.vhd) |
112 |
112 |
82 |
0 |
address_width = 29 remove_zero_length_input_jobs = True max_output_job_length_bytes = 2048 max_input_job_length_bytes = 10240 (Using wrapper job_partitioner_netlist_wrapper.vhd) |
80 |
80 |
71 |
0 |
address_width = 29 remove_zero_length_input_jobs = False max_output_job_length_bytes = 2048 max_input_job_length_bytes = 10240 (Using wrapper job_partitioner_netlist_wrapper.vhd) |
73 |
73 |
71 |
0 |