.. _module_axi_master: Module axi_master ================= This document contains technical documentation for the ``axi_master`` module. An AXI read/write master is a convenient tool for performing memory operations from your FPGA design. It handles the complexity of performing raw AXI transactions and instead presents a very simple interface to the user. This type of module is often referred to as a "data mover" as well. An AXI write master has a ``job`` interface where the user specifies a burst length as well as a target memory address. Parallel to this is a ``data`` interface where the data that shall be saved in memory is streamed using a ``ready`` / ``valid`` handshake interface. The AXI write master will adapt the jobs internally to make sure that all operations adhere to the AXI standard. The AXI read master works analogously, with a ``job`` interface and a ``data`` interface. In this case, the ``data`` interface streams data from memory to the user. Module structure ---------------- This module has two top levels that are an integration of the sub-entities. See :ref:`axi_read_master ` and :ref:`axi_write_master ` for detailed documentation and block diagrams. Performance ----------- This design achieves 100% utilization of the data channels (``R`` and ``W``). This is done by having full separation of the address and data/response channels. For :ref:`axi_write_master ` this can be controlled via a generic, since the a logic footprint is a little higher when full throughput must be supported. Limitations ----------- 1. These AXI4 signals are not included in the interfaces, and are assumed to be constant: - Lock type: ``AxLOCK`` - Memory type: ``AxCACHE`` - Protection type: ``AxPROT`` - Quality of service: ``AxQOS`` - Region identifier: ``AxREGION`` - User-defined signaling: ``AxUSER`` and ``xUSER`` 2. AXI standard demands there be no combinatorial paths between input and output handshake signals (``ready`` and ``valid``). This rule is not honored in this module, since it increases logic footprint and is not necessary to reach timing. 3. The module does not have any reset functionality. The design targets modern SRAM-based FPGAs, where initial values can be used and there is no need for reset. Resource utilization -------------------- The top-levels and sub-entities of this module feature generics for data width, address width and ID width. For these generics, a higher value will result in greater logic footprint. Special care should be taken to specify exactly the address and ID width that is actually needed. In most use cases the ID is not used, so the ID width can be set to zero. Specifically for :ref:`axi_write_master ` has a large impact. Setting it to zero is very beneficial. Handshake interface ------------------- This module uses handshaking for data qualification on the ``job`` and ``data`` interfaces. .. Note that this file, which is in REPO_ROOT/fpga/doc, is copied into RST build directory by documentation script. Needs to have a .txt extension for a technical reason listed in the copy_files_needed_by_sphinx_build() Python function. .. include:: ../../fpga/doc/axi_stream_handshake_rules.rst.txt .. _axi_master.axi_master_pkg: axi_master_pkg.vhd ------------------ Package with types and utility functions for the AXI master eco system. .. _axi_master.axi_read_master: axi_read_master.vhd ------------------- .. symbolator:: component axi_read_master is generic ( address_width : positive; id_width : natural; data_width : positive; -- Typically 256 or 16. max_axi_burst_length_beats : positive; -- Set to 'false' if job/data interfaces are in different clock domains than the AXI interface clocks_are_the_same : boolean; -- Depth of the job FIFO. Can be set to zero to disable FIFO. -- Must be non-zero if 'clocks_are_the_same' is false. job_fifo_depth : natural; -- Depth of the data FIFO. Can be set to zero to disable FIFO. -- Must be non-zero if 'clocks_are_the_same' is false. data_fifo_depth : natural; -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8. support_unaligned_length : boolean; -- Setting to true increases logic footprint remove_zero_length_input_jobs : boolean; -- If there is a known limitation on what lengths are set on the input jobs, resources can -- be saved by specifying a lower value here. max_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value ); port ( -- Clock for the job and data interfaces. clk : in std_ulogic; -- Clock for the AXI interface. Shall be assigned to same clock signal as 'clk' -- if 'clocks_are_the_same' is set to true. axi_clk : in std_ulogic; --# {{}} job_ready : out std_ulogic; job_valid : in std_ulogic; job : in axi_master_job_t; --# {{}} data_ready : in std_ulogic; data_valid : out std_ulogic; data : out std_ulogic_vector; data_id : out u_unsigned; data_resp : out std_ulogic_vector; data_last : out std_ulogic; data_strb : out std_ulogic_vector; --# {{}} axi_read_m2s : out axi_read_m2s_t; axi_read_s2m : in axi_read_s2m_t ); end component; Top level for AXI read master that instantiates a :ref:`job_partitioner ` and an :ref:`axi_read_master_core `. It also features optional FIFOs to provide buffering, and any clock domain crossing that is needed. .. digraph:: my_graph graph [ dpi = 300 ]; rankdir="LR"; splines=ortho; input_job [ label="job" shape=none ]; output_data [ label="data" shape=none ]; job_fifo [ label="" shape=none image="fifo.png"]; data_fifo [ label="" shape=none image="fifo.png"]; { rank=same;job_fifo;data_fifo; } input_job -> job_fifo; output_data -> data_fifo [ dir="back" ]; job_partitioner [ label="job_partitioner" shape=box]; job_fifo -> job_partitioner; axi_read_master_core [ label="axi_read_master_core" shape=box height=1.3 ]; job_partitioner -> axi_read_master_core; data_fifo -> axi_read_master_core [ dir="back" ]; axi [ label="AXI" shape=box height=2 ]; # The rendering is royally messed up here. The labels are switched. # Visually verify result when changing anything in this graph. axi_read_master_core -> axi [ label="AR" dir="back" ]; axi_read_master_core -> axi [ label="R" ]; By setting ``job_fifo_depth`` and ``data_fifo_depth`` the amount of buffering can be controlled. If either of the values is set to zero, that FIFO is omitted. If the ``clocks_are_the_same`` generic is set to ``false`` the FIFOs will be asynchronous, which provides the necessary clock crossing. Apart from the generics discussed above, the further generics are the same as to the :ref:`job_partitioner ` and :ref:`axi_read_master_core `. See those entities for further documentation. Resource utilization ____________________ This entity has `netlist builds `__ set up with `automatic size checkers `__ in ``module_axi_master.py``. The following table lists the resource utilization for the entity, depending on generic configuration. .. list-table:: Resource utilization for axi_read_master.vhd netlist builds. :header-rows: 1 * - Generics - Total LUTs - FFs - RAMB36 - RAMB18 - DSP Blocks * - (Using wrapper axi_read_master_netlist_wrapper.vhd) - 233 - 338 - 4 - 0 - 0 .. _axi_master.axi_read_master_core: axi_read_master_core.vhd ------------------------ .. symbolator:: component axi_read_master_core is generic ( address_width : positive; id_width : natural; data_width : positive; -- Typically 256 or 16. max_burst_length_beats : positive; -- Set to '1' to support the case where job.length_bytes is not a multiple of data_width / 8. support_unaligned_length : boolean ); port ( clk : in std_ulogic; --# {{}} job_ready : out std_ulogic; job_valid : in std_ulogic; job : in axi_master_job_t; --# {{}} data_ready : in std_ulogic; data_valid : out std_ulogic; data : out std_ulogic_vector; data_id : out u_unsigned; data_resp : out std_ulogic_vector; data_last : out std_ulogic; data_strb : out std_ulogic_vector; --# {{}} axi_read_m2s : out axi_read_m2s_t; axi_read_s2m : in axi_read_s2m_t ); end component; Create AXI ``AR`` and ``R`` transactions from a stream of ``job`` s. The design is pipelined and has full separation between the ``AR`` and ``R``. This achieves 100% utilization of the ``R`` channel, with no cycles wasted. The ``AR`` channel can have at most a 50% utilization, which means that this entity can accept a new ``job`` every second cycle. .. warning:: This entity assumes that the incoming jobs are valid in an AXI sense: 1. The jobs must not be of length zero. 2. The jobs must not cross a 4k address boundary, 3. The jobs must not be longer than ``max_burst_length_beats``. In cases where this is not guaranteed, a :ref:`job_partitioner ` can be used to adapt the jobs before being sent to this entity. This is always done in :ref:`axi_read_master `. This entity also assumes that ``job.address`` is aligned with ``data_width``. The generic ``support_unaligned_length`` shall be set based on the ``job`` length characteristics. If ``job.length_bytes`` is always a multiple of ``data_width / 8`` then it can be set to ``false``. If the length does not fulfill this condition in all cases however, the generic must be set to ``true``. Enabling the generic does increase the logic footprint marginally (~10 LUT). Resource utilization ____________________ This entity has `netlist builds `__ set up with `automatic size checkers `__ in ``module_axi_master.py``. The following table lists the resource utilization for the entity, depending on generic configuration. .. list-table:: Resource utilization for axi_read_master_core.vhd netlist builds. :header-rows: 1 * - Generics - Total LUTs - Logic LUTs - FFs - DSP Blocks * - data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = False - 31 - 29 - 62 - 0 * - data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = True - 38 - 34 - 66 - 0 .. _axi_master.axi_write_master: axi_write_master.vhd -------------------- .. symbolator:: component axi_write_master is generic ( address_width : positive; id_width : natural; data_width : positive; -- Typically 256 or 16. max_axi_burst_length_beats : positive; clocks_are_the_same : boolean; -- Depth of the job FIFO. Can be set to zero to disable FIFO. -- Must be non-zero if 'clocks_are_the_same' is false. job_fifo_depth : natural; -- Depth of the data FIFO. Can be set to zero to disable FIFO. -- Must be non-zero if 'clocks_are_the_same' is false. data_fifo_depth : natural; -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8. support_unaligned_length : boolean; -- Setting to true increases logic footprint remove_zero_length_input_jobs : boolean; -- If there is a known limitation on what lengths are set on the input jobs, resources can -- be saved by specifying a lower value here. max_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value; -- This enables full separation of the AW/W channels, which improves throughput. enable_full_throughput : boolean; -- For AXI3 operation this must be set to true in order for the WID field to be set correctly. set_wid : boolean ); port ( -- Clock for the job, data and job_done interfaces. clk : in std_ulogic; -- Clock for the AXI interface. Shall be assigned to same clock signal as 'clk' -- if 'clocks_are_the_same' is set to true. axi_clk : in std_ulogic; --# {{}} job_ready : out std_ulogic; job_valid : in std_ulogic; job : in axi_master_job_t; --# {{}} data_ready : out std_ulogic; data_valid : in std_ulogic ; data : in std_ulogic_vector; --# {{}} job_done_valid : out std_ulogic; job_done_id : out u_unsigned; --# {{}} axi_write_m2s : out axi_write_m2s_t; axi_write_s2m : in axi_write_s2m_t ); end component; Top level for AXI write master that instantiates a :ref:`job_partitioner ` and an AXI write master core. .. digraph:: my_graph graph [ dpi = 300 ]; rankdir="LR"; splines=ortho; input_job [ label="job" shape=none ]; input_data [ label="data" shape=none ]; job_fifo [ label="" shape=none image="fifo.png"]; data_fifo [ label="" shape=none image="fifo.png"]; { rank=same;job_fifo;data_fifo; } input_job -> job_fifo; input_data -> data_fifo; job_partitioner [ label="job_partitioner" shape=box]; job_fifo -> job_partitioner; axi_write_master_core [ label="axi_write_master_core" shape=box height=1.3 ]; job_partitioner -> axi_write_master_core; data_fifo -> axi_write_master_core; axi [ label="AXI" shape=box height=2 ]; # The rendering is royally messed up here. The labels are switched. # Verify result when changing anything in this graph. axi_write_master_core -> axi [ label="AW" dir="back" ]; axi_write_master_core -> axi [ label="W" ]; axi_write_master_core -> axi [ label="B" ]; By setting ``job_fifo_depth`` and ``data_fifo_depth`` the amount of buffering can be controlled. If either of the values is set to zero, that FIFO is omitted. If the ``clocks_are_the_same`` generic is set to ``false`` the FIFOs will be asynchronous, which provides the necessary clock crossing. There is a generic ``enable_full_throughput`` which controls a tradeoff between performance and logic footprint. If it is set to ``true`` a :ref:`axi_write_master_core_full_throughput ` will be instantiated that has full separation of ``AW``/``W``/``B`` channels which enables 100% utilization of the ``W`` channel. If the generic is set to ``false`` a :ref:`axi_write_master_core ` will be instantiated that has at least two cycles of overhead per burst. Apart from the generics discussed above, the further generics are the same as to the :ref:`job_partitioner ` and :ref:`axi_read_master_core `. See those entities for further documentation. Resource utilization ____________________ This entity has `netlist builds `__ set up with `automatic size checkers `__ in ``module_axi_master.py``. The following table lists the resource utilization for the entity, depending on generic configuration. .. list-table:: Resource utilization for axi_write_master.vhd netlist builds. :header-rows: 1 * - Generics - Total LUTs - FFs - RAMB36 - RAMB18 - DSP Blocks * - address_width = 32 id_width = 8 data_width = 64 max_axi_burst_length_beats = 256 clocks_are_the_same = False job_fifo_depth = 16 data_fifo_depth = 2048 support_unaligned_length = False remove_zero_length_input_jobs = False max_job_length_bytes = 65535 enable_full_throughput = False set_wid = False - 291 - 405 - 4 - 0 - 0 .. _axi_master.axi_write_master_core: axi_write_master_core.vhd ------------------------- .. symbolator:: component axi_write_master_core is generic ( address_width : positive; id_width : natural; data_width : positive; -- Typically 256 or 16. max_burst_length_beats : positive; -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8. support_unaligned_length : boolean; -- For AXI3. Will increase logic footprint. set_wid : boolean; -- Control the metadata FIFO, which is written when an AW transaction occurs, and popped when -- a B transaction occurs. This indirectly controls the maximum number of -- outstanding transactions. -- The default value is somewhat arbitrarily chosen, but should not be limiting in any realistic -- situation. Also going down to e.g. 8 does not save any considerable amount of FIFO logic. metadata_fifo_depth : positive ); port ( clk : in std_ulogic; --# {{}} job_ready : out std_ulogic; job_valid : in std_ulogic; job : in axi_master_job_t; --# {{}} data_ready : out std_ulogic; data_valid : in std_ulogic; data : in std_ulogic_vector; --# {{}} -- Leaving job_done_ready at it's default value will minimize logic utilization. job_done_ready : in std_ulogic; job_done_valid : out std_ulogic; job_done_id : out u_unsigned; --# {{}} axi_write_m2s : out axi_write_m2s_t; axi_write_s2m : in axi_write_s2m_t ); end component; Create AXI ``AW`` and ``W`` transactions from a stream of ``job`` s and ``data``. AXI3 compliance can be enabled via the ``set_wid`` generic. Note that this design does not achieve full utilization of the ``W`` channel. The ``AW`` channel will not perform a new transaction until the ``W`` channel has finished it's previous burst. Assuming there is a FIFO on the ``AW`` channel, this still leaves two cycles overhead per ``job``. This could be amended if the ``AW`` and ``W`` channels are separated by having a FIFO that holds ``awlen`` and ``last_beat_strb``. When this information is saved in the FIFO, the state machine could pop a new ``job`` and perform a new ``AW`` transaction. This adds a small amount of LUTs and flip-flops to the design. .. warning:: This entity assumes that the incoming jobs are valid in an AXI sense: 1. The jobs must not be of length zero. 2. The jobs must not cross a 4k address boundary, 3. The jobs must not be longer than ``max_burst_length_beats``. In cases where this is not guaranteed, a :ref:`job_partitioner ` can be used to adapt the jobs before being sent to this entity. This is always done in :ref:`axi_write_master `. This entity also assumes that ``job.address`` is aligned with ``data_width``. The generic ``support_unaligned_length`` shall be set based on the ``job`` length characteristics. If ``job.length_bytes`` is always a multiple of ``data_width / 8`` then it can be set to ``false``. If the length does not fulfill this condition in all cases however, the generic must be set to ``true``. Enabling the generic does increase the logic footprint marginally (~10 LUT). Resource utilization ____________________ This entity has `netlist builds `__ set up with `automatic size checkers `__ in ``module_axi_master.py``. The following table lists the resource utilization for the entity, depending on generic configuration. .. list-table:: Resource utilization for axi_write_master_core.vhd netlist builds. :header-rows: 1 * - Generics - Total LUTs - Logic LUTs - FFs - DSP Blocks * - data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = False set_wid = False - 51 - 49 - 71 - 0 * - data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = True set_wid = False - 54 - 52 - 74 - 0 .. _axi_master.axi_write_master_core_full_throughput: axi_write_master_core_full_throughput.vhd ----------------------------------------- .. symbolator:: component axi_write_master_core_full_throughput is generic ( address_width : positive; id_width : natural; data_width : positive; -- Typically 256 or 16. max_burst_length_beats : positive; -- Set to 'true' to support the case where job.length_bytes is not a multiple of data_width / 8. support_unaligned_length : boolean; -- For AXI3. Will increase logic footprint and decrease throughput. set_wid : boolean; -- Control the metadata FIFOs, which are written when an AW transaction occurs. One of them is -- popped when a WLAST occurs, and the other is popped when a B transaction occurs. -- This indirectly controls the maximum number of outstanding transactions. -- The default value is somewhat arbitrarily chosen, but should not be limiting in any realistic -- situation. Also going down to e.g. 8 does not save any considerable amount of FIFO logic. metadata_fifo_depth : positive ); port ( clk : in std_ulogic; --# {{}} job_ready : out std_ulogic; job_valid : in std_ulogic; job : in axi_master_job_t; --# {{}} data_ready : out std_ulogic; data_valid : in std_ulogic ; data : in std_ulogic_vector; --# {{}} -- Leaving job_done_ready at it's default value will minimize logic utilization. job_done_ready : in std_ulogic; job_done_valid : out std_ulogic; job_done_id : out u_unsigned; --# {{}} axi_write_m2s : out axi_write_m2s_t; axi_write_s2m : in axi_write_s2m_t ); end component; Like :ref:`axi_write_master_core ` but has full separation of ``AW`` and ``W`` channels, which achieves 100% utilization of the ``W`` channel. Resource utilization ____________________ This entity has `netlist builds `__ set up with `automatic size checkers `__ in ``module_axi_master.py``. The following table lists the resource utilization for the entity, depending on generic configuration. .. list-table:: Resource utilization for axi_write_master_core_full_throughput.vhd netlist builds. :header-rows: 1 * - Generics - Total LUTs - Logic LUTs - FFs - DSP Blocks * - data_width = 32 address_width = 29 id_width = 8 max_burst_length_beats = 256 support_unaligned_length = True set_wid = False - 74 - 64 - 92 - 0 .. _axi_master.job_fifo: job_fifo.vhd ------------ .. symbolator:: component job_fifo is generic ( -- Depth of the FIFO. Can be set to zero to disable FIFO. depth : natural; -- asynchronous : boolean; address_width : positive; id_width : natural; -- If there is a known limitation on what lengths are set on the jobs, resources can -- be saved by specifying a lower value here. max_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value ); port ( -- Clock for the 'job' interface. job_clk : in std_ulogic; -- Clock for the 'buffered_job' interface. Shall be assigned to same clock signal as 'job_clk' -- if 'asynchronous' is set to false. buffered_job_clk : in std_ulogic; --# {{}} job_ready : out std_ulogic; job_valid : in std_ulogic; job : in axi_master_job_t; --# {{}} buffered_job_ready : in std_ulogic; buffered_job_valid : out std_ulogic; buffered_job : out axi_master_job_t ); end component; FIFO wrapper for AXI master jobs. Can by synchronous or asynchronous. Can also be omitted by settings ``depth`` to zero. .. _axi_master.job_partitioner: job_partitioner.vhd ------------------- .. symbolator:: component job_partitioner is generic ( address_width : positive; -- Setting to true increases logic footprint remove_zero_length_input_jobs : boolean; -- Typically limited by AXI SIZE and LEN max_output_job_length_bytes : integer range 1 to boundary_4k_bytes; -- If there is a known limitation on what lengths are set on the input jobs, resources can -- be saved by specifying a lower value here. max_input_job_length_bytes : integer range 1 to axi_master_job_length_bytes_max_value ); port ( clk : in std_ulogic; --# {{}} input_ready : out std_ulogic; input_valid : in std_ulogic; input_job : in axi_master_job_t; --# {{}} output_ready : in std_ulogic; output_valid : out std_ulogic; output_job : out axi_master_job_t; --# {{}} too_long_input_job_error : out std_ulogic; -- Is always zero when remove_zero_length_input_jobs is set to true zero_length_input_job_error : out std_ulogic ); end component; This entity makes sure that, based on unconstrained input jobs, the output jobs 1. (Optional) Are not of length zero. 2. Are not longer than ``max_output_job_length_bytes``. 3. Do not cross 4k address boundaries. The first option is enabled by setting the ``remove_zero_length_input_jobs`` generic. It can be useful when this entity is used in conjunction with e.g. a :ref:`axi_write_master_core `, which can not handle jobs of length zero. This is only necessary, however, when there is a risk if input jobs sent to this entity being of length zero (i.e. null jobs). If it is known beforehand that input jobs always have non-zero length, the generic can be disabled to save some resources. In order to fulfill the second two constraints, the input jobs are split into smaller jobs (unless already compliant). The first output job will be a (potentially) short job that aligns the address with ``max_output_job_length_bytes``. After this, upcoming jobs can be sent out using with the maximum length of ``max_output_job_length_bytes``. The last job is shorter, unless it happens to line up exactly. Using this pattern, there is no need to monitor for 4k boundary crossings. This works based on the fact that ``max_output_job_length_bytes`` is a power of two that is less than or equal to 4k. An alternative, and probably the most intuitive, approach would be to use the maximum length ``max_output_job_length_bytes`` already from the first job, and then shorten only the last job. This does imply that we have to monitor for 4k boundary crossings. From experimentation it has been found that the approach in this entity results in smaller logic footprint than the intuitive approach. The intuitive approach can result in fewer output jobs in some scenarios though. Consider the example: ``input_job.length_bytes`` is 256, while ``input_job.address`` is 128. In this case, our method will result in two jobs, while the intuitive approach will result in only one. It is considered worth it to use this method, that is cheaper in terms of area. The increased number of jobs only happens in a few cases, and is not estimated to be significantly detrimental to memory throughput. Resource utilization ____________________ This entity has `netlist builds `__ set up with `automatic size checkers `__ in ``module_axi_master.py``. The following table lists the resource utilization for the entity, depending on generic configuration. .. list-table:: Resource utilization for job_partitioner.vhd netlist builds. :header-rows: 1 * - Generics - Total LUTs - Logic LUTs - FFs - DSP Blocks * - address_width = 29 remove_zero_length_input_jobs = True max_output_job_length_bytes = 2048 (Using wrapper job_partitioner_netlist_wrapper.vhd) - 112 - 112 - 82 - 0 * - address_width = 29 remove_zero_length_input_jobs = True max_output_job_length_bytes = 2048 max_input_job_length_bytes = 10240 (Using wrapper job_partitioner_netlist_wrapper.vhd) - 80 - 80 - 71 - 0 * - address_width = 29 remove_zero_length_input_jobs = False max_output_job_length_bytes = 2048 max_input_job_length_bytes = 10240 (Using wrapper job_partitioner_netlist_wrapper.vhd) - 73 - 73 - 71 - 0