The intended audience comprises hardware or software developers that require an Accelerator Function (AF) to buffer data locally in memory connected to the Altera FPGA device.
Table 1. Document Conventions
Precedes a command that indicates the command is to be entered as root.
Indicates a command is to be entered as a user.
Filenames, commands, and keywords are printed in this font. Long command lines are printed in this font. Although long command lines may wrap to the next line, the return is not part of the command; do not press enter.
Indicates the placeholder text that appears between the angle brackets must be replaced with an appropriate value. Do not enter the angle brackets.
Table 2. Acceleration Stack for Intel? Xeon?CPU with FPGAs Glossary
Acceleration Stack for Intel? Xeon?CPU with FPGAs
A collection of software, firmware and tools that provides performance-optimized connectivity between an Intel?FPGA and an Intel? Xeon?processor.
Intel? Programmable Acceleration Card with Intel? Arria? 10 GX FPGA
Intel? PAC with Arria? 10
PCIe?accelerator card with an Intel? Arria? 10 FPGA. Programmable Acceleration Card is abbreviated PAC.
Contains a FPGA Interface Manager (FIM) that pairs with an Intel? Xeon?processor over PCIe?bus.
Intel? Xeon? Processor with Integrated FPGA
Integrated FPGA Platform
Intel? Xeon?plus FPGA platform with the Intel? Xeon?and an FPGA in a single package and sharing a coherent view of memory via Quick Path Interconnect (QPI).
Table 3. Acronyms
Hardware accelerator implemented in FPGA logic that accelerates or intends to accelerate an application.
Accelerator Functional Unit
The supplied implementation of an accelerator, typically in HDL.
Application Programming Interface
A set of subroutine definitions, protocols, and tools for building software applications.
Advanced Programmable Interrupt Controller
The set of advanced programmable interrupt controller features which may be implemented in a stand-alone controller, part of a system chipset, or in a microprocessor.
AFU Simulation Environment
Co-simulation environment that allows you to use the same host application and AF in a simulation environment. ASE is part of the .
Core Cache Interface
CCI-P is the hardware-side signaling interface between the AFU and the FPGA Interface Unit (FIU).
64-byte cache line
Device Feature Header
Creates a linked list of feature headers to provide an extensible way of adding features.
Device Status Memory
A memory page to share control and status information between the software and hardware.
FPGA Interface Manager (FIM)
The compiled bitstream containing the FPGA Interface Unit (FIU) and other interfaces such as external SDRAM.
FPGA Interface Unit (FIU)
The FIU connects the host and the AFU.
Memory Properties Factory
Optimizes CCI-P traffic traffic before it reaches FIU.
Message - a control notification
The NLB performs reads and writes to the CCI-P link to test connectivity and throughput.
Read Line Invalid
Memory Read Request, with FPGA cache hint set to invalid. The line is not cached in the FPGA, but may cause FPGA cache pollution.
Note: The cache tag tracks the request status for all outstanding requests on Intel?Ultra Path Interconnect (Intel?UPI). Therefore, even though RdLine_I is marked invalid upon completion, it consumes the cache tag temporarily to track the request status over UPI. This action may result in the eviction of a cache line, resulting in cache pollution. The advantage of using RdLine_I is that it is not tracked by CPU directory; thus it will prevent snooping from CPU.
Read Line Shared
Memory read request with FPGA cache hint set to shared. An attempt is made to keep it in the FPGA cache in a shared state.
Write Line Invalid
Memory Write Request, with FPGA cache hint set to Invalid. The FIU writes the data with no intention of keeping the data in FPGA cache.
Write Line Modified
Memory Write Request, with the FPGA cache hint set to Modified. The FIU writes the data and leaves it in the FPGA cache in a modified state.
DMA AFU Description
The Direct Memory Access (DMA) AFU example shows how to manage memory transfers between the host processor and the FPGA. You can integrate the DMA AFU into your design to move data between the host memory and the FPGA local memory. Connecting a local memory directly to the FPGA should improve performance significantly for applications that frequently access FPGA memory.
The DMA AFU comprises the following submodules:
The DMA basic building block (BBB)
The Core Cache Interface (CCI-P) to the Avalon?Memory-Mapped (Avalon?-MM) bridge
The memory properties factory
The asynchronous shim BBB
These submodules are described in more detail in the DMA AFU Hardware Components topic below.
The Acceleration Stack for Intel? Xeon? CPU with FPGAs package file, *.tar.gz includes the DMA AFU example. This example provides a user space driver. The driver programs the host application to use the DMA to move data between host and FPGA memory. The hardware binaries, sources, and the user space driver are available in the following directory: <installation_path>/hw/samples/dma_afu.
Before experimenting with the DMA AFU, you must install the Open Programmable Acceleration Engine (OPAE)software package. Refer to Installing the OPAE Software Package in the AlteraAcceleration Stack for Intel? Xeon?CPU with FPGAs Getting Started Guide for installation instructions. This Quick Start Guide also includes basic information about the Open Programmable Acceleration Engine (OPAE) and configuring an AFU.
After installing the Open Programmable Acceleration Engine (OPAE) software package, a sample host application and the DMA AFU are available in the following directory: <installation_path>/hw/samples/dma_afu/sw. A sample application, fpga_dma_test implements the DMA AFU user space driver.
The DMA AFU interfaces with the FPGA Interface Unit (FIU) and two banks of local DDR4-SDRAM. The total memory addressable on the device is 8 gigabytes (8 GB). The memory comprises two, 4 GB banks.
The local memory interfaces operate at one-quarter the SDRAM clock frequency which is 1067 MHz. Consequently the local memory interface operates at 267 MHz.
Note: The currently available hardware dictates this memory configuration. Future hardware may support different memory configurations.
You can use the DMA AFU to copy data between the following source and destination locations:
The host to device local memory
Device local memory to the host
Device local memory to device local memory
A system, <installation path>/hw/samples/dma_afu/hw/rtl/qsys/dma_test_system.qsys implements most of the DMA AFU.
Figure 1. DMA AFU Hardware Block Diagram
The DMA AFU includes the following internal modules to interface with the FPGA Interface Unit (FIU):
Async Shim Basic Building Block (BBB): This module transfers the CCI-P transactions from the pCLK400 MHz clock domain to the pCLKDiv2200 MHz clock domain.
Memory Properties Factory (MPF): This module ensures that read responses from the DMA return in the order that they were issued. The Avalon?-MM protocol requires read responses to return in the correct order.
CCI-P to Avalon?-MM bridge: This module translates between CCI-P and Avalon?-MM transactions, as follows:
The CCI-P to Avalon?-MM path (MMIO path): This path translates CCI-P transactions to Avalon?-MMreads and writes.
Note: MMIO accesses do not support backpressure. As a result, the CCI-P to Avalon?-MM bridge cannot support the waitrequest signal. Altera recommends that you add an Avalon?-MM Clock Crossing Bridge, available in the IP Catalog, between the CCI-P to Avalon?MMIO Adapter master port and the DMA Test System Avalon?-MM slave port.
The Avalon?-MM to CCI-P: This path creates separate read and write channels for the DMA to access host memory.
The CCI-P interface to the Avalon?-MM write slave includes an extra, high-order bit to implement write fences. When the high-order bit is set to 1’b1, the CCI-P bridge first issues a write fence. Then, the CCI-Pbridge writes data to the host physical address space with the high-order bit is set to 1’b0. This operation allows the DMA to synchronize writes to host memory. The DMA BBB is not capable of receiving bus responses from the host.
DMA Test System
The DMA test system tests the DMA AFU.
Figure 2. DMA Test System Block DiagramThis block diagram shows the internals of the DMA Test System. The DMA Test system is shown as monolithic block in Figure 1.
The DMA test system includes the following internal modules:
AFUID: This component stores the 64-bit Device Feature Header (DFH) and also includes the universally unique identifier (UUID). The AFU_ID_L register stores the lower 32 bits of the AFU ID. The AFU_ID_Hregister stores the upper 32 bits of the AFU ID. A software driver scans the DMA Test System, finds the AFU ID, and identifies the DMA BBB the DMA component.
DMA Basic Building Block (BBB): This component moves data between the host and local device memory spaces. DMA BBB interrupt connects to the IRQ 0 signal. The IRQ 0 signal is an input to the CCI-P to Avalon?bridge. The CCI-P to Avalon?bridge forwards the interrupt to the host.
Pipeline Bridge: The Pipeline Bridge inserts pipeline stages between memory mapped IP cores. By default,optimizes for low latency. Consequently, the Pipeline Bridges improve the system FMAXat the expense of latency.
Clock Crossing Bridge: The Clock Crossing Bridge isolates Avalon?-MM masters and slaves that are in different clock domains. Because the Clock Crossing Bridge includes clock-crossing logic, it adds FIFOs that have a greater latency than the standard Pipeline Bridge.
The DMA BBB subsystem transfers data from source to destination addresses using memory-mapped transactions. The DMA AFU accesses control and status registers in the DMA BBB subsystem. The DMA BBB comprises five IP cores available in the IP Catalog as shown in the following figure.
Figure 3. DMA BBB Block DiagramThis block diagram excludes some internal Pipeline Bridge IP cores.
The components in the DMA BBB implement the following functions:
Modular Scatter-Gather DMA (MSGDMA): This IP core performs memory mapped transfers between source and destination addresses. The MSGDMA transfers 64 bytes per clock cycle. The data must be aligned to 64-byte boundaries. The transfer length must be a multiple of 64 bytes. The MSGDMA supports 50-bit addressing and can transfer up to 16,777,152 bytes per descriptor. In this implementation, the driver limits the transfer size to 1,047,552 bytes per descriptor.
Address Span Extender: This IP core implements memory transfers that are not aligned on a 64-byte boundary. The host uses it to perform MMIO accesses to FPGA device memory that are not aligned on a 64-byte boundary. The Address Span Extender accesses a 4 kilobyte (4 KB) window into the local device memory. The control port sets the base address of the (4 KB) window. The base address must be aligned to a 4 KB boundary so that the window is aligned to the window size. For example, to access FPGA memory address 0xF340, set the window address to 0xF000 and then access offset 0x0340 within the address span extender data window.
BBB ID: This component stores the 64-bit Device Feature Header(DFH) and the UUID. The BBB_ID_Lregister stores the lower 32 bits of the BBB ID. The BBB_ID_H register stores the upper 32 bits of the BBB ID. A software driver scans the BBB ID to identify the functionality of this DMA subsystem.
Magic Number ROM: This IP core contains a single, read-only 64-byte value. The DMA uses this value to create a write fence in host memory. This ROM is only visible to the MSGDMA. The host cannot access it.
Pipeline Bridge: The Pipeline Bridge inserts pipeline stages to improve the system FMAXat the expense of latency.
Building the DMA AFU
Complete the following steps to build the DMA AFU. To be successful, the PATH environment variable must include the Intel? Quartus? Prime Pro Edition executable.
$ export PATH=$PATH:$DCP_LOC/bin
$ cd $DCP_LOC/hw/samples/dma_afu
Upon completion, the AF, dma_afu.gbs, is available $DCP_LOC/hw/samples/dma_afu. Compilation can take up to 45 minutes on a high-performance host.
The clean.sh script removes the files generated during hardware compilation. To clean a previously built DMA AFU, repeat the Steps 1 and 2 and replace Step 3 by running the clean.sh script.
Register Map and Address Spaces
The DMA AFU supports a 50-bit address space. The lower half of the address map is the device memory space. The upper half is host memory space.The device memory space includes all the registers for the following IP cores:
The DMA BBB
The DMA AFU
The two local memory banks
The Magic Number ROM
The MMIO registers in the DMA BBB and AFU support 32- and 64-bit accesses. The DMA AFU does not support 512-bit accesses. Accesses to the MSGDMA registers inside the DMA BBB must be 32 bits.
DMA AFU Register Map
The DMA register map provides the absolute addresses of all the locations within the unit.
Moves the address window that the data port accesses.
Address Span Extender Data
Maps a 4 KB window to a local device memory.
Table 7. DMA BBB DFH Encoding
Feature ID. Set to 0
AFU major revision number. Set to 0.
Next DFH byte offset / DFH region size. Set to 8192.
End of DFH list. When set, the DFH is at the end of the list. The default value is 0.
AFU minor revision number. Set to 0
Feature type. Set to 2 (BBB).
DMA AFU Address Space
The host can access registers listed in the Table 1 and the Table 3. Host accesses to FPGA local memory must use the Address Span Extender IP core included in the DMA BBB subsystem.
The MSGDMA in the DMA BBB subsystem has access to the full 50-bit address space. The lower have of this address space includes the local memories and the Magic Number ROM. The upper half of this address space includes host memory.
The following figure shows the host and MSGDMA views of memory.
Figure 4. The DMA AFU and Host Views of Memory
Note: The Address Spn Extender can only access the EMIF A and EMIF B address spaces shown in the figure above.
Note: The write fence aliased host memory, addresses 0x3_0000_0000_0000-0x3_FFFF_FFFF_FFFF, aliases to the host memory spanning 0x2_0000_0000_0000-0x2_FFFF_FFFF_FFFF. The write fence aliased host memory span is write only. Reads to this address space are undefined. Writes to the write fence aliased host memory cause a write fence to be issued followed by the write data accompanying it. This address space should only be written to infrequently to send write fences to synchronize with the host.
Software Programming Model
The DMA AFU includes a software driver that you can use in your own host application. The fpga_dma.c and fgpa_dma.h files located in the <installation_path>/hw/samples/dma_afu/sw directory implement the software driver.
This driver supports the following functions:
Opens a handle to the DMA BBB.
Transfers data from a source location to a destination location. The source and destination can be located in host or device memory.
Closes DMA BBB handle previously allocated.
fpgaDmaOpen() scans the device feature chain to locate the DMA BBB and then creates a handle for the DMA BBB.
dmaInput specifying DMA handle obtained from fpgaDmaOpen().
Input specifying the destination byte address of the transfer. To maximize performance, make dst a multiple of 64 bytes.
Input specifying the source byte address of the transfer. To maximize performance, make src a multiple of 64 bytes.
Input specifying the length of the transfer in bytes. To maximize performance, make count a multiple of 64 bytes.
Input specifying the type of transfer. type has the following valid values: HOST_TO_FPGA_MM, FPGA_TO_HOST_MM, or FPGA_TO_FPGA_MM.
FPGA_OK on success; otherwise error return code.
fpgaDmaClose() closes the previously allocated DMA BBB handle.
fpga_result fpgaDmaClose(fpga_dma_handle dma)
Input containing DMA handle obtained from fpgaDMAOpen().
FPGA_OK on success; otherwise error return code.
The software driver included in this package does not support the following features:
Asynchronous transfers: Initiate all transfers with the blocking fpgaDmaTransferSync() API.
Running DMA AFU Example
Before running this example, you should be familiar with the examples in the Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA. The DCP_LOC and OPAE_LOCenvironment variables must be set.
Complete the following steps to download the DMA AF bitstream and build and run the example software:
If you have not already done so, configure the system hugepage to allocate 20, 2 MB hugepages that this utility requires. This command requires root privileges: # sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-\2048kB/nr_hugepages"