DMA Accelerator Functional Unit (AFU) User Guide
发布时间:2018/3/17
About this Document
Intended Audience
The intended audience comprises hardware or software developers that require an Accelerator Function (AF) to buffer data locally in memory connected to the Altera FPGA device.
Conventions
Convention | Description |
---|---|
# | Precedes a command that indicates the command is to be entered as root. |
$ | Indicates a command is to be entered as a user. |
This font | Filenames, commands, and keywords are printed in this font. Long command lines are printed in this font. Although long command lines may wrap to the next line, the return is not part of the command; do not press enter. |
<variable_name> | Indicates the placeholder text that appears between the angle brackets must be replaced with an appropriate value. Do not enter the angle brackets. |
Acceleration Glossary
Term | Abbreviation | Description |
---|---|---|
Acceleration Stack for Intel? Xeon?CPU with FPGAs | Acceleration Stack | A collection of software, firmware and tools that provides performance-optimized connectivity between an Intel?FPGA and an Intel? Xeon?processor. |
Intel? Programmable Acceleration Card with Intel? Arria? 10 GX FPGA | Intel? PAC with Arria? 10 | PCIe? accelerator card with an Intel? Arria? 10 FPGA. Programmable Acceleration Card is abbreviated PAC. Contains a FPGA Interface Manager (FIM) that pairs with an Intel? Xeon? processor over PCIe?bus. |
Intel? Xeon? Processor with Integrated FPGA | Integrated FPGA Platform | Intel? Xeon? plus FPGA platform with the Intel? Xeon? and an FPGA in a single package and sharing a coherent view of memory via Quick Path Interconnect (QPI). |
Acronyms
Acronyms | Expansion | Description |
---|---|---|
AF | Accelerator Function | Hardware accelerator implemented in FPGA logic that accelerates or intends to accelerate an application. |
AFU | Accelerator Functional Unit | The supplied implementation of an accelerator, typically in HDL. |
API | Application Programming Interface | A set of subroutine definitions, protocols, and tools for building software applications. |
APIC | Advanced Programmable Interrupt Controller | The set of advanced programmable interrupt controller features which may be implemented in a stand-alone controller, part of a system chipset, or in a microprocessor. |
ASE | AFU Simulation Environment | Co-simulation environment that allows you to use the same host application and AF in a simulation environment. ASE is part of the . |
CCI-P | Core Cache Interface | CCI-P is the hardware-side signaling interface between the AFU and the FPGA Interface Unit (FIU). |
CL | Cache Line | 64-byte cache line |
DFH | Device Feature Header | Creates a linked list of feature headers to provide an extensible way of adding features. |
DSM | Device Status Memory | A memory page to share control and status information between the software and hardware. |
FIM | FPGA Interface Manager (FIM) | The compiled bitstream containing the FPGA Interface Unit (FIU) and other interfaces such as external SDRAM. |
FIU | FPGA Interface Unit (FIU) | The FIU connects the host and the AFU. |
MPF | Memory Properties Factory | Optimizes CCI-P traffic traffic before it reaches FIU. |
Msg | Message | Message - a control notification |
NLB | Native Loopback | The NLB performs reads and writes to the CCI-P link to test connectivity and throughput. |
RdLine_I | Read Line Invalid | Memory Read Request, with FPGA cache hint set to invalid. The line is not cached in the FPGA, but may cause FPGA cache pollution. Note: The cache tag tracks the request status for all outstanding requests on Intel? Ultra Path Interconnect (Intel? UPI). Therefore, even though RdLine_I is marked invalid upon completion, it consumes the cache tag temporarily to track the request status over UPI. This action may result in the eviction of a cache line, resulting in cache pollution. The advantage of using RdLine_I is that it is not tracked by CPU directory; thus it will prevent snooping from CPU. |
RdLine-S | Read Line Shared | Memory read request with FPGA cache hint set to shared. An attempt is made to keep it in the FPGA cache in a shared state. |
WrLine_I | Write Line Invalid | Memory Write Request, with FPGA cache hint set to Invalid. The FIU writes the data with no intention of keeping the data in FPGA cache. |
WrLine_M | Write Line Modified | Memory Write Request, with the FPGA cache hint set to Modified. The FIU writes the data and leaves it in the FPGA cache in a modified state. |
DMA AFU Description
Introduction
The Direct Memory Access (DMA) AFU example shows how to manage memory transfers between the host processor and the FPGA. You can integrate the DMA AFU into your design to move data between the host memory and the FPGA local memory. Connecting a local memory directly to the FPGA should improve performance significantly for applications that frequently access FPGA memory.
The DMA AFU comprises the following submodules:
- The DMA basic building block (BBB)
- The Core Cache Interface (CCI-P) to the Avalon? Memory-Mapped ( Avalon? -MM) bridge
- The memory properties factory
- The asynchronous shim BBB
These submodules are described in more detail in the DMA AFU Hardware Components topic below.
The DMA AFU Software Package
The Acceleration Stack for Intel? Xeon? CPU with FPGAs package file, *.tar.gz includes the DMA AFU example. This example provides a user space driver. The driver programs the host application to use the DMA to move data between host and FPGA memory. The hardware binaries, sources, and the user space driver are available in the following directory: <installation_path>/hw/samples/dma_afu.
Before experimenting with the DMA AFU, you must install the Open Programmable Acceleration Engine (OPAE)software package. Refer to Installing the OPAE Software Package in the Altera Acceleration Stack for Intel? Xeon?CPU with FPGAs Getting Started Guide for installation instructions. This Quick Start Guide also includes basic information about the Open Programmable Acceleration Engine (OPAE) and configuring an AFU.
After installing the Open Programmable Acceleration Engine (OPAE) software package, a sample host application and the DMA AFU are available in the following directory: <installation_path>/hw/samples/dma_afu/sw. A sample application, fpga_dma_test implements the DMA AFU user space driver.
The DMA AFU Hardware Components
The DMA AFU interfaces with the FPGA Interface Unit (FIU) and two banks of local DDR4-SDRAM. The total memory addressable on the device is 8 gigabytes (8 GB). The memory comprises two, 4 GB banks.
The local memory interfaces operate at one-quarter the SDRAM clock frequency which is 1067 MHz. Consequently the local memory interface operates at 267 MHz.
You can use the DMA AFU to copy data between the following source and destination locations:
- The host to device local memory
- Device local memory to the host
- Device local memory to device local memory
A system, <installation path>/hw/samples/dma_afu/hw/rtl/qsys/dma_test_system.qsys implements most of the DMA AFU.
The DMA AFU includes the following internal modules to interface with the FPGA Interface Unit (FIU):
- Async Shim Basic Building Block (BBB): This module transfers the CCI-P transactions from the pCLK 400 MHz clock domain to the pCLKDiv2 200 MHz clock domain.
- Memory Properties Factory (MPF): This module ensures that read responses from the DMA return in the order that they were issued. The Avalon? -MM protocol requires read responses to return in the correct order.
- CCI-P to Avalon? -MM bridge: This module translates between CCI-P and Avalon? -MM transactions, as follows:
- The CCI-P to Avalon? -MM path (MMIO path): This path translates CCI-P transactions to Avalon? -MMreads and writes.Note: MMIO accesses do not support backpressure. As a result, the CCI-P to Avalon? -MM bridge cannot support the waitrequest signal. Altera recommends that you add an Avalon? -MM Clock Crossing Bridge, available in the IP Catalog, between the CCI-P to Avalon? MMIO Adapter master port and the DMA Test System Avalon? -MM slave port.
- The Avalon? -MM to CCI-P: This path creates separate read and write channels for the DMA to access host memory.
The CCI-P interface to the Avalon? -MM write slave includes an extra, high-order bit to implement write fences. When the high-order bit is set to 1’b1, the CCI-P bridge first issues a write fence. Then, the CCI-Pbridge writes data to the host physical address space with the high-order bit is set to 1’b0. This operation allows the DMA to synchronize writes to host memory. The DMA BBB is not capable of receiving bus responses from the host.
- The CCI-P to Avalon? -MM path (MMIO path): This path translates CCI-P transactions to Avalon? -MMreads and writes.
DMA Test System
The DMA test system includes the following internal modules:
- AFUID: This component stores the 64-bit Device Feature Header (DFH) and also includes the universally unique identifier (UUID). The AFU_ID_L register stores the lower 32 bits of the AFU ID. The AFU_ID_Hregister stores the upper 32 bits of the AFU ID. A software driver scans the DMA Test System, finds the AFU ID, and identifies the DMA BBB the DMA component.
- DMA Basic Building Block (BBB): This component moves data between the host and local device memory spaces. DMA BBB interrupt connects to the IRQ 0 signal. The IRQ 0 signal is an input to the CCI-P to Avalon?bridge. The CCI-P to Avalon? bridge forwards the interrupt to the host.
- Pipeline Bridge: The Pipeline Bridge inserts pipeline stages between memory mapped IP cores. By default,optimizes for low latency. Consequently, the Pipeline Bridges improve the system FMAX at the expense of latency.
- Clock Crossing Bridge: The Clock Crossing Bridge isolates Avalon? -MM masters and slaves that are in different clock domains. Because the Clock Crossing Bridge includes clock-crossing logic, it adds FIFOs that have a greater latency than the standard Pipeline Bridge.
DMA BBB
The DMA BBB subsystem transfers data from source to destination addresses using memory-mapped transactions. The DMA AFU accesses control and status registers in the DMA BBB subsystem. The DMA BBB comprises five IP cores available in the IP Catalog as shown in the following figure.
The components in the DMA BBB implement the following functions:
- Modular Scatter-Gather DMA (MSGDMA): This IP core performs memory mapped transfers between source and destination addresses. The MSGDMA transfers 64 bytes per clock cycle. The data must be aligned to 64-byte boundaries. The transfer length must be a multiple of 64 bytes. The MSGDMA supports 50-bit addressing and can transfer up to 16,777,152 bytes per descriptor. In this implementation, the driver limits the transfer size to 1,047,552 bytes per descriptor.
- Address Span Extender: This IP core implements memory transfers that are not aligned on a 64-byte boundary. The host uses it to perform MMIO accesses to FPGA device memory that are not aligned on a 64-byte boundary. The Address Span Extender accesses a 4 kilobyte (4 KB) window into the local device memory. The control port sets the base address of the (4 KB) window. The base address must be aligned to a 4 KB boundary so that the window is aligned to the window size. For example, to access FPGA memory address 0xF340, set the window address to 0xF000 and then access offset 0x0340 within the address span extender data window.
- BBB ID: This component stores the 64-bit Device Feature Header(DFH) and the UUID. The BBB_ID_Lregister stores the lower 32 bits of the BBB ID. The BBB_ID_H register stores the upper 32 bits of the BBB ID. A software driver scans the BBB ID to identify the functionality of this DMA subsystem.
- Magic Number ROM: This IP core contains a single, read-only 64-byte value. The DMA uses this value to create a write fence in host memory. This ROM is only visible to the MSGDMA. The host cannot access it.
- Pipeline Bridge: The Pipeline Bridge inserts pipeline stages to improve the system FMAX at the expense of latency.
Building the DMA AFU
- $ export PATH=$PATH:$DCP_LOC/bin
- $ cd $DCP_LOC/hw/samples/dma_afu
- $ run.shUpon completion, the AF, dma_afu.gbs, is available $DCP_LOC/hw/samples/dma_afu. Compilation can take up to 45 minutes on a high-performance host.
Register Map and Address Spaces
- The DMA BBB
- The DMA AFU
- The two local memory banks
- The Magic Number ROM
DMA AFU Register Map
The DMA register map provides the absolute addresses of all the locations within the unit.
Byte Address | Name | Span in Bytes | Description |
---|---|---|---|
0x0_0000 | AFU DFH | 8 | Refer to Table 2 for the bit fields. |
0x0_0008 | AFU ID_L | 8 | Set to 0x9081F88B8F655CAA for the DMA AFU. |
0x0_0010 | AFU ID_H | 8 | Set to 0x331DB30C988541EA for the DMA AFU. |
0x0_2000 | MPF DFH | 240 | Specifies IDs, feature list, and control and statusregisters. The MPF decodes this information. This information is not available inside the DMA system. |
0x2_0000 | DMA BBB | 8192 | The DMA BBB memory map. Refer to Table 3 for the register offsets. |
Bit Field | Description |
---|---|
[11:0] | Feature ID. Set to 0. |
[15:12] | AFU major revision number. Set to 0. |
[39:16] | Next DFH byte offset/DFH region size. Set to 8192. |
[40] | End of DFH list. When set, the DFH is at the end of the list. The default value is 0. |
[47:41] | Reserved. |
[51:48] | AFU minor revision number. Set to 0. |
[59:52] | Reserved. |
[63:60] | Feature type. Set to 1 (AFU). |
Byte Address | Name | Span in Bytes | Description |
---|---|---|---|
0x0000 | BBB DFH | 8 | Refer to Table 3 for the bit fields. |
0x0008 | BBB ID_L | 8 | Set to 0xA9149A35BACE01EA for the DMA BBB. |
0x0010 | BBB ID_H | 8 | Set to 0xEF82DEF7F6EC40FC for the DMA BBB. |
0x0040 | MSGDMA CSR | 32 | Controls the DMA. |
0x0060 | MSGDMA Descriptor | 32 | Receives DMA descriptors. |
0x0200 | Address Span Extender Control | 8 | Moves the address window that the data port accesses. |
0x1000 | Address Span Extender Data | 4096 | Maps a 4 KB window to a local device memory. |
Bit Field | Description |
---|---|
[11:0] | Feature ID. Set to 0 |
[15:12] | AFU major revision number. Set to 0. |
[39:16] | Next DFH byte offset / DFH region size. Set to 8192. |
[40] | End of DFH list. When set, the DFH is at the end of the list. The default value is 0. |
[47:41] | Reserved. |
[51:48] | AFU minor revision number. Set to 0 |
[59:52] | Reserved. |
[63:60] | Feature type. Set to 2 (BBB). |
DMA AFU Address Space
The host can access registers listed in the Table 1 and the Table 3. Host accesses to FPGA local memory must use the Address Span Extender IP core included in the DMA BBB subsystem.
The MSGDMA in the DMA BBB subsystem has access to the full 50-bit address space. The lower have of this address space includes the local memories and the Magic Number ROM. The upper half of this address space includes host memory.
Software Programming Model
The DMA AFU includes a software driver that you can use in your own host application. The fpga_dma.c and fgpa_dma.h files located in the <installation_path>/hw/samples/dma_afu/sw directory implement the software driver.
This driver supports the following functions:API | Description |
---|---|
fpgaDMAOpen() | Opens a handle to the DMA BBB. |
fpgaDMATransferSync() | Transfers data from a source location to a destination location. The source and destination can be located in host or device memory. |
fpgaDMATransferClose() | Closes DMA BBB handle previously allocated. |
Software APIs
fpgaDMAOpen()
Prototype | fpga_result fpgaDmaOpen(fpga_handle fpga, fpga_dma_handle *dma) | |
---|---|---|
Arguments | fpga | Input containing fpga object handle from fpgaOpen(). |
dma | Output containing handle to the DMA BBB. | |
Returns | FPGA_OK on success; otherwise an error return code. |
fpgaDmaTransferSync()
Prototype | fpga_result fpgaDmaTransferSync(fpga_dma_handle dma, uint64_t dst, uint64_t src, size_t count, fpga_dma_transfer_t type) | |
---|---|---|
Arguments | dma | dmaInput specifying DMA handle obtained from fpgaDmaOpen(). |
dst | Input specifying the destination byte address of the transfer. To maximize performance, make dst a multiple of 64 bytes. | |
src | Input specifying the source byte address of the transfer. To maximize performance, make src a multiple of 64 bytes. | |
count | Input specifying the length of the transfer in bytes. To maximize performance, make count a multiple of 64 bytes. | |
type | Input specifying the type of transfer. type has the following valid values: HOST_TO_FPGA_MM, FPGA_TO_HOST_MM, or FPGA_TO_FPGA_MM. | |
Returns | FPGA_OK on success; otherwise error return code. |
fpgaDMAClose()
Prototype | fpga_result fpgaDmaClose(fpga_dma_handle dma) | |
---|---|---|
Parameters | dma | Input containing DMA handle obtained from fpgaDMAOpen(). |
Returns | FPGA_OK on success; otherwise error return code. |
Driver Limitations
The software driver included in this package does not support the following features:
- Asynchronous transfers: Initiate all transfers with the blocking fpgaDmaTransferSync() API.
Running DMA AFU Example
Before running this example, you should be familiar with the examples in the Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA. The DCP_LOC and OPAE_LOCenvironment variables must be set.
Complete the following steps to download the DMA AF bitstream and build and run the example software:
- If you have not already done so, configure the system hugepage to allocate 20, 2 MB hugepages that this utility requires. This command requires root privileges: # sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-\ 2048kB/nr_hugepages"
- # sudo fpgaconf $DCP_LOC/hw/samples/dma_afu/bin/dma_afu.gbs
- $ cd $DCP_LOC/hw/samples/dma_afu/sw
- $ make
- sudo LD_LIBRARY_PATH=[Math Processing Error]:$LD_LIBRARY_PATH ./fpga_dma_test 0The DMA software takes approximately a minute to populate test buffers and verify the results. The software prints the following messages during a successful run:
Running test in HW mode
Buffer Verification Success!
Buffer Verification Success!
Running DDR sweep test
Allocated test buffer
Fill test buffer
DDR Sweep Host to FPGA
Measured bandwidth = 6710.886400 Megabytes/sec
Clear buffer
DDR Sweep FPGA to Host
Measured bandwidth = 6927.366606 Megabytes/sec
Verifying buffer..
Buffer Verification Success!
Document Revision History for the DMA AFU User Guide
Date | Version | Changes |
---|---|---|
December 2017 | 2017.12.22 | Initial release. |