DMA Accelerator Functional Unit (AFU) User Guide

发布时间:2018/3/17

About this Document

Intended Audience

The intended audience comprises hardware or software developers that require an Accelerator Function (AF) to buffer data locally in memory connected to the Altera FPGA device.

Conventions

Table 1.  Document Conventions
ConventionDescription
#Precedes a command that indicates the command is to be entered as root.
$Indicates a command is to be entered as a user.
This fontFilenames, commands, and keywords are printed in this font. Long command lines are printed in this font. Although long command lines may wrap to the next line, the return is not part of the command; do not press enter.
<variable_name>Indicates the placeholder text that appears between the angle brackets must be replaced with an appropriate value. Do not enter the angle brackets.

Acceleration Glossary

Table 2.  Acceleration Stack for Intel? Xeon? CPU with FPGAs Glossary
TermAbbreviationDescription
Acceleration Stack for Intel? Xeon?CPU with FPGAsAcceleration Stack

A collection of software, firmware and tools that provides performance-optimized connectivity between an Intel?FPGA and an Intel? Xeon?processor.

Intel? Programmable Acceleration Card with Intel? Arria? 10 GX FPGAIntel? PAC with Arria? 10

PCIe? accelerator card with an Intel? Arria? 10 FPGA. Programmable Acceleration Card is abbreviated PAC.

Contains a FPGA Interface Manager (FIM) that pairs with an Intel? Xeon? processor over PCIe?bus.

Intel? Xeon? Processor with Integrated FPGAIntegrated FPGA Platform

Intel? Xeon? plus FPGA platform with the Intel? Xeon? and an FPGA in a single package and sharing a coherent view of memory via Quick Path Interconnect (QPI).

Acronyms

Table 3.  Acronyms
AcronymsExpansionDescription
AFAccelerator Function

Hardware accelerator implemented in FPGA logic that accelerates or intends to accelerate an application.

AFUAccelerator Functional UnitThe supplied implementation of an accelerator, typically in HDL.
APIApplication Programming InterfaceA set of subroutine definitions, protocols, and tools for building software applications.
APICAdvanced Programmable Interrupt Controller

The set of advanced programmable interrupt controller features which may be implemented in a stand-alone controller, part of a system chipset, or in a microprocessor.

ASEAFU Simulation Environment

Co-simulation environment that allows you to use the same host application and AF in a simulation environment. ASE is part of the .

CCI-PCore Cache InterfaceCCI-P is the hardware-side signaling interface between the AFU and the FPGA Interface Unit (FIU).
CLCache Line64-byte cache line
DFHDevice Feature HeaderCreates a linked list of feature headers to provide an extensible way of adding features.
DSMDevice Status Memory

A memory page to share control and status information between the software and hardware.

FIMFPGA Interface Manager (FIM)The compiled bitstream containing the FPGA Interface Unit (FIU) and other interfaces such as external SDRAM.
FIUFPGA Interface Unit (FIU)

The FIU connects the host and the AFU.

MPFMemory Properties FactoryOptimizes CCI-P traffic traffic before it reaches FIU.
MsgMessageMessage - a control notification
NLBNative LoopbackThe NLB performs reads and writes to the CCI-P link to test connectivity and throughput.
RdLine_IRead Line Invalid

Memory Read Request, with FPGA cache hint set to invalid. The line is not cached in the FPGA, but may cause FPGA cache pollution.

Note: The cache tag tracks the request status for all outstanding requests on Intel? Ultra Path Interconnect (Intel? UPI). Therefore, even though RdLine_I is marked invalid upon completion, it consumes the cache tag temporarily to track the request status over UPI. This action may result in the eviction of a cache line, resulting in cache pollution. The advantage of using RdLine_I is that it is not tracked by CPU directory; thus it will prevent snooping from CPU.
RdLine-SRead Line SharedMemory read request with FPGA cache hint set to shared. An attempt is made to keep it in the FPGA cache in a shared state.
WrLine_IWrite Line Invalid

Memory Write Request, with FPGA cache hint set to Invalid. The FIU writes the data with no intention of keeping the data in FPGA cache.

WrLine_MWrite Line Modified

Memory Write Request, with the FPGA cache hint set to Modified. The FIU writes the data and leaves it in the FPGA cache in a modified state.

DMA AFU Description

Introduction

The Direct Memory Access (DMA) AFU example shows how to manage memory transfers between the host processor and the FPGA. You can integrate the DMA AFU into your design to move data between the host memory and the FPGA local memory. Connecting a local memory directly to the FPGA should improve performance significantly for applications that frequently access FPGA memory.

The DMA AFU comprises the following submodules:

  • The DMA basic building block (BBB)
  • The Core Cache Interface (CCI-P) to the Avalon? Memory-Mapped ( Avalon? -MM) bridge
  • The memory properties factory
  • The asynchronous shim BBB

These submodules are described in more detail in the DMA AFU Hardware Components topic below.

The DMA AFU Software Package

The Acceleration Stack for Intel? Xeon? CPU with FPGAs package file, *.tar.gz includes the DMA AFU example. This example provides a user space driver. The driver programs the host application to use the DMA to move data between host and FPGA memory. The hardware binaries, sources, and the user space driver are available in the following directory: <installation_path>/hw/samples/dma_afu.

Before experimenting with the DMA AFU, you must install the Open Programmable Acceleration Engine (OPAE)software package. Refer to Installing the OPAE Software Package in the Altera Acceleration Stack for Intel? Xeon?CPU with FPGAs Getting Started Guide for installation instructions. This Quick Start Guide also includes basic information about the Open Programmable Acceleration Engine (OPAE) and configuring an AFU.

After installing the Open Programmable Acceleration Engine (OPAE) software package, a sample host application and the DMA AFU are available in the following directory: <installation_path>/hw/samples/dma_afu/sw. A sample application, fpga_dma_test implements the DMA AFU user space driver.

The DMA AFU Hardware Components

The DMA AFU interfaces with the FPGA Interface Unit (FIU) and two banks of local DDR4-SDRAM. The total memory addressable on the device is 8 gigabytes (8 GB). The memory comprises two, 4 GB banks.

The local memory interfaces operate at one-quarter the SDRAM clock frequency which is 1067 MHz. Consequently the local memory interface operates at 267 MHz.

Note: The currently available hardware dictates this memory configuration. Future hardware may support different memory configurations.

You can use the DMA AFU to copy data between the following source and destination locations:

  • The host to device local memory
  • Device local memory to the host
  • Device local memory to device local memory

system, <installation path>/hw/samples/dma_afu/hw/rtl/qsys/dma_test_system.qsys implements most of the DMA AFU.

Figure 1. DMA AFU Hardware Block Diagram

The DMA AFU includes the following internal modules to interface with the FPGA Interface Unit (FIU):

  • Async Shim Basic Building Block (BBB): This module transfers the CCI-P transactions from the pCLK 400 MHz clock domain to the pCLKDiv2 200 MHz clock domain.
  • Memory Properties Factory (MPF): This module ensures that read responses from the DMA return in the order that they were issued. The Avalon? -MM protocol requires read responses to return in the correct order.
  • CCI-P to Avalon? -MM bridge: This module translates between CCI-P and Avalon? -MM transactions, as follows:
    • The CCI-P to Avalon? -MM path (MMIO path): This path translates CCI-P transactions to Avalon? -MMreads and writes.
      Note: MMIO accesses do not support backpressure. As a result, the CCI-P to Avalon? -MM bridge cannot support the waitrequest signal. Altera recommends that you add an Avalon? -MM Clock Crossing Bridge, available in the IP Catalog, between the CCI-P to Avalon? MMIO Adapter master port and the DMA Test System Avalon? -MM slave port.
    • The Avalon? -MM to CCI-P: This path creates separate read and write channels for the DMA to access host memory.

    The CCI-P interface to the Avalon? -MM write slave includes an extra, high-order bit to implement write fences. When the high-order bit is set to 1’b1, the CCI-P bridge first issues a write fence. Then, the CCI-Pbridge writes data to the host physical address space with the high-order bit is set to 1’b0. This operation allows the DMA to synchronize writes to host memory. The DMA BBB is not capable of receiving bus responses from the host.

DMA Test System

The DMA test system tests the DMA AFU.
Figure 2. DMA Test System Block DiagramThis block diagram shows the internals of the DMA Test System. The DMA Test system is shown as monolithic block in Figure 1.

The DMA test system includes the following internal modules:

  • AFUID: This component stores the 64-bit Device Feature Header (DFH) and also includes the universally unique identifier (UUID). The AFU_ID_L register stores the lower 32 bits of the AFU ID. The AFU_ID_Hregister stores the upper 32 bits of the AFU ID. A software driver scans the DMA Test System, finds the AFU ID, and identifies the DMA BBB the DMA component.
  • DMA Basic Building Block (BBB): This component moves data between the host and local device memory spaces. DMA BBB interrupt connects to the IRQ 0 signal. The IRQ 0 signal is an input to the CCI-P to Avalon?bridge. The CCI-P to Avalon? bridge forwards the interrupt to the host.
  • Pipeline Bridge: The Pipeline Bridge inserts pipeline stages between memory mapped IP cores. By default,optimizes for low latency. Consequently, the Pipeline Bridges improve the system FMAX at the expense of latency.
  • Clock Crossing Bridge: The Clock Crossing Bridge isolates Avalon? -MM masters and slaves that are in different clock domains. Because the Clock Crossing Bridge includes clock-crossing logic, it adds FIFOs that have a greater latency than the standard Pipeline Bridge.

DMA BBB

The DMA BBB subsystem transfers data from source to destination addresses using memory-mapped transactions. The DMA AFU accesses control and status registers in the DMA BBB subsystem. The DMA BBB comprises five IP cores available in the IP Catalog as shown in the following figure.

Figure 3. DMA BBB Block DiagramThis block diagram excludes some internal Pipeline Bridge IP cores.

The components in the DMA BBB implement the following functions:

  • Modular Scatter-Gather DMA (MSGDMA): This IP core performs memory mapped transfers between source and destination addresses. The MSGDMA transfers 64 bytes per clock cycle. The data must be aligned to 64-byte boundaries. The transfer length must be a multiple of 64 bytes. The MSGDMA supports 50-bit addressing and can transfer up to 16,777,152 bytes per descriptor. In this implementation, the driver limits the transfer size to 1,047,552 bytes per descriptor.
  • Address Span Extender: This IP core implements memory transfers that are not aligned on a 64-byte boundary. The host uses it to perform MMIO accesses to FPGA device memory that are not aligned on a 64-byte boundary. The Address Span Extender accesses a 4 kilobyte (4 KB) window into the local device memory. The control port sets the base address of the (4 KB) window. The base address must be aligned to a 4 KB boundary so that the window is aligned to the window size. For example, to access FPGA memory address 0xF340, set the window address to 0xF000 and then access offset 0x0340 within the address span extender data window.
  • BBB ID: This component stores the 64-bit Device Feature Header(DFH) and the UUID. The BBB_ID_Lregister stores the lower 32 bits of the BBB ID. The BBB_ID_H register stores the upper 32 bits of the BBB ID. A software driver scans the BBB ID to identify the functionality of this DMA subsystem.
  • Magic Number ROM: This IP core contains a single, read-only 64-byte value. The DMA uses this value to create a write fence in host memory. This ROM is only visible to the MSGDMA. The host cannot access it.
  • Pipeline Bridge: The Pipeline Bridge inserts pipeline stages to improve the system FMAX at the expense of latency.

Building the DMA AFU

Complete the following steps to build the DMA AFU. To be successful, the PATH environment variable must include the Intel? Quartus? Prime Pro Edition executable.
  1. $ export PATH=$PATH:$DCP_LOC/bin
  2. $ cd $DCP_LOC/hw/samples/dma_afu
  3. $ run.sh
    Upon completion, the AFdma_afu.gbs, is available $DCP_LOC/hw/samples/dma_afu. Compilation can take up to 45 minutes on a high-performance host.
The clean.sh script removes the files generated during hardware compilation. To clean a previously built DMA AFU, repeat the Steps 1 and 2 and replace Step 3 by running the clean.sh script.

Register Map and Address Spaces

The DMA AFU supports a 50-bit address space. The lower half of the address map is the device memory space. The upper half is host memory space.The device memory space includes all the registers for the following IP cores:
  • The DMA BBB
  • The DMA AFU
  • The two local memory banks
  • The Magic Number ROM
The MMIO registers in the DMA BBB and AFU support 32- and 64-bit accesses. The DMA AFU does not support 512-bit accesses. Accesses to the MSGDMA registers inside the DMA BBB must be 32 bits.

DMA AFU Register Map

The DMA register map provides the absolute addresses of all the locations within the unit.

Table 4.  DMA AFU Memory Map
Byte AddressNameSpan in BytesDescription
0x0_0000AFU DFH8Refer to Table 2 for the bit fields.
0x0_0008AFU ID_L8Set to 0x9081F88B8F655CAA for the DMA AFU.
0x0_0010AFU ID_H8Set to 0x331DB30C988541EA for the DMA AFU.
0x0_2000MPF DFH240Specifies IDs, feature list, and control and statusregisters. The MPF decodes this information. This information is not available inside the DMA system.
0x2_0000DMA BBB8192The DMA BBB memory map. Refer to Table 3 for the register offsets.
Table 5.  DMA AFU DFH Encoding
Bit FieldDescription
[11:0]Feature ID. Set to 0.
[15:12]AFU major revision number. Set to 0.
[39:16]Next DFH byte offset/DFH region size. Set to 8192.
[40]End of DFH list. When set, the DFH is at the end of the list. The default value is 0.
[47:41]Reserved.
[51:48]AFU minor revision number. Set to 0.
[59:52]Reserved.
[63:60]Feature type. Set to 1 (AFU).
Table 6.  DMA BBB Memory Map.Add the byte addresses below to the BBB DMA base address, 0x2_0000.
Byte AddressNameSpan in BytesDescription
0x0000BBB DFH8Refer to Table 3 for the bit fields.
0x0008BBB ID_L8Set to 0xA9149A35BACE01EA for the DMA BBB.
0x0010BBB ID_H8Set to 0xEF82DEF7F6EC40FC for the DMA BBB.
0x0040MSGDMA CSR32Controls the DMA.
0x0060MSGDMA Descriptor32Receives DMA descriptors.
0x0200Address Span Extender Control8Moves the address window that the data port accesses.
0x1000Address Span Extender Data4096Maps a 4 KB window to a local device memory.
Table 7.  DMA BBB DFH Encoding
Bit FieldDescription
[11:0]Feature ID. Set to 0
[15:12]AFU major revision number. Set to 0.
[39:16]Next DFH byte offset / DFH region size. Set to 8192.
[40]End of DFH list. When set, the DFH is at the end of the list. The default value is 0.
[47:41]Reserved.
[51:48]AFU minor revision number. Set to 0
[59:52]Reserved.
[63:60]Feature type. Set to 2 (BBB).

DMA AFU Address Space

The host can access registers listed in the Table 1 and the Table 3. Host accesses to FPGA local memory must use the Address Span Extender IP core included in the DMA BBB subsystem.

The MSGDMA in the DMA BBB subsystem has access to the full 50-bit address space. The lower have of this address space includes the local memories and the Magic Number ROM. The upper half of this address space includes host memory.

The following figure shows the host and MSGDMA views of memory.
Figure 4. The DMA AFU and Host Views of Memory
Note: The Address Spn Extender can only access the EMIF A and EMIF B address spaces shown in the figure above.
Note: The write fence aliased host memory, addresses 0x3_0000_0000_0000-0x3_FFFF_FFFF_FFFF, aliases to the host memory spanning 0x2_0000_0000_0000-0x2_FFFF_FFFF_FFFF. The write fence aliased host memory span is write only. Reads to this address space are undefined. Writes to the write fence aliased host memory cause a write fence to be issued followed by the write data accompanying it. This address space should only be written to infrequently to send write fences to synchronize with the host.

Software Programming Model

The DMA AFU includes a software driver that you can use in your own host application. The fpga_dma.c and fgpa_dma.h files located in the <installation_path>/hw/samples/dma_afu/sw directory implement the software driver.

This driver supports the following functions:
APIDescription
fpgaDMAOpen()Opens a handle to the DMA BBB.
fpgaDMATransferSync()Transfers data from a source location to a destination location. The source and destination can be located in host or device memory.
fpgaDMATransferClose()Closes DMA BBB handle previously allocated.

Software APIs

fpgaDMAOpen()

fpgaDmaOpen() scans the device feature chain to locate the DMA BBB and then creates a handle for the DMA BBB.
Prototypefpga_result fpgaDmaOpen(fpga_handle fpga, fpga_dma_handle *dma)
Argumentsfpga

Input containing fpga object handle from fpgaOpen().

dma

Output containing handle to the DMA BBB.

ReturnsFPGA_OK on success; otherwise an error return code.

fpgaDmaTransferSync()

Prototypefpga_result fpgaDmaTransferSync(fpga_dma_handle dma, uint64_t dst, uint64_t src, size_t count, fpga_dma_transfer_t type)
Argumentsdma

dmaInput specifying DMA handle obtained from fpgaDmaOpen().

dst

Input specifying the destination byte address of the transfer. To maximize performance, make dst a multiple of 64 bytes.

src

Input specifying the source byte address of the transfer. To maximize performance, make src a multiple of 64 bytes.

count

Input specifying the length of the transfer in bytes. To maximize performance, make count a multiple of 64 bytes.

type

Input specifying the type of transfer. type has the following valid values: HOST_TO_FPGA_MMFPGA_TO_HOST_MM, or FPGA_TO_FPGA_MM.

ReturnsFPGA_OK on success; otherwise error return code.

fpgaDMAClose()

fpgaDmaClose() closes the previously allocated DMA BBB handle.
Prototypefpga_result fpgaDmaClose(fpga_dma_handle dma)
ParametersdmaInput containing DMA handle obtained from fpgaDMAOpen().
ReturnsFPGA_OK on success; otherwise error return code.

Driver Limitations

The software driver included in this package does not support the following features:

  • Asynchronous transfers: Initiate all transfers with the blocking fpgaDmaTransferSync() API.

Running DMA AFU Example

Before running this example, you should be familiar with the examples in the Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA. The DCP_LOC and OPAE_LOCenvironment variables must be set.

Complete the following steps to download the DMA AF bitstream and build and run the example software:

  1. If you have not already done so, configure the system hugepage to allocate 20, 2 MB hugepages that this utility requires. This command requires root privileges: # sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-\ 2048kB/nr_hugepages"
  2. # sudo fpgaconf $DCP_LOC/hw/samples/dma_afu/bin/dma_afu.gbs
  3. $ cd $DCP_LOC/hw/samples/dma_afu/sw
  4. $ make
  5. sudo LD_LIBRARY_PATH=pwd:$LD_LIBRARY_PATH ./fpga_dma_test 0
    The DMA software takes approximately a minute to populate test buffers and verify the results. The software prints the following messages during a successful run:
    Running test in HW mode
    Buffer Verification Success!
    Buffer Verification Success!
    Running DDR sweep test
    Allocated test buffer
    Fill test buffer
    DDR Sweep Host to FPGA
    Measured bandwidth = 6710.886400 Megabytes/sec
    Clear buffer
    DDR Sweep FPGA to Host
    Measured bandwidth = 6927.366606 Megabytes/sec
    Verifying buffer..
    Buffer Verification Success!

Document Revision History for the DMA AFU User Guide

DateVersionChanges
December 20172017.12.22Initial release.