HumanISE
Permanent URI for this community
This service focuses its action on information systems applied to the sectors of autarchies, industry, commerce, health, telecommunications and central and regional administration.
Browse
Browsing HumanISE by Author "473"
Results Per Page
Sort Options
-
ItemA Binary Translation Framework for Automated Hardware Generation( 2021) Nuno Miguel Paulino ; João Bispo ; João Canas Ferreira ; João Paiva Cardoso ; 6527 ; 5550 ; 473 ; 5802
-
ItemDynamic Partial Reconfiguration of Customized Single-Row Accelerators( 2019) Nuno Miguel Paulino ; João Canas Ferreira ; João Paiva Cardoso ; 5550 ; 473 ; 5802
-
ItemExecuting ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework( 2020) Nuno Miguel Paulino ; João Canas Ferreira ; João Bispo ; João Paiva Cardoso ; 5550 ; 6527 ; 473 ; 5802
-
ItemGeneration of Customized Accelerators for Loop Pipelining of Binary Instruction Traces( 2017) Nuno Miguel Paulino ; João Canas Ferreira ; João Paiva Cardoso ; 5550 ; 5802 ; 473Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floating-point arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61x for 13 floating-point kernels from the Livermore Loops set, and of 4.08x for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47x the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78x over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.
-
ItemOptimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets( 2020) Nuno Miguel Paulino ; João Canas Ferreira ; João Paiva Cardoso ; 5550 ; 473 ; 5802
-
ItemA Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses( 2015) Nuno Miguel Paulino ; João Canas Ferreira ; João Paiva Cardoso ; 5550 ; 5802 ; 473This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruction traces and generates customized RPU implementations. The implementation of Megablocks with memory accesses uses a memory-sharing mechanism to support concurrent accesses to the entire address space of the GPP's data memory. The scheduling of load/store operations and memory access handling have been optimized to minimize the latency introduced by memory accesses. The system is able to dynamically switch the execution between the GPP and the RPU when executing the original binaries of the input application. Our proof-of-concept prototype achieved geometric mean speedups of 1.60x and 1.18x for, respectively, a set of 37 benchmarks and a subset considering the 9 most complex benchmarks. With respect to a previous version of our approach, we achieved geometric mean speedup improvements from 1.22 to 1.53 for the 10 benchmarks previously used.