Architecture exploration of recent GPUs to analyze the efficiency of hardware resources

ABSTRACT


INTRODUCTION
Multithreaded hardware, which has been extensively developed in the last decade, becomes an attractive computing unit for software developers to speed up their solution of computational problems. Graphics processing units (GPUs) form one of the typical classes of multithreaded hardware. Modern GPU generation provides parallel programing models such as CUDA [1] and OpenCL [2,3], which make GPUs accessible for non-graphics applications as well as graphics applications. The increasing demand from developers for more powerful GPUs along with flexible development platforms requires GPU manufactures to pack more memory and execution units with higher speeds, as well as improve their programmability. Most recent GPU architectures have adapted to this trend, however, hardware resources in GPUs are not fully utilized yet, leading to reduced performance improvement.
A typical GPU comprises multiple streaming multiprocessors (SMs), or compute units (CUs) as commonly referred by AMD, also known as shader cores [4,5]. SMs are layouted in groups called clusters that share a common port to the interconnection network. Each SM contains a certain amount of single-instruction, multiple-thread (SIMT) cores or CUDA cores, several load/store units (LSUs), and special function units (SFUs). The core employs integer arithmetic logic units (ALUs) and floating-point units (FPUs) supporting diverse instructions including boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count. The LSU is responsible for memory operations, whereas SFU is dedicated for transcendental instructions such as sine, cosine, reciprocal, and square root [6][7][8].
A general-purpose computing on GPUs (GPGPU) application comprises several kernels, each comprising a massive number of threads (instance of the program kernel). A group of threads initiates a thread block termed as a cooperative thread array (CTA) [6,9]. The size of a CTA is determined by the number of threads that the programmer intends to launch at a time. Determing the size of CTA wisely allows a balanced workload distribution to SMs, which enables utilizing hardware resources efficiently. Global CTA scheduler assigns one or more CTAs to the SM until the resources inside the SM are saturated. A new CTA is assigned to the SM whenever it completes the previous CTA. Within the SM, CUDA cores and other execution units execute the threads in groups of 32 threads known as warps. The warps stored in a warp pool in the SM are scheduled for their execution in the order with respect to the policy governed by the warp scheduler. On every cycle, the scheduler selects one of ready warps to issue next. Beside conventional loose round robbin (LRR) warp scheduler, there has been several efforts to effectively arrange the oder of warps such as greendy then oldest (GTO) [10], Two-level warp scheduler [11], Cache-consicous wavefront scheduling [10], Criticalityaware warp scheduling [12], and Synchronize-aware warp scheduling [13]. In modern GPU architectures, two or more warp schedulers are working together in the SM to better utilize the hardware resources.
To facilitate memory pipeline state, there are a few levels of memory units in the SM. A substantial portion of the register file is utilized for each SM. L1 caches help handle register spills and serve various purposes (e.g. read-only constant cache, texture data, and irregular data access) [6,[14][15][16]. Ranking at a similar hierarchy of the L1 cache, shared memory allows threads to cooperate with each other, which is the key feature for reusing on-chip data. Shared memory also helps to reduce the traffic to lower level caches. The applications that make use of shared memory often desire to perform barrier synchronization which is supported by CUDA. The use of synchronization barrier is to synchronize shared data between threads within a CTA [17]. Barriers are also used for correctness and maintaining the performance after a highly branching threads. When the warps of CTA hit the barrier, they become stalled and have to wait for all the other warps to arrive at the barrier before making any further progress. Back to the memory pipeline state, if the memory requests are out of the L1 range, they are recorded by a miss status holding registers (MSHRs) before proceeding to L2 cache via an interconnect network. The L2 cache can be accessed across the entire kernel. It is split into multiple banks attached to the memory controllers. The controller is responsible for forwarding the miss requests in the L2 cache to the dynamic random access memory (DRAM) module. Figure 1 represents the general placement of the internal hardware components in recent GPUs. There has been several ideas and techniques for improving the performance of GPUs. However, most of them are obtained from a specific GPU architecture therefore the compatibility with various generations is questionable. In fact, whenever a new GPU architecture is released, studies are conducted to evaluate the cability and the power of new GPUs. Although these studies do mentiton the advancement compared to the previous architecture, there is lack of a follow-up from generation to generation to measure how far the architecture has developed. This is what our study tries to analyze. We not only evaluate the performance improvement over the predecessor architectures by applying the benchmarks but also analyze the efficiency of the hardware upgrades from fromer Fermi architecture to recent Pascal architecture in storage and computation domain. The understanding of the hardware efficiency throughout many GPU architectures is important for hardware developer to improve the future GPU architecture and for software developer to customize their applications to be optimized on various GPU generations. The remainder of this paper is organized as follows: Section 2 presents a comparison of the main improvements of Pascal to those of Fermi architecture. Section 3 presents a discussion of our experimental method and Section 4 presents the simulation results. Finally, Section 5 provides conclusions. Table 1 lists out some key improvements over the GPU development period of NVIDIA. Major revision of Tesla architecture (G80 and GT200) has been to set the philosophy at NVIDIA for improving both capability and programmability. The Fermi architecture has maintained this orientation and advanced further. NVIDIA learned from prior processors and improved several features to enhance GPU computation capability as: a. Third generation SM [6]: a fourfold increase in the number of CUDA core (32 compared to 8 in GT200), 16 load/store units that set the address calculation speed at 16 threads per clock, four SFUs per SM, integer ALU in Fermi which supports full 32-bit precision for all instructions (GT200 is limited to 24-bit precision for multiplication operations), dual warp schedulers which enable two warps to be issued and executed simultaneously. b. Memory subsystem improvement [6]: a true cache hierarchy with configurable L1/shared memory and unified L2 cache. The configurability allows programmers to configure the program behavior to fit their needs. Fermi was also the first GPU generation with error correcting code support from register files to DRAM.

OLD ARCHITECTURE VERSUS RECENT ARCHITECTURE
Since the great success of Fermi, NVIDIA further improved their architecture for the next generations: Kepler and Maxwell. Pascal maintained this trend and became the most advanced NVIDIA architecture ever built [18]. Among the Pascal GPUs, GP100 is very powerful and architecturally complex GPU. Some of its key features are as: a. NVLink provides high-speed bidirectional interface for GPU-to-GPU transfer. Multiple-GPU systems substantially benefit from this feature. b. The GPU which utilizes the second generation of HBM (HBM2). This type of memory achieves a significant bandwidth upscale. c. Sixth generation SM: a full GP100 delivers a total of 60 SMs, and each SM comprises 64 single precision (FP32) CUDA cores, 32 double precision (FP64) CUDA cores, and four texture units. Notably, FP32 CUDA cores can process both 16-bit and 32-bit precision instructions and data. This feature significantly benefits deep learning algorithms wherein high levels of precision are not strictly required. FP16 reduces memory usage, thus allowing larger training networks.
While Fermi and Kepler allow a programmer to configure shared memory and L1 cache, Maxwell and Pascal adopt a different approach. Each SM has its own 64 KB dedicated shared memory. Moreover, L1 cache can provide the function as a texture cache depending on the workload. Pascal GP100 also supports big L2 cache, which reduces DRAM access latency.

EXPERIMENTAL METHOD
In oder to evaluate how well the recent GPU architectures convert their hardware upgrades into performance gains, we use 9 applications (HS: Hotspot, BFS: Breadth-first search, LUD: LU decomposition, BP: BackPropagation, Dwt2d: discrete wavelet transfrom 2D, BT: B+ tree, PF: PathFinder, NW: Neadleman-Wunsch, SR: Srad_v2) from Rodinia benchmark suite [19] on a cycle level GPU simulator called GPGPU-Sim [20]. The simulator models the whole GPU allowing researcher to modify the architecture and run GPGPU applications without using a real GPU. To the best of our knowledge, almost all other researchers in GPU community have used GPGPU-Sim (v3.x) for their evaluation because Fermi architecture, which is supported by this version, is one of the simplest architecture configurations. However, we use GPGPU-Sim version 4.0 [21] since this version implements many updates to improve the evaluation accuracy and it supports many recent GPU architectures. Rodinia benchmark suite is a collection of popular benchmarks written in CUDA and OpenCL for evaluating the power and efficiency of GPU architecture.
We select popular Fermi GPU architecture to compare with recent Pascal GPU architecture. We apply the default architecture configuration of GTX 480 (Fermi) because it is the commonly used in the research community. For the case of recent architectures, GPGPU-Sim 4.0 supports Titan X configuration as a representative in Pascal family. Although TitanX does not use the most powerful Pascal chipset, it can generate a comparable performance as it uses GP102 chipset which inherits most of GP100 features excluding FP64 CUDA cores and replaces HBM2 memory interface by GDDR5X. Titan X is also more practical because it is a commercial GPU while GP100 is mainly available in high performance computer systems for scientific projects. We modified some parameters of Titan X to sufficiently approximate the real GPU. Table 2 describes the simulation parameters for compared two architectures. We divide the parameter set into two groups: core related parameters and memory related parameters. One prominent feature in modern GPU architectures is the hardware resource management by control the number of warps issued to executing units. The SM consists of a large number of CUDA cores that require an effective number of warp schedulers to orchestrate well for the activities of the warps. Fermi employs dual warp schedulers while Kepler and Maxwell use quad warp schedulers to utilize massive number of CUDA cores in their architecture. Recent GPU architecture like Pascal adopt dual warp schedulers as it is not necessary to use more than two warp schedulers because Pascal architecture has reduced the number of CUDA cores while it has increased the number of SMs. The key fact is that the number of warp schedulers per SM should be maintained appropriately depending on the hardware resources in the SM. If the number of warp schedulers is small while available CUDA cores are too many, most of CUDA cores are left to be unused. In the opposite case, the redundant warp schedulers may harm the GPU performance. To analyze the efficiency of hardware resources in the SM, we vary the number of warp schedulers in the SM. The number of warp schedulers does not only reflect the ability to utilize SM's resources but also affect to the barrier waiting time of the warps at the synchronization barrier. The fact is that synchronization overhead can be a limitation to achieve good performance in the GPU [19]. That is the reason why we also analyze barrier synchronization impact in overall performance. In this papper, we focus on inter-CTA synchronization which is entirely managed by hardware [22,23]. Another type of synchronization is global synchronization which is achieved by allowing the current kernel to complete and start a new kernel or atomic operations [13], which cost a significant overhead compared to inter-CTA synchronization. CUDA provides syncthreads() to perform a barrier statement [24].
To measure the barrier waiting time, we trigger the cycle counter for every warp within a CTA when it hits the barrier and keep tracking until the final warp of that CTA reach the barrier. Since then, all warps are released and cross that synchronization barrier. The average waiting time for every CTA and SM are calculated for individual benchmark. We compare the waiting duration in case of single, dual, and quad warp schedulers for Fermi and Pascal architecture from most active synchronization kernels of each benchmark.   Table 3 presents the performance of the benchmarks running on Fermi and Pascal architectures which is normalized to Fermi's performance. As shown in the table, all the benchmarks show better performance in the modern Pascal architecture (273% performance improvement on average) . However, massive upgrade of hardware in the Pascal does not consistently translate into significant performance gains for every benchmark. The understanding of this inefficiency is necessary for architecture researchers to propose any development for the future GPU architecture generation. Based on our simulation results, we can know that the improvement divergence depends heavily on the characteristic of applications.

EVALUATION RESULTS
Hotspot (HS) is a relatively compute-intensive benchmark [19]. Therefore, the increase in the number of SMs and CUDA cores of Pascal is dominantly beneficial to the performance. BT (B+ tree) comprises several parallel regions (a region between two consecutive barrier instructions) [12]. It also shows over 4 times speed up in the Pascal over the Fermi owing to the improvement in parallelism ability of modern GPU architectures. LUD (LU decomposition) shows similar instructions per cycle (IPC) in both architectures. LUD algorithm, which involves matrices, exhibits significant inter-thread sharing as well as row and column dependencies [25]. LUD would strongly benefit from shared memory. However, shared memory still remained even in recent architecture compared to past architecture as 48 KB, which explains why the GPU performance remains unchanged in both architectures for LUD.
To gain further insight into the influence of the computing and memory parameters in recent GPU architectures, we define two parameter sets. By maintaining core related parameters as the Ferimi configuration then replacing the memory parameters from the Pascal architecture, which is named FermiCore_PascalMem, we first evaluate the benefits of the Pascal's memory related resources. Via the inverse procedure which is named PascalCore_FermiMem, we can analyze the contribution of computing resources in Pascal architecture. Figure 2 illustrates the performance comparison according to the improvement of memory and computing resources from Fermi to Pascal.  As shown in Figure 2, HS and BT are two benchmarks that mostly inherit the benefit of core related parameters from recent GPU architecture (273% performance improvement on average). These core related configurations help to push the performance more than 3 times compared to Fermi performance. Digging into the simulation results of these benchmarks, when Pascal's core related parameters are applied, HS can schedule 4 more CTAs into one SM and BT can schedule 3 more CTAs, resulting in enhanced parallelism. These two benchmarks benefit from doubling the number of registers localized per SM in Pascal architecture. On the other hand, NW (Needleman-Wunsch) and BFS (Breadth-First Search) aggressively make use of memory related parameters of the new architecture. BFS in Pascal memory architecture achieves 2.57 times more efficient than Fermi. In case of NW, Figure 2 shows that there is an improvement of 1.21 times contributed by memory parameters compared to 1.33 times of fully Pascal's parameters. NW has only 16 threads (corresponding to 1 warp) in a CTA [19], which makes it difficult to gain a big improvement with the new architecture because SM occupancy is throttled by a small number of threads in the CTA, although smaller CTAs and reduced occupancy can have gains in some cases. To conduct futher analysis on the benchmark characteristics as well as to find out effective hardware parameteres, we divide the benchmarks into four groups as: Group 1: As shown in Figure 2, Dwt2d and SR show minimum dependency on the increase of computing and memory resources. When varying one of both types of resources, there is a little performance gain contributed by each type of resources. In both benchmarks, core related parameters have bigger impact than memory related parameters. The interesting thing is applying fully Pascal parameters provides an obvious performance improvement by 1.93 times and 2.93 times faster for Dwt2d and SR, respectively. These benchmarks are well-programmed to be scalable with the additional resources offered by recent architecture. For example, SR is a relatively compute-intensive benchmark, which uses matrix-like structure and exposes massive data parallelism. It also requires significant CPU-GPU communication which needs high memory bandwidth to facilitate [19]. Those reasons explain why SR shows noticeable performance improvement by increasing both computing and memory resources.
Group 2: There exists only LUD benchmark in this group because it shows no performance improvement in the new GPU architecture. As aforementioned, this benchmark mostly relies on the ability of shared memory utilization. While the amount of shared memory is unchanged in Pascal architecture, increasing memory channels and cache size in Pascal does not show any advantage for LUD.
Group 3: HS(Hotspot), BP(Backpropagation), BT(B+tree), and PF(PathFinder) are those benchmarks whose performance strongly depends on computing resources. In order to analyze the effects of computing resources, we evaluate the GPU performance by changing the number of processing cores and monitoring how the benchmarks react. Figure 3 indicates that the performance gain is 222% on average if we just double the number of processing cores. In detail, the performance of HS and BT are proportional to the number of shader cores. There is a spontaneous improvement (approximately 100%) when the number of cores increases from 16 to 36. The normalized IPC in these two benchmarks keep increasing as more cores are appended. However, the increasing slew rate is not sharp. Based on these results, we can know that there is still some scope for further performance improvement in the future GPU architecture. In case of BP (Backpropagation) benchmark, the IPC drops slightly when the number of cores reaches 76. This implies that the BP benchmark achieves its peak performance before the number of cores is increased to 76. PF (PathFinder) shows an abnormal behavior. The IPC does not consistently improve as more cores are added. One of the reasons is that PF is a high branch divergent benchmark, therefore, under-utilization of computing resources may occur. show similar bahavior in this group, which is highly influenced by memory resources. We maintain the computing resources similar to Fermi in combination with Pascal memory's configuration while varying the number of memory channels (memory controllers). Increasing the number of memory channels is more practical than increasing the size of cache. Figure 4 presents our simulation results for this experiment. The performance is increased by 206% if we increase the number of memory channels by 3 times. Both BFS and NW benchmarks are limited by the GPU off-chip bandwidth. However, only BFS is responsive when the number of memory channels increases. Meanwhile, NW shows an unconventional memory access pattern, and is designed for extensive use of shared memory. Therefore, adding more memory channels is not an effective way to improve the performance for this kind benchmarks.  Figure 4. Performance impact of memory resources In the following evaluation, we observe the performance difference while changing the number of warp schedulers in the SM. We also measure the barrier waiting time for evaluated cases. The number of warp schedulers play a critical role in maintaining the activities of all the memory and execution units within the SM. Multiple warp schedulers can issue many warps into the execution pipeline, leading to reduced waiting time in the warp pool. In some cases, this also increases the possibility that remaining warp within one CTA hit and clear the barrier synchronization faster. In other words, multiple warp schedulers encourage the CTA to finish earlier by enabling more CTAs to be lanched. Hence, it is necessary to evaluate the impact of the number of warp schedulers along with the barrier waiting time. Figure 5 shows the IPC varying the number of warp schedulers in Fermi GPU architecture. The warp scheduler selects a ready warp and issues one instruction from it to a group of 16 CUDA cores, 16 load and store units or a group of four SFUs [6]. The warp schedulers only issue the instructions with respect to the avalabilty of hardware resources. Those benchmarks which are composed of a combination of memory and computational instructions have high probability to issue more warps to SM's resources. Our experimental results show that four benchmarks (HS, BP, PF, SR) out of nine benchmarks are obviously empowered by increasing the number of warp schedulers. HS and SR are compute intensive benchmarks, therefore they can utilize two groups of 16 CUDA cores owing to the help of multiple warp schedulers. BT is a memory intensive benchmark, which does not make use of multiple schedulers because there is only one group of 16 load/store units in the SM. Similarly, NW and BFS are categorized as memory related benchmarks where similar pattern is provided. Figure 6 shows the IPC with different number of warp schedulers in Pascal architecture. BT, Dwt2d, and LUD do not show IPC difference when dual or quad warp schedulers are applied, where more time is spent in waiting at the synchronization barrier. In contrast, HS and BP fully take advantage of adopting multiple warp schedulers, resulting in the performance improvement (more than 1.7 times) compared to those with single warp scheduler.   Figure 6. Impact of the number of warp schedulers to the GPU performance in Pascal architecture Table 4 shows the barrier waiting time of multiple (dual and quad) warp schedulers normalized to the barrier waiting time of single warp scheduler in Fermi and Pascal archtectures. In this table, barrier waiting time for NW and BFS benchmarks are not provided. The reason is NW has only one warp (16 threads) in a CTA. Hence, warp does not have to wait for any other warps to arrive and cross the barrier in NW. In case of BFS, we cannot measure the barrier waiting time because there is no synchronization instruction in the benchmark. For BT, the barrier waiting time of multiple warp schedulers is longer than that of single warp scheduler, becuase synchronization instructions are performed frequently. This means that multiple warp schedulers can increase the barrier waiting time, leading to reduced performance. In the benchmarks where multiple warp schedulers show shorter barrier waiting time than that of single warp scheduler, multiple warp schedulers can improve the GPU performance more efficiently. From the results in Table 4, we also can know that using four warp schedulers is not always better than using two warp schedulers where two warp schedulers are enough to utilize the hardware resources in the SM. Therefore, the number of warp schedulers in the SM should be determined after careful evaluation about hardware utilization and barrier waiting time.

CONCLUSION
In this paper, we explored the advantage of recent GPU architectures over former architectures with respect to different computing resources and memory resources. We focused on the performance analysis with two representative GPU architectures: Fermi for the old architecture and Pascal for the modern architecture. When the benchmarks are executed on Pascal architecture, the performance is improved by 273% on average and up to 429% for hotspot compared to Fermi architecture. In order to analyze the efficiency of hardware resources in the GPU, we conducted various simulations to evaluate the performance improvement with many combinations of computing and memory parameters such as the number of shader cores, the number of memory channels, and the number of warp schedulers. Our simulation results indicate that shader cores and memory channels are critical factors to improve the performance of modern GPU architecture. We observed that bandwidth consuming benchmarks can improve the performance by 206% if the number of memory channels is increased by three times, and computation sensitive benchmarks show the speedup of 222% on average if the number of cores is doubled. For most of evaluated benchmarks, utilizing computing and memory hardware efficiently provides noticeable speedup. The experiments also pointed out some cases where applying new architecture does not improve the performance. For this reason, enhancing the hardware resources in the GPU should be considered together with the characteristics of the applications. We also analyzed the efficiency of employing multiple warp schedulers in the SM. Our experiments showed that the number of warp schedulers in the SM is well determined in recent GPU architectures, where multiple warp schedulers increase the hardware utilization and reduce the barrier waiting time for synchronization. However, excessive warp schedulers may cause resource under-utilization, leading to reduced performance. Therefore, appending more warp schedulers need to be taken into accout carefully if the future GPU architecture increases the hardware resources in the SM to solve more complex problems.