Hardware Acceleration of EDA Algorithms- P11

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:7

Thêm vào BST

Báo xấu

94
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Hardware Acceleration of EDA Algorithms- P11: Single-threaded software applications have ceased to see significant gains in performance on a general-purpose CPU, even with further scaling in very large scale integration (VLSI) technology. This is a significant problem for electronic design automation (EDA) applications, since the design complexity of VLSI integrated circuits (ICs) is continuously growing. In this research monograph, we evaluate custom ICs, field-programmable gate arrays (FPGAs), and graphics processors as platforms for accelerating EDA algorithms, instead of the general-purpose singlethreaded CPU....

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Hardware Acceleration of EDA Algorithms- P11

12 Conclusions 185 Multi−Threaded Multi−Threaded Display Interface Fixed Function Wide SIMD Wide SIMD I$ D$ I$ D$ Memory Controller Memory Controller L2 Cache System Interface Texture Logic Multi−Threaded Multi−Threaded Wide SIMD Wide SIMD I$ D$ I$ D$ Fig. 12.2 Larrabee architecture from Intel Shared Multiprocessor Core DRAM I/F DRAM I/F HOST I/F L2 DRAM I/F Giga Thread DRAM I/F DRAM I/F DRAM I/F Fig. 12.3 Fermi architecture from NVIDIA multiprocessor (SM). The block diagram of a single SM is shown in Fig. 12.4 and the block diagram of a core within an SM is shown in Fig. 12.5. With these upcoming architectures, newer approaches for hardware acceleration of algorithms would become viable. These approaches could exploit the more gen- eral computing paradigm offered by the newer architectures. For example, the close coupling between the GPU and the CPU (which reside on the same die) would
186 12 Conclusions Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units X 16 Special Func Units X 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Fig. 12.4 Block diagram of a single shared multiprocessor (SM) in Fermi reduce the communication cost. Also, in these upcoming architectures the instruc- tion dispatch unit is distributed, and the instruction set is more general purpose. These enhancements would enable a more general computing paradigm (in compar- ison to the SIMD paradigm for current GPUs), which in turn would enable acceler- ation opportunities for more EDA applications. The approaches presented in this monograph collectively aim to contribute toward enabling the CAD community to accelerate EDA algorithms on modern hardware platforms. Our work demonstrates techniques to rearchitect several EDA algorithms to maximally harness their performance on the alternative platforms under consideration.
References 187 CUDA Core Dispatch Port Operand Collector FP Unit INT Unit Result Queue Fig. 12.5 Block diagram of a single processor (core) in SM References 1. http://www.cs.chalmers.se/cs/research/formalmethods/minisat/main.html. The MiniSAT Page 2. NVIDIA Tesla GPU Computing Processor. http://www.nvidia.com/object/IO_ 43499.html 3. OmegaSim Mixed-Signal Fast-SPICE Simulator. http://www.nascentric.com/ product.html 4. Lee, H.K., Ha, D.S.: An efﬁcient, forward fault simulation algorithm based on the parallel pattern single fault propagation. In: Proceedings of the IEEE International Test Conference on Test, pp. 946–955. IEEE Computer Society, Washington, DC (1991) 5. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27(3), 1–15 (2008) 6. Silva, M., Sakallah, J.: GRASP-a new search algorithm for satisﬁability. In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp. 220–7 (1996)
Index A CDFG, 160 Accelerators, 9 Clause, 31 ACML-GPU, 15 Clock speed, 11 Activity, 93 CNF, 31, 34 Algorithm parallel, 120, 121, 134 Co-processors, 9 Amdahl’s Law, 158, 170 Compilers, 16 Application speciﬁc, 64 Complete Arrival time, 110 SAT, 83, 85 Assignment, 31, 37, 40 Conﬂict, 37, 40, 42, 44, 71 Conﬂict clause, 31 B Conﬂict clause generation, 33, 64 Backtracking, 32 Conjunctive Normal Form, 34 Bandwidth, 13 Constant Memory, 26, 161 Bandwidth minimization, 52 Control and dataﬂow graph, 173 Bank conﬂict, 27 Control dominated BCP, 32, 37, 40 EDA, 3 Bias Control plus data parallel survey propagation, 89 EDA, 3 Bins, 64 Core, 185 Bin packing, 52, 70 Critical line Bin utilization, 74 critical path tracing, 138 Bit parallel, 135, 146 Critical path tracing, 138 Block, 28 CUBLAS, 15 Block-based CUDA, 15, 24 SSTA, 108 CUFFT, 15 Board test, 15 Cumulative detectability, 138 Boolean Constant Propagation, see BCP Custom IC, 7, 10, 33 Boolean Satisﬁability, see SAT Box-Muller, 101 D BRAM, 11, 14, 32, 63, 66, 72, 78 Data parallel, 28, 106, 120, 122, 134 Brook+, 15 Debuggers, 16 BSIM3 Decision engine, 37, 39, 49, 70 SPICE, 158 Decision level, 39, 67 BSIM4 Decisions SPICE, 158 SAT, 32 Bulldog Fortran, 171 Detectability, 138 DFF, 11 C DIMACS, 45 Capacity, 31, 35 Dimblock, 29 K. Gulati, S.P. Khatri, Hardware Acceleration of EDA Algorithms, 189 DOI 10.1007/978-1-4419-0944-2, C Springer Science+Business Media, LLC 2010
190 Index Dimensionality, 29 Infringement Dimgrid, 29 security, 19 Divide, 12 Input vector control, 10 Dominator, 138 Instance speciﬁc, 64 DPLL, 85 Inter-bin DRAM, 14, 66, 184 non-chronological, 32 Dropped Intra-bin fault table, 134 non-chronological, 32 Dynamic IP cores, 15 power, 10 Dynamic bulk modulation, 10 K Kernel, 28, 167, 184 E EDA, 3 L Embedded processor, 10 Larrabee, 184 Latency, 11, 13 F Leakage Factor Graph, 87 power, 10 Fault detection, 134 Levelize, 112 Fault diagnosis, 134 Literal, 37 Fault dropping, 134 free literal, 41 Fault grading, 102, 120 Local memory, 12, 27 Fault injection, 135 Logic analyzers, 15 Fault parallel Lookup table, 11, 106, 120 data parallel, 120 LUT, 12 Fault simulation, 4, 119 Fault table, 4, 134 Fermi, 184 M Fingerprinting, 19 Memory bandwidth, 1, 13 FPGA, 3, 7, 10, 32 Memory wall, 1 Function Mersenne Twister, 101, 106, 112 Factor Graph, 87 MIMD, 171 Minimum unsatisﬁable core, 31, 33, 53 G MiniSAT, 85 Global Memory, 13, 27, 110, 159 MNA GPGPU, 3 SPICE, 154 Graphics Processors, see GPU Model evaluation, 154 GRASP, 35, 64, 85 Model parallel, 122, 134 Grid, see dimgrid Monte Carlo, 4 GridSAT, 87 SSTA, 101, 106 GSAT, 85 Moore’s Law, 24 MOPs, 17 H MOPs per watt Hardware MOPs, 17 IP cores, 15 Multi-GPU, 16 HDL, 10, 14, 19 Multi-port Hybrid memory, 20 SAT, 85 Multiprocessor, 12, 24 MUX, 11 I Immediate dominator N dominator, 138 Newton-Raphson, 154 Implication, 37, 40, 44 NMOS Implication graph, 31, 33, 37, 50, 64 passgates, 11
Index 191 Non-chronological backtrack, 32, 43, 45, 64, Reconﬁgure, 12 68, 85 Reduced OR, 144 Non-recurring engineering, 10, 18 Register, 26, 172 Non-volatile Resolution, 36 memory, 20 Reuse-based design, 19 O S Off-chip, 14 Sample parallelism, 106 On-chip, 14 SAT, 4, 31, 33, 34, 36 OPB, 67, 72 3-SAT, 36 Scalability, 15, 31, 35, 66 P Scattered reads, 29 Paging, 12 SEE, 18, 114 Parafrase, 170 Self-test, 15 Parallel Sensitive input, 138 SAT, 85 Shared Memory, 26, 27, 110 Partition, 32, 35, 63, 78 Shared multiprocessor, 185 Pass/fail fault dictionary, 134 SIMD, 3, 18, 29 Path-based Software SSTA, 108 IP cores, 15 Pattern parallel Span, 69 data parallel, 120 Speedup, 31 PCI, 15 SPICE, 31, 153 PCI-X Square root, 12 PCI, 15 SRAM, 11 Pipeline, 11 SSTA, 4, 101, 106 Piracy STA, 101, 106 security, 19 SPICE, 154 PLB, 67, 72 Stem, 137 PLB2OPB bridge, 72 Stem region, 138, 143 Power, 10, 56 Stochastic average power, 58 SAT, 83, 85 Power delay product, 18 Subroutine, 167 Power gating, 10 Subsumption Power wall, 1 resolution, 56 PowerPC, 32 Successive chord, 156 Precharged, 39 Supply voltage, 10 Predischarged, 39 Survey propagation, 84 Process variations, 106 Surveys Processor, 24 survey propagation, 88 Proﬁling Synchronization points, 29 code, 16 Synchronize, 28 Programmable, 12 System test, 15 Prototyping, 16 Systematic variations, 106 Q QuickPath Interconnect, 18 T Termination cell, 40 R Texture fetching Random Texture Memory, 27 variations, 106 Texture Memory, 26, 110, 155, 160 Re-programmability, 19 Thread, 28, 146 Reconﬁgurable logic Thread block, 28 FPGA, 11 Thread parallel, 135
192 Index Thread scheduler, 29 Virtual memory, 12 Throughput, 11 VLIW, 171 Time slicing, 29 VLSI, 106 Tree Factor Graph, 87 VSIDS, 93 U W Unate covering, 134 WalkSAT, 85, 90, 96 V Warp size, 29 Variable, 31, 37 Warps, 29 Factor Graph, 87 Watermarking, 19 Variable ordering SAT, 32 X Variable Vt, 10 XC2VP30 Variations, 106 FPGA, 32