Bài giảng Computer architecture: Part VII

Chia sẻ: Codon_06 Codon_06 | Ngày: | Loại File: PPT | Số trang:67

Thêm vào BST

Báo xấu

57
lượt xem 6
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng Computer architecture: Part VII - Advanced Architectures tập trung trình bày về road to higher performance; vector and array processing; shared-memory multiprocessing;...

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Bài giảng Computer architecture: Part VII

Part VII Advanced Architectures July 2004 Computer Architecture, Advanced Architectures Slide 1
About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper- division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 July 2004 Computer Architecture, Advanced Architectures Slide 2
VII Advanced Architectures Performance enhancement beyond what we have seen: • What else can we do at the instruction execution level? • Data parallelism: vector and array processing • Control parallelism: parallel and distributed processing Topics in This Part Chapter 25 Road to Higher Performance Chapter 26 Vector and Array Processing Chapter 27 Shared-Memory Multiprocessing Chapter 28 Distributed Multicomputing July 2004 Computer Architecture, Advanced Architectures Slide 3
25 Road to Higher Performance Review past, current, and future architectural trends: • General-purpose and special-purpose acceleration • Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing July 2004 Computer Architecture, Advanced Architectures Slide 4
25.1 Past and Current Performance Trends Computer performance grew by a factor Available computing power ca. 2000: of about 10000 between 1980 and 2000 GFLOPS on desktop 100 due to faster technology TFLOPS in supercomputer center 100 due to better architecture PFLOPS on drawing board Architectural method Improvement factor Previously discussed 1. Pipelining (and superpipelining) 3-8 √ Established methods 2. Cache memory, 2-3 levels 2-5 √ 3. RISC and related ideas 2-3 √ 4. Multiple instruction issue (superscalar) 2-3 √ 5. ISA extensions (e.g., for multimedia) 1-3 √ Covered in 6. Multithreading (super-, hyper-) 2-5 ? Part VII methods 7. Speculation and value prediction 2-3 ? Newer 8. Hardware acceleration 2-10 ? 9. Vector and array processing 2-10 ? 10. Parallel/distributed computing 2-1000s ? July 2004 Computer Architecture, Advanced Architectures Slide 5
Peak Performance of Supercomputers PFLOPS Earth Simulator 10 / 5 years ASCI White Pacific TFLOPS ASCI Red TMC CM-5 Cray T3D Cray X-MP TMC CM-2 Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] July 2004 Computer Architecture, Advanced Architectures Slide 6
Energy Consumption is Getting out of Hand TIPS DSP performance Absolute per watt proce ssor performance GIPS Performance GP processor performance per watt MIPS kIPS 1980 1990 2000 2010 Calendar year Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and JulyDSPs. 2004 Computer Architecture, Advanced Architectures Slide 7
25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a b) Multiply-accumulate: reduce round-off error (s := s + a b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands July 2004 Computer Architecture, Advanced Architectures Slide 8
Class Instruction Vector Op type Function or results Intel Integer register MMX MMX Register copy 32 bits register ISA Parallel pack 4, 2 Saturate Convert to narrower elements Exten- Copy Merge lower halves of 2 sion Parallel unpack low 8, 4, 2 vectors Merge upper halves of 2 Parallel unpack high 8, 4, 2 vectors Wrap/Satura Add; inhibit carry at Parallel add 8, 4, 2 te# boundaries Wrap/Satura Subtract with carry Parallel subtract 8, 4, 2 te# inhibition Multiply, keep the 4 low Parallel multiply low 4 halves Arith- Multiply, keep the 4 high Parallel multiply high 4 metic halves Multiply, add adjacent Parallel multiplyadd 4 products* Table 25.1 8, 4, 2 All 1s where equal, else all Parallel compare equal 0s July 2004 Computer Architecture, 8, 4, Advanced 2 Architectures Slide 9 else All 1s where greater, Parallel compare greater all 0s
MMX Multiplication and Multiply-Add a b d e a b d e e f g h e f g h e h z v e h v d g y u d g u b f x t b f t a e w s a e s add add s t u v s +t u+v (a) Parallel multiply low (b) Parallel multiply-add Figure 25.2 Parallel multiplication and multiply-add in MMX. July 2004 Computer Architecture, Advanced Architectures Slide 10
MMX Parallel Comparisons 14 3 58 66 5 3 12 32 5 6 12 9 79 1 58 65 3 12 22 17 5 12 90 8 65 535 255 (all 1s) (all 1s) 0 0 0 0 0 0 0 0 (a) Parallel compare equal (b) Parallel compare greater Figure 25.3 Parallel comparisons in MMX. July 2004 Computer Architecture, Advanced Architectures Slide 11
25.3 Instruction-Level Parallelism 3 30% Speedup attained Fraction of cycles 20% 2 10% 0% 1 0 1 2 3 4 5 6 7 8 0 2 4 6 8 Issuable instructions per cycle Instruction issue width (a) (b) Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91]. July 2004 Computer Architecture, Advanced Architectures Slide 12
Instruction-Level Parallelism Figure 25.5 A computation with inherent instruction-level parallelism. July 2004 Computer Architecture, Advanced Architectures Slide 13
VLIW and EPIC Architectures VLIW Very long instruction word architecture EPIC Explicitly parallel instruction computing General registers (128) Execution Execution Execution unit unit ... unit Predi- Memory cates (64) Execution Execution Execution unit unit ... unit Floating-point registers (128) Figure 25.6 Hardware organization for IA-64. General and floating- point registers are 64-bit wide. Predicates are single-bit registers. July 2004 Computer Architecture, Advanced Architectures Slide 14
25.4 Speculation and Value Prediction spec load spec load ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- store store ---- ---- ---- ---- load check load load check load ---- ---- ---- ---- ---- ---- ---- ---- (a) Control speculation (b) Data speculation Figure 25.7 Examples of software speculation in IA-64. July 2004 Computer Architecture, Advanced Architectures Slide 15
Value Prediction Memo table Miss Mux 0 Mult/ Output Inputs Div 1 Done Inputs ready Control Output ready Figure 25.8 Value prediction for multiplication or division via a memo table. July 2004 Computer Architecture, Advanced Architectures Slide 16
25.5 Special-Purpose Hardware Accelerators Data and Configuration program CPU memory memory FPGA-like unit Accel. 1 on which Accel. 3 accelerators can be formed via loading of Accel. 2 Unused configuration resources registers Figure 25.9 General structure of a processor with configurable hardware accelerators. July 2004 Computer Architecture, Advanced Architectures Slide 17
Graphic Processors, Network Processors, etc. PE PE PE PE 0 1 2 3 PE PE PE PE PE 5 4 5 6 7 Input Output buffer PE PE PE PE buffer 8 9 10 11 PE PE PE PE 12 13 14 15 Feedback path Column Column Column Column memory memory memory memory Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor. July 2004 Computer Architecture, Advanced Architectures Slide 18
25.6 Vector, Array, and Parallel Processing Single data Multiple data Shared Message stream streams variables passing Single instr memory stream Global SISD SIMD GMSV GMMP Johnson’ s expansion Uniprocessors Array or vector Shared-memory Rarely used processors multiprocessors Multiple instr Distributed streams memory MISD MIMD DMSV DMMP Rarely used Multiproc’s or Distributed Distrib-memory multicomputers shared memory multicomputers Flynn’s categories Figure 25.11 The Flynn-Johnson classification of computer systems. July 2004 Computer Architecture, Advanced Architectures Slide 19
SIMD Architectures Data parallelism: executing one operation on multiple data streams Concurrency in time – vector processing Concurrency in space – array processing Example to provide context Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n Sources of performance improvement in vector processing (details in the first half of Chapter 26) One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic Array processing is similar (details in the second half of Chapter 26) July 2004 Computer Architecture, Advanced Architectures Slide 20