深入理解并行编程读书笔记
1 How To Use This Book
1.1 Roadmap
1.2 Quick Quizzes
1.3 Alternatives to This Book
1.4 Sample Source Code
1.5 Whose Book Is This?
2 Introduction
2.1 Historic Parallel Programming Difficulties
2.2 Parallel Programming Goals
2.2.1 Performance
2.2.2 Productivity
2.2.3 Generality
2.3 Alternatives to Parallel Programming
2.3.1 Multiple Instances of a Sequential Application
2.3.2 Use Existing Parallel Software
2.3.3 Performance Optimization
2.4 What Makes Parallel Programming Hard?
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
2.4.3 Resource Partitioning and Replication
2.4.4 Interacting With Hardware
2.4.5 Composite Capabilities
2.4.6 How Do Languages and Environments Assist With These Tasks?
2.5 Discussion
3 Hardware and its Habits
3.1 Overview
3.1.1 Pipelined CPUs
3.1.2 Memory References
3.1.3 Atomic Operations
3.1.4 Memory Barriers
3.1.5 Cache Misses
3.1.6 I/O Operations
3.2 Overheads
3.2.1 Hardware System Architecture
3.2.2 Costs of Operations
3.2.3 Hardware Optimizations
3.3 Hardware Free Lunch?
3.3.1 3D Integration
3.3.2 Novel Materials and Processes
3.3.3 Light, Not Electrons
3.3.4 Special-Purpose Accelerators
3.3.5 Existing Parallel Software
3.4 Software Design Implications
4 Tools of the Trade
4.1 Scripting Languages
4.2 POSIX Multiprocessing
4.2.1 POSIX Process Creation and Destruction
4.2.2 POSIX Thread Creation and Destruction
4.2.3 POSIX Locking
4.2.4 POSIX Reader-Writer Locking
4.2.5 Atomic Operations (GCC Classic)
4.2.6 Atomic Operations (C11)
4.2.7 Atomic Operations (Modern GCC)
4.2.8 Per-Thread Variables
4.3 Alternatives to POSIX Operations
4.3.1 Organization and Initialization
4.3.2 Thread Creation, Destruction, and Control
4.3.3 Locking
4.3.4 Accessing Shared Variables
4.3.5 Atomic Operations
4.3.6 Per-CPU Variables
4.4 The Right Tool for the Job: How to Choose?
5 Counting
5.1 Why Isn't Concurrent Counting Trivial?
5.2 Statistical Counters
5.2.1 Design
5.2.2 Array-Based Implementation
5.2.3 Per-Thread-Variable-Based Implementation
5.2.4 Eventually Consistent Implementation
5.2.5 Discussion
5.3 Approximate Limit Counters
5.3.1 Design
5.3.2 Simple Limit Counter Implementation
5.3.3 Simple Limit Counter Discussion
5.3.4 Approximate Limit Counter Implementation
5.3.5 Approximate Limit Counter Discussion
5.4 Exact Limit Counters
5.4.1 Atomic Limit Counter Implementation
5.4.2 Atomic Limit Counter Discussion
5.4.3 Signal-Theft Limit Counter Design
5.4.4 Signal-Theft Limit Counter Implementation
5.4.5 Signal-Theft Limit Counter Discussion
5.4.6 Applying Exact Limit Counters
5.5 Parallel Counting Discussion
5.5.1 Parallel Counting Validation
5.5.2 Parallel Counting Performance
5.5.3 Parallel Counting Specializations
5.5.4 Parallel Counting Lessons
6 Partitioning and Synchronization Design
6.1 Partitioning Exercises
6.1.1 Dining Philosophers Problem
6.1.2 Double-Ended Queue
6.1.3 Partitioning Example Discussion
6.2 Design Criteria
6.3 Synchronization Granularity
6.3.1 Sequential Program
6.3.2 Code Locking
6.3.3 Data Locking
6.3.4 Data Ownership
6.3.5 Locking Granularity and Performance
6.4 Parallel Fastpath
6.4.1 Reader/Writer Locking
6.4.2 Hierarchical Locking
6.4.3 Resource Allocator Caches
6.5 Beyond Partitioning
6.5.1 Work-Queue Parallel Maze Solver
6.5.2 Alternative Parallel Maze Solver
6.5.3 Maze Validation
6.5.4 Performance Comparison I
6.5.5 Alternative Sequential Maze Solver
6.5.6 Performance Comparison II
6.5.7 Future Directions and Conclusions
6.6 Partitioning, Parallelism, and Optimization
7 Locking
7.1 Staying Alive
7.1.1 Deadlock
7.1.2 Livelock and Starvation
7.1.3 Unfairness
7.1.4 Inefficiency
7.2 Types of Locks
7.2.1 Exclusive Locks
7.2.2 Reader-Writer Locks
7.2.3 Beyond Reader-Writer Locks
7.2.4 Scoped Locking
7.3 Locking Implementation Issues
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange
7.3.2 Other Exclusive-Locking Implementations
7.4 Lock-Based Existence Guarantees
7.5 Locking: Hero or Villain?
7.5.1 Locking For Applications: Hero!
7.5.2 Locking For Parallel Libraries: Just Another Tool
7.5.3 Locking For Parallelizing Sequential Libraries: Villain!
7.6 Summary
8 Data Ownership
8.1 Multiple Processes
8.2 Partial Data Ownership and pthreads
8.3 Function Shipping
8.4 Designated Thread
8.5 Privatization
8.6 Other Uses of Data Ownership
9 Deferred Processing
9.1 Running Example
9.2 Reference Counting
9.3 Hazard Pointers
9.4 Sequence Locks
9.5 Read-Copy Update (RCU)
9.5.1 Introduction to RCU
9.5.2 RCU Fundamentals
9.5.3 RCU Linux-Kernel API
9.5.4 RCU Usage
9.5.5 RCU Related Work
9.6 Which to Choose?
9.6.1 Which to Choose? (Overview)
9.6.2 Which to Choose? (Details)
9.6.3 Which to Choose? (Production Use)
9.7 What About Updates?
10 Data Structures
10.1 Motivating Application
10.2 Partitionable Data Structures
10.2.1 Hash-Table Design
10.2.2 Hash-Table Implementation
10.2.3 Hash-Table Performance
10.3 Read-Mostly Data Structures
10.3.1 RCU-Protected Hash Table Implementation
10.3.2 RCU-Protected Hash Table Validation
10.3.3 RCU-Protected Hash Table Performance
10.3.4 RCU-Protected Hash Table Discussion
10.4 Non-Partitionable Data Structures
10.4.1 Resizable Hash Table Design
10.4.2 Resizable Hash Table Implementation
10.4.3 Resizable Hash Table Discussion
10.4.4 Other Resizable Hash Tables
10.5 Other Data Structures
10.6 Micro-Optimization
10.6.1 Specialization
10.6.2 Bits and Bytes
10.6.3 Hardware Considerations
10.7 Summary
11 Validation
11.1 Introduction
11.1.1 Where Do Bugs Come From?
11.1.2 Required Mindset
11.1.3 When Should Validation Start?
11.1.4 The Open Source Way
11.2 Tracing
11.3 Assertions
11.4 Static Analysis
11.5 Code Review
11.5.1 Inspection
11.5.2 Walkthroughs
11.5.3 Self-Inspection
11.6 Probability and Heisenbugs
11.6.1 Statistics for Discrete Testing
11.6.2 Statistics Abuse for Discrete Testing
11.6.3 Statistics for Continuous Testing
11.6.4 Hunting Heisenbugs
11.7 Performance Estimation
11.7.1 Benchmarking
11.7.2 Profiling
11.7.3 Differential Profiling
11.7.4 Microbenchmarking
11.7.5 Isolation
11.7.6 Detecting Interference
11.8 Summary
12 Formal Verification
12.1 State-Space Search
12.1.1 Promela and Spin
12.1.2 How to Use Promela
12.1.3 Promela Example: Locking
12.1.4 Promela Example: QRCU
12.1.5 Promela Parable: dynticks and Preemptible RCU
12.1.6 Validating Preemptible RCU and dynticks
12.2 Special-Purpose State-Space Search
12.2.1 Anatomy of a Litmus Test
12.2.2 What Does This Litmus Test Mean?
12.2.3 Running a Litmus Test
12.2.4 PPCMEM Discussion
12.3 Axiomatic Approaches
12.3.1 Axiomatic Approaches and Locking
12.3.2 Axiomatic Approaches and RCU
12.4 SAT Solvers
12.5 Stateless Model Checkers
12.6 Summary
12.7 Choosing a Validation Plan
13 Putting It All Together
13.1 Counter Conundrums
13.1.1 Counting Updates
13.1.2 Counting Lookups
13.2 Refurbish Reference Counting
13.2.1 Implementation of Reference-Counting Categories
13.2.2 Counter Optimizations
13.3 Hazard-Pointer Helpers
13.3.1 Scalable Reference Count
13.3.2 Long-Duration Accesses
13.4 Sequence-Locking Specials
13.4.1 Dueling Sequence Locks
13.4.2 Correlated Data Elements
13.4.3 Atomic Move
13.4.4 Upgrade to Writer
13.5 RCU Rescues
13.5.1 RCU and Per-Thread-Variable-Based Statistical Counters
13.5.2 RCU and Counters for Removable I/O Devices
13.5.3 Array and Length
13.5.4 Correlated Fields
13.5.5 Update-Friendly Traversal
13.5.6 Scalable Reference Count Two
13.5.7 Retriggered Grace Periods
13.5.8 Long-Duration Accesses Two
14 Advanced Synchronization
14.1 Avoiding Locks
14.2 Non-Blocking Synchronization
14.2.1 Simple NBS
14.2.2 Applicability of NBS Benefits
14.2.3 NBS Discussion
14.3 Parallel Real-Time Computing
14.3.1 What is Real-Time Computing?
14.3.2 Who Needs Real-Time?
14.3.3 Who Needs Parallel Real-Time?
14.3.4 Implementing Parallel Real-Time Systems
14.3.5 Implementing Parallel Real-Time Operating Systems
14.3.6 Implementing Parallel Real-Time Applications
14.3.7 Real Time vs. Real Fast: How to Choose?
15 Advanced Synchronization: Memory Ordering
15.1 Ordering: Why and How?
15.1.1 Why Hardware Misordering?
15.1.2 How to Force Ordering?
15.1.3 Basic Rules of Thumb
15.2 Tricks and Traps
15.2.1 Variables With Multiple Values
15.2.2 Memory-Reference Reordering
15.2.3 Address Dependencies
15.2.4 Data Dependencies
15.2.5 Control Dependencies
15.2.6 Cache Coherence
15.2.7 Multicopy Atomicity
15.3 Compile-Time Consternation
15.3.1 Memory-Reference Restrictions
15.3.2 Address- and Data-Dependency Difficulties
15.3.3 Control-Dependency Calamities
15.4 Higher-Level Primitives
15.4.1 Memory Allocation
15.4.2 Locking
15.4.3 RCU
15.5 Hardware Specifics
15.5.1 Alpha
15.5.2 Armv7-A/R
15.5.3 Armv8
15.5.4 Itanium
15.5.5 MIPS
15.5.6 POWER / PowerPC
15.5.7 SPARC TSO
15.5.8 x86
15.5.9 z Systems
15.6 Where is Memory Ordering Needed?
16 Ease of Use
16.1 What is Easy?
16.2 Rusty Scale for API Design
16.3 Shaving the Mandelbrot Set
17 Conflicting Visions of the Future
17.1 The Future of CPU Technology Ain't What it Used to Be
17.1.1 Uniprocessor Über Alles
17.1.2 Multithreaded Mania
17.1.3 More of the Same
17.1.4 Crash Dummies Slamming into the Memory Wall
17.1.5 Astounding Accelerators
17.2 Transactional Memory
17.2.1 Outside World
17.2.2 Process Modification
17.2.3 Synchronization
17.2.4 Discussion
17.3 Hardware Transactional Memory
17.3.1 HTM Benefits WRT Locking
17.3.2 HTM Weaknesses WRT Locking
17.3.3 HTM Weaknesses WRT Locking When Augmented
17.3.4 Where Does HTM Best Fit In?
17.3.5 Potential Game Changers
17.3.6 Conclusions
17.4 Formal Regression Testing?
17.4.1 Automatic Translation
17.4.2 Environment
17.4.3 Overhead
17.4.4 Locate Bugs
17.4.5 Minimal Scaffolding
17.4.6 Relevant Bugs
17.4.7 Formal Regression Scorecard
17.5 Functional Programming for Parallelism
17.6 Summary
18 Looking Forward and Back
A Important Questions
A.1 Why Aren't Parallel Programs Always Faster?
A.2 Why Not Remove Locking?
A.3 What Time Is It?
A.4 What Does ``After'' Mean?
A.5 How Much Ordering Is Needed?
A.5.1 Where is the Defining Data?
A.5.2 Consistent Data Used Consistently?
A.5.3 Is the Problem Partitionable?
A.5.4 None of the Above?
A.6 What is the Difference Between ''Concurrent'' and ''Parallel''?
A.7 Why Is Software Buggy?
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.