1 How To Use This Book

1.1 Roadmap

1.2 Quick Quizzes

1.3 Alternatives to This Book

1.4 Sample Source Code

1.5 Whose Book Is This?

2 Introduction

2.1 Historic Parallel Programming Difficulties

2.2 Parallel Programming Goals

2.2.1 Performance

2.2.2 Productivity

2.2.3 Generality

2.3 Alternatives to Parallel Programming

2.3.1 Multiple Instances of a Sequential Application

2.3.2 Use Existing Parallel Software

2.3.3 Performance Optimization

2.4 What Makes Parallel Programming Hard?

2.4.1 Work Partitioning

2.4.2 Parallel Access Control

2.4.3 Resource Partitioning and Replication

2.4.4 Interacting With Hardware

2.4.5 Composite Capabilities

2.4.6 How Do Languages and Environments Assist With These Tasks?

2.5 Discussion

3 Hardware and its Habits

3.1 Overview

3.1.1 Pipelined CPUs

3.1.2 Memory References

3.1.3 Atomic Operations

3.1.4 Memory Barriers

3.1.5 Cache Misses

3.1.6 I/O Operations

3.2 Overheads

3.2.1 Hardware System Architecture

3.2.2 Costs of Operations

3.2.3 Hardware Optimizations

3.3 Hardware Free Lunch?

3.3.1 3D Integration

3.3.2 Novel Materials and Processes

3.3.3 Light, Not Electrons

3.3.4 Special-Purpose Accelerators

3.3.5 Existing Parallel Software

3.4 Software Design Implications

4 Tools of the Trade

4.1 Scripting Languages

4.2 POSIX Multiprocessing

4.2.1 POSIX Process Creation and Destruction

4.2.2 POSIX Thread Creation and Destruction

4.2.3 POSIX Locking

4.2.4 POSIX Reader-Writer Locking

4.2.5 Atomic Operations (GCC Classic)

4.2.6 Atomic Operations (C11)

4.2.7 Atomic Operations (Modern GCC)

4.2.8 Per-Thread Variables

4.3 Alternatives to POSIX Operations

4.3.1 Organization and Initialization

4.3.2 Thread Creation, Destruction, and Control

4.3.3 Locking

4.3.4 Accessing Shared Variables

4.3.5 Atomic Operations

4.3.6 Per-CPU Variables

4.4 The Right Tool for the Job: How to Choose?

5 Counting

5.1 Why Isn't Concurrent Counting Trivial?

5.2 Statistical Counters

5.2.1 Design

5.2.2 Array-Based Implementation

5.2.3 Per-Thread-Variable-Based Implementation

5.2.4 Eventually Consistent Implementation

5.2.5 Discussion

5.3 Approximate Limit Counters

5.3.1 Design

5.3.2 Simple Limit Counter Implementation

5.3.3 Simple Limit Counter Discussion

5.3.4 Approximate Limit Counter Implementation

5.3.5 Approximate Limit Counter Discussion

5.4 Exact Limit Counters

5.4.1 Atomic Limit Counter Implementation

5.4.2 Atomic Limit Counter Discussion

5.4.3 Signal-Theft Limit Counter Design

5.4.4 Signal-Theft Limit Counter Implementation

5.4.5 Signal-Theft Limit Counter Discussion

5.4.6 Applying Exact Limit Counters

5.5 Parallel Counting Discussion

5.5.1 Parallel Counting Validation

5.5.2 Parallel Counting Performance

5.5.3 Parallel Counting Specializations

5.5.4 Parallel Counting Lessons

6 Partitioning and Synchronization Design

6.1 Partitioning Exercises

6.1.1 Dining Philosophers Problem

6.1.2 Double-Ended Queue

6.1.3 Partitioning Example Discussion

6.2 Design Criteria

6.3 Synchronization Granularity

6.3.1 Sequential Program

6.3.2 Code Locking

6.3.3 Data Locking

6.3.4 Data Ownership

6.3.5 Locking Granularity and Performance

6.4 Parallel Fastpath

6.4.1 Reader/Writer Locking

6.4.2 Hierarchical Locking

6.4.3 Resource Allocator Caches

6.5 Beyond Partitioning

6.5.1 Work-Queue Parallel Maze Solver

6.5.2 Alternative Parallel Maze Solver

6.5.3 Maze Validation

6.5.4 Performance Comparison I

6.5.5 Alternative Sequential Maze Solver

6.5.6 Performance Comparison II

6.5.7 Future Directions and Conclusions

6.6 Partitioning, Parallelism, and Optimization

7 Locking

7.1 Staying Alive

7.1.1 Deadlock

7.1.2 Livelock and Starvation

7.1.3 Unfairness

7.1.4 Inefficiency

7.2 Types of Locks

7.2.1 Exclusive Locks

7.2.2 Reader-Writer Locks

7.2.3 Beyond Reader-Writer Locks

7.2.4 Scoped Locking

7.3 Locking Implementation Issues

7.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange

7.3.2 Other Exclusive-Locking Implementations

7.4 Lock-Based Existence Guarantees

7.5 Locking: Hero or Villain?

7.5.1 Locking For Applications: Hero!

7.5.2 Locking For Parallel Libraries: Just Another Tool

7.5.3 Locking For Parallelizing Sequential Libraries: Villain!

7.6 Summary

8 Data Ownership

8.1 Multiple Processes

8.2 Partial Data Ownership and pthreads

8.3 Function Shipping

8.4 Designated Thread

8.5 Privatization

8.6 Other Uses of Data Ownership

9 Deferred Processing

9.1 Running Example

9.2 Reference Counting

9.3 Hazard Pointers

9.4 Sequence Locks

9.5 Read-Copy Update (RCU)

9.5.1 Introduction to RCU

9.5.2 RCU Fundamentals

9.5.3 RCU Linux-Kernel API

9.5.4 RCU Usage

9.6 Which to Choose?

9.6.1 Which to Choose? (Overview)

9.6.2 Which to Choose? (Details)

9.6.3 Which to Choose? (Production Use)

9.7 What About Updates?

10 Data Structures

10.1 Motivating Application

10.2 Partitionable Data Structures

10.2.1 Hash-Table Design

10.2.2 Hash-Table Implementation

10.2.3 Hash-Table Performance

10.3 Read-Mostly Data Structures

10.3.1 RCU-Protected Hash Table Implementation

10.3.2 RCU-Protected Hash Table Validation

10.3.3 RCU-Protected Hash Table Performance

10.3.4 RCU-Protected Hash Table Discussion

10.4 Non-Partitionable Data Structures

10.4.1 Resizable Hash Table Design

10.4.2 Resizable Hash Table Implementation

10.4.3 Resizable Hash Table Discussion

10.4.4 Other Resizable Hash Tables

10.5 Other Data Structures

10.6 Micro-Optimization

10.6.1 Specialization

10.6.2 Bits and Bytes

10.6.3 Hardware Considerations

10.7 Summary

11 Validation

11.1 Introduction

11.1.1 Where Do Bugs Come From?

11.1.2 Required Mindset

11.1.3 When Should Validation Start?

11.1.4 The Open Source Way

11.2 Tracing

11.3 Assertions

11.4 Static Analysis

11.5 Code Review

11.5.1 Inspection

11.5.2 Walkthroughs

11.5.3 Self-Inspection

11.6 Probability and Heisenbugs

11.6.1 Statistics for Discrete Testing

11.6.2 Statistics Abuse for Discrete Testing

11.6.3 Statistics for Continuous Testing

11.6.4 Hunting Heisenbugs

11.7 Performance Estimation

11.7.1 Benchmarking

11.7.2 Profiling

11.7.3 Differential Profiling

11.7.4 Microbenchmarking

11.7.5 Isolation

11.7.6 Detecting Interference

11.8 Summary

12 Formal Verification

12.1.1 Promela and Spin

12.1.2 How to Use Promela

12.1.3 Promela Example: Locking

12.1.4 Promela Example: QRCU

12.1.5 Promela Parable: dynticks and Preemptible RCU

12.1.6 Validating Preemptible RCU and dynticks

12.2.1 Anatomy of a Litmus Test

12.2.2 What Does This Litmus Test Mean?

12.2.3 Running a Litmus Test

12.2.4 PPCMEM Discussion

12.3 Axiomatic Approaches

12.3.1 Axiomatic Approaches and Locking

12.3.2 Axiomatic Approaches and RCU

12.4 SAT Solvers

12.5 Stateless Model Checkers

12.6 Summary

12.7 Choosing a Validation Plan

13 Putting It All Together

13.1 Counter Conundrums

13.1.1 Counting Updates

13.1.2 Counting Lookups

13.2 Refurbish Reference Counting

13.2.1 Implementation of Reference-Counting Categories

13.2.2 Counter Optimizations

13.3 Hazard-Pointer Helpers

13.3.1 Scalable Reference Count

13.3.2 Long-Duration Accesses

13.4 Sequence-Locking Specials

13.4.1 Dueling Sequence Locks

13.4.2 Correlated Data Elements

13.4.3 Atomic Move

13.4.4 Upgrade to Writer

13.5 RCU Rescues

13.5.1 RCU and Per-Thread-Variable-Based Statistical Counters

13.5.2 RCU and Counters for Removable I/O Devices

13.5.3 Array and Length

13.5.4 Correlated Fields

13.5.5 Update-Friendly Traversal

13.5.6 Scalable Reference Count Two

13.5.7 Retriggered Grace Periods

13.5.8 Long-Duration Accesses Two

14 Advanced Synchronization

14.1 Avoiding Locks

14.2 Non-Blocking Synchronization

14.2.1 Simple NBS

14.2.2 Applicability of NBS Benefits

14.2.3 NBS Discussion

14.3 Parallel Real-Time Computing

14.3.1 What is Real-Time Computing?

14.3.2 Who Needs Real-Time?

14.3.3 Who Needs Parallel Real-Time?

14.3.4 Implementing Parallel Real-Time Systems

14.3.5 Implementing Parallel Real-Time Operating Systems

14.3.6 Implementing Parallel Real-Time Applications

14.3.7 Real Time vs. Real Fast: How to Choose?

15 Advanced Synchronization: Memory Ordering

15.1 Ordering: Why and How?

15.1.1 Why Hardware Misordering?

15.1.2 How to Force Ordering?

15.1.3 Basic Rules of Thumb

15.2 Tricks and Traps

15.2.1 Variables With Multiple Values

15.2.2 Memory-Reference Reordering

15.2.3 Address Dependencies

15.2.4 Data Dependencies

15.2.5 Control Dependencies

15.2.6 Cache Coherence

15.2.7 Multicopy Atomicity

15.3 Compile-Time Consternation

15.3.1 Memory-Reference Restrictions

15.3.2 Address- and Data-Dependency Difficulties

15.3.3 Control-Dependency Calamities

15.4 Higher-Level Primitives

15.4.1 Memory Allocation

15.4.2 Locking

15.4.3 RCU

15.5 Hardware Specifics

15.5.1 Alpha

15.5.2 Armv7-A/R

15.5.3 Armv8

15.5.4 Itanium

15.5.5 MIPS

15.5.6 POWER / PowerPC

15.5.7 SPARC TSO

15.5.8 x86

15.5.9 z Systems

15.6 Where is Memory Ordering Needed?

16 Ease of Use

16.1 What is Easy?

16.2 Rusty Scale for API Design

16.3 Shaving the Mandelbrot Set

17 Conflicting Visions of the Future

17.1 The Future of CPU Technology Ain't What it Used to Be

17.1.1 Uniprocessor Über Alles

17.1.2 Multithreaded Mania

17.1.3 More of the Same

17.1.4 Crash Dummies Slamming into the Memory Wall

17.1.5 Astounding Accelerators

17.2 Transactional Memory

17.2.1 Outside World

17.2.2 Process Modification

17.2.3 Synchronization

17.2.4 Discussion

17.3 Hardware Transactional Memory

17.3.1 HTM Benefits WRT Locking

17.3.2 HTM Weaknesses WRT Locking

17.3.3 HTM Weaknesses WRT Locking When Augmented

17.3.4 Where Does HTM Best Fit In?

17.3.5 Potential Game Changers

17.3.6 Conclusions

17.4 Formal Regression Testing?

17.4.1 Automatic Translation

17.4.2 Environment

17.4.3 Overhead

17.4.4 Locate Bugs

17.4.5 Minimal Scaffolding

17.4.6 Relevant Bugs

17.4.7 Formal Regression Scorecard

17.5 Functional Programming for Parallelism

17.6 Summary

18 Looking Forward and Back

A Important Questions

A.1 Why Aren't Parallel Programs Always Faster?

A.2 Why Not Remove Locking?

A.3 What Time Is It?

A.4 What Does ``After'' Mean?

A.5 How Much Ordering Is Needed?

A.5.1 Where is the Defining Data?

A.5.2 Consistent Data Used Consistently?

A.5.3 Is the Problem Partitionable?

A.5.4 None of the Above?

A.6 What is the Difference Between ''Concurrent'' and ''Parallel''?

A.7 Why Is Software Buggy?