A well-known and often-dreaded problem in concurrent systems is the
"Heisenbug" problem --- subtle interference between threads can crash
systems that have otherwise been running reliably for months. These bugs
are hard to reproduce and the mere fact of observing them, such as adding
debug statements to the program, can sometimes make the bug go away.
CHESS is a tool for finding and reproducing concurrency bugs. When
attached to a program, CHESS takes complete control over the scheduling of
threads. This allows CHESS to systematically drive different runs of the
program along different thread interleavings. Moreover, when a run results
in an error, CHESS can reproduce the erroneous interleaving for ease of
debugging. CHESS is used by several product teams at Microsoft and has
found and helped debug many concurrency errors.
In this talk, I will describe the architecture of CHESS and the various
algorithms CHESS uses to effectively search the astronomically large space
of thread
interleavings. See http://research.microsoft.com/CHESS
for more details on CHESS.