A few weeks ago, I spent more than a week debugging parts of a very large software system at work. I will not go into detail about the software, but I want to write down what I learned from it so that I will remember it better, and share my conclusions with you.
Actually, I started thinking about similar problems months earlier when I was writing new code that gradually became much more complicated than I had expected. Why did I end up in a morass of special cases and conditions?
Both problems involved updating the state of several different data structures that had to be kept synchronized. The basic problem was that changes to the data were spread out over many different functions and that early changes could invalidate later changes or affect the control flow in an undesired way by changing the outcome of a conditional branch.
When I was working on the first problem I was writing new code and had free hands when it came to design. After realizing I would have to scrap my first design to make further progress, I decided I would partition the problem into two distinct parts: One would figure out which changes needed to be made and encode them in a class called a ChangeSet; The other would apply the changes. The idea was that it would be easier to debug and understand the algorithm if you could print out the ChangeSet and check invariants before it was applied. This worked well and simplified the code enough that I was able to make progress again.
The other problem had another important ingredient: randomness. The iteration order of several central datastructures (such as sets) is random in our system (a side effect of our generational garbage collector’s decisions) and caused the (very large) algorithm to update the data structures in a different order from one run to another. Most of the time everything worked fine, but every now and then things failed with random symptoms.
It has slowly dawned on me that these two problems were essentially the same: Many changes had to be made to many different data structures, and those very changes could cause the algorithm to trip itself up.
My solution to the first problem (using a ChangeSet) worked well not only because it made following the changes easier, but also because it concentrated all state changes to one phase, letting the first part of the algorithm discover what needed to be done in a temporarily immutable world. I think there is an important and general principle here: Changes to mutable state should be confined to as few points in both space and time as possible. This goes for any language with mutable state. Object-oriented principles do not seem to cover this problem, since hiding the state inside a class does not make the state go away; It is just changed via a method instead.
An interesting parallel is that the solution I described above is similar to what database systems do with transactions, and databases are specialized on managing mutable state in unpredictable environments.