When debugging an opaque problem in a system, a nice strategy is to just focus on adding permanent (rather than temporary) metrics or logging to make the problem obvious. Often makes it easy to solve not just the specific issue but also a whole class of potential related issues.
Debugging Systems: Adding Permanent Metrics and Logging
By
–