Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

June 5, 2019 12:31 pm GMT

Weeding Out Distributed System Bugs

As weve learned more and more about distributed systems, weve seen the many ways that things can wrong. More specifically, weve seen that there are just so many possible scenarios in which a portion of a large or system can fail.

Since failure is inevitable in any sized system, we ought to do ourselves a favor and better understand it. So far, weve been talking about different kinds of faults and failures in a fairly abstract sense. Its time for us to get a little more concrete, however. Sure, we vaguely understand the different flavors and profiles of how things can go wrong within a system, and why they are problematic (yet inevitable) within our system. But how can we start to understand these flaws in a system in a more tangible form?

In order to do that, we need to think more deeply about the different ways in which failures present themselves to us in the systems that we all deal with every day, as both consumers and creators of software. For most of us, failures in our systems present themselves as bugs. The ways in which a bug might appear to us, however, can make all the difference in how we are able to understand it in a more concrete way.

So what kinds of bugs do we deal with, exactly, and how do they impact the failures of a distributed system? Well, thats the mystery were about to solve!

(Hardware) failures of the past

Before we dive into identifying the bugs of today, lets take a quick detour into the past. Its easy to get overwhelmed when thinking about failures and faults in a system, so before we get too deep into the trenches, we ought to take a moment to see how the landscape of bugs in computing have changed over the years.

What is hardware reliability?

Until the 1980s, the major focus of computing was around hardware. More specifically, much of the field was focused on how to make hardware just generally better. This was simply because hardware was the major limiting factor in many ways. For example, if we wanted to make a machine faster and more performant back then, we needed bigger hardware; that was the only way to have enough space for all the circuits we needed! And if we wanted more circuitswhich were each already pretty large and lofty in size then we also needed to be prepared for them to use a lot of power and exude a lot of energy and heat.

These issues begin to highlight some of the clear possible faults that could have popped up within a system just a few decades ago. As we already know, hardware faultssuch as a circuit overheating, or a network wiring issue causing a widespread outageare what lead to hardware failures. If any aspect of the hardware fails, then that failure will likely cause some form of downtime in a system, which we know makes a system less reliable. Until the 80s, hardware faults were very much a real and common problem.

But these days, the story is a little different. We experience far less downtime due to hardware faults than we did just forty years ago. We have many years of concentrated effort that has helped improve the hardware we all rely upon on a daily basis now! In the past three decades, the size of circuits has decreased, allowing us to pack more circuits into a smaller space, and those circuits produce less heat, and use much less energy. Circuits have also become easier and cheaper to produce, making them more inexpensive in general. This has also allowed us to create smaller devices, like laptop computers, tables, and smartphones, to mention just a few.

This doesnt mean that there are no hardware problems whatsoever, though! Even small devices with smaller circuits inside of them will experience hardware failures. Network issues that cause downtime are still likely to happen, even though the frequency with which they occur has definitely dropped off. Hardware disks are still prone to failures, which makes it tricky to read (much less write) data to them. And, of course, just because hardware has improved doesnt mean that it doesnt require maintenance and upgrades; since these are still requirements, they will still result in planned downtime.

How has hardware changed?

Overall, however, the changes weve seen in hardware have been quite a net positive for computing. So, if hardware has improvedwhat else could be a contributing factor to failures in a distribute system? Why, our dear friend software, of course!

Even in the most well-tested systems, software failures are responsible for a significant amount of downtime. We know these failures by another name: bugs.

Improvements in hardware notwithstanding, it is bugs in the software of distributed systems that result in unexpected and unplanned downtime. Many studies estimate that, 25 to 35% of the downtime in a system are caused by bugs in software-related code.

Software failures as a major pain point in a distributed system

The interesting aspect of this story is the fact that, even within systems that are fairly well-established and have rigorous testing practices in place, studies have found that the actual percentage of software-related downtime doesnt really ever reduce beyond that 25% threshold! There are just some bugs that still seem to exist, even with well thought-out tests and quality control.

(Software) problems of the present

The bugs that still exist in more mature systemsthose that have rigorous testing, for exampleare also known as residual bugs , and they can be classified into two separate categories:

The two main kinds of residual software bugs

Bohrbugs , which are named for Niels Bohr and Ernest Rutherfords model fo the atomic nucleus, and
Heisenbugs , which are named as a pun on Werner Heisenbergs Heisenberg uncertainty principle.

These two bugs have been researched by many different computer scientists; three of the most notable ones include Jim Gray, Bruce Lindsey, and Andrea (Anita) Borr, and well read about the fruits of their labor in a little bit. Between these two different kinds of residual bugs, one is definitely way easier to wrap our heads around than the other. So lets start with that one first!

The Bohrbug is a bug that most (all?) programmers will encounter while tinkering with software. A Bohrbug is a bug that can reliably be reproduced, given the right conditions. For example, if we were able to notice that a bug occurred in a piece of software and closely observed the situation in which it happened, if it was a Bohrbug, then would be able to reproduce it by re-creating the same situation.

Bohrbug: a quick definition!

A Bohrbug is pretty easy to localize and pinpoint to a certain part of a codebase. As developers, this is a huge boon, since it means that we can reliably find and then fix the Bohrbug, as annoying as it might be!

Interestingly, when Jim Gray and Bruce Lindsey were researching Bohrbugs in more mature systems, they posited that the frequency of these reproducible little bugs actually reduced as a system grew older and more stable.

Bohrbugs in a system, over time

However, Anita Borrs research added a bit more nuance to this. She found that the percentage of Bohrbugs didnt strictly continue to drop as a system grew more stable; rather, her research found that, with each new upgrade or scheduled maintenance that was introduced into the system, there was also a slight uptick of Bohrbugs, since significant changes in the system were still very capable of introducing reproducible bugs.

Thankfully, even though new Bohrbugs might be introduced with these system-wide changes, at least they can be reproduced (and hopefully, fixed!). But things arent always that simple in the world of software (of course). Some bugs dont always behave the samein fact, some of them seem to behave differently when we try to investigate them!

Dealing with difficult, distributed bugs

There is one species of bug that is particularly relevant to distributed systems, and its finally time for us to come face-to-face with it in this series. Im talking, of course, about the Heisenbug!

Heisenbug: a quick definition!

A Heisenbug can be super frustrating to deal with as a programmer. This is a bug that actually behaves differently when it is observed closely. As one begins to investigate a Heisenbug, it may change how it manifests itself. In some cases, when a Heisenbug is, well, being debugged, it disappears completely. And in some situations, when certain conditions are recreated in an effort to reproduce the bug, the bug just wont appear! Pretty frustrating, right?

What makes Heisenbugs so hard?

For example, a bug such as a data structure running out of space or a portion of a program that overflows some allocated memory in a production environment might not be to easily reproduce locally or in a test; however, this exact bug could cause a system to crash, and is pretty fatal!

This is part of the reason that makes Heisenbugs so difficult to reason about. They are incredibly hard to actually reason about, because its hard to actually localize them and pin them down. And, because theyre hard to reproduce reliably, theyre hard to identify and thus, difficult to actually solve!

The Heisenbug is especially relevant to distributed systems because theyre more likely to occur in a distributed system than in a localized, central one. These kinds of bugs are actually indicators of problems and failures in the system that occurred much earlier than when the bug manifested itself.

Heisenbugs are much more common in distributed systems.

A Heisenbug is usually a red flag that something else went wrong in the system awhile ago, and it is only surface now, and it only just so happens that it is surfacing in the form of this bug.

In actuality, a Heisenbug is just a long-delayed side effect of a much earlier problem.

In more mature distributed systems, it is Heisenbugsnot Bohrbugsthat cause major failures and system crashes. If we think about it more deeply, this starts to make sense; there are many moving parts in a distributed system, and many dependencies and nodes that rely upon one another. A failure that appears to be coming from one node in the system might actually be three nodes removed from a failure that originated elsewhere, but propagated throughout the system. While a Bohrbug might be easy to reproduce, localize, and reason about, a Heisenbug is much tricker to think about and thus, to fix.

Heisenbugs in a system, over time

Anita Borrmy new favorite researcher on residual bugsactually found in her research that many engineers have a hard time reasoning about Heisenbugs, and that making attempts to fix a Heisenbug actually cause more problems than there were to begin with! So if youve been feeling like Heisenbugs are tricky little creatures and hard to contend with, dont worry; the research agrees with you!

Resources

Software failures in distributed systems are pretty cool to learn about, and you can learn a whole lot more about them. There is a lot of interesting research and writing on how to deal with and guard against Heisenbugs in your system. Check out some of the resources below if youre curious to learn more!

Reliable Distributed Systems: Technologies, Web Services, and Applications, Kenneth Birman
Why Do Computers Stop and What Can Be Done About It?, Jim Gray
Introduction to Distributed Systems, University of Washington
Heisenbugs and Bohrbugs: Why are they different?, Richard Martin (?)
Protecting Applications Against Heisenbugs, Chris Hobbs

Original Link: https://dev.to/vaidehijoshi/weeding-out-distributed-system-bugs-9c8

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To