The art of debugging

Programming is all about the flow of data. Our director of engineering, Frank Gu, puts it this way:

Every software company is just a shell for a data company.

A function takes data as arguments, and either transforms it or generates new data from the existing data, and sends it somewhere else. The same applies to services, endpoints taking in parameters. We start with data from SOMEWHERE and do SOMETHING with it.

A multiplayer video game is, in essence, just packets of data being sent to your device and eventually converted into light and sound. You look at that light and sound, produce data, and send it off again.

Think of development as doing plumbing for data (obviously there’s a lot more nuance to this, it is not a perfect analogy).

At my place, I have a water pipe and a sewer pipe. Water needs to get to me somehow from the reservoir, and my sewage needs to reach the wastewater treatment plant. There are some immediate stages along the way: processing (filters, pumps), storage (water tower), and junctions.

If there was a resident complaining to the municipal water management that their water was all weird and tasted funny 🤢, where should we start to look?

Well, we can eliminate problems with the sewer pipe since it’s the water coming in that is wrong. It’s got to be somewhere between the reservoir and their tap, or maybe even the reservoir itself.

It’s not very easy to look at all the intermediate stages if I don’t have knowledge of what they are, and where they are. But it’s always easy to take a measurement at the reservoir and treatment plant to ensure that is not the problem - it’s a sanity check that the sewer system isn’t broken and polluting the reservoir too much for the treatment to handle. Let’s say things are good on that end.

It’s now important to check the water quality at the various stages, from the treatment plant to the tap. The problem MUST be somewhere here.

If you know absolutely nothing, the best strategy is a binary search.

  1. Start at the middle point of the section
  2. If it is dirty, check the section before that middle point. Go to Step 1
  3. If it is clean, check the section after that middle point. Go to Step 1

Keep repeating until you find the exact point that is causing the problem. This is powerful because we halve the section we need to check every time. It’s logarithmically powerful: O(log N)!

Logs sound great and all, but what if there’s a method absurdly better than that?

It’s called an educated guess! This comes with the experience that you, the humble (data) plumber, accumulate as you fix problems.

Maybe you know that because the house was built before 1950, and it could have corrosive lead piping that is stripping off. So now you go and directly check the water going into the house - it is clean, so it must be in the house’s internal plumbing.We just skipped checking the entire section between the treatment plant and the house.

Maybe this is the only resident that complained about funny water, so it's more likely this house. Or if only a neighbourhood group complains, it’s the junction going into that neighbourhood we want to check.

Maybe we know that a certain section was worked on or changed recently, or is known to be problematic, so we can start there first.

This educated guess allows us to zoom in and narrow down the problem potentially in constant time (if we are right, most of the time).

Let’s step back from the analogy for a second. You might’ve already picked up what each term corresponds to. Water is data. The reservoir is your database. Treatment plants and stages could be the various backend services that modify and feed the data downstream. The house could be the front-end client. Data operates in cycles “to” and “from” the database - so the database is always a good starting point to see if the issue is “to” or “from” and in which direction to look.

Just like how a traditional plumber might work mostly with a house’s internal plumbing and other professionals work on other stages of water management, we have the frontend, backend, full stack developers, data engineers, etc. You should at least be able to clarify that the issue lies outside your domain - that it’s not the house’s (front-end client’s) issue so others can look into it. But understanding the overall system can give you the opportunity to inform the right person or fix it yourself.

This experience leading to an educated guess can make all the difference between a wild goose chase of a slog and apex bug-killing precision.

Often if you’re just building out new pipes (building new features) and never fixing or maintaining existing ones, you won’t gain this experience.

This is all to say, while it is exciting to build the latest, greatest, shiniest new pipe in a brand new neighbourhood, it can be very valuable to go into that nasty old existing system, even if it isn’t appealing, find problems, and fixing it up once in a while.

Understanding why an old system sucks and why it breaks helps us design a better new system.


Building a knowledge base with OpenAI, LangChain, and more