Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 11, 2022 10:47 am GMT

The Cost of Production Blindness

When I speak at conferences, I often fall back to the fact that just a couple of decades ago wed observe production by kicking the server. This is obviously no longer practical. We cant see our production. Its an amorphous cloud that we cant touch or feel. A power that we read about but dont fully grasp.

In this case, we have physical evidence that the cloud is there.

A part of this major shift in our industry is a change to our fundamental roles as engineers. DevOps and SRE are roles that didnt exist back then. Yet today, theyre often essential for major businesses. They brought with them tremendous advancements to the reliability of production, but they also brought with them a cost: distance.

Production is in the amorphous cloud, which is accessible everywhere. Yet its never been further away from the people who wrote the software powering it. We no longer have the fundamental insight we took for granted a bit over a decade ago.

Is That So Bad?

Yes, and no. We gave up some insight and control and got a lot in return:

  • Stability
  • Simplicity
  • Security

These are pretty incredible benefits. We dont want to give these benefits up. But we also lost some insight, debugging became harder and complexity rose. We discussed these problems before but today I want to talk about one impact only

Cost

This is a form of blindness.

I wrote a lot about the impact of this situation on the reliability of our cloud deployments. But today I want to talk about the financial and environmental costs. Initially, the cloud was billed as a cost saving measure and there was some truth to that. The agility of deployment let us cut down on hardware costs, consolidate and simplify.

But as we got used to the cloud, our appetite for scale/reliability grew. We ended up simplifying deployment to such an extent that launching a container can be accomplished seamlessly, with no interaction on our part. This is enormous progress but also troubling. We slowly lose grip on our costs and end up paying more for less.

So whats the solution?

APMs are a category that rose to prominence specifically around this problem. Today, they are more important than ever. They help us get a sense of the Pareto principle (80/20 rule) so we can focus optimizations on the specific areas that cost the most.

This is a powerful and important tool that DevOps use every day, but its also a very limited one.

Before we proceed, Id like to take a moment to discuss the concept of cost. The most obvious impact is on our monthly cloud provider bill. This is work that might fund a department. But theres a more important cost in my humble opinion: the environmental cost. We tend to ignore the electricity spend because its a very amorphous spend. But this cost is severe, e.g. the cost of a single cloud instance over one year can be the equivalent of a transatlantic flight.

We dont see the underlying hardware, but its there, and it carries a carbon footprint. By optimizing, we can affect both costs significantly.

Observing Production Effectively

APMs are great for measuring performance at a high level. But they provide very little detail about the dynamic inner workings of the application and the cost-cutting measures we can take inside. I often liken them to the bat signal or check engine light. They notify us of a problem but leave us without the tool to inspect the details.

Thats where developer observability tools can fill in that gap. These tools can provide low level applicable insights into the application. Verify assumptions and provide developers with the means to understand production substantially.

Instead of discussing the theory, lets give some examples of actions you can take today with developer observability tools to reduce the costs of your production.

Reduce Logs

Log ingestion is probably the most expensive feature in your application. Removing a single line of log code can end up saving thousands of dollars in ingestion and storage costs. We tend to overlog since the alternative is production issues that we cant trace to their root cause.

We need a middle ground. We want the ability to follow an issue through without overlogging. Developer observability lets you add logs dynamically as needed into production. This frees you from the need to overlog and lets you focus on logging a reasonable amount. You can also raise the log level to keep the logs down. I wrote about this in depth here.

Caching

My top three tips for performance have always been:

  1. Caching
  2. Caching
  3. Caching

Theres really nothing else. It all boils down to that. Unfortunately, cache misses are notoriously hard to tune and detect. This is an even bigger issue in production where we need to account for the changing landscape. E.g. we cache up to 10 friends of a user on a social network but in production the growth team encourages friendships and users have more friends

Youd have cache misses more often and you wouldnt even know.

Placing conditional breakpoints or temporary conditional logs on cache misses and inspecting them can go a long way to detect subtle issues like that. This can make an order-of-magnitude difference to performance when done right.

However, theres a bigger payout here. Many developers just ignore L2 caches entirely. This is understandable. They are hard to maintain and debug. Especially in production. A single cache corruption or a value thats out of sync and you end up with a major bug. The problem is that debugging these things in production environments is essential. Cache behaves radically differently in production because of its distributed nature.

We built developer observability solutions to debug these exact types of problems. By placing snapshots and logs over cache population/invalidation, we can narrow down the point of corruption and fix cache relation issues. By deploying these solutions to your production server, overhead can be reduced significantly!

Micro Benchmarks

APMs provide us with high-level numbers on performance and a general direction. They dont provide the lines of code we need to address. Thats left up to our guesswork. If the system behaves identically when its running locally, this should be fine. Unfortunately, this is rarely the case. E.g. a database query can have a significantly different impact when running in production. Based on local profiling results, you might waste your energy on the wrong optimization.

Developer observability tools provide the ability to narrow down the performance overhead of a code snippet. This lets us follow through the web service stack and narrow down the actual lines that are taking the most CPU time. We can accomplish this by adding a tictoc metric that measures the time between the tic line and the toc line.

We can mark a block of code and get statistics about its execution time. As in the common case of a specific query taking longer in production, we can quickly prove that this is the cause of the performance problem using this tool. The impact of many small issues like this can be significant in a large system and can easily mean the difference between scaling and a bottleneck.

Verification and Dead Code

A common problem is under utilized resources. APMs expose some of those problems but dont expose them all. When we have dead-code, its impact on our bottom line can be significant.

How many times did you refactor code or stop yourself from refactoring because of a legacy mess you didnt want to touch?

Yes, that legacy mess is used in your code so you dont want to risk it. If you end up changing the code, you need to walk on eggshells and the entire operation can take an order-of-magnitude longer. This maps to cost since our time is valuable and you can spend it optimizing. It also blocks some major optimizations most times.

But what if that block of code isnt used by anyone in production?

What if its used by very few people?

Thats exactly what the counter metric does. It counts the number of times a line was reached. It can tell us which methods are important to us and how frequently theyre reached. You wouldnt be as concerned about a refactor if only three people reach that line of code

Finally

I could carry on with the discussion of these techniques, but the gist is simple: we need to see whats going on. As developers, were given a task to build a product. But the tools that let us peer into production arent as capable as our local tools. The results we get from production can be very misleading.

As we scale production deployments, we need to use a new class of tools that exposes our code in this way. I can classify modern production with one word:

DREAD.

This deep binding fear that we all feel when we push a major change into production. People lose their jobs by pushing bad stuff to production. Thats scary!

What do we do when facing such dread?

We keep going, but carefully. We step lightly and dont take big risks. Is our code wasteful?

Maybe, but the risk of bringing down production is far scarier than the benefit of shaving some expenses to the company.

Developer observability is the light within this darkness. When you shine a light in the dark, you take away some of the fear and make production more approachable. We can measure, test, and move fast. We also have a better sense of the risks well be facing with the upcoming changes. The tooling also gives us a sense of the upside. How much can we save? Imagine saving the cost of your entire department in cloud expenses. Thats job security right there The best to fight that fear of risky changes.


Original Link: https://dev.to/codenameone/the-cost-of-production-blindness-48aj

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To