Million Dollar Lines of Code - An Engineering Perspective on Cloud Cost Optimization


Key Takeaways

  • Developers must recognize their code’s financial impact, underscoring how seemingly minor decisions can lead to significant costs.
  • Engineers are crucial contributors to an organization’s financial strategy, with their coding choices directly influencing the outcomes.
  • A balance is required between leveraging cloud scalability and managing financial limitations.
  • Cloud Costs must be considered a critical engineering and non-functional requirement influencing cloud service choices.
  • Metrics such as your “cloud efficiency rate” (CER) can be practical for organizations to use to baseline their spending at different stages of development and measure their cloud-related costs over revenue.

There has never been a better time to be a software developer, and there has also never been a time when a single engineer can wield so much power. It takes only one line of code to determine an organization’s financial trajectory. Like many of you, I’ve long been passionate about creating efficient software. Yet, in our cloud-centric world, efficiency is no longer just about performance. The on-demand computing and infrastructure choices we make now all cost real money, and neglecting this in the cloud can be quite perilous.

I shared an engineering perspective on cloud cost optimization at a recent QCon San Francisco conference. This article is a deeper dive into that presentation. I encourage you to watch the original presentation here.

Every Engineering Decision Is a Buying Decision

Every engineering decision is a buying decision. More scrutiny will be directed at the money you’re spending on dinner or lunch today than how much money is getting spent in the cloud. Somebody somewhere in accounting will be staring at that $50 lunch; nobody is staring at the $10,000 your engineers are spending on cloud computing. It’s hard to fathom this because CTOs, CIOs, and CFOs used to oversee the procurement and purchasing processes. Today, a junior engineer has more purchasing autonomy than anyone in your company.

The world today is having it’s cloud cost moment. After years of massive growth, the conversation has shifted from growth at all costs, to efficient, profitable growth. Some people are wondering if maybe the cloud was a mistake. Maybe it’s a scam. What happens if, when we move to the cloud, we discover that we need to get out quickly? The fear of lock-in has caused many to build with one foot still stuck in the data center, and many are discovering this approach is expensive. Many have discovered that lift and shift, which might have been the greatest lie ever told in the cloud, is super expensive. What is going on?  Was the cloud a scam? Unfortunately, it’s exactly this fear of the cloud and all its costs that has created this self-fulfilling prophecy of cloud waste.

Burn the Boats

The cloud is not just someone else’s computer, it is an operating system, a new platform entirely. Like Cortés coming to the new world, we have to set fire to the boats and forget about ever returning home if we are going to be successful. Yet many are still writing code for yesterday’s mainframe, not realizing the need to rewrite for the cloud if we’re going to take maximum advantage of it. Before the DevOps movement started. We would throw code over the wall to ops and move on to the next problem. Now, we write code and throw it over the wall for finance to worry about. All of us living in this economy today and tomorrow need to understand that to build great software, that software has to be profitable.

A lot of software is still waiting to be deployed to the cloud.  By one estimate, there’s $4.6 trillion in IT spending still running in the data center. The cloud; despite its growth, is still in its early days.

We still have a lot to figure out. If we are going to move to the cloud, it’s got to make strong economic sense. Some people are convinced that this isn’t possible. Some people are convinced that it was all a mistake. I happen to know these people are wrong. I live in the cloud, after all. I want to stay there. But I’d also like to see this happen in my lifetime, and unfortunately, even with the Cloud growing at 50% YoY, none of us will likely live long enough to see that happen unless we start to build differently.

This conversation is happening because it hasn’t been abundantly clear to us as we sit down to build things that the cloud makes strong economic sense.

We have to change that. I’ve checked the numbers, and I can tell you it makes strong economic sense. I’ve seen it, but you’ll have to build differently, write your code differently, and think about systems design differently. You cannot take what worked in the data center, lift and shift it into the cloud, and expect a good outcome. You’ve got to think differently about it.

The Engineer’s Role in Software Profitability

Today, cost efficiency often reflects system quality. A well-architected system is a cost-effective system. One line of code can determine whether or not the company you’re working for is profitable.

We have a challenge that we must figure out together about the best way to measure cost efficiency. To do that, I want to get into the code. I’m an engineer at heart, so I’ve collected a few examples of million-dollar lines of code, in some cases multi-million-dollar lines of code, to show how easy it is to spend money. It’s all been anonymized and translated into Python to protect the innocent. These all resulted in people spending way more money than they should have with just a few lines of code.

Example 1: Death by Debug (Even DevOps Costs Money)

IIn this example, an AWS Lambda function has an average monthly cost of $628. CloudWatch has an average monthly cost of $31,000. What’s happening there? Unfortunately it is all too common where AWS CloudWatch costs way more than the actual invoke of a lambda function. I don’t know how many people have experienced this, but it feels like it eventually happens to anyone building Serverless systems in AWS.

For this example, the total annualized cost of this system is $1.1 million just to write log data. What was the cause of it? It was a combination of two things. Code that shouldn’t necessarily have left the building. Something that once did something important. A well-intentioned debug line that, when the ops team turned on debug logging and didn’t think much about it, sent massive amounts of data into CloudWatch. Because Ops teams are sometimes disconnected from Dev (I know we’d all love to think that DevOps is always working together, but it’s not always true). In this case, they assumed it was supposed to work that way. It ran for a long time, $1.1 million.

As an aside, think about how much more expensive this could have been if they also used Datadog to collect their logs? With that in mind $1.1 million is a bargain, but in either example, it’s also tragic and unnecessary.

What’s the fix for this? This one’s pretty simple. Get rid of the debug statements, we know that’s the problem. That’s great if I’m writing some sample code on my desktop, but the fix is to delete this. We don’t need this. The developer, if they’re building this code, testing it out is great and helpful at that time, but when you deploy it, don’t put this in there. It’s like a vulnerability. It’s a time bomb waiting to go off. The fix is to delete it.  

Example 2: The API Costs Money

In this example, we’ve got an MVP that found its way into production. Years later, now the product is creating billions of API requests to S3, and it’s grown nice and slow so that nobody noticed. The total cost of this code over just one year is $1.3 million.

There are many challenges in this code. This worked perfectly as an MVP. It’s great. Get an idea, put it on paper, and make it happen. Deliver it. Why are these things inside the for loop? Why are we calling out to the S3 APIs while this is running out? We could actually pull all this stuff out of there and quickly cache or capture this information. The problem is that this code works.

When it got deployed, it worked just fine. It wasn’t until years later, when it was up to scale, that it started to cost that $1.3 million. We also got a little detail here. Maybe I shouldn’t pass these files on to my following function for further processing. What’s the fix for this one? Let’s pull that outside of the for loop. Calculate or download the stuff in advance; do it once instead of the million times we’re running inside this function. Instead of passing just some pointers to the files to look up later, pass the actual data. Use it once – simple stuff. Again, we’ve all done this, where we got the code working. It worked as a prototype. Then it snuck out the door, and we never thought about it again. API calls cost money. Sometimes, in S3, the API calls might cost more than the storage.

Example 3: How To 2x DynamoDB Write Costs with a Few Bytes

This is an example of a developer being asked to add something straightforward. This record we’re writing to DynamoDB doesn’t have a timestamp; we’d like to know when it was written. Why don’t you add that field? It should be super easy. Code change that takes a second; somebody tested it, deployed it, and it’s up and running.

Look at the bill shortly after that. DynamoDB costs have just gone up 2x. This one’s a little harder to spot. Does anybody see why adding that single line timestamp line made DynamoDB cost twice as much as before? DynamoDB charges in 1k elements for writes. It’s 1000 bytes being written, but we added a timestamp, an attribute of 9 bytes. We said a timestamp in ISO format, which is 32 bytes, 141 bytes, 2x the cost, from just one line of code.

It’s pretty hard to spot that one. We have to think differently about how the data flows across the line. More importantly, how does that affect our costs? What’s the fix for this one? We should do two things. We should reduce the size of the attribute name – it makes it more aerodynamic when it flies through the wire. It’s a fundamental property of the TCP/IP protocol. Make that a “ts” instead of a timestamp. Let’s shave off a few bytes there. Let’s reformat our timestamp so that we’re down to 20 bytes. Good news, we got 2 bytes to spare; we’re under the wire, so we’re back where they needed to be—one line of code, 1/2 the cost.

Example 4: Leaking Infra-as-Code (Terraform Edition)

Let’s not forget about infrastructure such as code, Terraform, and CloudFormation. They’re guilty here too. In this next example, we have a Terraform template that creates autoscaling groups. It can simultaneously scale up clusters with hundreds or even thousands of EC2 instances. Someone designed the system so that it recycles the instances every 24 hours. Maybe they had a memory leak and thought that was an excellent way to fix it. Unfortunately, somebody in security was worried that the data might be necessary, so the option to delete the EBS volumes was removed. This system ran for about a year, slowly leaving a growing pile of EBS volumes. At the end of that year, $1.1 million went out the door. This example is a bit verbose, as is most infrastructure as code, but this problem comes down to only two lines, but two lines in two separate files.

The combination of these two lines will create unattached EBS volumes every 24 hours, one for each EC2 instance created. The first, delete_on_termination, is set to false, which prevents the EBS volumes from being deleted. The second,  max_instance_lifetime, is the recycle time. Because these are in two different files, this is easy to miss.

Those two lines meant that every time that EC2 instance spins up it creates an EBS volume, which will never be deleted (unless done manually). With the autoscaling group having a max size of 1000  (and in this example, there were 300 to 600 EC2 instances in this environment at any given moment), The amount of unattached EBS volumes added up quickly. Over a year, this added up to just over a million dollars. The fix to this one however, it’s a little more complicated. This one, you have to change your processes.

You have to think a little about how your team works when they put these things in place. If you create resources, you should always ask how to eliminate them. This applies to many things in the cloud, not just the cost. Many of us have spent the last couple of years thinking about how to scale up, but we don’t think enough about how to scale down. Scaling down is way harder, and it’s way more important. It also can save your business. If you were a travel company during COVID, the ones that survived were the ones who knew how to scale down. For example, I have heard amazing things about the team at Expedia, but not everyone was so lucky. Beware of well-intentioned infrastructure as code, particularly if you’ve got requirements from different teams.

Example 5: Cost Delivery Network

In the fifth example, I leave the best for last. We love content delivery networks (CDNs). They make all of our stuff go faster by moving content to our customers faster. In this last example, A company with 2.3 million devices deployed worldwide made a small change, and this small change was deployed to all those devices.

After about 14 hours, that change’s impact started becoming an issue. It reached a steady state of about $4500 an hour before it was corrected, and while it ended up costing several hundred thousand dollars, it was nowhere near what the real impact could have been. If this change had continued to run for a year without anyone noticing, and I’ll explain why they may not have even seen it, it would have been a whopping $39 million dollar line of code.

Somebody somewhere would have been begging for that money back, I’m sure. Hopefully, somebody will have seen it in finance after the first month. However, that would have been after the first month at a cost of $648,000. That’s a pretty painful bug. Thankfully this issue was discovered after six days, making this something of a success story when usually, it would have taken months. Unfortunately, that’s a really messed up barometer for a success story.

What was the code? It looked something like this.

In this code is  a well-intentioned update function that was probably written by an intern a long time ago. Because it used to be called once a day and download and compare a large file, which seemed like a bad idea, someone decided it would be more efficient to download metadata instead. Ironically, this change was actually designed to lower costs. They rolled it out. They expected everything to go in the right direction. They weren’t quite certain what happened next when they suddenly discovered it wasn’t working as they expected.

How many people can spot the bug in this code?

It’s just one single character, and that one single-character typo meant that this code flipped to the more expensive path. At the same time, they moved this up to calling up every hour instead of every day. The thing is, CloudFront is perfectly happy to serve up content. It did a great job, scaled up, and delivered that content. No systems were harmed, and no errors were detected. Everybody was happy the data was flowing.

Their customers might have wondered why there was all this extra data flowing around their home networks for these devices, but because everything continued to work just fine, this was a hard-to-spot issue. The operations teams and the monitoring tools behind the scenes had no errors, and there was nothing to alert them. Datadog is not telling them anything’s wrong because CloudFront can happily handle the traffic. The only indicator is that they are now spending $4,500 an hour, which they weren’t spending before – all because of a single character.

The real fix for this one is to go back and rethink this whole thing. Sure there is a quick fix here, but this was a pretty important aspect of this product. It is also not so straightforward to put tests in place to catch these sorts of bugs; many of these issues are impossible to notice in test when you operate not entirely at scale. This painful single-character bug would have resulted in a $39 million bill.

Lessons Learned

What have we learned from all this? Storage is still cheap. We should really still be thinking about storage as being pretty cheap. Calling APIs costs money. It’s always going to cost money. In fact, you should accept that anything you do in the cloud costs money. It might not be a lot; it might be a few pennies. It might be a few fractions of pennies, but it costs money. It would be best to consider that before you call an API. The cloud has given us practically infinite scale, however, I have not yet found an infinite wallet.

We have a system design constraint that no one seems to be focusing on during design, development, and deployment. What’s the important takeaway from this? Should we now layer one more thing on top of what it means to be a software developer in the cloud these days? I’ve been thinking about this for a long time, but the idea of adding one more thing to worry about sounds pretty painful. Do we want all of our engineers agonizing over the cost of their code? Even in this new cloud world, the following quote from Donald Knuth is as true as ever.

“Premature optimization is the root of all evil” – Donald Knuth

The first thing that we have to figure out as engineers is, will this damn thing even work? Can I even solve this problem?

Because all these examples I’ve shared with you are not problems until you get to scale. They’re not actually problems unless you’re successful. They’re not issues you should care about unless you might be onto something with the product or service you’re building.

Cloud engineers should consider cost, but they should do so iteratively over time and not all at once.

First, answer that question: Can it even be done? Then remember, you work on a team. Is this the right way to do it as a team? How will others maintain my code? Next, what happens if this thing becomes popular? That’s the point where you should start thinking about how much money this should cost to run.

When I started building some of my first systems in the cloud, I had no idea how much this stuff should cost. When I went to my CFO and said I wanted to use AWS for a project. He said, “Erik, you can do whatever you want, but you have a budget of $3,000. Don’t go spend it all at once.” This was a long time ago, and the cloud was very new back then. I knew that if I could somehow keep my project under $3K, I would get to play with the Cloud. So, I became suddenly obsessed with trying to maximize my return on that investment, and for me, it paid off because I managed to stay under that budget, and I’ve been playing with the Cloud ever since. Is that for everyone? Do we want to give engineers a budget? I would argue that we want to give engineers something more powerful than a number, which might equal buying a Lamborghini daily, which seems completely abstract and bizarre. Instead, I’d like everyone to focus on efficiency.

Cloud Efficiency Rate

For that, I want to introduce a concept called the cloud efficiency rate. It’s straightforward. It’s designed to be simple. A cloud efficiency rate can guide you to the right time to start optimizing your costs and not do it too prematurely. You calculate the Cloud Efficiency Rate (CER) by taking your revenue minus your cloud costs over your revenue, which gives you your percentage. For example, let’s say you’re making $100 million a year as a company, and your cloud costs are $20 million, in this example, you’re spending 20 cents per dollar in revenue, and your cloud efficiency rate is 80%. That’s awesome, but you don’t necessarily need to get there right out of the gate.

You should think of your cloud efficiency rate as a way of rationalizing cost as a non-functional requirement. For any cloud project, you should have your product people or business, define the desired cloud efficiency rate at project start and at what point in the application’s lifecycle should it change.

Remember, during R&D, you are just trying to figure out if it works. Your CER  could be negative. It probably should be negative. Something is wrong if you’re making money before shipping the product. Once you get to MVP, however, try to break even. It’s fine to have a low CER; it doesn’t matter; it should be zero to 25%. You are just trying to get to product market fit. Once people are telling you that they want to buy your product, and you are starting to wonder what happens if this thing becomes popular it’s time to get to 25% or 50%.

Once you are scaling and people are running around buying your stuff, you need to have a demonstrable path to getting to healthy margins; you want to get to between 50% and 70%. Finally, the last stage is a steady state; if you want to have a healthy business and be a profit engineer for your organization, work on getting to 80%. Use your cloud efficiency rate goals to transform dollar costs that no one really understands and turn them into a target that you can use to guide your cloud initiatives. It can be across your entire cloud platform or specific to an individual customer, a feature, a service, or anything new. As a rule of thumb, I recommend targeting a CER of about 80%.

Conclusion

When I was doing the homework for this article, I discovered by chance, Sir Tony Hoare. Some claimed he was responsible for the premature optimization quote. What I discovered blew me away. Sir Tony, it turns out, had beaten me to this concept by years. He described a billion-dollar mistake in his 2009 talk at QCon London. In his talk, he described inventing the null reference in 1965. Because of this early sin, he claims he has probably cost billions of dollars in the world economy. He’s probably right.

One thing we all know as engineers is that, over time, our code takes on a life of its own. It falls in other people’s hands. It moves on, and we lose sight of it. Those things that cost a few pennies while we were writing that code, testing it out on a laptop, and now deployed and running somewhere, it could just be $1 million per year to run that. When you return to your laptop or computer and peek at some of your code, think about that. Hopefully, you don’t have a negative discovery.

In closing, I will leave you with this simple statement:

“Every engineering decision is a buying decision.”

– Erik Peterson, Co-founder and CTO of CloudZero

If you write code today, you are making the buying decisions. You have a very powerful role in your organization, in this economy, and in this cloud-driven world. I hope you will use that power wisely.





Source link

Post a Comment

Previous Post Next Post