How We Decreased Cloud Costs and Increased System Resilience

Whether you work at a tech startup or at a factory, your company is likely reliant on the cloud for multiple essential business functions. From “simple” cloud-based applications like Google Docs to advanced cloud computing functions, the cloud has been adopted by consumers and businesses alike. Cumulus is no exception.  
 
For us, the cloud is a critical part of our platform. It’s what allows us to have a global footprint, delivering comprehensive digital work records for our customers to access whenever and wherever they like, for visualization, reporting, and analytics tasks.  
 
However, rising cloud costs and our growing customer base resulted in sky-high monthly bills. And we’re not alone. According to VentureBeat, 81% of IT teams have been directed by their C-Suite to reduce or halt additional cloud spending. 

After seeing over 400% increase in our monthly bill, it was clear that we had to completely rethink how we were using the cloud. The ensuing journey resulted in something even better than mere cost reduction: a more resilient system for our customers. With data backed up with point-in-time recovery and faults better contained within the affected service, our technology has never been more reliable for our customers.  

Because of our strong (and in some ways unexpected) results from this process, we want to share our strategy and approach. Hopefully, this information is helpful to other businesses who are looking to increase system resilience while reducing cloud costs. 

 

What was causing our high cloud costs? 

This was the first question that our team had to answer. It turns out that our company history played a big part in this story. Cumulus began as a spin-out from a large Fortune 100 company, where we inherited a typical on-prem enterprise software stack.

Accordingly, our software was essentially lifted and shifted onto a cloud-hosted platform. This was convenient at the time, but left us with a lot of cost inefficiencies. 

  1. The legacy technologies used in the original system were priced at a premium by service providers.  
  2. Our customers have “bursty” usage patterns. 
  3. We had to anticipate and provision for peak usage levels since scaling our services took too much time. 

It became apparent that we needed to transition to an architecture that scales to serve many customers. With limited resources, we had to focus on building features that help our customers succeed, while carefully controlling the work required to move to the new architecture. Serverless technologies became a natural fit for Cumulus. 

 

How did we reduce our cloud costs? 

After monitoring our platform to determine the fault lines causing cost tremors, our team had the knowledge we needed to take action.

The process of lowering our high cloud costs spanned four broad themes: 

1. Architecture 

Leveraging serverless technologies, we decided to evolve our system from a monolithic structure to serverless microservices using a strangler fig approach. It’s an approach that slowly replaces functionality in the old architecture, with new services. Thus, we tailor the roll-out of the new architecture to accommodate our customer needs without interrupting our product roadmap. The new architecture proved to be robust against bursts of intense usage by our customers, which were challenging to accommodate with the monolithic architecture.
 

2. Cloud Technologies

We shifted our primary storage from a relational database to DynamoDB tables. We also made the following changes:  

  • Replaced shared-secret authentication with granular IAM policies for internal service-to-service authentication 
  • Leveraged AWS Backup to create Point-In-Time-Recovery   
  • Eliminated AppSync cache by improving backend resources  
  • Reduced verbose data logging 

The result was a system that made more efficient use of the cloud applications, while achieving better performance and resilience.
 

3. Data Strategy

While migrating our system, we realigned our data strategy with the application design. New reporting features had led to a significant increase in costs. We analyzed our reporting queries and developed a better understanding of our reporting technology.

We made small changes to our query design and application schema to make generating the reports more cost-effective. With a good schema design, we could transport data, and run queries with improved efficiency and reduced cost.
 

4. Software Optimization

Adopting serverless technologies afforded us the opportunity to reassess some assumptions and business logic. We redesigned our backend services using Function-as-a-Service (FAAS) framework.  The routines were optimized for execution on stateless containers in cloud. Every function was custom designed, and optimized for a specific task. 

 

What were the results? 

After transitioning customers to the new architecture, our cloud costs were significantly reduced. Overall, we lowered our monthly bill by 71% between our peak in September 2021 and our new average cost as of August 2022.  

This decrease was driven by a reduction in maintenance costs, both in the effort required to keep the system running, and the support cost charged each month.  

Perhaps most importantly, the improvements we made yielded a highly resilient system. Data is backed up with point-in-time recovery, enabling swift recovery from service outages. Further, dead-letter queues (DLQs) capture failed messages that can be re-driven when failures are resolved.  

The new system also replicates services across multiple availability zones within each region. Availability zones are partitioned and isolated, which provides both independence and redundancy. Lastly, any fault that occurs in one part of the system is largely contained within the affected service. This creates better overall availability of our technology to our customers. 

 

Conclusion 

Because the cloud has been so widely adopted by companies ranging from railroad construction to telemedicine, we’re all reliant on the benefits that it provides. Increasing costs are hitting us all hard.

The work we did proves that it’s possible to simultaneously keep costs under control and improve the integrity and resilience of your system.   

For more helpful content, subscribe to our LinkedIn Newsletter, Work Done Right, or check out our YouTube channel. 

Facebook
Twitter
LinkedIn