Select Page
by

OpsMx

|
last updated on November 17, 2022
Share

OpsMx works with a company that provides a well-known destination site, founded in the early 2000s, with more than 200M unique monthly viewers worldwide. It has received more than 300M user reviews and is ranked as a leader in its segment.

Website Use Case: Challenges Solution Results

Website Use Case: Challenges Solution Results

Challenge
Accelerate Innovation by Reducing Software Delivery Time

Faced with stiff competition, the organization needed to increase the speed of delivering enhancements to their end customers while simultaneously reducing production failures caused by defective updates. 

The primary bottleneck they faced was a lengthy manual approval process to move updates from staging to production.  However, shortening the approval process had been proven to increase problems in production. 

Their IT architecture is complex, aggravating the problem. They deploy a broad range of microservices-based applications on Kubernetes, as well as a large number of monolithic applications. Their CI/CD system was built using Jenkins, plugins, and custom scripts. 

Moving more quickly was a key goal. They could process only 50 to 100 updates per month, and their goal was many hundreds of updates per month. Of course, like all companies, they are also under pressure to reduce costs. They estimated verifying updates to have annual direct costs of more than $1M. 

They evaluate every significant deployment as they move to production. The analysis requires 3 expert engineers, including at least one technical lead and one product engineer, and it takes an hour or more to decide whether to move the deployment forward. 

The analysis process for every update was time consuming because of the mountains of data generated. With hundreds of thousands of concurrent users, there is a tremendous amount of metrics and logs created. Consistently finding the “needle in the haystack” shows a potential problem is complicated, even for experienced engineers. As the team’s frequency of updates increased, the severity of the problem grew until they were nearly at a breaking point. 

The analysis and decision process is a bottleneck

Gathering and filtering the metrics and logs was monotonous and time-consuming. The data analysis was slow and problematic because it depended on subtle differences that were hard to find. And many times, the final approval decision was hard because conflicting data would point to both promoting and rejecting the update.

The leader of the developer productivity team, who improved this situation, said it best. “Even though most updates should be moved to production, you can’t assume that everything works all the time. We strive for uniformly fast AND reliable releases.” 

Too many errors

Any impact to the availability and performance of the customer-facing applications has a direct impact on the company revenue and a large indirect impact on the image and reputation of the brand. Too many updates were being approved incorrectly. Worse still, errors that should have been caught – because they had occurred before – continued to be made. 

Requires Expensive Experts

In order to reduce the chance of an incorrect decision, the most senior engineers conducted most reviews and analyses. It was challenging to train inexperienced engineers because of the time pressure, the complexity and subtlety of the analysis, and the limited number of people who were qualified to train. 

Solution
Autopilot: A Layer of Intelligence for Deploying with Jenkins

The best way to increase the speed and reliability of a process is to automate it using machine intelligence. The team had the vision to improve productivity: use ML to automate the verification and approval process in the deployment cycle. 

To enable this vision, the solution needed to be as accurate and consistent as a human team of experienced experts. Any errors – either rejecting an update that should be approved or accepting an update that was later determined to be faulty – would have large consequences, so any solution needed to perform better than the human experts.

“Autopilot is our layer of intelligence that makes continuous delivery effective.” – Director of Developer Productivity

After a thorough evaluation of potential solutions, including trying to build the solution on their own, they worked with OpsMx and implement OpsMx Autopilot. Autopilot is an intelligence layer for software delivery, integrating with any CI/CD platform. It uses AI/ML to automate verification (refer the screenshot below) and approvals, provide continual governance, and create visibility and insights into operations and best practices. 

Continuous Verification in CI/CD pipeline via Autopilot evaluating logs in Elasticsearch and metrics from Datadog

Continuous Verification in CI/CD pipeline via Autopilot evaluating logs in Elasticsearch and metrics from Datadog

Here, Autopilot gathers and evaluates logs stored in Elasticsearch and metrics from Datadog and others. Using natural language processing, statistical analysis, and machine learning algorithms, Autopilot analyzes every deployment and assigns a confidence score. The pipelines are configured to automatically promote updates to production when they are very likely to be successful, and reject them and return them for rework if the confidence score is too low.

Results
Faster and More Reliable CI/CD Pipelines

Since the deployment of Autopilot, this company has seen significant improvements in software delivery velocity. Most production approvals now require zero time from an engineer; even decisions that need to be reviewed are completed more quickly because the data is gathered and initial analysis is completed automatically. The history of similar errors is automatically retrieved, along with the corrective action, speeding the resolution of issues. 

With Autopilot, the number of updates has increased from 100 per month to over 1000, and errors in production has decreased as well.

The system has also improved the quality of the approval decision, both approving acceptable updates more quickly and rejecting more errors before they reach production. This improvement in accuracy is especially important in their most mission-critical applications – some applications run the Autopilot verification and approval process over five times a day.

Overall, they have been able to increase the update velocity thanks in large part to reducing the approval cycle. They have moved from 100 updates per month to over 1000, enabling them to more quickly respond to their customers. 

The leader of the developer productivity teams says “Autopilot has really helped us by automating the analysis of our deployments. It is very reliable in finding potential issues and has proven itself to be better than our experts at evaluating risk. Because it is automated, it is very consistent – we don’t worry that it will have a bad day and miss an issue.” 

Because the system continually learns, expert engineers can train Autopilot. This means that over time, Autopilot is able to dramatically reduce the time they spend analyzing updates. This allows them to work on higher value activities. 

“Autopilot is more effective than our experts at evaluating updates.”  – Director of Developer Productivity

Overall, the new system has improved production reliability and has enabled faster development of new capabilities, adding the equivalent of over six full-time senior engineers to the team. 

The deployment of Autopilot is now moving to its second phase: automatic policy checking. For example, to more easily meet SOX regulatory compliance, the person implementing any given change can not approve moving the change into production. Similarly, a QA manager must approve all significant updates. These policies and many others can be validated before an update is considered for promotion to production, saving even more time. 

These policy checks will pay off in terms of faster releases and better compliance, which generates higher-quality releases. The productivity team leader concluded, “We’re glad to be partnering with OpsMx and believe that Autopilot is the layer of intelligence that makes our continuous software delivery system effective.” 

Read more Autopilot user stories:

  1. Telecom Leader Accelerates Time to Market with OpsMx
  2. Networking Leader Automates Build Analysis with OpsMx Autopilot
  3. How Customers Improve CI/CD Velocity Using Autopilot

If you want to know more about the Autopilot or request a demonstration, please book a meeting with us for Autopilot Demo.


OpsMx is a leading provider of Continuous Delivery platform that helps enterprises safely deliver software at scale and without any human intervention. We help engineering teams take the risk and manual effort out of releasing innovations at the speed of modern business. For additional information, contact us.

OpsMx

Founded with the vision of “delivering software without human intervention,” OpsMx enables customers to transform and automate their software delivery processes. OpsMx builds on open-source Spinnaker and Argo with services and software that helps DevOps teams SHIP BETTER SOFTWARE FASTER.

Link

0 Comments

Submit a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.