Understanding DevOps with Real-Life IT Scenarios -

Reading Time: 8 mins

Overview:

DevOps seems to be a cliche word yet happens to be the secret pillar behind every successful and influential company like Google, Amazon, Facebook, Etsy and Netflix. It appears to be a prudent practice for every company that wishes to survive and succeed in the on-going market battle. The scalability of this practise can be defined by the fact that it is not just applicable only to the software industry but can also be extended to the hardware industry. Delivering a quality product demands better scalability, reliability, stability, security and seamless user-experience. Integrating all such demands within one application/services is in no way easier which in turn demands a structured and well-organized company with regard to its development and internal operations. Again, fine-tuning a company (any company for that matter) in such a way involves a rigorous and consistent practice to adapt and implement.

Here in this article, you are going to see the necessity, cause and effect and the outcomes of choosing and implementing the DevOps practice with the help of real-life IT scenarios.

What it’s like to be ‘before DevOps’:

Let’s take the journey of Target, the largest retailer in the US. Back then accessing data was a tedious task as everything was locked up in the legacy systems, and they contain multiple sources of truths of data. In such circumstances, if they need to do integrations for a customer, it took almost six months to perform the integrations and another six months to perform the manual testing as they need to manage interactions and hand-offs with thirty different teams and their dependencies to get the data which is needed for the integration.

Clearly, there is no continuous communication and collaboration among the teams; Development, IT Operations, Auditing, Testing, Networking and Server teams as they work in silos.

Leading to the longest Lead Time as the development work is waiting in a queue due to the delay in data retrieval.

Yet another company called CSG International that provides Business Support Systems primarily for telecommunication industries aimed to improve the stability and predictability of the product. To do so, the Operations team practised frequent deployment in the staging environment. But still, since the environment was far from replicating the dynamics of the production environment as it certainly does miss asserts like SAN, firewall, security and load balancer, they couldn’t achieve what they meant to.

Interestingly, they overcame this issue by creating a Shared Team of Operations, SOT (by combining developers and operations) to perform daily deployment (in development and testing) and continuous deployment to the production, every fourteen-week. This routine of daily and continuous deployment in development, testing and production equipped them to detect and fix defects early. It also instils the confidence to automate error-prone manual steps and fix defects that occur again.

In the above case, one of the DevOps practices, i.e. making a habit of daily deployment (in development and testing) and continuous deployment to the production was adopted. Also, by creating a Shared Team of Operations (SOT), they encouraged and practised team collaboration and coordination.

Likewise, there were so many practices that posed major threats and constraints in delivering the quality product and in achieving the business goals back in the period before the origin of the DevOps concept. They include

Communication took place only with the help of weekly project management meetings, conferences, email, ticketing system, ITIL compliant change management tool.
To meet the deadlines, teams often skip and fail to keep track of the scheduled meetings.
Further, the old ticketing system and others don’t come handy as they are complicated and consume time and menial tasks.
No room for proper documentation about the carried out changes as the teams don’t want to skip the deadline.
Thus, no logging of every keystroke and record of the terminal session.
Often the development team takes the entire scheduled time and leaves no time for testing and operations deployment.
Post which the testing and operation teams take shortcuts to hit the date which leaves in an unstable and unusable software product.
No sufficient version control, need to track the version numbers for the entire release.
Releases were often sent in single files, not as an entire package leading to many moving parts that are difficult to replicate.
Difficult to spread (distribute) the load across servers, at times, teams may need to cannibalize the servers in production.

These don’ts put organizations in grave danger of depending on a key resource (be it a person, machine and equipment) to carry out the escalations. This contradicts the very nature of teamwork and solely relies on one resource’s ability which in turn makes them irreplaceable. This is dangerous because the entire business and internal projects cannot rely on a single key resource. As there is no replacement for the Constraint (key resource), the pile of works gets trapped inside the Operations and Work In Progress (WIP) is the everyday status.

So how DevOps makes changes to these misfits?

According to Gene Kim, one of the pioneers of the DevOps concept and author of “The Phoenix Project: A Novel About IT, DevOps and Help You Business Win” highlighted three ways as the crux of DevOps practice.

First Way: To create a fast flow of work as it moves from Development into IT Operations, because that’s what is between the business and the customer.
Second Way: To shorten and amplify feedback loops, so as to fix quality at the source and thereby, eliminate rework.
Third Way: To create a culture that simultaneously fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery.

To make it easier and the concept more simple, let us say that the adoption of practices, techniques and tools which aims to fulfil business goals via these three ways can be ultimately seen as the DevOps approach. Yet by no means, the DevOps practice is limited only to these three ways, it is more than just IT capability. To understand it better, let us see three different case studies that went one step ahead for the above proposed three ways.

Scenario 1: Fast flow of Value: Blue-Green Deployment by Dixons Retail

Dixons Retail, a largest consumer electronics company in the UK involved in operating thousands of Point-Of-Sale (POS) for different customer brands. All goes smooth until when it comes to upgrading the POS. This is critical as the POS resides in hundreds of retail stores, a little disruption during the upgrade process may cause mayhem for the relying customers. It is because the POS clients and the centralized server will be upgraded simultaneously, which in turn requires a week’s downtime and massive network bandwidth.

Dave North and Dave Farley, co-authors of “Continuous Delivery” decided to overcome this issue using the blue-green deployments pattern (running two identical production environments called blue being live and green being idle). Though the blue-green deployments is an online web service, they used this approach to reduce the risk and changeover times for POS upgrades. North and Farley created two versions of the centralized server software so as to support both the old and new versions of the POS clients. The idea is to roll out the new versions of Client POS software installers to the concerned retail store over the slow network links and deploy it to an inactive state a week before the POS upgrade.

On the other side, the old versions kept running as usual, and once the POS clients and server staged for upgrade and got tested successfully, the privilege was given to the store managers to decide when to release the new version. Whether the store managers decide to release the new version immediately or later (based on their business needs), leaving the option over to them makes dramatic changes in the path of achieving business goals.

Without a doubt, by using the blue-green deployment pattern, North and Farley achieved smoother and faster release (the flow of value) with significantly less disruption to the store operations. Further, they streamlined the upgrade and deployment process in a much safer and reliable way that demonstrates the universality of the DevOps practice to be applied for any thick-skinned and different applications.

Scenario 2: Amplifying Feedback: Launch and Hand-off Readiness Review at Google

The safety checks Launch Readiness Review (LRR) and Hand-off Readiness Review (HRR) created at Google is a suitable example for amplifying feedback (between developers and operations) and enabling safe deployment. A term called “Site Reliability Engineers” (SRE) coined by Ben Tryner Sloss made Google assign functional orientation for Ops engineers as he defined the role of SRE as “what happens when a software engineer is tasked with what used to be called operations”. SRE comes to the rescue of product teams only for the cases of the highest importance and moreover, those cases must have a low-operational burden and service self-managed by developers for at least six months. The only cases that meet the above-stated criteria will be eligible to be assigned with the SRE team and the ones that don’t will remain in the developed-managed state itself.

To make the collective experience of SRE reach the self-managed product team, Google created two sets of safety checks for releasing critical services called Launch Readiness Review (LRR) and Hand-off Readiness Review (HRR). The LRR must be performed for the products that are receiving public traffic on or before release, whereas the HRR usually done for the products/service transitioned to the Ops-managed state. The difference between LRR and HRR lies in the fact that the latter is more stringent and possesses a higher acceptance standard while the former can be self-reported by the product team per se.

By creating LRR and HRR safety checks, Google made the organizational memory visible to each, and everyone as every time a release has performed the others who are not directly involved in the launch will have less experience. Despite the successful and unsuccessful status of the launch, everyone can benefit from the collective experience of the previous releases.

Making it more transparent and predictable by enabling the product team to self-manage their products/services, which in turn allows them to know about the downstream work centre.

Scenario 3: Injecting experimentation and learning: Static Security Testing at Twitter

One of the best case studies for learning from failure and experimentation can be seen from the Static Security Testing performed at Twitter. Justin Collins, Alex Smollen and Neil Matatall shared the transformation work that happened on Twitter’s information security during the AppSecConference in 2012. On the peak of its business, Twitter faced many challenges and downfalls due to its hyper-growth. Between January to March 2002, the number of active Twitter users went to 10 million from 2.5 million. It is in this period the famous Fail Whale error page was up everywhere due to Twitter’s lack of capability to keep up with its demands.

Also, at this same period (early 2005), two of the critical security breaches occurred; in January @BarackObama account was hacked, and in April, Twitter administrative accounts were compromised through a brute-force dictionary attack. These incidents made the Federal Trade Commission (FTC) issue a consent order for Twitter to comply within sixty days and to be enforced for the following twenty years. The set of procedures to comply with includes;

To assign a group of employees in charge of Twitter’s information security plan.
To identify unforeseeable threats from both internal and external factors and thereby, to implement possible rescue plans.
To maintain the privacy and integrity of user information not only from outside but also from the internal sources.

Now, the role of the assigned group is to take care of the information security plan by integrating security in the daily work of Dev and Ops and thereby, to shut all the possible security holes. According to Justin, Alex, Neil, there were a number of aspects needed to be looked upon in the process of integrating security in DevOps daily routine. Some of them were taking up a holistic approach, preventing the repetition of security breaches, integrating security objectives into the DevOps tools and fast flow through infosec (via automation). Finally, the team came across all these challenges by integrating static code analysis into the Twitter build process using Brakeman that scans Ruby on Rails applications for potential threats. The result was magnificent. By allowing the developers to see through the vulnerabilities whenever they write insecure code, Brakeman reduced the vulnerabilities at the rate of 60%.

Undoubtedly, the above case of Twitter proves the value of learning through experimentation and failure. If not for the two critical security breaches, the organization wouldn’t have gone through such a transformation that was vital for information security. Also, the integration of security into the daily work and tool of Dev and Ops teaches developers to write better secure code and how effectively one can mitigate vulnerabilities in the coming days.

Final Thoughts:

The above case studies of some of the leading organisations such as Twitter, Etsy, CSG International and Google does not merely imprint the success stories of them but also the DevOps capability to scale in and out as per one’s business objectives and differing technologies. Further, the proposed DevOps’ ways by Gene Kim and other co-authors aids in simplifying the undefinable practice in the form of inclusive points. Besides this case studies, there are numerous success stories in the history of DevOps adoption and implementation, which ensures its undeniable position in the current software-centric environment.