Minimum Viable Configuration (MVC)

In my PuppetConf talk, I discussed a concept I call “Minimum Viable Configuration”, or MVC. This concept is similar to that of the Minimum Viable Product (MVP), in which you develop and deploy just the core features required to determine if there’s a market fit for your anticipated customer base. The MVC, however, is targeted at your developers, and is the minimum amount of customization required for the developers to be productive with the languages and tools your organization uses. This can include everything from having preferred IDEs available, language plugins, build tools, etc.

A Minimum Viable Configuration may not appear necessary to many, especially those who have been customizing their own environment for years or decades. The MVC is really targeted at your team, or as the organization as a whole. You may have a great customized IDE setup for writing Puppet or Powershell code, but others on your team may just be starting. The MVC allows the organization to share that accumulated wealth, making full use of the tens or hundreds of years of experience on the team. A novice developer can sit down and be productive with any language or tool covered by the MVC by standing on the shoulders of their teammates.

The MVC truly is the minimum customization required to get started – for instance, a .vimrc file that sets the tabstop to 2 characters and provides enhanced color coding and syntax checking for various languages – but that still allows users to add their own customizations. If you enforce the minimum, but don’t limit further customization, new hires can not only check their email on day one, but can actually delve through the codebase and start making changes on day one. You can also tie it into any vagrant images you might maintain.

Your MVC will change over time, of course. Use your configuration management tool, like Puppet, to manage the MVC. When the baseline is updated, all the laptops and shared nodes can be updated quickly to the new standard. You can see an example of a Minimum Viable Configuration for Linux in PuppetInABox’s role::build and the related profiles (build, rcfiles::vim, rcfiles::bash). You can easily develop similar roles and profiles for other languages or operating systems.

I feel the MVC can be a very powerful tool for teams who work with an evolving variety of tools and languages, who hire novices and grow expertise internally, and especially organizations that are exposing Operations teams to development strategies (i.e. DevOps). What do you think about the MVC? Are you using something similar now, or is there another way to address the issue?

Ravello and SimSpace: Security in the cloud

Ravello and SimSpace’s On-Demand Cyber Ranges

Last year, many of us were introduced to Ravello Systems and their nested virtualization product. Their hypervisor, HVX, and their network and storage overlay technologies allow you to run any VM from your enterprise on a cloud – specifically Amazon AWS and Google Compute Engine. You can sign up for a free trial and migrate your VMs into the cloud instantly.

Many in the #vExpert community have used Ravello to augment or replace their home lab. We’ve also seen some pretty interesting uses of Ravello over the last year – AutoLab in the cloud, Ravello/vCloud Air DR setups and numerous blueprints (pre-defined multi-node system designs) such as Puppet and Openstack on AWS.

Yesterday, I had the pleasure of speaking with SimSpace Corporation, a security company focused on cyber assessments, training, and testing. SimSpace has a history of working with and testing next generation cyber-security tools and helping their clients to rapidly build network models, called Cyber Ranges, using these tools at scale. Today, SimSpace and Ravello announced a partnership to expand this functionality and allow users to create their own cyber ranges in the cloud in a product called SimSpace VCN (press release). A VCN is a virtual clone network that is self-contained and isolated from the internet. VCN instances can be spun up and down on demand. This is a pretty awesome use of Ravello that goes a bit beyond what I’ve seen before.

Virtual Clone Networks and Use Cases

Each VCN starts as a blueprint and multiple instances can be deployed using Ravello’s hypervisor in the target cloud. You can deploy multiple DMZs, layer on additional networking like VLANs and port mirroring, and add just about anything else you want to replicate from your production environment. The network will contain not only the server OS VMs but a plethora of network and security devices from vendors such as Cisco, Checkpoint, Fortinet, and Palo Alto Networks. Existing policy settings (firewall, threat, etc.) can then be deployed on the appropriate VCN components. Each instance is completely isolated, allowing the user to treat each VCN as if it were production, but without the negative side effects if something goes wrong.¬† SimSpace’s traditional clientele would then run cyber defense simulations in the VCN to identify faults, train new users, and test the behavior of modifications such as replacing a firewall of one type with another or modifying policies. SimSpace’s product has an attack framework with the ability to inject common network attacks and even simulate “zero day” attacks.

I see a number of other use cases that SimSpace’s VCN product is useful for. The ability to replace a blueprint node or set of nodes can be used to test how different vendor’s products behave and whether they are suitable for the environment. Even in a virtualized data center, lab testing is often not representative of production behavior, but making the change in production is highly risky and expensive. Testing in a VCN can help provide similar scale to production that a lab cannot and at greatly reduced cost and risk.

Another potential use case is disaster recovery’s awkward sibling, business continuity (BC). Disaster recovery typically involves an online site where some portion of the system is always hot, at least to receive data replication from the primary environment. Business continuity, on the other hand, tends to involve cold and sometimes non-existent datacenters that are built from scratch to provide a minimum level of service during crisis times. Most BC exercises involve numerous runbooks and often end with some level of failure as runbooks tend to get out of date quickly. A VCN, however, can be generated rapidly from production documentation and deployed in less than an hour (more details below) and without the expensive of standby hardware or a business continuity contract.

Finally, auditing for compliance is always tricky. For example, the latest version of PCI-DSS standards require penetration testing, which introduces risks that some tests could cause outages or destroy data. Giving the auditor access to the VCN replica of production allows you and the auditor to map out the likely impact of penetration testing in a controlled manner with zero risk, enumerating the most likely outage scenarios and avoiding surprises. When the real penetration testing occurs in production, the risk can be reduced to an acceptable level for the business.

Product Offerings

SimSpace’s product will be offered in two flavors. A number of pre-defined blueprints exist that can be taken advantage of by users whose production environments closely match or who do not need a higher level of fidelity. These users can be up and running with their first VCN in about an hour, including signup time.

Customers who desire a higher level of fidelity or whose environments do not match the pre-defined blueprints can engage SimSpace about a customized VCN blueprint. SimSpace has a number of tools they are developing, the most promising of which works with Visio-like network diagrams that can be exported as a blueprint. The tool aims to be as simple as adding some metadata (IP, hostname, OS, etc.) to an existing diagram which should result in rapid turnarounds. If the VCN’s blueprint is updated, only the changes need to be deployed to the instance so deployment times remain low.

How It Works

SimSpace has shared some under-the-covers details with me. Each VM has at least two vNICS, one connected to a management network. All the management traffic is segregated from the production network to ensure management has no affect on the security testing results. Puppet is used to manage much of the node configuration, including networking and any user-provided software deployments. Just upload your software to the provided repository and assign the correct version to each node, puppet does the rest. (I mention this for no particular reason of course!) Spinning up a VCN instance with ~40 nodes takes less than 10 minutes for Ravello to deploy and 10 minutes for SimSpace to populate and configure, or about 20 minutes for an average configuration. The minimum network size is about 20 nodes and the current maximum is around 80 nodes. Their developers are pushing that to 150 nodes in tests now and will continue to increase that number.

In addition to replicating your production environment, SimSpace has a “internet bubble” component that can be added to any blueprint that adds a fake internet. A few VMs with thousands of IPs are able to replicate some level of core routing, root DNS, and fake versions of Facebook, Google, and other popular websites, to help simulate the isolated VCN communicating with the greater internet. I imagine this is helpful if you want to test some watering hole exploits or DNS amplification attacks.

There is currently no provided cost for the service. The target model is a monthly subscription service with additional fees for cloud usage and commercial licenses used in the VCN. Commercial licenses for products in each VCN instance will be handled by SimSpace, so there’s no need for users to worry about vendor management with SimSpace VCN. An early access program will be starting in the next week or two and general availability is expected in the 4th quarter of 2015. If you’re interested in the early access program, you can contact SimSpace directly.

All in all, I am very excited about SimSpace VCN. The amount of functionality it enables and the risk it reduces should have value to many individuals and businesses, and the reduction in cost of test environments is nearly limitless. Technologically, it’s also a really novel and powerful use of Ravello’s nested virtualization technology. I cannot wait to see SimSpace VCN in action and see its promise realized.

Why I Blog

I’ve wanted to write about why I blog for a while, and I was recently encouraged to stop procrastinating by Mattias Geniar:

Much is said, and frequently, about why you should blog. As I find most such articles to be impersonal, I thought I might share the reasons and rewards that have driven me to blog and keep me going at it. So, why do I blog?

  • To express myself. Sometimes this means artistically – being creative and showing it off – but other times it simply means organizing my thoughts and presenting them to other human beings. This forces me to clarify my thoughts, construct an actual hypothesis, and begin to test it. The end result is a refined idea that can be actually be consumed by myself and others. This is especially helpful if I will be presenting the idea to my boss or coworkers, even when that is done in a different format or medium.
  • To improve at writing. Communication is vital in any relationship, personal or business, and the written word can be tricky to wield effectively. I write emails every day, but I had not written a long-form article since college (15+ years ago, at the time!) and not on deeply technical subjects. I like to think this has been paying off for me, even with non-written communication as I’ve become more methodical and self-aware of how I communicate in all forms.
  • For community. I consume a lot from a number of different communities – security, virtualization, automation, etc. – and I feel that a good citizen contributes back when possible. Maybe I only help one other person, but I hope that I enable or inspire that person to do something awesome – like get home an hour earlier to spend more time with their family that evening.
  • As a portfolio of work. We all need to keep a portfolio, resume, C.V., etc. A blog is part of that – even if I don’t view it as a portfolio, others may, so it’s in my best interest to treat it as such. I keep this in mind before hitting publish – is this something that I want other people to see? Is it of high enough quality? Does it say something worthwhile? Does it send a positive message? Will someone else want to read this, and would they be satisfied if they did? Set your bar high and make sure you’re hitting it every time you publish something.
  • For recognition. This isn’t a very altruistic reason, but it has contributed to my efforts. A desire to write well enough to have a popular blog used by people everyday isn’t a bad thing to aim for, is it? Page views also give feedback on who your audience actually is, not who you think they are, and helps you see how they react to various article types and formats. Stats drive my morale and motivation. I like seeing that my page views went up 10% for a week, it makes me more eager to blog again. If page views go down for a few weeks, I want to know why and do better. Use it as a healthy feedback loop for your writing.

The last two reasons may seem a bit selfish, but I think that blogging as an independent is in many ways inherently self-serving. Improving my writing probably benefits me even more than building a portfolio or gaining recognition. Regardless, we all have egos and by acknowledging how they drive us, we can harness our drive rather than be controlled by it.

However, the most rewarding reason I blog, by far, is:

  • For my future self. I’ve referenced my own blog numerous times and even it had it come up as a Google result when I forgot that I had already solved a problem.¬†Writing, reading, and applying my own article is a great feedback loop. Do something, write about it, do it again based on the article, rewrite the article, repeat until accurate. All the assumed knowledge is discovered and added to the article, bit by bit, so that anyone can follow the process. This is a practice you can apply to general documentation, as well. I also follow my own blog articles to replicate the results from my lab work, in my work environment (e.g. everything puppet related). This is critical to me, as I can prove to myself that I really have gained an understanding of the subject matter.

If you’re looking at blogging anytime soon, think about what it is you intend to get out of it. It can be extremely rewarding, but only if you go into it with some awareness. Have fun!

Visible Ops Phase Four: Enable Continual Improvement

The final phase of Visible Ops is Enable Continual Improvement. To really succeed with our efforts, we need to make sure that the resources we have are allocated optimally toward our business goals. With most of our fires put out and significant efforts into avoiding future fires, we have the time available to do this right. To determine where our resources should be allocated, we need to look at metrics.

Continue reading

Visible Ops Phase Three: Create A Repeatable Build Library

Phase three of Visible Ops is Create a Repeatable Build Library. This phase’s focus is to define build mechanisms, create system images, and establish documentation that together describe how to build our desired infrastructure from “bare metal” (I’ll continue to use “bare metal” throughout for consistency, but “bare VM” may be more appropriate in today’s virtualized IT). This allows us to treat our infrastructure like fuses. When a fuse pops, it is discarded instead of repaired and a new fuse is inserted in its place; likewise when a system fails, it is removed from service and a new system is provisioned in it’s place. All high-performing IT organizations, not just the unicorns of IT, use this technique. This chapter focuses on how to achieve that goal.

Continue reading

Visible Ops Phase Two: Catch And Release and Find Fragile Artifacts

In the second phase of Visible Ops implementation, our goal is to Catch & Release and Find Fragile Artifacts. This phase focuses on creating and maintaining an accurate inventory of assets and highlighting those that generate the most unplanned work. The various configurations in use are also collected in order to start reducing the unique configuration counts, the focus of Phase Three. This is the shortest chapter in the book at 6 pages, though it may take significant time to complete the work efforts.

The Issues and Indicators lays out the issues being tackled, including moving from “individual knowledge” to “tribal knowledge” (a shared knowledgebase the entire organization can access, rather than in people’s heads) and preventing the “special snowflake” syndrome where every node in the network, even those in clusters or farms, are similar but still unique.

Continue reading

Visible Ops Phase One: Stabilize The Patient and Modify First Response

The first phase of implementing Visible Ops is Stabilize The Patient and Modify First Response. This first order of business intends to reduce our amount of unplanned work significantly, to 25% or less, to allow our organization to focus on more than just firefighting. When the environment is stabilized, efforts can instead be spent on proactive work that stops fires before they start.

The chapter, like all chapters in Visible Ops, begins with an Issues and Indicators section, describing the issues with both a formal statement and a narrative example. The issues are familiar to anyone in IT, from how self-inflicting problems create unplanned work to the inconvenient, but predictable, timing of system failures during times of urgency. The narrative example provided helps to relate the formal statement to the experiences we all share.

Continue reading

Book Review: The Visible Ops Handbook

A few months ago, I purchased The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps, by Kevin Behr, Gene Kim, and George Spafford and published by the IT Process Institute (ITPI). Those names may sound familiar, as these are the same authors of The Phoenix Project. Where The Phoenix Project is a novel that teaches us general concepts through storytelling, The¬†Visible Ops Handbook is a set of instructions to help us implement the changes that are now so vital to the DevOps movement – it’s all about executing! The Visible Ops methodology it teaches is one of many precursors to DevOps, and it’s a wonderful foundation that I feel should be explored by all.

Visible Ops has a nice structure. There are five sections describing the four steps of Visible Ops (70 pages) followed by a few appendices (30 pages). The content is a result of studies done by the ITPI (contemporary equivalents of today’s State of DevOps reports) and the implementation of Visible Ops, before it went to print, by IP Services. In those 100 pages, the lessons learned from the studies is codified using ITIL terminology. The writing is very accessible, but also very dense, and is worth referencing repeatedly as you work through the included steps. The target audience is decidedly technical, but everyone in an IT organization can benefit from the material.

Continue reading

Opinionated Test Driven Development

There are many technical reasons to chose (or not chose!) to use Test Driven Development. In my opinion, one of the most important benefits isn’t technical at all – it’s that it forces you to have an opinion! When you decide what you are going to test, you are forming an opinion on what you want your code to do. It might be processing an input, transforming an input, or providing output based on current state. You have an opinion on the end state but leave the implementation to be determined.

I liken this to having an outline for writing. If you start writing without an outline, the end result never seems to match what you thought you’d end up with. If you follow an outline, however sparse, the result matches what you thought you’d up with – if not, you didn’t follow the outline. Sure, it requires some revisions here and there to make it all tidy, but at least your first draft is legible.

When considering whether you should employ Test Driven Development, keep this in mind as a non-technical benefit!

Distributed systems and high availability. Also, 787s.

We often talk about distributed systems at scale. A system with thousands of nodes in dozens of cities on three or more continents is clearly distributed. Sometimes we even discuss some of the difficult problems, like is there such a thing as ‘now’? What often gets lost is how smaller scale systems are distributed as well, especially systems that we describe as Highly Available. Even though these systems are often very close in proximity – sometimes as two line cards in the same chassis – we still must apply distributed systems principles to them.

This week was a good reminder of that. Perhaps you’ve heard of an issue with Boeing’s 787 and all four generator control units (GCUs) encountering an integer overflow after 248 days of continual power. All four. At once. Well, maybe… Will all four GCUs really shut off at the same time? They will, on one condition: all four GCUs were powered on at the same time. It is clearly stated as a condition in the FAA directive:

If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.

There’s a very simple solution here: don’t power all four GCUs up at the same time. Let’s pretend that GCUs take 10 minutes to go from off to available on average. The GCU power up sequence should introduce a delay of 10 minutes * X, where X is some factor representing the desired safety margin for the longest possible time to availability, say 1.5. We’d have a timeline like this:

  • 0:00 – GCU 1 powered on
  • 0:15 – GCU 2 powered on
  • 0:30 – GCU 3 powered on
  • 0:45 – GCU 4 powered on
  • 0:60 – All GCUs available

By following this schedule, if the 787 is not power cycled before 248d 14h rolls around, the plane won’t drop out of the sky. 3 GCUs will be available at all times.

I’m not going to say that these kinds of bugs are excusable. They simply aren’t – we’ve known 32-bit counters roll over relatively quickly for decades and have had the software and hardware capable of handling 64-bit or larger counters for quite some time. However, we have dealt with distributed systems that run in high availability modes for decades.¬†Whatever your highly available, distributed systems are – servers, firewalls, routers, databases, etc. – you should use delayed startup times to avoid this well known set of problems. This leaves you more time to focus on the lesser known problems.