Questioning Assumptions with Intelligence

“Question everything!” You’ve heard this a million times. You probably try to do it, sometimes, too. The underlying tenants of The Goal, the Theory of Constraints, Lean, and other methodologies relies on questioning assumptions. It’s important, but what exactly does it mean, and what do you do afterward?

First off, it’s not a license to literally ask questions about every business decision at every opportunity. Many questions can be answered in your own head before you open your mouth, so there’s no need to bother others with those questions. For the rest, go back to the theory of constraints and ask yourself if it’s a bottleneck first. If not, the answer might not matter. Above all, always be courteous and understanding of the situation before speaking. If you do literally question everything, you will be treated like an a-hole of the first degree and your message will be lost. There’s a time and place for everything. Continuing on…

In the right context, “Hey, wait a minute, why exactly are we doing that?” is a good question. Sometimes there is a good answer,  but other times the answer is simply, “because.” That’s not a good answer. For example, someone who lives in SoCal suggested I salt my car’s tires in the winter. Though I have lived in the north, I had never heard of doing that. I asked where they learned to do that. Many years ago, the person went to college in Pittsburgh and saw buckets of salt near parking areas. They saw someone else pour salt around their car’s tires, so they assumed that is what it was intended for. Turns out it was for the sidewalks.

You might find what that person did humorous, but before you snicker, look around your business – are you sure you’re not doing something simply because your colleague or predecessor did it? A long time ago, I found out I had been swapping backup tapes every morning on a system that had been decommissioned but not powered off. Whoops! This is cargo cult behavior, and we all participate in it at some point in our lives. Businesses do it A LOT. The important thing is that we come to understand what we are doing and correct the behavior.

When you do find some broken assumption, you must be smart in how you address it. Again, make sure it’s a constraint. A little salt around the tires won’t really hurt anything, but putting salt in the gas tank certainly would. Focus the efforts on the constraints. Figure out what is wrong with the assumption and how to make it right. When you find these broken assumptions, there’s no need to blame or ridicule someone. You fixed a problem, everyone should be happy! Once you make some correction, take a look at the other assumptions in your system and see if they were affected. Decisions in the fundamental parts of the system tend to have cascading effects further down the line.

This is an iterative process. If you question an assumption this year and there’s a good reason for it, you will eventually want to revisit it, maybe next year or in 5 years. Change is perpetual and you should embrace it, not flee from it.

Fortigate user permissions peculiarities

While working with a customer on their Fortigate firewalls, I was introduced to a peculiarity of how FortiOS interprets user’s diag commands. I suspect this affects multiple versions, but I don’t have the ability to test this.

  • FortiOS: 4.2.x
  • User: wild-card (TACACS)
  • Profile: super_admin_readonly

TACACS users whose permissions elevate them to the super_admin profile are unaffected. They can run diag commands unrestricted as they have full access.

TACACS users whose permissions remain at super_admin_readonly were finding that they could not run diag commands that accessed an interface, such as diag sniff packet any “icmp”. Upon further investigation, the issue was related to the IP the user connected to and the interface (“any” in the example) used in the command. As a readonly user, the any interface is off-limits. The interfaces configured for the VDOM that the user connected to are available to the readonly users.

In other words, if a firewall had two VDOMs, Common and DMZ, and the user connected to any interface connected to the Common interface, only those interfaces would be useable. For instance, diag sniff packet common-outside “icmp” would work, as well as common-inside. Interfaces connected to other VDOMs are off-limits, so diag sniff packet dmz-outside “icmp would fail. By providing the end user a list of the IP addresses and interface names, and the VDOM they belonged to, the user was able to perform all required diagnostic commands.

I hope this is fixed in more recent versions, but at least there’s a workaround that makes some logical sense.

Thinking TechX

A word we hear too much of these days is ‘disrupt’. When it’s not overused, it means that you’re trying to change the way you do things in some dramatic fashion. Instead of doing things by hand, you use some tool to automate some or all of it. Or you switch from Linux everywhere to Window everywhere, or vice versa. Whatever the change is, the point is that you’re changing how you do things.

Something that frequently appears to be forgotten during disruption is to change how you think about doing things. When you were doing things on Windows, you probably did a lot of mouse clicking and typing. Now you’ve moved to Linux. Was the change really about the OS? Probably not. The change was about not having to click the mouse and type. So stop it! Start “thinking Linux”, or whatever technology you’re using.

This has two advantages. First, it becomes really disruptive, because it was the thought process holding you back the whole time. If you only change the technology, you’ve just hidden the problem for a while. That buys you a bit of runway but no real solution. Applying an entirely new thought process will help you get out of the rut of “the way we’ve always done it.”

Second, if you are using idiomatic patterns of the chosen technology – such as using camelCase in Powershell but snake_case in Ruby – you’re going to find it much easier to attract and retain coworkers who already think that way. If your Ruby code looks like PowerShell, most Ruby devs will just run away. Even if your team has low turnover, it will make everyone on the team better able to receive new team members and allow the team to better contribute back to the community, especially via open source projects.

Take the time to approach your problems in a new manner from top to bottom and you’ll reap the benefits.

The Goal: Throughput and Efficiency

One of the most important concepts of The Goal is to increase throughput. Throughput is the rate at which the system generates money through sales. That is, when your company takes raw materials, processes them into a finished good, and sells it, the measured rate of that activity is your throughput. Severe emphasis on sales. Throughput is not the same as efficiency. Today, we will look at throughput vs. efficiency and how these concepts apply to IT.

Though we are focusing on throughput, we must state the descriptions of the two other measurements. Inventory is all the money that the system has invested in purchasing things which it intends to sell. Operational expense is all the money the system spends in order to turn inventory into throughput. I list the three definitions together because the definitions are precise and interconnected. Changing even a single word in one requires the other two be adjusted as well.

Another important concept in throughput is that it measures the entire system, not a locality. Whether you work in your garage or in a giant auto plant, you can not measure throughput locally, it must be measured over the entire system. This conflicts with most companies’ measurements of local efficiency. Employers naturally want to keep all their employees busy and employees like to see their coworkers pull their own weight. Why should Jane get to twiddle her thumbs at the Fob machine when Jill is busy pushing pallets of Fob parts around the floor? Is it fair to George to watch Jeff read the newspaper while he has to investigate hundreds of parts for quality control? And shouldn’t Jane and Jeff be worried that they might be reprimanded or fired for not being efficient, or draw the ire of their coworkers?

Continue reading

Sometimes We Break Things

Today’s a no-deploy Friday for me, like it is for many. However, also like many others, here I am deploying things. Small, minor things, but it would ruin my weekend if they broke anyway. Sometimes the worst does happen and we break things. Don’t worry, we’re professionals!

So, what happens if you do break something? First, don’t panic. Everyone’s broken something before, and that includes everyone above you in the food chain. The second step is to notify those above you according to your internal processes. In most cases, that means stopping what you are doing and giving your boss a paragraph summary of the issue, what it affects, and what you’re doing about it, then getting back to work. Third, don’t panic! I know I already said that, but since you’ve now gone and told your boss, they may have induced some panic – let it pass. The only way you’ll recover is if you don’t panic. Breath.

Fourth, fix it! Use your mind to decide what was supposed to happen, what you did, and where things went wrong. Identify the steps required to either back things out or repair the situation so you can proceed. Document the steps and follow them. If you have a maintenance window you are operating under, put some time estimates down and set an alarm for when you need to make the go/no-go call. Though the situation is urgent, taking a few moments now to prepare will make you more efficient as you proceed. Give your management chain short updates throughout the event until it is cleared, and don’t let rising panic get to you.

Continue reading

Don’t Disable SELinux, Part 2

Yesterday I warned everyone not to disable SELinux because the fix is almost always a quick one. But, what do you do if there is no selboolean that fixes your problem with a simple one liner?

After yesterday’s article, Tim Meusel shared a message he receives in his audit log when running nginx on his puppet master with SELinux in enforce mode:

type=AVC msg=audit(1415871389.171:787): avc:  denied  { name_connect }
 for  pid=2228 comm="nginx" dest=8080
 scontext=system_u:system_r:httpd_t:s0
 tcontext=system_u:object_r:http_cache_port_t:s0 tclass=tcp_socket
type=SYSCALL msg=audit(1415871389.171:787): arch=c000003e syscall=42
 success=no exit=-13 a0=19 a1=259e2b0 a2=10 a3=7fffdac559d0 items=0
 ppid=2227 pid=2228 auid=4294967295 uid=996 gid=995 euid=996 suid=996
 fsuid=996 egid=995 sgid=995 fsgid=995 tty=(none) ses=4294967295
 comm="nginx" exe="/usr/sbin/nginx" subj=system_u:system_r:httpd_t:s0
 key=(null)

That’s…. that’s ugly. The important parts have been highlighted. Nginx cannot talk to the tcp_socket at /var/run/puppet/puppetmaster_unicorn.sock. There doesn’t appear to be a selboolean that matches the issue. You could try flipping semi-relevant booleans for hours till you stumble upon some combination that may work, undoubtedly with side effects, and possibly never find the right combination. That could end up being a LOT of time wasted without any guarantee of success.

Instead, use audit2allow. By providing the tool with portions of an audit log, it will build an SELinux policy that will allow everything marked as “denied”. Here’s an example of generating a policy for review, then generating and applying that policy:

grep nginx /var/log/audit/audit.log | audit2allow > nginx.te
more nginx.te
grep nginx /var/log/audit/audit.log | audit2allow -M nginx
semodule -i nginx.pp

You can find more detail on the tool on the web, particularly this article where another nginx user is struggling with SELinux. You may have to repeat this process a few times – nginx stopped running when it failed to attach to the socket, so there could be other SELinux permission issues it would encounter if it had not failed. You won’t see those in the audit.log until it gets past the socket. Keep at it until audit2allow is building the same policy file on consecutive runs, at which point there are no new failures to discover. Your application should be fully working now and encounter no more SELinux permission issues.

Update: Tim continued to struggle after he performed the above steps until he moved the unicorn socket out of /var/run (which is admittedly not the recommended location!) even though he wasn’t seeing any more failures in the audit log. This command forces SELinux to log all failure events and then the new failures showed up and were processed by audit2allow:

semodule --disable_dontaudit --build

See Tim’s blog for more info.

You can apply the policy via puppet using the selmodule type, plus a file resource to put the .pp file in the correct location.

While this takes a lot longer to resolve than touching some selbooleans, you should only have to do it once. This ensures you still have the protections of SELinux and a well defined policy state for your application. If, and only if, this doesn’t resolve your issue, should you even entertain the thought of disabling SELinux, as a temporary resolution until a permanent solution is found.

Don’t Disable SELinux

When developing new web-based solutions on modern Linux distros, inevitably you’ll run into a fun issue – like your webserver throwing database errors when there’s not even any traffic making it to out of the server toward the database – and bang your head against the desk a bit. If you google for the error, you’ll run into the worst advise possible: “If your problem is not solved then disable your SELinux.” That’s right, just disable the whole thing because one part bothers you. The only positive part of this advise is that you may not have even though to look at SELinux before that.

You can verify that SELinux is the issue by taking a look at the audit log (tail -f /var/log/audit/audit.log) and using your web application. You’ll see a ton of crap that is simply undecipherable to human beings. What you’re looking for is the word denied and the application, file, or user that is having an issue. Here’s a deny for the application httpd when trying to talk to that remote database:

type=AVC msg=audit(1415813628.801:628): avc:  denied  { name_connect } for  pid=11911 comm="httpd"
 dest=3306 scontext=unconfined_u:system_r:httpd_t:s0 tcontext=system_u:object_r:mysqld_port_t:s0
 tclass=tcp_socket

The next step is to narrow the issue down. There are a large number of settings for SELinux, known as SELinux Booleans, that may be affecting your application. Take a quick gander at them, find the most likely boolean, set the value to on, and try your application again. If it doesn’t work, set it to off and try another. Here’s a Tips and Trick page that describes the process in more detail and provides a pretty thorough list of booleans. Can’t access files on an NFS share via httpd? Set httpd_use_nfs to true. Talking to a remote database as above? That’s httpd_can_network_connect_db. This is just as simple and more beneficial than disabling SELinux altogether.

Of course, I’d be remiss if I just told you to use setsebool as root. You need to including this setting in your application definition. For example, integrate the setting into your puppet manifests with the selboolean type. Set the value to on and persistent to true. Apply your manifest and getsebool will show the new value. Here’s an example of a manifest I built for the phpMyAdmin application, specifically lines 25-28 where the selbooleans are set. If you’re using a different configuration management tool, you’ll have to do this part yourself, the important part is that you capture the setting.

Take a few minutes to learn how to use SELinux, so you’re aware of when you’re barking up the wrong tree and how to resolve issues, and integrate your findings into your application’s state definition. You’ll benefit by leaving the protection in place.

A Call to Comments

In Greg Ferro’s call to arms for the 30 blog posts in 30 days challenge, Greg was encouraging us to use blogging as a social media, rather than Twitter, Facebook, etc. His challenge includes this statement:

Make sure you leave comments on other peoples blogs so they know someone read it. Just like you would on Twitter, Facebook , leave a comment saying “Like” or “Favourite”.

I’m not sure what Greg’s reasoning is for this specifically, but I think it’s a great one. Too often we see a blog post announced on Twitter, followed by some great comments that add to the value of the post. Someone who finds the post via RSS or a search engine – or someone who saw the original tweet before the valuable tweets came in – doesn’t see any of those comments. If the comments really add to the article, then something valuable was lost. It doesn’t take very long to add a comment to most blogs, so please, take a few minutes to drop a comment on blog posts when you have one. Even the “Great post!” comments are good to have around, it lets the author know the content was valuable to at least one person.

I’m going to make this effort myself, since I’m very guilty of it. Please, take the time to make permanent comments to blog posts for those who follow us.

Vacation Scheduling is Important

All of us get some amount of vacation time, and many of us never use it, especially in America. There are a variety of reasons given for not taking it. Regardless of the reason, at the end of the year, it goes unused and both individuals and businesses suffer for it.

Today’s DailyWTF is a perfect reminder of why it’s important to schedule to vacation. When someone leaves the company, on their terms or otherwise, their coworkers and management are often very shocked at what they were doing on a daily basis to keep things running. It might be as simple as pushing a button once a day, or it could be hand-massaging data between two systems in a complex manner that isn’t a documented process. Businesses do like to say that everyone is replaceable, and technically they are, but the amount of pain the business suffers until that person is replaced can be extraordinarily high.

That’s why The Practice of System and Network Administration* suggests that everyone be forced to take at least one serious, one week (contiguous!) vacation per year (pg 810). This may include removing that person’s access to remote email and VPN, to ensure they’re really not doing anything in that time unless they’re called for assistance. This will illuminate what needs to be turned into a documented process and whether your coworker’s cross-training has been successful. When everyone on a team takes a vacation, all of the major gaps can be identified on a yearly basis.

Of course, this requires management support. When someone disappears for a week, the button isn’t pushed, and you start a causal time loop, management should support you and your team as you document the gap and prevent it in the future. If your coworkers need more cross-training, management can help you find the time to make it happen. If you’re a manager reading this, ensure that discovering a gap is seen as an improvement rather than punishment.

Keep these lessons in mind as we approach the end of the year. If you and your team haven’t scheduled vacation time through Jan 1, set up a meeting this week and have everyone lay out their plans. You don’t want to find out on Dec 15 that no-one will be around between Christmas and New Year’s. By discovering this early, your team can adjust plans so that everyone is happy with minimal impact on travel plans and family visits.

* The authors of The Practice recently released the second edition of The Practice of Cloud System Administration, which may be more appealing to the modern System Administrator, but I haven’t had time to read it yet.

Documented Processes

The other day, I pointed out on twitter the redundancy of “documented processes.” If it’s not documented, it’s not a process, it’s just a thing you sometimes do! By documenting it, you turn a “checklist inside your head” into a repeatable process that anyone can follow.

In spite of making the jest, or perhaps because I made the jest, it was shortly made apparent to me at work how poorly I’m doing at this. We have build guides for all our nodes (even if it is just a hostname/ip and “run puppet”), but there’s no “generic” build guide, or a document on how to create a build guide. We also have lots of technology standards (use windows 2008r2/centos 6.5, if you deploy windows 2003 you will be given 40 lashes with a wet noodle, etc.) and those weren’t documented well, either. With a small team, this hasn’t been a huge obstacle, but if we are experiencing issues at a small size, I can’t imagine how it poorly it could go at scale without writing these things down.

Take the time to invest in documenting your processes. Next time you do something – even something you do all the time – check and see if there’s a document that accurately represents the process. If not, update what you have or create a new one. The time you spend on it now will pay for itself as you continue to approach a state of following well documented, repeatable, accurate processes and eliminating the errors of performing ad hoc tasks.