Planet SysAdmin

December 19, 2014

Chris Siebenmann

Our likely long road to working 10G-T on OmniOS

I wrote earlier about our problems with Intel 10G-T on our OmniOS fileservers and how we've had to fall back to 1G networking. Obviously we'd like to change that and go back to 10G-T. The obvious option was another sort of 10G-T chipset besides Intel's. Unfortunately, as far as we can see Intel's chipsets are the best supported option and eg Broadcom seems even less likely to work well (or at all, and we later had problems with even a Broadcom 1G chipset under OmniOS). So we've scratched that idea; at this point it's Intel or bust.

We really want to reproduce our issues outside of production. While we've set up a test environment and put load on it, we've so far been unable to make it fall over in any clearly networking related way (OmniOS did lock up once under extreme load, but that might not be related at all). We're going to have to keep trying in the new year; I don't know what we'll do if we can't reproduce things.

(We also aren't currently trying to reproduce the dual port card issue. We may switch to this at some point.)

As I said in the earlier entry, we no longer feel that we can trust the current OmniOS ixgbe driver in production. That means going back to production needs an updated driver. At the moment I don't think anyone in the Illumos community is actively working on this (which I can't blame them for), although I believe there's some interest in doing a driver update at some point.

It's possible that we could find some money to sponsor work on updating the ixgbe driver to the current upstream Intel version, and so get it done that way (assuming that this sort of work can be sponsored for what we can afford, which may be dubious). Unfortunately our constrained budget situation means that I can't argue very persuasively for exploring this until we have some confidence that the current upstream Intel driver would fix our issues. This is hard to get without at least some sort of reproduction of the problem.

(What this says to me is that I should start trying to match up driver versions and read driver changelogs. My guess is that the current Linux driver is basically what we'd get if the OmniOS driver was resynchronized, so I can also look at it for changes in the areas that I already know are problems, such as the 20msec stall while fondling the X540-AT2 ports.)

While I don't want to call it 'ideal', I would settle for a way to reproduce the dual card issue with simply artificial TCP network traffic. We could then change the server from OmniOS to an up to date Linux to see if the current Linux driver avoids the problem under the same load, then use this as evidence that commissioning an OmniOS driver update would get us something worthwhile.

None of this seems likely to be very fast. At this point, getting 10G-T back in six months seems extremely optimistic.

(The pessimistic view of when we might get our new fileserver environment back to 10G-T is obviously 'never'. That has its own long-term consequences that I don't want to think about right now.)

Sidebar: the crazy option

The crazy option is to try to learn enough about building and working on OmniOS so that I can build new ixgbe driver versions myself and so attempt either spot code modifications or my own hack testing on a larger scale driver resynchronization. While there is a part of me that finds this idea both nifty and attractive, my realistic side argues strongly that it would take far too much of my time for too little reward. Becoming a vaguely competent Illumos kernel coder doesn't seem like it's exactly going to be a small job, among other issues.

(But if there is an easy way to build new OmniOS kernel components, it'd be useful to learn at least that much. I've looked into this a bit but not very much.)

by cks at December 19, 2014 06:02 AM

System Administration Advent Calendar

Day 20 - Infosec Basics: Reason behind Madness

Written by: Jan Schaumann (@jschauma)
Edited by: Ben Cotton (@funnelfiasco)

Sysadmins are a stereotypically grumpy bunch. Oh wait, no, that was infosec people. Or was it infosec sysadmins? The two jobs are intersecting at the corner of cynicism and experience, and while any senior system administrator worth their salt has all the information security basics down, we still find the two camps at logger heads all too frequently.

Information Security frequently covers not only the general aspects of applying sound principles, but also the often ridiculed area of “compliance”, where rules too frequently seem blindly imposed without a full understanding of the practical implications or even their effectiveness. To overcome this divide, it is necessary for both camps to better understand one another’s daily routine, practices, and the reasons behind them.

Information Security professionals would do well to reach out and sit with the operational staff for extended periods of time, to work with them and get an understanding of how the performance, stability, and security requirements are imposed and met in the so-called real-world.

Similarly, System Administrators need to understand the reasons behind any requirements imposed or suggested by an organization’s Security team(s). In an attempt to bring the two camps a little bit closer, this post will present some of the general information security principles to show that there’s reason behind what may at times seem madness.

The astute reader will be amused to find occasionally conflicting requirements, statements, or recommendations. It is worthwhile to remember Sturgeon’s Law. (No, not his revelation, although that certainly holds true in information security just as well as in software engineering or internet infrastructure.)

Nothing is always absolutely so.

Understanding this law and knowing when to apply it, to be able to decide when an exception to the rules is warranted is what makes a senior engineer. But before we go making exceptions, let’s first begin by understanding the concepts.

Defense in Depth

Security is like an onion: the more layers you peel away, the more it stinks. Within this analogy lies one of the most fundamental concepts applied over and over to protect your systems, your users and their data: the principle of defense in depth. In simple terms, this means that you must secure your assets against any and all threats – both from the inside (of your organization or network) as well as from the outside. One layer is not enough.

Having a firewall that blocks all traffic from the Big Bad Internet except port 443 does not mean that once you’re on the web server, you should be able to connect to any other system in the network. But this goes further: your organization’s employees connect to your network over a password protected wireless network or perhaps a VPN, but being able to get on the internal network should not grant you access to all other systems, nor to view data flying by across the network. Instead, we want to secure our endpoints and data even against adversaries who already are on a trusted network.

As you will see, defense in depth relates to many of the other concepts we discuss here. For now, keep in mind that you should never rely separate protection outside of your control.

Your biggest threat comes from the inside

Internal services are often used by large numbers of internal users; sometimes they need to be available to all internal users. Even experienced system administrators may question why it is necessary to secure and authenticate a resources that is supposed to be available to “everybody”. But defense in depth requires us to, as it hints at an uncomfortable belief held by your infosec colleagues: your organization either already has been compromised and you just don’t know it, or it will be compromised in the very near future. Always assume that the attacker is already on the inside.

While this may seem paranoid, experience has shown time and again that the majority of attacks occur or are aided from within the trusted network. This is necessarily so: attackers can seldom gather all the information or gain all the access required to achieve their goals purely from the outside (DDoS attacks may count as the obligatory exception to this rule – see above re Sturgeon’s Law). Instead, they usually follow a general process in which they first gain access to a system within the network and then elevate their privileges from there.

This is one of the reasons why it is important to secure internal resources to the same degree as services accessible from the outside. Traffic on the internal network should be encrypted in transit to prevent an adversary on your network being able to pull it off the wire (or the airwaves, as the case may be); it should require authentication to confirm (and log) the party accessing the data and deny anonymous use.

This can be inconvenient, especially when you have to secure a service that has been available without authentication and around which other tools have been built. Which brings us to the next point…

You can’t just rub some crypto on it

Once the Genie’s out of the bottle, it’s very, very difficult to get it back in. Granting people access or privileges is easy, taking them away is near impossible. That means that securing an existing service after it has been in use is an uphill battle, and one of the reasons why System Administrators and Information Security engineers need to work closely in the design, development and deployment of any new service.

To many junior operations people, “security” and “encryption” are near equivalent, and using “crypto” (perhaps even: ‘military grade cryptography’!) is seen as robitussin for your systems: rub some on it and walk it off. You’re gonna be fine.

But encryption is only one aspect of (information) security, and it can only help mitigate some threats. Given our desire for defense in depth, we are looking to implement end-to-end encryption of data in transit, but that alone is not sufficient. In order to improve our security posture, we also require authentication and authorization of our services’ consumers (both human and software alike).

Authentication != authorization

Authentication and authorization are two core concepts in information security which are confused or equated all too often. The reason for this is that in many areas the two are practically conflated. Consider, for example, the Unix system: by logging into the system, you are authenticating yourself, proving that you are who you claim to be, for example by offering proof of access to a given private ssh key. Once you are logged in, your actions are authorized, most commonly, by standard Unix access controls: the kernel decides whether or not you are allowed to read a file by looking at the bits in an inode’s st_mode, your uid and your group membership.

Many internal web services, however, perform authentication and authorization (often referred to as “authN” and “authZ” respectively) simultaneously: if you are allowed to log in, you are allowed to use the service. In many cases, this makes sense – however, we should be careful to accept this as a default. Authentication to a service should, generally, not imply access of all resources therein, yet all too often we transpose this model even to our trusty old Unix systems, where being able to log in implies having access to all world-readable files.

Principle of least privilege

Applying the concept of defense in depth to authorization brings us to the principal of least privilege. As noted above, we want to avoid having authentication imply authorization, and so we need to establish more fine grained access controls. In particular, we want to make sure that every user has exactly the privileges and permissions they require, but no more. This concept spans all systems and all access – it applies equally to human users requiring access to, say, your HR database as well as to system accounts running services, trying to access your user data… and everything in between.

Perhaps most importantly (and most directly applicable to system administrators), this precaution to only grant the minimal required access also needs to be considered in the context of super-user privileges, where it demands fine-grained access control lists and/or detailed sudoers(5) rules. Especially in environments where more and more developers, site reliability engineers, or operational staff require the ability to deploy, restart, or troubleshoot complex systems is it important to clearly define who can do what.

Extended filesystem Access Control Lists are a surprisingly underutilized tool: coarse division of privileges by generic groups (“admins”, “all-sudo”, or “wheel”, perhaps) are all too frequently the norm, and sudo(8) privileges are granted almost always in an all-or-nothing approach.

On the flip side, it is important for information security engineers to understand that trying to restrict users in their effort to get their job done is a futile endeavor: users will always find a way around restrictions that get in their way, often times in ways that further compromise overall security (“ssh tunnels” are an immediate red flag here, as they frequently are used to circumvent firewall restrictions and in the process may unintentionally create a backdoor into production systems). Borrowing a bit from the Zen of Python, it is almost always better to explicitly grant permissions than to implicitly assume they are denied (and then find that they are worked around).

Perfect as the enemy of the Good

Information security professionals and System Administrators alike have a tendency to strive for perfect solutions. System Administrators, however, often times have enough practical experience to know that those rarely exist, and that deploying a reasonable, but not perfect, solution to a problem upon which can be iterated in the future is almost always preferable.

Herein lies a frequent fallacy however, which many an engineer has derived: if a given restriction can be circumvented, then it is useless. If we cannot secure a resource 100%, then trying to do so is pointless, and may in fact be harmful.

A common scenario might be sudo(8) privileges: many of the commands we may grant developers to run using elevated privileges can be abused or exploited to gain a full root shell (prime example: anything that invokes an editor that allows you to run commands, such as via vi(1)’s “!command” mechanism). Would it not be better to simply grant the user full sudo(8) access to begin with?

Generally: no. The principle of least privilege requires us to be explicit and restrict access where we can. Knowing that the rules in place may be circumvented by a hostile user lets us circle back to the important concept of defense in depth, but we don’t have it easier for the attackers. (The audit log provided by requiring specific sudo(8) invocations is another beneficial side-effect.)

We mustn’t let “perfect” be the enemy of the “good” and give up when we cannot solve 100% of the problems. At the same time, though, it is also worth noting that we equally mustn’t let “good enough” become the enemy of the “good”: a half-assed solution that “stops the bleeding” will all too quickly become the new permanent basis for a larger system. As all sysadmins know too well, there is no such thing as a temporary solution.

If these demands seem conflicting to you… you’re right. Striking the right balance here is what is most difficult, and senior engineers of both camps will distinguish themselves by understanding the benefits and drawbacks of either approach.

Understanding your threat model

As we’ve seen above, and as you no doubt will experience yourself, we constantly have to make trade-offs. We want defense in depth, but we do not want to make our systems unusable; we require encryption for data in transit even on trusted systems, because, well, we don’t actually trust these systems; we require authentication and authorization, and desire to have sufficient fine-grained control to abide by the principle of least privilege, yet we can’t let “perfect” be the enemy of the “good”.

Deciding which trade-offs to make, which security mechanisms to employ, and when “good enough” is actually that, and not an excuse to avoid difficult work… all of this, infosec engineers will sing in unison, depends on your threat model.

But defining a “threat model” requires a deep understanding of the systems at hand, which is why System Administrators and their expertise are so valued. We need to be aware of what is being protected from what threat. We need to know what our adversaries and their motivations and capabilities are before we can determine the methods with which we might mitigate the risks.

Do as DevOps Does

As system administrators, it is important to understand the thought process and concepts behind security requirements. As a by-and-large self-taught profession, we rely on collaboration to learn from others.

As you encounter rules, regulations, demands, or suggestions made by your security team, keep the principles outlined in this post in mind, and then engage them and try to understand not only what exactly they’re asking of you, but also why they’re asking. Make sure to bring your junior staff along, to allow them to pick up these concepts and apply them in the so-called real world, in the process developing solid security habits.

Just like you, your information security colleagues, too, get up every morning and come to work with the desire to do the best job possible, not to ruin your day. Invite them to your team’s meetings; ask them to sit with you and learn about your processes, your users, your requirements.

Do as DevOps does, and ignite the SecOps spark in your organization.

Further reading:

There are far too many details that this already lengthy post could not possible cover in adequate depth. Consider the following a list of recommended reading for those who want to learn more:

Security through obscurity is terrible; that does not mean that obscurity cannot still provide some (additional) security.

Be aware of the differences between active and passive attacks. Active attacks may be easier to detect, as they are actively changing things in your environment; passive attacks like wire tapping or traffic analysis, are much harder to detect. These types of attacks have a different threat model.

Don’t assume your tools are not going to be in the critical path.

Another example of why defense in depth is needed is the fact that often times seemingly minor or unimportant issues can be combined to become a critical issue.

The “Attacker Life Cycle”, frequently used within the context of so-called “Advanced Persistent Threats”, may help you understand more completely an adversaries process, and thus develop your threat model:

This old essay by Bruce Schneier is well worth a read and covers similar ground as this posting. It includes this valuable lesson: When in doubt, fail closed. “When an ATM fails, it shuts down; it doesn’t spew money out its slot.”

by Christopher Webber ( at December 19, 2014 12:00 AM

December 18, 2014

Chris Siebenmann

How I made IPSec IKE work for a point to point GRE tunnel on Fedora 20

The basic overview of my IPSec needs is that I want to make my home machine (with an outside address) appear as an inside IP address on the same subnet as my work machine is on. Because of Linux proxy ARP limitations, the core mechanics of this involve a GRE tunnel, which must be encrypted and authenticated by IPSec. Previously I was doing this with a static IPSec configuration created by direct use of setkey, which had the drawback that it didn't automatically change encryption keys or notice if something went wrong with the IPSec stuff. The normal solution to these drawbacks is to use an IKE daemon to automatically negotiate IPSec (and time it out if the other end stops), but unfortunately this is not a configuration that IKE daemons such as Fedora 20's Pluto support directly. I can't really blame them; anything involving proxy ARP is at least reasonably peculiar and most sane people either use routing on subnets or NAT the remote machines.

My first step to a working configuration came about after I fixed my configuration to block unprotected GRE traffic. Afterwards I realized this meant that I could completely ignore managing GRE in my IKE configuration and only have it deal with IPSec stuff; I'd just leave the GRE tunnel up all the time and if IPSec was down, the iptables rules would stop traffic. After I gritted my teeth and read through the libreswan ipsec.conf manpage, this turned out to be a reasonably simple configuration. The core of it is this:

conn cksgre
    left=<work IP alias>
    leftsourceip=<work IP alias>
    right=<home public IP>
    # what you want for always-up IPSec
    # I only want to use IPSec on GRE traffic

    # authentication is:

The two IP addresses used here are the two endpoints of my GRE tunnel (the 'remote' and 'local' addresses in 'ip tunnel <...>'). Note that this configuration has absolutely no reference to the local and peer IP addresses that you set on the inside of the tunnel; in my setup IPSec is completely indifferent to them.

I initially attempted to do authentication via PSK aka a (pre) shared secret. This caused my setup of the Fedora 20 version of Pluto to dump core with an assertion failure (for what seems to be somewhat coincidental reasons), which turned out to be lucky because there's a better way. Pluto supports what it calls 'RSA signature authentication', which people who use SSH also know as 'public key authentication'; just as with SSH, you give each end its own keypair and then list the public key(s) in your configuration and you're done. How to create the necessary RSA keypairs and set everything up is not well documented in the Fedora 20 manpages; in fact, I didn't even realize it was possible. Fortunately I stumbled over this invaluable blog entry on setting up a basic IPSec connection which covered the magic required.

This got the basic setup working, but after a while the novelty wore off and my urge to fiddle with things got the better of me so I decided to link the GRE tunnel to the IKE connection, so it would be torn down if the connection died (and brought up when the connection was made). You get your commands run on such connection events through the leftupdown="..." or rightupdown="..." configuration setting; your command gets information about what's going on through a pile of environment variables (which are documented in the ipsec_pluto manpage). For me this is a script that inspects $PLUTO_VERB to find out what's going on and runs one of my existing scripts to set up or tear down things on up-host and down-host actions. As far as I can tell, my configuration does not need to run the default 'ipsec _updown' command.

(My existing scripts used to do both GRE setup and IPSec setup, but of course now they only do the GRE setup and the IPSec stuff is commented out.)

This left IPSec connection initiation (and termination) itself. On my home machine I used to bring up and tear down the entire IPSec and GRE stuff when my PPPoE DSL link came up or went down. In theory one could now leave this up to a running Pluto based on its normal keying retries and timeouts; in practice this doesn't really work well and I wound up needing to do manual steps. Manual control of Pluto is done through 'ipsec whack' and if everything is running smoothly doing the following on DSL link up or down is enough:

ipsec whack --initiate|--terminate --name cksgre >/dev/null 2>&1

Unfortunately this is not always sufficient. Pluto does not notice dynamically appearing and disappearing network links and addresses, so if it's (re)started while my DSL link is down (for example on boot) it can't find either IP address associated with the cksgre connection and then refuses to try to do anything even if you explicitly ask it to initiate the connection. To make Pluto re-check the system's IP addresses and thus become willing to activate the IPSec connection, I need to do:

ipsec whack --listen

Even though the IPSec connection is set to autostart, Pluto does not actually autostart it when --listen causes it to notice that the necessary IP address now exists; instead I have to explicitly initiate it with 'ipsec whack --initiate --name cksgre'. My current setup wraps this all up in a script and runs it from /etc/ppp/ip-up.local and ip-down.local (in the same place where I previously invoked my own IPSec and GRE setup and stop scripts).

So far merely poking Pluto with --listen has been sufficient to get it to behave, but I haven't extensively tested this. My script currently has a fallback that will do a 'systemctl restart ipsec' if nothing else works.

PS: Note that taking down the GRE tunnel on IPSec failure has some potential security implications in my environment. I think I'm okay with them, but that's really something for another entry.

Sidebar: What ipsec _updown is and does

On Fedora 20 this is /usr/libexec/ipsec/_updown, which runs one of the _updown.* scripts in that directory depending on what the kernel protocol is; on most Linux machines (and certainly on Fedora 20) this is NETKEY, so _updown.netkey is what gets run in the end. What these scripts can do for you and maybe do do for you is neither clear nor documented and they make me nervous. They certainly seem to have the potential to do any number of things, some of them interesting and some of them alarming.

Having now scanned _updown.netkey, it appears that the only thing it might possibly be doing for me is mangling my /etc/resolv.conf. So, uh, no thanks.

by cks at December 18, 2014 07:23 PM

RISKS Digest


WARNING: The following packages cannot be authenticated!

We run several (read: hundreds) of servers that are still running Debian 6 (Squeeze). A few months ago, we started seeing the following errors coming from the daily apt cronjob: "WARNING: The following packages cannot be authenticated!" When running apt-get update the following errors dump out:

W: GPG error: squeeze-backports Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B48AD6246925553
W: GPG error: squeeze-lts Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B48AD6246925553
W: GPG error: squeeze-updates Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B48AD6246925553

There are two ways to solve the problem:

apt-get install debian-archive-keyring will install all the keys you need.

If you want to install a specific key, then apt-key adv --keyserver --recv-keys 8B48AD6246925553 will do what you need. Obviously, adjust the key accordingly.

by Scott Hebert at December 18, 2014 05:33 PM

Racker Hacker

Eight years at Rackspace

Rackspace Datapoint office sign

Saying farewell to the Datapoint office location in 2011. That’s where it all started for me in 2006.

Today marks my eight year anniversary at Rackspace and I’m truly honored to work for such a rapidly evolving company that takes the art of customer service to the next level. I continue to learn so much from the community of Rackers around me and I’m glad to have the opportunity to teach them something new as well.

The post Eight years at Rackspace appeared first on

by Major Hayden at December 18, 2014 02:00 PM

Chris Siebenmann

The potential end of public clients at the university?

Recently, another department asked our campus-wide sysadmin mailing list for ideas on how to deal with keyloggers, after having found one. They soon clarified that they meant physical keyloggers, because that's what they'd found. As I read the ensuing discussion I had an increasing sinking feeling that the answer was basically 'you can't' (which was pretty much the consensus answer; no one had really good ideas and several people knew things that looked attractive but didn't fully work). And that makes me pretty unhappy, because it means that I'm not sure public clients are viable any more.

Here at the university there's long been a tradition and habit of various sorts of public client machines, ranging from workstations in computer labs in various departments to terminals in libraries. All of these uses depend crucially on the machines being at least non-malicious, where we can assure users that using the machine in front of them is not going to give them massive problems like compromised passwords and everything that ensues from that.

(A machine being non-malicious is different from it being secure, although secure machines are usually non-malicious as well. A secure machine is doing only what you think it should be, while a non-malicious machine is at least not screwing its user. A machine that does what the user wants instead of what you want is insecure but not hopefully not malicious (and if it is malicious, well, the user did it to themselves, which is admittedly not a great comfort).)

Keyloggers, whether software or physical, are one way to create malicious machines. Once upon a time they were hard to get, expensive, and limited. These days, well, not so much, based on some hardware projects I've heard of; I'm pretty sure you could build a relatively transparent USB keylogger with tens of megabytes of logging capacity as an undergrad final project with inexpensive off the shelf parts. Probably you can already buy fully functional ones for cheap on EBay. What was once a pretty rare and exclusive preserve is now available to anyone who is bored and sufficiently nasty to go fishing. As this incident illustrates, some number of our users probably will do so (and it's only going to get worse as this stuff gets easier to get and use).

If we can't feasibly keep public machines from being made malicious, it's hard to see how we can keep offering and operating them at all. I'm now far from convinced that this is possible in most settings. Pessimistically, it seems like we may have reached the era where it's much safer to tell people to bring their own laptops, tablets, or phones (which they often will anyways, and will prefer using).

(I'm not even convinced it's a good idea to have university provided machines in graduate student offices, many of which are shared and in practice are often open for people who look like they belong to stroll through and fiddle briefly with a desktop.)

PS: Note that keyloggers are on the easy scale of the damage you can do with nasty USB hardware. There's much worse possible, but of course people really want to be able to plug their own USB sticks and so on into your public machines.

Sidebar: Possible versus feasible here

I'm pretty sure that you could build a kiosk style hardware enclosure that would make a desktop's actual USB ports and so on completely inaccessible, so that people couldn't unplug the keyboard and plug in their keylogger. I'm equally confident that this would be a relatively costly piece of custom design and construction that would also consume a bunch of extra physical space (and the physical space needed for public machines is often a big limiting factor on how many seats you can fit in).

by cks at December 18, 2014 04:44 AM

System Administration Advent Calendar

Day 18 - Adding Context to Alerts with nagios-herald

Written by: Katherine Daniels (@beerops)
Edited by: Jennifer Davis (@sigje)

3am Pages Suck!

As sysadmins, we all know the pain that comes from getting paged at 3am because some computery thing somewhere has caught on fire. It’s dark, you were having a perfectly pleasant dream about saving the world from killer robots, or cake, or something, when all of a sudden your phone starts making a noise like a car alarm. It’s your good friend Nagios, disturbing your slumber once again with word of a problem and very little else.

We might hate it for being the bearer of bad news, but Nagios is a well-known and time-tested monitoring and alerting tool. It does its job well- it runs the checks we tell it to, when we tell it to, and it dutifully whines when those checks fail. The problem with its whining, however, is that by default there is very little context around it.

Adding Context to 3am

As an example, let’s take a look at everyone’s favorite thing to get woken up by, the disk space check. Disk Space Alert Without Context

We know that the disk space has just crossed the warning threshold. We know the amount and percentage of free space on this volume. We know what volume is having this issue, and what time the notification was sent. But this doesn’t tell us anything more. Was this volume gradually getting close to the threshold and just happened to go over it during the night? If so, we probably don’t care in the middle of the night - a nice slow increase means that it won’t explode during the night and can be fixed in the morning instead. On the other hand, was there a sudden drastic increase in disk usage? That’s another matter entirely, and something that someone probably should get out of bed for.

This kind of additional context provides really valuable information as to how actionable this alert is. And when we get disk space alerts, one of the first things we do is to check how quickly the disk has been filling up. But in the middle of the night, that’s asking an awful lot - getting out of bed to find a laptop, maybe arguing with a VPN, finding the right graphite or ganglia graph - who wants to do all that when what we really want to do is go back to sleep?

With nagios-herald, the computers can do all of that work for us.

Disk Space Alert With Context

Here we have a bunch of the most relevant context added into the alert for us. We start with a visual indicator of the problematic volume and how full it is, so eyes bleary from sleep can easily grok the severity of the situation. Next is a ganglia graph of the volume over the past day, to give an idea of how fast it has been filling up (and if there was a sudden jump, when it happened, which can often help in tracking down the source of a problem). The threshold is there as well, so we can tell if a critical alert is just barely over the threshold or OH HEY THIS IS REALLY SUPER SERIOUSLY CRITICAL GET UP AND PAY ATTENTION TO IT. Finally, we have alert frequency, to know easily if this is a box that frequently cries wolf or one that might require more attention.

Introducing Formatters

All this is done by way of formatters used by nagios-herald. nagios-herald is itself just a Nagios notification script, but these formatters can be used to do the heavy lifting of adding as much context to an alert as can be dreamt up (or at least automated). The Formatter::Base class defines a variety of methods that make up the core of nagios-herald’s formatting. More information on these methods can be found in their documentation, but to name a few: * add_text can be used to add any block of plain text to an alert - this could be used to add information such as which team to contact if this alert fires, whether or not the service is customer-impacting, or anything else that might assist the on-call person who receives the alert. * add_html can add any arbitrary HTML - this could be a link to a run-book with more detailed troubleshooting or resolution information, it could add an image (maybe a graph, or just a funny cat picture), or just turn the alert text different colors for added emphasis. * ack_info can be used to format information about who acknowledged the alert and when, which can be especially useful on larger or distributed teams where other people might be working on an issue (maybe that lets you know that somebody else is so on top of things that you can go back to sleep and wait until morning!)

All of the methods in the formatter base class can be overridden in any subclass that inherits from it, so the only limit is your imagination. For example, we have several checks that look at graphite graphs and alert (or not) based on their value. Those checks use the check_graphite_graph formatter, which overrides the additional_info base formatter method to add the relevant graph to the Nagios alert:

def additional_info
    section = __method__
    output = get_nagios_var("NAGIOS_#{@state_type}OUTPUT")
    add_text(section, "Additional Info:\n #{unescape_text(output)}\n\n") if output
    output_match = output.match(/Current value: (?<current_value>[^,]*), warn threshold: (?<warn_threshold>[^,]*), crit threshold: (?<crit_threshold>[^,]*)/)
    if output_match
      add_html(section, "Current value: <b><font color='red'>#{output_match['current_value']}</font></b>, warn threshold: <b>#{output_match['warn_threshold']}</b>, crit threshold: <b><font color='red'>#{output_match['crit_threshold']}</font></b><br><br>")
      add_html(section, "<b>Additional Info</b>:<br> #{output}<br><br>") if output

    service_check_command = get_nagios_var("NAGIOS_SERVICECHECKCOMMAND")
    url = service_check_command.split(/!/)[-1].gsub(/'/, '')
    graphite_graphs = get_graphite_graphs(url)
    from_match = url.match(/from=(?<from>[^&]*)/)
    if from_match
      add_html(section, "<b>View from '#{from_match['from']}' ago</b><br>")
     add_html(section, "<b>View from the time of the Nagios check</b><br>")
    add_attachment graphite_graphs[0]    # The original graph.
    add_html(section, %Q(<img src="#{graphite_graphs[0]}" alt="graphite_graph" /><br><br>))
    add_html(section, '<b>24-hour View</b><br>')
    add_attachment graphite_graphs[1]    # The 24-hour graph.
    add_html(section, %Q(<img src="#{graphite_graphs[1]}" alt="graphite_graph" /><br><br>))

In this method, it calls other methods from the base formatter class such as add_html or add_attachment to get all the relevant information we wanted to add for these graphite-based checks.

Now What?

If you’re using Nagios and wish its alerts were a little more helpful, go ahead and install nagios-herald and give it a try! From there, you can start customizing your own alerts by writing your own formatters - and we love feedback and pull requests. You’ll have to wrangle some Ruby, but it’s totally worth it for how much more useful your alerts will be. Getting paged in the middle of the night still won’t be particularly fun, but with nagios-herald, at least you can know that the computers are pulling their weight as well. And really, if they’re going to be so demanding and interrupt our sleep, shouldn’t they at least do a little bit of work for us when they do?

by Christopher Webber ( at December 18, 2014 12:00 AM

December 17, 2014

Debian Administration

A brief introduction to publish-subscribe queuing with redis

In this brief article we'll demonstrate using Redis for a publish/subscribe system. There are dedicated publish-subscribe platforms out there, but Redis is reasonably performant and essentially configuration-free.

by Steve at December 17, 2014 04:46 PM

Racker Hacker

Try out LXC with an Ansible playbook

Ansible logoThe world of containers is constantly evolving lately. The latest turn of events involves the CoreOS developers when they announced Rocket as an alternative to Docker. However, LXC still lingers as a very simple path to begin using containers.

When I talk to people about LXC, I often hear people talk about how difficult it is to get started with LXC. After all, Docker provides an easy-to-use image downloading function that allows you to spin up multiple different operating systems in Docker containers within a few minutes. It also comes with a daemon to help you manage your images and your containers.

Managing LXC containers using the basic LXC tools isn’t terribly easy — I’ll give you that. However, managing LXC through libvirt makes the process much easier. I wrote a little about this earlier in the year.

I decided to turn the LXC container deployment process into an Ansible playbook that you can use to automatically spawn an LXC container on any server or virtual machine. At the moment, only Fedora 20 and 21 are supported. I plan to add CentOS 7 and Debian support soon.

Clone the repository to get started:

git clone
cd ansible-lxc

If you’re running the playbook on the actual server or virtual machine where you want to run LXC, there’s no need to alter the hosts file. You will need to adjust it if you’re running your playbook from a remote machine.

As the playbook runs, it will install all of the necessary packages and begin assembling a Fedora 21 chroot. It will register the container with libvirt and do some basic configuration of the chroot so that it will work as a container. You’ll end up with a running Fedora 21 LXC container that is using the built-in default NAT network created by libvirt. The playbook will print out the IP address of the container at the end. The default password for root is fedora. I wouldn’t recommend leaving that for a production use container. ;)

All of the normal virsh commands should work on the container. For example:

# Stop the container gracefully
virsh shutdown fedora21
# Start the container
virsh start fedora21

Feel free to install the virt-manager tool and manage everything via a GUI locally or via X forwarding:

yum -y install virt-manager dejavu* xorg-x11-xauth
# OPTIONAL: For a better looking virt-manager interface, install these, too
yum -y install gnome-icon-theme gnome-themes-standard

The post Try out LXC with an Ansible playbook appeared first on

by Major Hayden at December 17, 2014 01:50 PM

Chris Siebenmann

Does having a separate daemon manager help system resilience?

One of the reasons usually put forward for having a separate daemon manager process (instead of having PID 1 do this work) is that doing so increases overall system resilience. As the theory goes, PID 1 can be made minimal and extremely unlikely to crash (unlike a more complex PID 1), while if the more complicated daemon manager does crash it can be restarted.

Well, maybe. The problem is the question of how well you can actually take over from a crashed daemon manager. Usually this won't be an orderly takeover and you can't necessarily trust anything in any auxiliary database that the daemon manager has left behind (since it could well have been corrupted before or during the crash). You need to have the new manager process step in and somehow figure out what was (and is) running and what isn't, then synchronize the state of the system back to what it's supposed to be, then pick up monitoring everything.

The simple case is a passive init system. Since the init system does not explicitly track daemon state, there is no state to recover on a daemon manager restart and resynchronization can be done simply by trying to start everything that should be started (based on runlevel and so on). We can blithely assume that the 'start' action for everything will do nothing if the particular service is already started. Of course this is not very realistic, as passive init systems generally don't have daemon manager processes that can crash in the first place.

For an active daemon manager, I think that at a minimum what you need is some sort of persistent and stable identifier for groups of processes that can be introspected and monitored from an arbitrary process. The daemon manager starts processes for all services under a an identifier determined from their service name; then when it crashes and you have to start a new one, the new one can introspect the identifiers for all of the groups to determine what services are (probably) running. Unfortunately there are lots of complications here, including that this doesn't capture the state of 'one-shot' services without persistent processes. This is of course not a standard Unix facility, so no fully portable daemon manager can do this.

It's certainly the case that a straightforward, simple daemon manager will not be able to take over from a crashed instance of itself. Being able to do real takeover requires both system-specific features and a relatively complex design and series of steps on startup, and still leaves you with uncertain or open issues. In short, having a separate daemon manager does not automatically make the system any more resilient under real circumstances. A crashing daemon manager is likely to force a system reboot just as much as a crashing PID 1 does.

However I think it's fair to say that under normal circumstances a separate daemon manager process crashing (instead of PID 1 crashing) will buy you more time to schedule a system outage. If the only thing that needs the daemon manager running is starting or stopping services and you already have all normal services started up, your system may be able to run for days before you need to reboot it. If your daemon manager is more involved in system operation or is routinely required to restart services, well, you're going to have (much) less time depending on the exact details.

by cks at December 17, 2014 04:54 AM

System Administration Advent Calendar

Day 17 - DevOps for Horses: Moving an Enterprise Application to the Cloud

Written by: Eric Shamow (@eshamow)
Edited by: Michelle Carroll (@miiiiiche)

As an engineer, when you first start thinking about on-demand provisioning, CD, containers, or any of the myriad techniques and technologies floating across the headlines, there is a point when you realize with a cold sweat that this is going to be a bigger job than you thought. As you watch folks talking at various conferences about the way they are deploying and scaling applications, you realize that your applications won’t work if you deployed them this way.

Most of the glamorous or really interesting, thought-provoking discussions around deployment methodologies work because the corresponding applications were built to be deployed into those environments in a true virtuous cycle between development and operations teams. Sometimes the lines between those teams disappear entirely.

In some cases, this is because Operations is outsourced entirely — consider PaaS environments like Heroku or Google App Engine, where applications can be deployed with tremendous ease, due to a very restricted set of conditions defining how code is structured and what features are available. Similarly, on-premises PaaS infrastructures, such as Cloud Foundry or OpenShift, allow for organizations to create a more flexible and customized environment while leveraging the same kind of automation and tight controls around application delivery.

If you can leverage these tools, you should. I advise teams to try and build out an internal PaaS capability — whether they are using Cloud Foundry or bootstrapping their own, or even several to allow for multiple application patterns. The Twelve-Factor App pattern is a good checklist of conditions to start with for understanding what’s necessary to get to a Heroku-like level of automation. If your app meets all these conditions, congratulations — you are probably ready to go PaaS.

My App Isn’t Ready For PaaS

Unless you’re a startup or have a well-funded team effort to move, your application won’t work as it stands in a PaaS. You are perhaps ready for IaaS (or are evaluating IaaS) wondering, where do I start? If you can’t do much with the application design, how can you begin to get ready for a cloud move with the legacy infrastructure and code you have?

Getting Your Bearings

Start by collecting data. A few critical pieces of information I like to gather before drawing up a strategy:

  • What are the components of the application? Can you draw a graph of their dependencies?

  • If the components are separated from one another, can they tolerate the partition or does the app crash or freeze? Are any components a single point of failure?

  • How long does it take for the application to recover from a failure?

  • Can the application recover from a typical failure automatically? If not what manual intervention is involved?

  • How is the application deployed? If the server on which the application is running dies, what is the process/procedure for bringing it back to life?

  • Can you easily replicate the state of your app in any environment? Are your developers looking at code in an environment that looks as close as possible to production? Can your QA team adequately simulate the conditions of an outage when testing a new release?

  • How do you scale the application? Can you add additional worker systems and scale the system horizontally, or do you need to move the system to bigger and more powerful servers as the service grows?

  • What does the Development/QA cycle look like? Is Operations involved in deploying applications into QA? How long does it take for developers to get a new release into and through the testing cycle?

  • How does operations take delivery of code from development? What is the definition of a deliverable? Is it consistent, or does it change from version to version?

  • How do you know that your application was successfully installed?

I’m not going to tackle all of them, but will rather focus on some of the key themes we’re looking for in examining our apps and environment.


One of the key underpinnings of modern application design is the understanding that failure is inevitable — it’s not a question of if a component of your application will fail, but when. The critical metric for an application is not necessarily how often it fails (although an app that fails regularly is clearly a problem) but how well its components tolerate the failure of other components. As your app scales out — and particularly if you are planning to move to public cloud — you can expect that data will no longer flow evenly between components. This is not just a problem of high latency, but variable latency — sudden network congestion can cause traffic between components to be bursty.

If one component of your application depends on another component to be functional, or your app requires synchronous and low-latency communication at all times between components, you have identified tight coupling. These tight couplings are death for applications in the cloud (and they’re the services that make upgrades and migration to new locations the most difficult as well). Tight couplings are amongst the most difficult problems to address — often they relate to application design and are tightly tied to the business logic and implementation of the application. A good overview of the problem and some potential remedies can be found in Martin Fowler’s 2001 article “Reducing Coupling” (warning: PDF).

For now , we need to identify these tight couplings and pay extra attention to them — monitor heavily around communications, add checks to ensure that data is flowing smoothly, and in general treat these parts of our architecture as the fragile breakpoints that they are. If you cannot work around or eliminate these couplings, you may be able to automate processes for detection and remediation. Ultimately, the couplings between your apps will determine your pattern for upgrades, migrations and scaling — so understanding how your components communicate and which depend on each other is essential to building a working and automated process.


If you can’t reinstall the app without human intervention, you have a problem. We can expect that a server will eventually fail and that application updates will happen on a regular cadence. Humans screw up things we do repetitively — repeat even a simple process often enough and you will eventually do it wrong. Computers are exceptionally good at repetitive tasks. If you have your sysadmins doing regular installs of your applications — or worse if your sysadmins have to call in developers and they must pair to slowly work through every install — you are not taking advantage of the computers. And you’re overtaxing humans who are much better at — and happier — doing other things.

Many organizations maintain either an installation wiki, a set of install scripts, or both. These sources of information frequently vary and operators need to hop from one to the other to assemble and install. With this type of ad-hoc assembly of a process, it’s likely that one administrator will not follow the process perfectly each time, but certain that different administrators will follow the process in different ways. Asking people to “fix the wiki” will not fix the discrepancy. The wiki will always lag the current state of your systems. Instead, treat your installation scripts like “executable documentation.” They should be the single source of truth for the process used to deploy the app.

While you will want your automation to use good, known frameworks, the reality is that a BASH script is a good start if you have nothing in place. Is BASH the way to go for your system automation? As a former employer put it, “SSH in a for loop isn’t enough” — and it’s not. But writing a script to deploy a system in a language you already know is a good way to identify if you can automate the deploy, as well as the decisions you need to make during the install. This information informs your later choice of automation framework, and enables you to identify which parts of your configuration change from install to install. As a bonus, you’ve taken a first pass at automating your process, which will speed up your deploys and help you select an automation framework that best fits your use case. For an exploration of this topic and an introduction to taking it a step further into early Configuration Management, check out my former colleague Mike Stahnke’s dead-on 2013 presentation “Getting Started With Puppet.”

Environment Parity and Configuration Management

We’ve all been on some side of the environment parity issue. Code makes it into production that didn’t take into account some critical element of the production environment — a firewall, different networking configuration, different system version, and so on. The invariable response from Operations is, “Developers don’t understand real operating environments.” The colloquial version of this is, “It works on my laptop!”

The more common truth is that Operations didn’t provide Development with an environment that looked anything like production, or even with the tools to know about or understand what the production environment looks like. As an Operations team, if you don’t offer Development a prod-like environment to deploy into and test with, you cede your right to complain about code they produce that doesn’t match prod.

Since it is often not possible to give developers an exact copy of production, it’s important for the Operations team to abstract away as many changes between environments as is possible. Dev, Prod, QA and all other teams should be running the same OS versions and patch sets, with the same dependencies and same system configuration across the board. The most sensible way to do this is with Configuration Management. Configure all of your environments using the same tools and — most critically — with the same configuration management scripts. The differences between your environments should be a set of variables that inform that code.

If you can’t reduce the differences between your environments to code informed by variables, you’ve identified some hard problems your developers and operations teams are going to have to bridge together. At the very least, if you can make your environments more similar, you can significantly reduce the number of factors that must be taken into account when an app fails in one environment when it succeeded in another.

Get Operations out of the Dev/QA Cycle

The notion of Operations being required to install applications into a QA/Testing environment always baffled me. I was in favor of Development not doing the install themselves, but I also understood that opening a ticket with Operations and waiting for an install is a time-intensive process, and that debugging/troubleshooting is a highly interactive one. These two needs are at odds. By slowing down the Dev/QA feedback loop, Operations not only causes Development to become less efficient, it also encourages developers to do larger chunks of work and submit them for testing less frequently.

The flip side of this is that allowing developers full root access on QA servers is potentially dangerous. Developers may inadvertently make changes that change the performance of the servers from production. Similarly, if developers are installing directly into QA, operations doesn’t get to look at the deployable until it reaches production. When they install the application for the first time, it’s in the most critical environment.

There’s a three-part fix for this:

  • Developers are responsible for deliverables in a consistent format. Whether that’s a package, a tarball, or a tagged git checkout, the deliverable must look the same from release to release

  • QA is managed via Configuration Management, and applications are installed into QA using the same automation tools/scripts used in production.

  • Operations’ SLA for QA is that it will flatten and re-provision the environment when needed. If a deployment screws up the server, Ops will provide a new, clean server.

Using these policies, the application is installed into QA and any subsequent environments with the same scripts. If we’ve learned anything from the Lean movement, it’s that accuracy can be improved by reducing batch sizes, increasing the speed of processing and baking QA into the process. With these changes, the deployment scripts and artifacts are tested dozens, hundreds or thousands of times before they are ever used in production. This can help find deployment problems and iron out scripts long before code ever reaches user-facing systems.

The benefits for both teams are clear: Development gets a fast turnaround time for QA, Operations gets a clean deliverable that can be deployed via its own scripts.

Functional Testing

While there will always be the need for manual testing of certain functionality, establishing an automated testing regimen can provide quick feedback about whether an app is functioning as intended.

While an overview of testing strategies is beyond the scope of this article (Chapter 4 of Jez Humble and David Farley’s book Continuous Delivery provides an excellent overview), I’d argue for prioritizing a combination of functional and integration tests. You want to confirm that the app does what is intended. Simple smoke tests to verify that a server is configured properly and that an application is installed and running is a good first pass at a testing regimen.

Once you get comfortable writing tests, you should begin doing more involved testing of application and server behavior and performance. Every time you make a change that alters the behavior of the application or underlying system, add a test. Down the line you may want to consider TDD or BDD, but start small — having imperfect tests is better than having no tests at all.

At the application level, your development team likely has a testing language or suite for unit and integration tests. There are a number of frameworks you can use for doing this at the server/Configuration Management level. I have used both serverspec and Beaker with success in the past.

The first time you run a proposed configuration management change through tests and discover that it would break your application is a revelation. Similarly, the first time you prevent a regression by adding a check for something that “always” breaks will be the last time somebody accidentally breaks it.

Wrapping Up

We’ve just scratched the surface of what can be done with an existing environment, but as you can hopefully see, there’s plenty you can do right now to get your environment ready for IaaS (and eventually, PaaS) without touching your application’s code.

Remember that this process should be iterative — unless you have the budget to build a greenfield environment tomorrow, you are going to be tackling this one piece at a time. Don’t feel ashamed because your environments aren’t automated enough or you don’t have comprehensive enough tests for your application. Rather, focus on making things better. If you don’t have enough automation, build more. If there aren’t enough good tests, write just one. Then re-examine your environment, see what most needs improvement, and iterate there.

There’s no way to completely move an app without touching the code, but there’s plenty of work to do before you get there in preparation of scalable, loosely coupled code. Don’t wait for the perfect application to start doing the right thing.

by Christopher Webber ( at December 17, 2014 12:00 AM

December 16, 2014

Chris Siebenmann

How a Firefox update just damaged practical security

Recently, Mozilla pushed out Firefox 34 as one of their periodic regular Firefox updates. Unfortunately this shipped with a known incompatible change that broke several extensions, including the popular Flashblock extension. Mozilla had known about this problem for months before the release; in fact the bug report was essentially filed immediately after the change in question landed in the tree, and the breakage was known when the change was proposed. Mozilla people didn't care enough to do anything in particular about this beyond (I think) blacklisting the extension as non-functional in Firefox 34.

I'm sure that this made sense internally in Mozilla and was justified at the time. But in practice this was a terrible decision, one that's undoubtedly damaged pragmatic Firefox security for some time to come. Given that addons create a new browser, the practical effect of this decision is that Firefox's automatic update to Firefox 34 broke people's browsers. When your automatic update breaks people's browsers, congratulations, you have just trained them to turn your updates off. And turning automatic updates off has very serious security impacts.

The real world effect of Mozilla's decision is that Mozilla has now trained some number of users that if they let Mozilla update Firefox, things break. Since users hate having things break, they're going to stop allowing those updates to happen, which will leave them exposed to real Firefox security vulnerabilities that future updates would fix (and we can be confident that there will be such updates). Mozilla did this damage not for a security critical change but for a long term cleanup that they decided was nice to have.

(Note that Mozilla could have taken a number of methods to fix the popular extensions that were known to be broken by this change, since the actual change required to extensions is extremely minimal.)

I don't blame Mozilla for making the initial change; trying to make this change was sensible. I do blame Mozilla's release process for allowing this release to happen knowing that it broke popular extensions and doing nothing significant about it, because Mozilla's release process certainly should care about the security impact of Mozilla's decisions.

by cks at December 16, 2014 03:15 AM

System Administration Advent Calendar

Day 16 - How to Interview Systems Administrators

Written by: Corey Quinn (@quinnypig)
Edited by: Justin Garrison (@rothgar)

There are many blog posts, articles, and even books[0] written on how to effectively interview software engineers. Hiring systems administrators[1] is a bit more prickly of a topic, for a few reasons.

  • You generally hire fewer of them than you do developers[2].
  • A systems administrator likely has root in production. Mistakes will show more readily, and in many environments “peer review” is an aspiration rather than the current state of things.
  • It’s extremely easy to let your systems administration team become “the department of no.” This can have an echo effect that pumps toxicity into your organization. It’s important to hire someone who isn’t going to add overwhelming negativity.

Every job interview since the beginning of time is built around asking candidates three questions. They’ll take different forms, and you’ll dress them up differently each time, but they can be distilled down as follows.

  1. Can you do the job?
  2. Will you like doing the job?
  3. Can we stand working with you?

Doing the Job

This is where the barrage of technical questions comes in. Be careful when selecting what technical areas you want to cover, and how you cover them. Going into stupendous depth on SAN management when you don’t have centralized storage at all is something of a waste of time.

Additionally, many shops equate trivia with mastery of a subject. “Which format specifier to date(1) will spit out the seconds since the Unix epoch began?” The correct answer is of course “man date” unless they, for some reason, have %s memorized– but what does a right answer really tell you past a single bit of data? Being able to successfully memorize trivia doesn’t really speak to someone’s ability to successfully perform in an operational role.

Instead, it probably makes more sense for you to ask open ended questions about things you care about. “So, we have a lot of web servers here. What’s your experience with managing them? What other technology have you worked with in conjunction with serving data over http/https?” This gleans a lot more data than asking trivia questions about configuring virtual hosts in Apache’s httpd. Be aware that some folks will try to talk around the question; politely returning to specific scenarios can help refocus them.

Liking the Job

Hiring people, training them, and the rest of the onboarding process are expensive. Having to replace someone who left due to poor fit, a skills mismatch, or other reasons two months into the job is awful. It’s important to suss out whether or not the candidate is likely to enjoy their work. That said, it’s sometimes difficult to ascertain whether or not the candidate is just telling you what you want to hear. To that end, ask the candidate for specific stories regarding their current and past work. “Tell me about a time you had to deal with a difficult situation.” Push for specific details– you don’t want to hear “the right answer,” you want to know what actually happened.

This questioning technique leads well into the third question…

Not Being a Jerk

If you think back across your career, you can probably think of a systems administrator you’ve met who could easily be named Surly McBastard. You really, really, really don’t want to hire that person. It’s very easy for the sysadmin group to gain the reputation as “the department of no” just due to their job function alone– remember, their goal is stability above all else. Your engineering group (presuming a separate and distinct team from the operations group) is trying to roll new features out. This gives way to a natural tension in most organizations. There’s no need to exacerbate this by hiring someone who’s difficult to work with.

A key indicator here is fanaticism. We all have our favorite pet technologies, but most of us are able to put personal preferences aside in favor of the prevailing consensus. A subset of technologists are unable to do this. “You use Redis? Why?! It’s a steaming pile of crap!” is a great example of what you might not want to hear. A better way for a candidate to frame this sentiment might be “Oh, you’re a Redis shop? That’s interesting– I’ve run into some challenges with it in the past. I’d be very curious to hear how you’ve overcome some challenges…”

Remember, the successful candidate is going to have to deal with other groups of people, and that’s a very challenging thing to interview for. It also helps to remember that interviewing is an inexact science, and everyone approaches it with a number of biases.

For this reason, I strongly recommend having multiple interviewers speak to each candidate, and then compare notes afterwards. It’s entirely possible that one person will pick up on a red flag that others will miss.

Ultimately, interviewing is a challenge on both sides of the table. The best way to improve is to practice– take notes on what works, what doesn’t, and adjust accordingly. Remember that every hire you make shifts your team; ideally you want that to be trending upwards with each successive hire.

[0] I’m partial to myself.
[1] For purposes of this article, “systems administrators” can be expanded to include operations engineers, devops unicorns, network engineers, database wizards, storage gurus, infrastructure perverts, NOC technicians, and other similar roles.
[2] For purposes of this article, “developers” can be expanded to include… you get the idea.

by Christopher Webber ( at December 16, 2014 12:08 AM

December 15, 2014

Chris Siebenmann

Why your 64-bit Go programs may have a huge virtual size

For various reasons, I build (and rebuild) my copy of the core Go system from the latest development source on a regular basis, and periodically rebuild the Go programs I use from that build. Recently I was looking at the memory use of one of my programs with ps and noticed that it had an absolutely huge virtual size (Linux ps's VSZ field) of around 138 GB, although it had only a moderate resident set size. This nearly gave me a heart attack, since a huge virtual size with a relatively tiny resident set size is one classical sign of a memory leak.

(Builds with earlier versions of Go tended to have much more modest virtual set sizes on the order of 32 MB to 128 MB depending on how long it had been running.)

Fortunately this was not a memory leak. In fact, experimentation soon demonstrated that even a basic 'hello world' program had that huge a virtual size. Inspection of the process's /proc/<pid>/smaps file (cf) showed that basically all of the virtual space used was coming from two inaccessible mappings, one roughly 8 GB long and one roughly 128 GB. These mappings had no access permissions (they disallowed reading, writing, and executing) so all they did was reserve address space (without ever using any actual RAM). A lot of address space.

It turns out that this is how Go's current low-level memory management likes to work on 64-bit systems. Simplified somewhat, Go does low level allocations in 8 KB pages taken from a (theoretically) contiguous arena; what pages are free versus allocated is stored in a giant bitmap. On 64-bit machines, Go simply pre-reserves the entire memory address space for both the bitmaps and the arena itself. As the runtime and your Go code starts to actually use memory, pieces of the arena bitmap and the memory arena will be changed from simple address space reservations into memory that is actually backed by RAM and being used for something.

(Mechanically, the bitmap and arena are initially mmap()'d with PROT_NONE. As memory is used, it is remapped with PROT_READ|PROT_WRITE. I'm not confident that I understand what happens when it's freed up, so I'm not going to say anything there.)

All of this is the case for the current post Go 1.4 development version of Go. Go 1.4 and earlier behave differently with much lower virtual sizes for running 64-bit programs, although in reading the Go 1.4 source code I'm not sure I understand why.

As far as I can tell, one of the interesting consequences of this is that 64-bit Go programs can use at most 128 GB of memory for most of their allocations (perhaps all of them that go through the runtime, I'm not sure).

For more details on this, see the comments in src/runtime/malloc2.go and in mallocinit() in src/runtime/malloc1.go.

I have to say that this turned out to be more interesting and educational than I initially expected, even if it means that watching ps is no longer a good way to detect memory leaks in your Go programs (mind you, I'm not sure it ever was). As a result, the best way to check this sort of memory usage is probably some combination of runtime.ReadMemStats() (perhaps exposed through net/http/pprof) and Linux's smem program or the like to obtain detailed information on meaningful memory address space usage.

PS: Unixes are generally smart enough to understand that PROT_NONE mappings will never use up any memory and so shouldn't count against things like system memory overcommit limits. However they generally will count against a per-process limit on total address space, which likely means that you can't really use such limits and run post 1.4 Go programs. Since total address space limits are rarely used, this is probably not likely to be an issue.

Sidebar: How this works on 32-bit systems

The full story is in the mallocinit() comment. The short version is that the runtime reserves a large enough arena to handle 2 GB of memory (which 'only' takes 256 MB) but only reserves 512 MB of address space out of the 2 GB it could theoretically use. If the runtime later needs more memory, it asks the OS for another block of address space and hopes that it is in the remaining 1.5 GB of address space that the arena covers. Under many circumstances the odds are good that the runtime will get what it needs.

by cks at December 15, 2014 06:18 AM

System Administration Advent Calendar

Day 15 - Cook your own packages: Getting more out of fpm

Written by: Mathias Lafeldt (@mlafeldt)
Edited by: Joseph Kern (@josephkern)


When it comes to building packages, there is one particular tool that has grown in popularity over the last years: fpm. fpm’s honorable goal is to make it as simple as possible to create native packages for multiple platforms, all without having to learn the intricacies of each distribution’s packaging format (.deb, .rpm, etc.) and tooling.

With a single command, fpm can build packages from a variety of sources including Ruby gems, Python modules, tarballs, and plain directories. Here’s a quick example showing you how to use the tool to create a Debian package of the AWS SDK for Ruby:

$ fpm -s gem -t deb aws-sdk
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}

It is this simplicity that makes fpm so popular. Developers are able to easily distribute their software via platform-native packages. Businesses can manage their infrastructure on their own terms, independent of upstream vendors and their policies. All of this has been possible before, but never with this little effort.

In practice, however, things are often more complicated than the one-liner shown above. While it is absolutely possible to provision production systems with packages created by fpm, it will take some work to get there. The tool can only help you so far.

In this post we’ll take a look at several best practices covering: dependency resolution, reproducible builds, and infrastructure as code. All examples will be specific to Debian and Ruby, but the same lessons apply to other platforms/languages as well.

Resolving dependencies

Let’s get back to the AWS SDK package from the introduction. With a single command, fpm converts the aws-sdk Ruby gem to a Debian package named rubygem-aws-sdk. This is what happens when we actually try to install the package on a Debian system:

$ sudo dpkg --install rubygem-aws-sdk_1.59.0_all.deb
dpkg: dependency problems prevent configuration of rubygem-aws-sdk:
 rubygem-aws-sdk depends on rubygem-aws-sdk-v1 (= 1.59.0); however:
  Package rubygem-aws-sdk-v1 is not installed.

As we can see, our package can’t be installed due to a missing dependency (rubygem-aws-sdk-v1). Let’s take a closer look at the generated .deb file:

$ dpkg --info rubygem-aws-sdk_1.59.0_all.deb
 Package: rubygem-aws-sdk
 Version: 1.59.0
 License: Apache 2.0
 Vendor: Amazon Web Services
 Architecture: all
 Maintainer: <vagrant@wheezy-buildbox>
 Installed-Size: 5
 Depends: rubygem-aws-sdk-v1 (= 1.59.0)
 Provides: rubygem-aws-sdk
 Section: Languages/Development/Ruby
 Priority: extra
 Description: Version 1 of the AWS SDK for Ruby. Available as both `aws-sdk` and `aws-sdk-v1`.
  Use `aws-sdk-v1` if you want to load v1 and v2 of the Ruby SDK in the same

fpm did a great job at populating metadata fields such as package name, version, license, and description. It also made sure that the Depends field contains all required dependencies that have to be installed for our package to work properly. Here, there’s only one direct dependency – the one we’re missing.

While fpm goes to great lengths to provide proper dependency information – and this is not limited to Ruby gems – it does not automatically build those dependencies. That’s our job. We need to find a set of compatible dependencies and then tell fpm to build them for us.

Let’s build the missing rubygem-aws-sdk-v1 package with the exact version required and then observe the next dependency in the chain:

$ fpm -s gem -t deb -v 1.59.0 aws-sdk-v1
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}

$ dpkg --info rubygem-aws-sdk-v1_1.59.0_all.deb | grep Depends
 Depends: rubygem-nokogiri (>= 1.4.4), rubygem-json (>= 1.4), rubygem-json (<< 2.0)

Two more packages to take care of: rubygem-nokogiri and rubygem-json. By now, it should be clear that resolving package dependencies like this is no fun. There must be a better way.

In the Ruby world, Bundler is the tool of choice for managing and resolving gem dependencies. So let’s ask Bundler for the dependencies we need. For this, we create a Gemfile with the following content:

# Gemfile
source ""
gem "aws-sdk", "= 1.59.0"
gem "nokogiri", "~> 1.5.0" # use older version of Nokogiri

We then instruct Bundler to resolve all dependencies and store the resulting .gem files into a local folder:

$ bundle package
Updating files in vendor/cache
  * json-1.8.1.gem
  * nokogiri-1.5.11.gem
  * aws-sdk-v1-1.59.0.gem
  * aws-sdk-1.59.0.gem

We specifically asked Bundler to create .gem files because fpm can convert them into Debian packages in a matter of seconds:

$ find vendor/cache -name '*.gem' | xargs -n1 fpm -s gem -t deb
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}
Created package {:path=>"rubygem-json_1.8.1_amd64.deb"}
Created package {:path=>"rubygem-nokogiri_1.5.11_amd64.deb"}

As a final test, let’s install those packages…

$ sudo dpkg -i *.deb
Setting up rubygem-json (1.8.1) ...
Setting up rubygem-nokogiri (1.5.11) ...
Setting up rubygem-aws-sdk-v1 (1.59.0) ...
Setting up rubygem-aws-sdk (1.59.0) ...

…and verify that the AWS SDK actually can be used by Ruby:

$ ruby -e "require 'aws-sdk'; puts AWS::VERSION"


The purpose of this little exercise was to demonstrate one effective approach to resolving package dependencies for fpm. By using Bundler – the best tool for the job – we get fine control over all dependencies, including transitive ones (like Nokogiri, see Gemfile). Other languages provide similar dependency tools. We should make use of language specific tools whenever we can.

Build infrastructure

After learning how to build all packages that make up a piece of software, let’s consider how to integrate fpm into our build infrastructure. These days, with the rise of the DevOps movement, many teams have started to manage their own infrastructure. Even though each team is likely to have unique requirements, it still makes sense to share a company-wide build infrastructure, as opposed to reinventing the wheel each time someone wants to automate packaging.

Packaging is often only a small step in a longer series of build steps. In many cases, we first have to build the software itself. While fpm supports multiple source formats, it doesn’t know how to build the source code or determine dependencies required by the package. Again, that’s our job.

Creating a consistent build and release process for different projects across multiple teams is hard. Fortunately, there’s another tool that does most of the work for us: fpm-cookery. fpm-cookery sits on top of fpm and provides the missing pieces to create a reusable build infrastructure. Inspired by projects like Homebrew, fpm-cookery builds packages based on simple recipes written in Ruby.

Let’s turn our attention back to the AWS SDK. Remember how we initially converted the gem to a Debian package? As a warm up, let’s do the same with fpm-cookery. First, we have to create a recipe.rb file:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name    "aws-sdk"
  version "1.59.0"

Next, we pass the recipe to fpm-cook, the command-line tool that comes with fpm-cookery, and let it build the package for us:

$ fpm-cook package recipe.rb
===> Starting package creation for aws-sdk-1.59.0 (debian, deb)
===> Verifying build_depends and depends with Puppet
===> All build_depends and depends packages installed
===> [FPM] Trying to download {"gem":"aws-sdk","version":"1.59.0"}
===> Created package: /home/vagrant/pkg/rubygem-aws-sdk_1.59.0_all.deb

To complete the exercise, we also need to write a recipe for each remaining gem dependency. This is what the final recipes look like:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <>"

  chain_package true
  chain_recipes ["aws-sdk-v1", "json", "nokogiri"]

# aws-sdk-v1.rb
class AwsSdkV1Gem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk-v1"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <>"

# json.rb
class JsonGem < FPM::Cookery::RubyGemRecipe
  name       "json"
  version    "1.8.1"
  maintainer "Mathias Lafeldt <>"

# nokogiri.rb
class NokogiriGem < FPM::Cookery::RubyGemRecipe
  name       "nokogiri"
  version    "1.5.11"
  maintainer "Mathias Lafeldt <>"

  build_depends ["libxml2-dev", "libxslt1-dev"]
  depends       ["libxml2", "libxslt1.1"]

Running fpm-cook again will produce Debian packages that can be added to an APT repository and are ready for use in production.

Three things worth highlighting:

  • fpm-cookery is able to build multiple dependent packages in a row (configured by chain_* attributes), allowing us to build everything with a single invocation of fpm-cook.
  • We can use the attributes build_depends and depends to specify a package’s build and runtime dependencies. When running fpm-cook as root, the tool will automatically install missing dependencies for us.
  • I deliberately set the maintainer attribute in all recipes. It’s important to take responsibility of the work that we do. We should make it as easy as possible for others to identify the person or team responsible for a package.

fpm-cookery provides many more attributes to configure all aspects of the build process. Among other things, it can download source code from GitHub before running custom build instructions (e.g. make install). The fpm-recipes repository is an excellent place to study some working examples. This final example, a recipe for chruby, is a foretaste of what fpm-cookery can actually do:

# recipe.rb
class Chruby < FPM::Cookery::Recipe
  description "Changes the current Ruby"

  name     "chruby"
  version  "0.3.8"
  homepage ""
  source   "{version}.tar.gz"
  sha256   "d980872cf2cd047bc9dba78c4b72684c046e246c0fca5ea6509cae7b1ada63be"

  maintainer "Jan Brauer <>"

  section "development"

  config_files "/etc/profile.d/"

  def build
    # nothing to do here

  def install
    make :install, "PREFIX" => prefix
    etc("profile.d").install workdir("")

source /usr/share/chruby/

Wrapping up

fpm has changed the way we build packages. We can get even more out of fpm by using it in combination with other tools. Dedicated programs like Bundler can help us with resolving package dependencies, which is something fpm won’t do for us. fpm-cookery adds another missing piece: it allows us to describe our packages using simple recipes, which can be kept under version control, giving us the benefits of infrastructure as code: repeatability, automation, rollbacks, code reviews, etc.

Last but not least, it’s a good idea to pair fpm-cookery with Docker or Vagrant for fast, isolated package builds. This, however, is outside the scope of this article and left as an exercise for the reader.

Further reading

by Christopher Webber ( at December 15, 2014 12:00 AM

December 14, 2014

Chris Siebenmann

How init wound up as Unix's daemon manager

If you think about it, it's at least a little bit odd that PID 1 wound up as the de facto daemon manager for Unix. While I believe that the role itself is part of the init system as a whole, this is not the same thing as having PID 1 do the job and in many ways you'd kind of expect it to be done in another process. As with many things about Unix, I think that this can be attributed to the historical evolution Unix has gone through.

As I see the evolution of this, things start in V7 Unix (or maybe earlier) when Research Unix grew some system daemons, things like crond. Something had to start these, so V7 had init run /etc/rc on boot as the minimal approach. Adding networking to Unix in BSD Unix increased the number of daemons to start (and was one of several changes that complicated the whole startup process a lot). Sun added even more daemons with NFS and YP and so on and either created or elaborated interdependencies among them. Finally System V came along and made everything systematic with rcN.d and so on, which was just in time for yet more daemons.

(Modern developments have extended this even further to actively monitoring and restarting daemons if you ask them to. System V init could technically do this if you wanted, but people generally didn't use inittab for this.)

At no point in this process was it obvious to anyone that Unix was going through a major sea change. It's not as if Unix went in one step from no daemons to a whole bunch of daemons; instead there was a slow but steady growth in both the number of daemons and the complexity of system startup in general, and much of this happened on relatively resource-constrained machines where extra processes were a bad idea. Had there been a single giant step, maybe people would have sat down and asked themselves if PID 1 and a pile of shell scripts were the right approach and said 'no, it should be a separate process'. But that moment never happened; instead Unix basically drifted into the current situation.

(Technically speaking you can argue that System V init actually does do daemon 'management' in another process. System V init doesn't directly start daemons; instead they're started several layers of shell scripts away from PID 1. I call it part of PID 1 because there is no separate process that really has this responsibility, unlike the situation in eg Solaris SMF.)

by cks at December 14, 2014 05:56 AM

System Administration Advent Calendar

Day 14 - Using Chef Provisioning to Build Chef Server

Or, Yo Dawg, I heard you like Chef.

Written by: Joshua Timberman (@jtimberman)
Edited by: Paul Graydon (@twirrim)

This post is dedicated to Ezra Zygmuntowicz. Without Ezra, we wouldn’t have had Merb for the original Chef server, chef-solo, and maybe not even Chef itself. His contributions to the Ruby, Rails, and Chef communities are immense. Thanks, Ezra, RIP.

In this post, I will walk through a use case for Chef Provisioning used at Chef Software, Inc.: building a new Hosted Chef infrastructure with Chef Server 12 on Amazon EC2. This isn’t an in-depth how to guide, but I will illustrate the important components to discuss what is required to setup Chef Provisioning, with a real world example. Think of it as a whirlwind tour of Chef Provisioning and Chef Server 12.


If you have used Chef for awhile, you may recall the wiki page “Bootstrap Chef RubyGems Installation” - the installation guide that uses cookbooks with chef-solo to install all the components required to run an open source Chef Server. This idea was a natural fit in the omnibus packages for Enterprise Chef (nee Private Chef) in the form of private-chef-ctl reconfigure: that command kicks off a chef-solo run that configures and starts all the Chef Server services.

It should be no surprise, that at CHEF we build Hosted Chef using Chef. Yes, it’s turtles and yo-dawg jokes all the way down. As the CHEF CTO Adam described when talking about one Chef Server codebase, we want to bring our internal deployment and development practices in line with what we’re shipping to customers, and we want to unify our approach so we can provide better support.

Chef Server 12

As announced recently, Chef Server 12 is generally available. For purposes of the example discussed below, we’ll provision three machines: one backend, one frontend (with Chef Manage and Chef Reporting), and one running Chef Analytics. While Chef Server 12 has the capability to install add-ons, we have a special cookbook with a resource to manage the installation of “Chef Server Ingredients.” This is so we can also install the chef-server-core package used by both the API frontend nodes and the backend nodes.

Chef Provisioning

Chef Provisioning is a new capability for Chef, where users can define “machines” as Chef resources in recipes, and then converge those recipes on a node. This means that new machines are created using a variety of possible providers (AWS, OpenStack, or Docker, to name a few), and they can have recipes applied from other cookbooks available on the Chef Server.

Chef Provisioning “runs” on a provisioner node. This is often a local workstation, but it could be a specially designated node in a data center or cloud provider. It is simply a recipe run by chef-client (or chef-solo). When using chef-client, any Chef Server will do, including Hosted Chef. Of course, the idea here is we don’t have a Chef Server yet. In my examples in this post, I’ll use my OS X laptop as the provisioner, and Chef Zero as the server.

Assemble the Pieces

The cookbook that does the work using Chef Provisioning is chef-server-cluster. Note that this cookbook is under active development, and the code it contains may differ from the code in this post. As such, I’ll post relevant portions to show the use of Chef Provisioning, and the supporting local setup required to make it go. Refer to the in the cookbook for the most recent information on how to use it.

Amazon Web Services EC2

The first thing we need is an AWS account for the EC2 instances. Once we have that, we need an IAM user that has privileges to manage EC2, and an SSH keypair to log into the instances. It is outside the scope of this post to provide details on how to assemble those pieces. However once those are acquired, do the following:

Put the access key and secret access key configuration in ~/.aws/config. This is automatically used by chef-provisioning’s AWS provider. The SSH keys will be used in a data bag item (JSON) that is described later. You will then want to choose an AWS region to use. For sake of example, my keypair is named hc-metal-provisioner in the us-west-2 region.

Chef Provisioning needs to know about the SSH keys in three places:

  1. In the .chef/knife.rb, the private_keys and public_keys configuration settings.
  2. In the machine_options that is used to configure the (AWS) driver so it can connect to the machine instances.
  3. In a recipe.

This is described in more detail below.

Chef Repository

We use a Chef Repository to store all the pieces and parts for the Hosted Chef infrastructure. For example purposes I’ll use a brand new repository. I’ll use ChefDK’s chef generate command:

% chef generate repo sysadvent-chef-cluster

This repository will have a Policyfile.rb, a .chef/knife.rb config file, and a couple of data bags. The latest implementation specifics can be found in the chef-server-cluster cookbook’s

Chef Zero and Knife Config

As mentioned above, Chef Zero will be the Chef Server for this example, and it will run on a specific port (7799). I started it up in a separate terminal with:

% chef-zero -l debug -p 7799

The knife config file will serve two purposes. First, it will be used to load all the artifacts into Chef Zero. Second, it will provide essential configuration to use with chef-client. Let’s look at the required configuration.

This portion tells chef, knife, and chef-client to use the chef-zero instance started earlier.

chef_server_url 'http://localhost:7799'
node_name       'chef-provisioner'

In the next section, I’ll discuss the policyfile feature in more detail. These configuration settings tell chef-client to use policyfiles, and which deployment group the client should use.

use_policyfile   true
deployment_group 'sysadvent-demo-provisioner'

As mentioned above, these are the configuration options that tell Chef Provisioning where the keys are located. The key files must exist on the provisioning node somewhere.

First here’s the knife config:

private_keys     'hc-metal-provisioner' => '/tmp/ssh/id_rsa'
public_keys      'hc-metal-provisioner' => '/tmp/ssh/'

Then the recipe - this is from the current version of chef-server-cluster::setup-ssh-keys.

fog_key_pair node['chef-server-cluster']['aws']['machine_options']['bootstrap_options']['key_name'] do
  private_key_path '/tmp/ssh/id_rsa'
  public_key_path '/tmp/ssh/'

The attribute here is part of the driver options set using the with_machine_options method for Chef Provisioning in chef-server-cluster::setup-provisioner. For further reading about machine options, see Chef Provisioning configuration documentation. While the machine options will automatically use keys stored in ~/.chef/keys or ~/.ssh, we do this to avoid strange conflicts on local development systems used for test provisioning. An issue has been opened to revisit this.


Beware, gentle reader! This is an experimental new feature that mayWwill change. However, I wanted to try it out, as it made sense for the workflow when I was assembling this post. Read more about Policyfiles in the ChefDK repository. In particular, read the “Motivation and FAQ” section. Also, Chef (client) 12 is required, which is included in the ChefDK package I have installed on my provisioning system.

The general idea behind Policyfiles is to assemble node’s run list as an artifact, including all the roles and recipes needed to fulfill its job in the infrastructure. Each policyfile.rb contains at least the following.

  • name: the name of the policy
  • run_list: the run list for nodes that use this policy
  • default_source: the source where cookbooks should be downloaded (e.g., Supermarket)
  • cookbook: define the cookbooks required to fulfill this policy

As an example, here is the Policyfile.rb I’m using, at the toplevel of the repository:

name            'sysadvent-demo'
run_list        'chef-server-cluster::cluster-provision'
default_source  :community
cookbook        'chef-server-ingredient', '>= 0.0.0',
                :github => 'opscode-cookbooks/chef-server-ingredient'
cookbook        'chef-server-cluster', '>= 0.0.0',
                :github => 'opscode-cookbooks/chef-server-cluster'

Once the Policyfile.rb is written, it needs to be compiled to a lock file (Policyfile.lock.json) with chef install. Installing the policy does the following.

  • Build the policy
  • “Install” the cookbooks to the cookbook store (~/.chefdk/cache/cookbooks)
  • Write the lockfile

This doesn’t put the cookbooks (or the policy) on the Chef Server. We’ll do that in the upload section with chef push.

Data Bags

At CHEF, we prefer to move configurable data and secrets to data bags. For secrets, we generally use Chef Vault, though for the purpose of this example we’re going to skip that here. The chef-server-cluster cookbook has a few data bag items that are required before we can run Chef Client.

Under data_bags, I have these directories/files.

  • secrets/hc-metal-provisioner-chef-aws-us-west-2.json: the name hc-metal-provisioner-chef-aws-us-west-2 is an attribute in the chef-server-cluster::setup-ssh-keys recipe to load the correct item; the private and public SSH keys for the AWS keypair are written out to /tmp/ssh on the provisioner node
  • secrets/private-chef-secrets-_default.json: the complete set of secrets for the Chef Server systems, written to /etc/opscode/private-chef-secrets.json
  • chef_server/topology.json: the topology and configuration of the Chef Server. Currently this doesn’t do much but will be expanded in future to inform /etc/opscode/chef-server.rb with more configuration options

See the chef-server-cluster cookbook for the latest details about the data bag items required. Note At this time, chef-vault is not used for secrets, but that will change in the future.

Upload the Repository

Now that we’ve assembled all the required components to converge the provisioner node and start up the Chef Server cluster, let’s get everything loaded on the Chef Server.

Ensure the policyfile is compiled and installed, then push it as the provisioner deployment group. The group name is combined with the policy name in the config that we saw earlier in knife.rb. The chef push command uploads the cookbooks, and also creates a data bag item that stores the policyfile’s rendered JSON.

% chef install
% chef push provisioner

Next, upload the data bags.

% knife upload data_bags

We can now use knife to confirm that everything we need is on the Chef Server:

% knife data bag list
% knife cookbook list
apt                      11131342171167261.63923027125258247.235168191861173
chef-server-cluster      2285060862094129.64629594500995644.198889591798187
chef-server-ingredient   37684361341419357.41541897591682737.246865540583454
chef-vault               11505292086701548.4466613666701158.13536425383812

What’s with those crazy versions? That is what the policyfile feature does. The human readable versions are no longer used, cookbook versions are locked using unique, automatically generated version strings, so based on the policy we know the precise cookbook dependency graph for any given policy. When Chef runs on the provisioner node, it will use the versions in its policy. When Chef runs on the machine instances, since they’re not using Policyfiles, it will use the latest version. In the future we’ll have policies for each of the nodes that are managed with Chef Provisioning.


At this point, we have:

  • ChefDK installed on the local privisioning node (laptop) with Chef client version 12
  • AWS IAM user credentials in ~/.aws/config for managing EC2 instances
  • A running Chef Server using chef-zero on the local node
  • The chef-server-cluster cookbook and its dependencies
  • The data bag items required to use chef-server-cluster’s recipes, including the SSH keys Chef Provisioning will use to log into the EC2 instances
  • A knife.rb config file that will point chef-client at the chef-zero server, and tells it to use policyfiles

Chef Client

Finally, the moment (or several moments…) we have been waiting for! It’s time to run chef-client on the provisioning node.

% chef-client -c .chef/knife.rb

While that runs, let’s talk about what’s going on here.

Normally when chef-client runs, it reads configuration from /etc/chef/client.rb. As I mentioned, I’m using my laptop, which has its own run list and configuration, so I need to specify the knife.rb discussed earlier. This will use the chef-zero Chef Server running on port 7799, and the policyfile deployment group.

In the output, we’ll see Chef get its run list from the policy file, which looks like this:

resolving cookbooks for run list: ["chef-server-cluster::cluster-provision@0.0.7 (081e403)"]
Synchronizing Cookbooks:
  - chef-server-ingredient
  - chef-server-cluster
  - apt
  - chef-vault

The rest of the output should be familiar to Chef users, but let’s talk about some of the things Chef Provisioning is doing. First, the following resource is in the chef-server-cluster::cluster-provision recipe:

machine 'bootstrap-backend' do
  recipe 'chef-server-cluster::bootstrap'
  ohai_hints 'ec2' => '{}'
  action :converge
  converge true

The first system that we build in a Chef Server cluster is a backend node that “bootstraps” the data store that will be used by the other nodes. This includes the postgresql database, the RabbitMQ queues, etc. Here’s the output of Chef Provisioning creating this machine resource.

Recipe: chef-server-cluster::cluster-provision
  * machine[bootstrap-backend] action converge
    - creating machine bootstrap-backend on fog:AWS:862552916454:us-west-2
    -   key_name: "hc-metal-provisioner"
    -   image_id: "ami-b99ed989"
    -   flavor_id: "m3.medium"
    - machine bootstrap-backend created as i-14dec01b on fog:AWS:862552916454:us-west-2
    - Update tags for bootstrap-backend on fog:AWS:862552916454:us-west-2
    -   Add Name = "bootstrap-backend"
    -   Add BootstrapId = "http://localhost:7799/nodes/bootstrap-backend"
    -   Add BootstrapHost = "champagne.local"
    -   Add BootstrapUser = "jtimberman"
    - create node bootstrap-backend at http://localhost:7799
    -   add normal.tags = nil
    -   add normal.chef_provisioning = {"location"=>{"driver_url"=>"fog:AWS:XXXXXXXXXXXX:us-west-2", "driver_version"=>"0.11", "server_id"=>"i-14dec01b", "creator"=>"user/IAMUSERNAME, "allocated_at"=>1417385355, "key_name"=>"hc-metal-provisioner", "ssh_username"=>"ubuntu"}}
    -   update run_list from [] to ["recipe[chef-server-cluster::bootstrap]"]
    - waiting for bootstrap-backend (i-14dec01b on fog:AWS:XXXXXXXXXXXX:us-west-2) to be ready ...
    - bootstrap-backend is now ready
    - waiting for bootstrap-backend (i-14dec01b on fog:AWS:XXXXXXXXXXXX:us-west-2) to be connectable (transport up and running) ...
    - bootstrap-backend is now connectable
    - generate private key (2048 bits)
    - create directory /etc/chef on bootstrap-backend
    - write file /etc/chef/client.pem on bootstrap-backend
    - create client bootstrap-backend at clients
    -   add public_key = "-----BEGIN PUBLIC KEY-----\n..."
    - create directory /etc/chef/ohai/hints on bootstrap-backend
    - write file /etc/chef/ohai/hints/ec2.json on bootstrap-backend
    - write file /etc/chef/client.rb on bootstrap-backend
    - write file /tmp/ on bootstrap-backend
    - run 'bash -c ' bash /tmp/'' on bootstrap-backend

From here, Chef Provisioning kicks off a chef-client run on the machine it just created. This script is the one that uses CHEF’s omnitruck service. It will install the current released version of Chef, which is 11.16.4 at the time of writing. Note that this is not version 12, so that’s another reason we can’t use Policyfiles on the machines. The chef-client run is started on the backend instance using the run list specified in the machine resource.

Starting Chef Client, version 11.16.4
 resolving cookbooks for run list: ["chef-server-cluster::bootstrap"]
 Synchronizing Cookbooks:
   - chef-server-cluster
   - chef-server-ingredient
   - chef-vault
   - apt

In the output, we see this recipe and resource:

Recipe: chef-server-cluster::default
  * chef_server_ingredient[chef-server-core] action reconfigure
    * execute[chef-server-core-reconfigure] action run
      - execute chef-server-ctl reconfigure

An “ingredient” is a Chef Server component, either the core package (above), or one of the Chef Server add-ons like Chef Manage or Chef Reporting. In normal installation instructions for each of the add-ons, their appropriate ctl reconfigure is run, which is all handled by the chef_server_ingredient resource. The reconfigure actually runs Chef Solo, so we’re running chef-solo in a chef-client run started inside a chef-client run.

The bootstrap-backend node generates some files that we need on other nodes. To make those available using Chef Provisioning, we use machine_file resources.

%w{ actions-source.json webui_priv.pem }.each do |analytics_file|
  machine_file "/etc/opscode-analytics/#{analytics_file}" do
    local_path "/tmp/stash/#{analytics_file}"
    machine 'bootstrap-backend'
    action :download

machine_file '/etc/opscode/webui_pub.pem' do
  local_path '/tmp/stash/webui_pub.pem'
  machine 'bootstrap-backend'
  action :download

These are “stashed” on the local node - the provisioner. They’re used for Chef Manage webui, and the Chef Analytics node. When the recipe runs on the provisioner, we see this output:

  * machine_file[/etc/opscode-analytics/actions-source.json] action download
    - download file /etc/opscode-analytics/actions-source.json on bootstrap-backend to /tmp/stash/actions-source.json
  * machine_file[/etc/opscode-analytics/webui_priv.pem] action download
    - download file /etc/opscode-analytics/webui_priv.pem on bootstrap-backend to /tmp/stash/webui_priv.pem
  * machine_file[/etc/opscode/webui_pub.pem] action download
    - download file /etc/opscode/webui_pub.pem on bootstrap-backend to /tmp/stash/webui_pub.pem

They are uploaded to the frontend and analytics machines with the files resource attribute. Files are specified as a hash. The key is the target file to upload to the machine, and the value is the source file from the provisioning node.

machine 'frontend' do
  recipe 'chef-server-cluster::frontend'
        '/etc/opscode/webui_priv.pem' => '/tmp/stash/webui_priv.pem',
        '/etc/opscode/webui_pub.pem' => '/tmp/stash/webui_pub.pem'

machine 'analytics' do
  recipe 'chef-server-cluster::analytics'
        '/etc/opscode-analytics/actions-source.json' => '/tmp/stash/actions-source.json',
        '/etc/opscode-analytics/webui_priv.pem' => '/tmp/stash/webui_priv.pem'

Note These files are transferred using SSH, so they’re not passed around in the clear.

The provisioner will converge the frontend next, followed by the analytics node. We’ll skip the bulk of the output since we saw it earlier with the backend.

  * machine[frontend] action converge
  ... SNIP
    - upload file /tmp/stash/webui_priv.pem to /etc/opscode/webui_priv.pem on frontend
    - upload file /tmp/stash/webui_pub.pem to /etc/opscode/webui_pub.pem on frontend

Here is where the files are uploaded to the frontend, so the webui will work (it’s an API client itself, like knife, or chef-client).

When the frontend runs chef-client, not only does it install the chef-server-core and run chef-server-ctl reconfigure via the ingredient resource, it also gets the manage and reporting addons:

* chef_server_ingredient[opscode-manage] action install
  * package[opscode-manage] action install
    - install version 1.6.2-1 of package opscode-manage
* chef_server_ingredient[opscode-reporting] action install
   * package[opscode-reporting] action install
     - install version 1.2.1-1 of package opscode-reporting
Recipe: chef-server-cluster::frontend
  * chef_server_ingredient[opscode-manage] action reconfigure
    * execute[opscode-manage-reconfigure] action run
      - execute opscode-manage-ctl reconfigure
  * chef_server_ingredient[opscode-reporting] action reconfigure
    * execute[opscode-reporting-reconfigure] action run
      - execute opscode-reporting-ctl reconfigure

Similar to the frontend above, the analytics node will be created as an EC2 instance, and we’ll see the files uploaded:

    - upload file /tmp/stash/actions-source.json to /etc/opscode-analytics/actions-source.json on analytics
    - upload file /tmp/stash/webui_priv.pem to /etc/opscode-analytics/webui_priv.pem on analytics

Then, the analytics package is installed as an ingredient, and reconfigured:

* chef_server_ingredient[opscode-analytics] action install
* package[opscode-analytics] action install
  - install version 1.0.4-1 of package opscode-analytics
* chef_server_ingredient[opscode-analytics] action reconfigure
  * execute[opscode-analytics-reconfigure] action run
    - execute opscode-analytics-ctl reconfigure
Chef Client finished, 10/15 resources updated in 1108.3078 seconds

This will be the last thing in the chef-client run on the provisioner, so let’s take a look at what we have.

Results and Verification

We now have three nodes running as EC2 instances for the backend, frontend, and analytics systems in the Chef Server. We can view the node objects on our chef-zero server:

% knife node list

We can use search:

% knife search node 'ec2:*' -r
3 items found

  run_list: recipe[chef-server-cluster::analytics]

  run_list: recipe[chef-server-cluster::bootstrap]

  run_list: recipe[chef-server-cluster::frontend]

% knife search node 'ec2:*' -a ipaddress
3 items found




If we navigate to the frontend IP, we can sign up using the Chef Server management console, then download a starter kit and use that to bootstrap new nodes against the freshly built Chef Server.

% unzip
  inflating: chef-repo/.chef/sysadvent-demo.pem
  inflating: chef-repo/.chef/sysadvent-demo-validator.pem
% cd chef-repo
% knife client list
% knife node create sysadvent-node1 -d
Created node[sysadvent-node1]

If we navigate to the analytics IP, we can sign in with the user we just created, and view the events from downloading the starter kit: the validator client key was regenerated, and the node was created.

Next Steps

For those following at home, this is now a fully functional Chef Server. It does have premium features (manage, reporting, analytics), but those are free up to 25 nodes. We can also destroy the cluster, using the cleanup recipe. That can be applied by disabling policyfile in .chef/knife.rb:

% grep policyfile .chef/knife.rb
# use_policyfile   true
% chef-client -c .chef/knife.rb -o chef-server-cluster::cluster-clean
Recipe: chef-server-cluster::cluster-clean
  * machine[analytics] action destroy
    - destroy machine analytics (i-5cdac453 at fog:AWS:XXXXXXXXXXXX:us-west-2)
    - delete node analytics at http://localhost:7799
    - delete client analytics at clients
  * machine[frontend] action destroy
    - destroy machine frontend (i-68dfc167 at fog:AWS:XXXXXXXXXXXX:us-west-2)
    - delete node frontend at http://localhost:7799
    - delete client frontend at clients
  * machine[bootstrap-backend] action destroy
    - destroy machine bootstrap-backend (i-14dec01b at fog:AWS:XXXXXXXXXXXXX:us-west-2)
    - delete node bootstrap-backend at http://localhost:7799
    - delete client bootstrap-backend at clients
  * directory[/tmp/ssh] action delete
    - delete existing directory /tmp/ssh
  * directory[/tmp/stash] action delete
    - delete existing directory /tmp/stash

As you can see, the Chef Provisioning capability is powerful, and gives us a lot of flexibility for running a Chef Server 12 cluster. Over time as we rebuild Hosted Chef with it, we’ll add more capability to the cookbook, including HA, scaled out frontends, and splitting up frontend services onto separate nodes.

by Christopher Webber ( at December 14, 2014 12:00 AM

December 13, 2014

Chris Siebenmann

There are two parts to making your code work with Python 3

In my not terribly extensive experience so far, in the general case porting your code to Python 3 is really two steps in one, not a single process. First, you need to revise your code so that it runs on Python 3 at all; it uses print(), it imports modules under their new names, and so on. Some amount of this can be automated by 2to3 and similar tools, although not all of it. As I discovered, a great deal of this is basically synonymous with modernizing your code to the current best practice for Python 2.7. I believe that almost all of the necessary changes will still work on Python 2.7 without hacks (certainly things like print() will with the right imports from __future__).

After your code will theoretically run at all, you need to revise your code so that it handles strings in Unicode, and it means that calling this process 'porting' is not really a good label. The moment you deal with Unicode you need to consider both character encoding conversion points and what you do on errors. Dealing with Unicode is extra work and confronting it may well require at least a thorough exploration of your code and perhaps a deep rethink of your design. This is not at all like the effort to revise your code to Python 3 idioms.

(And some people will have serious problems, although future Python 3 versions are dealing with some of the problems.)

Code that has already been written to the latest Python 2.7 idioms will need relatively little revision for Python 3's basic requirements, although I think it always needs some just to cope with renamed modules. Code that was already dealing very carefully with Unicode on Python 2.7 will need little or no revision to deal with Python 3's more forced Unicode model, because it's already effectively operating in that model anyways (although possibly imperfectly in ways that were camouflaged by Python 2.7's handling of this issue).

The direct corollary is that both the amount and type of work you need to do to get your code running under Python 3 depends very much on what it does today with strings and Unicode on Python 2. 'Clean' code that already lives in a Unicode world will have one experience; 'sloppy' code will have an entirely different one. This means that the process and experience of making code work on Python 3 is not at all monolithic. Different people with different code bases will have very different experiences, depending on what their code does (and on how much they need to consider corner cases and encoding errors).

(I think that Python 3 basically just works for almost all string handling if your system locale is a UTF-8 one and you never deal with any input that isn't UTF-8 and so never are confronted with decoding errors. Since this describes a great many people's environments and assumptions, simplistic Python 3 code can get very far. If you're in such a simple environment, the second step of Python 3 porting also disappears; your code works on Python 3 the moment it runs, possibly better than it did on Python 2.)

by cks at December 13, 2014 06:13 AM

System Administration Advent Calendar

Day 13 - Managing Repositories with Pulp

Written by: Justin Garrison (@rothgar)
Edited by: Corey Quinn (@quinnypig)

Your infrastructure and build systems shouldn’t rely on Red Hat, Ubuntu, or Docker’s repositories being available. Your work doesn’t stop when they have scheduled maintenance. You also have no control when they make updates available. Have you ever run an update and crossed your fingers unknown package versions wouldn’t break production? How do you manage content and repos between environments? Are you still running reposync or wget in a cron job? It’s time to take control of your system dependencies and use something that is scalable, flexible and repeatable.


That’s where Pulp comes in. Pulp is a platform for managing repositories of content and pushing that content out to large numbers of consumers. Pulp can sync and manage more than just RPMs. Do you want to create your own Docker repository? How about syncing Puppet modules from the forge, or an easy place to host your installation media? Would you like to sync Debian* repositories or have a local mirror of pip*? Pulp can do all of that and is built to scale to your needs.

*Note that some importers are still a work in progress and not fully functional. Pull requests welcome.

How Pulp Works

The first step is to use an importer to get content into a Pulp repository. Importers are plugins which make them extremely flexible when dealing with content. With a little bit of work you can build an importer for any content source. Want local gems, maven or CPAN repositories? You can write your own importers and have it working with Pulp repos for just about anything. The content can be synced from external sources, uploaded from local files, or imported between repos. The importer validates content, applies metadata, and removes old content when it syncs. No more guessing if your packages synced or if your cron job failed.

After you have content in a repo, you then use a distributor to publish the content. Like importers, distributors are pluggable and content can be published to multiple locations. A single repo can publish to http(s), ISO, rsync, or any other exporter available. Publishing and syncing can also be scheduled for one or multiple times so you don’t have to worry about your content getting out of date.

Scaling Pulp

Pulp has different components that can be scaled according to your needs. The main components can be broken up into

  • httpd - frontend for API and http(s) published repos
  • pulp_workers - process for long running tasks like repo syncing and publishing
  • pulp_celerybeat - maintains workers and task cancellation
  • pulp_resource_manager - job assigner for tasks
  • mongodb - repo and content metadata value store
  • Qpid/RabbitMQ - message bus for job assigning
  • pulp-admin - cli tool for managing content and consumers
  • consumer - optional agent installed on node to subscribe to repos/content

Here’s how the components interact for a Pulp server.

Pulp components

Because of this modular layout you can scale out each component individually. Need more resources for large files hosted on http? You can scale httpd easily with a load balancer and a couple shared folders. If your syncs and publishes are taking a long time you can add more pulp_workers to get more concurrent tasks running inside Pulp.

If you have multiple datacenters you can mirror your Pulp repos to child nodes, which can replicate all or part of your parent server to each child.

Node topologies

Getting Started

The best architected software in the world is frivolous if you’re not going to use it. With Pulp, hopefully you can find a few use cases. Let’s say you just want better control over what repositories your servers pull content from. If you’re using Puppet you can quickly set up a server using the provided Puppet manifests and then you can mirror EPEL with just a few lines.

class pulp::repo::epel_6 {
     pulp_repo { 'epel-6-x86_64':    
       feed       => '',
       serve_http => true,

Want to set up dedicated repos for dev, test, and prod? Just create repos for each and schedule content syncing between environment repos. You’ll finally take control over what content gets pushed to each environment. Because Pulp is intelligent with its storage you can make sure you only ever store a needed package once.

Want to create an internal Docker registry? How about hosting it in Pulp deployed with Docker containers. You can deploy it with a single line in bash. Check out the infrastructure diagram below and learn how to do it in the quickstart documentation.

Getting content to consumers can be as easy as relying on system tools to pull the content like it normally does via http publishing, or you can install the consumer agent and get real time status about what is installed on each node, push content immediately when it is available, or rollback managed content if you find a broken package.


Not only can managing your own repositories greatly improve your control and visibility into your systems, but moving data closer to the nodes can speed up your deployments and simplify your infrastructure. If I haven’t convinced you yet that you should manage the content that goes onto your servers you must either be very trusting or have one doozy of a cron job!

by Christopher Webber ( at December 13, 2014 12:00 AM

December 12, 2014

Racker Hacker

Install sysstat on Fedora 21

One of the first tools I learned about after working with Red Hat was sysstat. It can write down historical records about your server at regular intervals. This can help you diagnose CPU usage, RAM usage, or network usage problems. In addition, sysstat also provides some handy command line utilities like vmstat, iostat, and pidstat that give you a live view of what your system is doing.

On Debian-based systems (including Ubuntu), you install the sysstat package and enable it with a quick edit to /etc/default/sysstat and the cron job takes it from there. CentOS and Fedora systems call the collector process using a cron job in /etc/cron.d and it’s enabled by default.

Fedora 21 comes with sysstat 11 and there are now systemd unit files to control the collection and management of stats. You can find the unit files by listing the files in the sysstat RPM:

$ rpm -ql sysstat | grep systemd

These services and timers aren’t enabled by default in Fedora 21. If you run sar after installing sysstat, you’ll see something like this:

# sar
Cannot open /var/log/sa/sa12: No such file or directory
Please check if data collecting is enabled

All you need to do is enable and start the main sysstat service:

systemctl enable sysstat
systemctl start sysstat

From there, systemd will automatically call for collection and management of the statistics using its internal timers. Opening up /usr/lib/systemd/system/sysstat-collect.timer reveals the following:

# /usr/lib/systemd/system/sysstat-collect.timer
# (C) 2014 Tomasz Torcz <>
# sysstat-11.0.0 systemd unit file:
#        Activates activity collector every 10 minutes
Description=Run system activity accounting tool every 10 minutes

The timer unit file ensures that the sysstat-collect.service is called every 10 minutes based on the real time provided by the system clock. (There are other options to set timers based on relative time of when the server booted or when a user logged into the system). The familiar sa1 command appears in /usr/lib/systemd/system/sysstat-collect.service:

# /usr/lib/systemd/system/sysstat-collect.service
# (C) 2014 Tomasz Torcz <>
# sysstat-11.0.0 systemd unit file:
#        Collects system activity data
#        Activated by sysstat-collect.timer unit
Description=system activity accounting tool
ExecStart=/usr/lib64/sa/sa1 1 1

The post Install sysstat on Fedora 21 appeared first on

by Major Hayden at December 12, 2014 05:55 PM

Chris Siebenmann

The bad side of systemd: two recent systemd failures

In the past I've written a number of favorable entries about systemd. In the interests of balance, among other things, I now feel that I should rake it over the coals for today's bad experiences that I ran into in the course of trying to do a yum upgrade of one system from Fedora 20 to Fedora 21, which did not go well.

The first and worst failure is that I've consistently had systemd's master process (ie, PID 1, the true init) segfault during the upgrade process on this particular machine. I can say it's a consistent thing because this is a virtual machine and I snapshotted the disk image before starting the upgrade; I've rolled it back and retried the upgrade with variations several times and it's always segfaulted. This issue is apparently Fedora bug #1167044 (and I know of at least one other person it's happened to). Needless to say this has put somewhat of a cramp in my plans to upgrade my office and home machines to Fedora 21.

(Note that this is a real segfault and not an assertion failure. In fact this looks like a fairly bad code bug somewhere, with some form of memory scrambling involved.)

The slightly good news is that PID 1 segfaulting does not reboot the machine on the spot. I'm not sure if PID 1 is completely stopped afterwards or if it's just badly damaged, but the bad news is that a remarkably large number of things stop working after this happens. Everything trying to talk to systemd fails and usually times out after a long wait, for example attempts to do 'systemctl daemon-reload' from postinstall scripts. Attempts to log in or to su to root from an existing login either fail or hang. A plain reboot will try to talk to systemd and thus fails, although you can force a reboot in various ways (including 'reboot -f').

The merely bad experience is that as a result of this I had occasion to use journalctl (I normally don't). More specifically, I had occasion to use 'journalctl -l', because of course if you're going to make a bug report you want to give full messages. Unfortunately, 'journalctl -l does not actually show you the full message. Not if you just run it by itself. Oh, the full message is available, all right, but journalctl specifically and deliberately invokes the pager in a mode where you have to scroll sideways to see long lines. Under no circumstance is all of a long line visible on screen at once so that you may, for example, copy it into a bug report.

This is not a useful decision. In fact it is a screamingly frustrating decision, one that is about the complete reverse of what I think most people would expect -l to do. In the grand systemd tradition, there is no option to control this; all you can do is force journalctl to not use a pager or work out how to change things inside the pager to not do this.

(Oh, and journalctl goes out of its way to set up this behavior. Not by passing command line arguments to less, because that would be too obvious (you might spot it in a ps listing, for example); instead it mangles $LESS to effectively add the '-S' option, among other things.)

While I'm here, let me mention that journalctl's default behavior of 'show all messages since the beginning of time in forward chronological order' is about the most useless default I can imagine. Doing it is robot logic, not human logic. Unfortunately the systemd journal is unlikely to change its course in any significant way so I expect we'll get to live with this for years.

(I suppose what I need to do next is find out wherever abrt et al puts core dumps from root processes so that I can run gdb on my systemd core to poke around. Oh wait, I think it's in the systemd journal now. This is my unhappy face, especially since I am having to deal with a crash in systemd itself.)

by cks at December 12, 2014 06:52 AM

System Administration Advent Calendar

Day 12 - Ops and Development Teams: Finding a Harmony

Written by: Nell Shamrell (@nellshamrell)
Edited by: Ben Cotton (@funnelfiasco)

I used to think the term "DevOps" meant infrastructure automation tools like Chef, Puppet, and Ansible. But Devops is more than that, even more than technology itself. Devops is a culture, an attitude, a way of doing things where Dev and Ops work together rather than against each other. When you work in IT it's easy to get lost in technical skills. It's what we recruit for, it's what certifications test for. However, I've found that in any professional environment - no matter what the technology - soft skills like communication and taking responsibility are just as important as hard skills like sysadmin or coding.

I’ve worked in software development for 8 years now. In that time I have worked on teams with the classic Operations/Development divide - features and bug fixes were “thrown over the wall” to operations to deploy. I’ve also worked at a company where developers were responsible for their own hand crafted QA, Staging, and Production systems (often crafted by developers without expertise in SysAdmin) with little help from Operations. In working in both these extremes, I’ve found our projects were repeatedly delayed not by technical problems, but by failures in knowing responsibilities, actually taking responsibility, and communication.

Not knowing who is responsible for what kills IT projects. Are the devs just supposed to focus on the application code? Is the Ops team the one who should be woken up when a deploy goes wrong? When we don’t know the answers - or don’t agree on the answers - our project, company, and therefore ourselves suffer for it. Responsibilities will vary from team to team and project to project, but I’d like to at least give you a place to start. Here are 10 general guidelines to what the Ops team, Dev team, and both teams should be responsible for.

Ops Responsibilities

  1. Provide production-like environments for developers to test their code on. Why is this the Ops team’s responsibility, rather than the devs? Because Ops knows the production system better than anyone else and are tasked with protecting the stability of the system. In order to protect it, you must replicate it for developers to use in a safe way to test their features and bug fixes before they go live. Then everyone will sleep better at night.
  2. Provide multiple production-like environments. There is little more frustrating to a developer to have a feature ready to test on a production-like environment, but be blocked because someone else is already using the environment. One QA and one Staging server are not enough. Prevent it from become a bottleneck which delays features and bug fixes, which makes everyone unhappy. If resources are limited, at the very least provide an easily replicated virtual box (kept in sync with production!) that developers can run locally.
  3. Automate and Document procedures for building production-like environments. This will prevent developers from needing to ask you to build a system for them whenever they need to test something. Empower developers to do this themselves by providing automated infrastructure and documentation they can use. Then the developers will be able to get bug fixes and features out to the customers faster, not have to bother you in the process, and everybody wins.

Dev Responsibilities

  1. Test code in a production like environment before ever declaring a feature or bug fix to be done. “Works on my machine!” is never the definition of done. Gene Kim altered the agile manifesto definition of done to illustrate this "At the end of each sprint, we must have working and shippable code...demonstrated in an environment that resembles production."
  2. Take full responsibility for deployed code. If a deploy of developer code starts causing havoc in a system and preventing people from getting work done, it is the developer’s fault. Not Ops, not QA’s, the ultimate responsibility for what code does in production belongs to the developer who wrote the code. If something goes wrong in the middle of the night, the developer should be woken up first, then take the responsibility to wake up further team members if needed.
  3. Read documentation first and try the procedures it suggests before asking the Ops team for help. Respect the Ops team’s time by using any resources they’ve provided first, then ask for help if the problem remains.

Both Teams

  1. Respect how the other team does things - although we have similar goals, often dev and ops have different ways of getting work done. If an Ops person needs to request something of the dev team - or to contribute some code - check for any contributing documents and follow them. The same goes for a Dev person who needs something in the Ops team’s domain. Avoid “just this once” exceptions - those often multiply and turn into cruft which will bring down a system.
  2. Write down procedures and who is responsible for what. As stated earlier, not knowing (or not caring) who is responsible for what aspects of an application will kill that application. Knowing how things get done or are intended to be done is too vital to be tribal knowledge. Write it down, follow it, and refer to it whenever needed. If someone repeatedly refuses to consult the documentation, default to sending them a link to the documentation (or even better the section of the documentation) where they will find their answer. This may seem like an aggressive stance, but time is too scarce in the IT world to repeatedly solve the same problem over and over because someone refuses to look at available documentation.
  3. Never use the phrase “Why don’t you just-”. This comes across as extremely condescending. Teams must respect their teammates’ intelligence and realize that if it were “just” that simple, they probably would have already done it. “Have you considered...” is a good way of rephrasing.
  4. When you don’t know, help find the answer. When someone has a question and you don’t know the answer it's ok to say "I don't know." But your responsibility doesn’t end there. In a professional IT environment, it your duty to point the questioner to where they might find their answer - whether that's a person, a web resource, or just "Here, let’s try googling that together and let's see what we find." Help a questioner move forward, rather than stopping them in their tracks.

I am relieved to now work at company that embraces these principles of taking responsibility and communication (I learned many of these from working there!). Projects are completed faster, both teams are happier, far fewer people are woken in the middle of the night, and the business benefits tremendously. The key to making this work has been clearly establishing who is responsible for what and, when any confusion or blockers come up, communicating immediately. We may have different areas of expertise, but everyone is equally accountable to communicate and take responsibility for a project’s success. And it does work.

As IT professionals we shape how the world works now. It’s not just how people spend money - our work is now vital to how humans travel, how they communicate, how they access utilities like lighting and water, how laws are passed and implemented, and (as IT becomes more integrated into health care) how they physically survive and thrive. The stakes are much too high to let a lack of communication or failure to establish and take responsibility kill an IT project. The weight of the world rests on our shoulders now - we have that great power. It’s time to not only meet but embrace that responsibility!

by Christopher Webber ( at December 12, 2014 12:05 AM

December 11, 2014

That grumpy BSD guy

The Password? You Changed It, Right?

Right at this moment, there's a swarm of little password guessing robots trying for your router's admin accounts. Do yourself a favor and do some logs checking right away. Also, our passwords are certainly worth a series of conferences of their own.

As my Twitter followers may be aware, I spent the first part of this week at the Passwords14 conference in Trondheim, Norway. More about that later, suffice for now to say that the conference was an excellent one, and my own refreshed Hail Mary Cloud plus more recent history talk was fairly well received.

But the world has a way of moving on even while you're not looking, and of course when I finally found a few moments to catch up on my various backlogs while waiting to board the plane for the first hop on the way back from the conference, a particular sequence stood out in the log extracts from one of the Internet-reachable machines in my care:

Dec  9 19:00:24 delilah sshd[21296]: Failed password for invalid user ftpuser from port 37404 ssh2
Dec 9 19:00:25 delilah sshd[6320]: Failed password for invalid user admin from port 38041 ssh2
Dec 9 19:00:26 delilah sshd[10100]: Failed password for invalid user D-Link from port 38259 ssh2
Dec 9 19:03:53 delilah sshd[26709]: Failed password for invalid user ftpuser from port 43261 ssh2
Dec 9 19:03:55 delilah sshd[23796]: Failed password for invalid user admin from port 43575 ssh2
Dec 9 19:03:56 delilah sshd[12810]: Failed password for invalid user D-Link from port 43833 ssh2
Dec 9 19:06:36 delilah sshd[14572]: Failed password for invalid user ftpuser from port 52436 ssh2
Dec 9 19:06:37 delilah sshd[427]: Failed password for invalid user admin from port 53127 ssh2
Dec 9 19:06:38 delilah sshd[28547]: Failed password for invalid user D-Link from port 53393 ssh2
Dec 9 19:14:44 delilah sshd[31640]: Failed password for invalid user ftpuser from port 35760 ssh2

Yes, you read that right. Several different hosts from widely dispersed networks, trying to guess passwords for the accounts they assume exist on your system. One of the user names is close enough to the name of a fairly well known supplier of consumer and SOHO grade network gear that it's entirely possible that it's a special account on equipment from that supplier.

Some catching up on sleep and attending to some high priority tasks later, I found that activity matching the same pattern turned up in a second system on the same network.

By this afternoon (2014-12-11), it seems that all told a little more than 700 machines have come looking for mostly what looks like various manufacturers' names and a few other usual suspects. The data can be found here, with roughly the same file names as in earlier episodes. Full list of attempts on both hosts here, with the rather tedious root only sequences removed here, hosts sorted by number of attempts here, users sorted by number of attempts here, a CSV file with hosts by number of attempts with first seen and last seen dates and times, and finally hosts by number of attempts with listing of each host's attempts. Expect updates to all of these at quasi-random intervals.

The pattern we see here is quite a bit less stealthy than the classic Hail Mary Cloud pattern. In this sequence we see most of the hosts trying all the desired user names only a few seconds apart, and of course the number of user IDs is very small compared to the earlier attempts. But there seems to be some level of coordination - the attackers move on to the next target in their list, and at least some of them come back for a second try after a while.

Taken together, it's likely that what we're seeing is an attempt to target the default settings on equipment from a few popular brands of networking equipment. It's likely that the plan is to use the captured hosts to form botnets for purposes such as DDOSing. There is at least one publicly known incident that has several important attributes in common with what we're seeing: Norwegian ISP and cable TV supplier GET found themselves forced to implement some ad hoc countermeasures recently (article in Norwegian, but you will find robots) in a timeframe that fits with the earliest attempts we've seen here. I assume similar stories will emerge over the next days or weeks, possibly with more detail that what's available in the article.

If you're seeing something similar in your network and you are in a position to share data for analysis similar to what you see in the files referenced abovee, I would like to hear from you.

A conference dedicated to passwords and their potential replacements.

Yes, such a thing exists. All aspects of passwords and their potential replacements have been the topics of a series of conferences going back to 2011. This year I finally had a chance to attend the European one, Passwords14 in Trondheim, Norway December 8-10.

The conference has concluded, but you can find the program up still here, and the video from the live stream is archived here (likely to disappear for a few days soon, only to reappear edited into more manageable chunks of sessions or individual talks). You'll find me in the material from the first day, in a slightly breathless presentation (58 slides to 30 minutes talking time), and my slides with links to data and other material are available here.

Even if you're not in a position to go to Europe, there is hope: there will be a Passwords15 conference for the Europe-challenged in Las Vegas, NV, USA some time during the summer of 2015, and the organizers are currently looking for a suitable venue and time for the 2015 European one. I would strongly recommend attending the next Passwords conference; both the formal talks and the hallway track are bound to supply enlightening insights and interesting ideas for any reasonably security oriented geek.

Now go change some passwords!

I'll be at at least some of the BSD themed conferences in 2015, and I hope to see you there.

by (Peter N. M. Hansteen) at December 11, 2014 09:39 PM

Steve Kemp's Blog

An anniversary and a retirement

On this day last year I we got married.

This morning my wife cooked me breakfast in bed for the second time in her life, the first being this time last year. In thanks I will cook a three course meal this evening.


In unrelated news the BlogSpam service will be retiring the XML/RPC API come 1st January 2015.

This means that any/all plugins which have not been updated to use the JSON API will start to fail.

Fingers crossed nobody will hate me too much..

December 11, 2014 10:56 AM

Chris Siebenmann

What good kernel messages should be about and be like

Linux is unfortunately a haven of terrible kernel messages and terrible kernel message handling, as I have brought up before. In a spirit of shouting at the sea, today I feel like writing down my principles of good kernel messages.

The first and most important rule of kernel messages is that any kernel message that is emitted by default should be aimed at system administrators, not kernel developers. There are very few kernel developers and they do not look at very many systems, so it's pretty much guaranteed that most kernel messages are read by sysadmins. If a kernel message is for developers, it's useless for almost everyone reading it (and potentially confusing). Ergo it should not be generated by default settings; developers who need it for debugging can turn it on in various ways (including kernel command line parameters). This core rule guides basically all of the rest of my rules.

The direct consequence of this is that all messages should be clear, without in-jokes or cleverness that is only really comprehensible to kernel developers (especially only subsystem developers). In other words, no yama-style messages. If sysadmins looking at your message have no idea what it might refer to, no lead on what kernel subsystem it came from, and no clue where to look for further information, your message is bad.

Comprehensible messages are only half of the issue, though; the other half is only emitting useful messages. To be useful, my view is that a kernel message should be one of two things: it should either be what they call actionable or it should be necessary in order to reconstruct system state (one example is hardware appearing or disappearing, another is log messages that explain why memory allocations failed). An actionable message should cause sysadmins to do something and really it should mean that sysadmins need to do something.

It follows that generally other systems should not be able to cause the kernel to log messages by throwing outside traffic at it (these days that generally means network traffic), because outsiders should not be able to harm your kernel to the degree where you need to do anything; if this is the case, they are not actionable for the sysadmin of the local machine. And yes, I bang on this particular drum a fair bit; that's because it keeps happening.

Finally, almost all messages should be strongly ratelimited. Unfortunately I've come around to the view that this is essentially impossible to do at a purely textual level (at least with acceptable impact for kernel code), so it needs to be considered everywhere kernel code can generate a message. This very definitely includes things like messages about hardware coming and going, because sooner or later someone is going to have a flaky USB adapter or SATA HD that starts disappearing and then reappearing once or twice a second.

To say this more compactly, everything in your kernel messages should be important to you. Kernel messages should not be a random swamp that you go wading in after problems happen in order to see if you can spot any clues amidst the mud; they should be something that you can watch live to see if there are problems emerging.

by cks at December 11, 2014 03:53 AM

System Administration Advent Calendar

Day 11 - Turning off the Pacemaker: Load Balancing Across Layer 3

Written by: Jan Ivar Beddari (@beddari)
Edited by: Joseph Kern (@josephkern)


Traditional load balancing usually brings to mind a dedicated array of Layer 2 devices connected to a server farm, with all of the devices preferably coming from the same vendor. But the latest techniques in load balancing are being implemented as open source software and standards driven Layer 3 protocols. Building new load balancing stacks away from the traditional (often vendor controlled) Layer 2 technologies opens up the network edge and creates a flexible multi-vendor approach to systems design that many small organizations are embracing and leaves many larger organizations wondering why they should care.

Layer 2 is deprecated!

The data center network as a concept is changing. The traditional three layer design model - access, aggregation and core - is being challenged by simpler, more cost effective models where internet proven technology is reused inside the data center. Today, building or redesigning a data center network for modern workloads would likely include running Layer 3 and routing protocols all the way down to the top-of-rack (ToR) switches. Once that is done, there is no need for Layer 2 between more than two adjacent devices. As a result, each and every ToR interface would be a point-to-point IP subnet with its own minimal broadcast domain. Conceptually it could look something like this:

layer 3 only-network

Removing Layer 2 from the design process and accepting Layer 3 protocols and routing appears to be the future for networks and service design. This can be a hard transition if you work in a more traditional environment. Security, deployment practices, management and monitoring tools, and a lot of small details need to change when your design process removes Layer 2. One of those design details that need special consideration is load balancing.

Debating the HAProxy single point of failure

Every team or project that has deployed HAproxy has had a conversation about load balancing and resiliencey. These converstaions often start with the high ideal of eliminating single points of failure (SPoF) and end with an odd feeling that we might be traiding one SPoF for another. A note: I’m not a purist, I tend to casually mix the concept of load balancing with that of achieving basic network resilience. Apologies in advance about my lack of formality, practical experience suggests that dealing with these concepts separately does less to actually solve problems. How then do we deploy HAProxy for maximum value with the least ammount of effort in this new Layer 3 environment?

The simplest solution, and possibly my favorite one, would be to not bother with any failover for HAProxy at all. HAProxy is an incredible software engineering effort and given stable hardware it will just run, run and run. HAProxy will work, there will be no magic happening, if you reboot the node where it runs or have any kind of downtime - your services will be down. As excpected. That’s the point, you know what to expect and you will get exactly that. I think we sometimes underestimate the importance of making critical pieces of infrastructure as simple as possible. If you know why and at what cost metrics, just accepting that your HAProxy is and will be a SPoF can be your best bet.

Good design practice: Always question situations where a service must run without a transparent failover mechanism. Is this appropriate? Do I understand the risk? Have the people that depend on this service understood and accepted this risk?

But providing failover for a HAProxy service isn’t trivial. Is it even worth implementing? Maybe using Pacemaker or keepalived to cluster the HAProxy will work? Or might there be better alternatives that have been created while you are reading this post?

Let’s say that for the longest time you did run your HAProxy as a SPoF and it worked very well, it was never overloaded, whatever downtime experienced wasn’t related to the load balancer. But then someone decides that all parts and components in your service have to be designed with a realtime failover capability. With a background in development or operations, I think most people would default to start building a solution on proven software like Pacemaker or keepalived. Pacemaker is a widely used cluster resource management tool that covers a wide array of use cases around running and operating software clusters. keepalived design is simpler and with less features, relying on Virtual Router Redundancy Protocol (VRRP) for IP based failover. Given how services are evolving towards Layer 3 mechanisms, using any of these tools might not be the best decision. Pacemaker and keepalived in their default configurations rely on moving a virtual IP adress (VIP) inside a single subnet. They just will not work in a modern data center without providing legacy Layer 2 broadcast domains as exceptions to the Layer 3 design.

But the Layer 2 broadcast domain requirement of Pacemaker and keepalived are limitations that can be ignored or worked around. Ignoring it would involve doing things like placing VIP resources inside an “availability-stretched” overlay network, e.g inside a Openstack tenant network or a subnet inside a single Amazon availability zone. This is a horrible idea. Not only does this build a critical path on top of services not really designed for that, it would also not achieve anything beyond the capabilites of a single HAProxy instance. When it comes to workarounds, keepalived could allow VRRP unicast to be routed, thus “escaping” the single subnet limitation. Pacemaker uses a VIPArip resource that allows management of IP aliases across different subnets. I don’t think these designs would make enough sense (i.e. be simple enough) to design a solution around. Working around a single broadcast domain limitation by definition would involve modifying your existing Layer 3 routing, better value exists eslewhere.

Solving the HAProxy SPoF problem

Now, if you had a little more than just basic networking skills or are lucky enough to work with people that do - you might be aware of a solution that is both elegant and scalable. Using routing protocols it is possible to split the traffic to a VIP across upstream routers and have multiple HAProxy instances process the flows. The reason this can work is that most modern routers are able to do load balancing per flow so that each TCP session consistently gets the same route - this means they will also get the same HAProxy instance. This is not a new practice and has been done for years in organizations that operate large scale services. In the wider operations community though, there doesn’t seem to be much discucssion. Technically, it is not hard or complicated, but it requires skills and expereinces that are less common.

Knowing the basics, there are multiple ways of accomplishing this. CloudFlare uses Anycast to solve this problem, and a blog post by Allan Feid at Shutterstock explains how you could run ExaBGP to announce or whitdraw routes to a service. In short, if HAProxy is up serving connections, use ExaBGP to announce to the upstream router that the route is available. In case of failure, do the opposite, tell the router that this route is no longer available for traffic.

I’m going to describe a solution that is similar but expand it a bit more, I hope you begin to see your services and datacenter a little differently.

haproxy ecmp

In this scenario there are two routers, r1 and r2, both announcing a route to the service IP across the network. This announcement is done using routing protocols like BGP or OSPF. It does not matter which one is used, for our use-case, they are functionally very close. Depending on how the larger network around r1 and r2 is designed they might not be receiving equal amounts of the traffic towards the service IP. If so, it is possible to have the routers balance the workload across n0 before routing the traffic to the service.

These routers (r1 and r2) are connected to both load balancers across different link networks (n1_ through n4) and have two equal cost routes set up to the service IP. They know they can reach the service IP across both links and must make a routing decision about which one to use.

haproxy ecmp hashing

The routers then use a hashing algorithm on the packets in the flow to make that decision. A typical algorithm looks at Layer 3 and Layer 4 information as a tuple, e.g source IP, destination IP, source port and destination port, and then calculate a hash. If configured correctly, both routers will calculate the same hash and consequently install the same route, routing traffic to the same load balancer instance. Configuring hashing algorithms on the routers is what I’d consider the hardest part of getting a working solution. Not all routers are able to do it and trying to find documentation about it is hard.

Another approach is not using hardware routers at all and rely only on the Linux kernel and a software routing daemon like BIRD or Quagga. These would then be serving as dedicated routing servers in the setup, replacing the r1 and r2 hardware devices.

Regardless of using hardware or software routers, what makes this setup effective is that you do not interrupt any traffic when network changes take place. If r1 is administratively taken offline, routing information in the network will be updated so that the peering routers only use r2 as a destination for traffic towards the service IP. As for HAProxy it does not need to know that this is happening. Existing sessions (flows) won’t be interrupted and will be drained off r1.

haproxy ecmp failure

For minimizing unplanned downtime, optimizing configuration on r1 and r2 for fast convergence - quick recovery from an unknown state, is essential. Rather than adjusting routing protocol timers I’d recommend using other forms of convergence optimization, like Bidirectional Forwarding Detection (BFD). The BFD failure detection timers have much shorter time limits than the failure detection mechanisms in the routing protcols, so they provide faster detection. This means recovery can be fast, even sub-second, and data loss minimalized.

Automated health checks (mis)using Bidirectional Forwarding Detection

Now we need to define how to communicate with the routers from the HAProxy instances. They need to be able to communitcate with the routing layer to start or stop sending traffic in their direction. In practice that means to signal the routers to add or withdraw one of the routes to the service IP address. Again, there are multiple ways of acheiving this but simplicity is our goal. For this solution I’ll again focus on BFD. My contacts over at UNINETT in Trondheim have had success using OpenBFDD, an open source implementation of BFD, to initiate these routing updates. BFD is a network protocol used to detect faults between devices connected by a link. Standards-wise it is at the RFC stage according to Wikipedia. It is low-overhead and fast, so it’s perfect for our simple functional needs. While both Quagga and BIRD have support for BFD, OpenBFDD can be used as a standalone mechanism, removing the need for running a full routing daemon on your load balancer.

To set this up, you would run bfdd-beacon on your HAProxy nodes, and then send it commands from its control utility bfdd-control. Of course this is something you’d want to automate in your HAProxy health status checks. As an example, this is a simple Python daemon that will run in the background, check HAProxy status and interfaces every second, and signal the upstream routers about state changes:

#!/usr/bin/env python
import os.path
import requests
import time
import subprocess
import logging
import argparse
from daemonize import Daemonize

APP = "project_bfd_monitor"
DESCRIPTION = "Check that everything is ok, and signal link down via bfd otherwise"
ADMIN_DOWN_MARKER = "/etc/admin_down"
HAPROXY_CHECK_URL = "http://localhost:1936/haproxy_up"

def check_state(interface):
        response = requests.get(HAPROXY_CHECK_URL, timeout=1)
        return "down"
    ifstate_filename = "/sys/class/net/{}/operstate".format(interface)
    if not os.path.exists(ifstate_filename) or open(ifstate_filename).read().strip() != "up":
        return "down"
    if os.path.exists(ADMIN_DOWN_MARKER):
        return "admin"
    return "up"

def set_state(new_state):
    if new_state not in ("up", "down", "admin"):
        raise ValueError("Invalid new state: {}".format(new_state))["/usr/local/bin/bfdd-control", "session", "all", "state", new_state])

def main(logfile, interface):
                        format='%(asctime)s %(name)s %(levelname)s %(message)s')
    if logfile:
        handler = logging.handlers.RotatingFileHandler(logfile,
                                                       maxBytes=10*1024**3, backupCount=5)
    state = check_state(interface)
    set_state(state)"bfd-check starting, initial state: %s", state)
    while True:
        new_state = check_state(interface)
        if new_state != state:
  "state changed from %s to %s", state, new_state)
            state = new_state

def parse_args():
    parser = argparse.ArgumentParser(description=DESCRIPTION)
    parser.add_argument('-d', '--daemonize', default=False, action='store_true',
                        help="Run as daemon")
    parser.add_argument('--pidfile', type=str, default="/var/run/{}.pid".format(APP),
                        help="pidfile when run as daemon")
    parser.add_argument('--logfile', default='/var/log/{}.log'.format(APP),
                        help="logfile to use")
    parser.add_argument('--interface', help="Downstream interface to monitor status of")

    return parser.parse_args()

if __name__ == '__main__':
    args = parse_args()
    if args.daemonize:
        daemon_main = lambda: main(args.logfile, args.interface)
        daemon = Daemonize(app=APP, pid=args.pidfile, action=daemon_main)
        main(None, args.interface)

There are three main functions. check_state() requests the HAProxy stats uri and checks status of the monitored network interface through SysFS. main() runs a while True loop that calls check_state() every second. If state has changed, set_state() will be called and a bfdd-control subprocess ran to signal the new state through the protocol to the listening routers. One interesting thing to note about the admin down state - it is actually part of the BFD protocol standard as defined by the RFC. As a consequence, taking a load balancer out of service is as simple as marking it down, then waiting for its sessions to drain.

When designing the HAProxy health checks to run there is an important factor to remember. You don’t want complicated health checks, and there is no point in exposing any application or service-related checks through to the routing layer. As long as HAProxy is up serving connections, we want to continue receiving traffic and stay up.


Standards driven network design with core services implemented using open source software is currently gaining acceptance. Traditionally, developer and operations teams have had little knowledge of and ownership to this part of infrastructure. Shared knowledge of Layer 3 protocols and routing will be crucial for any organization building or operating IT services going forward. Load balancing is a problem space that needs multi-domain knowledge, and greatly benifits from integrated teams. Integrating knowlege of IP networks and routing with application service delivery allows the creation of flexible load balancing systems while staying vendor-neutral.

Interesting connections are most often made when people of diverse backgrounds meet and form a new relationship. If you know a network engineer, the next time you talk, ask about Bidirectional Forwarding Detection and convergence times. It could become an interesting conversation!

Thank you to Sigmund Augdal at UNINETT for sharing the Python code in this article. His current version is available on their Gitlab service.

by Christopher Webber ( at December 11, 2014 12:00 AM

December 10, 2014

Standalone Sysadmin

Leaving the LOPSA Board

It’s with some amount of sorrow and trepidation that I begin this blog entry.

One of the things that I often need to be reminded of is my own limitations. I think we all can forget that we’ve got human limits and sometimes we take on more than we can deal with. I am a chronic “joiner”. I like people, I like to build communities and organizations, and I like to put forth effort to make things happen.

By itself, this is fine, but in the macro, I try to do too much - certainly more than I can accomplish. My work suffers across the board from my lack of attention in any one area. It’s like the old problem of task switching, but when the tasks are completely unrelated to each other, it’s like context switching my entire brain out, and when I do it too often, I lose because of how inefficient it is. Worse than that, the tasks suffer.

For a long time, I was able to not let that be a massive problem, because I worked hard to keep myself out of the “critical path”, so that when I was concentrating on task B, task A could comfortably wait. But that’s not the case anymore. The quality of my work has been suffering, and it’s to the point where not only is everything I’ve been doing mediocre, those organizations where I’m in the critical path have suffered, and I’m no longer willing to make other people suffer because of problem of taking on too much.

Effective today at noon, I’m resigning as a Director of LOPSA. This might be surprising given how much I wanted to actively work and lead the change that I believe the organization needs, and I can tell you that no one is sorrier than I am that I’m stepping down. This isn’t me “breaking up” with the organization. I still believe that the organization has a lot to offer and its community of IT Admins is a potent force capable of a lot of good. But I’m not going to serve as a sea anchor to slow it down just when it needs to be more agile.

I’m really fond of the "golf ball an hour” analogy, and I’m going to start spending my golf balls on my family, and improving my IT skills. I remember when I was a good sysadmin. I don’t feel like that anymore. It’s not impostor syndrome in this case. It’s that I haven’t spent the time honing my skills and keeping up. So I’m going to try to fix that. And maybe I’ll be able to get some blog entries written about what I learn along the way.

So anyway, I’m going back to being a community member rather than a community leader, and I’m fine with that. The other LOPSA Board members have been very supportive of my decision, and I thank them for that, and I thank my many friends who have done the same.

If you were one of the many people who voted for me in the LOPSA Board election when I ran, thank you. You can take heart in the fact that I believe I was able to make some significant changes in the 18 months I served, and I really think that the organization is more aware of what its possibilities are than it ever has been. I’m glad I had the chance to serve and contribute. Thank you for giving me that opportunity.

by Matt Simmons at December 10, 2014 04:00 PM

Everything Sysadmin

Interview by Win Treese in InformIT

Win Treese interviewed me and my co-authors about the book.

An Interview with the authors of "The Practice of Cloud System Administration" on DevOps and Data Security

We discussed DevOps in the enterprise, trends in system administration, and at the end I got riled up and ranted about how terrible computer security has become.

December 10, 2014 03:00 PM

Security Monkey

Keurig 2.0: Hacked: Documented DRM Bypass!

It's been well documented in all of my case files that I love coffee.  I practically live off of the stuff, intertwined with occassional cups of green tea.  Since I'm constantly on the go, there are days when I must forego the art of the french press...

December 10, 2014 09:15 AM

System Administration Advent Calendar

Day 10 - DevOps View of Cryptocurrency

Written by: Bob Feldbauer (@bobfeldbauer)
Edited by: Hugh Brown (@saintaardvark)

Since Bitcoin was introduced in 2009, cryptocurrencies have steadily increased in popularity and thousands of digital currencies have been created. The current market capitalization of the top ~500 digital currencies, including Bitcoin and other coins (often called “altcoins”), has reached $5.9 billion1. In 2014, more than $300 million in venture capital was invested in startups focused on digital currencies2. As popularity and use cases for digital currencies grow, gaining a solid understanding of the core concepts and best practices involved is increasingly important for devops professionals.

Before we dive into practical implementation details, management, and risks, let’s review the basic concepts of the blockchain, mining, and wallets.

Mining the Blockchain

The blockchain3 is widely considered to be the core innovation of Bitcoin, and it acts as a permanent public ledger of transactions. Digital currencies use the principles of public-key (or asymmetric) cryptography4, relying on a public and private key to transfer coins from one address to another (which is called a “transaction”). Transactions are grouped together and encoded into blocks, approximately every 10 minutes although it is worth noting that’s just an average. Mining is the process by which blocks are created, and the fundamental principle of digital currency networks is that mining power acts in a distributed, P2P fashion to verify and “vote” on the validity of new blocks being created.

When a new valid block is agreed on by a majority of miners, it is added to the blockchain. Each new block references a hash of the previous block, which is how the blockchain is formed. There are two common types of methods for miners to generate blocks: the most common method is called “proof-of-work”5 (or “POW”), which is used in Bitcoin, Litecoin, etc. The second method, called “proof-of-stake”6 (or “POS”), is used by Peercoin, Blackcoin, and many newer altcoins. New blocks with proof-of-stake are generated by holding coins within a wallet.

For in-depth details about the technical structure of Bitcoin blocks and transactions, see Bitcoin for Mere Mortals and Bitcoins the Hard Way.


From Dogecoin to Bitcoin, each digital currency has its own form of wallet software but they tend to provide most of the same core functions. Wallets primarily exist to store your public and private keys, and interact with the blockchain by creating transactions. Most coins have similar basic functionality and you usually interact with them in pretty standard ways – both from a devops perspective and as a user.

Interestingly, many of the innovations from altcoins have been primarily wallet-based, with new features including things like anonymous transactions, traditional point-of-sale functionality, smart contracts, encrypted messaging services, etc.

Another job that wallets can perform, for coins that utilize proof-of-stake, is to generate new blocks and verify transactions on the blockchain based on the coins held in the wallet instead of traditional computationally intensive mining power.

Service Implementation

From a devops perspective, interacting with most wallets is accomplished by using an JSON RPC API that the wallet can be configured to expose. You can do that the hard way by writing your own JSON calls, or use one of the many popular libraries available in most languages.

To get started, here’s a list of API calls and the API reference with examples in various languages. Most altcoins will use the same core API calls, although they may extend the API with their own functionality.

There are numerous libraries out there for interacting with the JSON RPC API in most languages. Some of the more popular ones include bitcoin-pythonlib, bitcoin-ruby, and bitcoin-php.

Unless you’re talking about an extreme volume of transactions, you won’t likely need to build your own custom wallet implementation. Some tuning of wallet configuration settings, hardware, etc. may be required to improve performance though.

Wallet Configuration

Beyond setting the usual options like user, password, and port, if you’re building a service that interacts with the wallet you’ll also want to restrict access to the RPC interface, increase the number of connections, and bump up the thread count.

Coin wallets normally use a simple configuration file for their settings. Here’s an example configuration to get you started, which should work for Bitcoin and the vast majority of altcoins:


The first three options (rpcuser, rpcpassword, and rpcport) should be obvious enough - they define the user access control for RPC interface. The next five (server, listen, gen, testnet, and daemon) are just to tell the wallet you want to start a daemon, not try to uselessly mine new blocks with your server’s CPU, and operate on the main blockchain for the coin instead of the test network.

Setting rpcthreads and maxconnections is required to improve performance when you’re building services so you don’t get bottlenecked by the default limits. And finally, rpcallowip is set to restrict the wallet’s JSON RPC interface from being accessed by any outside address. Do note that securing your wallets via normal firewall rules, appropriate user permissions, etc. is still important and security will be discussed a bit more later.

*Note for Proof-of-Stake Coins

If you’re building services that interact with a coin that uses proof-of-stake to generate blocks, you’ll want to disable staking to avoid complications. Some proof-of-stake coins’ wallets won’t respect one setting or the other to disable staking, so you’ll want to set both of these options:


(Here, reservedbalance is a number larger than the number of coins you’ll have in the wallet at any time.)

Efficient Transactions

When sending transactions, you’ll normally need to pay a transaction fee to get miners to include your transaction in a block. To improve performance and reduce the amount you spend on transaction fees, you’ll want to avoid sending a single transaction for each transaction your service requires, if possible. You can typically reduce your transaction fees by at least 60–80% by batching transactions using the sendmany RPC call to send coins to multiple addresses in a single transaction, instead of using the standard sendtoaddress call.


Don’t rely on built-in wallet “accounts” for your accounting, although they may initially appear useful. Balances reported by the built-in account system should not be considered reliable7 or accurate, and the “accounts” functionality is being deprecated entirely in new versions of Bitcoin. Track incoming and outgoing transactions by transaction id, make sure to verify the number of confirmations meets your requirements, and handle your own accounting in a database of your choice.

Also, be sure to regularly audit your expected balance against what the wallet reports the balance as in a getinfo RPC call. If you spot a discrepancy, your wallet may be corrupted and you’ll need to work on repairing it.

Repairing a Corrupted Wallet

Like anything else, backups are important. Be sure to make regular backups and test restores regularly, before you have to deal with a corrupted wallet. Hopefully you already have good backups of your wallet’s .dat file (normally just called wallet.dat) that you could restore, but the basic process for wallet repair under most coins looks like this:

  1. Make a note of any pending transactions that you’ve sent but haven’t reached at least 1 confirmation on the blockchain yet. This is accomplished with a “listtransactions” commandline or JSON RPC call and checking the number of confirmations listed for recent transactions.

  2. Stop the wallet, then make a backup of the entire wallet directory.

  3. Run the wallet from the commandline with -rescan to see if your wallet has simply missed some transaction[s]. If that fixes your balance discrepancy, you’re done.

  4. If -rescan didn’t help, run -salvagewallet. If that doesn’t help, you’re stuck trying to recover the private keys and importing them into a new wallet…

  5. To accomplish the export/import, without a previous good backup, you’ll likely end up using pywallet.


You can duplicate your wallet and have it running on multiple servers at the same time. Active-passive and active-active setups are both possible. From an infrastructure perspective, the key point is that each running wallet functions as an independent node on the coin network. Running the same wallet on multiple servers means that the public/private keys are present and transactions can be generated anywhere.


Always keep in mind that your wallet holds your public/private keys, which are all that is needed to spend coins. Proper systems security is beyond the scope of this post, but there are a few cryptocurrency-specific items to be aware of which we’ll discuss.


The first layer of protection on a wallet is to use the wallet’s built in encryption with a strong password. When you spend coins, you’ll need to enter the password, which adds at least a reasonable basic layer of protection.


As with any form of valuable data, you’ll want to keep multiple encrypted, secured backups of your digital currency wallets. If you lose your wallet’s private keys, you’ve lost the ability to spend your coins.

“Hot” and “Cold” wallets

These are just fancy terms for what most of us would define as “online” and “offline”. Hot wallets are where you store frequently accessed funds or a sufficient supply for normal transaction volume if you’re running a service. Having large amounts of coins in a hot wallet is unwise because of the risk of loss should your wallet be compromised.

Cold wallets store your coins offline so they can’t be accessed without physical action, whether that’s in the form of a physical coin with a private key attached, a piece of paper in a safe, an encrypted hardware wallet, a USB drive, or otherwise. See for more information on cold storage.

For additional security on your cold storage, you can use multi-signature protocols to ensure a single compromised source couldn’t spend your coins. “Multisig”8 transactions work by requiring multiple private keys to sign a transaction for it to be able to be spent.


As Bitcoin and altcoins grow in popularity, DevOps professionals will need to understand the issues involved in developing, managing, and securing digital currencies. Although this article is not a comprehensive guide, hopefully you’ve learned enough to get started. Further reading can be found in the “Additional Resources” section.

Additional Resources

  1. Mastering Bitcoin
  2. Bitcoin Developer Guide
  3. Bitcoin Reading List
  4. Developers Introduction to Bitcoin


  1. Bitcoin Venture Capital
  2. CoinMarketCap
  3. Public-key Cryptography
  4. Block chain
  5. Proof-of-work
  6. Proof-of-stake
  7. Accounts explained
  8. Multisig: The future of Bitcoin

by Christopher Webber ( at December 10, 2014 08:23 AM

Chris Siebenmann

How to delay your fileserver replacement project by six months or so

This is not exactly an embarrassing confession, because I think we made the right decisions for the long term, but it is at least an illustration of how a project can get significantly delayed one little bit at a time. The story starts back in early January, where we had basically finalized the broad details of our new fileserver environment; we had the hardware picked out and we knew we'd run OmniOS on the fileservers and our current iSCSI target software on some distribution of Linux. But what Linux?

At first the obvious answer was CentOS 6, since that would get us a nice long support period and RHEL 5 had been trouble-free on our existing iSCSI backends. Then I really didn't like RHEL/CentOS 6 and didn't want to use it here for something we'd have to deal with for four or five years to come (especially since it was already long in the tooth). So we switched our plans to Ubuntu, since we already run it everywhere else, and in relatively short order I had a version of our iSCSI backend setup running on Ubuntu 12.04. This was probably complete some time in late February, based on circumstantial evidence.

Eliding some rationale, Ubuntu 12.04 was an awkward thing to settle on in March or so of this year because Ubuntu 14.04 was just around the corner. Given that we hadn't built and fully tested the production installation, we might actually have wound up in the position of deploying 12.04 iSCSI backends after 14.04 had actually come out. Since we didn't feel in a big rush at the time, we decided it was worthwhile to wait for 14.04 to be released and for us to spin up the 14.04 version of our local install system, which we expected to have done by not too long after the 14.04 release. As it happened it was June before I picked the new fileserver project up again and I turned out to dislike Ubuntu 14.04 too.

By the time we knew we didn't really want to use Ubuntu 14.04, RHEL 7 was out (it came out June 10th). While we couldn't use it directly for local reasons, we though that CentOS 7 was probably going to be released soon and that we could at least wait a few weeks to see. CentOS 7 was released on July 7th and I immediately got to work, finally getting us back on track to where we probably could have been at the end of January if we'd stuck with CentOS 6.

(Part of the reason that we were willing to wait for CentOS 7 was that I actually built a RHEL 7 test install and everything worked. That not only proved that CentOS 7 was viable, it meant that we had an emergency fallback if CentOS 7 was delayed too long; we could go into at least initial production with RHEL 7 instead. I believe I did builds with CentOS 7 beta spins as well.)

Each of these decisions was locally sensible and delayed things only a moderate bit, but the cumulative effects delayed us by five or six months. I don't have any great lesson to point out here, but I do think I'm going to try to remember this in the future.

by cks at December 10, 2014 05:14 AM

December 09, 2014

Chris Siebenmann

Why I do unit tests from inside my modules, not outside them

In reading about how to do unit testing, one of the divisions I've run into is between people who believe that you should unit test your code strictly through its external API boundaries and people who will unit test code 'inside' the module itself, taking advantage of internal features and so on. The usual arguments I've seen for doing unit tests from outside the module are that your API working is what people really care about and this avoids coupling your tests too closely to your implementation, so that you don't have the friction of needing to revise tests if you revise the internals. I don't follow this view; I write my unit tests inside my modules, although of course I test the public API as much as possible.

The primary reason why I want to test from the inside is that this gives me much richer and more direct access to the internal operation of my code. To me, a good set of unit tests involves strongly testing hypotheses about how the code behaves. It is not enough to show that it works for some cases and then call it a day; I want to also poke the dark corners and the error cases. The problem with going through the public API for this is that it is an indirect way of testing things down in the depths of my code. In order to reach down far enough, I must put together a carefully contrived scenario that I know reaches through the public API to reach the actual code I want to test (and in the specific way I want to test it). This is extra work, it's often hard and requires extremely artificial setups, and it still leaves my tests closely coupled to the actual implementation of my module code. Forcing myself to work through the API alone is basically testing theater.

(It's also somewhat dangerous because the coupling of my tests to the module's implementation is now far less obvious. If I change the module implementation without changing the tests, the tests may well still keep passing but they'll no longer be testing what I think they are. Oops.)

Testing from inside the module avoids all of this. I can directly test that internal components of my code work correctly without having to contrive peculiar and fragile scenarios that reach them through the public API. Direct testing of components also lets me immediately zero in on the problem if one of them fails a test, instead of forcing me to work backwards from a cascade of high level API test failures to find the common factor and realize that oh, yeah, a low level routine probably isn't working right. If I change the implementation and my tests break, that's okay; in a way I want them to break so that I can revise them to test what's important about the new implementation.

(I also believe that directly testing internal components is likely to lead to cleaner module code due to needing less magic testing interfaces exposed or semi-exposed in my APIs. If this leads to dirtier testing code, that's fine with me. I strongly believe that my module's public API should not have anything that is primarily there to let me test the code.)

by cks at December 09, 2014 05:35 AM

System Administration Advent Calendar

Day 9 - D3 for SysAdmins

Written by: Anthony Elizondo (@complex)
Edited by: Shaun Mouton (@sdmouton)

In this post we will talk an introductory look at D3.js and using it on relatively "raw" datasets such as those in CSV or JSON.

D3.js is a powerful Javascript library that allows you to build SVG images and animations, implemented with HTML5 and CSS.

There are some who will tell you D3.js has a steep learning curve. I think the reason for this is it lives at the center of a Venn diagram where the circles are Javascript, CSS, math, and graphic design. Not the most common skill set.

But don’t let that scare you. You’re awesome! You can do it!

Visualizing your Data

Are you collecting metrics? A recent survey by James Turnbull of Kickstarter says that 90% of you are. (If not, why not?) Ideally these metrics are fed into a permanent installation of a tool that can be used for measuring performance, alarming on errors, or just general trending. And that is great.

But perhaps you just want to whip up something on an ad-hoc basis. D3.js is superb at this. With simply a CSV file and a web server (even something as simple as "python -m SimpleHTTPServer 8000", see William Bowers' list) you can create something that works on any modern browser, mobile included.

Data Type and Format

Before we can do anything, we have to think about what type of data we have. More precisely, what is the story we want to tell about it? We might have a collection of ohai output (JSON) from 10,000 servers. This type of data would be nicely visualized with a force-directed graph, perhaps even sorted by type. Maybe you have logs of concurrent sessions on HA Proxy or your F5 load balancer. Then this heat map would work well. Your data might be as simple as a single metric, but you want to chart it over time. A simple line chart would be sufficient.

Fundamentals of D3.js

D3.js operates on the DOM of a web page. You can boil most of its operation in three phases. First, it creates SVGs and adds ("appends") them to the page. Next it reads in data embedded in the page itself, from a separate file, or from other online source via AJAX. Then, it performs transforms to the SVG elements based on this data, or perhaps based on user input.

Simple Example

Now we’ll try create a simple example. Assume our data, after some awk mangling, is in a TSV. The first column is the date in YYYY-MM-DD format, the second is a scalar indicating how many servers we have.
date   servers
2013-12-08  4343
2013-12-07  4328
2013-12-06  4325
First, include D3 in our HTML file,
<script src=""></script>
Now create the SVG.
var svg ="body").append("svg")
.attr("width", 860)
.attr("height", 500)
.attr("transform", "translate(" + margin.left + "," + + ")");
Time to load our data.
d3.tsv("vmcount.tsv", function(error, data) {
data.forEach(function(d) { = parseDate(;
d.servers = +d.servers;
And draw it.
.attr("class", "line")
.attr("d", line);

The .append("g") and attr("d", line) are not magic. They are SVG element tags. The "g" attribute indicates "group everything together as one". The "d" attribute defines a path to follow. The 3 bits of Javascript above are the core of the work, but they alone are insufficient. There is a bit more required to bring it to life and make it look nice. That includes defining axes, scales, labels and domains (in the geometric sense). The full working example can be found here. It adds all the features mentioned above, plus a fancy hover to show precise values.

On Your Way

Hopefully I’ve demystified D3.js for you a bit with this short introduction. To dive deeper I suggest browsing some of Mike Bostock’s simpler examples, and don’t be afraid to ask your friendly neighborhood frontend developer for help! :) The full D3.js API can be found here.


by Shaun Mouton ( at December 09, 2014 12:24 AM

December 08, 2014

Everything Sysadmin

Book Excerpt: Capacity Planning has published an excerpt from our book "The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems Vol 2".

The article has a title that implies it is about capacity planning for data centers but it's really about capacity planning for any system or service.

Room to grow: Tips for data center capacity planning

If you like that it, there's 547 more pages of good stuff like that in the book.

December 08, 2014 05:00 PM