Planet SysAdmin

October 22, 2014


Getting stuck in Siberia

I went on a bit of a twitter rant recently.

Good question, since that's a very different problem than the one I was ranting about. How do you deal with that?

I hate to break it to you, but if you're in the position where your manager is actively avoiding you it's all on you to fix it. There are cases where it isn't up to you, such as if there are a lot of people being avoided and it's affecting the manager's work-performance, but that's a systemic problem. No, for this case I'm talking about you are being avoided, and not your fellow direct-reports. It's personal, not systemic.

No, it's not fair. But you still have to deal with it.

You have a question to ask yourself:

Do I want to change myself to keep the job, or do I want to change my manager by getting a new job?

Because this shunning activity is done by managers who would really rather fire your ass, but can't or won't for some reason. Perhaps they don't have firing authority. Perhaps the paperwork is too much to bother with firing someone. Perhaps they're the conflict-avoidant type and pretending you don't exist is preferable to making you Very Angry by firing them.

You've been non-verbally invited to Go Away. You get to decide if that's what you want to do.

Going Away

Start job-hunting, and good riddance. They may even overlook job-hunt activities on the job, but don't push it.

Staying and Escalating

They can't/won't get rid of you, but you're still there. It's quite tempting to stick around and intimidate your way into their presence and force them to react. They're avoiding you for a reason, so hit those buttons harder. This is not the adult way to respond to the situation, but they started it.

I shouldn't have to say that, but this makes for a toxic work environment for everyone else so... don't do that.

Staying and Reforming

Perhaps the job itself is otherwise awesome-sauce, or maybe getting another job will involve moving and you're not ready for that. Time to change yourself.

Step 1: Figure out why the manager is hiding from you.
Step 2: Stop doing that.
Step 3: See if your peace-offering is accepted.

Figure out why they're hiding

This is key to the whole thing. Maybe they see you as too aggressive. Maybe you keep saying no and they hate that. Maybe you never give an unqualified answer and they want definites. Maybe you always say, 'that will never work,' to anything put before you. Maybe you talk politics in the office and they don't agree with you. Maybe you don't go paintballing on weekends. Whatever it is...

Stop doing that.

It's not always easy to know why someone is avoiding you. That whole avoidant thing makes it hard. Sometimes you can get intelligence from coworkers about what the manager has been saying when you're not around or what happens when your name comes up. Ask around, at least it'll show you're aware of the problem.

And then... stop doing whatever it is. Calm down. Say yes more often. Start qualifying answers only in your head instead of out loud. Say, "I'll see what I can do" instead of "that'll never work." Stop talking politics in the office. Go paintballing on weekends. Whatever it is, start establishing a new set of behaviors.

And wait.

Maybe they'll notice and warm up. It'll be hard, but you probably need the practice to change your habits.

See if your peace-offering is accepted

After your new leaf is turned over, it might pay off to draw their attention to it. This step definitely depends on the manager and the source of the problem, but demonstrating a new way of behaving before saying you've been behaving better can be the key to get back into the communications stream. It also hangs a hat on the fact that you noticed you were in bad graces and took effort to change.

What if it's not accepted?

Then learn to live in Siberia and work through proxies, or lump it and get another job.

by SysAdmin1138 at October 22, 2014 08:00 PM

Everything Sysadmin

Katherine Daniels (@beerops) interviews Tom Limoncelli

Katherine Daniels (known as @beerops on Twitter) interviewed me about the presentations I'll be doing at the upcoming Usenix LISA '14 conference. Check it out:

Register soon! Seating in my tutorials is limited!

October 22, 2014 03:00 PM

Steve Kemp's Blog

On writing test-cases and testsuites.

Last night I mostly patched my local copy of less to build and link against the PCRE regular expression library.

I've wanted to do that for a while, and reading Raymond Chen's blog post last night made me try it out.

The patch was small and pretty neat, and I'm familiar with GNU less having patched it in the past. But it doesn't contain tests.

Test cases are hard. Many programs, such as less, are used interactively which makes writing a scaffold hard. Other programs suffer from a similar fate - I'm not sure how you'd even test a web browser such as Firefox these days - mangleme would catch some things, eventually, but the interactive stuff? No clue.

In the past MySQL had a free set of test cases, but my memory is that Oracle locked them up. SQLite is famous for its decent test coverage. But off the top of my head I can't think of other things.

As a topical example there don't seem to be decent test-cases for either bash or openssl. If it compiles it works, more or less.

I did start writing some HTTP-server test cases a while back, but that was just to automate security attacks. e.g. Firing requests like:

GET /../../../etc/passwd HTTP/1.0
GET //....//....//....//etc/passwd HTTP/1.0

(It's amazing how many toy HTTP server components included in projects and products don't have decent HTTP-servers.)

I could imagine that being vaguely useful, especially because it is testing the protocol-handling rather than a project-specific codebase.

Anyway, I'm thinking writing test cases for things is good, but struggling to think of a decent place to start. The project has to be:

  • Non-interactive.
  • Open source.
  • Widely used - to make it a useful contribution.
  • Not written in some fancy language.
  • Open to receiving submissions.

Comments welcome; but better yet why not think about the test-coverage of any of your own packages and projects...?

October 22, 2014 09:21 AM

Chris Siebenmann

Exim's (log) identifiers are basically unique on a given machine

Exim gives each incoming email message an identifier; these look like '1XgWdJ-00020d-7g'. Among other things, this identifier is used for all log messages about the particular email message. Since Exim normally splits information about each message across multiple lines, you routinely need to reassemble or at least match multiple lines for a single message. As a result of this need to aggregate multiple lines, I've quietly wondered for a long time just how unique these log identifiers were. Clearly they weren't going to repeat over the short term, but if I gathered tens or hundreds of days of logs for a particular system, would I find repeats?

The answer turns out to be no. Under normal circumstances Exim's message IDs here will be permanently unique on a single machine, although you can't count on global uniqueness across multiple machines (although the odds are pretty good). The details of how these message IDs are formed are in the Exim documentation's chapter 3.4. On most Unixes and with most Exim configurations they are a per-second timestamp, the process PID, and a final subsecond timestamp, and Exim takes care to guarantee that the timestamps will be different for the next possible message with the same PID.

(Thus a cross-machine collision would require the same message time down to the subsecond component plus the same PID on both machines. This is fairly unlikely but not impossible. Exim has a setting that can force more cross-machine uniqueness.)

This means that aggregation of multi-line logs can be done with simple brute force approaches that rely on ID uniqueness. Heck, to group all the log lines for a given message together you can just sort on the ID field, assuming you do a stable sort so that things stay in timestamp order when the IDs match.

(As they say, this is relevant to my interests and I finally wound up looking it up today. Writing it down here insures I don't have to try to remember where I found it in the Exim documentation the next time I need it.)

PS: like many other uses of Unix timestamps, all of this uniqueness potentially goes out the window if you allow time on your machine to actually go backwards. On a moderate volume machine you'd still have to be pretty unlucky to have a collision, though.

by cks at October 22, 2014 04:21 AM

October 21, 2014

Chris Siebenmann

Some numbers on our inbound and outbound TLS usage in SMTP

As a result of POODLE, it's suddenly rather interesting to find out the volume of SSLv3 usage that you're seeing. Fortunately for us, Exim directly logs the SSL/TLS protocol version in a relatively easy to search for format; it's recorded as the 'X=...' parameter for both inbound and outbound email. So here's some statistics, first from our external MX gateway for inbound messages and then from our other servers for external deliveries.

Over the past 90 days, we've received roughly 1.17 million external email messages. 389,000 of them were received with some version of SSL/TLS. Unfortunately our external mail gateway currently only supports up to TLS 1.0, so the only split I can report is that only 130 of these messages were received using SSLv3 instead of TLS 1.0. 130 messages is low enough for me to examine the sources by hand; the only particularly interesting and eyebrow-raising ones were a couple of servers at a US university and a .nl ISP.

(I'm a little bit surprised that our Exim doesn't support higher TLS versions, to be honest. We're using Exim on Ubuntu 12.04, which I would have thought would support something more than just TLS 1.0.)

On our user mail submission machine, we've delivered to 167,000 remote addresses over the past 90 days. Almost all of them, 158,000, were done with SSL/TLS. Only three of them used SSLv3 and they were all to the same destination; everything else was TLS 1.0.

(It turns out that very few of our user submitted messages were received with TLS, only 0.9%. This rather surprises me but maybe many IMAP programs default to not using TLS even if the submission server offers it. All of these small number of submissions used TLS 1.0, as I'd hope.)

Given that our Exim version only supports TLS 1.0, these numbers are more boring than I was hoping they'd be when I started writing this entry. That's how it goes sometimes; the research process can be disappointing as well as educating.

(I did verify that our SMTP servers really only do support up to TLS 1.0 and it's not just that no one asked for a higher version than that.)

One set of numbers I'd like to get for our inbound email is how TLS usage correlates with spam score. Unfortunately our inbound mail setup makes it basically impossible to correlate the bits together, as spam scoring is done well after TLS information is readily available.

Sidebar: these numbers don't quite mean what you might think

I've talked about inbound message deliveries and outbound destination addresses here because that's what Exim logs information about, but of course what is really encrypted is connections. One (encrypted) connection may deliver multiple inbound messages and certainly may be handed multiple RCPT TO addresses in the same conversation. I've also made no attempt to aggregate this by source or destination, so very popular sources or destinations (like, say, Gmail) will influence these numbers quite a lot.

All of this means that this sort of numbers can't be taken as an indication of how many sources or destinations do TLS with us. All I can talk about is message flows.

(I can't even talk about how many outgoing messages are completely protected by TLS, because to do that I'd have to work out how many messages had no non-TLS deliveries. This is probably possible with Exim logs, but it's more work than I'm interested in doing right now. Clearly what I need is some sort of easy to use Exim log aggregator that will group all log messages for a given email message together and then let me do relatively sophisticated queries on the result.)

by cks at October 21, 2014 03:28 AM

October 20, 2014

Everything Sysadmin

See you tomorrow evening at the Denver DevOps Meetup!

Hey Denver folks! Don't forget that tomorrow evening (Tue, Oct 21) I'll be speaking at the Denver DevOps Meetup. It starts at 6:30pm! Hope to see you there!

October 20, 2014 05:00 PM

The Tech Teapot

New Aviosys IP Power 9858 Box Opening

A series of box opening photos of the new Aviosys IP Power 9858 4 port network power switch. This model will in due course replace the Aviosys IP Power 9258 series of power switches. The 9258 series is still available in the mean time though, so don’t worry.

The new model supports WiFi (802.11n-b/g and WPS for easy WiFi setup), auto reboot on ping failure, time of day scheduler and internal temperature sensor. Aviosys have also built apps for iOS and Android, so you can now manage your power switch on the move. Together with the 8 port Aviosys IP Power 9820 they provide very handy tools for remote power management of devices. Say goodbye to travelling to a remote site just to reboot a broadband router.

Aviosys IP Power 9858DX Closed Box Aviosys IP Power 9858DX Open Box Aviosys IP Power 9858DX Front with Wifi Aerial Aviosys IP Power 9858DX Front Panel Aviosys IP Power 9858DX Rear Panel Aviosys IP Power 9858DX Read Close Up #2


The post New Aviosys IP Power 9858 Box Opening appeared first on Openxtra Tech Teapot.

by Jack Hughes at October 20, 2014 07:00 AM

Chris Siebenmann

Revisiting Python's string concatenation optimization

Back in Python 2.4, CPython introduced an optimization for string concatenation that was designed to reduce memory churn in this operation and I got curious enough about this to examine it in some detail. Python 2.4 is a long time ago and I recently was prompted to wonder what had changed since then, if anything, in both Python 2 and Python 3.

To quickly summarize my earlier entry, CPython only optimizes string concatenations by attempting to grow the left side in place instead of making a new string and copying everything. It can only do this if the left side string only has (or clearly will have) a reference count of one, because otherwise it's breaking the promise that strings are immutable. Generally this requires code of the form 'avar = avar + ...' or 'avar += ...'.

As of Python 2.7.8, things have changed only slightly. In particular concatenation of Unicode strings is still not optimized; this remains a byte string only optimization. For byte strings there are two cases. Strings under somewhat less than 512 bytes can sometimes be grown in place by a few bytes, depending on their exact sizes. Strings over that can be grown if the system realloc() can find empty space after them.

(As a trivial root, CPython also optimizes concatenating an empty string to something by just returning the other string with its reference count increased.)

In Python 3, things are more complicated but the good news is that this optimization does work on Unicode strings. Python 3.3+ has a complex implementation of (Unicode) strings, but it does attempt to do in-place resizing on them under appropriate circumstances. The first complication is that internally Python 3 has a hierarchy of Unicode string storage and you can't do an in-place concatenation of a more complex sort of Unicode string into a less complex one. Once you have compatible strings in this sense, in terms of byte sizes the relevant sizes are the same as for Python 2.7.8; Unicode string objects that are less than 512 bytes can sometimes be grown by a few bytes while ones larger than that are at the mercy of the system realloc(). However, how many bytes a Unicode string takes up depends on what sort of string storage it is using, which I think mostly depends on how big your Unicode characters are (see this section of the Python 3.3 release notes and PEP 393 for the gory details).

So my overall conclusion remains as before; this optimization is chancy and should not be counted on. If you are doing repeated concatenation you're almost certainly better off using .join() on a list; if you think you have a situation that's otherwise, you should benchmark it.

(In Python 3, the place to start is PyUnicode_Append() in Objects/unicodeobject.c. You'll probably also want to read Include/unicodeobject.h and PEP 393 to understand this, and then see Objects/obmalloc.c for the small object allocator.)

Sidebar: What the funny 512 byte breakpoint is about

Current versions of CPython 2 and 3 allocate 'small' objects using an internal allocator that I think is basically a slab allocator. This allocator is used for all overall objects that are 512 bytes or less and it rounds object size up to the next 8-byte boundary. This means that if you ask for, say, a 41-byte object you actually get one that can hold up to 48 bytes and thus can be 'grown' in place up to this size.

by cks at October 20, 2014 04:37 AM

October 19, 2014

Chris Siebenmann

Vegeta, a tool for web server stress testing

Standard stress testing tools like siege (or the venerable ab, which you shouldn't use) are all systems that do N concurrent requests at once and see how your website stands up to this. This model is a fine one for putting a consistent load on your website for a stress test, but it's not actually representative of how the real world acts. In the real world you generally don't have, say, 50 clients all trying to repeatedly make and re-make one request to you as fast as they can; instead you'll have 50 new clients (and requests) show up every second.

(I wrote about this difference at length back in this old entry.)

Vegeta is a HTTP load and stress testing tool that I stumbled over at some point. What really attracted my attention is that it uses a 'N requests a second' model, instead of the concurrent request model. As a bonus it will also report not just average performance but also on outliers in the form of 90th and 99th percentile outliers. It's written in Go, which some of my readers may find annoying but which I rather like.

I gave it a try recently and, well, it works. It does what it says it does, which means that it's now become my default load and stress testing tool; 'N new requests a second' is a more realistic and thus interesting test than 'N concurrent requests' for my software (especially here, for obvious reasons).

(I may still do N concurrent requests tests as well, but it'll probably mostly be to see if there are issues that come up under some degree of consistent load and if I have any obvious concurrency race problems.)

Note that as with any HTTP stress tester, testing with high load levels may require a fast system (or systems) with plenty of CPUs, memory, and good networking if applicable. And as always you should validate that vegeta is actually delivering the degree of load that it should be, although this is actually reasonably easy to verify for a 'N new request per second' tester.

(Barring errors, N new requests a second over an M second test run should result in N*M requests made and thus appearing in your server logs. I suppose the next time I run a test with vegeta I should verify this myself in my test environment. In my usage so far I just took it on trust that vegeta was working right, which in light of my ab experience may be a little bit optimistic.)

by cks at October 19, 2014 06:04 AM

October 18, 2014

Steve Kemp's Blog

On the names we use in email

Yesterday I received a small rush of SPAM mails, all of which were 419 scams, and all of them sent by "Mrs Elizabeth PETERSEN".

It struck me that I can't think of ever receiving a legitimate mail from a "Mrs XXX [YYY]", but I was too busy to check.

Today I've done so. Of the 38,553 emails I've received during the month of October 2014 I've got a hell of a lot of mails with a From address including a "Mrs" prefix:

"Mrs.Clanzo Amaki" <>
"Mrs Sarah Mamadou"<>
"Mrs Abia Abrahim" <>
"Mrs. Josie Wilson" <>
"Mrs. Theresa Luis"<>

There are thousands more. Not a single one of them was legitimate.

I have one false-positive when repeating the search for a Mr-prefix. I have one friend who has set his sender-address to "Mr Bob Smith", which always reads weirdly to me, but every single other email with a Mr-prefix was SPAM.

I'm not going to use this in any way, since I'm happy with my mail-filtering setup, but it was interesting observation.

Names are funny. My wife changed her surname post-marriage, but that was done largely on the basis that introducing herself as "Doctor Kemp" was simpler than "Doctor Foreign-Name", she'd certainly never introduce herself ever as Mrs Kemp.

Trivia: In Finnish the word for "Man" and "Husband" is the same (mies), but the word for "Woman" (nainen) is different than the word for "Wife" (vaimo).

October 18, 2014 11:03 PM


For other Movable Type blogs out there

If you're wondering why comments aren't working, as I was, and are on shared hosting, as I am, and get to looking at your error_log file and see something like this in it:

[Sun Oct 12 12:34:56 2014] [error] [client] 
ModSecurity: Access denied with code 406 (phase 2).
Match of "beginsWith http://%{SERVER_NAME}/" against "MATCHED_VAR" required.
[file "/etc/httpd/modsecurity.d/10_asl_rules.conf"] [line "1425"] [id "340503"] [rev "1"]
[msg "Remote File Injection attempt in ARGS (/cgi-bin/mt4/mt-comments.cgi)"]
[severity "CRITICAL"]
[hostname ""]
[uri "/cgi-bin/mt/mt-comments.cgi"]
[unique_id "PIMENTOCAKE"]

It's not just you.

It seems that some webhosts have a mod_security rule in place that bans submitting anything through "mt-comments.cgi". As this is the main way MT submits comments, this kind of breaks things. Happily, working around a rule like this is dead easy.

  1. Rename your mt-comments.cgi file to something else
  2. Add "CommentScript ${renamed file}" to your mt-config.cgi file

And suddenly comments start working again!

Except for Google, since they're deprecating OpenID support.

by SysAdmin1138 at October 18, 2014 09:46 PM

Chris Siebenmann

During your crisis, remember to look for anomalies

This is a war story.

Today I had one of those valuable learning experiences for a system administrator. What happened is that one of our old fileservers locked up mysteriously, so we power cycled it. Then it locked up again. And again (and an attempt to get a crash dump failed). We thought it might be hardware related, so we transplanted the system disks into an entirely new chassis (with more memory, because there was some indications that it might be running out of memory somehow). It still locked up. Each lockup took maybe ten or fifteen minutes from the reboot, and things were all the more alarming and mysterious because this particular old fileserver only had a handful of production filesystems still on it; almost all of them had been migrated to one of our new fileservers. After one more lockup we gave up and went with our panic plan: we disabled NFS and set up to do an emergency migration of the remaining filesystems to the appropriate new fileserver.

Only as we started the first filesystem migration did we notice that one of the ZFS pools was completely full (so full it could not make a ZFS snapshot). As we were freeing up some space in the pool, a little light came on in the back of my mind; I remembered reading something about how full ZFS pools on our ancient version of Solaris could be very bad news, and I was pretty sure that earlier I'd seen a bunch of NFS write IO at least being attempted against the pool. Rather than migrate the filesystem after the pool had some free space, we selectively re-enabled NFS fileservice. The fileserver stayed up. We enabled more NFS fileservice. And things stayed happy. At this point we're pretty sure that we found the actual cause of all of our fileserver problems today.

(Afterwards I discovered that we had run into something like this before.)

What this has taught me is during an inexplicable crisis, I should try to take a bit of time to look for anomalies. Not specific anomalies, but general ones; things about the state of the system that aren't right or don't seem right.

(There is a certain amount of hindsight bias in this advice, but I want to mull that over a bit before I wrote more about it. The more I think about it the more complicated real crisis response becomes.)

by cks at October 18, 2014 04:55 AM

October 17, 2014

Everything Sysadmin

Usenix LISA early registration discount expires soon!

Register by Mon, October 20 and take advantage of the early bird pricing.

I'll be teaching tutorials on managing oncall, team-driven sysadmin tools, upgrading live services and more. Please register soon and save!

October 17, 2014 06:20 PM

Standalone Sysadmin

VM Creation Day - PowerShell and VMware Automation

I should have ordered balloons and streamers, because Monday was VM creation day on my VMware cluster.

In addition to a 3-node production-licensed vSphere cluster, I run a 10-node cluster specifically for academic purposes. One of those purposes is building and maintaining classroom environments. A lot of professors maintain a server or two for their courses, but our Information Assurance program here goes above and beyond in terms of VM utilization. Every semester, I've got to deal with the added load, so I figured if I'm going to document it, I might as well get a blog entry while I'm at it.vmware_ia_spinup

Conceptually, the purpose of this process is to allow an instructor to create a set of virtual machines (typically between 1 and 4 of them), collectively referred to as a 'pod', which will serve as a lab for students. Once this set of VMs is configured exactly as the professor wants, and they have signed off on them, those VMs become the 'Gold Images', and then each student gets their own instance of these VMs. A class can have between 10 and 70 students, so this quickly becomes a real headache to deal with, hence the automation.

Additionally, because these classes are Information Assurance courses, it's not uncommon for the VMs to be configured in an insecure manner (on purpose) and to be attacked by other VMs, and to generally behave in a manner unbecoming a good network denizen, so each class is cordoned off onto its own VLAN, with its own PFsense box guarding the entryway and doing NAT for the several hundred VMs behind the wall. The script needs to automate the creation of the relevant PFsense configs, too, so that comes at the end.

I've written a relatively involved PowerShell script to do my dirty work for me, but it's still a long series of things to go from zero to working classroom environment. I figured I would spend a little time to talk about what I do to make this happen. I'm not saying it's the best solution, but it's the one I use, and it works for me. I'm interested in hearing if you've got a similar solution going on. Make sure to comment and let everyone know what you're using for these kinds of things.

The process is mostly automated hard parts separated by manual staging, because I want to verify sanity at each step. This kind of thing happens infrequently enough that I'm not completely trusting of the process yet, mostly due to my own ignorance of all of the edge cases that can cause failures. To the right, you'll see a diagram of the process.

In the script, the first thing I do is include functions that I stole from an awesome post on Subnet Math with PowerShell from Indented!, a software blog by Chris Dent. Because I'm going to be dealing with the DHCP config, it'll be very helpful to be able to have functions that understand what subnet boundaries are, and how to properly increment IP addresses.

I need to make sure that, if this powershell script is running, that we are actually loading the VMware PowerCLI commandlets. We can do that like this:

if ( ( Get-PSSnapin -name VMware.VimAutomation.Core -ErrorAction SilentlyContinue ) -eq $null ) {
Add-PSSnapin VMware.VimAutomation.Core

For the class itself, this whole process consists of functions to do what needs to be done (or "do the needful" if you use that particular phrase), and it's fairly linear, and each step requires the prior to be completed. What I've done is to create an object that represents the course as a whole, and then add the appropriate properties and methods. I don't actually need a lot of the power of OOP, but it provides a convenient way to keep everything together. Here's an example of the initial class setup:

$IA = New-Object psobject

# Lets add some initial values
Add-Member -InputObject $IA -MemberType NoteProperty -Name ClassCode -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Semester -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Datastore -Value "FASTDATASTORENAME"
Add-Member -InputObject $IA -MemberType NoteProperty -Name Cluster -Value "IA Program"
Add-Member -InputObject $IA -MemberType NoteProperty -Name VIServer -Value "VSPHERE-SERVER"
Add-Member -InputObject $IA -MemberType NoteProperty -Name IPBlock -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name SubnetMask -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Connected -Value $false
Add-Member -InputObject $IA -MemberType NoteProperty -Name ResourcePool -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name PodCount -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name GoldMasters -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Folder -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name MACPrefix -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name ConfigDir -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name VMarray -Value @()

These are just the values that almost never change. Since we're using NAT, and we're not routing to that network, and every class has its own dedicated VLAN, we can use the same IP block every time without running into a problem. The blank values are there just as placeholder, and those values will be filled in as the class methods are invoked.

At the bottom of the script, which is where I spend most of my time, I set per-class settings:

$IA.ClassCode = "ia1234"
$IA.Semester = "Fall-2014"
$IA.PodCount = 35
$IA.GoldMasters = @(
vmname = "ia1234-win7-gold-20141014"
osname = "win7"
tcp = 3389
udp = ""
vmname = "ia1234-centos-gold-20141014"
osname = "centos"
tcp = ""
udp = ""
vmname = "ia1234-kali-gold-20141014"
osname = "kali"
tcp = "22"
udp = ""

We set the class code, semester, and pod count simply. These will be used to create the VM names, the folders, and resource groups that the VMs live in. The GoldMaster array is a data structure that has an entry for each of the gold images that the professor has created. It contains the name of the gold image, plus a short code that will be used to name the VM instances coming from it, and has a placeholder for the tcp and udp ports which need forwarded from the outside to allow internal access. I don't currently have the code in place that allows me to specify multiple port forwards, but that's going to be added, because I had a professor request 7(!) forwarded ports per VM in one of their classes this semester.

As you can see in the diagram, I'm using Linked Clones to spin up the students' pods. This has the advantage of saving diskspace and of completing quickly. Linked clones operate on a snapshot of the original disk image. Rather than actually have the VMs operate on the gold images, I do a full clone of the VM over to a faster datastore than the Ol' Reliable NetApp.

We add a method to the $IA object like this:

Add-Member -InputObject $IA -MemberType ScriptMethod -Name createLCMASTERs -Value {
# This is the code that converts the gold images into LCMASTERs
# Because you need to put a template somewhere, it makes sense to put it
# into the folder that the VMs will eventually live in themselves (thus saving
# yourself the effort of locating the right folder twice).
Process {
... stuff goes here

The core of this method is the following block, which actually performs the clone:

if ( ! (Get-VM -Name $LCMASTERName) ) {
try {
$presnap = New-snapshot -Name ("Autosnap: " + $(Get-Date).toString("yyyMMdd")) -VM $GoldVM -confirm:$false

$cloneSpec = new-object VMware.Vim.VirtualMachineCloneSpec
$cloneSpec.Location = New-Object VMware.Vim.VirtualMachineRelocateSpec
$cloneSpec.Location.Pool = ($IA.ResourcePool | Get-View).MoRef
$ = ($vm | Get-VMHost).MoRef
$cloneSpec.Location.Datastore = ($IA.Datastore | Get-View).MoRef
$cloneSpec.Location.DiskMoveType = [VMware.Vim.VirtualMachineRelocateDiskMoveOptions]::createNewChildDiskBacking
$cloneSpec.Snapshot = ($GoldVM | Get-View).Snapshot.CurrentSnapshot
$cloneSpec.PowerOn = $false

($GoldVM | Get-View).cloneVM( $LCMasterFolder.MoRef, $LCMASTERName, $cloneSpec)

Remove-snapshot -Snapshot $presnap -confirm:$false
catch [Exception] {
Write-Host "Error: " $_.Exception.Message
} else {
Write-Host "Template found with name $LCMasterName - not recreating"

(apologies for the lack of indentation)

If you're interested in doing this kind of thing, make sure you check out the docs for the createNewChildDiskBacking setting.

After the Linked Clone Masters have been created, then it's a simple matter of creating the VMs from each of them (using the $IA.PodCount value to figure out how many we need). They end up getting named something like $IA.ClassCode-$IA.Semester-$IA.GoldMasters[#].osname-pod$podcount which makes it easy to figure out what goes where when I have several classes running at once.

After the VMs have been created, we can start dealing with the network portion. I used to spin up all of the VMs, then loop through them and pull the MAC addresses to use with the DHCP config, but there were problems with that method. I found that a lot of the time, I'll need to rerun this script a few times per class, either because I've screwed something up or the instructor needs to make changes to the pod. When that happens, EACH TIME I had to re-generate the DHCP config (which is easy) and then manually insert it into PFsense (which is super-annoying).

Rather than do that every time, I eventually realized that it's much easier just to dictate what the MAC address for each machine is, and then it doesn't matter how often I rerun the script, the DHCP config doesn't change. (And yes, I'm using DHCP, but with static leases, which is necessary because of the port forwarding).

Here's what I do:

Add-Member -InputObject $IA -MemberType ScriptMethod -Name assignMACs -Value {
Process {
$StaticPrefix = "00:50:56"
if ( $IA.MACPrefix -eq "" ) {
# Since there isn't already a prefix set, it's cool to make one randomly
$IA.MACPrefix = $StaticPrefix + ":" + ("{0:X2}" -f (Get-Random -Minimum 0 -Maximum 63) )
$machineCount = 0
$IA.VMarray | ForEach-Object {
$machineAddr = $IA.MACPrefix + ":" + ("{0:X4}" -f $machineCount).Insert(2,":")

$vm = Get-VM -name $
$networkAdapter = Get-NetworkAdapter -VM $vm
Write-Host "Setting $vm to $machineAddr"
Set-NetworkAdapter -NetworkAdapter $networkAdapter -MacAddress $machineAddr -Confirm:$false
$IA.VMarray[$machineCount].MAC = $machineAddr
$IA.VMarray[$machineCount].index = $machineCount


As you can see, this randomly assigns a MAC address in the vSphere range. Sort of. The fourth octet is randomly selected between 00 and 3F, and then the last two octets are incremented starting from 00. Optionally, the fourth octet can be specified, which is useful in a re-run of the script so that the DHCP config doesn't need to be re-generated.

After the MAC addresses are assigned, the IPs can be determined using the network math:

Add-Member -InputObject $IA -MemberType ScriptMethod -Name assignIPs -Value {
# This method really only assigns the IP to the object.
Process {
# It was tempting to assign a sane IP block to this network, but given the
# tendancy to shove God-only-knows how many people into a class at a time,
# lets not be bounded by reasonable or sane. /16 it is.
# First 50 IPs are reserved for gateway plus potential gold images.
$currentIP = Get-NextIP $IA.IPBlock 2
$IA.VMarray | ForEach-Object {
$_.IPAddr = $currentIP
$currentIP = Get-NextIP $currentIP 2


This is done by naively giving every other IP to a machine, leaving the odd IP addresses between them open. I've had to massage this before, where a large pod of 5-6 VMs all need to be incremental then skip IPs between them, but I've done those mostly as a one-off. I don't think I need to build in a lot of flexibility because those are relatively rare cases, but it wouldn't be that hard to develop a scheme for it if you needed.

After the IPs are assigned, you can create the DHCP config. Right now, I'm using an ugly hack, where I basically just print out the top of the DHCP config, then loop through the VMs outputting XML the whole way. It's ugly, and I'm not going to paste it here, but if you download a DHCPD XML file from PFsense, then you can basically see what I'm doing. I then do the same thing with the NAT config.

Because I'm still running these functions manually, I have these XML-creation methods printing output, but it's easy to see how you could have them redirect output to a text file (and if you were super-cool, you could use something like this example from MSDN where you spin up an instance of IE:

$ie = new-object -com "InternetExplorer.Application"
... and so on

Anyway, I've spun up probably thousands of VMs using this script (or previous instances of it). It's saved me a lot of time, and if you have to manage bulk-VMs using vSphere, and you're not automating it (using PowerCLI, or vCloud Director, or something else), you really should be. And if you DO, what do you do? Comment below and let me know!

Thanks for reading all the way through!

by Matt Simmons at October 17, 2014 03:16 PM

Chris Siebenmann

My experience doing relatively low level X stuff in Go

Today I wound up needing a program that spoke the current Firefox remote control protocol instead of the old -remote based protocol that Firefox Nightly just removed. I had my choice between either adding a bunch of buffer mangling to a very old C program that already did basically all of the X stuff necessary or trying to do low-level X things from a Go program. The latter seemed much more interesting and so it's what I did.

(The old protocol was pretty simple but the new one involves a bunch of annoying buffer packing.)

Remote controlling Firefox is done through X properties, which is a relatively low level part of the X protocol (well below the usual level of GUIs and toolkits like GTK and Qt). You aren't making windows or drawing anything; instead you're grubbing around in window trees and getting obscure events from other people's windows. Fortunately Go has low level bindings for X in the form of Andrew Gallant's X Go Binding and his xgbutil packages for them (note that the XGB documentation you really want to read is for xgb/xproto). Use of these can be a little bit obscure so it very much helped me to read several examples (for both xgb and xgbutil).

All told the whole experience was pretty painless. Most of the stumbling blocks I ran into were because I don't really know X programming and because I was effectively translating from an older X API (Xlib) that my original C program was using to XCB, which is what XGB's API is based on. This involved a certain amount of working out what old functions that the old code was calling actually did and then figuring out how to translate them into XGB and xgbutil stuff (mostly the latter, because xgbutil puts a nice veneer over a lot of painstaking protocol bits).

(I was especially pleased that my Go code for the annoying buffer packing worked the first time. It was also pretty easy and obvious to write.)

One of the nice little things about using Go for this is that XGB turns out to be a pure Go binding, which means it can be freely cross compiled. So now I can theoretically do Firefox remote control from essentially any machine I remotely log into around here. Someday I may have a use for this, perhaps for some annoying system management program that insists on spawning something to show me links.

(Cross machine remote control matters to me because I read my email on a remote machine with a graphical program, and of course I want to click on links there and have them open in my workstation's main Firefox.)

Interested parties who want either a functional and reasonably commented example of doing this sort of stuff in Go or a program to do lightweight remote control of Unix Firefox can take a look at the ffox-remote repo. As a bonus I have written down in comments what I now know about the actual Firefox remote control protocol itself.

by cks at October 17, 2014 04:55 AM

October 16, 2014

LZone - Sysadmin

Getting rid of Bash Ctrl+R

Today was a good day, as I stumbled over this post (at hinting on the following bash key bindings:

bind '"\e[A":history-search-backward'
bind '"\e[B":history-search-forward'

It changes the behaviour of the up and down cursor keys to not go blindly through the history but only through items matching the current prompt. Of course at the disadvantage of having to clear the line to go through the full history. But as this can be achieved by a Ctrl-C at any time it is still preferrable to Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R Ctrl+R ....

by Lars Windolf at October 16, 2014 06:42 PM


The new economy and systems administration

"Over the next few decades demand in the top layer of the labor market may well centre on individuals with high abstract reasoning, creative, and interpersonal skills that are beyond most workers, including graduates."
-Economist, vol413/num8907, Oct 4, 2014, "Special Report: The Third Great Wave. Productivity: Technology isn't Working"

The rest of the Special Report lays a convincing argument that people who have automation-creation as part of their primary job duties are in for quite a bit of growth and that people in industries subject to automation are going to have a hard time of it. This has a direct impact to sysadminly career direction.

In the past decate Systems Administration has been moving away from mechanics who deploy hardware, install software and fix problems and towards Engineers who are able to build automation for provisioning new computing instances, installing application frameworks, and know how to troubleshoot problems with all of that. In many ways we're a specialized niche of Software Engineering now, and that means we can ride the rocket with them. If you want to continue to have a good job in the new industrial revolution, keep plugging along and don't become the dragon in the datacenter people don't talk to.

Abstract Reasoning

Being able to comprehend how a complex system works is a prime example of abstract reasoning. Systems Administration is more than just knowing the arcana of init, grub, or WMI; we need to know how systems interact with each other. This is a skill that has been a pre-requisite for Senior Sysadmins for several decades now, so isn't new. It's already on our skill-path. This is where System Engineers make their names, and sometimes become Systems Architects.


This has been less on our skill-path, but is definitely something we've been focusing on in the past decade or so. Building large automation systems, even with frameworks such as Puppet or Chef, takes a fair amount of both abstract reasoning and creativity. If you're good at this, you've got 'creative' down.

This has impacts for the lower rungs of the sysadmin skill-ladder. Brand new sysadmins are going to be doing less racking-and-stacking and more parsing and patching ruby or ruby-like DSLs.

Interpersonal Skills

This is where sysadmins tend to fall down. A lot of us got into this gig because we didn't have to talk to people who weren't other sysadmins. Technology made sense, people didn't.

This skill is more a reflection of the service-oriented economy, and sysadmins are only sort of that, but our role in product creation and maintenance is ever more social these days. If you're one of two sysadmin-types in a company with 15 software engineers, you're going to have to learn how to have a good relationship with software engineers. In olden days, only very senior sysadmins had to have the Speaker to Management skill, now even mid-levels need to be able to speak coherently to technical and non-technical management.

It is no coincidence that many of the tutorials at conferences like LISA are aimed at building business and social skills in sysadmins. It's worth your time to attend them, since your career advancement depends on it.

Yes, we're well positioned to do well in the new economy. We just have to make a few changes we've known about for a while now.

by SysAdmin1138 at October 16, 2014 05:10 PM

Everything Sysadmin

Results of the PuppetConf 2014 Raffle

If you recall, the fine folks at Puppet Labs gave me a free ticket to PuppetConf 2014 to give away to a reader of this blog. Here's a report from our lucky winner!

Conference Report: PuppetConf 2014

by Anastasiia Zhenevskaia

You never know when you will be lucky enough to win a ticket to the PuppetConf, one of the greatest conferences of this year. My "moment" happened just 3 weeks before the conference and let me dive into things I've never thought about.

Being a person who worked mostly with the front-end development, I was always a little bit scared and puzzled by more complicated things. Fortunately, the Conference helped me to understand how important and simple all these processes could be. I was so impressed by personality of all speakers. Their eyes were full of passion, their presentations were clear, informational and breath-taking. Their attitude towards things they're working on - exceptional. Those are people you might want to work with, share ideas and create amazing things.

I'm so glad that I got this opportunity and wish that everybody could get this chance and taste the atmosphere of Puppet!

October 16, 2014 03:00 PM

Chris Siebenmann

Don't use dd as a quick version of disk mirroring

Suppose, not entirely hypothetically, that you initially set up a server with one system disk but have come to wish that it had a mirrored pair of them. The server is in production and in-place migration to software RAID requires a downtime or two, so as a cheap 'in case of emergency' measure you stick in a second disk and then clone your current system disk to it with dd (remember to fsck the root filesystem afterwards).

(This has a number of problems if you ever actually need to boot from the second disk, but let's set them aside for now.)

Unfortunately, on a modern Linux machine you have just armed a time bomb that is aimed at your foot. It may never go off, or it may go off more than a year and a half later (when you've forgotten all about this), or it may go off the next time you reboot the machine. The problem is that modern Linux systems identify their root filesystem by its UUID, not its disk location, and because you cloned the disk with dd you now have two different filesystems with the same UUID.

(Unless you do something to manually change the UUID on the cloned copy, which you can. But you have to remember that step. On extN filesystems, it's done with tune2fs's -U argument; you probably want '-U random'.)

Most of the time, the kernel and initramfs will probably see your first disk first and inventory the UUID on its root partition first and so on, and thus boot from the right filesystem on the first disk. But this is not guaranteed. Someday the kernel may get around to looking at sdb1 before it looks at sda1, find the UUID it's looking for, and mount your cloned copy as the root filesystem instead of the real thing. If you're lucky, the cloned copy is so out of date that things fail explosively and you notice immediately (although figuring out what's going on may take a bit of time and in the mean time life can be quite exciting). If you're unlucky, the cloned copy is close enough to the real root filesystem that things mostly work and you might only have a few little anomalies, like missing log files or mysteriously reverted package versions or the like. You might not even really notice.

(This is the background behind my recent tweet.)

by cks at October 16, 2014 06:14 AM

October 15, 2014

Everything Sysadmin

Tutorial: Evil Genius 101

I'm teaching a tutorial at Usenix LISA called "Evil Genius 101: Subversive Ways to Promote DevOps and Other Big Changes".

Whether you are trying to bring "devops culture" to your workplace, or just get approval to purchase a new machine, convincing and influencing people is a big part of a system administrator's time.

For the last few years I've been teaching this class called "Evil Genius 101" where I reveal my tricks for understanding people and swaying their opinion. None of these are actually evil, nor do I teach negotiating techniques. I simply list 3-4 techniques I've found successful for each of these situations: talking to executives, talking to managers, talking to coworkers, and talking to users.

Seating is limited. Register now!

Evil Genius 101: Subversive Ways to Promote DevOps and Other Big Changes

Who should attend:

Sysadmins and managers looking to influence the technology and culture of your organization.


Monday, 10-Nov, 1:30pm-5pm at Usenix LISA


You want to innovate: deploy new technologies such as configuration management, kanban, a wiki, or standardized configurations. Your coworkers don't want change: they like the way things are. Therefore, they consider you evil. However you aren't evil, you just want to make things better. Learn how to talk your team, managers and executives into adopting DevOps techniques and culture.

Take back to work:

  • Help your coworkers understand and agree with your awesome ideas
  • Convince your manager about anything. Really.
  • Get others to trust you so they are more easily convinced
  • Deciding which projects to do when you have more projects than time
  • Turn the most stubborn user into your biggest fan
  • Make decisions based on data and evidence

Topics include:

  • DevOps "value mapping" exercise: Understand how your work relates to business needs.
  • So much to do! What should you do first?
  • How to sell ideas to executives, management, co-workers, and users.
  • Simple ways to display data to get your point across better.

Register today for Usenix LISA 2014!

October 15, 2014 03:00 PM

Chris Siebenmann

Why system administrators hate security researchers every so often

So in the wake of the Bash vulnerability I was reading this Errata Security entry on Bash's code (via due to an @0xabad1dea retweet) and I came across this:

So now that we know what's wrong, how do we fix it? The answer is to clean up the technical debt, to go through the code and make systematic changes to bring it up to 2014 standards.

This will fix a lot of bugs, but it will break existing shell-scripts that depend upon those bugs. That's not a problem -- that's what upping the major version number is for. [...]

I cannot put this gently, so here it goes: FAIL.

The likely effect of any significant amount of observable Bash behavior changes (for behavior that is not itself a security bug) will be to leave security people feeling smug and the problem completely unsolved. Sure, the resulting Bash will be more secure. A powered off computer in a vault is more secure too. What it is not is useful, and the exact same thing is true of cavalierly breaking things in the name of security.

Bash's current behavior is relied on by a great many scripts written by a great many people. If you change any significant observable part of that behavior, so that scripts start breaking, you have broken the overall system that Bash is a part of. Your change is not useful. It doesn't matter if you change Bash's version number because changing the version number does nothing to magically fix those broken scripts.

Fortunately (for sysadmins), the Bash maintainers are extremely unlikely to take changes that will cause significant breakage in scripts. Even if the Bash maintainers take them, many distribution maintainers will not take them. In fact the distributions who are most likely to not take the fixes are the distributions that most need them, ie the distributions that have Bash as /bin/sh and thus where the breakage will cause the most pain (and Bashisms in such scripts are not necessarily bugs). Hence such a version of Bash, if one is ever developed by someone, is highly likely to leave security researchers feeling smug about having fixed the problem even if people are too obstinate to pick up their fix and to leave systems no more secure than before.

But then, this is no surprise. Security researchers have been ignoring the human side of their nominal field for a long time.

(As always, social problems are the real problems. If your proposed technical solution to a security issue is not feasible in practice, you have not actually fixed the problem. As a corollary, calling for such fixes is much the same as hoping magical elves will fix the problem.)

by cks at October 15, 2014 05:13 AM

October 14, 2014

Warren Guy

Regenerating an RSA private key with Python

This is an exercise in regenerating an RSA private key while possessing only the public key. You might also find this useful if you happen to know all of the parameters of a private key (modulus, public exponent, and either the private exponent or prime factors), and want to reconstruct a key from them (skip to the end). This covers only the practical steps required without detailed explanation.

The example used here is a 256-bit RSA key, which can be factored on my laptop in less than three minutes. You won't (I hope) find any 256-bit RSA keys in the real world, however you could likely factor a 512-bit key (which sadly do exist in the wild) with modern hardware in a matter of days.

Read full post

October 14, 2014 08:55 PM

Everything Sysadmin

Come hear me speak in Denver next week!

On Tuesday, Oct 21st, I'll be speaking at the Denver DevOps Meetup. It is short notice, but if you happen to be in the area, please come! I'll be talking about the new book and how DevOps principles can make the world a better place. I'll have a copy or two to give away, and special discount codes for everyone.

The meeting is at the Craftsy Offices, 999 18th St., Suite 240, Denver, CO. For more information and to RSVP, please go to

October 14, 2014 05:30 PM

Tutorial: How To Not Get Paged

Step 1: turn off your pager. Step 2: disable the monitoring system. Or.... you can run oncall using modern methodologies that constantly improve the reliability of your system.

I'm teaching a tutorial at Usenix LISA called "How To Not Get Paged: Managing Oncall to Reduce Outages".

I'm excited about this class because I'm going to explain a lot of the things I learned at Google about how to turn oncall from a PITA to a productive use of time that improves the reliability of the systems you run. Most of the material is from our new book, The Practice of Cloud System Administration, but the Q&A always leads me to say things I couldn't put in print.

Seating is limited. Register now!

How To Not Get Paged: Managing Oncall to Reduce Outages

Who should attend:

Anyone with an oncall responsibility (or their manager).


Tuesday, 11-Nov, 1:30pm-5pm at Usenix LISA


People think of "oncall" as responding to a pager that beeps because of an outage. In this class you will learn how to use oncall as a vehicle to improve system reliability so that you get paged less often.

Take back to work:

  • How to monitor more accurately so you get paged less
  • How to design an oncall schedule so that it is more fair and less stressful
  • How to assure preventative work and long-term solutions get done between oncall shifts
  • How to conduct "Fire Drills" and "Game Day Exercises" to create antifragile systems
  • How to write a good Post-mortem document that communicates better and prevents future problems

Topics include:

  • Why your monitoring strategy is broken and how to fix it
  • Building a more fair oncall schedule
  • Monitoring to detect outages vs. monitoring to improve reliability
  • Alert review strategies
  • Conducting "Fire Drills" and "Game Day Exercises"
  • "Blameless Post-mortem documents"

Register today for Usenix LISA 2014!

October 14, 2014 03:00 PM

Debian Administration

Setting up your own graphical git-server with gitbucket

This article documents the process of configuring a git host, using gitbucket, which will give you a graphical interface to a collection of git repositories, accessible via any browser, along with support for groups, issues, and forks.

by Steve at October 14, 2014 09:18 AM

Chris Siebenmann

Bashisms in #!/bin/sh scripts are not necessarily bugs

In the wake of Shellshock, any number of people have cropped up in any number of places to say that you should always be able to change a system's /bin/sh to something other than Bash because Bashisms in scripts that are specified to use #!/bin/sh are a bug. It is my heretical view that these people are wrong in general (although potentially right in specific situations).

First, let us get a trivial root out of the way: a Unix distribution is fully entitled to assume that you have not changed non-adjustable things. If a distribution ships with /bin/sh as Bash and does not have a supported way to change it to some other shell, then the distribution is fully entitled to write its own #!/bin/sh shell scripts so that they use Bashisms. This may be an unwise choice on the distribution's part, but it's not a bug unless they have an official policy that all of their shell scripts should be POSIX-only.

(Of course the distribution may act on RFEs that their #!/bin/sh scripts not use Bashisms. But that's different from it being a bug.)

Next, let's talk about user scripts. On a system where /bin/sh is always officially Bash, ordinary people are equally entitled to assume that your systems have not been manually mangled into unofficial states. As a result they are also entitled to write their #!/bin/sh scripts with Bashisms in them, because these scripts work properly on all officially supported system configurations. As with distributions, this may not be a wise choice (since it may cause pain if and when they ever move those scripts to another Unix system) but it is not a bug. The only case when it even approaches being a bug is when the distribution has officially included large warnings saying '/bin/sh is currently Bash but it may be something else someday, you should write all /bin/sh shell scripts to POSIX only, and here is a tool to help with that'.

There are some systems where this is the case and has historically been the case, and on those systems you can say that people using Bashisms in #!/bin/sh scripts clearly have a bug by the system's official policy. There are also quite a number of systems where this is or has not been the case, where the official /bin/sh is Bash and always has been. On those systems, Bashisms in #!/bin/sh scripts are not a bug.

(By the way, only relatively recently have you been able to count on /bin/sh being POSIX compatible; see here. Often it's had very few guarantees.)

By the way, as a pragmatic matter a system with only Bash as /bin/sh is likely to have plenty of /bin/sh shell scripts with Bashisms in them even if the official policy is that you should only use POSIX features in such scripts. This is a straightforward application of one of my aphorisms of system administration (and perhaps also this one). These scripts have a nominal bug, but of course people are not going to be happy if you break them.

by cks at October 14, 2014 06:07 AM

Check and Fix SSL servers for SSLv3 connections or the Poodle CVE-2014-3566 bug

The POODLE CVE-2014-3566 bug is a new bug discovered by Google in the SSLv3 protocol. The fix is easy, disable support for SSLv3. See for a good list of SSL ciphers. You can use this check from the shell to check your servers. This command can easily be automated with other shell scripts. It also allows you to check your services without exposing them to an external checking website.

October 14, 2014 12:00 AM

October 13, 2014

Everything Sysadmin

Interview on Demystifying DevOps with Tom Limoncelli

Holly from SpiceWorks interviewed me while I was in Austin for the SpiceWorld '14 conference. We talked about DevOps from the SMB "IT guy" perspective, Lord of the Rings, Chef vs. Puppet, and my secret desire start a podcast what would be "the Stephen Colbert of DevOps."

The interview has been published on their community website:

Demystifying DevOps: Q&A with Tom Limoncelli


October 13, 2014 03:21 PM

Tutorial: Live Upgrades on Running Systems

I'm teaching a tutorial at Usenix LISA called "Live Upgrades on Running Systems: 8 Ways to Upgrade a Running Service With Zero Downtime".

Ever notice that Google, Facebook and other website aren't down periodically for software upgrades? That's because they're upgrading software on their service while it is live. As a result, they can push new features continuously. In this tutorial I'll describe 8 techniques they use... and so can you. Oh, and here's a secret: I'll have a 9th way to upgrade software... but it requires down-time. That said, it might not require down-time that is visible to users!

I'm excited about this tutorial because it covers a lot of the unique topics we cover in The Practice of Cloud System Administration that I haven't talked about publicly before.

Seating is limited. Register now!

Live Upgrades on Running Systems: 8 Ways to Upgrade a Running Service With Zero Downtime

Who should attend:

Sysadmins that run web-based services, or services that involve many machines.


Friday, 14-Nov, 9am-10:30am at Usenix LISA


How do you upgrade your service while it is running? This class covers nine techniques from the new book by Limoncelli/Chalup/Hogan, "The Practice of Cloud System Administration"... eight of which don't require downtime. Learn best practices from Google, Facebook, and other successful companies and apply them to your environment. Techniques include: The Google "Canary" process, Facebook "Dark Launches", proportional shedding, feature toggles, Erlang live-code upgrades, and live SQL and NoSQL schema changes.

Who should attend:

Sysadmins that run web-based services, or services that involve many machines.

Take back to work:

  • 8 ways to upgrade live systems without downtime
  • Techniques for cautious upgrades you may not have thought of
  • How to change SQL schemas without requiring downtime
  • Continuous Integration as a stepping stone to Continuous Deployment

Topics include:

  • Upgrade while the system is down (not viable for live upgrades)
  • Rolling upgrades
  • Google's "canary" upgrade system
  • Proportional Shedding
  • Feature Toggles
  • Facebook's Dark Launch system
  • Upgrades that involve SQL and NoSQL schema changes.
  • Languages that support live code upgrades
  • Continuous Deployment

Register today for Usenix LISA 2014!

October 13, 2014 03:00 PM


A Quick and Practical Reference for tcpdump

When it comes to tcpdump most admins fall into two categories; they either know tcpdump and all of its flags like the back of their hand, or they kind of know it but need to use a reference for anything outside of the basic usage. The reason for this is because tcpdump is a pretty advanced command and it is pretty easy to get into the depths of how networking works when using it.

For today's article I wanted to create a quick but practical reference for tcpdump. I will cover the basics as well as some of the more advanced usage. I am sure I will most likely leave out some cool commands so if you want to add anything please feel free to drop it into the comments section.

Before we get too far into the weeds, it is probably best to cover what tcpdump is used for. The command tcpdump is used to create "dumps" or "traces" of network traffic. It allows you to look at what is happening on the network and really can be useful for troubleshooting many types of issues including issues that aren't due to network communications. Outside of network issues I use tcpdump to troubleshoot application issues all the time; if you ever have two applications that don't seem to be working well together, tcpdump is a great way to see what is happening. This is especially true if the traffic is not encrypted as tcpdump can be used to capture and read packet data as well.

The Basics

The first thing to cover with tcpdump is what flags to use. In this section I am going to cover the most basic flags that can be used in most situations.

Don't translate hostnames, ports, etc

# tcpdump -n

By default tcpdump will try to lookup and translate hostnames and ports.

# tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:15:05.051896 IP blog.ssh > Flags [P.], seq 2546456553:2546456749, ack 1824683693, win 355, options [nop,nop,TS val 620879437 ecr 620879348], length 196

You can turn this off by using the -n flag. Personally, I always use this flag as the hostname and port translation usually annoys me because I tend to work from IP addresses rather than hostnames. However, knowing that you can have tcpdump translate or not translate these are useful; as there are times where knowing what server the source traffic is coming from is important.

# tcpdump -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:23:47.934665 IP > Flags [P.], seq 2546457621:2546457817, ack 1824684201, win 355, options [nop,nop,TS val 621010158 ecr 621010055], length 196

Adding verbosity

# tcpdump -v

By adding a simple -v the output will start including a bit more such as the ttl, total length and options in an the IP packets.

# tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:15:05.051896 IP blog.ssh > Flags [P.], seq 2546456553:2546456749, ack 1824683693, win 355, options [nop,nop,TS val 620879437 ecr 620879348], length 196

tcpdump has three verbosity levels, you can add more verbosity by adding additional v's to the command line flags. In general whenever I am using tcpdump I tend to use the highest verbosity, as I like having everything visible just in case I need it.

# tcpdump -vvv -c 1
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:36:13.873456 IP (tos 0x10, ttl 64, id 121, offset 0, flags [DF], proto TCP (6), length 184)
    blog.ssh > Flags [P.], cksum 0x1ba1 (incorrect -> 0x0dfd), seq 2546458841:2546458973, ack 1824684869, win 355, options [nop,nop,TS val 621196643 ecr 621196379], length 132

Specifying an Interface

# tcpdump -i eth0

By default when you run tcpdump without specifying an interface it will choose the lowest numbered interface, usually this is eth0 however that is not guaranteed for all systems.

# tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:15:05.051896 IP blog.ssh > Flags [P.], seq 2546456553:2546456749, ack 1824683693, win 355, options [nop,nop,TS val 620879437 ecr 620879348], length 196

You can specify the interface by using the -i flag followed by the interface name. On most linux systems a special interface name of any can be used to tell tcpdump to listen on all interfaces, I find this extremely useful when troubleshooting servers with multiple interfaces. This is especially true when there are routing issues involved.

# tcpdump -i any
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
16:45:59.312046 IP blog.ssh > Flags [P.], seq 2547763641:2547763837, ack 1824693949, win 355, options [nop,nop,TS val 621343002 ecr 621342962], length 196

Writing to a file

# tcpdump -w /path/to/file

When you just run tcpdump by itself it will output to your screen.

# tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:15:05.051896 IP blog.ssh > Flags [P.], seq 2546456553:2546456749, ack 1824683693, win 355, options [nop,nop,TS val 620879437 ecr 620879348], length 196

There are many times where you may want to save the tcpdump data to a file, the easiest way to do this is to use the -w flag. This is useful for situations where you may need to save the network dump to review later. One benefit to saving the data to a file is that you can read the dump file multiple times and apply other flags or filters (which we will cover below) to that snapshot of network traffic.

# tcpdump -w /var/tmp/tcpdata.pcap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
1 packet captured
2 packets received by filter
0 packets dropped by kernel

By default the data is buffered and will not usually be written to the file until you CTRL+C out of the running tcpdump command.

Reading from a file

# tcpdump -r /path/to/file

Once you save the output to a file you will inherently need to read that file. To do this you can simply use the -r flag followed by the path to the file.

# tcpdump -r /var/tmp/tcpdata.pcap 
reading from file /var/tmp/tcpdata.pcap, link-type EN10MB (Ethernet)
16:56:01.610473 IP blog.ssh > Flags [P.], seq 2547766673:2547766805, ack 1824696181, win 355, options [nop,nop,TS val 621493577 ecr 621493478], length 132

As a quick note, if you are more familiar with tools such as wireshark you can read files saved by tcpdump with most network troubleshooting tools like wireshark.

Specifying the capture size of each packet

# tcpdump -s 100

By default most newer implementations of tcpdump will capture 65535 bytes, however in some situations you may not want to capture the default packet length. You can use -s to specify the "snaplen" or "snapshot length" that you want tcpdump to capture.

Specifying the number of packets to capture

# tcpdump -c 10

When you run tcpdump by itself it will keep running until you hit CTRL+C to quit.

# tcpdump host
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
0 packets captured
4 packets received by filter
0 packets dropped by kernel

You can tell tcpdump to stop capturing after a certain number of packets by using the -c flag followed by the number of packets to capture. This is pretty useful for situations where you may not want tcpdump to spew output to your screen so fast you can't read it, however generally this is more useful when you are using filters to grab specific traffic.

Pulling the basics together

# tcpdump -nvvv -i any -c 100 -s 100

All of the basic flags that were covered above can also be combined to allow you to specify exactly what you want tcpdump to provide.

# tcpdump -w /var/tmp/tcpdata.pcap -i any -c 10 -vvv
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
10 packets captured
10 packets received by filter
0 packets dropped by kernel
# tcpdump -r /var/tmp/tcpdata.pcap -nvvv -c 5
reading from file /var/tmp/tcpdata.pcap, link-type LINUX_SLL (Linux cooked)
17:35:14.465902 IP (tos 0x10, ttl 64, id 5436, offset 0, flags [DF], proto TCP (6), length 104) > Flags [P.], cksum 0x1b51 (incorrect -> 0x72bc), seq 2547781277:2547781329, ack 1824703573, win 355, options [nop,nop,TS val 622081791 ecr 622081775], length 52
17:35:14.466007 IP (tos 0x10, ttl 64, id 52193, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x4950), seq 1, ack 52, win 541, options [nop,nop,TS val 622081791 ecr 622081791], length 0
17:35:14.470239 IP (tos 0x10, ttl 64, id 5437, offset 0, flags [DF], proto TCP (6), length 168) > Flags [P.], cksum 0x1b91 (incorrect -> 0x98c3), seq 52:168, ack 1, win 355, options [nop,nop,TS val 622081792 ecr 622081791], length 116
17:35:14.470370 IP (tos 0x10, ttl 64, id 52194, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x48da), seq 1, ack 168, win 541, options [nop,nop,TS val 622081792 ecr 622081792], length 0
17:35:15.464575 IP (tos 0x10, ttl 64, id 5438, offset 0, flags [DF], proto TCP (6), length 104) > Flags [P.], cksum 0x1b51 (incorrect -> 0xc3ba), seq 168:220, ack 1, win 355, options [nop,nop,TS val 622082040 ecr 622081792], length 52


Now that we have covered some of the basic flags we should cover filtering. tcpdump has the ability to filter the capture or output based on a variety of expressions, in this article I am only going to cover a few quick examples to give you an idea of the syntax. For a full list you can checkout the pcap-filter section of the tcpdump manpage.

Searching for traffic to and from a specific host

# tcpdump -nvvv -i any -c 3 host

The above command will run a tcpdump and send the output to the screen like we saw with the flags before, however it will only do so if the source or destination IP address is Essentially by adding host we are asking tcpdump to filter out anything that is not to or from

# tcpdump -nvvv -i any -c 3 host
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
17:54:15.067496 IP (tos 0x10, ttl 64, id 5502, offset 0, flags [DF], proto TCP (6), length 184) > Flags [P.], cksum 0x1ba1 (incorrect -> 0x9f75), seq 2547785621:2547785753, ack 1824705637, win 355, options [nop,nop,TS val 622366941 ecr 622366923], length 132
17:54:15.067613 IP (tos 0x10, ttl 64, id 52315, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x7c34), seq 1, ack 132, win 540, options [nop,nop,TS val 622366941 ecr 622366941], length 0
17:54:15.075230 IP (tos 0x10, ttl 64, id 5503, offset 0, flags [DF], proto TCP (6), length 648) > Flags [P.], cksum 0x1d71 (incorrect -> 0x3443), seq 132:728, ack 1, win 355, options [nop,nop,TS val 622366943 ecr 622366941], length 596

Only show traffic where the source is a specific host

# tcpdump -nvvv -i any -c 3 src host

Where the previous example showed traffic to and from the above command will only show traffic where the source of the packet is This is accomplished by adding src in front of the host filter. This is an additional filter that tells tcpdump to look for a specific "source". This can be reversed by using the dst filter, which specifies the "destination".

# tcpdump -nvvv -i any -c 3 src host
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
17:57:12.194902 IP (tos 0x10, ttl 64, id 52357, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x1707), seq 1824706545, ack 2547787717, win 540, options [nop,nop,TS val 622411223 ecr 622411223], length 0
17:57:12.196288 IP (tos 0x10, ttl 64, id 52358, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x15c5), seq 0, ack 325, win 538, options [nop,nop,TS val 622411223 ecr 622411223], length 0
17:57:12.197677 IP (tos 0x10, ttl 64, id 52359, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x1491), seq 0, ack 633, win 536, options [nop,nop,TS val 622411224 ecr 622411224], length 0
# tcpdump -nvvv -i any -c 3 dst host
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
17:59:37.266838 IP (tos 0x10, ttl 64, id 5552, offset 0, flags [DF], proto TCP (6), length 184) > Flags [P.], cksum 0x1ba1 (incorrect -> 0x586d), seq 2547789725:2547789857, ack 1824707577, win 355, options [nop,nop,TS val 622447491 ecr 622447471], length 132
17:59:37.267850 IP (tos 0x10, ttl 64, id 5553, offset 0, flags [DF], proto TCP (6), length 392) > Flags [P.], cksum 0x1c71 (incorrect -> 0x462e), seq 132:472, ack 1, win 355, options [nop,nop,TS val 622447491 ecr 622447491], length 340
17:59:37.268606 IP (tos 0x10, ttl 64, id 5554, offset 0, flags [DF], proto TCP (6), length 360) > Flags [P.], cksum 0x1c51 (incorrect -> 0xf469), seq 472:780, ack 1, win 355, options [nop,nop,TS val 622447491 ecr 622447491], length 308

Filtering source and destination ports

# tcpdump -nvvv -i any -c 3 port 22 and port 60738

You can add some rather complicated filtering statements with tcpdump when you start to using operators like and. You can think of this as something similar to if statements. In this example we are using the and operator to tell tcpdump to only output packets that have both ports 22 and 60738. This allows us to narrow down the packets to a specific session, this can be extremely useful when troubleshooting network issues.

# tcpdump -nvvv -i any -c 3 port 22 and port 60738
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
18:05:54.069403 IP (tos 0x10, ttl 64, id 64401, offset 0, flags [DF], proto TCP (6), length 104) > Flags [P.], cksum 0x1b51 (incorrect -> 0x5b3c), seq 917414532:917414584, ack 1550997318, win 353, options [nop,nop,TS val 622541691 ecr 622538903], length 52
18:05:54.072963 IP (tos 0x10, ttl 64, id 13601, offset 0, flags [DF], proto TCP (6), length 184) > Flags [P.], cksum 0x1ba1 (incorrect -> 0xb0b1), seq 1:133, ack 52, win 355, options [nop,nop,TS val 622541692 ecr 622541691], length 132
18:05:54.073080 IP (tos 0x10, ttl 64, id 64402, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x1e3b), seq 52, ack 133, win 353, options [nop,nop,TS val 622541692 ecr 622541692], length 0

You can express the and operator in a couple of different ways, you can use and or &&. Personally, I tend to use them both; it is important to remember that if you are going to use && that you should enclose the filter expression with single or double quotes. In BASH you can use && to run one command and if successful run a second. In general it is best to simply wrap filter expressions in quotes; this will prevent any unexpected results as filters can have quite a few special characters.

# tcpdump -nvvv -i any -c 3 'port 22 && port 60738'
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
18:06:16.062818 IP (tos 0x10, ttl 64, id 64405, offset 0, flags [DF], proto TCP (6), length 88) > Flags [P.], cksum 0x1b41 (incorrect -> 0x776c), seq 917414636:917414672, ack 1550997518, win 353, options [nop,nop,TS val 622547190 ecr 622541776], length 36
18:06:16.065567 IP (tos 0x10, ttl 64, id 13603, offset 0, flags [DF], proto TCP (6), length 120) > Flags [P.], cksum 0x1b61 (incorrect -> 0xaf2d), seq 1:69, ack 36, win 355, options [nop,nop,TS val 622547191 ecr 622547190], length 68
18:06:16.065696 IP (tos 0x10, ttl 64, id 64406, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0xf264), seq 36, ack 69, win 353, options [nop,nop,TS val 622547191 ecr 622547191], length 0

Searching for traffic on one port or another

# tcpdump -nvvv -i any -c 20 'port 80 or port 443'

You can also use the or or || operator to filter tcpdump results. In this example we are using the or operator to capture traffic to and from port 80 or port 443. This example is especially useful as webservers generally have two ports open, 80 for http traffic and 443 for https.

# tcpdump -nvvv -i any -c 20 'port 80 or port 443'
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
18:24:28.817940 IP (tos 0x0, ttl 64, id 39930, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0x1b25 (incorrect -> 0x8611), seq 3836995553, win 29200, options [mss 1460,sackOK,TS val 622820379 ecr 0,nop,wscale 7], length 0
18:24:28.818052 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) > Flags [R.], cksum 0x012c (correct), seq 0, ack 3836995554, win 0, length 0
18:24:32.721330 IP (tos 0x0, ttl 64, id 48510, offset 0, flags [DF], proto TCP (6), length 475) > Flags [P.], cksum 0x1cc4 (incorrect -> 0x3a4e), seq 580573019:580573442, ack 1982754038, win 237, options [nop,nop,TS val 622821354 ecr 622815632], length 423
18:24:32.721465 IP (tos 0x0, ttl 64, id 1266, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x45d7), seq 1, ack 423, win 243, options [nop,nop,TS val 622821355 ecr 622821354], length 0
18:24:32.722098 IP (tos 0x0, ttl 64, id 1267, offset 0, flags [DF], proto TCP (6), length 241) > Flags [P.], cksum 0x1bda (incorrect -> 0x855c), seq 1:190, ack 423, win 243, options [nop,nop,TS val 622821355 ecr 622821354], length 189
18:24:32.722232 IP (tos 0x0, ttl 64, id 48511, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x4517), seq 423, ack 190, win 245, options [nop,nop,TS val 622821355 ecr 622821355], length 0

Searching for traffic on two specific ports and from a specific host

# tcpdump -nvvv -i any -c 20 '(port 80 or port 443) and host'

While the previous example is great for looking at issues for a multiport protocol; what if this is a very high traffic webserver? The output from tcpdump may get a bit confusing. We can narrow down the results even further by adding a host filter. To do this while maintaining our or expression we can simply wrap the or statement in parenthesis.

# tcpdump -nvvv -i any -c 20 '(port 80 or port 443) and host'
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
18:38:05.551194 IP (tos 0x0, ttl 64, id 63169, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0x1bcd (incorrect -> 0x0d96), seq 4173164403, win 29200, options [mss 1460,sackOK,TS val 623024562 ecr 0,nop,wscale 7], length 0
18:38:05.551310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) > Flags [R.], cksum 0xa64a (correct), seq 0, ack 4173164404, win 0, length 0
18:38:05.717130 IP (tos 0x0, ttl 64, id 51574, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0x1bcd (incorrect -> 0xdf7c), seq 1068257453, win 29200, options [mss 1460,sackOK,TS val 623024603 ecr 0,nop,wscale 7], length 0
18:38:05.717255 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S.], cksum 0x1bcd (incorrect -> 0xed80), seq 2992472447, ack 1068257454, win 28960, options [mss 1460,sackOK,TS val 623024603 ecr 623024603,nop,wscale 7], length 0
18:38:05.717474 IP (tos 0x0, ttl 64, id 51575, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1bc5 (incorrect -> 0x8c87), seq 1, ack 1, win 229, options [nop,nop,TS val 623024604 ecr 623024603], length 0

You can use the parenthesis multiple times in a single filter, for example the below command will filter the capture to only packets that are to or from port 80 or port 443 and from hosts and if they are destined for

# tcpdump -nvvv -i any -c 20 '((port 80 or port 443) and (host or host and dst host'
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
18:53:30.349306 IP (tos 0x0, ttl 64, id 52641, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0x1b25 (incorrect -> 0x4890), seq 3026316656, win 29200, options [mss 1460,sackOK,TS val 623255761 ecr 0,nop,wscale 7], length 0
18:53:30.349558 IP (tos 0x0, ttl 64, id 52642, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x3454), seq 3026316657, ack 3657995297, win 229, options [nop,nop,TS val 623255762 ecr 623255762], length 0
18:53:30.354056 IP (tos 0x0, ttl 64, id 52643, offset 0, flags [DF], proto TCP (6), length 475) > Flags [P.], cksum 0x1cc4 (incorrect -> 0x10c2), seq 0:423, ack 1, win 229, options [nop,nop,TS val 623255763 ecr 623255762], length 423
18:53:30.354682 IP (tos 0x0, ttl 64, id 52644, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1b1d (incorrect -> 0x31e6), seq 423, ack 190, win 237, options [nop,nop,TS val 623255763 ecr 623255763], length 0

Understanding the output

Capturing network traffic with tcpdump is hard enough with all of the options, but once you have that data you have to decipher it. In this section we are going to cover how to identify the source/destination IP, source/destination Port and the type of packet for the TCP protocol. While these are all very basic items they are far from the extent of what you can identify from tcpdump, however this article is meant to be quick and dirty so we will keep it to the basics. For more information on tcpdump and what is being listed I suggest checking out the manpages.

Identifying the source and destination

Identifying the source and destination addresses and ports are actually fairly easy. > Flags [S], cksum 0xcf28 (incorrect -> 0x0388), seq 682725222, win 29200, options [mss 1460,sackOK,TS val 619989005 ecr 0,nop,wscale 7], length 0

Given the above output we can see that the source ip is the source port is 56894 and the destination ip is with a destination port of 22. This is pretty easy to identify once you understand the format of tcpdump. If you haven't guessed the format yet you can break it down as follows src-ip.src-port > dest-ip.dest-port: Flags[S] the source is in front of the > and the destination is behind. You can think of the > as an arrow pointing to the destination.

Identifying the type of packet > Flags [S], cksum 0xcf28 (incorrect -> 0x0388), seq 682725222, win 29200, options [mss 1460,sackOK,TS val 619989005 ecr 0,nop,wscale 7], length 0

From the sample above we can tell that the packet is a single SYN packet. We can identify this by the Flags [S] section of the tcpdump output, different types of packets have different types of flags. Without going too deep into what types of packets exist within TCP you can use the below as a cheat sheet for identifying packet types.

  • [S] - SYN (Start Connection)
  • [.] - No Flag Set
  • [P] - PSH (Push Data)
  • [F] - FIN (Finish Connection)
  • [R] - RST (Reset Connection)

Depending on the version and output of tcpdump you may also see flags such as [S.] this is used to indicate a SYN-ACK packet.

An unhealthy example

15:15:43.323412 IP (tos 0x0, ttl 64, id 51051, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0xcf28 (incorrect -> 0x0388), seq 682725222, win 29200, options [mss 1460,sackOK,TS val 619989005 ecr 0,nop,wscale 7], length 0
15:15:44.321444 IP (tos 0x0, ttl 64, id 51052, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0xcf28 (incorrect -> 0x028e), seq 682725222, win 29200, options [mss 1460,sackOK,TS val 619989255 ecr 0,nop,wscale 7], length 0
15:15:46.321610 IP (tos 0x0, ttl 64, id 51053, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0xcf28 (incorrect -> 0x009a), seq 682725222, win 29200, options [mss 1460,sackOK,TS val 619989755 ecr 0,nop,wscale 7], length 0

The above sampling shows an example of an unhealthy exchange, and by unhealthy exchange for this example that means no exchange. In the above sample we can see that is sending a SYN packet to host however we never see a response from host

A healthy example

15:18:25.716453 IP (tos 0x10, ttl 64, id 53344, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0xcf3a (incorrect -> 0xc838), seq 1943877315, win 29200, options [mss 1460,sackOK,TS val 620029603 ecr 0,nop,wscale 7], length 0
15:18:25.716777 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S.], cksum 0x594a (correct), seq 4001145915, ack 1943877316, win 5792, options [mss 1460,sackOK,TS val 18495104 ecr 620029603,nop,wscale 2], length 0
15:18:25.716899 IP (tos 0x10, ttl 64, id 53345, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0xcf32 (incorrect -> 0x9dcc), ack 1, win 229, options [nop,nop,TS val 620029603 ecr 18495104], length 0

A healthy example would look like the above, in the above we can see a standard TCP 3-way handshake. The first packet above is a SYN packet from host to host, the second packet is a SYN-ACK from host acknowledging the SYN. The final packet is a ACK or rather a SYN-ACK-ACK from host acknowledging that it has received the SYN-ACK. From this point on there is an established TCP/IP connection.

Packet Inspection

Printing packet data in Hex and ASCII

# tcpdump -nvvv -i any -c 1 -XX 'port 80 and host'

A common method of troubleshooting application issues over the network is by using tcpdump to use the -XX flag to print the packet data in hex and ascii. This is a pretty helpful command, it allows you to look at both the source, destination, type of packet and the packet itself. However, I am not a fan of this output. I think it is a bit hard to read.

# tcpdump -nvvv -i any -c 1 -XX 'port 80 and host'
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
19:51:15.697640 IP (tos 0x0, ttl 64, id 54313, offset 0, flags [DF], proto TCP (6), length 483) > Flags [P.], cksum 0x1ccc (incorrect -> 0x2ce8), seq 3920159713:3920160144, ack 969855140, win 245, options [nop,nop,TS val 624122099 ecr 624117334], length 431
        0x0000:  0000 0001 0006 fe0a e2d1 8785 0000 0800  ................
        0x0010:  4500 01e3 d429 4000 4006 49f5 0a00 0301  E....)@.@.I.....
        0x0020:  0a00 03f6 b2a4 0050 e9a8 e3e1 39ce d0a4  .......P....9...
        0x0030:  8018 00f5 1ccc 0000 0101 080a 2533 58f3  ............%3X.
        0x0040:  2533 4656 4745 5420 2f73 6f6d 6570 6167  %3FVGET./somepag
        0x0050:  6520 4854 5450 2f31 2e31 0d0a 486f 7374  e.HTTP/1.1..Host
        0x0060:  3a20 3130 2e30 2e33 2e32 3436 0d0a 436f  :.
        0x0070:  6e6e 6563 7469 6f6e 3a20 6b65 6570 2d61  nnection:.keep-a
        0x0080:  6c69 7665 0d0a 4361 6368 652d 436f 6e74  live..Cache-Cont
        0x0090:  726f 6c3a 206d 6178 2d61 6765 3d30 0d0a  rol:.max-age=0..
        0x00a0:  4163 6365 7074 3a20 7465 7874 2f68 746d  Accept:.text/htm
        0x00b0:  6c2c 6170 706c 6963 6174 696f 6e2f 7868  l,application/xh
        0x00c0:  746d 6c2b 786d 6c2c 6170 706c 6963 6174  tml+xml,applicat
        0x00d0:  696f 6e2f 786d 6c3b 713d 302e 392c 696d  ion/xml;q=0.9,im
        0x00e0:  6167 652f 7765 6270 2c2a 2f2a 3b71 3d30  age/webp,*/*;q=0
        0x00f0:  2e38 0d0a 5573 6572 2d41 6765 6e74 3a20  .8..User-Agent:.
        0x0100:  4d6f 7a69 6c6c 612f 352e 3020 284d 6163  Mozilla/5.0.(Mac
        0x0110:  696e 746f 7368 3b20 496e 7465 6c20 4d61  intosh;.Intel.Ma
        0x0120:  6320 4f53 2058 2031 305f 395f 3529 2041  c.OS.X.10_9_5).A
        0x0130:  7070 6c65 5765 624b 6974 2f35 3337 2e33  ppleWebKit/537.3
        0x0140:  3620 284b 4854 4d4c 2c20 6c69 6b65 2047  6.(KHTML,.like.G
        0x0150:  6563 6b6f 2920 4368 726f 6d65 2f33 382e  ecko).Chrome/38.
        0x0160:  302e 3231 3235 2e31 3031 2053 6166 6172  0.2125.101.Safar
        0x0170:  692f 3533 372e 3336 0d0a 4163 6365 7074  i/537.36..Accept
        0x0180:  2d45 6e63 6f64 696e 673a 2067 7a69 702c  -Encoding:.gzip,
        0x0190:  6465 666c 6174 652c 7364 6368 0d0a 4163  deflate,sdch..Ac
        0x01a0:  6365 7074 2d4c 616e 6775 6167 653a 2065  cept-Language:.e
        0x01b0:  6e2d 5553 2c65 6e3b 713d 302e 380d 0a49  n-US,en;q=0.8..I
        0x01c0:  662d 4d6f 6469 6669 6564 2d53 696e 6365  f-Modified-Since
        0x01d0:  3a20 5375 6e2c 2031 3220 4f63 7420 3230  :.Sun,.12.Oct.20
        0x01e0:  3134 2031 393a 3430 3a32 3020 474d 540d  14.19:40:20.GMT.
        0x01f0:  0a0d 0a                                  ...

Printing packet data in ASCII only

# tcpdump -nvvv -i any -c 1 -A 'port 80 and host'

I tend to prefer to print only the ASCII data, this helps me to quickly identify what is being sent and what is correct or not correct about the packets data. To print packet data in only the ascii format you can use the -A flag.

# tcpdump -nvvv -i any -c 1 -A 'port 80 and host'
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
19:59:52.011337 IP (tos 0x0, ttl 64, id 53757, offset 0, flags [DF], proto TCP (6), length 406) > Flags [P.], cksum 0x1c7f (incorrect -> 0xead1), seq 1552520173:1552520527, ack 428165415, win 237, options [nop,nop,TS val 624251177 ecr 624247749], length 354
%5Q)%5C.GET /newpage HTTP/1.1
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8

As you can see from the output above we have successfully captured an http GET request. Being able to print the packet data in a human readable format is very useful when troubleshooting application issues where the traffic is not encrypted. If you are troubleshooting encrypted traffic then printing packet data is not very useful. However, if you use have the certificates in use you could use commands such as ssldump or even wireshark.

Non-TCP Traffic

While the majority of this article covered TCP based traffic tcpdump can capture much more than TCP. It can also be used to capture ICMP, UDP, and ARP packets to name a few. Below are a few quick examples of non-TCP packets captured by tcpdump.

ICMP packets

# tcpdump -nvvv -i any -c 2 icmp
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
20:11:24.627824 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 15683, seq 1, length 64
20:11:24.627926 IP (tos 0x0, ttl 64, id 31312, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 15683, seq 1, length 64

UDP packets

# tcpdump -nvvv -i any -c 2 udp
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
20:12:41.726355 IP (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 76) > [bad udp cksum 0x43a9 -> 0x7043!] NTPv4, length 48
        Client, Leap indicator: clock unsynchronized (192), Stratum 2 (secondary reference), poll 6 (64s), precision -22
        Root Delay: 0.085678, Root dispersion: 57.141830, Reference-ID:
          Reference Timestamp:  3622133515.811991035 (2014/10/12 20:11:55)
          Originator Timestamp: 3622133553.828614115 (2014/10/12 20:12:33)
          Receive Timestamp:    3622133496.748308420 (2014/10/12 20:11:36)
          Transmit Timestamp:   3622133561.726278364 (2014/10/12 20:12:41)
            Originator - Receive Timestamp:  -57.080305658
            Originator - Transmit Timestamp: +7.897664248
20:12:41.748948 IP (tos 0x0, ttl 54, id 9285, offset 0, flags [none], proto UDP (17), length 76) > [udp sum ok] NTPv4, length 48
        Server, Leap indicator:  (0), Stratum 3 (secondary reference), poll 6 (64s), precision -20
        Root Delay: 0.054077, Root dispersion: 0.058944, Reference-ID:
          Reference Timestamp:  3622132887.136984840 (2014/10/12 20:01:27)
          Originator Timestamp: 3622133561.726278364 (2014/10/12 20:12:41)
          Receive Timestamp:    3622133618.830113530 (2014/10/12 20:13:38)
          Transmit Timestamp:   3622133618.830129086 (2014/10/12 20:13:38)
            Originator - Receive Timestamp:  +57.103835195
            Originator - Transmit Timestamp: +57.103850722

If you have an awesome tcpdump command example that you think should be added to this article feel free to post it in the comments section.

Originally Posted on Go To Article

by Benjamin Cane at October 13, 2014 07:50 AM

Chris Siebenmann

System metrics need to be documented, not just to exist

As a system administrator, I love systems that expose metrics (performance, health, status, whatever they are). But there's a big caveat to that, which is that metrics don't really exist until they're meaningfully documented. Sadly, documenting your metrics is much less common than simply exposing them, perhaps because it takes much more work.

At the best of times this forces system administrators and other bystanders to reverse engineer your metrics from your system's source code or from programs that you or other people write to report on them. At the worst this makes your metrics effectively useless; sysadmins can see the numbers and see them change, but they have very little idea of what they mean.

(Maybe sysadmins can dump them into a stats tracking system and look for correlations.)

Forcing people to reverse engineer the meaning of your stats has two bad effects. The obvious one is that people almost always wind up duplicating this work, which is just wasted effort. The subtle one is that it is terribly easy for a mistake about what the metrics means to become, essentially, superstition that everyone knows and spreads. Because people are reverse engineering things in the first place, it's very easy for mistakes and misunderstandings to happen; then people write the mistake down or embody it in a useful program and pretty soon it is being passed around the Internet since it's one of the few resources on the stats that exist. One mistake will be propagated into dozens of useful programs, various blog posts, and so on, and through the magic of the Internet many of these secondary sources will come off as unhesitatingly authoritative. At that point, good luck getting any sort of correction out into the Internet (if you even notice that people are misinterpreting your stats).

At this point some people will suggest that sysadmins should avoid doing anything with stats that they reverse engineer unless they are absolutely, utterly sure that they're correct. I'm sorry, life doesn't work this way. Very few sysadmins reverse engineer stats for fun; instead, we're doing it to solve problems. If our reverse engineering solves our problems and appears sane, many sysadmins are going to share their tools and what they've learned. It's what people do these days; we write blog posts, we answer questions on Stackoverflow, we put up Github repos with 'here, these are the tools that worked for me'. And all of those things flow around the Internet.

(Also, the suggestion that people should not write tools or write up documentation unless they are absolutely sure that they are correct is essentially equivalent to asking people not to do this at all. To be absolutely sure that you're right about a statistic, you generally need to fully understand the code. That's what they call rather uncommon.)

by cks at October 13, 2014 05:25 AM

October 12, 2014

Chris Siebenmann

Phish spammers are apparently exploiting mailing list software

One of the interesting things I've observed recently through my sinkhole SMTP server is a small number of phish spams that have been sent to me by what is clearly mailing list software; the latest instance was sent by a Mailman installation, for example. Although I initially thought all three of the emails I've spotted were all from one root cause, it turns out that there are several different things apparently going on.

In one case, the phish spammer clearly seems to have compromised a legitimate machine with mailing list software and then used that software to make themselves a phish spamming mailing list. It's easy to see the attraction of this; it makes the phish spammer much more efficient in that it takes them less time to send stuff to more people. In an interesting twist, the Received headers of the email I got say that the spammer initially sent it with the envelope address of (which matched their From:) and then the mailing list software rewrote the envelope sender.

In the most clear-cut case, the phish spammer seems to have sent out their spam through a commercial site that advertises itself as (hosted) 'Bulk Email Marketing Software'. This suggests that the phish spammer was willing to spend some money on their spamming, or at least burned a stolen credit card (the website advertises fast signups, which mean that credit cards mean basically nothing). I'm actually surprised that this doesn't happen more often, given that my impression is that the spam world is increasingly commercialized and phish spammers now often buy access to compromised machines instead of compromising the machines themselves. If you're going to spend money one way or another and you can safely just buy use of a commercial spam operation, well, why not?

(I say 'seems to' because the domain I got it from is not quite the same as the commercial site's main domain, although there are various indications tying it to them. If the phish spammer is trying to frame this commercial site, they went to an unusually large amount of work to do so.)

The third case is the most interesting to me. It uses a domain that was registered two days before it sent the phish spam and that domain was registered by an organization called 'InstantBulkSMTP'. The sending IP,, was also apparently also assigned on the same day. The domain has now disappeared but the sending IP now has DNS that claims it is '' and the website for that domain is the control panel for something called 'Interspire Email Marketer'. So my operating theory is that it's somewhat like the second case; a phish spammer found a company that sets up this sort of stuff and paid them some money (or gave them a bad credit card) for a customized service. The domain name they used was probably picked to be useful for the phish spam target.

(The domain was '' and the phish target was Google Translate claims that 'titolari' translates to 'holders'.)

PS: All of this shows the hazards of looking closely at spam. Until I started writing this entry, I had thought that all three cases were the same and were like the first one, ie phish spammers exploiting compromised machines with mailing list managers. Then things turned out to be more complicated and my nice simple short blog entry disappeared in a puff of smoke.

by cks at October 12, 2014 05:37 AM

Configserver Firewall & Security (CSF/LFD)

his page covers my notes about csf and lfd. csf is an easy SPI iptables firewall suite. lfd is the login failure daemon, which scans log files for failed authentication and blocks the IP's doing that. This page covers installation, populair command line options and popular config file options.

October 12, 2014 12:00 AM

October 11, 2014

Everything Sysadmin

Tutorial: Work Like a Team, not a group of individuals

I'm teaching a tutorial at Usenix LISA called "Work Like a Team: Best Practices for Team Coordination and Collaborations So You Aren't Acting Like a Group of Individuals".

I'm excited about this class because I'm going to demo a lot of the Google Apps tricks I've accumulated over the years, and combine them with stories about successes (and failures) related to bringing teams together to work on projects. I also get to explain a lot of DevOps culture in ways that make sense to non-DevOps shops (mostly stuff I've been advocating for since before "devops" was a thing). A lot of the material will overlap with our new book, The Practice of Cloud System Administration.

Seating is limited. Register now!

Work Like a Team: Best Practices for Team Coordination and Collaborations So You Aren't Acting Like a Group of Individuals

Who should attend:

System administrators and managers that work on a team of 3 or more.


Sunday, 9-Nov, 9am-12:30pm at Usenix LISA


System Administration is a team sport. How can we better collaborate and work as a team? Techniques will include many uses of Google Docs, wikis and other shared document systems, as well as strategies and methods that create a culture of cooperation.

Take back to work:

  • Behavior that builds team cohesion
  • 3 uses of Google docs you had not previously considered
  • How to organize team projects to improve teamwork
  • Track projects using Kanban boards.
  • How to divide big projects among team members
  • Collaborating via the "Tom Sawyer Fence Painting" technique
  • How to criticize the work of teammates constructively
  • How to get agreement on big plans

Topics include:

  • Meetings: How to make them more effective, shorter, and more democratic
  • How to create accountability, stop re-visiting past decisions, improve involvement
  • Strategy for leaving "fire-fighting" mode, be more "project focused".
  • Project Work: Using "design docs" to get consensus on big and small designs before they are committed to code.
  • Service Docs: How to document services so any team member can cover for any other.
  • Kanban: How to manage work that needs to be done.
  • Chatroom effectiveness: How to make everyone feel included, not lose important decisions.
  • Playbooks: How to get consistent results across the team, train new-hires, make delegation easier.
  • Send more effective email: How to write email that gets read.

Register today for Usenix LISA 2014!

October 11, 2014 03:00 PM

Chris Siebenmann

Thinking about how to create flexible aggregations from causes

Every so often I have programming puzzles that I find frustrating, not so much because I can't solve them as such but because I feel that there must already be a solution for them if I could both formulate the problem right and then use that formulation to search existing collections of algorithms and such. Today's issue is a concrete problem I am running into with NFS activity monitoring.

Suppose that you have a collection of specific event counters, where you know that userid W on machine X working against filesystem Y (in ZFS pool Z) did N NFS operations per second. My goal is to show aggregate information about the top sources of operations on the server, where a source might be one machine, one user, one filesystem, one pool, or some combination of these. This gives me two problems.

The first problem is efficiently going 'upwards' to sum together various specific event counters into more general categories (with the most general one being 'all NFS operations'). This feels like I want some sort of clever tree or inverted tree data structure, but I could just do it by brute force since I will probably not be dealing with too many specific event counters at any one time (from combinations we can see that each 4-element specific initial event maps to 16 categories; this is amenable to brute force on modern machines).

The second problem is going back 'down' from a category sum to the most specific cause possible for it so that we can report only that. The easiest way to explain this is with an example; if we have (user W, machine X, fs Y, pool Z) with 1000 operations and W was the only user to do things from that machine or on that filesystem, we don't want a report that lists every permutation of the machine and filesystem (eg '1000 from X', '1000 against Y', '1000 from X against Y', etc). Instead we want to report only that 1000 events came from user W on machine X doing things to filesystem Y.

If I wind up with a real tree, this smells like a case of replacing nodes that have only one child with their child (with some special cases around the edges). If I wind up with some other data structure, well, I'll have to figure it out then. And a good approach for this might well influence what data structure I want to use for the first problem.

If all of this sounds like I haven't even started trying to write some code to explore this problem, that would be the correct impression. One of my coding hangups is that I like to have at least some idea of how to solve a problem before I start trying to tackle it; this is especially the case if my choice of language isn't settled and I might want to use a different solution depending on the language I wind up in.

(There are at least three candidate languages for what I want to do here, including Go if I need raw speed to make a brute force approach feasible.)

by cks at October 11, 2014 06:02 AM

October 10, 2014

Everything Sysadmin

Is TPOCSA a DevOps book?

Quoting from a community forum post on SpiceWorks:

It doesn't have "DevOps" in the name, but the new The Practice of Cloud System Administration ... covers a lot of the same concepts, more as "here's some things that have emerged as best practices in the modern world of system administration." Textbook-thick but destined to be a classic like his previous The Practice of System and Network Administration.

Thanks to Ernest Mueller for the kind words!

October 10, 2014 04:00 PM

Calling all students and women!

Apply now for a grant to attend LISA14. Submissions are due by Monday, October 13.

Are you a student? There are grants available for the general conference and the tutorial program.

Are you a woman? As part of its ongoing commitment to encourage women to excel in this field, Usenix is pleased to announce the return of the Google Grants for Women to support female computer scientists interested in attending the LISA14 conference. All female computer scientists from academia or industry are encouraged to apply.

Applications are due by October 13.

October 10, 2014 03:11 PM

Chris Siebenmann

Where your memory can be going with ZFS on Linux

If you're running ZFS on Linux, its memory use is probably at least a concern. At a high level, there are at least three different places that your RAM may be being used or held down with ZoL.

First, it may be in ZFS's ARC, which is the ZFS equivalent of the buffer cache. A full discussion of what is included in the ARC and how you measure it and so on is well beyond the scope of this entry, but the short summary is that the ARC includes data from disk, metadata from disk, and several sorts of bookkeeping data. ZoL reports information about it in /proc/spl/kstat/zfs/arcstats, which is exactly the standard ZFS ARC kstats. What ZFS considers to be the total current (RAM) size of the ARC is size. ZFS on Linux normally limits the maximum ARC size to roughly half of memory (this is c_max).

(Some sources will tell you that the ARC size in kstats is c. This is wrong. c is the target size; it's often but not always the same as the actual size.)

Next, RAM can be in slab allocated ZFS objects and data structures that are not counted as part of the ARC for one reason or another. It used to be that ZoL handled all slab allocation itself and so all ZFS slab things were listed in /proc/spl/kmem/slab, but the current ZoL development version now lets the native kernel slab allocator handle most slabs for objects that aren't bigger than spl_kmem_cache_slab_limit bytes, which is normally 16K by default. Such native kernel slabs are theoretically listed in /proc/slabinfo but are unfortunately normally subject to SLUB slab merging, which often means that they get merged with other slabs and you can't actually see how much memory they're using.

As far as slab objects that aren't in the ARC, I believe that zfs_znode_cache slab objects (which are znode_ts) are not reflected in the ARC size. On some machines active znode_t objects may be a not insignificant amount of memory. I don't know this for sure, though, and I'm somewhat reasoning from behavior we saw on Solaris.

Third, RAM can be trapped in unused objects and space in slabs. One way that unused objects use up space (sometimes a lot of it) is that slabs are allocated and freed in relatively large chunks (at least one 4KB page of memory and often bigger in ZoL), so if only a few objects in a chunk are in use the entire chunk stays alive and can't be freed. We've seen serious issues with slab fragmentation on Solaris and I'm sure ZoL can have this too. It's possible to see the level of wastage and fragmentation for any slab that you can get accurate numbers for (ie, not any that have vanished into SLUB slab merging).

(ZFS on Linux may also allocate some memory outside of its slab allocations, although I can't spot anything large and obvious in the kernel code.)

All of this sounds really abstract, so let me give you an example. On one of my machines with 16 GB and actively used ZFS pools, things are currently reporting the following numbers:

  • the ARC is 5.1 GB, which is decent. Most of that is not actual file data, though; file data is reported as 0.27 GB, then there's 1.87 GB of ZFS metadata from disk and a bunch of other stuff.

  • 7.55 GB of RAM is used in active slab objects. 2.37 GB of that is reported in /proc/spl/kmem/slab; the remainder is in native Linux slabs in /proc/slabinfo. The znode_t slab is most of the SPL slab report, at 2 GB used.

    (This machine is using a hack to avoid the SLUB slab merging for native kernel ZoL slabs, because I wanted to look at memory usage in detail.)

  • 7.81 GB of RAM has been allocated to ZoL slabs in total. This means that there is a few hundred MB of space wasted at the moment.

If znode_t objects are not in the ARC, the ARC and active znode_t objects account for almost all of the slab space between the two of them; 7.1 GB out of 7.55 GB.

I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.

Sidebar: Accurately tracking ZoL slab memory usage

To accurately track ZoL memory usage you must defeat SLUB slab merging somehow. You can turn it off entirely with the slub_nomerge kernel paramter or hack the spl ZoL kernel module to defeat it (see the sidebar here).

Because you can set spl_kmem_cache_slab_limit as a module parameter for the spl ZoL kernel module, I believe that you can set it to zero to avoid having any ZoL slabs be native kernel slabs. This avoids SLUB slab merging entirely and also makes it so that all ZoL slabs appear in /proc/spl/kmem/slab. It may be somewhat less efficient.

by cks at October 10, 2014 05:25 AM

RISKS Digest

October 09, 2014


And if there isn't a stipend...

Sysadmin-types, we kind of have to have a phone. It's what the monitoring system makes vibrate when our attention is needed, and we also tend to be "always on-call", even if it's tier 4 emergency last resort on-call. But sometimes we're the kind of on-call where we have to pay attention any time an alert comes in, regardless of hour, and that's when things get real.

So what if you're in that kind of job, or applying for one, and it turns out that your employer doesn't provide a cell phone and doesn't provide reimbursement. Some Bring Your Own Device policies are written this way. Or maybe your employer moves to a BYOD policy and the company paid telecoms are going away.

Can they do that?

Yes they can, but.

As with all labor laws, the rules vary based on where you are in the world. However, in August 2014 (a month and a half ago!) Schwann's Home Services, Inc lost an appeal in California Appellate court. This is important because California contains Silicon Valley and what happens there tends to percolate out to the rest of the tech industry. This ruling held that employees who do company business on personal phones are entitled to reimbursement.

The ruling didn't provide a legal framework for how much reimbursement is required, just that some is.

This thing is so new that the ripples haven't been felt everywhere yet. No-reimbursement policies are not legal, that much is clear, but beyond that, not much is. For non-California based companies such as those in tech hot-spots like Seattle, New York, or the DC area this is merely a warning that the legal basis for such no-reimbursement policies is not firm. As the California-based companies revise policies in light of this ruling, accepted-practice in the tech field will shift without legal action elsewhere.

My legal google-fu is too weak to tell if this thing can be appealed to the state Supreme Court, though it looks like it might have already toured through there.

Until then...

I strongly recommend against using your personal phone for both work and private. Having two phones, even phones you pay for, provides an affirmative separation between your work identity subject to corporate policies and liability, and your private identity. This is more expensive than just getting an unlimited voice/text plan with lots of data and dual-homing, but you face fewer risks to yourself that way. No-reimbursement BYOD policies are unfair to tech-workers the way that employers that require a uniform to be worn who don't provide a uniform allowance are unfair; for some of us, that phone is essential to our ability to do our jobs and should be expensed to the employer. Laws and precedent always take a while to catch up to business reality, and BYOD is getting caught up.

by SysAdmin1138 at October 09, 2014 01:21 PM

The general pointlessness of high-CPU alarms

When it comes to things to send alarming emails about, CPU, RAM, Swap, and Disk are the four everyone thinks of. If something seems slow, check one or all of those four to see if it really is slow. This sets up a causal chain...

It was slow, and CPU was high. Therefore, when CPU is high it is slow. QED.

We will now alarm on high CPU.

It may be true in that one case, but high CPU is not always a sign of bad. In fact, high CPU is a perfectly normal occurrence in some systems.

  1. Render farms are supposed to run that high all the time.
  2. Build servers are supposed to be running that high a lot of the time.
  3. Databases chewing on long-running queries.
  4. Big-data analytics that can run for hours.
  5. QE systems grinding on builds.
  6. Test-environment systems being ground on by QE.

Of course, not all CPU checks are created equal. Percent-CPU is one thing, Load Average is another. If Percent-CPU is 100% and your load-average matches the number of cores in the system, you're probably fine. If Percent-CPU is 100% and your load-average is 6x the number of cores in the system, you're probably not fine. If your monitoring system only grabs Percent-CPU, you won't be able to tell what kind of 100% event it is.

As a generic, apply-it-to-everything alarm, High-CPU is a really poor thing to pick. It's easy to monitor, which is why it gets selected for alarming. But, don't do that.

Cases where a High-CPU alarm won't actually tell you that something is going wrong:

  • Everything in the previous list.
  • If your app is single-threaded, the actual high-CPU event for that app on a multi-core system is going to be WELL below 100%. It may even be as low as 12.5%.
  • If it's a single web-server in a load-balanced pool of them, it won't be a BOTHER HUMANS RIGHT NOW event.
  • During routine patching. It should be snoozed on a maintenance window anyway, but sometimes it doesn't happen.
  • Initializing a big application. Some things normally chew lots of CPU when spinning up for the first time.

CPU/Load Average is something you probably should monitor, since there is value in retroactive analysis and aggregate analysis. Analyzing CPU trends can tell you it's time to buy more hardware, or turn up the max-instances value in your auto-scaling group. These are all the kinds of thing you look at in retrospective, they're not things that you want waking you up at 2:38am.

Only turn on CPU alarms if you know that is an error condition worthy of waking up a human. Turning it on for everything just in case is a great way to train yourself out of ignoring high-CPU alarms, which means you'll miss the ones you actually care about. Human factors, they're part of everything.

by SysAdmin1138 at October 09, 2014 01:11 PM

Redundancy in the Cloud

Strange as it might be to contemplate, but imagine what would happen if AWS went into receivership and was shut down to liquidate assets? What would that mean for your infrastructure? Project? Or even startup?

It would be pretty bad.

Startups have been deploying preferentially on AWS or other Cloud services for some time now, in part due to venture-capitalist push to not have physical infrastructure to liquidate should the startup go *pop* and to scale fast should a much desired rocket-launch happen. If AWS shut down fully for, say, a week, the impact to pretty much everything would be tremendous.

Or what if it was Azure? Fully debilitating for those that are on it, but the wide impacts would be less.

Cloud vendors are big things. In the old physical days we used to deal with the all-our-eggs-in-one-basket problem by putting eggs in multiple places. If you're on AWS, Amazon is very big about making sure you deploy across multiple Availability Zones and helping you become multi-region in the process if that's important to you. See? More than one basket for your eggs. I have to presume Azure and the others are similar, since I haven't used them.

Do you put your product on multiple cloud-vendors as your more-than-one-basket approach?

It isn't as easy as it was with datacenters, that's for sure.

This approach can work if you treat the Cloud vendors as nothing but Virtualization and block-storage vendors. The multiple-datacenter approach worked in large part because colos sell only a few things that impact the technology (power, space, network connectivity, physical access controls), though pricing and policies may differ wildly. Cloud vendors are not like that, they differentiate in areas that are technically relevant.

Do you deploy your own MySQL servers, or do you use RDS?
Do you deploy your now MongoDB servers, or do you use DynamoDB?
Do you deploy your own CDN, or do you use CloudFront?
Do you deploy your own Redis group, or do you use SQS?
Do you deploy your own Chef, or do you use OpsWorks?

The deeper down the hole of Managed Services you dive, and Amazon is very invested in pushing people to use them, the harder it is to take your toys and go elsewhere. Or run your toys on multiple Cloud infrastructures. Azure and the other vendors are building up their own managed service offerings because AWS is successfully differentiating from everyone else by having the widest offering. The end-game here is to have enough managed services offerings that virtual private servers don't need to be used at all.

Deploying your product on multiple cloud vendors requires either eschewing managed-services entirely, or accepting greater management overhead due to very significant differences in how certain parts of your stack are managed. Cloud vendors are very much Infrastructure-as-Code, and deploying on both AWS and Azure is like deploying the same application in Java and .NET; it takes a lot of work, the dialect differences can be insurmountable, and the expertise required means different people are going to be working on each environment which creates organizational challenges. Deploying on multiple cloud-vendors is far harder than deploying in multiple physical datacenters, and this is very much intentional.

It can be done, it just takes drive.

  • New features will be deployed on one infrastructure before the others, and the others will follow on as the integration teams figure out how to port it.
  • Some features may only ever live on one infrastructure as they're not deemed important enough to go to all of the effort to port to another infrastructure. Even if policy says everything must be multi-infrastructure, because that's how people work.
  • The extra overhead of running in multiple infrastructures is guaranteed to become a target during cost-cutting drives.

The ChannelRegister article's assertion that AWS is now in "too big to fail" territory, and thus requiring governmental support to prevent wide-spread industry collapse, is a reasonable assertion. It just plain costs too much to plan for that kind of disaster in corporate disaster-response planning.

by SysAdmin1138 at October 09, 2014 01:10 PM

The alerting problem

4100 emails.

That's the approximate number of alert emails that got auto-deleted while I was away on vacation. That number will rise further before I officially come back from vacation, but it's still a big number. The sad part is, 98% of those emails are for:

  • Problems I don't care about.
  • Unsnoozable known issues.
  • Repeated alarms for the first two points (puppet, I'm looking at you)

We've made great efforts in our attempt to cut down our monitoring fatigue problem, but we're not there yet. In part this is because the old, verbose monitoring system is still up and running, in part this is due to limitations in the alerting systems we have access to, and in part due to organizational habits that over-notify for alarms under the theory of, "if we tell everyone, someone will notice."

A couple weeks ago, PagerDuty had a nice blog-post about tackling alert fatigue, and had a lot of good points to consider. I want to spend some time on point 6:

Make sure the right people are getting alerts.

How many of you have a mailing list you dump random auto-generated crap like cron errors and backup failure notices to?

This pattern is very common in sysadmin teams, especially teams that began as one or a very few people. It just doesn't scale. Also, you learn to just ignore a bunch of things like backup "failures" for always-open files. You don't build an effective alerting system with the assumption that alerts can be ignored; if you find yourself telling new hires, "oh ignore those, they don't mean anything," you have a problem.

The failure mode of tell-everyone is that everyone can assume someone else saw it first and is working on it. And no one works on it.

I've seen exactly this failure mode many times. I've even perpetrated it, since I know certain coworkers are always on top of certain kinds of alerts so I can safely ignore actually-critical alerts. It breaks down if those people have a baby and are out of the office for four weeks. Or were on the Interstate for three hours and not checking mail at that moment.

When this happens and big stuff gets dropped, technical management gets kind of cranky. Which leads to hypervigilence and...

The failure mode of tell-everyone is that everyone will pile into the problem at the same time and make things worse.

I've seen this one too. A major-critical alarm is sent to a big distribution list, six admins immediately VPN in and start doing low-impact diagnostics. Diagnostics that aren't low impact if six people are doing them at the same time. Diagnostics that aren't meant to be run in parallel and can return non-deterministic results if run that way, which tells six admins different stories about what's actually wrong sending six admins into six different directions to solve not-actually-a-problem issues.

This is the Thundering Herd problem as it applies to sysadmins.

The usual fix for this is to build in a culture of, "I've got this," emails and to look for those messages before working on a problem.

The usual fix for this fails if admins do a little "verify the problem is actually a problem" work before sending the email and stomp on each other's toes in the process.

The usual fix for that is to build a culture of, "I'm looking into it," emails.

Which breaks down if a sysadmin is reasonably sure they're the only one who saw the alert and works on it anyway. Oops.

Really, these are all examples of telling the right people about the problem, but you really do need to go into more detail than "the right people". You need, "the right person". You need an on-call schedule that will notify one or two of the Right People about problems. Build that with the expectation that if you're in the hot seat you will answer ALL alerts, and build a rotation so no one is in the hotseat long enough to start ignoring alarms, and you have a far more reliable alerting system.

PagerDuty sells such a scheduling system. But what if you can't afford X-dollars a seat for something like that? You have some options. Here is one:

An on-call distribution-list and scheduler tasks
This recipe will provide an on-call rotation using nothing but free tools. It won't work with all environments. Scripting or API access to the email system is required.


    • 1 on-call distribution list.
    • A list of names of people who can go into the DL.
    • A task scheduler such as cron or Windows Task Scheduler.
    • A database of who is supposed to be on-call when (can substitute a flat file if needed)
    • A scripting language that can talk to both email system management and database.


Build a script that can query the database (or flat-file) to determine who is supposed to be on-call right now, and can update the distribution-list with that name. Powershell can do all of this for full MS-stack environments. For non-MS environments more creativity may be needed.

Populate the database (or flat-file) with the times and names of who is to be on-call.

Schedule execution of the script using a task scheduler.

Configure your alert-emailing system to send mail to the on-call distribution list.

Nice and free! You don't get a GUI to manage the schedule and handling on-call shift swaps will be fully manual, but you at least are now sending alerts to people who know they need to respond to alarms. You can even build the watch-list so that it'll always include certain names that always want to know whenever something happens, such as managers. The thundering herd and circle-of-not-me problems are abated.

This system doesn't handle escalations at all, that's going to cost you either money or internal development time. You kind of do get what you pay for, after all.

How long should on-call shifts be?

That depends on your alert-frequency, how long it takes to remediate an alert, and the response time required.

Alert Frequency and Remediation:

  • Faster than once per 30 minutes:
    • They're a professional fire-fighter now. This is their full-time job, schedule them accordingly.
  • One every 30 minutes to an hour:
    • If remediation takes longer than 1 minute on average, the watch-stander can't do much of anything else but wait for alerts to show up. 8-12 hours is probably the most you can expect reasonable performance.
    • If remediation takes less than a minute, 16 hours is the most you can expect because this frequency ensures no sleep will be had by the watch-stander.
  • One every 1-2 hours:
    • If remediation takes longer than 10 minutes on average, the watch-stander probably can't sleep on their shift. 16 hours is probably the maximum shift length.
    • If remediation takes less than 10 minutes, sleep is more possible. However, if your watch-standers are the kind of people who don't fall asleep fast, you can't rely on that. 1 day for people who sleep at the drop of a hat, 16 hours for the rest of us.
  • One every 2-4 hours:
    • Sleep will be significantly disrupted by the watch. 2-4 days for people who sleep at the drop of a hat. 1 day for the rest of us.
  • One every 4-6 hours:
    • If remediation takes longer than an hour, 1 week for people who sleep at the drop of a hat. 2-4 days for the rest of us.
  • Slower than one every 6 hours:
    • 1 week

Response Time:

This is a fuzzy one, since it's about work/life balance. If all alerts need to be responded to within 5 minutes of their arrival, the watch-stander needs to be able to respond in 5 minutes. This means no driving or doing anything that requires not paying attention to the phone such as kid's performances or after-work meetups. For a watch-stander that drives to work, their on-call shift can't overlap their commute.

For 30 minute response, things are easier. Driving short trips is easier, and longer ones so long as the watch-stander pulls over to check what each alert is when they arrive. Kid performances are still problematic, and longer commutes just as much.

And then there is the curve-ball known as, "define 'response'". If Response is acking the alert, that's one thing and much less disruptive to off-hours lie. If Response is defined as "starts working on the problem," that's much more disruptive since the watch-stander has to have a laptop and bandwidth at all times.

The answers here will determine what a reasonable on-call shift looks like. A week of 5 minute time-to-work is going to cause the watch-stander to be house-bound for that entire week and that sucks a lot; there better be on-call pay associated with a schedule like that or you're going to get turnover as sysadmins go work for someone less annoying.

It's more than just make sure the right people are getting alerts, it's building a system of notifying the Right People in such a way that the alerts will get responded to and handled.

This will build a better alerting system overall.

by SysAdmin1138 at October 09, 2014 01:10 PM

Chris Siebenmann

How /proc/slabinfo is not quite telling you what it looks like

The Linux kernel does a lot (although not all) of its interesting internal memory allocations through a slab allocator. For quite a while it's exposed per-type details of this process in /proc/slabinfo; this is very handy to get an idea of just what in your kernel is using up a bunch of memory. Today I was exploring this because I wanted to look into ZFS on Linux's memory usage and wound up finding out that on modern Linuxes it's a little bit misleading.

(By 'found out' I mean that DeHackEd on the #zfsonlinux IRC channel explained it to me.)

Specifically, on modern Linux the names shown in slabinfo are basically a hint because the current slab allocator in the kernel merges multiple slab types together if they are sufficiently similar. If five different subsystems all want to allocate (different) 128-byte objects with no special properties, they don't each get separate slab types with separate slabinfo entries; instead they are all merged into one slab type and thus one slabinfo entry. That slabinfo entry normally shows the name of one of them, probably the first to be set up, with no direct hint that it also includes the usage of all the others.

(The others don't appear in slabinfo at all.)

Most of the time this is a perfectly good optimization that cuts down on the number of slab types and enables better memory sharing and reduced fragmentation. But it does mean that you can't tell the memory used by, say, btree_node apart from ip_mrt_cache (on my machine, both are one of a lot of slab types that are actually all mapped to the generic 128-byte object). It can also leave you wondering where your slab types actually went, if you're inspecting code that creates a certain slab type but you can't find it in slabinfo (which is what happened to me).

The easiest way to see this mapping is to look at /sys/kernel/slab; all those symlinks are slab types that may be the same thing. You can decode what is what by hand, but if you're going to do this regularly you should get a copy of tools/vm/slabinfo.c from the kernel source and compile it; see the kernel SLUB documentation for details. You want 'slabinfo -a' to report the mappings.

(Sadly slabinfo is underdocumented. I wish it had a manpage or at least a README.)

If you need to track the memory usage of specific slab types, perhaps because you really want to know the memory usage of one subsystem, the easiest way is apparently to boot with the slub_nomerge kernel command line argument. Per the the kernel parameter documentation this turns off all slab merging, which may result in you having a lot more slabs than usual.

(On my workstation, slab merging condenses 110 different slabs into 14 actual slabs. On a random server, 170 slabs turn into 35 and a bunch of the pre-merger slabs are probably completely unused.)

Sidebar: disabling this merging in kernel code

The SLUB allocator does not directly expose a way of disabling this merging when you call kmem_cache_create() in that there's no 'do not merge, really' flag to the call. However, it turns out that supplying at least one of a number of SLUB debugging flags will disable this merging and on a kernel built without CONFIG_DEBUG_KMEMLEAK using SLAB_NOLEAKTRACE appears to have absolutely no other effects from what I can tell. Both Fedora 20 and Ubuntu 14.04 build their kernels without this option.

(I believe that most Linux distributions put a copy of the kernel build config in /boot when they install kernels.)

This may be handy if you have some additional kernel modules that you want to be able to track memory use for specifically even though a number of their slabs would normally get merged away, and you're compiling from source and willing to make some little modifications to it.

You can see the full set of flags that force never merging in the #define for SLUB_NEVER_MERGE in mm/slub.c. On a quick look, none of the others are either harmless or always defined as a non-zero value. It's possible that SLAB_DEBUG_FREE also does nothing these days; if used it will make your slabs only mergeable with other slabs that also specify it (which no slabs in the main kernel source do). That would cause slabs from your code to potentially be merged together but they wouldn't merge with anyone else's slabs, so at least you could track your subsystem's memory usage.

Disclaimer: these ideas have been at most compile-tested, not run live.

by cks at October 09, 2014 04:20 AM

October 08, 2014

Everything Sysadmin

Concerning PICC

Today, Wednesday, October 8, 2014, we, Matt Simmons and Thomas Limoncelli, resigned from the board of Professional IT Community Conferences, Inc. also known as "PICC". PICC is the New Jersey non-profit business entity that has backed LOPSA-East and Cascadia since 2011. Those two conferences should be unaffected as it was already agreed that they would find new organization(s) to work with for their 2015 conferences.

As of June 10, 2014, PICC, Inc. had voted to and was in the process of being dissolved. However we feel this process has become impossible due to the remaining board member's foot-dragging and at times outright deceptive actions. We can not be on a board of an organization that conducts business in that way. We feel that the community deserves better and should request transparency from PICC, Inc. during its dissolution process.

We look forward to the future success of the organizations and events with which PICC has been affiliated.

October 08, 2014 10:00 PM

Standalone Sysadmin

Concerning PICC

Today, Wednesday, October 8, 2014, we, Matt Simmons and Thomas Limoncelli,  resigned from the board of Professional IT Community Conferences, Inc. also known as “PICC”.  PICC is the New Jersey non-profit business entity that has backed LOPSA-East and Cascadia since 2011.  Those two conferences should be unaffected as it was already agreed that they would find new organization(s) to work with for their 2015 conferences.


As of June 10, 2014, PICC, Inc. had voted to and was in the process of being dissolved.  However we feel this process has become impossible due to the remaining board member’s foot-dragging and at times outright deceptive actions.  We can not be on a board of an organization that conducts business in that way.  We feel that the community deserves better and should request transparency from PICC, Inc. during its dissolution process.


We look forward to the future success of the organizations and events with which PICC has been affiliated.


by Matt Simmons at October 08, 2014 09:00 PM

Steve Kemp's Blog

Writing your own e-books is useful

Before our recent trip to Poland I took the time to create my own e-book, containing the names/addresses of people to whom we wanted to send postcards.

Authoring ebooks is simple, and this was a useful use. (Ordinarily I'd have my contacts on my phone, but I deliberately left it at home ..)

I did mean to copy and paste some notes from wikipedia about transport, tourist destinations, etc, into a brief guide. But I forgot.

In other news the toy virtual machine I hacked together got a decent series of updates, allowing you to embed it and add your own custom opcode(s) easily. That was neat, and fell out naturely from the switch to using function-pointers for the opcode implementation.

October 08, 2014 07:03 PM

Everything Sysadmin

I'm coming to Europe in November!

I'm honored to be a keynote at NLUUG's Autumn Conference, 20-Nov-2014, in The Netherlands. I don't get to Europe often, so this may be the last chance to see me there for a while. I'm also trying to arrange a book-signing while I'm there.

For more info, visit

Register now! Registration is limited!

Even though the registration page is in Dutch, the talk will be in English. Google translate is your friend.

October 08, 2014 11:00 AM