Planet SysAdmin


Upcoming sysadmin conferences/events

Contact me to have your event listed here.


April 17, 2014

Racker Hacker

DevOps and enterprise inertia

As I wait in the airport to fly back home from this year’s Red Hat Summit, I’m thinking back over the many conversations I had over breakfast, over lunch, and during the events. One common theme that kept cropping up was around bringing DevOps to the enterprise. I stumbled upon Mathias Meyer’s post, The Developer is Dead, Long Live the Developer, and I was inspired to write my own.

Before I go any further, here’s my definition of DevOps: it’s a mindset shift where everyone is responsible for the success of the customer experience. The success (and failure) of the project rests on everyone involved. If it goes well, everyone celebrates and looks for ways to highlight what worked well. If it fails, everyone gets involved to bring it back on track. Doing this correctly means that your usage of “us” and “them” should decrease sharply.

The issue at hand

One of the conference attendees told me that he and his technical colleagues are curious about trying DevOps but their organization isn’t set up in a way to make it work. On top of that, very few members of the teams knew about the concept of continuous delivery and only one or two people knew about tools that are commonly used to practice it.

I dug deeper and discovered that they have outages just like any other company and they treat outages as an operations problem primarily.  Operations teams don’t get much sleep and they get frustrated with poorly written code that is difficult to deploy, upgrade, and maintain.  Feedback loops with the development teams are relatively non-existent since the development teams report into a different portion of the business.  His manager knows that something needs to change but his manager wasn’t sure how to change it.

His company certainly isn’t unique.  My advice for him was to start a three step process:

Step 1: Start a conversation around responsibility.

Leaders need to understand that the customer experience is key and that experience depends on much more than just uptime. This applies to products and systems that support internal users within your company and those that support your external customers.

Imagine if you called for pizza delivery and received a pizza without any cheese. You drive back to the pizza place to show the manager the partial pizza you received. The manager turns to the employees and they point to the person assigned to putting toppings on the pizza. They might say: “It’s his fault, I did my part and put it in the oven.” The delivery driver might say: “Hey, I did what I was supposed to and I delivered the pizza. It’s not my fault.”

All this time, you, the customer, are stuck holding a half made pizza. Your experience is awful.

Looking back, the person who put the pizza in the oven should have asked why it was only partially made. The delivery driver should have asked about it when it was going into the box. Most important of all, the manager should have turned to the employees and put the responsibility on all of them to make it right.

Step 2: Foster collaboration via cross-training.

Once responsibility is shared, everyone within the group needs some knowledge of what other members of the group do. This is most obvious with developers and operations teams. Operations teams need to understand what the applications do and where their weak points are. Developers need to understand resource constraints and how to deploy their software. They don’t need to become experts but they need to know enough overlapping knowledge to build a strong, healthy feedback loop.

This cross-training must include product managers, project managers, and leaders. Feedback loops between these groups will only be successful if they can speak some of the language of the other groups.

Step 3: Don’t force tooling.

Use the tools that make the most sense to the groups that need to use them. Just because a particular software tool helps another company collaborate or deploy software more reliably doesn’t mean it will have a positive impact on your company.

Watch out for the “sunk cost” fallacy as well. Neal Ford talked about this during a talk at the Red Hat Summit and how it can really stunt the growth of a high performing team.

Summary

The big takeaway from this post is that making the mindset shift is the first and most critical step if you want to use the DevOps model in a large organization. The first results you’ll see will be in morale and camaraderie. That builds momentum faster than anything else and will carry teams into the idea of shared responsibility and ownership.

DevOps and enterprise inertia is a post from: Major Hayden's blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

by Major Hayden at April 17, 2014 05:46 PM

Chris Siebenmann

Partly getting around NFS's concurrent write problem

In a comment on my entry about NFS's problem with concurrent writes, a commentator asked this very good question:

So if A writes a file to an NFS directory and B needs to read it "immediately" as the file appears, is the only workaround to use low values of actimeo? Or should A and B be communicating directly with some simple mechanism instead of setting, say, actimeo=1?

(Let's assume that we've got 'close to open' consistency to start with, where A fully writes the file before B processes it.)

If I was faced with this problem and I had a free hand with A and B, I would make A create the file with some non-repeating name and then send an explicit message to B with 'look at file <X>' (using eg a TCP connection between the two). A should probably fsync() the file before it sends this message to make sure that the file's on the server. The goal of this approach is to avoid B's kernel having any cached information about whether or not file <X> might exist (or what the contents of the directory are). With no cached information, B's kernel must go ask the NFS fileserver and thus get accurate information back. I'd want to test this with my actual NFS server and client just to be sure (actual NFS implementations can be endlessly crazy) but I'd expect it to work reliably.

Note that it's important to not reuse filenames. If A ever reuses a filename, B's kernel may have stale information about the old version of the file cached; at the best this will get B a stale filehandle error and at the worst B will read old information from the old version of the file.

If you can't communicate between A and B directly and B operates by scanning the directory to look for new files, you have a moderate caching problem. B's kernel will normally cache information about the contents of the directory for a while and this caching can delay B noticing that there is a new file in the directory. Your only option is to force B's kernel to cache as little as possible. Note that if B is scanning it will presumably only be scanning, say, once a second and so there's always going to be at least a little processing lag (and this processing lag would happen even if A and B were on the same machine); if you really want immediately, you need A to explicitly poke B in some way no matter what.

(I don't think it matters what A's kernel caches about the directory, unless there's communication that runs the other way such as B removing files when it's done with them and A needing to know about this.)

Disclaimer: this is partly theoretical because I've never been trapped in this situation myself. The closest I've come is safely updating files that are read over NFS. See also.

by cks at April 17, 2014 04:11 AM

RISKS Digest

April 16, 2014

The Lone Sysadmin

The Eternal Wait For Vendor Software Updates

There’s been a fair amount of commentary & impatience from IT staff as we wait for vendors to patch their products for the OpenSSL Heartbleed vulnerability. Why don’t they hurry up? They’ve had 10 days now, what’s taking so long? How big of a deal is it to change a few libraries?

Perhaps, to understand this, we need to consider how software development works.

The Software Development Life Cycle

Software Development Life Cycle Image courtesy of the Wikimedia Commons.

To understand why vendors take a while to do their thing we need to understand how they work. In short, there are a few different phases they work through when designing a new system or responding to bug reports.

Requirement Analysis is where someone figures out precisely what the customer wants and what the constraints are, like budget. It’s a lot of back & forth between stakeholders, end users, and the project staff. In the case of a bug report, like “OMFG OPENSSL LEAKING DATA INTERNET HOLY CRAP” the requirements are often fairly clear. Bugs aren’t always clear, though, which is why you sometimes get a lot of questions from support guys.

Design is where the technical details of implementation show up. The project team takes the customer requirements and turns them into a technical design. In the case of a bug the team figures out how to fix the problem without breaking other stuff. That’s sometimes a real art. Read bugs filed against the kernel in Red Hat’s Bugzilla if you want to see guys try very hard to fix problems without breaking other things.

Implementation is where someone sits down and codes whatever was designed, or implements the agreed-upon fix.

The testing phase can be a variety of things. For new code it’s often it’s full system testing, integration testing, and end-user acceptance testing. But if this is a bug, the testing is often Quality Assurance. Basically a QA team is trying to make sure that whoever coded a fix didn’t introduce more problems along the way. If they find a problem, called a regression, they work with the Engineering team to get it resolved before it ships.

Evolution is basically just deploying what was built. For software vendors there’s a release cycle, and then the process starts again.

So what? Why can’t they just fix the OpenSSL problem?

Git Branching Model Image borrowed from Maescool’s Git Branching Model Tutorial.

The problem is that in an organization with a lot of coders, a sudden need for an unplanned release really messes with a lot of things, short-circuiting the requirements, design, and implementation phases and wreaking havoc in testing.

Using this fine graphic I’ve borrowed from a Git developer we can get an idea of how this happens. In this case there’s a “master” branch of the code that customer releases are done from. Feeding that, there’s a branch called “release” that is likely owned by the QA guys. When the developers think they’re ready for a release they merge “develop” up into “release” and QA tests it. If it is good it moves on to “master.”

Developers who are adding features and fixing bugs create their own branches (“feature/xxx” etc.) where they can work, and then merge into “develop.” At each level there’s usually senior coders and project managers acting as gatekeepers, doing review and managing the flow of updates. On big code bases there are sometimes hundreds of branches open at any given time.

So now imagine that you’re a company like VMware, and you’ve just done a big software release, like VMware vSphere 5.5 Update 1, that has huge new functionality in it (VSAN).[0] There’s a lot of coding activity against your code base because you’re fixing new bugs that are coming in. You’re probably also adding features, and you’re doing all this against multiple major versions of the product. You might have had a plan for a maintenance release in a couple of months, but suddenly this OpenSSL thing pops up. It’s such a basic system library that it affects everything, so everybody will need to get involved at some level.

On top of that, the QA team is in hell because it isn’t just the OpenSSL fix that needs testing. A ton of other stuff was checked in, and is in the queue to be released. But all that needs testing, first. And if they find a regression they might not even be able to jettison the problem code, because it’ll be intertwined with other code in the version control system. So they need to sort it out, and test more, and sort more out, and test again, until it works like it should. The best way out is through, but the particular OpenSSL fix can’t get released until everything else is ready.

This all takes time, to communicate and resolve problems and coordinate hundreds of people. We need to give them that time. While the problem is urgent, we don’t really want software developers doing poor work because they’re burnt out. We also don’t want QA to miss steps or burn out, either, because this is code that we need to work in our production environments. Everybody is going to run this code, because they have to. If something is wrong it’ll create a nightmare for customers and support, bad publicity, and ill will.

So let’s not complain about the pace of vendor-supplied software updates appearing, without at least recognizing our hypocrisy. Let’s encourage them to fix the problem correctly, doing solid QA and remediation so the problem doesn’t get worse. Cut them some slack for a few more days while we remember that this is why we have mitigating controls, and defense-in-depth. Because sometimes one of the controls fails, for an uncomfortably long time, and it’s completely out of our control.

—–

[0] This is 100% speculative, while I have experience with development teams I have no insight into VMware or IBM or any of the other companies I’m waiting for patches from.


Did you like this article? Please give me a +1 back at the source: The Eternal Wait For Vendor Software Updates

This post was written by Bob Plankers for The Lone Sysadmin - Virtualization, System Administration, and Technology. Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License and copyrighted © 2005-2013. All rights reserved.

by Bob Plankers at April 16, 2014 06:00 PM

Chris Siebenmann

Where I feel that btrfs went wrong

I recently finished reading this LWN series on btrfs, which was the most in-depth exposure at the details of using btrfs that I've had so far. While I'm sure that LWN intended the series to make people enthused about btrfs, I came away with a rather different reaction; I've wound up feeling that btrfs has made a significant misstep along its way that's resulted in a number of design mistakes. To explain why I feel this way I need to contrast it with ZFS.

Btrfs and ZFS are each both volume managers and filesystems merged together. One of the fundamental interface differences between them is that ZFS has decided that it is a volume manager first and a filesystem second, while btrfs has decided that it is a filesystem first and a volume manager second. This is what I see as btrfs's core mistake.

(Overall I've been left with the strong impression that btrfs basically considers volume management to be icky and tries to have as little to do with it as possible. If correct, this is a terrible mistake.)

Since it's a volume manager first, ZFS places volume management front and center in operation. Before you do anything ZFS-related, you need to create a ZFS volume (which ZFS calls a pool); only once this is done do you really start dealing with ZFS filesystems. ZFS even puts the two jobs in two different commands (zpool for pool management, zfs for filesystem management). Because it's firmly made this split, ZFS is free to have filesystem level things such as df present a logical, filesystem based view of things like free space and device usage. If you want the actual physical details you go to the volume management commands.

Because btrfs puts the filesystem first it wedges volume creation in as a side effect of filesystem creation, not a separate activity, and then it carries a series of lies and uselessly physical details through to filesystem level operations like df. Consider the the discussion of what df shows for a RAID1 btrfs filesystem here, which has both a lie (that the filesystem uses only a single physical device) and a needlessly physical view (of the physical block usage and space free on a RAID 1 mirror pair). That btrfs refuses to expose itself as a first class volume manager and pretends that you're dealing with real devices forces it into utterly awkward things like mounting a multi-device btrfs filesystem with 'mount /dev/adevice /mnt'.

I think that this also leads to the asinine design decision that subvolumes have magic flat numeric IDs instead of useful names. Something that's willing to admit it's a volume manager, such as LVM or ZFS, has a name for the volume and can then hang sub-names off that name in a sensible way, even if where those sub-objects appear in the filesystem hierarchy (and under what names) gets shuffled around. But btrfs has no name for the volume to start with and there you go (the filesystem-volume has a mount point, but that's a different thing).

All of this really matters for how easily you can manage and keep track of things. df on ZFS filesystems does not lie to me; it tells me where the filesystem comes from (what pool and what object path within the pool), how much logical space the filesystem is using (more or less), and roughly how much more I can write to it. Since they have full names, ZFS objects such as snapshots can be more or less self documenting if you name them well. With an object hierarchy, ZFS has a natural way to inherit various things from parent object to sub-objects. And so on.

Btrfs's 'I am not a volume manager' approach also leads it to drastically limit the physical shape of a btrfs RAID array in a way that is actually painfully limiting. In ZFS, a pool stripes its data over a number of vdevs and each vdev can be any RAID type with any number of devices. Because ZFS allows multi-way mirrors this creates a straightforward way to create a three-way or four-way RAID 10 array; you just make all of the vdevs be three or four way mirrors. You can also change the mirror count on the fly, which is handy for all sorts of operations. In btrfs, the shape 'raid10' is a top level property of the overall btrfs 'filesystem' and, well, that's all you get. There is no easy place to put in multi-way mirroring; because of btrfs's model of not being a volume manager it would require changes in any number of places.

(And while I'm here, that btrfs requires you to specify both your data and your metadata RAID levels is crazy and gives people a great way to accidentally blow their own foot off.)

As a side note, I believe that btrfs's lack of allocation guarantees in a raid10 setup makes it impossible to create a btrfs filesystem split evenly across two controllers that is guaranteed to survive the loss of one entire controller. In ZFS this is trivial because of the explicit structure of vdevs in the pool.

PS: ZFS is too permissive in how you can assemble vdevs, because there is almost no point of a pool with, say, a mirror vdev plus a RAID-6 vdev. That configuration is all but guaranteed to be a mistake in some way.

by cks at April 16, 2014 05:28 AM

April 15, 2014

The Tech Teapot

Stack Overflow Driven Development

The rise of Stack Overflow has certainly changed how many programmers go about their trade.

I have recently been learning some new client side web skills because I need them for a new project. I have noticed that the way I go about learning is quite different from the way I used to learn pre-web.

I used to have a standard technique. I’d go through back issues of magazines I’d bought (I used to have hundreds of back issues) and read any articles related to the new technology. Then I’d purchase a book about the topic, read it and start a simple starter project. Whilst doing the starter project, I’d likely pick up a couple of extra books and skim them to find techniques I needed for the project. This method worked pretty well, I’d be working idiomatically, without a manual in anywhere from a month to three months.

Using the old method, if I got stuck on something, I’d have to figure it out on my own. I remember it took three days to get a simple window to display when I was learning Windows programming in 1991. Without the internet, there was nobody you could ask when you got stuck. If you didn’t own the reference materials you needed, then you were stuck.

Fast forward twenty years and things are rather different. For starters, I don’t have a bunch of magazines sitting around. I don’t even read tech magazines any more, either in print or digitally. None of my favourite magazines survived the transition to digital.

Now when I want to learn a new tech, I head to Wikipedia first to get a basic idea. Then I start trawling google for simple tutorials. I then read one of the new generation of short introductory books on my Kindle.

I then start my project safe in the knowledge that google will always be there. And, of course, google returns an awful lot of Stack Overflow pages. Whilst I would have felt very uncomfortable starting a project without a full grasp of a technology twenty years ago, now I think it would be odd not to. The main purpose of the initial reading is to get a basic understanding of the technology and, most importantly, the vocabulary. You can’t search properly if you don’t know what to search for.

Using my new approach, I’ve cut my learning time from one to three months down to one to three weeks.

The main downside to my approach is that, at the beginning at least, I may not write idiomatic code. But, whilst that is a problem, software is very maleable and you can always re-write parts later on if the project is a success. The biggest challenge now seems to be getting to the point when you know a project has legs as quickly as possible. Fully understanding a tech before starting a project, just delays the start and I doubt you’ll get that time back later in increased productivity.

Of course, by far the quickest approach is to use a tech stack you already know. Unfortunately, in my case that wasn’t possible because I don’t know a suitable client side tech. It is a testament to the designers of Angular.js, SignalR and NancyFX that I have found it pretty easy to get started. I wish everything was so well designed.

The post Stack Overflow Driven Development appeared first on Openxtra Tech Teapot.

by Jack Hughes at April 15, 2014 12:53 PM

Chris Siebenmann

Chasing SSL certificate chains to build a chain file

Supposes that you have some shiny new SSL certificates for some reason. These new certificates need a chain of intermediate certificates in order to work with everything, but for some reason you don't have the right set. In ideal circumstances you'll be able to easily find the right intermediate certificates on your SSL CA's website and won't need the rest of this entry.

Okay, let's assume that your SSL CA's website is an unhelpful swamp pit. Fortunately all is not lost, because these days at least some SSL certificates come with the information needed to find the intermediate certificates. First we need to dump out our certificate, following my OpenSSL basics:

openssl x509 -text -noout -in WHAT.crt

This will print out a bunch of information. If you're in luck (or possibly always), down at the bottom there will be a 'Authority Information Access' section with a 'CA Issuers - URI' bit. That is the URL of the next certificate up the chain, so we fetch it:

wget <SOME-URL>.crt

(In case it's not obvious: for this purpose you don't have to worry if this URL is being fetched over HTTP instead of HTTPS. Either your certificate is signed by this public key or it isn't.)

Generally or perhaps always this will not be a plain text file like your certificate is, but instead a binary blob. The plain text format is called PEM; your fetched binary blob of a certificate is probably in the binary DER encoding. To convert from DER to PEM we do:

openssl x509 -inform DER -in <WGOT-FILE>.crt -outform PEM -out intermediate-01.crt

Now you can inspect intermediate-01.crt in the same to see if it needs a further intermediate certificate; if it does, iterate this process. When you have a suitable collection of PEM format intermediate certificates, simply concatenate them together in order (from the first you fetched to the last, per here) to create your chain file.

PS: The Qualys SSL Server Test is a good way to see how correct your certificate chain is. If it reports that it had to download any certificates, your chain of intermediate certificates is not complete. Similarly it may report that some entries in your chain are not necessary, although in practice this rarely hurts.

Sidebar: Browsers and certificate chains

As you might guess, some but not all browsers appear to use this embedded intermediate certificate URL to automatically fetch any necessary intermediate certificates during certificate validation (as mentioned eg here). Relatedly, browsers will probably not tell you about unnecessary intermediate certificates they received from your website. The upshot of this can be a HTTPS website that works in some browsers but fails in others, and in the failing browser it may appear that you sent no additional certificates as part of a certificate chain. Always test with a tool that will tell you the low-level details.

(Doing otherwise can cause a great deal of head scratching and frustration. Don't ask how I came to know this.)

by cks at April 15, 2014 02:03 AM

April 14, 2014

Steve Kemp's Blog

Is lumail a stepping stone?

I'm pondering a rewrite of my console-based mail-client.

While it is "popular" it is not popular.

I suspect "console-based" is the killer.

I like console, and I ssh to a remote server to use it, but having different front-ends would be neat.

In the world of mailpipe, etc, is there room for a graphic console client? Possibly.

The limiting factor would be the lack of POP3/IMAP.

Reworking things such that there is a daemon to which a GUI, or a console client, could connect seems simple. The hard part would obviously be working the IPC and writing the GUI. Any toolkit selected would rule out 40% of the audience.

In other news I'm stalling on replying to emails. Irony.

April 14, 2014 11:21 PM

Everything Sysadmin

Time Management training at SpiceWorld Austin, 2014

I'll be doing a time management class at SpiceWorld.

Read about my talk and the conference at their website.

If you register, use code "LIMONCELLI20" to save 20%.

See you there!

April 14, 2014 03:00 PM

Interview with LOPSA-East Keynote: Vish Ishaya

Vish Ishaya will be giving the opening keynote at LOPSA-East this year. I caught up with him to talk about his keynote, OpenStack, and how he got his start in tech. The conference is May 2-3, 2014 in New Brunswick, NJ. If you haven't registered, do it now!

Tom Limoncelli: Tell us about your keynote. What should people expect / expect to learn?

Vish Ishaya: The keynote will be about OpenStack as well as the unique challenges of running a cloud in the datacenter. Cloud development methodologies mean different approaches to problems. These approaches bring with them a new set of concerns. By the end of the session people should understand where OpenStack came from, know why businesses are clamoring for it, and have strategies for bringing it into the datacenter effectively.

TL: How did you get started in tech?

VI: I started coding in 7th Grade, when I saw someone "doing machine language" on a computer at school (He was programming in QBasic). I started copying programs from books and I was hooked.

TL: If an attendee wanted to learn OpenStack, what's the smallest installation they can build to be able to experiment? How quickly could they go from bare metal to a working demo?

VI: The easiest way to get started experimenting with OpenStack is to run DevStack (http://devstack.org) on a base Ubuntu or Fedora OS. It works on a single node and is generally running in just a few minutes.

TL: What are the early-adopters using OpenStack for? What do you see the next tier of customers using it for?

VI: OpenStack is a cloud toolkit, so the early-adopters are building clouds. These tend to be service providers and large enterprises. The next tier of customers are smaller businesses that just want access to a private cloud. These are the ones that are already solving interesting business problems using public clouds and want that same flexibility on their own infrastructure.

TL: Suppose a company had a big investment in AWS and wanted to bring it in-house and on-premise. What is the compatibility overlap between OpenStack and AWS?

We've spent quite a bit of time analyzing this at Nebula, because it is a big use-case for our customers. It really depends on what features in AWS one is using. If just the basics are being used, the transition is very easy. If you're using a bunch of the more esoteric services, finding an open source analog can be tricky.

TL: OpenStack was founded by Rackspace Hosting and NASA. Does OpenStack run well in zero-G environments? Would you go into space if NASA needed an OpenStack deployment on the moon?

When I was working on the Nebula project at NASA (where the OpenStack compute project gestated), everyone always asked if I had been to space. I haven't yet, but I would surely volunteer.

Thanks to Vish for taking the time to do this interview! See you at LOPSA-East!

April 14, 2014 02:41 PM

Chris Siebenmann

My reactions to Python's warnings module

A commentator on my entry on the warnings problem pointed out the existence of the warnings module as a possible solution to my issue. I've now played around with it and I don't think it fits my needs here, for two somewhat related reasons.

The first reason is that it simply makes me nervous to use or even take over the same infrastructure that Python itself uses for things like deprecation warnings. Warnings produced about Python code and warnings that my code produces are completely separate things and I don't like mingling them together, partly because they have significantly different needs.

The second reason is that the default formatting that the warnings module uses is completely wrong for the 'warnings produced from my program' case. I want my program warnings to produce standard Unix format (warning) messages and to, for example, not include the Python code snippet that generated them. Based on playing around with the warnings module briefly it's fairly clear that I would have to significantly reformat standard warnings to do what I want. At that point I'm not getting much out of the warnings module itself.

All of this is a sign of a fundamental decision in the warnings module: the warnings module is only designed to produce warnings about Python code. This core design purpose is reflected in many ways throughout the module, such as in the various sorts of filtering it offers and how you can't actually change the output format as far as I can see. I think that this makes it a bad fit for anything except that core purpose.

In short, if I want to log warnings I'm better off using general logging and general log filtering to control what warnings get printed. What features I want there are another entry.

by cks at April 14, 2014 05:20 AM

April 13, 2014

Security Monkey

Amazing Write-Up on BillGates Botnet - With Monitoring Tools Source!

Just stumbled upon this amazing write-up by ValdikSS on not only his discovery of the "BillGates" botnet, but of some source code he's developed that yo

April 13, 2014 09:32 PM

Chris Siebenmann

A problem: handling warnings generated at low levels in your code

Python has a well honed approach for handling errors that happen at a low level in your code; you raise a specific exception and let it bubble up through your program. There's even a pattern for adding more context as you go up through the call stack, where you catch the exception, add more context to it (through one of various ways), and then propagate the exception onwards.

(You can also use things like phase tracking to make error messages more specific. And you may want to catch and re-raise exceptions for other reasons, such as wrapping foreign exceptions.)

All of this is great when it's an error. But what about warnings? I recently ran into a case where I wanted to 'raise' (in the abstract) a warning at a very low level in my code, and that left me completely stymied about what the best way to do it was. The disconnect between errors and warnings is that in most cases errors immediately stop further processing while warnings don't, so you can't deal with warnings by raising an exception; you need to somehow both 'raise' the warning and continue further processing.

I can think of several ways of handling this, all of which I've sort of used in code in the past:

  • Explicitly return warnings as part of the function's output. This is the most straightforward but also sprays warnings through your APIs, which can be a problem if you realize that you've found a need to add warnings to existing code.

  • Have functions accumulate warnings on some global or relatively global object (perhaps hidden through 'record a warning' function calls). Then at the end of processing, high-level code will go through the accumulated warnings and do whatever is desired with them.

  • Log the warnings immediately through a general logging system that you're using for all program messages (ranging from simple to very complex). This has the benefit that both warnings and errors will be produced in the correct order.

The second and third approaches have the problem that it's hard for intermediate layers to add context to warning messages; they'll wind up wanting or needing to pass the context down to the low level routines that generate the warnings. The third approach can have the general options problem when it comes to controlling what warnings are and aren't produced, or you can try to control this by having the high level code configure the logging system to discard some messages.

I don't have any answers here, but I can't help thinking that I'm missing a way of doing this that would make it all easy. Probably logging is the best general approach for this and I should just give in, learn a Python logging system, and use it for everything in the future.

(In the incident that sparked this entry, I wound up punting and just printing out a message with sys.stderr.write() because I wasn't in a mood to significantly restructure the code just because I now wanted to emit a warning.)

by cks at April 13, 2014 06:15 AM

Pragmatic reactions to a possible SSL private key compromise

In light of the fact that the OpenSSL 'heartbleed' issue may have resulted in someone getting a copy of your private keys, there are least three possible reactions that people and organizations can take:

  • Do an explicit certificate revocation through your SSL CA and get a new certificate, paying whatever extra certificate revocation cost the CA requires for this (some do it for free, some normally charge extra).

  • Simply get new SSL certificates from whatever certificate vendor you prefer or can deal with and switch to them. Don't bother to explicitly revoke your old keys.

  • Don't revoke or replace SSL keys at all, based on an assessment that the actual risk that your keys were compromised is very low.

These are listed in declining order of theoretical goodness and also possibly declining order of cost.

Obviously the completely cautious approach is to assume that your private keys have been compromised and also that you should explicitly revoke them so that people might be protected from an attacker trying man in the middle attacks with your old certificates and private keys (if revocation actually works this time). The pragmatic issue is that this course of action probably costs the most money (if it doesn't, well, then there's no problem). If your organization has a lot riding on the security of your SSL certificates (in terms of money or other things) then this extra expense is easy to justify, and in many places the actual cost is small or trivial compared to other budget items.

But, as they say. There are places where this is not so true, where the extra cost of certificate revocations will to some degree hurt or require a fight to get. Given that certificate revocation may not actually do much in practice, there is a real question of whether you're actually getting anything worthwhile for your money (especially since you're probably doing this as merely a precaution against potential key compromise). If certificate revocation is an almost certainly pointless expense that's going to hurt, the pragmatics push people away from paying for it and towards one of the other two alternatives.

(If you want more depressing reading on browser revocation checking, see Adam Langley (via).)

Getting new certificates is the intermediate caution option (especially if you believe that certificate revocation is ineffective in practice), since it closes off future risks that you can actually do something about yourself. But it still probably costs you some money (how much money depends on how many certificates you have or need).

Doing nothing with your SSL keys is the cheapest and easiest approach and is therefor very attractive for people on a budget, and there are a number of arguments towards a low risk assessment (or at least away from a high one). People will say that this position is obviously stupid, which is itself obviously stupid; all security is a question of risk versus cost and thus requires an assessment of both risk and cost. If people feel that the pragmatic risk is low (and at this point we do not have evidence that it isn't for a random SSL site) or cannot convince decision makers that it is not low and the cost is perceived as high, well, there you go. Regardless of what you think, the resulting decision is rational.

(Note that there is at least one Certificate Authority that offers SSL certificates for free but normally charges a not insignificant cost for revoking and reissuing certificates, which can swing the various costs involved. When certificates are free it's easy to wind up with a lot of them to either revoke or replace.)

In fact, as a late-breaking update as I write this, Neel Mehta (the person who found the bug) has said that private key exposure is unlikely, although of course unlikely is nowhere near the same thing as 'impossible'. See also Thomas Ptacek's followup comment.
Update: But see Tomas Rzepka's success report on FreeBSD for bad news.

Update April 12: It's now clear from the results of the CloudFlare challenge and other testing by people that SSL private keys can definitely be extracted from servers that are vulnerable to Heartbleed.

My prediction is that pragmatics are going to push quite a lot of people towards at least the second option and probably the third. Sure, if revoking and reissuing certificates is free a lot of people will take advantage of it (assuming that the message reaches them, which I would not count on), but if it costs money there will be a lot of pragmatic pressure towards cheap options.

(Remember the real purpose of SSL certificates.)

Sidebar: Paths to high cost perceptions

Some people are busy saying that the cost of new SSL certificates is low (or sometimes free), so why not get new ones? There are at least three answers:

  • The use of SSL is for a hobby thing or personal project and the person involved doesn't feel like spending any more money on it than they already have or are.

  • There are a significant number of SSL certificates involved, for example for semi-internal hosts, and there's no clear justification for replacing only a few of their keys (except 'to save money', and if that's the justification you save even more money by not replacing any of them).

  • The people who must authorize the money will be called on to defend the expense in front of higher powers or to prioritize it against other costs in a fixed budget or both.

These answers can combine with each other.

by cks at April 13, 2014 12:31 AM

April 12, 2014

Geek and Artist - Tech

(something something) Big Data!

I recently wrote about how I’d historically been using Pig for some daily and some ad-hoc data analysis, and how I’d found Hive to be a much friendly tool for my purposes. As I mentioned then, I’m not a data analyst by any stretch of the imagination, but have occasional need to use these kinds of tools to get my job done. The title of this post (while originally a placeholder for something more accurate) is a representation of the feeling I have for these topics – only a vague idea of what is going on, but I know it has to do with Big Data (proper noun).

Since writing that post, attempting and failing to find a simple way of introducing Hive usage at work (it’s yet another tool and set of data representations to maintain and support) I’ve also been doing a bit of reading on comparable tools, and frankly Hive only scratches the surface. Having a mostly SQL-compliant interface, there is a lot of competition in this space (and this blog post from Cloudera sums up the issue very well). SQL as an interface to big data operations is desirable for the same reasons I found it useful, but it also introduces some performance implications that are not suited to traditional MapReduce-style jobs which tend to have completion times in the tens of minutes to hours rather than seconds.

Cloudera’s Impala and a few other competitors in this problem space are attempting to address this problem by combining large-scale data processing that is traditionally MapReduce’s strong-point, with very low latencies when generating results. Just a few seconds is not unusual. I haven’t investigated any of these in-depth, but I feel as a sometimes-user of Hadoop via Pig and Hive it is just as important to be abreast of these technologies as the “power users”, so that when we do have occasion to need such data analysis, it can be done with as low a barrier to entry as possible and with the maximum performance.

Spark

http://spark.apache.org/

Spark is now an Apache project but originated in the AMPLab at UC Berkeley. My impression is that it is fairly similar to Apache Hadoop – its own parallel-computing cluster, with which you interact via native language APIs (in this case Java, Scala or Python). I’m sure it offers superior performance to Hadoop’s batch processing model, but unless you are already heavily integrating from these languages with Hadoop libraries it doesn’t offer a drastically different method of interaction.

On the other hand, there are already components built on top of the Spark framework which do allow this, for example, Shark (also from Berkeley). In this case, Shark even offers HiveQL compatibility, so if you are already using Hive there is a clear upgrade path. I haven’t tried it, but it sounds promising although being outside of the Cloudera distribution and not having first-class support on Amazon EMR makes it slightly harder to get at (although guides are available).

Impala

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

As already suggested, Impala was the first alternative I discovered and also is incorporated in Cloudera’s CDH distribution and available on Amazon EMR, which makes it more tempting to me for use both inside and outside of EMR. It supports ANSI SQL-92 rather than HiveQL, but coming from Pig or other non-SQL tools this may not matter to you.

PrestoDB

http://prestodb.io/

Developed by Facebook, and can either use HDFS data without any additional metadata, or with the Hive metadata store using a plugin. For that reason I see it as somewhat closer to Impala, although it also lacks the wider support in MapReduce deployments like CDH and Amazon EMR just like Shark/Spark.

AWS Redshift

http://aws.amazon.com/redshift/

Not really an open source tool like the others above, but deserves a mention as it really fits in the same category. If you want to just get something up and running immediately, this is probably the easiest option.

Summary

I haven’t even begun to scratch the surface of tooling available in this part of the Big Data space, and the above are only the easiest to find amongst further open source and commercial varieties. Personally I am looking forward to the next occasion I have to analyse some data where I can really pit some of these solutions against each other and find the most efficient and easy framework for my ad-hoc data analysis needs.

by oliver at April 12, 2014 08:32 PM

Chris Siebenmann

The relationship between SSH, SSL, and the Heartbleed bug

I will lead with the summary: since the Heartbleed bug is a bug in OpenSSL's implementation of a part of the TLS protocol, no version or implementation of SSH is affected by Heartbleed because the SSH protocol is not built on top of TLS.

So, there's four things involved here:

  • SSL aka TLS is the underlying network encryption protocol used for HTTPS and a bunch of other SSL/TLS things. Heartbleed is an error in implementing the 'TLS heartbeat' protocol extension to the TLS protocol. A number of other secure protocols are built partially or completely on top of TLS, such as OpenVPN.

  • SSH is the protocol used for, well, SSH connections. It's completely separate from TLS and is not layered on top of it in any way. However, TLS and SSH both use a common set of cryptography primitives such as Diffie-Hellman key exchange, AES, and SHA1.

    (Anyone sane who's designing a secure protocol reuses these primitives instead of trying to invent their own.)

  • OpenSSL is an implementation of SSL/TLS in the form of a large cryptography library. It also exports a whole bunch of functions and so on that do various cryptography primitives and other lower-level operations that are useful for things doing cryptography in general.

  • OpenSSH is one implementation of the SSH protocol. It uses various functions exported by OpenSSL for a lot of cryptography related things such as generating randomness, but it doesn't use the SSL/TLS portions of OpenSSL because SSH (the protocol) doesn't involve TLS (the protocol).

Low level flaws in OpenSSL such as Debian breaking its randomness can affect OpenSSH when OpenSSH uses something that's affected by the low level flaw. In the case of the Debian issue, OpenSSH gets its random numbers from OpenSSL and so was affected in a number of ways.

High level flaws in OpenSSL's implementation of TLS itself will never affect OpenSSH because OpenSSH simply doesn't use those bits of OpenSSL. For instance, if OpenSSL turns out to have an SSL certificate verification bug (which happened recently with other SSL implementations) it won't affect OpenSSH's SSH user and host key verification.

As a corollary, OpenSSH (and all SSH implementations) aren't directly affected by TLS protocol attacks such as BEAST or Lucky Thirteen, although people may be able to develop similar attacks against SSH using the same general principles.

by cks at April 12, 2014 03:44 AM

April 11, 2014

RISKS Digest

Everything Sysadmin

Replace Kathleen Sebelius with a sysadmin!

Scientists complain that there are only 2 scientists in congress and how difficult they find it to explain basic science to their peers. What about system administrators? How many people in congress or on the president's cabinet have every had the root or administrator password to systems that other people depend on?

Health and Human Services Secretary Kathleen Sebelius announced her resignation and the media has been a mix of claiming she's leaving in disgrace after the failed ACA website launch countered with she stuck it out until it was a success, which redeems her.

The truth is, folks, how many of you have launched a website and had it work perfectly the first day? Zero. Either you've never been faced with such a task, or you have and it didn't go well. Very few people can say they've launched a big site and had it be perfect the first day.

Let me quote from a draft of the new book I'm working on with Strata and Christine ("The Practice of Cloud Administration", due out this autumn):

[Some companies] declare that all outages are unacceptable and only accept perfection. Any time there is an outage, therefore, it must be someone's fault and that person, being imperfect, is fired. By repeating this process eventually the company will only employ perfect people. While this is laughable, impossible, and unrealistic it is the methodology we have observed in many organizations. Perfect people don't exist, yet organizations often adopt strategies that assume they do.

Firing someone "to prove a point" makes for exciting press coverage but terrible IT. Quoting Allspaw, "an engineer who thinks they're going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future." (link)

HHS wasn't doing the modern IT practices (DevOps) that Google, Facebook, and other companies use to have successful launches. However most companies today aren't either. The government is slower to adopt new practices and this is one area where that bites us all.

All the problems the site had were classic "old world IT thinking" leading to cascading failures that happen in business all the time. One of the major goals of DevOps is to eliminate this kind of problem.

Could you imagine a CEO today that didn't know what accounting is? No. They might not be experts at it, but at least they know it exists and why it is important. Can you imagine a CEO that doesn't understand what DevOps is and why small batches, blameless postmortems, and continuous delivery are important? Yes.. but not for long.

Obama did the right thing by not accepting her resignation until the system was up and running. It would have been disruptive and delayed the entire process. It would have also disincentivized engineers and managers to do the right thing in the future. [Yesterday I saw a quote from Obama where he basically paraphrased Allspaw's quote but I can't find it again. Links anyone?]

Healthcare is 5% "medical services" and 95% information management. Anyone in the industry can tell you that.

The next HHS Secretary needs to be a sysadmin. A DevOps-trained operations expert.

What government official has learned the most about doing IT right in the last year? Probably Sebelius. It's a shame she's leaving.


You can read about how DevOps techniques and getting rid of a lot of "old world IT thinking" saved the Obamacare website in this article at the Time Magazine website. Login required.)

April 11, 2014 03:00 PM

Steve Kemp's Blog

Putting the finishing touches to a nodejs library

For the past few years I've been running a simple service to block blog/comment-spam, which is (currently) implemented as a simple JSON API over HTTP, with a minimal core and all the logic in a series of plugins.

One obvious thing I wasn't doing until today was paying attention to the anchor-text used in hyperlinks, for example:

  <a href="http://fdsf.example.com/">buy viagra</a>

Blocking on the anchor-text is less prone to false positives than blocking on keywords in the comment/message bodies.

Unfortunately there seem to exist no simple nodejs modules for extracting all the links, and associated anchors, from a random Javascript string. So I had to write such a module, but .. given how small it is there seems little point in sharing it. So I guess this is one of the reasons why there often large gaps in the module ecosystem.

(Equally some modules are essentially applications; great that the authors shared, but virtually unusable, unless you 100% match their problem domain.)

I've written about this before when I had to construct, and publish, my own cidr-matching module.

Anyway expect an upload soon, currently I "parse" HTML and BBCode. Possibly markdown to follow, since I have an interest in markdown.

April 11, 2014 02:14 PM

TaoSecurity

Bejtlich Teaching at Black Hat USA 2014

I'm pleased to announce that I will be teaching one class at Black Hat USA 2014 2-3 and 4-5 August 2014 in Las Vegas, Nevada. The class in Network Security Monitoring 101. I've taught this class in Las Vegas in July 2013 and Seattle in December 2013. I posted Feedback from Network Security Monitoring 101 Classes last year as a sample of the student commentary I received.

This class is the perfect jumpstart for anyone who wants to begin a network security monitoring program at their organization. You may enter with no NSM knowledge, but when you leave you'll be able to understand, deploy, and use NSM to detect and respond to intruders, using open source software and repurposed hardware.

The first discounted registration deadline is 11:59 pm EDT June 2nd. The second discounted registration deadline (more expensive than the first but cheaper than later) ends 11:59 pm EDT July 26th. You can register here.

Please note: I have no plans to teach this class again in the United States. I haven't decided yet if I will not teach the class at Black Hat Europe 2014 in Amsterdam in October.

Since starting my current Black Hat teaching run in 2007, I've completely replaced each course every other year. In 2007-2008 I taught TCP/IP Weapons School version 1. In 2009-2010 I taught TCP/IP Weapons School version 2. In 2011-2012 I taught TCP/IP Weapons School version 3. In 2013-2014 I taught Network Security Monitoring 101. This fall I would need to design a brand new course to continue this trend.

I have no plans to design a new course for 2015 and beyond. If you want to see me teach Network Security Monitoring and related subjects, Black Hat USA is your best option.

Please sign up soon, for two reasons. First, if not enough people sign up early, Black Hat might cancel the class. Second, if many people sign up, you risk losing a seat. With so many classes taught in Las Vegas, the conference lacks the large rooms necessary to support big classes.

Several students asked for a more complete class outline. So, in addition to the outline posted currently by Black Hat, I present the following that shows what sort of material I cover in my new class.

OVERVIEW

Is your network safe from intruders? Do you know how to find out? Do you know what to do when you learn the truth? If you are a beginner, and need answers to these questions, Network Security Monitoring 101 (NSM101) is the newest Black Hat course for you. This vendor-neutral, open source software-friendly, reality-driven two-day event will teach students the investigative mindset not found in classes that focus solely on tools. NSM101 is hands-on, lab-centric, and grounded in the latest strategies and tactics that work against adversaries like organized criminals, opportunistic intruders, and advanced persistent threats. Best of all, this class is designed *for beginners*: all you need is a desire to learn and a laptop ready to run a virtual machine. Instructor Richard Bejtlich has taught over 1,000 Black Hat students since 2002, and this brand new, 101-level course will guide you into the world of Network Security Monitoring.

CLASS OUTLINE

Day One

0900-1030
·         Introduction
·         Enterprise Security Cycle
·         State of South Carolina case study
·         Difference between NSM and Continuous Monitoring
·         Blocking, filtering, and denying mechanisms
·         Why does NSM work?
·         When NSM won’t work
·         Is NSM legal?
·         How does one protect privacy during NSM operations?
·         NSM data types
·         Where can I buy NSM?

1030-1045
·         Break

1045-1230
·         SPAN ports and taps
·         Making visibility decisions
·         Traffic flow
·         Lab 1: Visibility in ten sample networks
·         Security Onion introduction
·         Stand-alone vs server plus sensors
·         Core Security Onion tools
·         Lab 2: Security Onion installation

1230-1400
·         Lunch

1400-1600
·         Guided review of Capinfos, Tcpdump, Tshark, and Argus
·         Lab 3: Using Capinfos, Tcpdump, Tshark, and Argus

1600-1615
·         Break

1615-1800
·         Guided review of Wireshark, Bro, and Snort
·         Lab 4: Using Wireshark, Bro, and Snort
·         Using Tcpreplay with NSM consoles
·         Guided review of process management, key directories, and disk usage
·         Lab 5: Process management, key directories, and disk usage

Day Two

0900-1030
·         Computer incident detection and response process
·         Intrusion Kill Chain
·         Incident categories
·         CIRT roles
·         Communication
·         Containment techniques
·         Waves and campaigns
·         Remediation
·         Server-side attack pattern
·         Client-side attack pattern

1030-1045
·         Break

1045-1230
·         Guided review of Sguil
·         Lab 6: Using Sguil
·         Guided review of ELSA
·         Lab 7: Using ELSA

1230-1400
·         Lunch

1400-1600
·         Lab 8. Intrusion Part 1 Forensic Analysis
·         Lab 9. Intrusion Part 1 Console Analysis

1600-1615
·         Break

1615-1800
·         Lab 10. Intrusion Part 2 Forensic Analysis
·         Lab 11. Intrusion Part 2 Console Analysis

REQUIREMENTS

Students must be comfortable using command line tools in a non-Windows environment such as Linux or FreeBSD. Basic familiarity with TCP/IP networking and packet analysis is a plus.

WHAT STUDENTS NEED TO BRING

NSM101 is a LAB-DRIVEN course. Students MUST bring a laptop with at least 8 GB RAM and at least 20 GB free on the hard drive. The laptop MUST be able to run a virtualization product that can CREATE VMs from an .iso, such as VMware Workstation (minimum version 8, 9 or 10 is preferred); VMware Player (minimum version 5 -- older versions do not support VM creation); VMware Fusion (minimum version 5, for Mac); or Oracle VM VirtualBox (minimum version 4.2). A laptop with access to an internal or external DVD drive is preferred, but not mandatory.

Students SHOULD test the open source Security Onion (http://securityonion.blogspot.com) NSM distro prior to class. The students should try booting the latest version of the 12.04 64 bit Security Onion distribution into live mode. Students MUST ensure their laptops can run a 64 bit virtual machine. For help with this requirement, see the VMware knowledgebase article “Ensuring Virtualization Technology is enabled on your VMware host (1003944)” (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003944). Students MUST have the BIOS password for their laptop in the event that they need to enable virtualization support in class. Students MUST also have administrator-level access to their laptop to install software, in the event they need to reconfigure their laptop in class.

WHAT STUDENTS WILL RECEIVE

Students will receive a paper class handbook with printed slides, a lab workbook, and the teacher’s guide for the lab questions. Students will also receive a DVD with a recent version of the Security Onion NSM distribution.

TRAINERS

Richard Bejtlich is Chief Security Strategist at FireEye, and was Mandiant's Chief Security Officer when FireEye acquired Mandiant in 2013. He is a nonresident senior fellow at the Brookings Institution, a board member at the Open Information Security Foundation, and an advisor to Threat Stack. He was previously Director of Incident Response for General Electric, where he built and led the 40-member GE Computer Incident Response Team (GE-CIRT). Richard began his digital security career as a military intelligence officer in 1997 at the Air Force Computer Emergency Response Team (AFCERT), Air Force Information Warfare Center (AFIWC), and Air Intelligence Agency (AIA). Richard is a graduate of Harvard University and the United States Air Force Academy. His fourth book is "The Practice of Network Security Monitoring" (nostarch.com/nsm). He also writes for his blog (taosecurity.blogspot.com) and Twitter (@taosecurity), and teaches for Black Hat.

by Richard Bejtlich (noreply@blogger.com) at April 11, 2014 09:13 AM

Chris Siebenmann

What sort of kernel command line arguments Fedora 20's dracut seems to want

Recently I upgraded the kernel on my Fedora 20 office workstation, rebooted the machine, and had it hang in early boot (the first two are routine, the last is not). Forcing a reboot back to the earlier kernel brought things back to life. After a bunch of investigation I discovered that this was not actually due to the new kernel, it was due to an earlier dracut update. So this is the first thing to learn: if a dracut update breaks something in the boot process, you'll probably only discover this the next time you upgrade the kernel and the (new) dracut builds a (new and not working) initramfs for it.

The second thing I discovered in the process of this is the Fedora boot process will wait for a really long time for your root filesystem to appear before giving up, printing messages about it, and giving you an emergency shell, where by a really long time I mean 'many minutes' (I think at least five). It turned out that my boot process had not locked up but instead it was sitting around waiting my root filesystem to appear. Of course this wait was silent, with no warnings or status notes reported on the console, so I thought that things had hung. The reason the boot process couldn't find my root filesystem was that my root filesystem is on software RAID and the new dracut has stopped assembling such things for a bunch of people.

(Fedora apparently considers this new dracut state to be 'working as designed', based on bug reports I've skimmed.)

I don't know exactly what changed between the old dracut and the new dracut, but what I do know is that the new dracut really wants you to explicitly tell it what software RAID devices, LVM devices, or other things to bring up on boot through arguments added to the kernel command line. dracut.cmdline(7) will tell you all about the many options, but the really useful thing to know is that you can get dracut itself to tell you what it wants via 'dracut --print-cmdline'.

For me on my machine, this prints out (and booting wants):

  • three rd.md.uuid=<UUID> settings for the software RAID arrays of my root filesystem, the swap partition, and /boot. I'm not sure why dracut includes /boot but I left it in. The kernel command line is already absurdly over-long on a modern Fedora machine, so whatever.

    (There are similar options for LVM volumes, LUKS, and so on.)

  • a 'root=UUID=<UUID>' stanza to specify the UUID of the root filesystem. It's possible that my old 'root=/dev/mdXX' would have worked (the root's RAID array is assembled with the right name), but I didn't feel like finding out the hard way.

  • rootflags=... and rootfstype=ext4 for more information about mounting the root filesystem.

  • resume=UUID=<UUID>, which points to my swap area. I omitted this in the kernel command line I set in grub.cfg because I never suspend my workstation. Nothing has exploded yet.

The simplest approach to fixing up your machine in a situation like this is probably to just update grub.cfg to add everything dracut wants to the new kernel's command line (removing any existing conflicting options, eg an old root=/dev/XXX setting). I looked into just what the arguments were and omitted one for no particularly good reason.

(I won't say that Dracut is magic, because I'm sure it could all be read up on and decoded if I wanted to. I just think that doing so is not worth bothering with for most people. Modern Linux booting is functionally a black box, partly because it's so complex and partly because it almost always just works.)

by cks at April 11, 2014 06:12 AM

April 10, 2014

Steve Kemp's Blog

A small assortment of content

Today I took down my KVM-host machine, rebooting it and restarting all of my guests. It has been a while since I'd done so and I was a little nerveous, as it turned out this nerveousness was prophetic.

I'd forgotten to hardwire the use of proxy_arp so my guests were all broken when the systems came back online.

If you're curious this is what my incoming graph of email SPAM looks like:

I think it is obvious where the downtime occurred, right?

In other news I'm awaiting news from the system administration job I applied for here in Edinburgh, if that doesn't work out I'll need to hunt for another position..

Finally I've started hacking on my console based mail-client some more. It is a modal client which means you're always in one of three states/modes:

  • maildir - Viewing a list of maildir folders.
  • index - Viewing a list of messages.
  • message - Viewing a single message.

As a result of a lot of hacking there is now a fourth mode/state "text-mode". Which allows you to view arbitrary text, for example scrolling up and down a file on-disk, to read the manual, or viewing messages in interesting ways.

Support is still basic at the moment, but both of these work:

  --
  -- Show a single file
  --
  show_file_contents( "/etc/passwd" )
  global_mode( "text" )

Or:

function x()
   txt = { "${colour:red}Steve",
           "${colour:blue}Kemp",
           "${bold}Has",
           "${underline}Definitely",
           "Made this work" }
   show_text( txt )
   global_mode( "text")
end

x()

There will be a new release within the week, I guess, I just need to wire up a few more primitives, write more of a manual, and close some more bugs.

Happy Thursday, or as we say in this house, Hyvää torstai!

April 10, 2014 03:34 PM

Everything Sysadmin

LISA CFP Deadline Extended to Fri, 4/18!

Whether you are submitting a talk proposal, workshop, tutorial, or research paper, the call for participation submission deadline has been extended to Friday, 4/18!

Submit today!

April 10, 2014 03:00 PM

SysAdmin1138

Password sprawl and human brain-meats

The number one piece of password advice is:

Only memorize a single complex password, use a password manager for everything else.

Gone is the time when you can plan on memorizing complex strings of characters using shift keys, letter substitution and all of that. The threats surrounding passwords, and the sheer number of things that require them, mean that human fragility is security's greatest enemy. The use of prosthetic memory is now required.

It could be a notebook you keep with you everywhere you go.
It could be a text file on a USB stick you carry around.
It could be a text file you keep in Dropbox and reference on all of your devices.
It could be an actual password manager like 1Password or LastPass that installs in all of your browsers.

There are certain accounts that act as keys to other accounts. The first account you need to protect like Fort Knox is the email accounts that receive activation-messages for everything else you use, since that vector can be used to gain access to those other accounts through the 'Forgotten Password' links.

ForgottenEmail.png

The second account you need to protect like Fort Knox are the identity services used by other sites so they don't have to bother with user account management, that would be all those "Log in with Twitter/Facebook/Google/Yahoo/Wordpress" buttons you see everywhere.

LoginEverywhere.png

The problem with prosthetic memory is that to beat out memorization it needs to be everywhere you ever need to log into anything. Your laptop, phone and tablet all can use the same manager, but the same isn't true of going to a friend's house and getting on their living-room machine to log into Hulu-Plus real quick since you have an account, they don't, but they have the awesome AV setup.

It's a hard problem. Your brain is always there, it's hard to beat that for convenience. But it's time to offload that particular bit of memorization to something else; your digital life and reputation depends on it.

by SysAdmin1138 at April 10, 2014 02:26 PM

Racker Hacker

Upgrade OpenSSL, then upgrade WordPress

The internet has been buzzing about the heartbleed OpenSSL vulnerability but another critical update came out this week: WordPress 3.8.2. The update fixes two CVE’s and a few other security issues.

eWeek has an informative article with additional details on the update.

Upgrade OpenSSL, then upgrade WordPress is a post from: Major Hayden's blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

by Major Hayden at April 10, 2014 01:06 PM

Chris Siebenmann

My current choice of a performance metrics system and why I picked it

In response to my previous entries on gathering OS level performance metrics, people have left a number of comments recommending various systems for doing this. So now it's time to explain my current decision about this.

The short version: I'm planning to use graphite combined with some stats-gathering frontend, probably collectd. We may wind up wanting something more sophisticated as the web interface; we'll see.

This decision is not made from a full and careful comparison of all of the available tools with respect to what we need, partly because I don't know enough to make that comparison. Instead it's made in large part based on what seems to be popular among relatively prominent and leading edge organizations today. Put bluntly, graphite appears to be the current DevOps hotness as far as metrics goes.

That it's the popular and apparent default choice means two good things. First, given that it's used by much bigger environments than we are I can probably make it work for us, and given that the world is not full of angry muttering about how annoying and/or terrible it is it's probably not going to be particularly bad. Second, it's much more likely that such a popular tool will have a good ecology around it, that there will be people writing howtos and 'how I did this' articles for it and add on tools and so on. And indeed this seems to be the case based on my trawling of the Internet so far; I've tripped over far more stuff about graphite than about anything else and there seem to be any number of ways of collecting stats and feeding it data.

(That graphite's the popular choice also means that it's likely to be kept up to date, developed further, possibly packaged for me, and so on.)

A side benefit of this reading is that it's shown me that people are pushing metrics into a graphite-based system at relatively high rates. This is exactly what I want to do given that averages lie and the shorter period you take them over the better for avoiding some of those lies.

(I'm aware that we may run into things like disk IO limits. I'll have to see, but gathering metrics say every five or ten seconds is certainly my goal.)

Many of the alternatives are probably perfectly good and would do decently well for us. They're just somewhat more risky choices than the current big popular thing and as a result they leave me with various concerns and qualms.

by cks at April 10, 2014 05:02 AM

April 09, 2014

The Lone Sysadmin

8 Practical Notes about Heartbleed (CVE-2014-0160)

I see a lot of misinformation floating around about the OpenSSL Heartbleed bug. In case you’ve been living under a rock, OpenSSL versions 1.0.1 through 1.0.1f are vulnerable to a condition where a particular feature will leak the contents of memory. This is bad, because memory often contains things like the private half of public-key cryptographic exchanges (which should always stay private), protected information, parts of your email, instant messenger conversations, and other information such as logins and passwords for things like web applications.

This problem is bad, but freaking out about it, and talking out of our duffs about it, adds to the problem.

You can test if you’re vulnerable with http://filippo.io/Heartbleed/ – just specify a host and a port, or with http://s3.jspenguin.org/ssltest.py from the command line with Python.

1. Not all versions of OpenSSL are vulnerable. Only fairly recent ones, and given the way enterprises patch you might be just fine. Verify the problem before you start scheduling remediations.

2. Heartbleed doesn’t leak all system memory. It only leaks information from the affected process, like a web server running with a flawed version of OpenSSL. A modern operating system prevents one process from accessing another’s memory space. The big problem is for things like IMAP servers and web applications that process authentication data, where that authentication information will be present in the memory space of the web server. That’s why this is bad, but it doesn’t automatically mean that things like your SSH-based logins to a host are compromised, nor just anything stored on a vulnerable server.

Of course, it’s always a good idea to change your passwords on a regular basis.

3. People are focusing on web servers being vulnerable, but many services can be, including your email servers (imapd, sendmail, etc.), databases (MySQL), snmpd, etc. and some of these services have sensitive authentication information, too. There’s lots of email that I wouldn’t want others to gain access to, like password reset tokens, what my wife calls me, etc.

4. A good way, under Linux, to see what’s running and using the crypto libraries is the lsof command:

$ sudo lsof | egrep "libssl|libcrypto" | cut -f 1 -d " " | sort | uniq
cupsd
dovecot
dsmc
httpd
imap-logi
java
mysqld
named
nmbd
ntpd
sendmail
smbd
snmpd
snmptrapd
spamd
squid
ssh
sshd
sudo
tuned
vsftpd

This does not list things that aren’t running that depend on the OpenSSL libraries. For that you might try mashing up find with ldd, mixing in -perm and -type a bit.

5. Just because you patched doesn’t mean that the applications using those libraries are safe. Applications load a copy of the library into memory when they start, so you replacing the files on disk means almost nothing unless you restart the applications, too. In my item #3 all of those processes have a copy of libcrypto or libssl, and all would need to restart to load the fixed version.

Furthermore, some OSes, like AIX, maintain a shared library cache, so it’s not even enough to replace it on disk. In AIX’s case you need to run /usr/sbin/slibclean as well to purge the flawed library from the cache and reread it from disk.

In most cases so far I have chosen to reboot the OSes rather than try to find and restart everything. Nuke it from orbit, it’s the only way to be sure.

6. Patching the system libraries is one thing, but many applications deliver libraries as part of their installations. You should probably use a command like find to search for them:

$ sudo find / -name libssl\*; sudo find / -name libcrypto\*
/opt/tivoli/tsm/client/ba/bin/libssl.so.0.9.8
/opt/tivoli/tsm/client/api/bin64/libssl.so.0.9.8
/home/plankers/pfs/openssl-1.0.1e/libssl.a
/home/plankers/pfs/openssl-1.0.1e/libssl.pc
/usr/lib/libssl.so.10
/usr/lib/libssl.so.1.0.1e
/usr/lib64/libssl.so.10
/usr/lib64/libssl3.so
/usr/lib64/libssl.so
/usr/lib64/pkgconfig/libssl.pc
/usr/lib64/libssl.so.1.0.1e
/opt/tivoli/tsm/client/ba/bin/libcrypto.so.0.9.8
/opt/tivoli/tsm/client/api/bin64/libcrypto.so.0.9.8
/home/plankers/pfs/openssl-1.0.1e/libcrypto.a
/home/plankers/pfs/openssl-1.0.1e/libcrypto.pc
/usr/lib/libcrypto.so.1.0.1e
/usr/lib/libcrypto.so.10
/usr/lib64/libcrypto.so.1.0.1e
/usr/lib64/libcrypto.so.10
/usr/lib64/libcrypto.so
/usr/lib64/pkgconfig/libcrypto.pc

In this example you can see that the Tivoli Storage Manager client has its own copy of OpenSSL, version 0.9.8, which isn’t vulnerable. I’ve got a vulnerable copy of OpenSSL 1.0.1e in my home directory from when I was messing around with Perfect Forward Secrecy. The rest looks like OpenSSL 1.0.1e but I know that it’s a patched copy from Red Hat. I will delete the vulnerable copy so there is no chance something could link against it.

7. If you were running a vulnerable web, email, or other server application you should change your SSL keys, because the whole point is that nobody but you should know your private keys. If someone knows your private keys they’ll be able to decrypt your traffic, NSA-style, or conduct a man-in-the-middle attack where they insert themselves between your server and a client and pretend to be you. Man-in-the-middle is difficult to achieve, but remember that this vulnerability has been around for about two years (April 19, 2012) so we don’t know who else knew about it. The good assumption is that some bad guys did. So change your keys. Remember that lots of things have SSL keys, mail servers, web servers, Jabber servers, etc.

8. While you’re messing with all your SSL certs, step up your SSL security in general. A great testing tool I use is the Qualys SSL Labs Server Test, and they link to best practices from the results page.

Good luck.


Did you like this article? Please give me a +1 back at the source: 8 Practical Notes about Heartbleed (CVE-2014-0160)

This post was written by Bob Plankers for The Lone Sysadmin - Virtualization, System Administration, and Technology. Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License and copyrighted © 2005-2013. All rights reserved.

by Bob Plankers at April 09, 2014 08:55 PM

April 08, 2014

Debian Administration

OpenSSL Heartbeat, a.k.a. Heartblead Bug

A serious security flaw has come to light in the OpenSSL package used in many Linux distributions including Debian. It is considered very serious and all administrators should patch their systems at once and restart any services that rely on OpenSSL.

by ajt at April 08, 2014 10:24 PM

Everything Sysadmin

Interview with LOPSA-East Keynote: Elizabeth Krumbach Joseph

Elizabeth Krumbach Joseph will be giving the closing keynote at LOPSA-East this year. I caught up with her to talk about her keynote, source code management, and Star Wars. The conference is May 2-3, 2014 in New Brunswick, NJ. If you haven't registered, do it now! (We'll have an interview with the opening keynote, Vish Ishaya, soon.)

Tom Limoncelli: Tell us about your keynote. What should people expect / expect to learn?

Elizabeth Krumbach Joseph: Over the past few years there have been a number of high profile incidents and news stories around the subject of women in technology. In my keynote I'll be giving some solid advice for how the technology industry, and each of us, can do a better job of attracting and keeping talent. I will focus on women, but the changes are ones that will help all of us and make the industry a better place for everyone.

As a sneak peek: It would be great if we could all have real flex time (particularly since my pager may go off at 2 AM) and gave more opportunities to junior systems administrators.

TL: What do you do for HP and OpenStack?

EKJ: I'm a systems administrator working on the OpenStack project infrastructure, so a vast majority of my day to day work is working directly on an open source project. Internally at HP I also pitch in with teams using the same upstream infrastructure tools and sometimes help out teams who are seeking to open source their projects to offer best practice advice.

TL: You are also giving a talk called "Code Review for Sys Admins". Tell us more about code reviews and how they benefit system administrators?

EKJ: A code review is my favorite thing! In software development it's a review of the code you submit, typically before it's merged.

The team I work on in OpenStack has taken this to our practice of systems administration. For each change we submit to the systems, it goes through a review system that does a few automated checks (ie: running "puppet parser validate" on Puppet changes and pep8 checks on our Python scripts) and then is reviewed and approved by peers on our team. It's led to one of the best working environment of systems administrators I've ever worked on and has been a valuable tool for our geographically distributed team. Plus, the whole thing is open source, and so is all of our work.

TL: This question is forwarded from two of the LOPSA-East committee members, one has a new born daughter and the other has a 7 year old granddaughter. What can they do now so that their granddaughter/daughter grow up to be engineers?

EKJ: Great question!

I was very fortunate to grow up in a family of all girls with a geek for a father. He was always encouraging us to learn and build things. My parents also encouraged interests early on like jigsaw puzzles. This kind of supportive environment helped develop the curiosity and interest in engineering that I've built my career upon.

I'm also really excited to see companies like Goldie Blox (http://www.goldieblox.com/) come on the scene with toys designed for girls to foster an interest in engineering. But you don't actually need specially designed interlocking blocks, lacking in funds for expensive LEGOs, my parents kept us stocked with plain wooden blocks that I'd build zoos and other creations with. [See picture.]

Today there are many programs that offer computer-specific programs for young people, like http://coderdojo.com

And others that are specifically tailored to girls and under-served demographics, like GirlDevelopIt.com, BlackGirlsCode.org, and a Girl Scouts program. Oh, and programs with robots! www.robogals.org

This is by no means an exhaustive list, only ones I've casually come across lately. More are popping up all the time, many just serving their regional area or school districts.

TL: You recently moved from Philly to California. I hope you are surviving the good weather and healthy living. When will we see you back in the Philly Linux community?

EKJ: I love San Francisco, but there's no place like Philly. I come back about twice a year to visit family and friends. If I'm in town during a PLUG (phillylinux.org) meeting I'll typically drop by, sometimes even give a presentation about some of my latest work. I also spoke at Fosscon (fosscon.org) in Philadelphia last August and hope to again this year.

TL: Your domain is princessleia.com so I have to ask... Which of Chapter 4, 5, or 6 is your favorite?

EKJ: A New Hope (Episode 4) will always be my favorite. Self-contained, not too complicated, and so endearing!

Thanks to Elizabeth for taking the time to do this interview! See you at LOPSA-East!

April 08, 2014 07:25 PM

TechRepublic Network Administrator

Intel and Cloudera: Why we're better together for Hadoop

Cloudera's CEO and Intel's GM of datacenter software explain what Intel's $740m investment in Cloudera means for the future of the big-data analytics platform.

by Nick Heath at April 08, 2014 03:59 PM

Standalone Sysadmin

Fun lesson on VRRP on Nexus

I'm in the middle of migrating our upstream links from the old 6500 core to the new Nexus switches, and I discovered something fun today that I thought I'd share.

Before, since I only had a single switch, each of my subnets had a VLAN interface which had the IP address of the gateway applied, such as this:

interface Vlan100
 description VLAN 100 -- Foo and Bar Stuff
 ip address 192.168.100.1 255.255.255.0
 no ip redirects
 ip dhcp relay information trusted
 ip route-cache flow
end

Pretty simple. But in the new regime, there are two switches, and theoretically, each switch should be fully capable of acting as the gateway. This is clearly a case for VRRP.

One Core01, the configuration looks like this:

interface Vlan100
  description Foo and Bar Stuff
  no shutdown
  ip address 192.168.100.2/24
  management
  vrrp 100
    authentication text MyVrrpPassword
    track 1 decrement 50
    address 192.168.100.1
    no shutdown

and on Core02, it looks like this:

interface Vlan100
  description Foo and Bar Stuff
  no shutdown
  ip address 192.168.100.3/24
  management
  vrrp 100
    authentication text MyVrrpPassword
    track 1 decrement 50
    address 192.168.100.1
    no shutdown

The only difference is the IP address assigned to the interface.

(Incidentally, I'm tracking the upstream interface. If the link goes dead, it decrements 50 points to make sure that VRRP changes over the Virtual IP to the other interfaces).

When both switches were configured this way, "show vrrp" reported that both switches were set to Master:

core01(config-if)# sho vrrp
      Interface  VR IpVersion Pri   Time Pre State   VR IP addr
---------------------------------------------------------------
        Vlan100 100   IPV4     100    1 s  Y  Master 192.168.100.1

core02(config-if)# sho vrrp
      Interface  VR IpVersion Pri   Time Pre State   VR IP addr
---------------------------------------------------------------
        Vlan100 100   IPV4     100    1 s  Y  Master 192.168.100.1

That's clearly not good. I verified that both switches were actually sending traffic:

2014 Apr 8 10:58:20.968603 vrrp-eng: Vlan100[Grp 100]: Sent packet for VR 100, intf 0x9010073

When digging in, what I found was that the switches were getting traffic from duplicate IPs:

Apr 8 11:06:33 core01 %ARP-3-DUP_VADDR_SRC_IP: arp [3573] Source address of packet received from 0000.5e00.0173 on Vlan100(port-channel1) is duplicate of local virtual ip, 192.168.100.1

On a whim, the other admin I was working with at the upstream provider had me disable the "management" flag. And of course, that made things start working immediately.

Apparently, setting the management flag (which ostensibly allows you to manage the switch in-band by using the address assigned to the interface) ALSO aggressively uses the VIP as its source address. I don't know why. It seems like a bug to me, but I'm going to get in touch with the TAC today and see if it's a known thing or not.

I thought you might be interested in knowing about this, anyway. Thanks for reading! (and if you have more information as to why this happens, please comment!)

by Matt Simmons at April 08, 2014 11:30 AM

Chris Siebenmann

My goals for gathering performance metrics and statistics

I've written before that one of my projects is putting together something to gather OS level performance metrics. Today I want to write down what my goals for this are. First off I should mention that this is purely for monitoring, not for alerting; we have a completely separate system for that.

The most important thing is to get visibility into what's going on with our fileservers and their iSCSI backends, because this is the center of our environment. We want at least IO performance numbers on the backends, network utilization and error counts on the backends and the fileservers, perceived IO performance for the iSCSI disks on the fileservers, ZFS level stats on the fileservers, CPU utilization information everywhere, and as many NFS level stats as we can conveniently get (in a first iteration this may amount to 'none'). I'd like like to have both a very long history (half a year or more would be great) and relatively fine-grained measurements, but in practice we're unlikely to need fine-grained measurements very far into the past. To put it one way, we're unlikely to try to troubleshoot in detail a performance issue that's more than a week or so old. At the same time it's important to be able to look back and say 'were things as bad as this N months ago or did they quietly get worse on us?', because we have totally had that happen. Long term stats are also a good way to notice a disk that starts to quietly decay.

(In general I expect us to look more at history than at live data. In a live incident we'll probably go directly to iostat, DTrace, and so on.)

Next most important is OS performance information for a few crucial Ubuntu NFS clients such as our IMAP servers and our Samba servers (things like local IO, NFS IO, network performance, and oh sure CPU and memory stats too). These are very 'hot' machines, used by a lot of people, so if they have performance problems we want to know about it and have a good shot at tracking things down. Also, this sort of information is probably going to help for capacity planning, which means that we probably also want to track some application level stats if possible (eg the number of active IMAP connections). As with fileservers a long history is useful here.

Beyond that it would be nice to get the same performance stats from basically all of our Ubuntu NFS clients. If nothing else this could be used to answer questions like 'do people ever use our compute servers for IO intensive jobs' and to notice any servers with surprisingly high network IO that might be priorities for moving from 1G to 10G networking. Our general Ubuntu machines can presumably reuse much or all of the code and configuration from the crucial Ubuntu machines, so this should be relatively easy.

In terms of displaying the results, I think that the most important thing will be an easy way of doing ad-hoc graphs and queries. We're unlikely to wind up with any particular fixed dashboard that we look at to check for problems; as mentioned, alerting is another system entirely. I expect us to use this metrics system more to answer questions like 'what sort of peak and sustained IO rates do we typically see during nightly backups' or 'is any backend disk running visibly slower than the others'.

I understand that some systems can ingest various sorts of logs, such as syslog and Apache logs. This isn't something that we'd do initially (just getting a performance metrics system off the ground will be a big enough project by itself). The most useful thing to have for problem correlation purposes would be markers for when client kernels report NFS problems, and setting up an entire log ingestion system for that seems a bit overkill.

(There are a lot of neat things we could do with smart log processing if we had enough time and energy, but my guess is that a lot of them aren't really related to gathering and looking at performance metrics.)

Note that all of this is relatively backwards from how you would do it in many environments, where you'd start from application level metrics and drill downwards from there because what's ultimately important is how the application performs. Because we're basically just a provider of vague general computing services to the department, we work from the bottom up and have relatively little 'application' level metrics we can monitor.

(With that said, it certainly would be nice to have some sort of metrics on how responsive and fast the IMAP and Samba servers were for users and so on. I just don't know if we can do very much about that, especially in an initial project.)

PS: There are of course a lot of other things we could gather metrics for and then throw into the system. I'm focusing here on what I want to do first and for the likely biggest payoff. Hopefully this will help me get over the scariness of uncertainty and actually get somewhere on this.

by cks at April 08, 2014 04:46 AM

Racker Hacker

openssl heartbleed updates for Fedora 19 and 20

heartbleedThe openssl heartbleed bug has made the rounds today and there are two new testing builds or openssl out for Fedora 19 and 20:

Both builds are making their way over into the updates-testing stable repository thanks to some quick testing and karma from the Fedora community.

If the stable updates haven’t made it into your favorite mirror yet, you can live on the edge and grab the koji builds:

For Fedora 19 x86_64:

yum -y install koji
koji download-build --arch=x86_64 openssl-1.0.1e-37.fc19.1
yum localinstall openssl-1.0.1e-37.fc19.1.x86_64.rpm

For Fedora 20 x86_64:

yum -y install koji
koji download-build --arch=x86_64 openssl-1.0.1e-37.fc20.1
yum localinstall openssl-1.0.1e-37.fc20.1.x86_64.rpm

Be sure to replace x86_64 with i686 for 32-bit systems or armv7hl for ARM systems (Fedora 20 only). If your system has openssl-libs or other package installed, be sure to install those with yum as well.

Kudos to Dennis Gilmore for the hard work and to the Fedora community for the quick tests.

openssl heartbleed updates for Fedora 19 and 20 is a post from: Major Hayden's blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

by Major Hayden at April 08, 2014 01:18 AM

April 07, 2014

CiscoZine

March 2014: nine Cisco vulnerabilities

The Cisco Product Security Incident Response Team (PSIRT) has published nine important vulnerability advisories: Cisco IOS Software SSL VPN Denial of Service Vulnerability Cisco IOS Software Session Initiation Protocol Denial of Service Vulnerability Cisco IOS Software Internet Key Exchange Version 2 Denial of Service Vulnerability Cisco IOS Software Crafted IPv6 Packet Denial of Service Vulnerability Cisco 7600 Series Route Switch Processor 720 with 10 Gigabit Ethernet Uplinks Denial of Service Vulnerability Cisco IOS Software Network Address Translation Vulnerabilities Cisco AsyncOS Software Code Execution Vulnerability Cisco Small Business Router Password Disclosure Vulnerability Multiple Vulnerabilities in Cisco Wireless LAN Controllers Cisco IOS […]

by Fabio Semperboni at April 07, 2014 07:50 PM

SysAdmin1138

The different kinds of money

Joseph Kern posted this gem to Twitter yesterday.

CapEx.png

It's one of those things I never thought about since I kind of instinctively learned what it is, but I'm sure there are those out there who don't know the difference between a Capital Expenditure and an Operational Expenditure, and what that means when it comes time to convince the fiduciary Powers That Be to fork over money to upgrade/install something that there is a crying need for.

Capital Expenditures

In short, these are (usually) one-time payments for things you buy once:

  • Server hardware.
  • Large storage arrays.
  • Perpetual licenses.
  • HVAC units.
  • UPS systems (but not batteries, see below).

Operational Expenditure

These are things that come with an ongoing cost of some kind. Could be monthly, could be annual.

  • Your AWS bill.
  • The Power Company bill for your datacenter.
  • Salaries and benefits for staff.
  • Consumables for your hardware (UPS batteries, disk-drives)
  • Support contract costs.
  • Annual renewal licenses.

Savy vendors have figured out a fundamental truth to budgeting:

OpEx ends up in the 'base-budget' and doesn't have to be justified every year, so is easier to sell.
CapEx has to be fought for every time you go to the well.

This is part of why perpetual licenses are going away.


But you, the sysadmin with a major problem on your hands, have found a solution for it. It is expensive, which means you need to get approval before you go buy it. It is very important that you know how your organization views these two expense categories. Once you know that, you can vet solutions for their likelihood of acceptance by cost-sensitive upper management. Different companies handle things differently.

Take a scrappy, bootstrapped startup. This is a company that does not have a deep bank-account, likely lives month to month on revenue, and a few bad months in a row can be really bad news. This is a company that is very sensitive to costs right now. Large purchases can be planned for and saved for (just like you do with cars). Increases in OpEx can make a month in the black become one in the red, and we all know what happens after too many red months. For companies like these, pitch towards CapEx. A few very good months means more cash, cash that can be spread on infrastructure upgrades.

Take a VC fueled startup. They have a large pile of money somewhere and are living off of it until they can reach profitability. Stable OpEx means calculating runway is easier, something investors and prospective employees like to know. Increased non-people CapEx means more assets to dissolve when the startup goes bust (as most do). OpEx (that AWS bill) is an easier pitch.

Take a civil-service job much like one of my old ones. This is big and plugged into the public finance system. CapEx costs over a certain line go before review (or worse, an RFC process), and really big ones may have to go before law-makers for approval. Departmental budget managers know many ways to... massage... things to get projects approved with minimal overhead. One of those ways is increasing OpEx, which becomes part of the annually approved budget. OpEx is treated differently than CapEx, and is often a lot easier to get approved... so long as costs are predictable 12 months in advance.


by SysAdmin1138 at April 07, 2014 05:26 PM

Chris Siebenmann

Giving in: pragmatic If-Modified-Since handling for Tiny Tiny RSS

I wrote yesterday about how Tiny Tiny RSS drastically mishandles generating If-Modified-Since headers for conditional GETs, but I didn't say anything about what my response to it is. DWiki insists on strict equality checking between If-Modified-Since and the Last-Modified timestamp (for good reasons), so Tiny Tiny RSS was basically doing unconditional GETs all the time.

I could have left the situation like that, and I actually considered it. Given the conditional GET irony I was never saving any CPU time on successful conditional GETs, only bandwidth, and I'm not particularly bandwidth constrained (either here or potentially elsewhere; 'small' bandwidth allocations on VPSes seem to be in the multiple TBs a month range by now). On the other hand, these requests were using up quite a lot of bandwidth because my feeds are big and Tiny Tiny RSS is quite popular, and that unnecessary bandwidth usage irritated me.

(Most of the bandwidth that Wandering Thoughts normally uses is in feed requests, eg today 87% of the bandwidth was for feeds.)

So I decided to give in and be pragmatic. Tiny Tiny RSS expects you to be doing timestamp comparisons for If-Modified-Since, so I added a very special hack that does just that if and only if the user agent claims to be some version of Tiny Tiny RSS (and various other conditions apply, such as no If-Not-Modified header being supplied). Looking at my logs this appears to have roughly halved the bandwidth usage for serving feeds, so I'm calling it worth it at least for now.

I don't like putting hacks like this into my code (and it doesn't fully solve Tiny Tiny RSS's problems with over-fetching feeds either), but I'm probably going to keep it. The modern web is a world full of pragmatic tradeoffs and is notably lacking in high-minded purity of implementation.

by cks at April 07, 2014 05:07 AM

April 06, 2014

Steve Kemp's Blog

So that distribution I'm not-building?

The other week I was toying with using GNU stow to build an NFS-share, which would allow remote machines to boot from it.

It worked. It worked well. (Standard stuff, PXE booting with an NFS-root.)

Then I started wondering about distributions, since in one sense what I'd built was a minimal distribution.

On that basis yesterday I started hacking something more minimal:

  • I compiled a monolithic GNU/Linux kernel.
  • I created a minimal initrd image, using busybox.
  • I built a static version of the tcc compiler.
  • I got the thing booting, via KVM.

Unfortunately here is where I ran out of patience. Using tcc and the static C library I can compile code. But I can't link it.

$ cat > t.c <>EOF
int main ( int argc, char *argv[] )
{
        printf("OK\n" );
        return 1;
}
EOF
$ /opt/tcc/bin/tcc t.c
tcc: error: file 'crt1.o' not found
tcc: error: file 'crti.o' not found
..

Attempting to fix this up resulted in nothing much better:

$ /opt/tcc/bin/tcc t.c -I/opt/musl/include -L/opt/musl/lib/

And because I don't have a full system I cannot compile t.c to t.o and use ld to link (because I have no ld.)

I had a brief flirt with the portable c-compiler, pcc, but didn't get any further with that.

I suspect the real solution here is to install gcc onto my host system, with something like --prefix=/opt/gcc, and then rsync that into my (suddenly huge) intramfs image. Then I have all the toys.

April 06, 2014 02:35 PM