Planet SysAdmin

January 17, 2018

Chris Siebenmann

My new Ryzen desktop is causing Linux to hang (and it's frustrating)

Normally I try to stick to a sunny tone here. Today is an unfortunate exception, since it's a problem that I'm very close to and that I have no solution for.

Last Friday, I assembled the hardware for my new office workstation, updated the BIOS, and over the weekend let it sit turned on with a scratch Fedora install and periodically doing some burnin tests, like mprime. Everything went fine. Normally I might have probably let the assembled machine sit for a while and do additional burnin tests before doing anything more, but over the weekend my current (old) workstation showed worrying signs of more flakiness, so on Monday I swapped my disks over to the new Ryzen-based hardware. Everything came up quite easily and it all looked good (and clearly faster), right up until the machine started locking up. At first I thought I had a culprit in the amdgpu kernel driver used by the new machine's Radeon RX 550 based graphics card, and I turned up a Fedora bug with a workaround. Unfortunately that doesn't appear to be a complete fix, because the machine has hung several times since then. For the latest hangs I've had netconsole enabled, and I've actually gotten output; unfortunately this has just made it more frustrating, because it is just a steady stream of 'watchdog: BUG: soft lockup - CPU#4 stuck for 23s!' reports.

(These reports are interesting in a way, because apparently the system is not so stuck that it cannot increment the timer. However, it is so stuck that it doesn't respond to the network or to the console, and it doesn't seem to notice a console Magic SysRq.)

In both sets of netconsole trances I've collected so far, I see things running through cross-CPU communication and often TLB stuff in general and specifically native_flush_tlb_others. For example:

Call Trace:
 ? __vma_rb_erase+0x1f1/0x270

This is interesting because of an old reddit post that blames this on 'Core C6 State', and there's also this bug report. On the other hand, the machine sat idle all weekend and didn't hang; in fact, it would have been more idle on the weekend than it was when it hung recently. However I'm into grasping at straws here.

(There's also this Ubuntu bug reports, which has a long discussion of tangled and complicated options to work around things, and this Fedora one. Probably there are others.)

The sensible thing to do right now is probably to swap my disks back into my old hardware until I have more time to deal with the problem (I have to get things stabilized tomorrow). But it is tempting to grasp at a number of straws:

  • swap the RX 550 card out for the very basic card in my old machine. This should completely eliminate both amdgpu and the new GPU hardware itself as a source of issues.

  • switch back to a kernel before CONFIG_RETPOLINE, because I use a number of out of tree modules and I've noticed their build process muttering about my gcc not having the needed support for this. I'm using the latest Fedora released gcc and you'd hope that that would be good enough, but I have no idea what's going on.

  • go through the BIOS to turn off 'Core C6 State' and any other fancy tuning options (and verify that it hasn't decided to silently turn on some theoretically mild and harmless automatic overclocking options). It's possible that the BIOS is deciding to do things that Linux objects to, although I don't know why it would have started to fail only once I swapped disks around. (The paranoid person wonders about UEFI versus MBR booting, but I'm not sure I'm that paranoid.)

(If I did all of this and the machine hung anyway, well, I'd be able to swap my disks back into my old desktop with no regrets.)

In the longer term, troubleshooting this and reporting any issues is probably going to be quite complicated. One of the problems is that I absolutely have to have one out of kernel module (ZFS on Linux) and I very much want another one (WireGuard). I suspect that the presence of these will cause any bug reports to be rejected more or less out of hand. In an ideal world this problem will reproduce itself on a scratch Fedora install with a stock kernel environment that's doing things like running graphics stress programs, but I'm not going to hold my breath on that. It seems quite possible that it will only happen if I'm actually using the machine, which has all sorts of problems.

(I have one complicated idea but it is complicated and rather annoying.)

The whole thing is frustrating and puzzling. We have a stable Ubuntu machine with a Ryzen 1800X and the same motherboard (but a different GPU), and this machine itself seemed fine right up until I swapped in my existing disks. Even post-swap it was perfectly fine with a six+ hour mprime -t run over last night. But if I use it it hangs sooner or later, and it now seems to be hanging even when I don't use it.

(And it appears that this motherboard doesn't have a hardware watchdog timer that's currently supported by Linux. I tried enabling the software watchdog, but it didn't trigger for literally hours and then when it did, it apparently hasn't managed to actually reboot the system, which is perhaps not too surprising under the circumstances.)

PPS: This does put a rather large crimp in my Ryzen temptation, especially if this is something systemic and widespread.

Sidebar: It's possible that I've had multiple issues

I may have hit both an amdgpu issue with Radeon RX 550s, which I've now mitigated, and some sort of issue with the BIOS putting a chunk of the machine to sleep and then Linux not being able to wake it up again. My initial hangs definitely happened while I was in front of the machine actively using it, but I believe that the hangs since I set amdgpu.dpm=0 this morning have been when I wasn't around the machine and it was at least partially idle. These are the only hangs that I have netconsole logs for, too, and they show that the machine is partially alive instead of totally hung.

by cks at January 17, 2018 07:20 AM

Colin Percival

Some thoughts on Spectre and Meltdown

By now I imagine that all of my regular readers, and a large proportion of the rest of the world, have heard of the security issues dubbed "Spectre" and "Meltdown". While there have been some excellent technical explanations of these issues from several sources — I particularly recommend the Project Zero blog post — I have yet to see anyone really put these into a broader perspective; nor have I seen anyone make a serious attempt to explain these at a level suited for a wide audience. While I have not been involved with handling these issues directly, I think it's time for me to step up and provide both a wider context and a more broadly understandable explanation.

January 17, 2018 02:40 AM

January 16, 2018

Mozilla Firefox to require HTTPS in order to use latest features

The post Mozilla Firefox to require HTTPS in order to use latest features appeared first on

This is a pretty bold move! All new JavaScript or CSS features are only going to be exposed on "Secure Contexts", aka HTTPS sites.

If your site isn't on HTTPS, you can't use the latest features in Firefox.

Requiring secure contexts for all new features

Effective immediately, all new features that are web-exposed are to be restricted to secure contexts. Web-exposed means that the feature is observable from a web page or server, whether through JavaScript, CSS, HTTP, media formats, etc. A feature can be anything from an extension of an existing IDL-defined object, a new CSS property, a new HTTP response header, to bigger features such as WebVR. In contrast, a new CSS color keyword would likely not be restricted to secure contexts.

Source: Secure Contexts Everywhere | Mozilla Security Blog

The post Mozilla Firefox to require HTTPS in order to use latest features appeared first on

by Mattias Geniar at January 16, 2018 08:50 PM

Cryptography Engineering

iCloud in China

Last week Apple made an announcement describing changes to the iCloud service for users residing in mainland China. Beginning on February 28th, all users who have specified China as their country/region will have their iCloud data transferred to the GCBD cloud services operator in Guizhou, China.

Chinese news sources optimistically describe the move as a way to offer improved network performance to Chinese users, while Apple admits that the change was mandated by new Chinese regulations on cloud services. Both explanations are almost certainly true. But neither answers the following question: regardless of where it’s stored, how secure is this data?

Apple offers the following:

Apple has strong data privacy and security protections in place and no backdoors will be created into any of our systems”

That sounds nice. But what, precisely, does it mean? If Apple is storing user data on Chinese services, we have to at least accept the possibility that the Chinese government might wish to access it — and possibly without Apple’s permission. Is Apple saying that this is technically impossible?

This is a question, as you may have guessed, that boils down to encryption.

Does Apple encrypt your iCloud backups?

Unfortunately there are many different answers to this question, depending on which part of iCloud you’re talking about, and — ugh — which definition you use for “encrypt”. The dumb answer is the one given in the chart on the right: all iCloud data probably is encrypted. But that’s the wrong question. The right question is: who holds the key(s)?

This kind of thing is Not Helpful.

There’s a pretty simple thought experiment you can use to figure out whether you (or a provider) control your encryption keys. I call it the “mud puddle test”. It goes like this:

Imagine you slip in a mud puddle, in the process (1) destroying your phone, and (2) developing temporary amnesia that causes you to forget your password. Can you still get your iCloud data back? If you can (with the help of Apple Support), then you don’t control the key.

With one major exception — iCloud Keychain, which I’ll discuss below — iCloud fails the mud puddle test. That’s because most Apple files are not end-to-end encrypted. In fact, Apple’s iOS security guide is clear that it sends the keys for encrypted files out to iCloud.

However, there is a wrinkle. You see, iCloud isn’t entirely an Apple service, not even here in the good-old U.S.A. In fact, the vast majority of iCloud data isn’t actually stored by Apple at all. Every time you back up your phone, your (encrypted)

A list of HTTPS requests made during an iCloud backup from an iPhone. The bottom two addresses are Amazon and Google Cloud Services “blob” stores.

data is transmitted directly to a variety of third-party cloud service providers including Amazon, Google and Microsoft.

And this is, from a privacy perspective, mostly** fine! Those services act merely as “blob stores”, storing unreadable encrypted data files uploaded by Apple’s customers. At least in principle, Apple controls the encryption keys for that data, ideally on a server located in a dedicated Apple datacenter.*

So what exactly is Apple storing in China?

Good question!

You see, it’s entirely possible that the new Chinese cloud stores will perform the same task that Amazon AWS, Google, or Microsoft do in the U.S. That is, they’re storing encrypted blobs of data that can’t be decrypted without first contacting the iCloud mothership back in the U.S. That would at least be one straightforward reading of Apple’s announcement, and it would also be the most straightforward mapping from iCloud’s current architecture and whatever it is Apple is doing in China.

Of course, this interpretation seems hard to swallow. In part this is due to the fact that some of the new Chinese regulations appear to include guidelines for user monitoring. I’m no lawyer, and certainly not an expert in Chinese law — so I can’t tell you if those would apply to backups. But it’s at least reasonable to ask whether Chinese law enforcement agencies would accept the total inability to access this data without phoning home to Cupertino, not to mention that this would give Apple the ability to instantly wipe all Chinese accounts. Solving these problems (for China) would require Apple to store keys as well as data in Chinese datacenters.

The critical point is that these two interpretations are not compatible. One implies that Apple is simply doing business as usual. The other implies that they may have substantially weakened the security protections of their system — at least for Chinese users.

And here’s my problem. If Apple needs to fundamentally rearchitect iCloud to comply with Chinese regulations, that’s certainly an option. But they should say explicitly and unambiguously what they’ve done. If they don’t make things explicit, then it raises the possibility that they could make the same changes for any other portion of the iCloud infrastructure without announcing it.

It seems like it would be a good idea for Apple just to clear this up a bit.

You said there was an exception. What about iCloud Keychain?

I said above that there’s one place where iCloud passes the mud puddle test. This is Apple’s Cloud Key Vault, which is currently used to implement iCloud Keychain. This is a special service that stores passwords and keys for applications, using a much stronger protection level than is used in the rest of iCloud. It’s a good model for how the rest of iCloud could one day be implemented.

For a description, see here. Briefly, the Cloud Key Vault uses a specialized piece of hardware called a Hardware Security Module (HSM) to store encryption keys. This HSM is a physical box located on Apple property. Users can access their own keys if and only if they know their iCloud Keychain password — which is typically the same as the PIN/password on your iOS device. However, if anyone attempts to guess this PIN too many times, the HSM will wipe that user’s stored keys.

The critical thing is that the “anyone” mentioned above includes even Apple themselves. In short: Apple has designed a key vault that even they can’t be forced to open. Only customers can get their own keys.

What’s strange about the recent Apple announcement is that users in China will apparently still have access to iCloud Keychain. This means that either (1) at least some data will be totally inaccessible to the Chinese government, or (2) Apple has somehow weakened the version of Cloud Key Vault deployed to Chinese users. The latter would be extremely unfortunate, and it would raise even deeper questions about the integrity of Apple’s systems.

Probably there’s nothing funny going on, but this is an example of how Apple’s vague (and imprecise) explanations make it harder to trust their infrastructure around the world.

So what should Apple do?

Unfortunately, the problem with Apple’s disclosure of its China’s news is, well, really just a version of the same problem that’s existed with Apple’s entire approach to iCloud.

Where Apple provides overwhelming detail about their best security systems (file encryption, iOS, iMessage), they provide distressingly little technical detail about the weaker links like iCloud encryption. We know that Apple can access and even hand over iCloud backups to law enforcement. But what about Apple’s partners? What about keychain data? How is this information protected? Who knows.

This vague approach to security might make it easier for Apple to brush off the security impact of changes like the recent China news (“look, no backdoors!”) But it also confuses the picture, and calls into doubt any future technical security improvements that Apple might be planning to make in the future. For example, this article from 2016 claims that Apple is planning stronger overall encryption for iCloud. Are those plans scrapped? And if not, will those plans fly in the new Chinese version of iCloud? Will there be two technically different versions of iCloud? Who even knows?

And at the end of the day, if Apple can’t trust us enough to explain how their systems work, then maybe we shouldn’t trust them either.


* This is actually just a guess. Apple could also outsource their key storage to a third-party provider, even though this would be dumb.

** A big caveat here is that some iCloud backup systems use convergent encryption, also known as “message locked encryption”. The idea in these systems is that file encryption keys are derived by hashing the file itself. Even if a cloud storage provider does not possess encryption keys, it might be able to test if a user has a copy of a specific file. This could be problematic. However, it’s not really clear from Apple’s documentation if this attack is feasible. (Thanks to RPW for pointing this out.)

by Matthew Green at January 16, 2018 07:44 PM


Addressing Innumeracy in Reporting

Anyone involved in cybersecurity reporting needs a strong sense of numeracy, or mathematical literacy. I see two sorts of examples of innumeracy repeatedly in the media.

The first involves the time value of money. Recently CNN claimed Amazon CEO Jeff Bezos was the "richest person in history" and Recode said Bezos was "now worth more than Bill Gates ever was." Thankfully both Richard Steinnon and Noah Kirsch recognized the foolishness of these reports, correctly noting that Bezos would only rank number 17 on a list where wealth was adjusted for inflation.

This failure to recognize the time value of money is pervasive. Just today I heard the host of a podcast claim that the 1998 Jackie Chan movie Rush Hour was "the top grossing martial arts film of all time." According to Box Office Mojo, Rush Hour earned $244,386,864 worldwide. Adjusting for inflation, in 2017 dollars that's $367,509,865.67 -- impressive!

For comparison, I researched the box office returns for Bruce Lee's Enter the Dragon. Box Office Mojo lacked data, but I found a 2017 article stating his 1973 movie earned "$25 million in the U.S. and $90 million worldwide, excluding Hong Kong." If I adjust the worldwide figure of $90 million for inflation, in 2017 dollars that's $496,864,864.86 -- making Enter the Dragon easily more successful than Rush Hour.

If you're wondering about Crouching Tiger, Hidden Dragon, that 2000 movie earned $213,525,736 worldwide. That movie earned less than Rush Hour, and arrived two years later, so it's not worth doing the inflation math.

The take-away is that any time you are comparing dollars from different time periods, you must adjust for inflation to have your comparisons have any meaning whatsoever.

Chart by @CanadianFlags
The second sort of innumeracy I'd like to highlight today also involves money, but in a slightly different way. This involves changes in values over time.

For example, a company may grow revenue from 2015 to 2016, with 2015 revenue being $100,000 and 2016 being $200,000. That's a 100% gain.

If the company grows another $100,000 from 2016 to 2017, from $200,000 to $300,000, the growth rate has declined to 50%. To have maintained a 100% growth rate, the company needed to make $400,000 in 2016.

That same $100,000 dollar increase isn't so great when compared to the new base value.

We see the same dynamic at play when tracking the growth of individual stocks or market indices over time.

CNN wrote a story about the 1,000 point rise in the Dow Jones Industrial Average over a period of 7 days, from 25,000 to 26,000. One person Tweeted the chart at the above right, asking "is that healthy?" My answer -- you need a proper chart!

My second reaction was "that's a jump, but it's only (1-(25000/26000)) = 3.8%. Yes, 3.8% in 7 days is a lot, but that doesn't even rate in the top 20 one-day percentage gains or losses over the life of the index.

If the DJIA gained 1,000 points in 7 days 5 years ago, when the market was at 13,649, a rise to 14,649 would be a 6.8% gain. 20 years ago the market was roughly 3,310, so a 1,000 point rise to 4,310 would be a massive 23.2% gain.

A better way to depict the growth in the DJIA would be to use a logarithmic chart. The charts below show a linear version on the top and a logarithmic version below it.

Using, I drew the last 30 years of the DJIA at the top using a linear Y axis, meaning there is equal distance between 2,000 and 4,000, 4,000 and 6,000, and so on. The blue line shows the slope of the growth.

I then drew the same period using a logarithmic Y axis, meaning the percentage gains from one line to another are equal. For example, a 100% increase from 1,000 to 2,000 occupies the same distance as the 100% increase from 5,000 to 10,000. The green line shows the slope of the growth.

I put the blue and green lines on both charts to permit comparison of the slopes. As you can see, the growth, when properly indicated using a log chart and the green line, is less than the exaggerations introduced by the linear chart blue line.

There is indeed an upturn recently in the log chart, but the growth is probably on trend over time.

While we're talking about the market, let's take one minute to smack down the old trope that "what comes up, must come down." There is no "law of gravity" in investing, at least for the US market, as a whole.

The best example I have seen of the reality of the situation is this 2017 article titled The Dow’s tumultuous 120-year history, in one chart. Here is the chart:

Chart by Chris Kacher, managing director of MoKa Investors

What an amazing story. The title of the article should not be gloomy. It should be triumphant. Despite two World Wars, a Cold War, wars in Korea, Vietnam, the Middle East, and elsewhere, assassinations of world leaders, market depressions and recessions, and so on, the trend line is up, and up in a big way. While the DJIA doesn't represent the entire US market, it captures enough of it to be representative. This is why I do not bet against the US market over the long term. (And yes I recognize that the market and the economy are different.)

Individual companies may disappear, and the DJIA has indeed been changed many times over the years. However, those changes were made so that the index roughly reflected the makeup of the economy. Is it perfect? No. Does it capture the overall directional trend line since 1896? Yes.

Please keep in mind these two sorts of innumeracy -- the time value of money, and the importance of percentage changes over time -- when dealing with numbers and time.

by Richard Bejtlich ( at January 16, 2018 05:31 PM

Chris Siebenmann

You could say that Linux is AT&T's fault

Recently on Twitter, I gave in to temptation. It went like this:

@thatcks: Blog post: Linux's glibc monoculture is not a bad thing (tl;dr: it's not a forced monoculture, it's mostly people naturally not needlessly duplicating effort)

@tux0r: Linux is duplicate work (ref.: BSD) and they still don't stop making new ones. :(

@oclsc: But their license isn't restrictive enough to be free! We HAVE to build our own wheel!

@thatcks: I believe you can direct your ire here to AT&T, given the origins and early history of Linux. (Or I suppose you could criticize the x86 BSDs.)

My tweet deserves some elaboration (and it turns out to be a bit exaggerated because I mis-remembered the timing a bit).

If you're looking at how we have multiple free Unixes today, with some descended from 4.x BSD and one written from scratch, it's tempting and easy to say that the people who created Linux should have redirected their efforts to helping develop the 4.x BSDs. Setting aside the licensing issues, this view is ahistorical, because Linux was pretty much there first. If you want to argue that someone was duplicating work, you have a decent claim that it's the BSDs who should have thrown their development effort in with Linux instead of vice versa. And beyond that, there's a decent case to be made that Linux's rise is ultimately AT&T's fault.

The short version of the history is that at the start of the 1990s, it became clear that you could make x86 PCs into acceptable inexpensive Unix machines. However, you needed a Unix OS in order to make this work, and there was no good inexpensive (or free) option in 1991. So, famously, Linus Torvalds wrote his own Unix kernel in mid 1991. This predated the initial releases of 386BSD, which came in 1992. Since 386BSD came from the 4.3BSD Net/2 release it's likely that it was more functional than the initial versions of Linux. If things had proceeded unimpeded, perhaps it would have taken the lead from Linux and became the clear winner.

Unfortunately this is where AT&T comes in. At the same time as 386BSD was coming out, BSDI, a commercial company, was selling their own Unix derived from 4.3BSD Net/2 without having a license from AT&T (on the grounds that Net/2 didn't contain any code with AT&T copyrights). BSDI was in fact being somewhat cheeky about it; their 1-800 sales number was '1-800-ITS-UNIX', for example. So AT&T sued them, later extending the lawsuit to UCB itself over the distribution of Net/2. Since the lawsuit alleged that 4.3BSD Net/2 contained AT&T proprietary code, it cast an obvious cloud over everything derived from Net/2, 386BSD included.

The lawsuit was famous (and infamous) in the Unix community at the time, and there was real uncertainty over how it would be resolved for several crucial years. The Wikipedia page is careful to note that 386BSD was never a party to the lawsuit, but I'm pretty sure this was only because AT&T didn't feel the need to drag them in. Had AT&T won, I have no doubt that there would have been some cease & desist letters going to 386BSD and that would have been that.

(While Dr Dobb's Journal published 386BSD Release 1.0 in 1994, they did so after the lawsuit was settled.)

I don't know for sure if the AT&T lawsuit deterred people from working on 386BSD and tilted them toward working on Linux (and putting together various early distributions). There were a number of things going on at the time beyond the lawsuit, including politics in 386BSD itself (see eg the FreeBSD early history). Perhaps 386BSD would have lost out to Linux even without the shadow of the lawsuit looming over it, simply because it was just enough behind Linux's development and excitement. But I do think that you can say AT&T caused Linux and have a decent case.

(AT&T didn't literally cause Linux to be written, because the lawsuit was only filed in 1992, after Torvalds had written the first version of his kernel. You can imagine what-if scenarios about an earlier release of Net/2, but given the very early history of Linux I'm not sure it would have made much of a difference.)

by cks at January 16, 2018 05:08 AM

January 15, 2018

Michael Biven

Brain Dump

Every wonder why you have so many ideas when you’re in the bathroom taking a shower, or a bath, brushing your teeth or that other thing we use the room for?

Ever notice that there are no screens in there to pull or hold our attention?

Take a minute and count how many screens are around you right now. How many different TVs, phones, tablets, computers, e-readers, portable video game consoles, smartwatches, or VR headsets are within your sight? How many of these are in the bathroom?

p.s. please don’t admit to owning a pair of AR glasses.

While having a discussion with my wife, who had a MacBook on her lap, a iPad closed on the ottoman next to her feet, and her iPhone within reach I said…

“the nuance is blurred” and then I had no reply as she was still reading whatever it was she was reading.

I waited a minute and then asked her “you know I just asked you a question right?”

She replied “yes, something about being blurred.”

I repeated what I had said “yeah the nuance is blurred” as I started to walk to the bathroom I added “Blur as in the band, as in song number two, as in what I’m going to do.”

Which I did and then noticed there are no screens in the bathroom and wondered maybe this is why we have flashes of clarity and creativity in here. Which reminded me we recently joked about getting an Amazon Echo in here so we could ask Alexa to take notes so we don’t forget things.

So as I step back into the living room, even before I’m through the bathroom door I’m calling out to her “Don’t say anything! I had an idea that I need to write down before I forget!” She was looking at me as I stepped out and I was left with the impression she was holding onto something to tell me.

I grabbed my laptop, sit down and start writing. At this point I’ve read everything that you’ve read so far to her and she laughs.

We talk about the fact there’s a screen in every room in the house except for the bathroom. We go over the semantics of are there really screens in the bedroom. Because you know the iPads we both have can follow us around. She uses hers through out the day and I leave mine at the bed side table to watch something when I go to bed. Which isn’t really going to bed as it’s just laying down and watching TV. No wonder I get so little sleep.

We talk about the screens even in our car. Hell the car even emails and sends us texts messages when it gets low on windshield wiper fluid. Our car has a drinking problem and it likes to let us know about it.

Used to be the only screen in the house was the one television set the family had and if you were well-off your parents had a second set in their bedroom to watch Johnny Carson together.

Instead we watch different things on our own screen that we hold out in front of us or sit on our bellies. Recently my wife asked if there’s an app so we can both watch the same thing synced up on our individual iPads. We laughed when we realized that yes there is and it’s called a television.

We seem to miss the opportunities we have to build something new and we underestimate the value of what we lost. Instead we build things that place each of us in our own individual world. Walled off with wireless headsets and a microphone, where you can be talking with someone who is ignoring the people around them or maybe you’re just listening to music that a machine picked to play for you.

Yes there is usually at least one screen in the bathroom called a mirror. It’s passive and only reflects back what we bring to it. That’s the key difference. That screen is passive and is used as a tool so we can brush our teeth or have a moment of self reflection.

Anyways just wanted to capture this before I lost.

I never did find out what she was holding on to tell me.

January 15, 2018 07:01 PM

Chris Siebenmann

Meltdown and the temptation of switching to Ryzen for my new home machine

Back in November, I put together a parts list for my still hypothetical new home Linux machine. At the time I picked an Intel CPU because Intel is still the top in single-core performance, especially when you throw in TDP; the i7-8700 is clearly superior to the Ryzen 7 1700, which is the last (or first) 65W TDP Ryzen. Then two things happened. The first is my new office workstation turned out to be Ryzen-based and it appears to work fine, run cool (actually cooler than my current machines, and seems quiet from limited testing. The second is Meltdown and to a lesser extent Spectre.

Mitigating Meltdown on Intel CPUs costs a variable and potentially significant amount of performance, depending on what your system is doing; a CPU bound program is only minorly affected, but something that interacts with the OS a lot has a problem. AMD CPUs are unaffected. AMD Zen-based CPUs, including Ryzens, are also partly immune to the branch predictor version of Spectre (from here) and so don't take a performance hit from mitigations for them.

(Currently, current Intel CPUs also cause heartburn for the retpoline Spectre mitigation, because they'll speculate through return instructions. This will apparently be changed in a microcode update, which will likely cost some performance.)

Almost the entire reason I was selecting an Intel CPU over a Ryzen was the better single-core performance; with more cores, everyone agrees that Ryzens are ahead on workloads that parallelize well. But it seems likely that Meltdown will throw away at least part of that advantage on at least some of the workloads that I care about, and anyway things like Firefox are becoming increasingly multi-threaded (although not for a while for me). There still are areas where Intel CPUs are superior to Ryzens, but then Ryzens have advantages themselves, such as supporting ECC (at least to some degree).

All of that is fine and rational, but if I'm being honest I have to admit that it's not the only reason. Another reason is that I plain don't like Intel's behavior. For years, Intel has taken advantage of lack of real competition to do things like not offer ECC in desktop CPUs or limit desktop CPUs to only four cores (it's remarkable how the moment AMD came along with real competition, Intel was able to crank that up to six cores and may go higher in the next generation). Meltdown provides a convenient reason or at least justification to spit in Intel's eye.

With all of that said, I don't know if I'm actually going to go through with this idea. A hypothetical Ryzen build is somewhat more expensive and somewhat more irritating than an Intel one, since it needs a graphics card and has more RAM restrictions, and it's at least possible that Intel will soon come out with new CPUs that do better in the face of Meltdown and Spectre (and have more cores). For the moment I'm probably just going to sit on my hands (again) and see how I like my new work desktop (when I turn the new machine into my work desktop).

(My home machine hasn't started exploding yet, so the path of least resistence and least effort is to do nothing. I'm very good at doing nothing.)

by cks at January 15, 2018 06:59 AM

January 14, 2018

Sarah Allen

more little rules for working life

It helps me to create little rules that provide default decisions for common and unusual situations. A couple of years ago, I wrote down my little rules for working life. Since then, I’ve collected a few more…


  • Speak the unspoken.
  • Have difficult conversations.
  • Find something remarkable, and remark on it, every day.
  • Not everything needs to be said.


  • Be intentional: for me, it takes reflection and constant conscious effort for my words and actions to reflect my values.
  • Consider your influencers (the people who influence you), and choose them as intentionally as you can.
  • Have a plan. Learn something. Change the plan.
  • Play the long game. Sometimes we have to do stuff that we don’t care about in the short-term, in order to meet expectations from people who decide if we get paid or if we get privileges. Even while we do the stupid short-term things, we can sometimes set ourselves up for some potentially awesome, or at least potentially meaningful future.
  • Focus on the outcome. Imagine what happens after you reach your goal. Then what? Often the real goal is the next thing, or the thing after that.

Getting Unstuck

  • When you hit a wall, step back and learn. Learn more about the problem. Who else sees it as a problem? Who made this wall anyhow? It’s probably there for a reason and the problem might be an unintended consequence.
  • Get to know the people. The system is made of people, and usually those aren’t the same people who made the system.
  • Write stuff down. Sometimes what you think you heard wasn’t the same thing other people heard.
  • Wait a week and ask again. Or a month. Or just listen for the moment when someone else raises the same problem, and chime in.

“Curiosity is the most under utilized tool of leaders” — Amy Edmondson
“Don’t fight stupid, make more awesome” — Jesse Robbins
“Make new mistakes.” — Esther Dyson, 2008 post

by sarah at January 14, 2018 07:35 PM


Remembering When APT Became Public

Last week I Tweeted the following on the 8th anniversary of Google's blog post about its compromise by Chinese threat actors:

This intrusion made the term APT mainstream. I was the first to associate it with Aurora, in this post

My first APT post was a careful reference in 2007, when we all feared being accused of "leaking classified" re China:

I should have added the term "publicly" to my original Tweet. There were consultants with years of APT experience involved in the Google incident response, and they recognized the work of APT17 at that company and others. Those consultants honored their NDAs and have stayed quiet.

I wrote my original Tweet as a reminder that "APT" was not a popular, recognized term until the Google announcement on 12 January 2010. In my Google v China blog post I wrote:

Welcome to the party, Google. You can use the term "advanced persistent threat" (APT) if you want to give this adversary its proper name.

I also Tweeted a similar statement on the same day:

This is horrifying: Google admits intellectual property theft from China; it's called Advanced Persistent Threat, GOOG

I made the explicit link of China and APT because no one had done that publicly.

This slide from a 2011 briefing I did in Hawaii captures a few historical points:

The Google incident was a watershed, for reasons I blogged on 16 January 2010. I remember the SANS DFIR 2008 event as effectively "APTCon," but beyond Mandiant, Northrup Grumman, and NetWitness, no one was really talking publicly about the APT until after Google.

As I noted in the July 2009 blog post, You Down With APT? (ugh):

Aside from Northrup Grumman, Mandiant, and a few vendors (like NetWitness, one of the full capture vendors out there) mentioning APT, there's not much else available. A Google search for "advanced persistent threat" -netwitness -mandiant -Northrop yields 34 results (prior to this blog post). (emphasis added)

Today that search yields 244,000 results.

I would argue we're "past APT." APT was the buzzword for RSA and other vendor-centric events from, say, 2011-2015, with 2013 being the peak following Mandiant's APT1 report.

The threat hasn't disappeared, but it has changed. I wrote my Tweet to mark a milestone and to note that I played a small part in it.

All my APT posts here are reachable by this APT tag. Also see my 2010 article for Information Security Magazine titled What APT Is, and What It Isn't.

by Richard Bejtlich ( at January 14, 2018 07:08 PM

January 13, 2018

I’m taking a break from cron.weekly

The post I’m taking a break from cron.weekly appeared first on

A little over 2 years ago I started a weekly newsletter for Linux & open source users, called cron.weekly. Today, I'm sending the last issue in what is probably going to be a pretty long time. I need a break.

Here's why.

tl;dr: I've got a wife, 2 kids, a (more than) full time job, 2 other side projects and a Netflix subscription. For now, cron.weekly doesn't fit in that list anymore.

The good :-)

I started cron.weekly out of a need. A need to read more technical content that I couldn't seem to find in a convenient form. So I started reading news & blogs more intensely and bookmarking whatever I found fascinating. Every week, that turned into a newsletter.

It was good timing for me, too. A few years ago my role at Nucleus, my employer, shifted from a purely technical one to the role of being a manager/management. It meant I was losing my touch with open source, projects, new releases, ... as it was no longer a core part of my role.

Writing cron.weekly forced me, on a weekly basis, to keep up with all the news, to read about new releases, to find new projects. It forced me to stay up-to-date, even if my job didn't directly require or allow it.

The bad :-|

What started as a hobby project quickly grew. At first, a handful of subscribers. After 2 years, a whopping 8.000 monthly newsletter readers. And a couple 1.000's more that read it via the web or the Reddit posts. I'm proud of that reach!

But my initial mistake became worse by the week: I called it a weekly newsletter that I send every Sunday.

That was my tagline: "cron.weekly is a weekly newsletter, delivered to you every Sunday, with news & tools tailored to Linux sysadmins.".

Weekly implies a never-ending-commitment and Sunday implies a weekly deadline. In the weekend.

In short, in the last 2 years I've spent at least one evening per weekend -- without a break -- writing a cron.weekly issue. At first because I loved it, but towards the end more because I had to. I had to, because I also found a way to monetize my newsletter: sponsors.

I won't lie, running cron.weekly has been my most profitable side business to day. Factor 10x more than all the others. But it's no passive income, it requires a newsletter issue every week, on the clock. And, it's a lot of writing & thinking, it's not a 10 minute write-up every week.

Having sponsors meant I had money coming in, justifying my time. But having sponsors also meant I had schedules, deals, commitments, ... that need to be upheld. Some sponsors want to time their new software launch with a big campaign (of which cron.weekly would be one aspect), so I can't just shift them around on a weekly basis. Sponsors -- rightfully so -- want to know when they get featured.

Adding sponsors turned it from a hobby to a job. At least, that's how it feels. It's no longer a spontaneous non-committal newsletter, it's now a business.

The ugly :-(

I all honesty, I'm burned-out from writing cron.weekly. Not Linux or open source in general, nor my day job, but I'm tired of writing cron.weekly. I'm just tired, in general. I had to force myself to write it. Toward the end, I dreaded it.

If I couldn't get it done on Friday evening, I would spend the rest of the weekend worrying that I couldn't get it done in time. It would keep haunting me in the back of my head "you need to write cron.weekly".

I did this onto myself, it's my own bloody fault. It should have been cron.random or cron.monthly. A weekly newsletter is intense & requires a lot of commitment, something I can't give at the moment.

So here we are ...

Among my other side gigs/hobbies are DNS Spy, Oh Dear, a newly found love for cryptocurrencies, ... and cron.weekly just doesn't fit at the moment.

As a result, I'm going to stop cron.weekly. For now. I don't want to say I completely quit, because I might pick it back up again.

But for now, I need a mental break from the weekly deadlines and to be able to enjoy my weekends, once again. Life's busy enough already.

If cron.weekly returns, it will give me the ability to rethink the newsletter, the timings & my commitments. Taking a break will allow me to re-launch it in a way that would fit in my life, in my family and in my hobbies.

I hope you enjoyed cron.weekly in the last 2 years. Who knows, you might receive a new surprise issue in a couple of months if I start again!

PS; I will never sell cron.weekly, nor the email userlist behind it. I appreciate the faith you had in me by giving me your e-mail address, that information remains closed and guarded. You won't be spammed.

The post I’m taking a break from cron.weekly appeared first on

by Mattias Geniar at January 13, 2018 07:00 PM

January 11, 2018

Sean's IT Blog

Getting Started with VMware UEM

One of the most important aspects of any end-user computing environment is user experience, and a big part of user experience is managing the user’s Windows and application preferences.  This is especially true in non-persistent environments and published application environments where the user may not log into the same machine each time.

So why is this important?  A big part of a user’s experience on any desktop is maintaining their customizations.  Users invest time into personalizing their environment by setting a desktop background, creating an Outlook signature, or configuring the applications to connect to the correct datasets, and the ability to retain these settings make users more productive because they don’t have to recreate these every time they log in or open the application.

User settings portability is nothing new.  Microsoft Roaming Profiles have been around for a long time.  But Roaming Profiles also have limitations, such as casting a large net by moving the entire profile (or the App Data roaming folder on newer versions of Windows) or being tied to specific versions of Windows.

VMware User Environment Manager, or UEM for short, is one of a few 3rd-party user environment management tools that can provide a lighter-weight solution than Roaming Profiles.  UEM can manage both the user’s personalization of the environment by capturing Windows and application settings as well as apply settings to the desktop or RDSH session based on the user’s context.  This can include things like setting up network drives and printers, Horizon Smart Policies to control various Horizon features, and acting as a Group Policy replacement for per-user settings.

UEM Components

There are four main components for VMware UEM.  The components are:

  • UEM Management Console – The central console for managing the UEM configuration
  • UEM Agent – The local agent installed on the virtual desktop, RDSH server, or physical machine
  • Configuration File Share – Network File Share where UEM configuration data is stored
  • User Data File Share – Network File Share where user data is stored.  Depending on the environment and the options used, this can be multiple file shares.

The UEM Console is the central management tool for UEM.  The console does not require a database, and anything that is configured in the console is saved as a text file on the configuration file share.  The agent consumes these configuration files from the configuration share during logon and logoff, and it saves the application or Windows settings configuration when the application is closed or when the user logs off, and it stores them on the user data share as a ZIP file.

The UEM Agent also includes a few other optional tools.  These are a Self-Service Tool, which allows users to restore application configurations from a backup, and an Application Migration Tool.  The Application Migration Tool allows UEM to convert settings from one version of an application to another when the vendor uses different registry keys and AppData folders for different versions.  Microsoft Office is the primary use case for this feature, although other applications may require it as well.

UEM also includes a couple of additional tools to assist administrators with maintaining environment.  The first of these tools is the Application Profiler Tool.  This tool runs on a desktop or an RDSH Server in lieu of the UEM Agent.  Administrators can use this tool to create UEM profiles for applications, and it does this by running the application and tracking where the application writes to.  It can also be used to create default settings that are applied to an application when a user launches it, and this can be used to reduce the amount of time it takes to get users applications configured for the first time.

The other support tool is the Help Desk support tool.  The Helpdesk support tool allows helpdesk agents or other IT support to restore a backup of a user settings archive.

Planning for a UEM Deployment

There are a couple of questions you need to ask when deploying UEM.

  1. How many configuration shares will I have, and where will they be placed? – In multisite environments, I may need multiple configuration shares so the configs are placed near the desktop environments.
  2. How many user data shares will I need, and where will they be placed?  – This is another factor in multi-site environments.  It is also a factor in how I design my overall user data file structure if I’m using other features like folder redirection.  Do I want to keep all my user data together to make it easier to manage and back up, or do I want to place it on multiple file shares.
  3. Will I be using file replication technology? What replication technology will be used? – A third consideration for multi-site environments.  How am I replicating my data between sites?
  4. What URL/Name will be used to access the shares? – Will some sort of global namespace, like a DFS Namespace, be used to provide a single name for accessing the shares?  Or will each server be accessed individually?  This can have some implications around configuring Group Policy and how users are referred to the nearest file server.
  5. Where will I run the management console?  Who will have access to it?
  6. Will I configure UEM to create backup copies of user settings?  How many backup copies will be created?

These are the main questions that come up from an infrastructure and architecture perspective, and they influence how the UEM file shares and Group Policy objects will be configured.

Since UEM does not require a database, and it does not actively use files on a network share, planning for multisite deployments is relatively straight forward.

In the next post, I’ll talk about deploying the UEM supporting infrastructure.

by seanpmassey at January 11, 2018 01:55 PM

January 10, 2018


OpenSSL Wins the Levchin Prize

Today I have had great pleasure in attending the Real World Crypto 2018 conference in Zürich in order to receive the Levchin prize on behalf of the OpenSSL team.

The Levchin prize for Real World Cryptography recognises up to two groups or individuals each year who have made significant advances in the practice of cryptography and its use in real-world systems. This year one of the two recipients is the OpenSSL team. The other recipient is Hugo Krawczyk.

The team were selected by the selection committee “for dramatic improvements to the code quality of OpenSSL”. You can read the press release here.

We have worked very hard over the last few years to build an active and engaged community around the project. I am very proud of what that community has collectively achieved. Although this prize names specific individuals in the OpenSSL team, I consider ourselves to just be the custodians of the project. In a very real way this prize is for the whole community. It is fantastic to be recognised in this way.

The job is not done though. There is still much work we need to do. I am confident though that our community will work together to achieve what needs to be done.

January 10, 2018 07:00 PM

Cryptography Engineering

Attack of the Week: Group Messaging in WhatsApp and Signal

If you’ve read this blog before, you know that secure messaging is one of my favorite topics. However, recently I’ve been a bit disappointed. My sadness comes from the fact that lately these systems have been getting too damned good. That is, I was starting to believe that most of the interesting problems had finally been solved.

If nothing else, today’s post helped disabuse me of that notion.

This result comes from a new paper by Rösler, Mainka and Schwenk from Ruhr-Universität Bochum (affectionately known as “RUB”). The RUB paper paper takes a close look at the problem of group messaging, and finds that while messengers may be doing fine with normal (pairwise) messaging, group messaging is still kind of a hack.

If all you want is the TL;DR, here’s the headline finding: due to flaws in both Signal and WhatsApp (which I single out because I use them), it’s theoretically possible for strangers to add themselves to an encrypted group chat. However, the caveat is that these attacks are extremely difficult to pull off in practice, so nobody needs to panic. But both issues are very avoidable, and tend to undermine the logic of having an end-to-end encryption protocol in the first place. (Wired also has a good article.)

First, some background.

How do end-to-end encryption and group chats work?

In recent years we’ve seen plenty of evidence that centralized messaging servers aren’t a very good place to store confidential information. The good news is: we’re not stuck with them. One of the most promising advances in the area of secure communications has been the recent widespread deployment of end-to-end (e2e) encrypted messaging protocols. 

At a high level, e2e messaging protocols are simple: rather than sending plaintext to a server — where it can be stolen or read — the individual endpoints (typically smartphones) encrypt all of the data using keys that the server doesn’t possess. The server has a much more limited role, moving and storing only meaningless ciphertext. With plenty of caveats, this means a corrupt server shouldn’t be able to eavesdrop on the communications.

In pairwise communications (i.e., Alice communicates with only Bob) this encryption is conducted using a mix of public-key and symmetric key algorithms. One of the most popular mechanisms is the Signal protocol, which is used by Signal and WhatsApp (notable for having 1.3 billion users!) I won’t discuss the details of the Signal protocol here, except to say that it’s complicated, but it works pretty well.

A fly in the ointment is that the standard Signal protocol doesn’t work quite as well for group messaging, primarily because it’s not optimized for broadcasting messages to many users.

To handle that popular case, both WhatsApp and Signal use a small hack. It works like this: each group member generates a single “group key” that this member will use to encrypt all of her messages to everyone else in the group. When a new member joins, everyone who is already in the group needs to send a copy of their group key to the new member (using the normal Signal pairwise encryption protocol). This greatly simplifies the operation of group chats, while ensuring that they’re still end-to-end encrypted.

How do members know when to add a new user to their chat?

Here is where things get problematic.

From a UX perspective, the idea is that only one person actually initiates the adding of a new group member. This person is called the “administrator”. This administrator is the only human being who should actually do anything — yet, her one click must cause some automated action on the part of every other group members’ devices. That is, in response to the administrator’s trigger, all devices in the group chat must send their keys to this new group member.

Notification messages in WhatsApp.

(In Signal, every group member is an administrator. In WhatsApp it’s just a subset of the members.)

The trigger is implemented using a special kind of message called (unimaginatively) a “group management message”. When I, as an administrator, add Tom to a group, my phone sends a group management message to all the existing group members. This instructs them to send their keys to Tom — and to notify the members visually so that they know Tom is now part of the group. Obviously this should only happen if I really did add Tom, and not if some outsider (like that sneaky bastard Tom himself!) tries to add Tom.

And this is where things get problematic.

Ok, what’s the problem?

According to the RUB paper, both Signal and WhatsApp fail to properly authenticate group management messages.

The upshot is that, at least in theory, this makes it possible for an unauthorized person — not a group administrator, possibly not even a member of the group — to add someone to your group chat.

The issues here are slightly different between Signal and WhatsApp. To paraphrase Tolstoy, every working implementation is alike, but every broken one is broken in its own way. And WhatsApp’s implementation is somewhat worse than Signal. Here I’ll break them down.

Signal. Signal takes a pragmatic (and reasonable) approach to group management. In Signal, every group member is considered an administrator — which means that any member can add a new member. Thus if I’m a member of a group, I can add a new member by sending a group management message to every other member. These messages are sent encrypted via the normal (pairwise) Signal protocol.

The group management message contains the “group ID” (a long, unpredictable number), along with the identity of the person I’m adding. Because messages are sent using the Signal (pairwise) protocol, they should be implicitly authenticated as coming from me — because authenticity is a property that the pairwise Signal protocol already offers. So far, this all sounds pretty good.

The problem that the RUB researchers discovered through testing, is that while the Signal protocol does authenticate that the group management comes from me, it doesn’t actually check that I am a member of the group — and thus authorized to add the new user!

In short, if this finding is correct, it turns out that any random Signal user in the world can you send a message of the form “Add Mallory to the Group 8374294372934722942947”, and (if you happen to belong to that group) your app will go ahead and try to do it.

The good news is that in Signal the attack is very difficult to execute. The reason is that in order to add someone to your group, I need to know the group ID. Since the group ID is a random 128-bit number (and is never revealed to non-group-members or even the server**) that pretty much blocks the attack. The main exception to this is former group members, who already know the group ID — and can now add themselves back to the group with impunity.

(And for the record, while the group ID may block the attack, it really seems like a lucky break — like falling out of a building and landing on a street awning. There’s no reason the app should process group management messages from random strangers.)

So that’s the good news. The bad news is that WhatsApp is a bit worse.

WhatsApp. WhatsApp uses a slightly different approach for its group chat. Unlike Signal, the WhatsApp server plays a significant role in group management, which means that it determines who is an administrator and thus authorized to send group management messages.

Additionally, group management messages are not end-to-end encrypted or signed. They’re sent to and from the WhatsApp server using transport encryption, but not the actual Signal protocol.

When an administrator wishes to add a member to a group, it sends a message to the server identifying the group and the member to add. The server then checks that the user is authorized to administer that group, and (if so), it sends a message to every member of the group indicating that they should add that user.

The flaw here is obvious: since the group management messages are not signed by the administrator, a malicious WhatsApp server can add any user it wants into the group. This means the privacy of your end-to-end encrypted group chat is only guaranteed if you actually trust the WhatsApp server.

This undermines the entire purpose of end-to-end encryption.

But this is silly. Don’t we trust the WhatsApp server? And what about visual notifications?

One perfectly reasonable response is that exploiting this vulnerability requires a compromise of the WhatsApp server (or legal compulsion, perhaps). This seems fairly unlikely.

And yet, the entire point of end-to-end encryption is to remove the server from the trusted computing base. We haven’t entirely achieved this yet, thanks to things like key servers. But we are making progress. This bug is a step back, and it’s one a sophisticated attacker potentially could exploit.

A second obvious objection to these issues is that adding a new group member results in a visual notification to each group member. However, it’s not entirely clear that these messages are very effective. In general they’re relatively easy to miss. So these are meaningful bugs, and things that should be fixed.

How do you fix this?

The great thing about these bugs is that they’re both eminently fixable.

The RUB paper points out some obvious countermeasures. In Signal, just make sure that the group management messages come from a legitimate member of the group. In WhatsApp, make sure that the group management messages are signed by an administrator.*

Obviously fixes like this are a bit complex to roll out, but none of these should be killers.

Is there anything else in the paper?

Oh yes, there’s quite a bit more. But none of it is quite as dramatic. For one thing, it’s possible for attackers to block message acknowledgements in group chats, which means that different group members could potentially see very different versions of the chat. There are also several cases where forward secrecy can be interrupted. There’s also some nice analysis of Threema, if you’re interested.

I need a lesson. What’s the moral of this story?

The biggest lesson is that protocol specifications are never enough. Both WhatsApp and Signal (to an extent) have detailed protocol specifications that talk quite a bit about the cryptography used in their systems. And yet the issues reported in the RUB paper not obvious from reading these summaries. I certainly didn’t know about them.

In practice, these problems were only found through testing.


So the main lesson here is: test, test, test. This is a strong argument in favor of open-source applications and frameworks that can interact with private-garden services like Signal and WhatsApp. It lets us see what the systems are getting right and getting wrong.

The second lesson — and a very old one — is that cryptography is only half the battle. There’s no point in building the most secure encryption protocol in the world if someone can simply instruct your client to send your keys to Mallory. The greatest lesson of all time is that real cryptosystems are always broken this way — and almost never through the fancy cryptographic attacks we love to write about.


* The challenge here is that since WhatsApp itself determines who the administrators are, this isn’t quite so simple. But at very least you can ensure that someone in the group was responsible for the addition.

** According to the paper, the Signal group IDs are always sent encrypted between group members and are never revealed to the Signal server. Indeed, group chat messages look exactly like pairwise chats, as far as the server is concerned. This means only current or former group members should know the group ID.

by Matthew Green at January 10, 2018 02:01 PM

January 09, 2018


The Ultimate Apollo Guidance Computer Talk [video]

This is the video recording of “The Ultimate Apollo Guidance Computer Talk” at 34C3. If you think it is too fast, try watching it at 0.75x speed.

I will post the slides in Apple Keynote format later.

If you enjoyed this, you might also like my talks

by Michael Steil at January 09, 2018 04:05 PM


Replicating NATS Streams between clusters

I’ve mentioned NATS before – the fast and light weight message broker from – but I haven’t yet covered the sister product NATS Streaming before so first some intro.

NATS Streaming is in the same space as Kafka, it’s a stream processing system and like NATS it’s super light weight delivered as a single binary and you do not need anything like Zookeeper. It uses normal NATS for communication and ontop of that builds streaming semantics. Like NATS – and because it uses NATS – it is not well suited to running over long cluster links so you end up with LAN local clusters only.

This presents a challenge since very often you wish to move data out of your LAN. I wrote a Replicator tool for NATS Streaming which I’ll introduce here.


First I guess it’s worth covering what Streaming is, I should preface also that I am quite new in using Stream Processing tools so I am not about to give you some kind of official answer but just what it means to me.

In a traditional queue like ActiveMQ or RabbitMQ, which I covered in my Common Messaging Patterns posts, you do have message storage, persistence etc but those who consume a specific queue are effectively a single group of consumers and messages either go to all or load shared all at the same pace. You can’t really go back and forth over the message store independently as a client. A message gets ack’d once and once it’s been ack’d it’s done being processed.

In a Stream your clients each have their own view over the Stream, they all have their unique progress and point in the Stream they are consuming and they can move backward and forward – and indeed join a cluster of readers if they so wish and then have load balancing with the other group members. A single message can be ack’d many times but once ack’d a specific consumer will not get it again.

This is to me the main difference between a Stream processing system and just a middleware. It’s a huge deal. Without it you will find it hard to build very different business tools centred around the same stream of data since in effect every message can be processed and ack’d many many times vs just once.

Additionally Streams tend to have well defined ordering behaviours and message delivery guarantees and they support clustering etc. much like normal middleware has. There’s a lot of similarity between streams and middleware so it’s a bit hard sometimes to see why you won’t just use your existing queueing infrastructure.

Replicating a NATS Stream

I am busy building a system that will move Choria registration data from regional data centres to a global store. The new Go based Choria daemon has a concept of a Protocol Adapter which can receive messages on the traditional NATS side of Choria and transform them into Stream messages and publish them.

This gets me my data from the high frequency, high concurrency updates from the Choria daemons into a Stream – but the Stream is local to the DC. Indeed in the DC I do want to process these messages to build a metadata store there but I also want to processes these messages for replication upward to my central location(s).

Hence the importance of the properties of Streams that I highlighted above – multiple consumers with multiple views of the Stream.

There are basically 2 options available:

  1. Pick a message from a topic, replicate it, pick the next one, one after the other in a single worker
  2. Have a pool of workers form a queue group and let them share the replication load

At the basic level the first option will retain ordering of the messages – order in the source queue will be the order in the target queue. NATS Streaming will try to redeliver a message that timed out delivery and it won’t move on till that message is handled, thus ordering is safe.

The 2nd option since you have multiple workers you have no way to retain ordering of the messages since workers will go at different rates and retries can happen in any order – it will be much faster though.

I can envision a 3rd option where I have multiple workers replicating data into a temporary store where on the other side I inject them into the queue in order but this seems super prone to failure, so I only support these 2 methods for now.

Limiting the rate of replication

There is one last concern in this scenario, I might have 10s of data centres all with 10s of thousands of nodes. At the DC level I can handle the rate of messages but at the central location where I might have 10s of DCs x 10s of thousands of machines if I had to replicate ALL the data at near real time speed I would overwhelm the central repository pretty quickly.

Now in the case of machine metadata you probably want the first piece of metadata immediately but from then on it’ll be a lot of duplicated data with only small deltas over time. You could be clever and only publish deltas but you have the problem then that should a delta publish go missing you end up with a inconsistent state – this is something that will happen in distributed systems.

So instead I let the replicator inspect your JSON, if your JSON has something like fqdn in it, it can look at that and track it and only publish data for any single matching sender every 1 hour – or whatever you configure.

This has the effect that this kind of highly duplicated data is handled continuously in the edge but that it only gets a snapshot replication upwards once a hour for any given node. This solves the problem neatly for me without there being any risks to deltas being lost, it’s also a lot simpler to implement.

Choria Stream Replicator

So finally I present the Choria Stream Replicator. It does all that was described above with a YAML configuration file, something like this:

debug: false                     # default
verbose: false                   # default
logfile: "/path/to/logfile"      # STDOUT default
state_dir: "/path/to/statedir"   # optional
        topic: acme.cmdb
        source_url: nats://source1:4222,nats://source2:4222
        source_cluster_id: dc1
        target_url: nats://target1:4222,nats://target2:4222
        target_cluster_id: dc2
        workers: 10              # optional
        queued: true             # optional
        queue_group: cmdb        # optional
        inspect: host            # optional
        age: 1h                  # optional
        monitor: 10000           # optional
        name: cmdb_replicator    # optional

Please review the README document for full configuration details.

I’ve been running this in a test DC with 1k nodes for a week or so and I am really happy with the results, but be aware this is new software so due care should be given. It’s available as RPMs, has a Puppet module, and I’ll upload some binaries on the next release.

by R.I. Pienaar at January 09, 2018 08:04 AM

January 08, 2018


Happy 15th Birthday TaoSecurity Blog

Today, 8 January 2018, is the 15th birthday of TaoSecurity Blog! This is also my 3,020th blog post.

I wrote my first post on 8 January 2003 while working as an incident response consultant for Foundstone.

I don't believe I've released statistics for the blog before, so here are a few. Blogger started providing statistics in May 2010, so these apply to roughly the past 8 years only!

As of today, since May 2010 the blog has nearly 7.7 million all time page views.

Here are the most popular posts as of today:

Twitter continues to play a role in the way I communicate. When I last reported on a blog birthday two years ago, I said that I had nearly 36,000 Twitter followers for @taosecurity, with roughly 16,000 Tweets. Today I have nearly 49,000 followers with less than 18,000 Tweets. As with most people on social media, blogging has taken a back seat to more instant forms of communication.

These days I am active on Instagram as @taosecurity as well. That account is a departure from my social media practice. On Twitter I have separate accounts for cybersecurity and intelligence (@taosecurity), martial arts (@rejoiningthetao), and other purposes. My Instagram @taosecurity account is a unified account, meaning I talk about whatever I feel like. 

During the last two years I also started another blog to which I regularly contribute -- Rejoining the Tao. I write about my martial arts journey there, usually once a week.

Once in a while I post to LinkedIn, but it's usually news of a blog post like this, or other LinkedIn content of interest.

What's ahead? You may remember I was working on a PhD and I had left FireEye. I decided to abandon my PhD in the fall of 2016. I realized I was not an academic, although I had written four books.

I have also changed all the goals I named in my post-FireEye announcement.

For the last year I have been doing limited security consulting, but that has been increasing in recent months. I continue to be involved in martial arts, but I no longer plan to be a Krav Maga instructor nor to open my own school.

For several months I've been working with a co-author and subject matter expert on a new book with martial arts applicability. I've been responsible for editing and publishing. I'll say more about that at Rejoining the Tao when the time is right.

Thank you to everyone who has been part of this blog's journey since 2003!

by Richard Bejtlich ( at January 08, 2018 07:49 PM

January 05, 2018

Steve Kemp's Blog

More ESP8266 projects, radio and epaper

I finally got the radio-project I've been talking about for the past while working. To recap:

  • I started with an RDA5807M module, but that was too small, and too badly-performing.
  • I moved on to using an Si4703-based integrated "evaluation" board. That was fine for headphones, but little else.
  • I finally got a TEA5767-based integrated "evaluatioN" board, which works just fine.
    • Although it is missing RDS (the system that lets you pull the name of the station off the transmission).
    • It also has no (digital) volume-control, so you have to adjust the volume physically, like a savage.

The project works well, despite the limitations, so I have a small set of speakers and the radio wired up. I can control the station via my web-browser and have an alarm to make it turn on/off at different times of day - cheating at that by using the software-MUTE facility.

All in all I can say that when it comes to IoT the "S stands for Simplicity" given that I had to buy three different boards to get the damn thing working the way I wanted. That said total cost is in the region of €5, probably about the same price I could pay for a "normal" hand-held radio. Oops.

The writeup is here:

The second project I've been working on recently was controlling a piece of ePaper via an ESP8266 device. This started largely by accident as I discovered you can buy a piece of ePaper (400x300 pixels) for €25 which is just cheap enough that it's worth experimenting with.

I had the intention that I'd display the day's calendar upon it, weather forecast, etc. My initial vision was a dashboard-like view with borders, images, and text. I figured rather than messing around with some fancy code-based grid-layout I should instead just generate a single JPG/PNG on a remote host, then program the board to download and display it.

Unfortunately the ESP8266 device I'm using has so little RAM that decoding and displaying a JPG/PNG from a remote URL is hard. Too hard. In the end I had to drop the use of SSL, and simplify the problem to get a working solution.

I wrote a perl script (what else?) to take an arbitrary JPG/PNG/image of the correct dimensions and process it row-by-row. It would keep track of the number of contiguous white/black pixels and output a series of "draw Lines" statements.

The ESP8266 downloads this simple data-file, and draws each line one at a time, ultimately displaying the image whilst keeping some memory free.

I documented the hell out of my setup here:

And here is a sample image being displayed:

January 05, 2018 10:00 PM

January 04, 2018

Intel’s CPU vulnerabilities: Meltdown and Spectre

The post Intel’s CPU vulnerabilities: Meltdown and Spectre appeared first on

The short version: every Intel CPU since 1995 has a vulnerability that could compromise a server or a hypervisor.

Meltdown and Spectre exploit critical vulnerabilities in modern processors. These hardware bugs allow programs to steal data which is currently processed on the computer. While programs are typically not permitted to read data from other programs, a malicious program can exploit Meltdown and Spectre to get hold of secrets stored in the memory of other running programs.

This might include your passwords stored in a password manager or browser, your personal photos, emails, instant messages and even business-critical documents.Meltdown and Spectre work on personal computers, mobile devices, and in the cloud.

Depending on the cloud provider's infrastructure, it might be possible to steal data from other customers.

Source: Meltdown and Spectre

The post Intel’s CPU vulnerabilities: Meltdown and Spectre appeared first on

by Mattias Geniar at January 04, 2018 09:58 AM

Errata Security

Some notes on Meltdown/Spectre

I thought I'd write up some notes.

You don't have to worry if you patch. If you download the latest update from Microsoft, Apple, or Linux, then the problem is fixed for you and you don't have to worry. If you aren't up to date, then there's a lot of other nasties out there you should probably also be worrying about. I mention this because while this bug is big in the news, it's probably not news the average consumer needs to concern themselves with.

This will force a redesign of CPUs and operating systems. While not a big news item for consumers, it's huge in the geek world. We'll need to redesign operating systems and how CPUs are made.

Don't worry about the performance hit. Some, especially avid gamers, are concerned about the claims of "30%" performance reduction when applying the patch. That's only in some rare cases, so you shouldn't worry too much about it. As far as I can tell, 3D games aren't likely to see less than 1% performance degradation. If you imagine your game is suddenly slower after the patch, then something else broke it.

This wasn't foreseeable. A common cliche is that such bugs happen because people don't take security seriously, or that they are taking "shortcuts". That's not the case here. Speculative execution and timing issues with caches are inherent issues with CPU hardware. "Fixing" this would make CPUs run ten times slower. Thus, while we can tweek hardware going forward, the larger change will be in software.

There's no good way to disclose this. The cybersecurity industry has a process for coordinating the release of such bugs, which appears to have broken down. In truth, it didn't. Once Linus announced a security patch that would degrade performance of the Linux kernel, we knew the coming bug was going to be Big. Looking at the Linux patch, tracking backwards to the bug was only a matter of time. Hence, the release of this information was a bit sooner than some wanted. This is to be expected, and is nothing to be upset about.

It helps to have a name. Many are offended by the crassness of naming vulnerabilities and giving them logos. On the other hand, we are going to be talking about these bugs for the next decade. Having a recognizable name, rather than a hard-to-remember number, is useful.

Should I stop buying Intel? Intel has the worst of the bugs here. On the other hand, ARM and AMD alternatives have their own problems. Many want to deploy ARM servers in their data centers, but these are likely to expose bugs you don't see on x86 servers. The software fix, "page table isolation", seems to work, so there might not be anything to worry about. On the other hand, holding up purchases because of "fear" of this bug is a good way to squeeze price reductions out of your vendor. Conversely, later generation CPUs, "Haswell" and even "Skylake" seem to have the least performance degradation, so it might be time to upgrade older servers to newer processors.

Intel misleads. Intel has a press release that implies they are not impacted any worse than others. This is wrong: the "Meltdown" issue appears to apply only to Intel CPUs. I don't like such marketing crap, so I mention it.

Statements from companies:

by Robert Graham ( at January 04, 2018 07:29 AM

Why Meltdown exists

So I thought I'd answer this question. I'm not a "chipmaker", but I've been optimizing low-level assembly x86 assembly language for a couple of decades.

The tl;dr version is this: the CPUs have no bug. The results are correct, it's just that the timing is different. CPU designers will never fix the general problem of undetermined timing.

CPUs are deterministic in the results they produce. If you add 5+6, you always get 11 -- always. On the other hand, the amount of time they take is non-deterministic. Run a benchmark on your computer. Now run it again. The amount of time it took varies, for a lot of reasons.

That CPUs take an unknown amount of time is an inherent problem in CPU design. Even if you do everything right, "interrupts" from clock timers and network cards will still cause undefined timing problems. Therefore, CPU designers have thrown the concept of "deterministic time" out the window.

The biggest source of non-deterministic behavior is the high-speed memory cache on the chip. When a piece of data is in the cache, the CPU accesses it immediately. When it isn't, the CPU has to stop and wait for slow main memory. Other things happening in the system impacts the cache, unexpectedly evicting recently used data for one purpose in favor of data for another purpose.

Hackers love "non-deterministic", because while such things are unknowable in theory, they are often knowable in practice.

That's the case of the granddaddy of all hacker exploits, the "buffer overflow". From the programmer's perspective, the bug will result in just the software crashing for undefinable reasons. From the hacker's perspective, they reverse engineer what's going on underneath, then carefully craft buffer contents so the program doesn't crash, but instead continue to run the code the hacker supplies within the buffer. Buffer overflows are undefined in theory, well-defined in practice.

Hackers have already been exploiting this defineable/undefinable timing problems with the cache for a long time. An example is cache timing attacks on AES. AES reads a matrix from memory as it encrypts things. By playing with the cache, evicting things, timing things, you can figure out the pattern of memory accesses, and hence the secret key.

Such cache timing attacks have been around since the beginning, really, and it's simply an unsolvable problem. Instead, we have workarounds, such as changing our crypto algorithms to not depend upon cache, or better yet, implement them directly in the CPU (such as the Intel AES specialized instructions).

What's happened today with Meltdown is that incompletely executed instructions, which discard their results, do affect the cache. We can then recover those partial/temporary/discarded results by measuring the cache timing. This has been known for a while, but we couldn't figure out how to successfully exploit this, as this paper from Anders Fogh reports. Hackers fixed this, making it practically exploitable.

As a CPU designer, Intel has few good options.

Fixing cache timing attacks is an impossibility. They can do some tricks, such as allowing some software to reserve part of the cache for private use, for special crypto operations, but the general problem is unsolvable.

Fixing the "incomplete results" problem from affecting the cache is also difficult. Intel has the fastest CPUs, and the reason is such speculative execution. The other CPU designers have the same problem: fixing the three problems identified today would cause massive performance issues. They'll come up with improvements, probably, but not complete solutions.

Instead, the fix is within the operating system. Frankly, it's a needed change that should've been done a decade ago. They've just been putting it off because of the performance hit. Now that the change has been forced to happen, CPU designers will probably figure out ways to mitigate the performance cost.

Thus, the Intel CPU you buy a year from now will have some partial fixes for these exactly problems without addressing the larger security concerns. They will also have performance enhancements to make the operating system patches faster.

But the underlying theoretical problem will never be solved, and is essentially unsolvable.

by Robert Graham ( at January 04, 2018 03:45 AM

January 03, 2018

Errata Security

Let's see if I've got Metldown right

I thought I'd write down the proof-of-concept to see if I got it right.

So the Meltdown paper lists the following steps:

 ; flush cache
 ; rcx = kernel address
 ; rbx = probe array
 mov al, byte [rcx]
 shl rax, 0xc
 jz retry
 mov rbx, qword [rbx + rax]
 ; measure which of 256 cachelines were accessed

So the first step is to flush the cache, so that none of the 256 possible cache lines in our "probe array" are in the cache. There are many ways this can be done.

Now pick a byte of secret kernel memory to read. Presumably, we'll just read all of memory, one byte at a time. The address of this byte is in rcx.

Now execute the instruction:
    mov al, byte [rcx]
This line of code will crash (raise an exception). That's because [rcx] points to secret kernel memory which we don't have permission to read. The value of the real al (the low-order byte of rax) will never actually change.

But fear not! Intel is massively out-of-order. That means before the exception happens, it will provisionally and partially execute the following instructions. While Intel has only 16 visible registers, it actually has 100 real registers. It'll stick the result in a pseudo-rax register. Only at the end of the long execution change, if nothing bad happen, will pseudo-rax register become the visible rax register.

But in the meantime, we can continue (with speculative execution) operate on pseudo-rax. Right now it contains a byte, so we need to make it bigger so that instead of referencing which byte it can now reference which cache-line. (This instruction multiplies by 4096 instead of just 64, to prevent the prefetcher from loading multiple adjacent cache-lines).
 shl rax, 0xc

Now we use pseudo-rax to provisionally load the indicated bytes.
 mov rbx, qword [rbx + rax]

Since we already crashed up top on the first instruction, these results will never be committed to rax and rbx. However, the cache will change. Intel will have provisionally loaded that cache-line into memory.

At this point, it's simply a matter of stepping through all 256 cache-lines in order to find the one that's fast (already in the cache) where all the others are slow.

by Robert Graham ( at January 03, 2018 11:10 PM

The Lone Sysadmin

Should We Panic About the KPTI/KAISER Intel CPU Design Flaw?

As a followup to yesterday’s post, I’ve been asked: should we panic about the KPTI/KAISER/F*CKWIT Intel CPU design flaw? My answer was: it depends on a lot of unknowns. There are NDAs around a lot of the fixes so it’s hard to know the scope and effect. We also don’t know how much this will affect […]

The post Should We Panic About the KPTI/KAISER Intel CPU Design Flaw? appeared first on The Lone Sysadmin. Head over to the source to read the full post!

by Bob Plankers at January 03, 2018 10:14 PM

Anton Chuvakin - Security Warrior

Annual Blog Round-Up – 2017

Here is my annual "Security Warrior" blog round-up of top 10 popular posts in 2017. Note that my current Gartner blog is where you go for my recent blogging (example), all of the content below predates 2011!

  1. “New SIEM Whitepaper on Use Cases In-Depth OUT!” (dated 2010) presents a whitepaper on select SIEM use cases described in depth with rules and reports [using now-defunct SIEM product]; also see this SIEM use case in depth and this for a more current list of popular SIEM use cases. Finally, see our 2016 research on developing security monitoring use cases here!
  2. Why No Open Source SIEM, EVER?” contains some of my SIEM thinking from 2009. Is it relevant now? You be the judge.  Succeeding with SIEM requires a lot of work, whether you paid for the software, or not.
  3. Simple Log Review Checklist Released!” is often at the top of this list – the checklist is still a very useful tool for many people. “On Free Log Management Tools” is a companion to the checklist (updated version
  4. My classic PCI DSS Log Review series is always hot! The series of 18 posts cover a comprehensive log review approach (OK for PCI DSS 3+ in 2017 as well), useful for building log review processes and procedures , whether regulatory or not. It is also described in more detail in our Log Management book and mentioned in our PCI book (out in its 4th edition!
  5. “SIEM Resourcing or How Much the Friggin’ Thing Would REALLY Cost Me?” is a quick framework for assessing the SIEM project (well, a program, really) costs at an organization (a lot more details on this here in this paper). 
  6. “SIEM Bloggables”  is a very old post, more like a mini-paper on  some key aspects of SIEM, use cases, scenarios, etc as well as 2 types of SIEM users. Still very relevant, if not truly modern.
  7. Top 10 Criteria for a SIEM?” came from one of my last projects I did when running my SIEM consulting firm in 2009-2011 (for my recent work on evaluating SIEM tools, see this document
  8. Another old checklist, “Log Management Tool Selection Checklist Out!”  holds a top spot  – it can be used to compare log management tools during the tool selection process or even formal RFP process. But let me warn you – this is from 2010.
  9. Updated With Community Feedback SANS Top 7 Essential Log Reports DRAFT2” is about top log reports project of 2008-2013.
  10. “A Myth of An Expert Generalist” is a fun rant on what I think it means to be “a security expert” today; it argues that you must specialize within security to really be called an expert.

Total pageviews: 33,231 in 2017.

Disclaimer: all this content was written before I joined Gartner on August 1, 2011 and is solely my personal view at the time of writing.  For my current security blogging, go here.

Also see my past monthly and annual “Top Posts” – 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016.

by Anton Chuvakin ( at January 03, 2018 07:11 PM

January 02, 2018

Everything Sysadmin

DevOpsDays New York City 2018: Register now!

DevOpsDays NYC is only a few weeks away: Jan 18-19, 2018!

Please register asap. We could sell out this year. With this awesome line-up of speakers, tickets are going fast.

Or... this handy shortcut:

by Tom Limoncelli at January 02, 2018 11:50 PM

The Lone Sysadmin

Intel CPU Design Flaw, Performance Degradation, Security Updates

I was just taking a break and reading some tech news and I saw a wonderfully detailed post from El Reg (link below) about an Intel CPU design flaw and impending crisis-level security updates to fix it. As if that wasn’t bad enough, the fix for the problem is estimated to decrease performance by 5% […]

The post Intel CPU Design Flaw, Performance Degradation, Security Updates appeared first on The Lone Sysadmin. Head over to the source to read the full post!

by Bob Plankers at January 02, 2018 09:26 PM

Anton Chuvakin - Security Warrior

Monthly Blog Round-Up – December 2017

Here is my next monthly "Security Warrior" blog round-up of top 5 popular posts based on last
month’s visitor data  (excluding other monthly or annual round-ups):
  1. Why No Open Source SIEM, EVER?” contains some of my SIEM thinking from 2009 (oh, wow, ancient history!). Is it relevant now? You be the judge.  Succeeding with SIEM requires a lot of work, whether you paid for the software, or not. BTW, this post has an amazing “staying power” that is hard to explain – I suspect it has to do with people wanting “free stuff” and googling for “open source SIEM” … 
  2. “New SIEM Whitepaper on Use Cases In-Depth OUT!” (dated 2010) presents a whitepaper on select SIEM use cases described in depth with rules and reports [using now-defunct SIEM product]; also see this SIEM use case in depth and this for a more current list of popular SIEM use cases. Finally, see our 2016 research on developing security monitoring use cases here – and we are updating it now.
  3. Again, my classic PCI DSS Log Review series is extra popular! The series of 18 posts cover a comprehensive log review approach (OK for PCI DSS 3+ even though it predates it), useful for building log review processes and procedures, whether regulatory or not. It is also described in more detail in our Log Management book and mentioned in our PCI book (now in its 4th edition!) – note that this series is mentioned in some PCI Council materials. 
  4. Simple Log Review Checklist Released!” is often at the top of this list – this rapildy aging checklist is still a very useful tool for many people. “On Free Log Management Tools” (also aged a bit by now) is a companion to the checklist (updated version)
  5. “SIEM Bloggables”  is a very old post, more like a mini-paper on  some key aspects of SIEM, use cases, scenarios, etc as well as 2 types of SIEM users. Still very relevant, if not truly modern.
In addition, I’d like to draw your attention to a few recent posts from my Gartner blog [which, BTW, now has more than 5X of the traffic of this blog]: 

A critical reference post (!):
Upcoming research on testing security:

Upcoming research on threat detection “starter kit”
Current research on SOAR:

Miscellaneous fun posts:

(see all my published Gartner research here)
Also see my past monthly and annual “Top Popular Blog Posts” – 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016.

Disclaimer: most content at SecurityWarrior blog was written before I joined Gartner on August 1, 2011 and is solely my personal view at the time of writing. For my current security blogging, go here.

Other posts in this endless series:

by Anton Chuvakin ( at January 02, 2018 07:06 PM

Michael Biven

Attrition from The Small Things

If you find yourself thinking about what work lies ahead in 2018 consider the following. Doesn’t matter if you’re only thinking of the first month, quarter or the entire year. How many changes can your team handle before it culminates to where they’re no longer capable of performing their normal operations? How about for yourself?

While you’re considering your answer frame change as anything from work related to existing or new projects, ongoing work supporting existing products, incidents, pivots in priorities or business models, new regulations, reorgs, to just plain interruptions.

Every individual and team has a point where their capabilities become ineffective. We have different tools and methodologies that track time spent on tasks. We track complexity (story points) delivered in a given timeframe (sprints). Burn up charts show tasks added across time, but this is a very narrow view of change. A holistic view of the impact of change on a team is missing. One that can show how change wears down on the people we’re responsible for and ourselves.

Reduction of Capability

Before we started needing data for most decisions we placed trust in individuals to do their job. When people were being pushed too far too fast they might push back. This still happens, but the early signs of it are often drowned out by data or a mantra of stick with the process. It’s developed into a narrow focus that has eroded trust in experience to drive us towards our goals. This has damaged some of the basic leadership skills needed and it has focused our industry on efficiency over effectiveness. I’m also starting to think this is creating a tendency where people are second guessing their own abilities due to the inabilities of others.

This reinforces a culture where leaders stop trusting the opinions of people doing the work or those who are close to it. When people push back the leaders have a choice to either listen and take into account the feedback or to double down on the data and methods used. This contributed in creating the environments where the labels “10x”, “Rock Stars” and “Ninjas” started being applied to engineers, designers, and developers.

heroics — “behavior or talk that is bold or dramatic, especially excessively or unexpectedly so: the makeshift team performed heroics.” — New Oxford American Dictionary

Ever think about why we apply the label heroics or hero when teams or people are able to pull through in the end? If the output of work and the frequency of changes were plotted I’d bet you’ll find the point where sustaining normal operations was impracticable or improbable was passed before these labels are used.

Last month’s fatal Amtrak derailment that killed three people was traveling more than twice the speed limit (80 mph in a 30 mph zone). The automated system (positive train control) designed to prevent these types of conditions from happening while installed was not activated. Was this fatal accident on the inaugural run of a new Amtrak route an example of where normal operations were no longer possible? Is this any different than the fatal collisions involving US Navy ships last year due to over-burdened personnel and equipment?

For the derailment it looks like a combination of failing to use available safety systems and following safety guidelines contributed to the accident. There’s also the question was the crew given training to build awareness of the new route. The Navy collisions looks to be the result of the strain of trying to do too much with too few people and resources. This includes individuals working too many hours, a reduction in training, failure to verify readiness, and a backlog of maintenance on the equipment, aircraft and ships.

The cadence of change was greater than what these organizations were capable of supporting.

For most of us working as engineers, designers, developers, product managers, or as online support we wouldn’t consider ourselves to be in a high-risk occupation. But the work we do impacts peoples lives in small to massive ways. These examples are something that we should be learning from. We should also acknowledge that we’re not good at being aware of the negative impacts of the tempo of change on our people.

There’s a phrase and image that can illustrate the dependencies between people, processes, and systems. It’s called the “Swiss Cheese Model” and it highlights when shortcomings between them line up it can allow a problem to happen. It also shows how the strengths from each is able to support the weaknesses of others.

Swiss Cheese Model of Accident Causation

Illustration by David Mack CC BY-SA 3.0.

We have runbooks, playbooks, incident management processes, and things to help us understand what is happening in our products and systems. Remember that these things are not absolute and they’re fallible. The systems and processes we put into place are never final, they’re ideas maintained as long as they stay relevant and then removed when they are no longer necessary. This requires awareness and diligence.

In any postmortem I’ve participated in or read through there were early signs that conditions were unusual. Often people fail to recognize a difference between what is happening and what is expected to happen. This is the point where a difference can start to develop into a problem if we ignore it. If you think you see something that doesn’t seem right you need to speak up.

After the Apollo 1 fire Gene Kranz gave a speech to his team at NASA that is knows as the Kranz Dictum. He begins by stating they work in a field that cannot tolerate incompetence. He then immediately holds himself and every part of the Apollo program accountable for their failures to prevent the deaths of Gus Grissom, Ed White, and Roger Chaffee.

From this day forward, Flight Control will be known by two words: “Tough” and “Competent.” Tough means we are forever accountable for what we do or what we fail to do. We will never again compromise our responsibilities. Every time we walk into Mission Control we will know what we stand for. Competent means we will never take anything for granted. We will never be found short in our knowledge and in our skills. — Gene Kranz

I take this as doing the work to protect the people involved. For us this should include ourselves, the people in our organizations, and our customers. Protection is gained when we’re thorough and accountable; sufficient training and resources are given; communication is concise and assertive; and we have an awareness of what is happening.

When I compare the derailment and collisions, what Kranz was speaking too, any emergency I responded to as a fire fighter, or any incident I worked as an engineer there are similarities. They’re the results from the attrition of little things that continued unabated.

Andon Cord for People

Alerting, availability, continuous integration/deployment, error rates, logging, metrics, monitoring, MTBF, MTTF, MVP, observability, reliability, resiliency, SLA, SLI, SLO, telemetry, throughput and uptime.

We build tools and we have all kinds of words and acronyms to help us frame our thoughts around the planning, building, maintaining and supporting of products. We even allow machines to bug us to fix them, including waking us up in the middle of the night. Why don’t we have the same level of response when people break?

One of the many things that came out of the Toyota Production System is Andon. It gives individuals the ability to stop the production line when a problem is found and call for help.

We talk about rapid feedback loops and iterative workflows, but we don’t talk about feedback from the people close to the work as a way of continuous improvement. We should be giving people the ability to pull the cord when there is an issue that impacts the ability for them or someone else on the team to perform. And that doesn’t mean only technical issues.

What would happen if your on-call staff had horrible time that they’re spent after their first night? Imagine if we gave our people the same level of support that we give our machines? Give them an andon cord to pull (i.e. page) that would get them the help they need.

As you’re planning don’t forget about your people. Could you track the frequency of changes happening to your team? Then plot the impact of that against the work completed? Think about providing an andon cord for them. How could you build a culture where people feel responsible to speak up when they see something that doesn’t line up with what we expect?

“People, ideas and technology. In that order!” — John Boyd.

Too many times we think a solution or problem is technical. More often than not it’s about a breakdown of communication and then sometimes not having the right people or protecting them.

The ideas from Boyd are a good example of how our industry fails to fully understand a concept before using it. If you’ve heard the phrase OODA Loop you’ve probably seen a circular image with points for Observe, Orient, Decide and Act. The thing is he never drew just a single loop. He gave a way to frame an environment and a process to help guide us through the unknowns. And it puts the people first by using their experience so when they recognize something for what it is they can act on it immediately. It was always more than a loop. It was a focus on the people and organizations.

January 02, 2018 11:20 AM


My Favourite Books in 2017

Following the very ambitious and successful 2016 challenge, I have decided to keep the goal at the same level of 36 books for 2017 to prove to myself that it is sustainable and wasn’t a one-off success. Surprising myself, I have crushed the goal and finished 39 books this year. Below is summary of the best of those books.

Business, Management and Leadership

After changing my job at the beginning of 2017 and returning to Swiftype to focus on Technical Operations team leadership, I continued working on improving my skills in this area and read a number of truly awesome books:

  • The Effective Executive: The Definitive Guide to Getting the Right Things Done” by Peter F. Drucker — this classic has immediately become one of my favourite leadership books of all time. There are many useful lessons I learned from it (like the notion that all knowledge workers should consider themselves executives in some sense), but the most powerful was the part on executive time management.
  • Hatching Twitter: A True Story of Money, Power, Friendship, and Betrayal” by Nick Bilton — A truly horrifying “Game of Thrones”-like story behind the early years of Twitter. I didn’t think shit like that actually happened in real life… I guess the book made me grow up a little and realize, that simply doing your best to push your company forward is not always enough. I’d highly recommend this book to anybody working in a fast growing company or thinking about starting a VC-backed business.
  • Shoe Dog: A Memoir by the Creator of NIKE” by Phil Knight — a great story of a great company built by regular people striving for quality results. Heavily reinforces the notion that to be an entrepreneur you need to be a bit crazy and slightly masochistic. Overall, a very fascinating tale of a multi-decade development of a company — a strong contrast with all the modern stories about internet businesses. A must read for people thinking about starting a business.

Health, Medicine and Mortality

I have always been fascinated by the history of medicine, medical stories and the inner workings of the modern medical system. Unfortunately, this year I’ve had to interact with it a lot and that made me seriously consider the fact of our mortality. This has led me upon a quest to learn more about the topics of medicine, mortality and philosophy.

  • When Breath Becomes Air” by Paul Kalanithi — Fantastic memoir! Terrifying, depressing, beautifully described story of a young neurosurgeon, his cancer diagnosis, his battle with the horrible disease and up to the very end of his life. I found the story of Paul very relatable and just like with Atul Gawande’s book I’ve read last year, it brought forth very important questions on how should we deal with our own mortality. Paul gave us a great example of one of the options for how we may choose to spend our last days — the same way we may want to spend our lives: “You can’t reach perfection, but you can believe in an asymptote toward which you are ceaselessly striving”.
  • The Emperor of All Maladies” by Siddhartha Mukherjee — probably the best book on cancer out there (based on my limited research). The author takes us on a long, very interesting and terrifying trip through the dark ages of human war against cancer and explains why after so much time we are still only starting to understand how to deal with it and there is still a long road ahead. Highly recommended to anybody interested in the history of medicine or wants to understand more about the reason behind a malady that kills more than 8 million people each year.
  • Complications: A Surgeon’s Notes on an Imperfect Science” by Atul Gawande — once again, one of my favourite authors manages to explain a hard problem of complications in healthcare and give us a sobering look at the limits and fallibilities of modern medicine.
  • Bonus: “On The Shortness Of Life” by Seneca — It is amazing how something written 2000 years ago can have such profound relevance today. I found this short book really inspiring and it has led me to start my road to adapting some of Stoic techniques including mindfulness and meditation.


Few more books I found very interesting:

  • Born a Crime: Stories From a South African Childhood” by Trevor Noah — Listened to this book on Audible and absolutely loved it! Hearing Noah’s voice describing his crazy childhood in South Africa mixing fun and absolutely horrifying details of his life there and the struggles he had to endure being a coloured kid under and right after Apartheid.
    Even though it was never as scary as what Noah is describing in his book, I have found in his stories a lot of things I could relate to based on my childhood in late USSR and then in 1990s Ukraine which was going through an economic meltdown with all of the usual attributes like crime and crazy unemployment.
  • I Can’t Make This Up: Life Lessons” by Kevin Hart — I have never been a particular fan of Kevin Hart. Not that I disliked him, just didn’t really follow his career. This book (I absolutely recommend the audiobook version!) ended up being one of the biggest literary surprises ever for me: it is the funniest inspirational read and the most inspiring comic memoir I’ve ever read (or, in this case, listened to). Kevin’s dedication to his craft, his work ethic and perseverance are truly inspiring and his success is absolutely well-earned.
  • Kingpin: How One Hacker Took Over the Billion-Dollar Cybercrime Underground” by Kevin Poulsen — Terrifying read… I’ve never realized how close the early years of my career as a systems administrator and developer took me to the crazy world of underground computer crime that was unfolding around us.
    I’ve spent a few weeks week wondering if doing what Max and other people in this story did is the result of an innate personality trait or just a set of coincidences, a bad hand the life deals a computer specialist, turning them into a criminal. For many people working in this industry, it is always about the craft, the challenge of building systems (just like the bind hack was for Max) and I am not sure there is a point in one’s career when you make a conscious decision to become a criminal. Unfortunately, even after finishing the book I don’t have an answer to this question.
    The book is a fascinating primer on the effects of bad and the need for good security in today’s computerized society and I’d highly recommend it to everybody working with computers on a daily basis.
  • Modern Romance” by Aziz Ansari — very interesting insight into the crazy modern world of dating and romance. Made me really appreciate the fact that I have already found the love of my life and hope will never need to participate in the technology-driven culture today’s singles have to deal with. Really recommend listening to the audiobook, Aziz is very funny even when he’s talking about a serious topic like this.
  • The Year of Living Danishly: My Twelve Months Unearthing the Secrets of the World’s Happiest Country” by Helen Russell — Really liked this book. It offers a glimpse into a society surprisingly different from what many modern North Americans would consider normal. Reading about all kinds of Danish customs, I would think back to the times I grew up in USSR and realize, that modern Danish life is very close to what was promised by the party back then. The only difference — they’ve managed to make it work long term.
    Even though not many of us could or want to relocate to Denmark or to affect our government policies, there is a lot in this book that many of us could apply in our lives: trusting people more, striving for a better work-life balance, exercising more, surrounding ourselves with beautiful things, etc.

I hope you enjoyed this overview of the best books I’ve read in 2017. Let me know you liked it!

by Oleksiy Kovyrin at January 02, 2018 02:09 AM

January 01, 2018

toolsmith #130 - OSINT with Buscador

First off, Happy New Year! I hope you have a productive and successful 2018. I thought I'd kick off the new year with another exploration of OSINT. In addition to my work as an information security leader and practitioner at Microsoft, I am privileged to serve in Washington's military as a J-2 which means I'm part of the intelligence directorate of a joint staff. Intelligence duties in a guard unit context are commonly focused on situational awareness for mission readiness. Additionally, in my unit we combine part of J-6 (command, control, communications, and computer systems directorate of a joint staff) with J-2, making Cyber Network Operations a J-2/6 function. Open source intelligence (OSINT) gathering is quite useful in developing indicators specific to adversaries as well as identifying targets of opportunity for red team and vulnerability assessments. We've discussed numerous OSINT offerings as part of toolsmiths past, there's no better time than our 130th edition to discuss an OSINT platform inclusive of previous topics such as Recon-ng, Spiderfoot, Maltego, and Datasploit. Buscador is just such a platform and comes from genuine OSINT experts Michael Bazzell and David Wescott. Buscador is "a Linux Virtual Machine that is pre-configured for online investigators." Michael is the author of Open Source Intelligence Techniques (5th edition) and Hiding from the Internet (3rd edition). I had a quick conversation with him and learned that they will have a new release in January (1.2), which will address many issues and add new features. Additionally, it will also revamp Firefox since the release of version 57. You can download Buscador as an OVA bundle for a variety of virtualization options, or as a ISO for USB boot devices or host operating systems. I had Buscador 1.1 up and running on Hyper-V in a matter of minutes after pulling the VMDK out of the OVA and converting it with QEMU. Buscador 1.1 includes numerous tools, in addition to the above mentioned standard bearers, you can expect the following and others:
  • Creepy
  • Metagoofil
  • MediaInfo
  • ExifTool
  • EmailHarvester
  • theHarvester
  • Wayback Exporter
  • HTTrack Cloner
  • Web Snapper
  • Knock Pages
  • SubBrute
  • Twitter Exporter
  • Tinfoleak 
  • InstaLooter 
  • BleachBit 
Tools are conveniently offered via the menu bar on the UI's left, or can easily be via Show Applications.
To put Buscador through its paces, using myself as a target of opportunity, I tested a few of the tools I'd not prior utilized. Starting with Creepy, the geolocation OSINT tool, I configured the Twitter plugin, one of the four available (Flickr, Google+, Instagram, Twitter) in Creepy, and searched holisticinfosec, as seen in Figure 1.
Figure 1:  Creepy configuration

The results, as seen in Figure 2, include some good details, but no immediate location data.

Figure 2: Creepy results
Had I configured the other plugins or was even a user of Flickr or Google+, better results would have been likely. I have location turned off for my Tweets, but my profile does profile does include Seattle. Creepy is quite good for assessing targets who utilize social media heavily, but if you wish to dig more deeply into Twitter usage, check out Tinfoleak, which also uses geo information available in Tweets and uploaded images. The report for holisticinfosec is seen in Figure 3.

Figure 3: Tinfoleak
If you're looking for domain enumeration options, you can start with Knock. It's as easy as handing it a domain, I did so with as seen in Figure 4, results are in Figure 5.
Figure 4: Knock run
Figure 5: Knock results
Other classics include HTTrack for web site cloning, and ExifTool for pulling all available metadata from images. HTTrack worked instantly as expected for I used Instalooter, "a program that can download any picture or video associated from an Instagram profile, without any API access", to grab sample images, then ran pyExifToolGui against them. As a simple experiment, I ran Instalooter against the infosec.memes Instagram account, followed by pyExifToolGui against all the downloaded images, then exported Exif metadata to HTML. If I were analyzing images for associated hashtags the export capability might be useful for an artifacts list.
Finally, one of my absolute favorites is Metagoofil, "an information gathering tool designed for extracting metadata of public documents." I did a quick run against my domain, with the doc retrieval parameter set at 50, then reviewed full.txt results (Figure 6), included in the output directory (home/Metagoofil) along with authors.csv, companies.csv, and modified.csv.

Figure 6: Metagoofil results

Metagoofil is extremely useful for gathering target data, I consider it a red team recon requirement. It's a faster, currently maintained offering that has some shared capabilities with Foca. It should also serve as a reminder just how much information is available in public facing documents, consider stripping the metadata before publishing. 

It's fantastic having all these capabilities ready and functional on one distribution, it keeps the OSINT discipline close at hand for those who need regular performance. I'm really looking forward to the Buscador 1.2 release, and better still, I have it on good authority that there is another book on the horizon from Michael. This is a simple platform with which to explore OSINT, remember to be a good citizen though, there is an awful lot that can be learned via these passive means.
Cheers...until next time.

by Russ McRee ( at January 01, 2018 11:28 PM

December 31, 2017

Sarah Allen

event-driven architectural patterns

Martin Fowler’s talk “The Many Meanings of Event-Driven Architecture” at GOTO2017 provides a good overview of different patterns that all are described as “event-driven” systems. At the end of the talk, he references to an earlier event-driven article, which offers a good prose description of these different patterns that folks are calling event-driven programming. In this talk, he covers specific examples that illustrate the patterns, grounding them in specific applications.

Event Notification

Person -> CRM -> Insurance Quoting -> CommunicationsFor example: address changed

Scenario: CRM system stores information about people. An insurance quoting system generates insurance rates based on demographics and address. When someone’s address changes, we need to calculate a new value for the cost of insurance.

We often don’t want these systems to be coupled, instead we want a reversal of dependencies. This patterns is used in relatively large scale systems, and also a long-established client-side pattern to separate GUIs and the rest of your code.

The change becomes a first class notion. We bundle the notification + data about the change.

Events OR Commands
* Commands enforce the coupling, it’s very different from an event, it conveys intent
* Naming makes all the diffrence

Additional property → easy to add systems without modifying the original system

Martin notes “the dark side of event notification” where your system quickly becomes hard to reason about because there is not statement of overall behavior.

Event-Carried State Transfer

Named in contrast to REST (Representational State Transfer), the event carries ALL of the data needed about the event, which completely de-couples the target system from the system that originates the event.

Of course, this introduces data replication and eventual consistency (which can be good for some use cases); however, this is a less common pattern since this lack of consistency can actually make the system more complex.

Event Sourcing

This is one of my favorite patterns which Martin explains nicely in the talk with two examples:

  • Version control is an event source system for code.
  • Accounting ledgers track every credit or debit, which are the source records (events), and the balance is calculated from those records.


  • auditing: natural part of the system
  • debugging: easy to replay a subset of events locally
  • historic state: time travel!
  • alternative state: branching, correcting errors
  • memory image: application state can be volatile (since persistence is achieved with event log, processing can happen quickly in memory based on recent events that can quickly regenerate state based on recent snapshots)


  • unfamiliar
  • external systems: everything needs to be an event
  • event schema: what happens when your data types change?
  • identifiers: how to create identifiers to reliably replay events

Common (Related) Challenges

  • **asynchronous processing** can be hard to reason about. This isn’t required for an event sourcing system, yet it is easy to add and needed in most server-side systems. Useful to remember that this is distinct from the core event sourcing pattern.
  • **versioning** is another option that is quite useful, yet also adds complexity. Greg Young’s advice: don’t have any business logic between the event and the internal representation of a record.

Martin talks about the difference between input event (the intention) and the output event (the effect). In deciding what to store think about how we would fix a bug. The key thing is to be clear about what you are storing, and probably most of the time you want to store both.


Coined by Greg Young, Command Query Responsibility Segregation, is where your write model is different from your read model. Two software components: one for updating the current model (the command component), and one for reading the state (the query component).

Martin suggests that we need to be wary of this approach. A good pattern when used appropriately (which you could say of any model). But isn’t Event Sourcing just a special case of this? Maybe the special case is what provides a structure that make it easier to reason about.

by sarah at December 31, 2017 12:50 AM

December 29, 2017

The Lone Sysadmin

Apple Deserves What It Gets From This Battery Fiasco

Yesterday Apple issued an apology for the intentional slowing of iPhones because of aging in the iPhone battery. As part of that they announced a number of changes, like a $29 battery replacement and actually giving people information and choices about how their device functions. This says a few things to me. First, it says […]

The post Apple Deserves What It Gets From This Battery Fiasco appeared first on The Lone Sysadmin. Head over to the source to read the full post!

by Bob Plankers at December 29, 2017 06:22 PM


In defense of job titles

I've noticed that the startup-flavored tech industry has certain preferences when it comes to your job-title. They like them flat. A job tree can look like this:

  1. Intern (write software as a student)
  2. Software Engineer (write software as a full time salaried employee)
  3. Lead Software Engineer (does manager things in addition to software things)
  4. Manager (mostly does manager things; if they used to be a Software Engineer, maybe some of that if there is time)

Short and to the point. The argument in favor of this is pretty well put by:

A flat hierarchy keeps us from having to rank everyone against some arbitrary rules. What, really, is the quantifiable difference between a 'junior' and a 'senior' engineer? We are all engineers. If you do manager things, you're a lead. When you put Eclipse/Vim/VisualStudio behind you, then you're a manager.

No need to judge some engineers as better than other engineers. Easy. Simple. Understandable.

Over in the part of the tech-industry that isn't dominated by startups, but is dominated by, say, US Federal contracting rules you have a very different hierarchy.

  1. Associate Systems Engineer
  2. Junior Systems Engineer
  3. Systems Engineer
  4. Senior Systems Engineer
  5. Lead Systems Engineer (may do some managery things, may not)
  6. Principal Systems Engineer (the top title for technical stuff)

Because civil service is like that, each of those has a defined job title, with responsibilities, and skill requirements. Such job-reqs read similar to:

Diagnose and troubleshoots problems involving multiple interconnected systems. Proposes complete systems and integrates them. Work is highly independent, and is effective in coordinating work with other separate systems teams. May assume a team-lead role.

Or for a more junior role:

Diagnose and troubleshoot problems for a single system in an interconnected ecosystem. Proposes changes to specific systems and integrates them. Follows direction when implementing new systems. Work is somewhat independent, guided by senior engineers.

Due to the different incentive, win US government contracting agreements versus not having to judge engineers as better/worse than each other, having multiple classes of 'systems engineer' makes sense for the non-startup case.

I'm arguing that the startup-stance (flat) is more unfair. Yes, you don't have to judge people as 'better-than'.

On the job-title, at least.

Salaries are another story. Those work very much like Enterprise Pricing Agreements, where no two agreements look the same. List-price is only the opening bid of a protracted negotiation, after all. This makes sense, as hiring a tech-person is a 6-figure annual recurring cost in most large US job-markets (after you factor in fringe benefits, employer-side taxes etc). That's an Enterprise contract right there, no wonder each one is a unique snowflake of specialness.

I guarantee that the person deciding what a potential hire's salary is going to be is going to consider time-in-the-field, experience with our given technologies, ability to operate in a fast paced & changing environment, and ability to make change as the factors in the initial offer. All things that were involved in the job-req example I posted above. Sub-consciously certain unconscious biases factor in, such as race and gender.

By the time a new Software Engineer walks in the door for their first day they've already been judged better/worse than their peers. Just, no one knows it because it isn't in the job title.

If the company is one that bases annual compensation improvements on the previous year's performance, this judgment happens every year and compounds. Which is how you can get a hypothetical 7 person team that looks like this:

  1. Lead Software Engineer, $185,000/yr
  2. Software Engineer, $122,000/yr
  3. Software Engineer, $105,000/yr
  4. Software Engineer, $170,000/yr
  5. Software Engineer, $150,000/yr
  6. Software Engineer, $135,000/yr
  7. Software Engineer, $130,000/yr

Why is Engineer 4 paid so much more? Probably because they were the second hire after the Lead, meaning they have more years of increase under their belts, and possibly a guilt-raise when Engineer 1 was picked for Lead when they weren't after the 3rd hire happened and the team suddenly needed a Lead.

One job-title, $65,000 spread in annual compensation. Obviously, no one has been judged better or worse than each other.


Then something like #TalkPay happens. Engineer number 4 says in Slack, "I'm making 170K. #TalkPay". Engineer number 3 chokes on her coffee. Suddenly, five engineers are now hammering to get raises because they had no idea the company was willing to pay that much for a non-Lead.

Now, if that same series were done but with a Fed-style job series?

  1. Lead Software Engineer, $185,000/yr
  2. Junior Software Engineer, $122,000/yr
  3. Associate Software Engineer, $105,000/yr
  4. Senior Software Engineer, $170,000/yr
  5. Senior Software Engineer, $150,000/yr
  6. Software Engineer, $135,000/yr
  7. Software Engineer, $130,000/yr

Only one person will be banging on doors, Engineer number 5. Having a job-series allows you to have overt pay disparity without having to pretend everyone is equal to everyone else. It makes overt the judgment that is already being made, which makes the system more fair.

Is this the best of all possible worlds?

Heck no. Balancing unconscious bias mitigation (rigid salary scheduled and titles) versus compensating your high performers (individualized salary negotiations) is a fundamentally hard problem with unhappy people no matter what you pick. But not pretending we're all the same helps keep things somewhat more transparent. It also makes certain kinds of people not getting promotions somewhat more obvious than certain kinds of people getting half the annual raises of everyone else.

by SysAdmin1138 at December 29, 2017 06:09 PM

December 28, 2017

Vincent Bernat

(Micro)benchmarking Linux kernel functions

Usually, the performance of a Linux subsystem is measured through an external (local or remote) process stressing it. Depending on the input point used, a large portion of code may be involved. To benchmark a single function, one solution is to write a kernel module.

Minimal kernel module

Let’s suppose we want to benchmark the IPv4 route lookup function, fib_lookup(). The following kernel function executes 1,000 lookups for and returns the average value.1 It uses the get_cycles() function to compute the execution “time.”

/* Execute a benchmark on fib_lookup() and put
   result into the provided buffer `buf`. */
static int do_bench(char *buf)
    unsigned long long t1, t2;
    unsigned long long total = 0;
    unsigned long i;
    unsigned count = 1000;
    int err = 0;
    struct fib_result res;
    struct flowi4 fl4;

    memset(&fl4, 0, sizeof(fl4));
    fl4.daddr = in_aton("");

    for (i = 0; i < count; i++) {
        t1 = get_cycles();
        err |= fib_lookup(&init_net, &fl4, &res, 0);
        t2 = get_cycles();
        total += t2 - t1;
    if (err != 0)
        return scnprintf(buf, PAGE_SIZE, "err=%d msg=\"lookup error\"\n", err);
    return scnprintf(buf, PAGE_SIZE, "avg=%llu\n", total / count);

Now, we need to embed this function in a kernel module. The following code registers a sysfs directory containing a pseudo-file run. When a user queries this file, the module runs the benchmark function and returns the result as content.

#define pr_fmt(fmt) "kbench: " fmt

#include <linux/kernel.h>
#include <linux/version.h>
#include <linux/module.h>
#include <linux/inet.h>
#include <linux/timex.h>
#include <net/ip_fib.h>

/* When a user fetches the content of the "run" file, execute the
   benchmark function. */
static ssize_t run_show(struct kobject *kobj,
                        struct kobj_attribute *attr,
                        char *buf)
    return do_bench(buf);

static struct kobj_attribute run_attr = __ATTR_RO(run);
static struct attribute *bench_attributes[] = {
static struct attribute_group bench_attr_group = {
    .attrs = bench_attributes,
static struct kobject *bench_kobj;

int init_module(void)
    int rc;
    /* ❶ Create a simple kobject named "kbench" in /sys/kernel. */
    bench_kobj = kobject_create_and_add("kbench", kernel_kobj);
    if (!bench_kobj)
        return -ENOMEM;

    /* ❷ Create the files associated with this kobject. */
    rc = sysfs_create_group(bench_kobj, &bench_attr_group);
    if (rc) {
        return rc;

    return 0;

void cleanup_module(void)

/* Metadata about this module */
MODULE_DESCRIPTION("Microbenchmark for fib_lookup()");

In ❶, kobject_create_and_add() creates a new kobject named kbench. A kobject is the abstraction behind the sysfs filesystem. This new kobject is visible as the /sys/kernel/kbench/ directory.

In ❷, sysfs_create_group() attaches a set of attributes to our kobject. These attributes are materialized as files inside /sys/kernel/kbench/. Currently, we declare only one of them, run, with the __ATTR_RO macro. The attribute is therefore read-only (0444) and when a user tries to fetch the content of the file, the run_show() function is invoked with a buffer of PAGE_SIZE bytes as last argument and is expected to return the number of bytes written.

For more details, you can look at the documentation in the kernel and the associated example. Beware, random posts found on the web (including this one) may be outdated.2

The following Makefile will compile this example:

# Kernel module compilation
KDIR = /lib/modules/$(shell uname -r)/build
obj-m += kbench_mod.o
kbench_mod.ko: kbench_mod.c
    make -C $(KDIR) M=$(PWD) modules

After executing make, you should get a kbench_mod.ko file:

$ modinfo kbench_mod.ko
filename:       /home/bernat/code/…/kbench_mod.ko
description:    Microbenchmark for fib_lookup()
license:        GPL
name:           kbench_mod
vermagic:       4.14.0-1-amd64 SMP mod_unload modversions

You can load it and execute the benchmark:

$ insmod ./kbench_mod.ko
$ ls -l /sys/kernel/kbench/run
-r--r--r-- 1 root root 4096 déc.  10 16:05 /sys/kernel/kbench/run
$ cat /sys/kernel/kbench/run

The result is a number of cycles. You can get an approximate time in nanoseconds if you divide it by the frequency of your processor in gigahertz (25 ns if you have a 3 GHz processor).3

Configurable parameters

The module hard-code two constants: the number of loops and the destination address to test. We can make these parameters user-configurable by exposing them as attributes of our kobject and define a pair of functions to read/write them:

static unsigned long loop_count      = 5000;
static u32           flow_dst_ipaddr = 0x08080808;

/* A mutex is used to ensure we are thread-safe when altering attributes. */
static DEFINE_MUTEX(kb_lock);

/* Show the current value for loop count. */
static ssize_t loop_count_show(struct kobject *kobj,
                               struct kobj_attribute *attr,
                               char *buf)
    ssize_t res;
    res = scnprintf(buf, PAGE_SIZE, "%lu\n", loop_count);
    return res;

/* Store a new value for loop count. */
static ssize_t loop_count_store(struct kobject *kobj,
                                struct kobj_attribute *attr,
                                const char *buf,
                                size_t count)
    unsigned long val;
    int err = kstrtoul(buf, 0, &val);
    if (err < 0)
        return err;
    if (val < 1)
        return -EINVAL;
    loop_count = val;
    return count;

/* Show the current value for destination address. */
static ssize_t flow_dst_ipaddr_show(struct kobject *kobj,
                                    struct kobj_attribute *attr,
                                    char *buf)
    ssize_t res;
    res = scnprintf(buf, PAGE_SIZE, "%pI4\n", &flow_dst_ipaddr);
    return res;

/* Store a new value for destination address. */
static ssize_t flow_dst_ipaddr_store(struct kobject *kobj,
                                     struct kobj_attribute *attr,
                                     const char *buf,
                                     size_t count)
    flow_dst_ipaddr = in_aton(buf);
    return count;

/* Define the new set of attributes. They are read/write attributes. */
static struct kobj_attribute loop_count_attr      = __ATTR_RW(loop_count);
static struct kobj_attribute flow_dst_ipaddr_attr = __ATTR_RW(flow_dst_ipaddr);
static struct kobj_attribute run_attr             = __ATTR_RO(run);
static struct attribute *bench_attributes[] = {

The IPv4 address is stored as a 32-bit integer but displayed and parsed using the dotted quad notation. The kernel provides the appropriate helpers for this task.

After this change, we have two new files in /sys/kernel/kbench. We can read the current values and modify them:

# cd /sys/kernel/kbench
# ls -l
-rw-r--r-- 1 root root 4096 déc.  10 19:10 flow_dst_ipaddr
-rw-r--r-- 1 root root 4096 déc.  10 19:10 loop_count
-r--r--r-- 1 root root 4096 déc.  10 19:10 run
# cat loop_count
# cat flow_dst_ipaddr
# echo > flow_dst_ipaddr
# cat flow_dst_ipaddr

We still need to alter the do_bench() function to make use of these parameters:

static int do_bench(char *buf)
    /* … */
    count = loop_count;
    fl4.daddr = flow_dst_ipaddr;

    for (i = 0; i < count; i++) {
        /* … */

Meaningful statistics

Currently, we only compute the average lookup time. This value is usually inadequate:

  • A small number of outliers can raise this value quite significantly. An outlier can happen because we were preempted out of CPU while executing the benchmarked function. This doesn’t happen often if the function execution time is short (less than a millisecond), but when this happens, the outliers can be off by several milliseconds, which is enough to make the average inadequate when most values are several order of magnitude smaller. For this reason, the median usually gives a better view.

  • The distribution may be asymmetrical or have several local maxima. It’s better to keep several percentiles or even a distribution graph.

To be able to extract meaningful statistics, we store the results in an array.

static int do_bench(char *buf)
    unsigned long long *results;
    /* … */

    results = kmalloc(sizeof(*results) * count, GFP_KERNEL);
    if (!results)
        return scnprintf(buf, PAGE_SIZE, "msg=\"no memory\"\n");

    for (i = 0; i < count; i++) {
        t1 = get_cycles();
        err |= fib_lookup(&init_net, &fl4, &res, 0);
        t2 = get_cycles();
        results[i] = t2 - t1;

    if (err != 0) {
        return scnprintf(buf, PAGE_SIZE, "err=%d msg=\"lookup error\"\n", err);
    /* Compute and display statistics */
    display_statistics(buf, results, count);

    return strnlen(buf, PAGE_SIZE);

Then, We need an helper function to be able to compute percentiles:

static unsigned long long percentile(int p,
                                     unsigned long long *sorted,
                                     unsigned count)
    int index = p * count / 100;
    int index2 = index + 1;
    if (p * count % 100 == 0)
        return sorted[index];
    if (index2 >= count)
        index2 = index - 1;
    if (index2 < 0)
        index2 = index;
    return (sorted[index] + sorted[index+1]) / 2;

This function needs a sorted array as input. The kernel provides a heapsort function, sort(), for this purpose. Another useful value to have is the deviation from the median. Here is a function to compute the median absolute deviation:4

static unsigned long long mad(unsigned long long *sorted,
                              unsigned long long median,
                              unsigned count)
    unsigned long long *dmedian = kmalloc(sizeof(unsigned long long) * count,
    unsigned long long res;
    unsigned i;

    if (!dmedian) return 0;
    for (i = 0; i < count; i++) {
        if (sorted[i] > median)
            dmedian[i] = sorted[i] - median;
            dmedian[i] = median - sorted[i];
    sort(dmedian, count, sizeof(unsigned long long), compare_ull, NULL);
    res = percentile(50, dmedian, count);
    return res;

With these two functions, we can provide additional statistics:

static void display_statistics(char *buf,
                               unsigned long long *results,
                               unsigned long count)
    unsigned long long p95, p90, p50;

    sort(results, count, sizeof(*results), compare_ull, NULL);
    if (count == 0) {
        scnprintf(buf, PAGE_SIZE, "msg=\"no match\"\n");

    p95 = percentile(95, results, count);
    p90 = percentile(90, results, count);
    p50 = percentile(50, results, count);
    scnprintf(buf, PAGE_SIZE,
          "min=%llu max=%llu count=%lu 95th=%llu 90th=%llu 50th=%llu mad=%llu\n",
          results[count - 1],
          mad(results, p50, count));

We can also append a graph of the distribution function (and of the cumulative distribution function):

min=72 max=33364 count=100000 95th=154 90th=142 50th=112 mad=6
    value │                      ┊                         count
       72 │                                                   51
       77 │▒                                                3548
       82 │▒▒░░                                             4773
       87 │▒▒░░░░░                                          5918
       92 │░░░░░░░                                          1207
       97 │░░░░░░░                                           437
      102 │▒▒▒▒▒▒░░░░░░░░                                  12164
      107 │▒▒▒▒▒▒▒░░░░░░░░░░░░░░                           15508
      112 │▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░               23014
      117 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             6297
      122 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              905
      127 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           3845
      132 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░       6687
      137 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     4884
      142 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   4133
      147 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  1015
      152 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  1123

Benchmark validity

While the benchmark produces some figures, we may question their validity. There are several traps when writing a microbenchmark:

dead code
Compiler may optimize away our benchmark because the result is not used. In our example, we ensure to combine the result in a variable to avoid this.
warmup phase
One-time initializations may affect negatively the benchmark. This is less likely to happen with C code since there is no JIT. Nonetheless, you may want to add a small warmup phase.
too small dataset
If the benchmark is running using the same input parameters over and over, the input data may fit entirely in the L1 cache. This affects positively the benchmark. Therefore, it is important to iterate over a large dataset.
too regular dataset
A regular dataset may still affect positively the benchmark despite its size. While the whole dataset will not fit into L1/L2 cache, the previous run may have loaded most of the data needed for the current run. In the route lookup example, as route entries are organized in a tree, it’s important to not linearly scan the address space. Address space could be explored randomly (a simple linear congruential generator brings reproducible randomness).
large overhead
If the benchmarked function runs in a few nanoseconds, the overhead of the benchmark infrastructure may be too high. Typically, the overhead of the method presented here is around 5 nanoseconds. get_cycles() is a thin wrapper around the RDTSC instruction: it returns the number of cycles for the current processor since last reset. It’s also virtualized with low-overhead in case you run the benchmark in a virtual machine. If you want to measure a function with a greater precision, you need to wrap it in a loop. However, the loop itself adds to the overhead, notably if you need to compute a large input set (in this case, the input can be prepared). Compilers also like to mess with loops. At last, a loop hides the result distribution.
While the benchmark is running, the thread executing it can be preempted (or when running in a virtual machine, the whole virtual machine can be preempted by the host). When the function takes less than a millisecond to execute, one can assume preemption is rare enough to be filtered out by using a percentile function.
When running the benchmark, noise from unrelated processes (or sibling hosts when benchmarking in a virtual machine) needs to be avoided as it may change from one run to another. Therefore, it is not a good idea to benchmark in a public cloud. On the other hand, adding controlled noise to the benchmark may lead to less artificial results: in our example, route lookup is only a small part of routing a packet and measuring it alone in a tight loop affects positively the benchmark.
syncing parallel benchmarks
While it is possible (and safe) to run several benchmarks in parallel, it may be difficult to ensure they really run in parallel: some invocations may work in better conditions because other threads are not running yet, skewing the result. Ideally, each run should execute bogus iterations and start measures only when all runs are present. This doesn’t seem a trivial addition.

As a conclusion, the benchmark module presented here is quite primitive (notably compared to a framework like JMH for Java) but, with care, can deliver some conclusive results like in these posts: “IPv4 route lookup on Linux” and “IPv6 route lookup on Linux.”


Use of a tracing tool is an alternative approach. For example, if we want to benchmark IPv4 route lookup times, we can use the following process:

while true; do
  ip route get $((RANDOM%100)).$((RANDOM%100)).$((RANDOM%100)).5
  sleep 0.1

Then, we instrument the __fib_lookup() function with eBPF (through BCC):

$ sudo funclatency-bpfcc __fib_lookup
Tracing 1 functions for "__fib_lookup"... Hit Ctrl-C to end.
     nsecs               : count     distribution
         0 -> 1          : 0        |                    |
         2 -> 3          : 0        |                    |
         4 -> 7          : 0        |                    |
         8 -> 15         : 0        |                    |
        16 -> 31         : 0        |                    |
        32 -> 63         : 0        |                    |
        64 -> 127        : 0        |                    |
       128 -> 255        : 0        |                    |
       256 -> 511        : 3        |*                   |
       512 -> 1023       : 1        |                    |
      1024 -> 2047       : 2        |*                   |
      2048 -> 4095       : 13       |******              |
      4096 -> 8191       : 42       |********************|

Currently, the overhead is quite high, as a route lookup on an empty routing table is less than 100 ns. Once Linux supports inter-event tracing, the overhead of this solution may be reduced to be usable for such microbenchmarks.

  1. In this simple case, it may be more accurate to use:

    t1 = get_cycles();
    for (i = 0; i < count; i++) {
        err |= fib_lookup();
    t2 = get_cycles();
    total = t2 - t1;

    However, this prevents us to compute more statistics. Moreover, when you need to provide a non-constant input to the fib_lookup() function, the first way is likely to be more accurate. 

  2. In-kernel API backward compatibility is a non-goal of the Linux kernel. 

  3. You can get the current frequency with cpupower frequency-info. As the frequency may vary (even when using the performance governor), this may not be accurate but this still provides an easier representation (comparable results should use the same frequency). 

  4. Only integer arithmetic is available in the kernel. While it is possible to approximate a standard deviation using only integers, the median absolute deviation just reuses the percentile() function defined above. 

by Vincent Bernat at December 28, 2017 09:27 AM

December 27, 2017

Carl Chenet

Testing Ansible Playbooks With Vagrant

I use Ansible to automate the deployments of my websites (, Journal du hacker) and my applications (Feed2toot, Feed2tweet). I’ll describe in this blog post my setup in order to test my Ansible Playbooks locally on my laptop.

Why testing the Ansible Playbooks

I need a simple and a fast way to test the deployments of my Ansible Playbooks locally on my laptop, especially at the beginning of writing a new Playbook, because deploying directly on the production server is both reeeeally slow… and risky for my services in production.

Instead of deploying on a remote server, I’ll deploy my Playbooks on a VirtualBox using Vagrant. This allows getting quickly the result of a new modification, iterating and fixing as fast as possible.

Disclaimer: I am not a profesionnal programmer. There might exist better solutions and I’m only describing one solution of testing Ansible Playbooks I find both easy and efficient for my own use cases.

My process

  1. Begin writing the new Ansible Playbook
  2. Launch a fresh virtual machine (VM) and deploy the playbook on this VM using Vagrant
  3. Fix the issues either from the playbook either from the application deployed by Ansible itself
  4. Relaunch the deployment on the VM
  5. If more errors, go back to step 3. Otherwise destroy the VM, recreate it and deploy to test a last time with a fresh install
  6. If no error remains, tag the version of your Ansible Playbook and you’re ready to deploy in production

What you need

First, you need Virtualbox. If you use the Debian distribution, this link describes how to install it, either from the Debian repositories either from the upstream.

Second, you need Vagrant. Why Vagrant? Because it’s a kind of middleware between your development environment and your virtual machine, allowing programmatically reproducible operations and easy linking your deployments and the virtual machine. Install it with the following command:

# apt install vagrant

Setting up Vagrant

Everything about Vagrant lies in the file Vagrantfile. Here is mine:

Vagrant.require_version ">= 2.0.0"

Vagrant.configure(1) do |config| = "debian/stretch64"
 config.vm.provision "shell", inline: "apt install --yes git python3-pip"
 config.vm.provision "ansible" do |ansible|
   ansible.verbose = "v"
   ansible.playbook = "site.yml"
   ansible.vault_password_file = "vault_password_file"

Debian, the best OS to operate your online services

  1. The 1st line defines what versions of Vagrant should execute your Vagrantfile.
  2. The first loop of the file, you could define the following operations for as many virtual machines as you wish (here just 1).
  3. The 3rd line defines the official Vagrant image we’ll use for the virtual machine.
  4. The 4th line is really important: those are the missing apps we miss on the VM. Here we install git and python3-pip with apt.
  5. The next line indicates the start of the Ansible configuration.
  6. On the 6th line, we want a verbose output of Ansible.
  7. On the 7th line, we define the entry point of your Ansible Playbook.
  8. On the 8th line, if you use Ansible Vault to encrypt some files, just define here the file with your Ansible Vault passphrase.

When Vagrant launches Ansible, it’s going to launch something like:

$  ansible-playbook --inventory-file=/home/me/ansible/test-ansible-playbook/.vagrant/provisioners/ansible/inventory -v --vault-password-file=vault_password_file site.yml

Executing Vagrant

After writing your Vagrantfile, you need to launch your VM. It’s as simple as using the following command:

$ vagrant up

That’s a slow operation, because the VM will be launched, the additionnal apps you defined in the Vagrantfile will be installed and finally your Playbook will be deployed on it. You should sparsely use it.

Ok, now we’re really ready to iterate fast. Between your different modifications, in order to test your deployments fast and on a regular basis, just use the following command:

$ vagrant provision

Once your Ansible Playbook is finally ready, usually after lots of iterations (at least that’s my case), you should test it on a fresh install, because your different iterations may have modified your virtual machine and could trigger unexpected results.

In order to test it from a fresh install, use the following command:

$ vagrant destroy && vagrant up

That’s again a slow operation. You should use it when you’re pretty sure your Ansible Playbook is almost finished. After testing your deployment on a fresh VM, you’re now ready to deploy in production.Or at least better prepared :p

Possible improvements? Let me know

I find the setup described in this blog post quite useful for my use cases. I can iterate quite fast especially when I begin writing a new playbook, not only on the playbook but sometimes on my own latest apps, not yet ready to be deployed in production. Deploying on a remote server would be both slow and dangerous for my services in production.

I could use a continous integration (CI) server, but that’s not the topic of this blog post.  As said before, the goal is to iterate as fast as possible in the beginning of writing a new Ansible Playbook.

Gitlab, offering Continuous Integration and Continuous Deployment services

Commiting, pushing to your Git repository and waiting for the execution of your CI tests is overkill at the beginning of your Ansible Playbook, when it’s full of errors waiting to be debugged one by one. I think CI is more useful later in the life of the Ansible Playbooks, especially when different people work on it and you have a set or code quality rules to enforce. That’s only my opinion and it’s open to discussion, one more time I’m not a professionnal programmer.

If you have better solutions to test Ansible Playbooks or to improve the one describe here, let me know by writing a comment or by contacting me through my accounts on social networks below, I’ll be delighted to listen to your improvements.

About Me

Carl Chenet, Free Software Indie Hacker, Founder of, a job board for Free and Open Source Jobs in France.

Follow Me On Social Networks


by Carl Chenet at December 27, 2017 11:00 PM

Steve Kemp's Blog

Translating my website to Finnish

I've now been living in Finland for two years, and I'm pondering a small project to translate my main website into Finnish.

Obviously if my content is solely Finnish it will become of little interest to the world - if my vanity lets me even pretend it is useful at the moment!

The traditional way to do this, with Apache, is to render pages in multiple languages and let the client(s) request their preferred version with Accept-Language:. Though it seems that many clients are terrible at this, and the whole approach is a mess. Pretending it works though we render pages such as:


Then "magic happens", such that the right content is served. I can then do extra-things, like add links to "English" or "Finnish" in the header/footers to let users choose.

Unfortunately I have an immediate problem! I host a bunch of websites on a single machine and I don't want to allow a single site compromise to affect other sites. To do that I run each website under its own Unix user. For example I have the website "" running as the "s-fi" user, and my blog runs as "s-blog", or "s-blogfi":

root@www ~ # psx -ef | egrep '(s-blog|s-fi)'
s-blogfi /usr/sbin/lighttpd -f /srv/ -D
s-blog   /usr/sbin/lighttpd -f /srv/ -D
s-fi     /usr/sbin/lighttpd -f /srv/ -D

There you can see the Unix user, and the per-user instance of lighttpd which hosts the website. Each instance binds to a high-port on localhost, and I have a reverse proxy listening on the public IP address to route incoming connections to the appropriate back-end instance.

I used to use thttpd but switched to lighttpd to allow CGI scripts to be used - some of my sites are slightly/mostly dynamic.

Unfortunately lighttpd doesn't support multiviews without some Lua hacks which will require rewriting - as the supplied example only handles Accept rather than the language-header I want.

It seems my simplest solution is to switch from having lighttpd on the back-end to running apache2 instead, but I've not yet decided which way to jump.

Food for thought, anyway.

hyvää joulua!

December 27, 2017 10:00 PM

LZone - Sysadmin

Helm Error: cannot connect to Tiller

Today I ran "helm" and got the following error:
$ helm status
Error: could not find tiller
It took me some minutes to find the root cause. First thing I thought was, that the tiller installation was gone/broken, which turned out to be fine. The root cause was that the helm client didn't select the correct namespace and probably stayed in the current namespace (where tiller isn't located).

This is due to the use of an environment variable $TILLER_NAMESPACE (as suggested in the setup docs) which I forgot to persist in my shell.

So running
$ TILLER_NAMESPACE=tiller helm status
solved the issue.

December 27, 2017 11:01 AM

December 25, 2017

Evaggelos Balaskas

2FA SSH aka OpenSSH OATH, Two-Factor Authentication

2FA SSH aka OpenSSH OATH, Two-Factor Authentication


Good security is based on layers of protection. At some point the usability gets in the way. My thread model on accessing systems is to create a different ssh pair of keys (private/public) and only use them instead of a login password. I try to keep my digital assets separated and not all of them under the same basket. My laptop is encrypted and I dont run any services on it, but even then a bad actor can always find a way.

Back in the days, I was looking on Barada/Gort. Barada is an implementation of HOTP: An HMAC-Based One-Time Password Algorithm and Gort is the android app you can install to your mobile and connect to barada. Both of these application have not been updated since 2013/2014 and Gort is even removed from f-droid!

Talking with friends on our upcoming trip to 34C4, discussing some security subjects, I thought it was time to review my previous inquiry on ssh-2FA. Most of my friends are using yubikeys. I would love to try some, but at this time I dont have the time to order/test/apply to my machines. To reduce any risk, the idea of combining a bastion/jump-host server with 2FA seemed to be an easy and affordable solution.

OpenSSH with OATH

As ssh login is based on PAM (Pluggable Authentication Modules), we can use Gnu OATH Toolkit. OATH stands for Open Authentication and it is an open standard. In a nutshell, we add a new authorization step that we can verify our login via our mobile device.

Below are my personal notes on how to setup oath-toolkit, oath-pam and how to synchronize it against your android device. These are based on centos 6.9


We need to install the epel repository:

# yum -y install

Searching packages

Searching for oath

# yum search oath

the results are similar to these:

liboath.x86_64       : Library for OATH handling
liboath-devel.x86_64 : Development files for liboath
liboath-doc.noarch   : Documentation files for liboath

pam_oath.x86_64      : A PAM module for pluggable login authentication for OATH
gen-oath-safe.noarch : Script for generating HOTP/TOTP keys (and QR code)
oathtool.x86_64      : A command line tool for generating and validating OTPs

Installing packages

We need to install three packages:

  • pam_oath is the PAM for OATH
  • oathtool is the gnu oath-toolkit
  • gen-oath-safe is the program that we will use to sync our mobile device with our system

# yum -y install pam_oath oathtool gen-oath-safe


Before we continue with our setup, I believe now is the time to install FreeOTP


FreeOTP looks like:



Now, it is time to generate and sync our 2FA, using HOTP


You should replace username with your USER_NAME !

# gen-oath-safe username HOTP



and scan the QR with FreeOTP


You can see in the top a new entry!



Do not forget to save your HOTP key (hex) to the gnu oath-toolkit user file.


Key in Hex: e9379dd63ec367ee5c378a7c6515af01cf650c89

# echo "HOTP username - e9379dd63ec367ee5c378a7c6515af01cf650c89" > /etc/liboath/oathuserfile


# cat /etc/liboath/oathuserfile

HOTP username - e9379dd63ec367ee5c378a7c6515af01cf650c89


The penultimate step is to setup our ssh login with the PAM oath library.

Verify PAM

# ls -l /usr/lib64/security/

-rwxr-xr-x 1 root root 11304 Nov 11  2014 /usr/lib64/security/


# cat /etc/pam.d/sshd

In modern systems, the sshd pam configuration file seems:

auth       required
auth       include      password-auth
account    required
account    include      password-auth
password   include      password-auth
# close should be the first session rule
session    required close
session    required
# open should only be followed by sessions to be executed in the user context
session    required open env_params
session    required
session    optional force revoke
session    include      password-auth

We need one line in the top of the file.

I use something like this:

auth       sufficient /usr/lib64/security/  debug   usersfile=/etc/liboath/oathuserfile window=5 digits=6

Depending on your policy and thread model, you can switch sufficient to requisite , you can remove debug option. In the third field, you can try typing just the without the full path and you can change the window to something else:


auth requisite usersfile=/etc/liboath/oathuserfile window=10 digits=6

Restart sshd

In every change/test remember to restart your ssh daemon:

# service sshd restart

Stopping sshd:                                             [  OK  ]
Starting sshd:                                             [  OK  ]


If you are getting some weird messages, try to change the status of selinux to permissive and try again. If the selinux is the issue, you have to review selinux audit logs and add/fix any selinux policies/modules so that your system can work properly.

# getenforce

# setenforce 0

# getenforce


The last and most important thing, is to test it !



Post Scriptum

The idea of using OATH & FreeOTP can also be applied to login into your laptop as PAM is the basic authentication framework on a linux machine. You can use OATH in every service that can authenticate it self through PAM.

Tag(s): SSH, FreeOTP, HOTP

December 25, 2017 11:17 AM

System Administration Advent Calendar

Day 25 - How to choose a data store for the new shiny thing

By: Silvia Botros (@dbsmasher)

Edited By: Kirstin Slevin (@andersonkirstin)

Databases can be hard. You know what’s harder? Choosing one in the first place. This is challenging whether you are in a new company that is still finding its product/market fit or in a company that has found its audience and is simply expanding the product offering. When building a new thing, one of the very first parts of that design process is what data stores should we use and should that be a single or a plural? Should we use relational stores or do we need to pick a key value store? What about time options? Should we also sprinkle in some distributed log replay? So. Many. Options…

I will try in this article to describe a process that will hopefully guide that decision and, where applicable, explain how the size and maturity of your organization can impact this decision.

Baseline requirements

Data is the lifeblood of any product. Even if we’re planning in the design to use more bleeding edge technology to store the application state (because MySQL or Postgres aren’t “cool” anymore), whatever we choose is still a data store and hence requires that we apply rigor when making our selection. The important thing to remember is that nothing is for free. All data stores come with compromises and if you are not being explicit about what compromises you are taking as a business risk, you will be taking unknown risk that will show itself at the worst possible time.

Your product manager is unlikely to know or even need to care what you use for your data store but they will drive the needs that shrink the option list. Sometimes even that needs some nudging by the development team, though. Here is a list of things you need to ask the product team to help drive your options:

  • Growth rate - How is the data itself or the access to it expected to change over time?
  • How will the billing team use this new data?
  • How will the ETL team use this data?
  • What accuracy/consistency requirements are expected for this new feature?
    • What time span for that consistency is acceptable? is post processing correction acceptable?

Find the context that is not said

The choice of the data store is not a choice reserved for the DBA or the Ops team or even just the engineer writing the code. For an already mature organization with a known addressable market, the requirements that feed this decision need to come from across the organization. If the requirements from the product team fit a dozen data stores, how do you determine requirements not explicitly called out? You need to surface unspoken requirements as soon as possible because it is the road to failed expectations down the line. A lot of implied things can make you fail in this ‘too many choices’ trap. This includes but is not limited to:

  • Incomplete feature lists
  • Performance requirements that are not explicitly listed
  • Consistency needs that are assumed
  • Growth rate that is not specified
  • Billing or ETL query needs that aren’t yet available/known

These are all possible ways that can leave an engineering team spinning their wheels too long vetting a long list of data store choices simply because the explicit criteria they are working with are too permissive or incomplete.

For more ‘greenfield’ products, as i mentioned before, your goal is flexibility. So a more general purpose, known quality, data store will help you to get closer to a deliverable, with the knowledge that down the line, you may need to move to a datastore that is more amenable to your new scale.

Make your list

It is time to filter potential solutions by your list of requirements. The resulting list needs to be not more than a handful of possible data stores. If the list of potential databases you can use is more than that then your requirements are too permissive and you need to go back and find out more information.

For younger, less mature companies, data store requirements is the area of the most unknowns. You are possibly building a new thing that no one offers just yet and so things like total addressable market size and growth rate may be relatively unknown and hard to quantify. In this case what you need is to not constrain yourself too early in the lifetime of your new company by using a one trick pony datastore. Yes, at some point your data will grow in new and unexpected ways but what you need right now is flexibility as you try to find your market niche and learn what the growth of your data will look like and what specific scalability features will become crucial to your growth.

If you are a larger company with a growing number of paying customers, your task here is to shrink the option list to preferably data stores you already have and maintain. When you already have a lot of paying customers, the risk of adding new data stores that your team is not familiar with becomes higher and, depending on the context of the data, simply unacceptable. Another thing to keep in mind is what tooling already exists for data stores and what would adopting a new one mean as far as up front work your team has to do. Configuration management, backup scripts, data recovery scripts, new monitoring checks, new dashboards to build and get familiar with. The list of operational cost of a new data store, regardless of risk, is not trivial.

Choose your poison

So here is a badly kept secret that DBAs hold on to. Databases are all terrible at something. There is even a whole theorem about that. Not just databases in the traditional sense, but any tech that stores state will be horrible in a way unique to how you use it. That is just a fact of life that you better internalize now. No, I am not saying you should avoid using any of these technologies, I am saying keep your expectations sane and know that YOU and only you and your team ultimately own delivering on the promises you make.

What does this mean in non abstract terms? Once you have a solid idea what data stores are going to be part of what you are building, you should start by knowing the weaknesses of these data stores. These weaknesses include but are not limited to:

  • Does this datastore work well under scan queries?
  • Does this datastore rely on a gossip protocol for data replication? if so, how does it handle network partitions? How much data is involved in that gossip?
  • Does this datastore have a single point of failure?
  • How mature are the drivers in the community to talk to it or do you need to roll your own?
  • This list can be huge

Thinking through the weaknesses of the potential solutions still on your list should knock more options off the list. This is now reality meeting the lofty promises of tech.

Spreadsheet and Bake off!

Once your list of choices is down to a small handful, it is time to put them all in a spreadsheet and start digging a little deeper. You need a pros column and a cons column and at this point, you will need to spend some time in each database documentation to find out nitty gritty details on how to do certain tasks. If this is data you expect to have a large growth rate, you need to know which of these options is easier to scale out. If this is a feature that does a lot of fuzzy search, you need to know which datastore can handle scans or searching through a large number of rows better and with what design. The target at this stage is to whittle down the list to ideally 2 or 3 options via documentation alone because if this new feature is critical enough to the company success, you will have to benchmark all three.

Why benchmark you say? Because no 2 companies use the same datastore the same way. Because sometimes documentation implies caveats that only gets exposed in other people’s war stories. Because no one owns the stability, the reliability and the predictability of this datastore but you.

Design your benchmark in advance. Ideally, you set up a full instance of the datastores in your list with production level specifications and produce test data that is not too small to make load testing useless. Make sure to not only benchmark for ‘normal load’ but also to test out some failure scenarios. The hope is that through the benchmark, you can find out any caveats that are severe enough to cause you to revisit the option list now instead of later when all the code is written and you are now at the fire drill phase with a lot of time and effort committed to the choice you made.

Document your choice

No matter what you do, you must document and broadcast internally the method by which you reached your choice and the alternatives that were investigated on the route to that decision. Presuming there is an overarching architecture blueprint of how this new feature will be created and all its components, you make sure to create a section dedicated to the datastore powering this new feature with links to all the benchmarks done to reach the decision the team came to. This is not just for the benefit of future new hires but also for your team’s benefit in the present. A document that people can asynchronously read and develop opinions on provides a way to keep the decision process transparent, grow a sense of best intent among the team members and can bring in criticism from perspectives you didn’t foresee.

Wrap up

These steps are not only going to lead to data-informed decisions when growing the business offering, but will also lead to a robust infrastructure and a more disciplined approach to when and where you use an ever growing field of technologies to provide value to your paying customers.

by Christopher Webber ( at December 25, 2017 05:38 AM

December 24, 2017

System Administration Advent Calendar

Day 24 - On-premise Kubernetes with dynamic load balancing using rke, Helm and NGINX

By: Sebastiaan van Steenis (@svsteenis)

Edited By: Mike Ciavarella (@mxcia)

Containers are a great solution for consistent software deployments. When you start using containers in your environment, you and your team will quickly realise that you need a system which allows you to automate container operations. You need a system to keep your containers running when stuff breaks (which always happens, expect failure!), be able to scale up and down, and which is also extensible, so you can interact with it or built upon it to get the functionality you need. The most popular system for deploying, scaling and managing containerized applications is Kubernetes.

Kubernetes is a great piece of software. It includes all the functionality you'll initially need to deploy, scale and operate your containers, as well as more advanced options for customising exactly how your containers are managed. A number of companies provide managed Kubernetes-as-a-service, but there are still plenty of use-cases that need to run on bare metal to meet regulatory requirements, use existing investments in hardware, or for other reasons. In this post we will use Rancher Kubernetes Engine (rke) to deploy a Kubernetes cluster on any machine you prefer, install the NGINX ingress controller, and setup dynamic load balancing across containers, using that NGINX ingress controller.

Setting up Kubernetes

Let's briefly go through the Kubernetes components before we deploy them. You can use the picture below for visualisation. Thanks to Lucas Käldström for creating this (@kubernetesonarm), used in his presentation on KubeCon.

Using rke, we can define 3 roles for our hosts:

  • control (controlplane)

    The controlplane consists of all the master components. In rke the etcd role is specified separatately but can be placed on the same host as the controlplane. The API server is the frontend to your cluster, handling the API requests you run (for example, through the Kubernetes CLI client kubectl which we talk about later). The controlplane also runs the controller manager, which is responsible for running controllers that execute routine tasks.

  • etcd

    The key-value store and the only component which has state, hence the term SSOT in the picture (Single Source of Truth). etcd needs quorum to operate, you can calculate quorum by using (n/2)+1 where n is the amount of members (which are usually hosts). This means for a production deployment, you would deploy at least 3 hosts with the etcd role. etcd will continue to function as long as it has quorum, so with 3 hosts with the etcd role you can have one host fail before you get in real trouble. Also make sure you have a backup plan for etcd.

  • worker

    A host with the worker role will be used to run the actual workloads. It will run the kubelet, which is basically the Kubernetes agent on a host. As one of its activities, kubelet will process the requested workload(s) for that host. Each worker will also run kube-proxy which is responsible for the networking rules and port forwarding. The container runtime we are using is Docker, and for this setup we'll be using the Flannel CNI plugin to handle the networking between all the deployed services on your cluster. Flannel will create an overlay network between the hosts, so that deployed containers can talk to each other.

For more information on Kubernetes components, see the Kubernetes documentation.

For this setup we'll be using 3 hosts, 1 host will be used as controlplane (master) and etcd (persistent data store) node, 1 will be used as worker for running containers and 1 host will be used as worker and loadbalancer entrypoint for your cluster.


You need at least OpenSSH server 7 installed on your host, so rke can use it to tunnel to the Docker socket. Please note that there is a known issue when connecting as the root user on RHEL/CentOS based systems, you should use an other user on these systems.

SSH key authentication will be used to setup an SSH tunnel to the Docker socket, to launch the needed containers for Kubernetes to function. Tutorials how to set this up, can be found for Linux and Windows

Make sure you have either swap disabled on the host, or configure the following in cluster.yml for kubelet (we will generate cluster.yml in the next step)

  image: rancher/k8s:v1.8.3-rancher2
  extra_args: {"fail-swap-on":"false"}

The hosts need to run Linux and use Docker version 1.12.6, 1.13.1 or 17.03.2. These are the Docker versions that are validated for Kubernetes 1.8, which we will be deploying. For easy installation of Docker, Rancher provides shell scripts to install a specific Docker version. For this setup we will be using 17.03.2 which you can install using (for other versions, see

curl | sudo sh

If you are not using the root user to connect to the host, make sure the user you are using can access the Docker socket (/var/run/docker.sock) on the host. This can be achieved by adding the user to the docker group (e.g. by using sudo usermod -aG docker your_username). For complete instructions, see the Docker documentation.


The network ports that will be used by rke are port 22 (to all hosts, for SSH) and port 6443 (to the master node, Kubernetes API).


Note: in the examples we are using rke_darwin-amd64, which is the binary for MacOS. If you are using Linux, replace that with rke_linux-amd64.

Before we can use rke, we need to get the latest rke release, at the moment this is v0.0.8-dev. Download rke v0.0.8-dev from the GitHub release page, and place in a rke directory. We will be using this directory to create the cluster configuration file cluster.yml. Open a terminal, make sure you are in your rke directory (or that rke_darwin-amd64 is in your $PATH), and run ./rke_darwin-amd64 config. Pay close attention to specifying the correct SSH Private Key Path and the SSH User of host:

$ ./rke_darwin-amd64 config
Cluster Level SSH Private Key Path [~/.ssh/id_rsa]:
Number of Hosts [3]: 3
SSH Address of host (1) [none]: IP_MASTER_HOST
SSH Private Key Path of host (IP_MASTER_HOST) [none]:
SSH Private Key of host (IP_MASTER_HOST) [none]:
SSH User of host (IP_MASTER_HOST) [ubuntu]: root
Is host (IP_MASTER_HOST) a control host (y/n)? [y]: y
Is host (IP_MASTER_HOST) a worker host (y/n)? [n]: n
Is host (IP_MASTER_HOST) an Etcd host (y/n)? [n]: y
Override Hostname of host (IP_MASTER_HOST) [none]:
Internal IP of host (IP_MASTER_HOST) [none]:
Docker socket path on host (IP_MASTER_HOST) [/var/run/docker.sock]:
SSH Address of host (2) [none]: IP_WORKER_HOST
SSH Private Key Path of host (IP_WORKER_HOST) [none]:
SSH Private Key of host (IP_WORKER_HOST) [none]:
SSH User of host (IP_WORKER_HOST) [ubuntu]: root
Is host (IP_WORKER_HOST) a control host (y/n)? [y]: n
Is host (IP_WORKER_HOST) a worker host (y/n)? [n]: y
Is host (IP_WORKER_HOST) an Etcd host (y/n)? [n]: n
Override Hostname of host (IP_WORKER_HOST) [none]:
Internal IP of host (IP_WORKER_HOST) [none]:
Docker socket path on host (IP_WORKER_HOST) [/var/run/docker.sock]:
SSH Address of host (3) [none]: IP_WORKER_LB_HOST
SSH Private Key Path of host (IP_WORKER_LB_HOST) [none]:
SSH Private Key of host (IP_WORKER_LB_HOST) [none]:
SSH User of host (IP_WORKER_LB_HOST) [ubuntu]: root
Is host (IP_WORKER_LB_HOST) a control host (y/n)? [y]: n
Is host (IP_WORKER_LB_HOST) a worker host (y/n)? [n]: y
Is host (IP_WORKER_LB_HOST) an Etcd host (y/n)? [n]: n
Override Hostname of host (IP_WORKER_LB_HOST) [none]:
Internal IP of host (IP_WORKER_LB_HOST) [none]:
Docker socket path on host (IP_WORKER_LB_HOST) [/var/run/docker.sock]:
Network Plugin Type [flannel]:
Authentication Strategy [x509]:
Etcd Docker Image []:
Kubernetes Docker image [rancher/k8s:v1.8.3-rancher2]:
Cluster domain [cluster.local]:
Service Cluster IP Range []:
Cluster Network CIDR []:
Cluster DNS Service IP []:
Infra Container image []:

This will generate a cluster.yml file, which can be used by rke to setup the cluster. By default, Flannel is used as CNI network plugin. To secure the Kubernetes components, rke generates certificates and configures the Kubernetes components to use the created certificates.

You can always check or edit the file (cluster.yml) if you made a typo or used the wrong IP address somewhere.

We are now ready to let rke create the cluster for us (specifying --config is only necessary when cluster.yml is not present in the same directory where you are running the rke command)

$ ./rke_darwin-amd64 up --config cluster.yml
INFO[0000] Building Kubernetes cluster
INFO[0000] [ssh] Setup tunnel for host [IP_MASTER_HOST]
INFO[0000] [ssh] Setup tunnel for host [IP_MASTER_HOST]
INFO[0000] [ssh] Setup tunnel for host [IP_WORKER_HOST]
INFO[0001] [ssh] Setup tunnel for host [IP_WORKER_LB_HOST]
INFO[0001] [certificates] Generating kubernetes certificates
INFO[0001] [certificates] Generating CA kubernetes certificates
INFO[0002] [certificates] Generating Kubernetes API server certificates
INFO[0002] [certificates] Generating Kube Controller certificates
INFO[0002] [certificates] Generating Kube Scheduler certificates
INFO[0002] [certificates] Generating Kube Proxy certificates
INFO[0003] [certificates] Generating Node certificate
INFO[0003] [certificates] Generating admin certificates and kubeconfig
INFO[0003] [reconcile] Reconciling cluster state
INFO[0003] [reconcile] This is newly generated cluster
INFO[0263] Finished building Kubernetes cluster successfully

All done! Your Kubernetes cluster is up and running in under 5 minutes, and most of that time was spent on pulling the needed Docker images.


The most common way to interact with Kubernetes is using kubectl. After the cluster has been setup, rke generates a ready-to-use configuration file which you can use with kubectl, called .kube_config_cluster.yml. Before we can use the file, you will need to install kubectl. Please refer to Kubernetes documentation on how to do this for your operating system.

Note: the Kubernetes documentation helps you to place the downloaded binary in a directory in your $PATH. The following commands are based on having kubectl in your PATH.

When you have kubectl installed, make sure you execute the comand in the rke directory (because we point to .kube_config_cluster.yml in that directory).

Now you can check the cluster by getting the node status:

$ kubectl --kubeconfig .kube_config_cluster.yml get nodes --show-labels
NAME              STATUS    ROLES         AGE       VERSION           LABELS
IP_MASTER_HOST   Ready     etcd,master   5m       v1.8.3-rancher1,,,,
IP_WORKER_HOST     Ready     worker        5m       v1.8.3-rancher1,,,
IP_WORKER_LB_HOST     Ready     worker        5m       v1.8.3-rancher1,,,

Note: as reference to each node, we will be using IP_MASTER_HOST, IP_WORKER_HOST and IP_WORKER_LB_HOST to identify respectively the master, worker, and the worker functioning as entrypoint (loadbalancer)

Three node cluster ready to run some containers. In the beginning I noted that we are going to use one worker node as loadbalancer, but at this point we can't differentiate both worker nodes. We need to use a host with the role worker. Let's make that possible by adding a label to that node:

$ kubectl --kubeconfig .kube_config_cluster.yml \
  label nodes IP_WORKER_LB_HOST role=loadbalancer
node "IP_WORKER_LB_HOST" labeled

Great, let's check if it was applied correctly:

$ kubectl --kubeconfig .kube_config_cluster.yml get nodes --show-labels
NAME              STATUS    ROLES         AGE       VERSION           LABELS
IP_MASTER_HOST   Ready     etcd,master   6m       v1.8.3-rancher1,,,,
IP_WORKER_HOST     Ready     worker        6m       v1.8.3-rancher1,,,
IP_WORKER_LB_HOST     Ready     worker        6m       v1.8.3-rancher1,,,,role=loadbalancer

Note: If you mistakenly applied the label to the wrong host, you can remove it by adding a minus to the end of the label (e.g. kubectl --kubeconfig .kube_config_cluster.yml label nodes IP_WORKER_LB_HOST role=loadbalancer-)

Install and configure NGINX ingress controller


Helm is the package manager for Kubernetes, and allows you to easily install applications to your cluster. Helm uses charts to deploy applications; a chart is a collection of files that describe a related set of Kubernetes resources. Helm needs two components: a client (helm) and a server (tiller). Helm binaries are provided for all major platforms, download one and make sure it's available on your commandline (move it to a location in your $PATH). When installed correctly, you should be able to run helm help from the command line.

We bootstrap Helm by using the helm client to install tiller to the cluster. The helm command can use the same Kubernetes configuration file generated by rke. We tell helm which configuration to use by setting the KUBECONFIG environment variable as shown below:

$ cd rke
$ KUBECONFIG=.kube_config_cluster.yml helm init
Creating /homedirectory/username/.helm
Creating /homedirectory/username/.helm/repository
Creating /homedirectory/username/.helm/repository/cache
Creating /homedirectory/username/.helm/repository/local
Creating /homedirectory/username/.helm/plugins
Creating /homedirectory/username/.helm/starters
Creating /homedirectory/username/.helm/cache/archive
Creating /homedirectory/username/.helm/repository/repositories.yaml
Adding stable repo with URL:
Adding local repo with URL:
$HELM_HOME has been configured at /homedirectory/username/.helm.

Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Happy Helming!

Assuming all went well, we can now check if Tiller is running by asking for the running version. Server should return a version here, as it will query the server side component (Tiller). It may take a minute to get Tiller started.

$ KUBECONFIG=.kube_config_cluster.yml helm version    
Client: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
A little bit on Pods, Services and Service Types

Services enable us to use service discovery within a Kubernetes cluster. Services allow us to use abstraction for one or more pods in your cluster. What is a pod? A pod is a set of one or more containers (usually Docker containers), with shared networking and storage. If you run a pod in your cluster, you usually would end up having two problems:

  • Scale: When running a single pod, you don't have any redundancy. You want to use a mechanism which ensures to run a given amount of pods, and be able to scale if needed. We will talk more on this when we are going to deploy our demo application later on.
  • Accessibility: What pods do you need to reach? (One static pod on one host is reachable, but how about scaling up and down, rescheduled pods?) and what IP address/name do you use the access the pod(s)?

By default, a Service will have the service type of ClusterIP. Which means it gets an internally accesible IP, which you can use to access your pods. The way the service knows what pods to target is by using a Label Selector. This will tell the Service to look for what labels on the pod to target.

Other service types are:

  • NodePort: expose the service on every host's IP on a selected port or randomly selected from the configured NodePort range (default: 30000-32767)
  • LoadBalancer: If a cloud provider is configured, this will request a loadbalancer from that cloudprovider and configure it as entrypoint. Cloud providers include AWS, Azure, GCE among others.
  • ExternalName: This makes it possible to configure a service to route to a predefined name outside the cluster by using a CNAME record in DNS.
Installing NGINX ingress controller

As the NGINX ingress controller meets all of the criteria of the technical requirements, it resides in the stable directory of Helm charts. As noted before, we labeled one node as our point of entry by applying the role=loadbalancer label to that node. We'll be using that label to pass onto the Helm chart and let the NGINX ingress controller get placed on the correct node. By default, the NGINX ingress controller gets created as service type LoadBalancer. Because we are assuming that you are running on-premise, this will not work. Service LoadBalancer will provision a loadbalancer from the configured cloud provider, which we didn't configure and is usually not available on a on-premise setup. Because of this we will set the service type to ClusterIP using the --set controller.service.type=ClusterIP argument. Secondly, because we don't have a external loadbalancer to get access to the services, we will configure the controller to use host networking. This way, the NGINX ingress controller will be reachable on the IP of the host. You can do so by setting controller.hostNetwork to true.

*NOTE: Another option is to use NodePort , which will use a port from the cluster defined range (30000-32767). You can use an external loadbalancer to loadbalance to this port on the node. For the simplicity of this post, I went for hostNetwork.*

$ KUBECONFIG=.kube_config_cluster.yml helm install stable/nginx-ingress \
--name nginx-ingress --set controller.nodeSelector."role"=loadbalancer --set controller.service.type=ClusterIP --set controller.hostNetwork=true

Run the following command to see if the deployment was successful, we should see

$ kubectl --kubeconfig .kube_config_cluster.yml rollout \
  status deploy/nginx-ingress-nginx-ingress-controller
deployment "nginx-ingress-nginx-ingress-controller" successfully rolled out

By default, the NGINX ingress controller chart will also deploy a default backend which will return default backend - 404 when no hostname was matched to a service. Let's test if the default backend was deployed successfully:

# First we get the loadbalancer IP (IP of the host running the NGINX ingress controller) and save it to variable $LOADBALANCERIP
$ LOADBALANCERIP=`kubectl --kubeconfig .kube_config_cluster.yml get node -l role=loadbalancer -o jsonpath={.items[*].status.addresses[?\(@.type==\"InternalIP\"\)].address}`
# Now we can curl that IP to see if we get the correct response
default backend - 404

Excellent, we reached the NGINX ingress controller. As there are no services defined, we get routed to the default backend which returns a 404.

Setup wildcard DNS entry

For this post, I decided to make a single host the entrypoint to the cluster. We applied the label role=loadbalancer to this host, and used it to schedule the deploy of the NGINX ingress controller. Now you can point a wildcard DNS record (* for example), to this IP. This will make sure that the hostname we will use for our demo application, will end it up on the host running the NGINX ingress controller (our designated entrypoint). In DNS terminology this would be (in this example, $LOADBALANCER is

*.kubernetes IN A

With this configured, you can try reaching the default backend by running the curl to a host which resides under this wildcard record, i.e.

$ curl
default backend - 404

Running and accessing the demo application

A little bit on ReplicaSet, Deployment and Ingress

Before we deploy our demo application, some explanation is needed on the terminology. Earlier, we talked about Services and services types to provide access to your pod/group of pods. And that running pods alone is not a failure-tolerant way of running your workload. To make this better, we can use a ReplicaSet. The basic functionality of a ReplicaSet is to run a specified number of pods. This would solve our problem of running single pods.

From the Kubernetes documentation:

While ReplicaSets can be used independently, today it’s mainly used by Deployments as a mechanism to orchestrate pod creation, deletion and updates. When you use Deployments you don’t have to worry about managing the ReplicaSets that they create. Deployments own and manage their ReplicaSets.

Deployments give us some other nice benefits, like checking the rollout status using kubectl rollout status. We will be using this when we deploy our demo application.

Last but not least, the Ingress. Usually, the components in your cluster will be for internal use. The components need to reach each other (web application, key value store, database) using the cluster network. But sometimes, you want to reach the cluster services from the outside (like our demo application). To make this possible, you need to deploy an Ingress definition. Deploying an Ingress definition without using an Ingress controller will give you limited functionality, that's why we deployed the NGINX ingress controller. By using the following key value under annotations, we make sure the NGINX ingress controller picks up our Ingress definition: "nginx"

Deploy demo application

For this post, we are using a simple web application. When you visit this web application, the UI will show you every container serving requests for this web application.

Let's create the files necessary to deploy our application, we'll be using a Deployment to create a ReplicaSet with 2 replicas and a Service to link our ingress to. Save to following as docker-demo.yml in the rke directory.

apiVersion: apps/v1beta2
kind: Deployment
  name: docker-demo-deployment
      app: docker-demo
  replicas: 2
        app: docker-demo
      - name: docker-demo
        image: ehazlett/docker-demo
        - containerPort: 8080


apiVersion: v1
kind: Service
  name: docker-demo-svc
  - port: 8080
    targetPort: 8080
    protocol: TCP
    app: docker-demo

Let's deploy this using kubectl:

$ kubectl --kubeconfig .kube_config_cluster.yml create -f docker-demo.yml
deployment "docker-demo-deployment" created
service "docker-demo-svc" created

Again, like in the previous deployment, we can query the deployment for its rollout status:

$ kubectl --kubeconfig .kube_config_cluster.yml rollout \
  status deploy/docker-demo-deployment
deployment "docker-demo-deployment" successfully rolled out

With this running, the web application is now accessible within the cluster. This is great when you need to connect web applications with backends like key-value stores, databases etcetera. For now, we just want this web application to be available through our loadbalancer. As we've already deployed the NGINX ingress controller before, we can now make our application accessible by using an Ingress resource. Let's create the ingress.yml file:

apiVersion: extensions/v1beta1
kind: Ingress
  name: docker-demo-ingress
  annotations: "nginx"
  - host:
      - path: /
          serviceName: docker-demo-svc
          servicePort: 8080

This is a fairly standard Ingress definition, we define a name and rules to access an application. We define the host that should be matched, and what path should route to what backend service on what port. The annotation "nginx" tells the NGINX ingress controller that this Ingress resource should be processed. When this Ingress is created, the NGINX ingress controller will see it, process the rules (in this case, create a "vhost" and point it the backend/upstream to the created pods in the ReplicaSet created by the Deployment ). This means, after creation, you should be able to reach this web application on Let's launch the ingress and find out:

$ kubectl --kubeconfig .kube_config_cluster.yml create -f ingress.yml
ingress "docker-demo-ingress" created

Check out the web application on

Wrapping up

Is this a full blown production setup? No. Keep in mind that it will need some work, but hopefully you have gained a basic understanding of some of the core parts of Kubernetes, how to deploy Kubernetes using rke, how to use Helm and the basics of the NGINX ingress controller. Let me give you some resources to continue the journey:

  • Try to scale your deployment to show more containers in the web application (kubectl scale -h)

  • rke supports HA, you can (and should) deploy multiple hosts with the controlplane and/or the etcd role.

  • Take a look at all the options of the NGINX ingress controller, see if it suits your needs

  • Explore how easy it is to use Let's Encrypt certificates on your ingresses by setting an extra annotation using kube-lego.

  • The NGINX ingress controller is a single point of failure (SPOF) now, explore how you can make this go away. Most companies use some kind of external loadbalancer which you could use for this.

  • Keep an eye on the Kubernetes Incubator Project external-dns, which can automatically create DNS records in supported providers.

  • To gain a deeper understanding of all the Kubernetes components, check out Kubernetes The Hard Way.

by Christopher Webber ( at December 24, 2017 03:24 PM

December 23, 2017

System Administration Advent Calendar

Day 23 - Open Source Licensing in the Real World

By: Carl Perry (@edolnx)
Edited By: Amy Tobey (@AlTobey)

Before we get started, I need to say a couple of things. I am not a lawyer. What I am sharing should not be considered legal advice. The idea of this article is to share my experiences to date and provide you resources you should use to start conversations with your lawyer, your boss, and your company’s legal team. You should also know that my experiences are primarily based on laws in the United States of America, the State of Texas, and the State of California. Your governmental structure may be different, but most of what I am going to talk about should be at least partially applicable no matter where you are. Now, with that out of the way, let’s get started!

Open Source vs Free Software

I don’t want to spend a lot of time getting into a philosophical debate about Free Software vs Open Source. So, I’m going to define how I use those terms for this article. I refer to Free Software as things that are either coming from the Free Software Foundation (FSF), or using an FSF license. Anything where the source code is made publicly available I am referring to Open Source. Yes, this means to me FSF projects are Open Source. They also mean things like “shared source” are open source. But the important part is the license, which is what we are here to talk about.

The mixing of two worlds that don’t always see eye to eye

Capitalism and Open Source Software don’t always mix well. There are exceptions, but most companies see software no different than a brick: cheap, replaceable, frequently made by an outside provider, durable, and invisible (until it fails or is missing). We know this is not the case, and that is where the problems begin. Many corporate lawyers treat software identically to any other corporate purchase: a thing with a perpetual license that places all liability on the provider. We know that’s not true either, especially with Open Source. That means education is super important.

Everything starts with copyright

Copyright has been around for a long time, and the Berne Convention of 1989 really formed the basis for copyright we know today. Many countries ratified the treaty, and others still are forced to follow it using various treaties. The basics are that anything is automatically copywritten by the “author” for a minimum of 50 years. No filing is required. When something is copy-written then all rights are reserved by the “author”. This is the crux of the problem. That means you cannot copy, reuse, integrate, or otherwise manipulate the work at all. This is where licensing comes in.

Licensing to the rescue?

Licensing a work is a contract that allows for the “author” to grant rights to other parties, typically with restrictions. Many of the commercial software licenses simply allow use and redistribution of the work within an organization in binary form only in exchange for lack of warranty and lack of liability. Open source and free software licenses work quite a bit differently. Back in the early days of computing, even commercial software included the source code. Failing to include the source code usually meant you had something to hide, until some companies decided that source code was intellectual property and not to be shared. But, enough history for now, let’s get into open source and free software licenese and what they do.

Free Software licenses

The most Free Software licenses are the GNU Public License and its derivatives (GPLv2, GPLv3, LGPLv2, LGPLv3, AGPLv3, et cetera). These are all based on a very clever legal hack called “Copy Left”: the idea is to use a license to enforce the exact opposite of all rights reserved. Instead, the rights of access to source, ability to modify the source, the ability to redistribute, the inability to use as part of a closed source project, and the inability to charge money for the software. There are exceptions, in that you can charge for distribution costs for example, but that’s pretty much it. It’s important to realize that this is a one way street; things can start licensed by some other license but once it becomes GPL it cannot go back.

How businesses see the Free Software licenses

Typically, business do not have a favorable view of the GPL and it’s derivatives. Part of the reason is patents (the clauses are vaguely worded at best), but the biggest reason is the perception of “losing IP” and “competitive advantage”. Because you have to release all your changes, it can make life difficult. Great examples of this conflict are GPU and SOC drivers. There is a lot of proprietary tech in those but you have to make all the code that controls it open to the public.

There are bounding boxes, however. Any code that is derived from a Free Software based product must use the same or compatible license. However, if your code is not based on a Free Software base, vendors often choose to make a Free Software shim that converts calls from the Free Software project (like the Linux kernel) to vendor proprietary binaries or code under a difference license. This is how ZFS and the binary NVIDIA drivers work with the kernel. This is not always true, as the Linux Kernel has a very clear line of demarcation in their LICENSE file which states that public APIs are just that: public and not subject to the GPLv2. That makes the kernel a unique case. Otherwise, any lack of clarification like that leaves you with the poorly defined “linking clause” to deal with.

“Ah, but LGPL doesn’t have those restrictions!” someone is shouting at their keyboard. That’s not entirely true. The LGPL lacks the “linking clause” which can help with adoption, but that’s it. The license is otherwise identical to the GPL. What does this mean? Great question. That is left up to the “author”. An excellent example of how to do it well is the libssh project clearly spells out that you can use libssh in a commercial product as long as you do not modify the library itself (see their features page under the “License Explained” section). But if a project does not do this in their documentation then it can be ambiguous.

The Academic licenses

This is a term I use to lump the BSD, MIT, CERN, X11, and their cousin licenses in one place. These licenses are super simple, typically less than a paragraph long. They also don’t do much except limit liability and allow for royalty free redistribution.

How businesses see the Academic licenses

Most companies do not have a problem with these licenses. They are simple, and allow for packaging them into commercial products without issue. The only major sticking point is patents. They are not covered at all by many of these licenses, and that can be a sticking point for projects that are sponsored by a company. However, using code licensed using of the Academic liceneses tends to be a non-issue.

The Common Open Source licenses

This is another term I use for things like the Apache v2 License, the Artistic License (used by Perl), the LLVM License, the Eclipse Public License, the Mozilla Public License, and a few others that are widely used and well understood. These are typically very much like the Academic licenses but were written by corporate lawyers and thus are verbose. But that’s not a bad thing, their verbosity means they are ver clear about what they are trying to do. Most also have provisions for patents.

How businesses see Common Open Source licenses

Much like the Academic licenses, these are typically not an issue. There are some options available for patent protection and/or indemnification under different licenses. Typically projects that are under some form of corporate sponsorship tend to use Apache v2 for these reasons over Academic licenses.

Creative Commons

So far, everything we have talked about has been for software generally, and source code specifically. But what about things like blog articles, artwork, sound files, videos, schematics, 3D Objects, basically anything that isn’t code? That’s what creative commons is for. They have an excellent license chooser on their site, answer three questions and you are done. They have plain text descriptions of what each license does, and well structured legal text to back up those descriptions to keep the lawyers happy.

But why not just use one of the above licenses for things like 3D Objects, PCB Gerbers, and Schematics? Aren’t they just source code that uses a special compiler? Sort of, but frequently for non-code items the base components are remixed through other applications so it’s not so cut and dry. CC licenses help with this immensely. Also, there is CC0 for a “public domain work-alike” license (since it’s not always clear how to put something in public domain no matter how hard you try).

Every other license

There are far too many here to go over, but I’ll give some highlights: CDDL, Shared Source, The JSON License (yes, really), et cetera. My biggest lessons to pass along here are twofold; don’t create your own license, and if you stumble across one of these you will need to start having conversations with lawyers. CDDL was made so that Oracle didn’t have to see all the wonderful work from SUN Microsystems/Solaris wind up in the Linux Kernel. I’ll leave it up to the reader to figure out how that worked out for them. The JSON License holds a special place in my heart: “The software shall be used for good, not evil.” Funny right? Not so much. The tools that use the JSON License routinely are houned by IBM because they want to use their tool and cannot guarantee that IBM or its clients will only use the software for good. Literally, people get calls from IBM lawyers every quarter about this. Not so funny now. Don’t invent licenses, and don’t be cute in them. For everyone’s sake.

Why do you always say “author” in quotes like that?

Let’s take a quick seque to talk about when someone who wrote something is not the “author”. If you are writing code as part of your job, you are likely working under a “work for hire” contract. You need to check this and what the bounds are. In Texas, companies can get away with just about anything in this realm as there are not strong employee protection laws in the state. California on the other hand has very strong laws about this. In California work done on your own time not using any company resources (like a work laptop or AWS account) cannot be claimed by the company as part of their work for hire. Texas is a lot less clear. Your local jurisdiction may vary. If you are working for the United States Government, for example, all work created for the US Government cannot be copywritten by a company or individual. So, you may not be the author of the code you write. It’s important to understand that, because of what we are about to talk about next….

Sometimes the license is not enough

Many commercially sponsored projects have additional clarifications put in place to deal with shortcomings of the license, or to assign copyright back to the sponsoring organization to simplify distribution issues. These are typically done using a Contributor License Agreement (CLA) which are typically managed by an out of band process and enforced using some form of gating system. But there are problems here as well: many projects want contributing to be as lightweight as possible so they implement a technical solution (an example is Developer Certificate of Origin). This is great for expediency of individual contributors but can be a real pain for corporate contributors.

Protecting yourself

OK, so now that I’ve probably scared you let’s talk about risk and how to mitigate it. Step one is get a lawyer. It’s a hard step, but if/when you need a lawyer it’s better to know who to call instead of getting someone who just deals with traffic tickets. It’s also important to point out that, in almost all cases, your company lawyer/legal team does not have your best interests in mind. They work for the company, not you. If you need help finding a lawyer that can understand this field, EFF is a great resource.

Second: understand your employment contract. You have one, even if you don’t think you do. Many companies have you sign a piece of paper that says you agree to the Employee Manual (or the like) when you join: congratulations, that makes the employee manual your employment contract. Understand if you are a work for hire employee or treated as a contractor. This has huge ramifications on what you can contribute to outside Open Source projects and who is the “author” in those cases.

Third: it’s likely your contract will not cover things like this. Fix that. Talk to your boss/manager and get an understanding of what the company’s expectations are and what the their expectations of you being an Open Source contributor are. It’s best to do this as part of your negotiations when being hired, but either way you need to have those conversations. It’s important to start with your boss/manager instead of legal because the last thing you want to do is confuse/annoy legal. If you work in a large company, expect this process to take a while and while it does do not contribute during company time or using company resources to open source projects. I cannot stress that enough. If the company doesn’t want you contributing and you do, then you are on the hook legally speaking. Be up front and transparent. If you have contributed start discussions now and stop contributing while you do. Hiding information is worse than making an honest mistake.

Fourth: Understand licenses used for projects you are using and/or contributing to. Also understand that if there is no license, you legally cannot use it. This includes Stack Overflow, Github, and random things found in search results. It’s much safer to find something that is properly licensed or to use what you find as a reference and reimplement the concepts yourself.

Fifth: Leave an audit trail. Did you get a chunk of code from somewhere? Link to it in a comment. Note where you are getting libraries and support applications from, and the licenses they use. If you need to use an open source piece of code, but modify it (like a Chef cookbook or an Ansible playbook) then add a file with a source (including version or even better immutable link like github with the revision SHA), what you changed (just a list of files), and why so if it needs to be upgraded later it’s easier to understand what your past self was thinking.

Sixth: If you are coding in an organization, find out what their open source policy is (or help build one). At several places I have worked, the company didn’t care about open source licenses we used, but we occasionally had contracts with customers/vendors who did. One in particular had a “No GPLv3 code would be delivered” clause in our contract and that caused some issues. It’s important to make this known to avoid surprises later.

Protecting your organization

Protecting your organization doesn’t just mean protecting the company you work for, it’s just as important to protect the open source communities you participate in and contribute to. A lot of what I suggested in the last section works wonders for both. But there are a couple of other steps that can be effective depending on your level of involvement, or contractual obligations:

Dependency Audit

These two words tend to strike a sense of horror and dread into most developers. Don’t let them. There are some great tools out there to help in the popular languages. You may discover some interesting surprises when you dive all the way down your dependency tree. Use of language native package management (pip, npm, rubygems, et cetera) can make this easier. Other languages (like Java and C#) you may have a harder time and need to do a lot more manual work. You may also find that there are libraries in there you do not want due to license concerns, but don’t fret too much as there are usually replacements. A great example of this is libreadline, which is used frequently in tools with an interactive CLI. Good old libreadline is GPL (2 or 3 depending on version), but there exists the equivalent libedit which is 100% API compatible and is BSD licensed. Things like that are a somewhat easy fix, you may need to build some more dependencies in your build pipeline to get the license coverage you desire. Some may be harder.

Legal Fiction

Do you run an open source project? You may want to build, or have it become part of, an organization to help protect it and you from a legal and financial standpoint. Examples of larger groups are The Free Software Foundation, Software in the Public Interest, The Linux Foundation, and the Apache Foundation. Others build their own: like the Blender Foundation and VLC. The idea there is the legal fiction (read: company) can absorb most/all of the legal risk from the individual contributors. It’s not perfect, as the recent unpleasantness with netfilter shows, but it can help. Larger groups can even provide legal support without you going out on your own.

Contributor License Agreements

Your organization may want to implement a CLA for external (and internal) contributors to do things like assign copyright to organization or grant royalty free patent licenses. If you are going to try and do something like that think about your users. If you are expecting or want contributions from corporations, think really hard about making something easier for corporate legal teams to work with rather than placing the onus on the contributor solely. As much of a pain as it can be, just having a Corporate Contributor License Agreement (CCLA) that is out of band with your other process to allow contributions from that org. This will allow your lawyers and their lawyers to work it out, and can be beneficial for larger orgs that don’t yet get Open Source.

Wrapping Up

This stuff is important, and it’s complicated, but it can be surmounted by anyone. To quote Lawrence Lessig’s excellent book title “Code is Law”, and using the transitive property “Law is Code”. All these licenses are written in legal code, and like any other language they can (and should) be read. If you are inexperienced, start reading licenses. When you get confused, ask for help (preferably from a lawyer). The more we all understand this, the better the world will be. Thanks for your time, and feel free to reach out if you have questions!

-Carl @edolnx on Twitter and on the HangOps Slack There is a discussion of this on my blog:

by Christopher Webber ( at December 23, 2017 05:19 AM

December 22, 2017

LZone - Sysadmin

Using Linux keyring secrets from your scripts

When you write script that need to perform remote authentication you don't want to include passwords plain text in the script itself. And if the credentials are personal credentials you cannot deliver them with the script anyway.


Since 2008 the Secret Service API is standardized via and is implemented by GnomeKeyring and ksecretservice. Effectivly there is standard interface to access secrets on Linux desktops.

Sadly the CLI tools are rarely installed by default so you have to add them manually. On Debian
apt install libsecret-tools

Using secret-tool

There are two important modes:

Fetching passwords

The "lookup" command prints the password to STDOUT
/usr/bin/secret-tool lookup <key> <name>

Storing passwords

Note that with "store" you do not pass the password, as a dialog is raised to add it.
/usr/bin/secret-tool store <key> <name>

Scripting with secret-tool

Here is a simple example Bash script to automatically ask, store and use a secret:

ST=/usr/bin/secret-tool LOGIN="my-login" # Unique id for your login LABEL="My special login" # Human readable label

get_password() { $ST lookup "$LOGIN" "$USER" }

password=$( get_password ) if [ "$password" = "" ]; then $ST store --label "$LABEL" "$LOGIN" "$USER" password=$( get_password ) fi

if [ "$password" = "" ]; then echo "ERROR: Failed to fetch password!" else echo "Credentials: user=$USER password=$password" fi

Note that the secret will appear in the "Login" keyring. On GNOME you can check the secret with "seahorse".

December 22, 2017 03:25 PM

December 21, 2017

Everything Sysadmin

Save 40-60% on the 3rd edition of TPOSANA

The 3rd edition of "Vol 1: The Practice of System and Network Administration" was nominated as a "2017 Community Favorite". To celebrate, you can get it 40-60% off between now and Jan 8, 2018.

Click this link and use code "FAVE"

See all the favorites here:

By the way... there haven't been many reviews of this book on Amazon, and none that have mentioned the new content in Section I, II and III. I've you've read the new edition and would like to post a review, we'd love to know your opinion (good or bad).

by Tom Limoncelli at December 21, 2017 07:46 PM

December 20, 2017

LZone - Sysadmin

How to install Helm on Openshift

This is a short summary of things to consider when installing Helm on Openshift.

What is Helm?

Before going into details: helm is a self-proclaimed "Kubernetes Package Manager". While this is not entirly false in my opinion it is three things
  • a package manager for chart repos
  • a Kubernetes API automation tool
  • a Kubernetes resource abstraction helper
When looking closer it does more of the stuff that automation tools like Puppet, Chef and Ansible do.

Current Installation Issues

Since kubernetes v1.6.1, which introduced RBAC (role based access control) it became harder to properly install helm. Actually the simple installation as suggested on the homepage
# Download and...
helm init
seems to work, but as soon as you run commands like
helm list
you get permission errors. This of course being caused by the tighter access control now being in place. Sadly even now being at kubernetes 1.8 helm still wasn't updated to take care of the proper permissions.

Openshift to the rescue...

As Redhat somewhat pioneered RBAC in Openshift with their namespace based "projects" concept they are also the ones with a good solution for the helm RBAC troubles.

Setting up Helm on Openshift

Client installation (helm)

curl -s | tar xz
sudo mv linux-amd64/helm /usr/local/bin
sudo chmod a+x /usr/local/bin/helm

helm init --client-only

Server installation (tiller)

With helm being the client only, Helm needs an agent named "tiller" on the kubernetes cluster. Therefore we create a project (namespace) for this agent an install it with "oc create"
export TILLER_NAMESPACE=tiller
oc new-project tiller
oc project tiller
oc process -f -p TILLER_NAMESPACE="${TILLER_NAMESPACE}" | oc create -f -
oc rollout status deployment tiller

Preparing your projects (namespaces)

Finally you have to give tiller access to each of the namespaces you want someone to manage using helm:
export TILLER_NAMESPACE=tiller
oc project 
oc policy add-role-to-user edit "system:serviceaccount:${TILLER_NAMESPACE}:tiller"
After you did this you can deploy your first service, e.g.
helm install stable/redis --namespace 

See also

December 20, 2017 08:48 PM

December 19, 2017

Cryptography Engineering

The strange story of “Extended Random”

Yesterday, David Benjamin posted a pretty esoteric note on the IETF’s TLS mailing list. At a superficial level, the post describes some seizure-inducingly boring flaws in older Canon printers. To most people that was a complete snooze. To me and some of my colleagues, however, it was like that scene in X-Files where Mulder and Scully finally learn that aliens are real.

Those fossilized printers confirmed a theory we’d developed in 2014, but had been unable to prove: namely, the existence of a specific feature in RSA’s BSAFE TLS library called “Extended Random” — one that we believe to be evidence of a concerted effort by the NSA to backdoor U.S. cryptographic technology.

Before I get to the details, I want to caveat this post in two different ways. First, I’ve written about the topic of cryptographic backdoors way too much. In 2013, the Snowden revelations revealed the existence of a campaign to sabotage U.S. encryption systems. Since that time, cryptographers have spent thousands of hours identifying, documenting, and trying to convince people to care about these backdoors. We’re tired and we want to do more useful things.

The second caveat covers a problem with any discussion of cryptographic backdoors. Specifically, you never really get absolute proof. There’s always some innocent or coincidental explanation that could sort of fit the evidence — maybe it was all a stupid mistake. So you look for patterns of unlikely coincidences, and use Occam’s razor a lot. You don’t get a Snowden every day.

With all that said, let’s talk about Extended Random, and what this tells us about the NSA. First some background.


To understand the context of this discovery, you need to know about a standard called Dual EC DRBG. This was a proposed random number generator that the NSA developed in the early 2000s. It was standardized by NIST in 2007, and later deployed in some important cryptographic products — though we didn’t know it at the time.

Dual EC has a major problem, which is that it likely contains a backdoor. This was pointed out in 2007 by Shumow and Ferguson, and effectively confirmed by the Snowden leaks in 2013. Drama ensued. NIST responded by pulling the standard. (For an explainer on the Dual EC backdoor, see here.)

Somewhere around this time the world learned that RSA Security had made Dual EC the default random number generator in their popular cryptographic library, which was called BSAFE. RSA hadn’t exactly kept this a secret, but it was such a bonkers thing to do that nobody (in the cryptographic community) had known. So for years RSA shipped their library with this crazy algorithm, which made its way into all sorts of commercial devices.

The RSA drama didn’t quite end there, however. In late 2013, Reuters reported that RSA had taken $10 million to backdoor their software. RSA sort of denies this. Or something. It’s not really clear.

Regardless of the intention, it’s known that RSA BSAFE did incorporate Dual EC. This could have been an innocent decision, of course, since Dual EC was a NIST standard. To shed some light on that question, in 2014 my colleagues and I decided to reverse-engineer the BSAFE library to see if it the alleged backdoor in Dual EC was actually exploitable by an attacker like the NSA. We figured that specific engineering decisions made by the library designers could be informative in tipping the scales one way or the other.

It turns out they were.

Extended Random

In the course of reverse engineering the Java version of BSAFE, we discovered a funny inclusion. Specifically, we found that BSAFE supports a non-standard extension to the TLS protocol called “Extended Random”.

The Extended Random extension is an IETF Draft proposed by an NSA employee named Margaret Salter (at some point the head of NSA’s Information Assurance Directorate, which worked on “defensive” crypto for DoD) along with Eric Rescorla as a contractor. (Eric was very clearly hired to develop a decent proposal that wouldn’t hurt TLS, and would primarily be used on government machines. The NSA did not share their motivations with him.)

It’s important to note that Extended Random by itself does not introduce any cryptographic vulnerabilities. All it does is increase the amount of random data (“nonces”) used in a TLS protocol connection. This shouldn’t hurt TLS at all, and besides it was largely intended for U.S. government machines.

The only thing that’s interesting about Extended Random is what happens when that random data is generated using the Dual EC algorithm. Specifically, this extra data acts as “rocket fuel”, significantly increasing the efficiency of exploiting the Dual EC backdoor to decrypt TLS connections.

In short, if you’re an agency like the NSA that’s trying to use Dual EC as a backdoor to intercept communications, you’re much better off with a system that uses both Dual EC DRBG and Extended Random. Since Extended Random was never standardized by the IETF, it shouldn’t be in any systems. In fact, to the best of our knowledge, BSAFE is the only system in the world that implements it.

In addition to Extended Random, we discovered a variety of features that, combined with the Dual EC backdoor, could make RSA BSAFE fairly easy to exploit. But Extended Random is by far the strangest and hardest to justify.

So where did this standard come from? For those who like technical mysteries, it turns out that Extended Random isn’t the only funny-smelling proposal the NSA made. It’s actually one of four failed IETF proposals made by NSA employees, or contractors who work closely with the NSA, all of which try to boost the amount of randomness in TLS. Thomas Ptacek has a mind-numbingly detailed discussion of these proposals and his view of their motivation in this post.

Oh my god I never thought spies could be so boring. What’s the new development?

Despite the fact that we found Extended Random in RSA BSAFE (a free version we downloaded from the Internet), a fly in the ointment was that it didn’t actually seem to be enabled. That is: the code was there but the switches to enable it were hard-coded to “off”.

This kind of put a wrench in our theory that RSA might have included Extended Random to make BSAFE connections more exploitable by the NSA. There might be some commercial version of BSAFE out there with this code active, but we were never able to find it or prove it existed. And even worse, it might appear only in some special “U.S. government only” version of BSAFE, which would tend to undermine the theory that there was something intentional about including this code — after all, why would the government spy on itself?

Which finally brings us to the news that appeared on the TLS mailing list the other day. It turns out that certain Canon printers are failing to respond properly to connections made using the new version of TLS (which is called 1.3), because they seem to have implemented an unauthorized TLS extension using the same number as an extension that TLS 1.3 needs in order to operate correctly. Here’s the relevant section of David’s post:

The web interface on some Canon printers breaks with 1.3-capable
ClientHello messages. We have purchased one and confirmed this with a
PIXMA MX492. User reports suggest that it also affects PIXMA MG3650
and MX495 models. It potentially affects a wide range of Canon

These printers use the RSA BSAFE library to implement TLS and this
library implements the extended_random extension and assigns it number
40. This collides with the key_share extension and causes 1.3-capable
handshakes to fail.

So in short, this news appears to demonstrate that commercial (non-free) versions of RSA BSAFE did deploy the Extended Random extension, and made it active within third-party commercial products. Moreover, they deployed it specifically to machines — specifically off-the-shelf commercial printers — that don’t seem to be reserved for any kind of special government use.

(If these turn out to be special Department of Defense printers, I will eat my words.)

Ironically, the printers are now the only thing that still exhibits the features of this (now deprecated) version of BSAFE. This is not because the NSA was targeting printers. Whatever devices they were targeting are probably gone by now. It’s because printer firmware tends to be obsolete and yet highly persistent. It’s like a remote pool buried beneath the arctic circle that preserves software species that would otherwise vanish from the Internet.

Which brings us to the moral of the story: not only are cryptographic backdoors a terrible idea, but they totally screw up the assigned numbering system for future versions of your protocol.

Actually no, that’s a pretty useless moral. Instead, let’s just say that you can deploy a cryptographic backdoor, but it’s awfully hard to control where it will end up.

by Matthew Green at December 19, 2017 08:22 PM

December 16, 2017

Steve Kemp's Blog

IoT radio: Still in-progress ..

So back in September I was talking about building a IoT Radio, and after that I switched to talking about tracking aircraft via software-defined radio. Perhaps time for a followup.

So my initial attempt at a IoT radio was designed with RDA5807M module. Frustratingly the damn thing was too small to solder easily! Once I did get it working though I found that either the specs lied to me, or I'd misunderstood them: It wouldn't drive headphones, and performance was poor. (Though amusingly the first time I got it working I managed to tune to Helsinki's rock-station, and the first thing I heard was Rammstein's Amerika.)

I made another attempt with an Si4703-based "evaluation board". This was a board which had most of the stuff wired in, so all you had to do was connect an MCU to it, and do the necessary software dancing. There was a headphone-socket for output, and no need to fiddle with the chip itself, it was all pretty neat.

Unfortunately the evaluation board was perfect for basic use, but not at all suitable for real use. The board did successfully output audio to a pair of headphones, but unfortunately it required the use of headphones, as the cable would be treated as an antenna. As soon as I fed the output of the headphone-jack to an op-amp to drive some speakers I was beset with the kind of noise that makes old people reminisce about how music was better back in their day.

So I'm now up to round 3. I have a TEA5767-based project in the works, which should hopefully resolve my problems:

  • There are explicit output and aerial connections.
  • I know I'll need an amplifier.
  • The hardware is easy to control via arduino/esp8266 MCUs.
    • Numerous well-documented projects exist using this chip.

The only downside I can see is that I have to use the op-amp for volume control too - the TEA5767-chip allows you to mute/unmute via software but doesn't allow you to set the volume. Probably for the best.

In unrelated news I've got some e-paper which is ESP8266/arduino controlled. I have no killer-app for it, but it's pretty great. I should write that up sometime.

December 16, 2017 10:00 PM

December 12, 2017


Digital Doorknobs

Doorknobs are entering the Internet of (unsecured) Things.

However, they've been there for quite some time already. As anyone who has been in a modern hotel any time in the last 30 years knows, metal keys are very much a thing of the past. The Hotel industry made this move for a lot of reasons, a big one being that plastic card is a lot easier to replace than an actual key.

They've also been there for office-access for probably longer, as anyone who has ever had to waive their butt or purse at a scan-pad beside a door knows. Modern versions are beginning to get smartphone hookups, allowing the an expensive (but employee-owned) smartphone with an app on it and enabled Bluetooth to replace that cheap company-owned prox-pass.

They're now moving into residences, and I'm not a fan of this trend. Most of my objection comes from being in Operations for as long as I have. The convenience argument for internet-enabling your doorknob is easy to make:

  • Need emergency maintenance when you're on vacation? Allow the maintenance crew in from your phone!
  • Assign digital keys to family members you can revoke when they piss you off!
  • Kid get their phone stolen? Revoke the stolen key and don't bother with a locksmith to change the locks!
  • Want the door to unlock just by walking up to it? Enable Bluetooth on your phone, and the door will unlock itself when you get close!

This is why these systems are selling.


I'm actually mostly OK with the security model on these things. The internals I've looked at involved PKI and client-certificates. When a device like a phone gets a key, that signed client-cert is allowed to access a thingy. If that phone gets stolen, revoke the cert at the CA and the entire thing is toast. The conversation between device and the mothership is done over a TLS connection using client-certificate authentication, which is actually more secure than most banks website logins.

The handshake over Bluetooth is similarly cryptoed, making it less vulnerable to replay attacks.

Where we run into problems is the intersection of life-safety and the flaky nature of most residential internet connections. These things need to be able to let people in the door even when CentryLink is doing that thing it does. If you err on the side of getting in the door, you end up caching valid certs on the lock-devices themselves, opening them up to offline attacks if you can jam their ability to phone home. If you err on the side of security, an internet outage is a denial of access attack.

The Real Objection

It comes down to the differences in the hardware and software replacement cycles, as well as certain rare but significant events like a change of ownership. The unpowered deadbolt in your front door could be 20 years old. It may be vulnerable to things like bump-keys, but you can give the pointy bits of metal (keys) to the next residents on your way to your new place and never have to worry about it. The replacement cycle on the whole deadbolt is probably the same as the replacement cycle of the owners, which is to say 'many years'. The pin settings inside the deadbolt may get changed more often, but the whole thing doesn't get changed much at all.

Contrast this with the modern software ecosystem, where if your security product hasn't had an update in 6 months it's considered horribly out of date. At the same time, due to the iterative nature of most SaaS providers and the APIs they maintain, an API version may get 5 years of support before getting shut down. Build a hardware fleet based on that API, and you have a hardware fleet that ages at the rate of software. Suddenly, that deadbolt needs a complete replacement every 5 years, and costs about 4x what the unpowered one did.

Most folks aren't used to that. In fact, they'll complain about it. A lot.

There is another argument to make about embedded system (that smart deadbolt), and their ability to handle the constantly more computationally expensive crypto-space. Not to mention changing radio-specs like Bluetooth and WiFi that will render old doorknobs unable to speak to the newest iPhone. Which is to say, definitely expect Google and Apple to put out doorknobs in the not too distant future. Amazon is already trying.

All of this make doorknob makers salivate, since it means more doorknobs will be sold per year. Also the analytics over how people use their doors? Priceless. Capitalism!

It also means that doorknob operators, like homeowners, are going to be in for a lot more maintenance work to keep them running. Work that didn't used to be there before. Losing a phone is pretty clear, but what happens when you sell your house?

You can't exactly 'turn over the keys' if they're 100% digital and locked into your Google or Apple identities. Doorknob makers are going to have to have voluntary ownership-transfer protocols.

Involuntary transfer protocols are going to be a big thing. If the old owners didn't transfer, you could be locked out of the house. That could mean a locksmith coming in to break in to your house, and having to replace every deadbolt in the place with brand new. Or it could mean arguing with Google over who owns your home and how to prove it.

Doing it wrong has nasty side-effects. If you've pissed off the wrong people on the internet, you could have griefers coming after your doorknob provider, and you could find yourself completely locked out of your house. The more paranoid will have to get Enterprise contracts and manage their doorknobs themselves so they have full control over the authentication and auth-bypass routes.

Personally, I don't like that added risk-exposure. I don't want my front door able to be socially engineered out of my control. I'll be sticking with direct-interaction token based authentication methods instead of digitally mediated digital token auth methods.

by SysAdmin1138 at December 12, 2017 10:36 PM

Everything Sysadmin

DevOpsDays New York City 2018 Speakers Announced!

Exciting news from the D-O-D-NYC committee!

  • Speakers announced. Wow! I've never seen such an amazing lineup of speakers!
  • The best of the best. The committee this year was flooded with so many amazing proposals but sadly it is a 2-day conference so they had to be very selective. Who benefits? You!
  • Early bird discount ends on Friday. Register soon and save!

DevOpsDays-NYC 2018 is Thu/Fri January 18-19, 2018 in midtown Manhattan. Easy to get to via all forms of public transportation.

For more information:

by Tom Limoncelli at December 12, 2017 07:49 PM

December 05, 2017

Ben's Practical Admin Blog

Monitoring Windows Server Interactive Logins

I’m sure many of you realise that for systems with high value in terms of information held or impact to business due to outage or data breach, you would probably want to crank up the monitoring of such systems. Best practices say you should pretty much monitor all activity associated with local users and groups, but today I want to focus on interactive logins to servers.

This has mainly come about from my own need recently to provide the ability to notify on any interactive login to a particular server, be it using remote desktop or a console session.

My first thought was to create a SCOM Rule that would report on Security Log EventID 4624 and if the Logon Type was 3 (console logon) or 10 (RDP Logon), send an email. As it turned out, this was much harder than I expected, as I found that Logon Type was not getting consistently passed as a parameter, and doing a text search on the entire message is not good practice.

My next trick was to go back to the Windows Eventlog itself. From Windows Server 2008 and Above, you are now able to attach a schedule task to be triggered by an EventID. However support for Email directory from the task was deprecated in Windows Server 2012, to the point where you cannot use it. This is not as big a deal as it sounds, as you can just use the send-mailmessage cmdlet to achieve the same thing in a script, and attach the script as the triggered task.

However, EventID 4624 can be quite verbose. There is a lot of information in the event message. Powershell would need to parse it and turn it into something readable. This my friends was where things while definitely possible, could have turned very messy.

It was while I was poking around in all the other available logs that ship in Windows Server that I came across the Microsoft-Windows-TerminalServices-LocalSessionManager and from there discovered EventID, which had a message that looked something like this:

Remote Desktop Services: Session logon succeeded:

User: domain\user
Session ID: 3
Source Network Address:

Perfect! Perhaps I can get SCOM to simply monitor this log for this EventID, and I won’t have to then filter based on parameter or details in the message! Oh if only…

The Version of  SCOM currently in use where I work has no support for Windows Event Logs outside of the usual Application, Security and System. So back to Powershell scripts and EventID triggered scheduled tasks we go!

The Powershell looks something like this:

$logentry = get-winevent -filterhashtable @{ Logname = 'Microsoft-Windows-TerminalServices-LocalSessionManager/Operational'; ID = 21} -MaxEvents 1
$logArray = $logentry.Message.Split("`n")

[string]$emailSubject = ("Local Login to CA - " +($logarray |select -Index 2)).Trim()
$emailBody = $logentry.message
$emailFrom = ""
$emailTo = ""
$smtp = ""

Send-MailMessage -To $emailTo -From $emailFrom -Subject $($emailSubject) -Body $emailbody -SmtpServer $smtp

As you can see, there is still a little bit of manipulation of the event message, mainly to split each line into an array to format the subject. Get-WinEvent is used to access the log as, like SCOM, Get-Eventlog only deals with the 3 main log files.

From here, it is just a matter of selecting the task in the event viewer, and choosing to “attach task to this event…” from the action pane on the right:

A wizard will then guide you through the steps.

I’ll be the first to admit there are probably better ways to do this – 3rd party tools, Heck, SCOM 2016 may well support these log names now. This however works within the constrains I have. It also is not something that would scale, and I acknowledge that.

If there are better solutions, why not leave a comment below to discuss – I’d love to hear from you!


by Ben at December 05, 2017 09:40 PM

December 04, 2017

Evaggelos Balaskas

Install Signal Desktop to Archlinux

How to install Signal dekstop to archlinux

Download Signal Desktop

eg. latest version v1.0.41

$ curl -s \
    -o /tmp/signal-desktop_1.0.41_amd64.deb

Verify Package

There is a way to manually verify the integrity of the package, by checking the hash value of the file against a gpg signed file. To do that we need to add a few extra steps in our procedure.

Download Key from the repository

$ wget -c

--2017-12-11 22:13:34--
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Connecting to connected.
Proxy request sent, awaiting response... 200 OK
Length: 3090 (3.0K) [application/pgp-signature]
Saving to: ‘keys.asc’

keys.asc                          100%[============================================================>]   3.02K  --.-KB/s    in 0s      

2017-12-11 22:13:35 (160 MB/s) - ‘keys.asc’ saved [3090/3090]

Import the key to your gpg keyring

$ gpg2 --import keys.asc

gpg: key D980A17457F6FB06: public key "Open Whisper Systems <>" imported
gpg: Total number processed: 1
gpg:               imported: 1

you can also verify/get public key from a known key server

$ gpg2 --verbose --keyserver --recv-keys 0xD980A17457F6FB06

gpg: data source:
gpg: armor header: Version: SKS 1.1.6
gpg: armor header: Comment: Hostname:
gpg: pub  rsa4096/D980A17457F6FB06 2017-04-05  Open Whisper Systems <>
gpg: key D980A17457F6FB06: "Open Whisper Systems <>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1

Here is already in place, so no changes.

Download Release files

$ wget -c

$ wget -c

Verify Release files

$ gpg2 --no-default-keyring --verify Release.gpg Release

gpg: Signature made Sat 09 Dec 2017 04:11:06 AM EET
gpg:                using RSA key D980A17457F6FB06
gpg: Good signature from "Open Whisper Systems <>" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: DBA3 6B51 81D0 C816 F630  E889 D980 A174 57F6 FB06

That means that Release file is signed from whispersystems and the integrity of the file is not changed/compromized.

Download Package File

We need one more file and that is the Package file that contains the hash values of the deb packages.

$ wget -c

But is this file compromized?
Let’s check it against Release file:

$ sha256sum Packages

ec74860e656db892ab38831dc5f274d54a10347934c140e2a3e637f34c402b78  Packages

$ grep ec74860e656db892ab38831dc5f274d54a10347934c140e2a3e637f34c402b78 Release

 ec74860e656db892ab38831dc5f274d54a10347934c140e2a3e637f34c402b78     1713 main/binary-amd64/Packages

yeay !

Verify deb Package

Finally we are now ready to manually verify the integrity of the deb package:

$ sha256sum signal-desktop_1.0.41_amd64.deb

9cf87647e21bbe0c1b81e66f88832fe2ec7e868bf594413eb96f0bf3633a3f25  signal-desktop_1.0.41_amd64.deb

$ egrep 9cf87647e21bbe0c1b81e66f88832fe2ec7e868bf594413eb96f0bf3633a3f25 Packages

SHA256: 9cf87647e21bbe0c1b81e66f88832fe2ec7e868bf594413eb96f0bf3633a3f25

Perfect, we are now ready to continue

Extract under tmp filesystem

$ cd /tmp/

$ ar vx signal-desktop_1.0.41_amd64.deb

x - debian-binary
x - control.tar.gz
x - data.tar.xz

Extract data under tmp filesystem

$ tar xf data.tar.xz

Move Signal-Desktop under root filesystem

# sudo mv opt/Signal/ /opt/Signal/


Actually, that’s it!


Run signal-desktop as a regular user:

$ /opt/Signal/signal-desktop

Signal Desktop



Define your proxy settings on your environment:

declare -x ftp_proxy=""
declare -x http_proxy=""
declare -x https_proxy=""



Tag(s): signal, archlinux

December 04, 2017 10:41 PM

Perl to go

I have been using Perl for more than 20 years now, seen Perl 4 bow out and Perl 5 come in and develop in that fantastic language that has helped me uncountable times in my professional life. During those years I’ve also considered learning another language, but I have been unable to take a stand for a long time.

And there came Go and the hype around Go, just like years ago there was a lot of hype around Java. But while whatever written in Java I came across was a big, heavy and slow memory eater, most of the tools I came across that were written in Go were actually good stuff — OK, still a bit bloated in size, but they actually worked. The opportunity came, and I finally gave Go a shot.

A false start

The opportunity arose when early this year it was announced that an introductory course in Go was being organised in our company. I immediately confirmed my interest and bought plane tickets when the date was confirmed.

When the day finally arrived I flew to Trondheim only to discover that the target of the course was silently changed and the course was way more specialistic than expected. Yes, you can say I was annoyed. In an attempt not to waste the trip and time, I started the Tour of Go there and then and went through the first few chapters and exercises. It looked great. For a while.

A tour of go

The tour of Go aims to be an introductory course, and to a point it actually is. But when you get far enough you’ll notice that some exercise assume more knowledge of the language than what was explained in previous chapters: whoever wrote the more advanced lessons was disconnected, either from the previous part of the course or from reality. I vented my frustrations in a tweet:

I was not liking Golang’s syntax nor the tour, frustration was growing, my learning experience was in danger. If I wanted to keep going I needed to find another way; for my particular mindset the best way is often a good book. It was time to find one.

Introducing Go

I did some research and finally set to buy Doxsey’s “Introducing Go” from O’Reilly, a simple book, so thin that Posten delivered it straight into my mailbox! (instead of the usual pick-up notice). The first chapters were simple indeed and I already knew most of their content from the tour of Go so I just skimmed through. Later chapters were still simple, but also informative and to the point, with exercises at the end that were well in line with the content.

I got to the end of the book reasonably quickly, and it was now time for a “final project”. Those who know me, know that I don’t like “Hello world!”-class exercises. To check my learning I want something that is easy enough to be possible for a beginner, but challenging enough to put my new knowledge to the test. I considered a few options and decided to reproduce hENC in Go.

hENC: a recap

For those who don’t know or don’t remember, hENC was a project of mine, the  radically simple hierarchical External Node Classifier (ENC) for CFEngine. It’s a very simple script written in pure Perl. In 79 lines of code (47 of actual code) it reads a list of files in a format similar to CFEngine’s module protocol and merges their content in a hierarchical fashion. The output is then used by CFEngine to classify the node and set variables according to the node’s role in the infrastructure.

hENC reads files, finds data through pattern matching, applies some logic, fills up data structures and prints out the result: no rocket science but still a real-world program. Reproducing hENC in Go had all that it takes for a decent exercise.

NB: hENC is silent on conditions that would otherwise be errors (e.g. missing input files). That is by design. Any output from the program (including errors!) is piped into CFEngine: if you don’t want to fill your CFEngine logs with tons of messages about non-compliant messages in the module’s output you need to make it as silent as possible. That’s why you don’t find any error logging in the Perl version, and all of the messages via the log package are commented out in the Go version.

hENC in Go

Perl and Go are built around very different philosophies: where Perl is redundant and allows for dozens of ways to express the same thing, Go chooses to allow only one or very few (think loops, think control structures…); where many frequently used functions are part of the core language in Perl (e.g. print() and file handling functions like open(), close() and so forth), Go has them in packages that, although part of the core, you have to import explicitly into the program. It’s a difference that is very visible by comparing the sources of the programs and from the very start: the Perl version starts by using two pragmata and then goes straight to the data structures:

use strict ;
use warnings ;

my %class ;    # classes container
my %variable ; # variables container

The Go version spends some lines to import before it gets to the same point:

package main

import "fmt"    // to print something...
import "os"     // to read command-line args, opening files...
import "bufio"  // to read files line by line, see
// import "log"
import "regexp"

var class    = make(map[string]int)     // classes container
var variable = make(map[string]string)  // variables container

Perl takes advantage of the diamond construct to read from the files like they were a single one and without even explicitly open them:

while (my $line = ) {

Not so in Go:

var encfiles = os.Args[1:]

func main() {
	// prepare the regex for matching the ENC setting
	settingRe := regexp.MustCompile(`^\s*([=\@%+-/_!])(.+)\s*$`)

	// prepare the regex for matching a variable assignment
	varRe := regexp.MustCompile(`^(.+?)=`)
	// iterate over files
	for _,filename := range encfiles {
		// try to open, fail silently if it doesn't exist
		file,err := os.Open(filename)
		if err != nil {
			// error opening this file, skip and...
			continue File
		defer file.Close()

		// Read file line by line.
		// Dammit Go, isn't this something that one does often
		// enough to deserve the simplest way to do it???
		// Anyway, here we go with what one can find in
		scanner := bufio.NewScanner(file)
		for scanner.Scan() {
			err := scanner.Err()
			if err != nil {
				// log.Printf("Error reading file %s: %s",filename,err)
				break Line

			// no need to "chomp()" here, the newline is already gone
			line := scanner.Text()

The previous code snippet also includes the preparation of two regular expression patterns that will be used later in the game; this is a notable difference from Perl, where Perl is on the minimalistic side (one single instruction to do pattern matching, sub-match extraction and so forth), where Go introduces a number of functions and methods to do the same job. Regular expressions are an already complicated subject and don’t definitely need any additional mess: like many other languages, Go should take some lessons from Perl on this subject.

An area where Go tends to be cleaner than Perl is where you can use the built-in switch/case construct instead of if:

			case `+`:
				// add a class, assume id is a class name
				class[id] = 1

			case `-`:
				// undefine a class, assume id is a class name
				class[id] = -1

Perl’s equivalent given/when construct is still experimental; a Switch module is provided in CPAN and was in the core distribution in the past, but it’s use is discouraged in favour of the experimental given/when construct… uhm…

Switch/case aside, the Perl version of hENC was designed to run on any recent and not-so-recent Perl, so it uses the plain old if construct:

    if ($setting eq '+') {
	# $id is a class name, or should be.
	$class{$id} = 1 ;

    # undefine a class
    if ($setting eq '-') {
	# $id is a class name, or should be.
	$class{$id} = -1 ;

though there are still places where Perl is a bit clearer and more concise than Go:

    # reset the status of a class
    if ($setting eq '_') {
	# $id is a class name, or should be.
	delete $class{$id} if exists $class{$id} ;


			case `_`:
				// reset the class, if it's there
				_,ok := class[id]
				if ok {

You can find the full source of gohENC at the end of the post. By the way, the gohENC source is 140 lines with extensive comments (80 lines of actual code, nearly twice as Perl’s version).


I got gohENC completed through a few short sessions and it was time to test if it really worked. That was an easy task, since hENC comes with a test suite. All I had to do was to compile the Go source, replace the Perl version with the binary and run the tests:

$ prove --exec "sudo cf-agent -KC -f" ./
./ .. ok   
All tests successful.
Files=1, Tests=8,  0 wallclock secs ( 0.05 usr  0.01 sys +  0.08 cusr  0.00 csys =  0.14 CPU)
Result: PASS

Success! gohENC is live!

Why not Java?

I have often considered Java, but never got to love it. I felt the language, the tools and the ecosystem were unnecessarily complicated and the Java programs I have used in the past didn’t make me love the language either.

Why not Python?

I have always been surrounded by “pythonists” and considered Python, too, but was kind of discouraged by the fact that Python 3’s popularity wasn’t really taking off, while learning Python 2.7 seemed like a waste of time because its successor was already there.

Why not Ruby?

The only time I touched Ruby was when I tried to write some Puppet facts: the code I saw at the time didn’t impress me and I tried to stay away from Ruby ever since.

Why not JavaScript?

Because I was unsure about how much I could use it to help me with my job, and outside of web pages anyway.

Why not PHP?


Why not Perl6?

Perl 6, the new kid on the block, seems great and powerful, but not really something that would add an edge in my CV unfortunately.

Source code for gohENC

package main

import "fmt"    // to print something...
import "os"     // to read command-line args, opening files...
import "bufio"  // to read files line by line, see
// import "log"
import "regexp"

var class    = make(map[string]int)     // classes container
var variable = make(map[string]string)  // variables container

var encfiles = os.Args[1:]

func main() {
	// prepare the regex for matching the ENC setting
	settingRe := regexp.MustCompile(`^\s*([=\@%+-/_!])(.+)\s*$`)

	// prepare the regex for matching a variable assignment
	varRe := regexp.MustCompile(`^(.+?)=`)
	// iterate over files
	for _,filename := range encfiles {
		// try to open, fail silently if it doesn't exist
		file,err := os.Open(filename)
		if err != nil {
			// error opening this file, skip and...
			continue File
		defer file.Close()

		// Read file line by line.
		// Dammit Go, isn't this something that one does often
		// enough to deserve the simplest way to do it???
		// Anyway, here we go with what one can find in
		scanner := bufio.NewScanner(file)
		for scanner.Scan() {
			err := scanner.Err()
			if err != nil {
				// log.Printf("Error reading file %s: %s",filename,err)
				break Line

			// no need to "chomp()" here, the newline is already gone
			line := scanner.Text()

			// Dear Go, regular expression are already
			// complicated, there is absolutely NO need for you to
			// make them even more fucked up...
			// Sixteen functions to do pattern matching... so much
			// for your fucking minimalism!
			match := settingRe.FindStringSubmatch(line)

			setting,id := match[1],match[2]
			// log.Printf("setting: %s, value: %s",setting,id)

			switch setting {
			case `!`:
				// take a command
				switch id {
					// flush the class cache
					// ...which means: kill all key/values
					// recorded in the classes map
					// In Go, you're better off overwriting the
					// new array, so...
					class = make(map[string]int)

					// remove active classes from the cache
					for k,v := range class {
						if v > 0 {

					// remove cancelled classes from the cache
					for k,v := range class {
						if v < 0 { 							delete(class,k) 						} 					} 				} // switch id 			case `+`: 				// add a class, assume id is a class name 				class[id] = 1 			case `-`: 				// undefine a class, assume id is a class name 				class[id] = -1 			case `_`: 				// reset the class, if it's there 				_,ok := class[id] 				if ok { 					delete(class,id) 				} 			case `=`, `@`, `%`: 				// define a variable/list 				match := varRe.FindStringSubmatch(id) 				varname := match[1] // not necessary, just clearer 				variable[varname] = line 			case `/`: 				// reset a variable/list 				_,ok := variable[id] 				if ok { 					delete(variable,id) 				} 				 			} // switch setting 			// discard the rest 		} 	} 	// print out classes 	class[`henc_classification_completed`] = 1 	for classname,value := range class { 		switch { 		case value > 0:

		case value < 0:

	// print variable/list assignments, the last one wins
	for _,assignment := range variable {


Tagged: Configuration management, golang, henc, Perl, programming

by bronto at December 04, 2017 08:00 AM

November 29, 2017

Feeding the Cloud

Proxy ACME challenges to a single machine

The Libravatar mirrors are setup using DNS round-robin which makes it a little challenging to automatically provision Let's Encrypt certificates.

In order to be able to use Certbot's webroot plugin, I need to be able to simultaneously host a randomly-named file into the webroot of each mirror. The reason is that the verifier will connect to, but there's no way to know which of the DNS entries it will hit. I could copy the file over to all of the mirrors, but that would be annoying since some of the mirrors are run by volunteers and I don't have direct access to them.

Thankfully, Scott Helme has shared his elegant solution: proxy the .well-known/acme-challenge/ directory from all of the mirrors to a single validation host. Here's the exact configuration I ended up with.

DNS Configuration

In order to serve the certbot validation files separately from the main service, I created a new hostname,, pointing to the main Libravatar server:

CNAME acme

Mirror Configuration

On each mirror, I created a new Apache vhost on port 80 to proxy the acme challenge files by putting the following in the existing port 443 vhost config (/etc/apache2/sites-available/libravatar-seccdn.conf):

<VirtualHost *:80>
    ServerAdmin __WEBMASTEREMAIL__

    ProxyPass /.well-known/acme-challenge/
    ProxyPassReverse /.well-known/acme-challenge/

Then I enabled the right modules and restarted Apache:

a2enmod proxy
a2enmod proxy_http
systemctl restart apache2.service

Finally, I added a cronjob in /etc/cron.daily/commit-new-seccdn-cert to commit the new cert to etckeeper automatically:

cd /etc/libravatar
/usr/bin/git commit --quiet -m "New seccdn cert" seccdn.crt seccdn.pem seccdn-chain.pem > /dev/null || true

Main Configuration

On the main server, I created a new webroot:

mkdir -p /var/www/acme/.well-known

and a new vhost in /etc/apache2/sites-available/acme.conf:

<VirtualHost *:80>
    DocumentRoot /var/www/acme
    <Directory /var/www/acme>
        Options -Indexes

before enabling it and restarting Apache:

a2ensite acme
systemctl restart apache2.service

Registering a new TLS certificate

With all of this in place, I was able to register the cert easily using the webroot plugin on the main server:

certbot certonly --webroot -w /var/www/acme -d

The resulting certificate will then be automatically renewed before it expires.

November 29, 2017 06:10 AM

November 27, 2017


Terraforming in prod

Terraform from HashiCorp is something we've been using in prod for a while now. Simply ages in terraform-years, which means we have some experience with it.

It also means we have some seriously embedded legacy problems, even though it's less than two years old. That's the problem with rapidly iterating infrastructure projects that don't build in production usecases from the very start. You see, a screwdriver is useful in production! It turns screws and occasionally opens paint cans. You'd think that would be enough. But production screwdrivers conform to external standards of screwdrivers, are checked in and checked out of central tooling because quality control saves everyone from screwdriver related injuries, and have support for either outright replacement or refacing of the toolface. has a great screed on this you should read. But I wanted to share how we use it, and our pains.

In the beginning

I did our initial Terraform work in the Terraform version 6 days. The move to production happened about when TF 7 came out. You can see how painfully long ago that was (or wasn't). It was a different product back then.

Terraform Modules were pretty new back when I did the initial build. I tried them, but couldn't get them to work right. At the time I told my coworkers:

They seem to work like puppet includes, not puppet defines. I need them to be defines, so I'm not using them.

I don't know if I had a fundamental misunderstanding back then or if that's how they really worked. But they're defines now, and all correctly formatted TF infrastructures use them or be seen as terribly unstylish. It means there is a lot of repeating-myself in our infrastructure.

Because we already had an AMI baking pipeline that worked pretty well, we never bothered with Terraform Provisioners. We build ours entirely on making the AWS assets versonable. We tried with CloudFormation, but gave that up due to the terrible horrible no good very bad edge cases that break iterations. Really, if you have to write Support to unfuck an infrastructure because CF can't figure out the backout plan (and that backout is obvious to you), then you have a broken product. When Terraform gets stuck, it just throws up its hands and says HALP! Which is fine by us.

Charity asked a question in that blog-post:

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

I wrote it for two big, big reasons.

  1. In TF6, there was zero support for sharing a Terraform statefile between multiple people (without paying for Atlas), and that critically needs to be done. So my wrapper implemented the sharing layer. Terraform now supports several methods for this out of the box, it didn't used to.
  2. Terraform is a fucking foot gun with a flimsy safety on the commit career suicide button. It's called 'terraform destroy' and has no business being enabled for any reason in a production environment ever. My wrapper makes getting at this deadly command require a minute or two of intentionally circumventing the safety mechanism. Which is a damned sight better than the routine. "Are you sure? y/n" prompt we're all conditioned to just click past. Of course I'm in the right directory! Yes!

And then there was legacy.

We're still using that wrapper-script. Partly because reimplementing it for the built-in statefile sharing is, like, work and what we have is working. But also because I need those welded on fire-stops on the foot-gun.

But, we're not using modules, and really should. However, integrating modules is a long and laborious process we haven't seen enough benefits to outweight the risk. To explain why, I need to explain a bit about how terraform works.

You define resources in Terraform, like a security group with rules. When you do an 'apply', Terraform checks the statefile to see if this has been created yet, and what state it was seen last. It then compares the last known state with the current state in the infrastructure to determine what changes need to be made. Pretty simple. The name of the resource in the statefile is a clear format. For non-module resources the string is "resource_type.resource_name", so our security group example would be "aws_security_group.prod_gitlab". For module resources it gets more complicated, such as "module_name.resource_type.resource_name". That's not the exact format -- which is definitely not bulk-sed friendly -- but it works for the example I'm about to share. If you change the name of a resource, Terraform's diff shows the old resource disappearing and a brand new one appearing and treats it as such. Sometimes this is what you want. Other times, like if they're your production load-balancers where delete-and-recreate will create a multi-minute outage, you don't.

To do a module conversion, this is the general workflow.

  1. Import the module and make your changes, but don't apply them yet.
  2. Use 'terraform statefile ls' to get a list of the resources in your statefile, note the names of the resources to be moved into modules.
  3. Use 'terraform statefile rm' to remove the old resources.
  4. Use 'terraform import' to import the existing resources into the statefile under their now module-based names.
  5. Use 'terraform plan' to make sure there are zero changes.
  6. Commit your changes to the terraform repo and apply.

Seems easy, except.

  • You need to lock the statefile so no one else can make changes when you're doing this. Critically important if you have automation that does terraform actions.
  • This lock could last a couple of hours depending on how many resources need to be modified.
  • This assumes you know how terraform statefiles work with resource naming, so you need someone experienced with terraform to do this work.
  • Your modules may do some settings subtly different than you did before, so it may not be a complete null-change.
  • Some resources, like Application Load Balancers, require a heartbreaking number of resources to define, which makes for a lot of import work.
  • Not all resources even have an import developed. Those resources will have to be deleted and recreated.
  • Step 1 is much larger than you think, due to dependencies from other resources that will need updating for the new names. Which means you may visit step 1 a few times by the time you get a passing step 5.
  • This requires a local working Terraform setup, outside of your wrapper scripts. If your wrapper is a chatbot and no one has a local TF setup, this will need to be done on the chatbot instance. The fact of the matter is that you'll have to point the foot-gun at your feet for a while when you do this.
  • This is not a change that can be done through Terraform's new change-management-friendly way of packaging changes for use in change-management workflows, so it will be a 'comprehensive' change-request when it comes.

Try coding that into a change-request that will pass auditorial muster. In theory it is possible to code up a bash-script that will perform the needed statefile changes automatically, but it will be incredibly fragile in the face of other changes to the statefile as the CR works its way through the process. This is why we haven't converted to a more stylish infrastructure; the intellectual purity of being stylish isn't yet outweighing the need to not break prod.

What it's good for

Charity's opinion is close to my own:

Terraform is fantastic for defining the bones of your infrastructure.  Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change.  Or spinning up replicas of production on every changeset via Travis-CI or Jenkins -- yay!  Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Yes. Terraform is best used in cases where doing an apply won't cause immediate outages or instabilities. Even using it the way we are, without provisioners, means following some rules:

  • Only define 'aws_instance' resources if we're fine with those suddenly disappearing and not coming back for a couple of minutes. Because if you change the AMI, or the userdata, or any number of other details, Terraform will terminate the existing one and make a new one.
    • Instead, use autoscaling-groups and a process outside of Terraform to manage the instance rotations.
  • It's fine to encode scheduled-scaling events on autoscaling groups, and even dynamic-scaling triggers on them.
  • Rotating instances in an autoscaling-group is best done in automation outside of terraform.
  • Playing pass-the-IP for Elastic-IP addresses is buggy and may require a few 'applies' before they fully move to the new instances.
  • Cache-invalidation on the global Internet's DNS caches is still buggy as fuck, though getting better. Plan around that.
  • Making some changes may require multiple phases. That's fine, plan for that.

The biggest strength of Terraform is that it looks a lot like Puppet, but for your AWS config. Our auditors immediately grasped that concept and embraced it like they've known about Terraform since forever. Because if some engineer cowboys in a change outside of the CR process, Terraform will back that out the next time someone does an apply, much the way puppet will back out a change to a file it manages. That's incredibly powerful, and something CloudFormation only sort of does.

The next biggest strength is that it is being very actively maintained and tracks AWS API changes pretty closely. When Amazon announces a new service, Terraform will generally have support for it within a month (not always, but most of the time). If the aws-cli can do it, Terraform will also be able to do it; if not now, then very soon.

While there are some patterns it won't let you do, like have two security-groups point to each other on ingress/egress lists because that is a dependency loop, there is huge scope in the zones of what it will let you do.

This is a good tool and I plan to keep using it. Eventually we'll do a module conversion somewhere, but that may wait until they have a better workflow for it. Which may be in a month, or half a year. This project is moving fast.

by SysAdmin1138 at November 27, 2017 04:42 PM


The Ultimate Apollo Guidance Computer Talk @ 34C3

After The Ultimate Commodore 64 Talk (2008) and The Ultimate Game Boy Talk (2016), my third talk from the “Ultimate” series will take place at the 34th Chaos Communication Congress at Leipzig (27-30 Dec 2017):

The Apollo Guidance Computer (“AGC”) was used onboard the Apollo spacecraft to support the Apollo moon landings between 1969 and 1972. This talk explains “everything about the AGC”, including its quirky but clever hardware design, its revolutionary OS, and how its software allowed humans to reach and explore the moon. 

The talk will be presented by me (Michael Steil) and hessi. Date and time are subject to final scheduling. I will post updates as well as further information about the Apollo Guidance Computer on this blog in the next weeks.

You can read a more detailed abstract and vote on the talks on the 34C3 scheduling page.

by Michael Steil at November 27, 2017 02:01 PM

November 24, 2017

Sarah Allen

exploring ghostscript API in C

Ghostscript lets you do all sorts of PDF and PostScript transformations. It’s got a command-line tool which is great for page-level operations.

Installation on a mac is pretty easy with homebrew:

brew install ghostscript

The syntax for the commands not very memorable, but easy once you know it. To get a PNG from a PDF:

gs -sDEVICE=pngalpha -o output.png input.pdf

We can also use the API to call the engine from C code. Note: to use this from a program we need to publish our source code (or purchase a license from the nice folks who create and maintain Ghostscript), which seems fair to me. See license details.

To do the exact same thing as above, I created a small program based on the example in the API doc, which I’ve posted on github.

What I really want to do is to replace text in a PDF (using one PDF as a template to create another). It seems like all the code I need is available in the Ghostscript library, but maybe not exposed in usable form:

  • Projects Seeking Developers Driver Architecture was the only place in the docs that I learned that we can’t add a driver without modifying the source code: “Currently, drivers must be linked into the executable.” Might be nice for these to be filed as bugs so interested developers might discuss options here. Of course, not sure that making a driver is a good solution to my problem at all.
  • There’s an option -dFILTERTEXT that removes all text from a PDF that I thought might provide a clue. I found the implementation in gdevoflt.c with a comment that it was derived from gdevflp.c.
  • gdevflp: This device is the first ‘subclassing’ device; the intention of subclassing is to allow us to develop a ‘chain’ or ‘pipeline’ of devices, each of which can process some aspect of the graphics methods before passing them on to the next device in the chain.

So, this appears to require diving into the inner-workings of Ghostscript, yet the code seems to be structured so that it is easily modifiable for exactly this kind of thing. It seems like it would be possible to add a filter that modifies text, rather than just deleting it, as long as the layout is unaffected. This implies setting up for building and debugging GhostScript from source and the potential investment of staying current with the codebase, which might not work for my very intermittent attention.

by sarah at November 24, 2017 02:56 PM

November 20, 2017

toolsmith #129 - DFIR Redefined: Deeper Functionality for Investigators with R - Part 2

You can have data without information, but you cannot have information without data. ~Daniel Keys Moran

Here we resume our discussion of DFIR Redefined: Deeper Functionality for Investigators with R as begun in Part 1.
First, now that my presentation season has wrapped up, I've posted the related material on the Github for this content. I've specifically posted the most recent version as presented at SecureWorld Seattle, which included Eric Kapfhammer's contributions and a bit of his forward thinking for next steps in this approach.
When we left off last month I parted company with you in the middle of an explanation of analysis of emotional valence, or the "the intrinsic attractiveness (positive valence) or averseness (negative valence) of an event, object, or situation", using R and the Twitter API. It's probably worth your time to go back and refresh with the end of Part 1. Our last discussion point was specific to the popularity of negative tweets versus positive tweets with a cluster of emotionally neutral retweets, two positive retweets, and a load of negative retweets. This type of analysis can quickly give us better understanding of an attacker collective's sentiment, particularly where the collective is vocal via social media. Teeing off the popularity of negative versus positive sentiment, we can assess the actual words fueling such sentiment analysis. It doesn't take us much R code to achieve our goal using the apply family of functions. The likes of apply, lapply, and sapply allow you to manipulate slices of data from matrices, arrays, lists and data frames in a repetitive way without having to use loops. We use code here directly from Michael Levy, Social Scientist, and his Playing with Twitter Data post.

polWordTables = 
  sapply(pol, function(p) {
    words = c(positiveWords = paste(p[[1]]$pos.words[[1]], collapse = ' '), 
              negativeWords = paste(p[[1]]$neg.words[[1]], collapse = ' '))
    gsub('-', '', words)  # Get rid of nothing found's "-"
  }) %>%
  apply(1, paste, collapse = ' ') %>% 
  stripWhitespace() %>% 
  strsplit(' ') %>%

par(mfrow = c(1, 2))
  lapply(1:2, function(i) {
    dotchart(sort(polWordTables[[i]]), cex = .5)

The result is a tidy visual representation of exactly what we learned at the end of Part 1, results as noted in Figure 1.

Figure 1: Positive vs negative words
Content including words such as killed, dangerous, infected, and attacks are definitely more interesting to readers than words such as good and clean. Sentiment like this could definitely be used to assess potential attacker outcomes and behaviors just prior, or in the midst of an attack, particularly in DDoS scenarios. Couple sentiment analysis with the ability to visualize networks of retweets and mentions, and you could zoom in on potential leaders or organizers. The larger the network node, the more retweets, as seen in Figure 2.

Figure 2: Who is retweeting who?
Remember our initial premise, as described in Part 1, was that attacker groups often use associated hashtags and handles, and the minions that want to be "part of" often retweet and use the hashtag(s). Individual attackers either freely give themselves away, or often become easily identifiable or associated, via Twitter. Note that our dominant retweets are for @joe4security, @HackRead,  @defendmalware (not actual attackers, but bloggers talking about attacks, used here for example's sake). Figure 3 shows us who is mentioning who.

Figure 3: Who is mentioning who?
Note that @defendmalware mentions @HackRead. If these were actual attackers it would not be unreasonable to imagine a possible relationship between Twitter accounts that are actively retweeting and mentioning each other before or during an attack. Now let's assume @HackRead might be a possible suspect and you'd like to learn a bit more about possible additional suspects. In reality @HackRead HQ is in Milan, Italy. Perhaps Milan then might be a location for other attackers. I can feed  in Twittter handles from my retweet and mentions network above, query the Twitter API with very specific geocode, and lock it within five miles of the center of Milan.
The results are immediate per Figure 4.

Figure 4: GeoLocation code and results
Obviously, as these Twitter accounts aren't actual attackers, their retweets aren't actually pertinent to our presumed attack scenario, but they definitely retweeted @computerweekly (seen in retweets and mentions) from within five miles of the center of Milan. If @HackRead were the leader of an organization, and we believed that associates were assumed to be within geographical proximity, geolocation via the Twitter API could be quite useful. Again, these are all used as thematic examples, no actual attacks should be related to any of these accounts in any way.

Fast Frugal Trees (decision trees) for prioritizing criticality

With the abundance of data, and often subjective or biased analysis, there are occasions where a quick, authoritative decision can be quite beneficial. Fast-and-frugal trees (FFTs) to the rescue. FFTs are simple algorithms that facilitate efficient and accurate decisions based on limited information.
Nathaniel D. Phillips, PhD created FFTrees for R to allow anyone to easily create, visualize and evaluate FFTs. Malcolm Gladwell has said that "we are suspicious of rapid cognition. We live in a world that assumes that the quality of a decision is directly related to the time and effort that went into making it.” FFTs, and decision trees at large, counter that premise and aid in the timely, efficient processing of data with the intent of a quick but sound decision. As with so much of information security, there is often a direct correlation with medical, psychological, and social sciences, and the use of FFTs is no different. Often, predictive analysis is conducted with logistic regression, used to "describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables." Would you prefer logistic regression or FFTs?

Figure 5: Thanks, I'll take FFTs
Here's a text book information security scenario, often rife with subjectivity and bias. After a breach, and subsequent third party risk assessment that generated a ton of CVSS data, make a fast decision about what treatments to apply first. Because everyone loves CVSS.

Figure 6: CVSS meh
Nothing like a massive table, scored by base, impact, exploitability, temporal, environmental, modified impact, and overall scores, all assessed by a third party assessor who may not fully understand the complexities or nuances of your environment. Let's say our esteemed assessor has decided that there are 683 total findings, of which 444 are non-critical and 239 are critical. Will FFTrees agree? Nay! First, a wee bit of R code.

cvss <- c:="" coding="" csv="" p="" r="" read.csv="" rees="">cvss.fft <- data="cvss)</p" fftrees="" formula="critical">plot(cvss.fft, what = "cues")
     main = "CVSS FFT",
     decision.names = c("Non-Critical", "Critical"))

Guess what, the model landed right on impact and exploitability as the most important inputs, and not just because it's logically so, but because of their position when assessed for where they fall in the area under the curve (AUC), where the specific curve is the receiver operating characteristic (ROC). The ROC is a "graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied." As for the AUC, accuracy is measured by the area under the ROC curve where an area of 1 represents a perfect test and an area of .5 represents a worthless test. Simply, the closer to 1, the better. For this model and data, impact and exploitability are the most accurate as seen in Figure 7.

Figure 7: Cue rankings prefer impact and exploitability
The fast and frugal tree made its decision where impact and exploitability with scores equal or less than 2 were non-critical and exploitability greater than 2 was labeled critical, as seen in Figure 8.

Figure 8: The FFT decides
Ah hah! Our FFT sees things differently than our assessor. With a 93% average for performance fitting (this is good), our tree, making decisions on impact and exploitability, decides that there are 444 non-critical findings and 222 critical findings, a 17 point differential from our assessor. Can we all agree that mitigating and remediating critical findings can be an expensive proposition? If you, with just a modicum of data science, can make an authoritative decision that saves you time and money without adversely impacting your security posture, would you count it as a win? Yes, that was rhetorical.

Note that the FFTrees function automatically builds several versions of the same general tree that make different error trade-offs with variations in performance fitting and false positives. This gives you the option to test variables and make potentially even more informed decisions within the construct of one model. Ultimately, fast frugal trees make very fast decisions on 1 to 5 pieces of information and ignore all other information. In other words, "FFTrees are noncompensatory, once they make a decision based on a few pieces of information, no additional information changes the decision."

Finally, let's take a look at monitoring user logon anomalies in high volume environments with Time Series Regression (TSR). Much of this work comes courtesy of Eric Kapfhammer, our lead data scientist on our Microsoft Windows and Devices Group Blue Team. The ideal Windows Event ID for such activity is clearly 4624: an account was successfully logged on. This event is typically one of the top 5 events in terms of volume in most environments, and has multiple type codes including Network, Service, and RemoteInteractive.
User accounts will begin to show patterns over time, in aggregate, including:
  • Seasonality: day of week, patch cycles, 
  • Trend: volume of logons increasing/decreasing over time
  • Noise: randomness
You could look at 4624 with a Z-score model, which sets a threshold based on the number of standard deviations away from an average count over a given period of time, but this is a fairly simple model. The higher the value, the greater the degree of “anomalousness”.
Preferably, via Time Series Regression (TSR), your feature set is more rich:
  • Statistical method for predicting a future response based on the response history (known as autoregressive dynamics) and the transfer of dynamics from relevant predictors
  • Understand and predict the behavior of dynamic systems from experimental or observational data
  • Commonly used for modeling and forecasting of economic, financial and biological systems
How to spot the anomaly in a sea of logon data?
Let's imagine our user, DARPA-549521, in the SUPERSECURE domain, with 90 days of aggregate 4624 Type 10 events by day.

Figure 9: User logon data
With 210 line of R, including comments, log read, file output, and graphing we can visualize and alert on DARPA-549521's data as seen in Figure 10

Figure 10: User behavior outside the confidence interval
We can detect when a user’s account exhibits  changes in their seasonality as it relates to a confidence interval established (learned) over time. In this case, on 27 AUG 2017, the user topped her threshold of 19 logons thus triggering an exception. Now imagine using this model to spot anomalous user behavior across all users and you get a good feel for the model's power.
Eric points out that there are, of course, additional options for modeling including:
  • Seasonal and Trend Decomposition using Loess (STL)
    • Handles any type of seasonality ~ can change over time
    • Smoothness of the trend-cycle can also be controlled by the user
    • Robust to outliers
  • Classification and Regression Trees (CART)
    • Supervised learning approach: teach trees to classify anomaly / non-anomaly
    • Unsupervised learning approach: focus on top-day hold-out and error check
  • Neural Networks
    • LSTM / Multiple time series in combination
These are powerful next steps in your capabilities, I want you to be brave, be creative, go forth and add elements of data science and visualization to your practice. R and Python are well supported and broadly used for this mission and can definitely help you detect attackers faster, contain incidents more rapidly, and enhance your in-house detection and remediation mechanisms.
All the code as I can share is here; sorry, I can only share the TSR example without the source.
All the best in your endeavors!
Cheers...until next time.

by Russ McRee ( at November 20, 2017 12:27 AM

November 17, 2017

Colin Percival

FreeBSD/EC2 on C5 instances

Last week, Amazon released the "C5" family of EC2 instances, continuing their trend of improving performance by both providing better hardware and reducing the overhead associated with virtualization. Due to the significant changes in this new instance family, Amazon gave me advance notice of their impending arrival several months ago, and starting in August I had access to (early versions of) these instances so that I could test FreeBSD on them. Unfortunately the final launch date took me slightly by surprise — I was expecting it to be later in the month — so there are still a few kinks which need to be worked out for FreeBSD to run smoothly on C5 instances. I strongly recommend that you read the rest of this blog post before you use FreeBSD on EC2 C5 instances. (Or possibly skip to the end if you're not interested in learning about any of the underlying details.)

November 17, 2017 01:45 AM

November 11, 2017

OpenVPN with Private Internet Access and port forwarding


This post will show my setup using PIA (Private Internet Access) with OpenVPN on a Linux machine. Specifically, where only certain applications will utilize the VPN and the rest of the traffic will go out the normal ISP's default route. It will also show how to access the PIA API via a shell script, to open a forwarding port for inbound traffic. Lastly, I will show how to take all of the OpenVPN and PIA information and feed it to programs like aria2c or curl. The examples below were done on Ubuntu 16.04.

Packages and PIA Setup

Go signup for a PIA account.

# Install the packages you need, example uses apt-get
sudo apt-get install openvpn curl unzip

# make dir for PIA files and scripts
sudo mkdir -p /etc/openvpn/pia
cd /usr/local/etc/openvpn

# grab PIA openvpn files and unzip
sudo curl -o $url && sudo unzip

OpenVPN password file

Now that we have PIA login info lets make password file so we don't have to put in a password every time we start OpenVPN. We just need to make a file with the PIA username on one line and the PIA password on the second line. So just use you favorite text editor and do this. The file should be called "pass" and put in the "/etc/openvpn/pia" directory. The scripts that are used later depend on this file being called "pass" and put in this specific directory. An example of what the file looks like is below.


Change permission on this file so only root can read it

sudo chmod 600 /etc/openvpn/pia/pass

OpenVPN config file

This is the OpenVPN config file that works with PIA, and that also utilizes the scripts that will be talked about further down in the page. Use your favorite editor and copy and paste this text to a file called "pia.conf" and put in the "/etc/openvpn/pia" directory.

# PIA OpenVPN client config file 
dev tun

# make sure the correct protocol is used
proto udp

# use the vpn server of your choice
# only use one server at a time
# the ip addresses can change, so use dns names not ip's
# find more server names in .ovpn files
# only certain gateways support port forwarding
#remote 1198
#remote 1198
#remote 1198
#remote 1198
remote 1198

resolv-retry infinite
cipher aes-128-cbc
auth sha1

# ca.crt and pem files from downloaded from pia
ca /etc/openvpn/pia/ca.rsa.2048.crt
crl-verify /etc/openvpn/pia/crl.rsa.2048.pem

remote-cert-tls server

# path to password file so you don't have to input pass on startup
# file format is username on one line password on second line
# make it only readable by root with: chmod 600 pass
auth-user-pass /etc/openvpn/pia/pass

# this suppresses the caching of the password and user name

verb 1
reneg-sec 0

# allows the ability to run user-defined script
script-security 2

# Don't add or remove routes automatically, pass env vars to route-up

# run our script to make routes
route-up "/etc/openvpn/pia/ up"

OpenVPN route script

This is the script that the OpenVPN client will run at the end of startup. The magic happens in this script. Without this script OpenVPN will start the client and make the default route for the box the vpn connection. If you want that then go into the pia.conf file and comment out the "script-security 2", "route-noexec", and "route up ..." lines, and just fire up the client "sudo openvpn --config /etc/openvpn/pia/pia.conf" and your done.

If you don't want the vpn to take over your default route then let's keep going. Now that you have left those lines in the pia.conf file, the following script will be run when the client starts, and it will set up a route that does not take over the default gateway, but just adds secondary vpn gateway for programs to use. Open your favorite text editor and copy in the script below into the file "/etc/openvpn/pia/".

# script used by OpenVPN to setup a route on Linux.
# used in conjunction with OpenVPN config file options
# script-security 2, route-noexec, route-up 
# script also requires route table rt2
# sudo bash -c 'echo "1 rt2" >> /etc/iproute2/rt_tables

# openvpn variables passed in via env vars

if [ -z $int ] || [ -z $iplocal ] || [ -z $ipremote ] || [ -z $gw ]; then
  echo "No env vars found. Use this script with an OpenVPN config file "
  exit 1

help() {
  echo "For setting OpenVPN routes on Linux."
  echo "Usage: $0 up or down"

down() {
  # delete vpn route if found
  ip route flush table $rtname
  if [ $? -eq 0 ]; then
    echo "Successfully flushed route table $rtname"
    echo "Failed to flush route table $rtname"

up() {
  # using OpenVPN env vars that get set when it starts, see man page
  echo "Tunnel on interface $int. File /tmp/vpnint"
  echo $int > /tmp/vpnint
  echo "Local IP is         $iplocal. File /tmp/vpnip"
  echo $iplocal > /tmp/vpnip
  echo "Remote IP is        $ipremote"
  echo "Gateway is          $gw"

  down # remove any old routes

  ip route add default via $gw dev $int table $rtname
  if [ $? -eq 0 ]; then
    echo "Successfully added default route $gw"
    echo "Failed to add default route for gateway $gw"
  ip rule add from $iplocal/32 table $rtname
  if [ $? -eq 0 ]; then
    echo "Successfully added local interface 'from' rule for $iplocal"
    echo "Failed to add local interface 'from' rule for $iplocal"
  ip rule add to $gw/32 table $rtname
  if [ $? -eq 0 ]; then
    echo "Successfully added local interface 'to' rule for $gw"
    echo "Failed to add local interface 'to' rule for $gw"

  # PIA port forwarding, only works with certain gateways
  # No US locations, closest US is Toronto and Montreal
  # no network traffic works during exec of this script
  # things like curl hang if not backgrounded
  $ovpnpia/ &

case $1 in
  "up") up;;
  "down") down;;
  *) help;;

# always flush route cache 
ip route flush cache

Now run some final commands to get the script ready to work

# make the new script executable
sudo chmod 755 /etc/openvpn/pia/

# make a new route table rt2 in linux for the script to use
# this only has to be run once before you connect the first time
sudo bash -c 'echo "1 rt2" >> /etc/iproute2/rt_tables'

PIA port forward script

The following script is run by the script. It will contact a PIA server and tell it to open a port for incoming traffic on your vpn connection. This is so people on the internet can contact your machine through the vpn connection. Just a important note that currently only a certain list of PIA gateways support port forwarding. See the PIA support article on this for more info. Now, open your favorite text editor and copy in the script below into the file "/etc/openvpn/pia/".

# Get forward port info from PIA server

client_id=$(head -n 100 /dev/urandom | sha256sum | tr -d " -")

echo "Making port forward request..."

curl --interface $(cat /tmp/vpnint) $url 2>/dev/null > /tmp/vpnportfwhttp

if [ $? -eq 0 ]; then
  port_fw=$(grep -o '[0-9]\+' /tmp/vpnportfwhttp)
  [ -f /tmp/vpnportfw ] && rm /tmp/vpnportfw
  echo $port_fw > /tmp/vpnportfw
  echo "Forwarded port is $port_fw"
  echo "Forwarded port is in file /tmp/vpnportfw"
  echo "Curl failed to get forwarded PIA port in some way"
# make the new script executable
sudo chmod 755 /etc/openvpn/pia/

Starting OpenVPN

Finally we can start OpenVPN to connect with PIA. To do this run the the following command. It will keep the connection in the foreground so you can watch the output.

sudo openvpn --config /etc/openvpn/pia/pia.conf

During startup the OpenVPN client and both of the scripts we made will report on the screen data about the connection and if there were any errors. The output will look like the following example.

Fri Nov 10 19:40:50 2017 OpenVPN 2.3.10 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [EPOLL] [PKCS11] [MH] [IPv6] built on Jun 22 2017
Fri Nov 10 19:40:50 2017 library versions: OpenSSL 1.0.2g  1 Mar 2016, LZO 2.08
Fri Nov 10 19:40:50 2017 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
Fri Nov 10 19:40:50 2017 UDPv4 link local: [undef]
Fri Nov 10 19:40:50 2017 UDPv4 link remote: [AF_INET]
Fri Nov 10 19:40:50 2017 [dbacd7b38d135021a698ed95e8fec612] Peer Connection Initiated with [AF_INET]
Fri Nov 10 19:40:53 2017 TUN/TAP device tun0 opened
Fri Nov 10 19:40:53 2017 do_ifconfig, tt->ipv6=0, tt->did_ifconfig_ipv6_setup=0
Fri Nov 10 19:40:53 2017 /sbin/ip link set dev tun0 up mtu 1500
Fri Nov 10 19:40:53 2017 /sbin/ip addr add dev tun0 local peer
Tunnel on interface tun0. File /tmp/vpnint
Local IP is File /tmp/vpnip
Remote IP is
Gateway is
Successfully flushed route table rt2
Successfully added default route
Successfully added local interface 'from' rule for
Successfully added local interface 'to' rule for
Fri Nov 10 19:40:53 2017 Initialization Sequence Completed
Making port forward request...
Forwarded port is 40074
Forwarded port is in file /tmp/vpnportfw

Using the vpn connection

When the vpn started it dropped some files in /tmp. These files have the ip and port info we need to give to different programs when the startup. The scripts created the following files.

  • /tmp/vpnip - ip address of the vpn
  • /tmp/vpnportfw - incomming port being forwarded from the internet to the vpn
  • /tmp/vpnint - interface of the vpn

Now you can use this info when you start certain programs. Here are some examples.

# get vpn incomming port
pt=$(cat /tmp/vpnportfw)

# get vpn ip
ip=$(cat /tmp/vpnip)

# get vpn interface
int=$(cat /tmp/vpnint)

# wget a file via vpn
wget --bind-address=$ip

# curl a file via vpn
curl --interface $int

# ssh to server via vpn
ssh -b $ip 

# rtorrent 
/usr/bin/rtorrent -b $ip -p $pt-$pt -d /tmp

# start aria2c and background. use aria2 WebUI to connect download files
aria2c --interface=$ip --listen-port=$pt --dht-listen-port=$pt > /dev/null 2>&1 &

Final notes and warnings

If you start any programs and don't specifically bind them to the vpn interface or its ip address their connection will go out the default interface for the machine. Please remember this setup only sends specific traffic through the vpn so things like DNS requests still go through the non-vpn default gateway.

Remember only certain PIA gateways support port forwarding so if it is not working, try another PIA gateway.

PIA has a Linux vpn client that you can download and use if you are into GUI's.

by at November 11, 2017 03:50 AM

November 10, 2017

Lies, damn lies, and spammers in disguise

Everyone get so many unsolicited commercial emails these days that you just become blind at them, at best. Sometimes they are clearly, expressly commercial. Other times, they try to pass through your attention and your spam checker by disguising themselves as legitimate emails. I have a little story about that.

A couple of weeks ago I got yet another spammy mail from. It was evidently sent through a mass mailing and, as such, also included an unsubscribe link, however the guy was trying to legitimate his spam by saying that he approached me specifically because a colleague referred me to him; in addition, I felt that some keywords were added to his message only to make it sound “prettier” or even more legitimate.

I usually don’t spend time on spammers, but when I do I try to do it well. And in this occasion I had a little time to spend on it, and I did.

On October 30th I was walking to my eye doctor’s while I saw an email notification on my phone. The email said the following (note: the highlights are mine):

I was sent your way as the person who is responsible for application security testing and understand XXXXXX would be of interest to Telenor. To what lies within our next generation penetration testing abilities, XXXXXX can bring to you:
  • Periodic pen tests in uncovering serious vulnerabilities and (lack of) continuous application testing to support dynamic DevOps SDLC.
  • An ROI of 53% or higher with the ability to Identify, locate and resolve all vulnerabilities
  • Averaging 35-40% critical vulnerabilities found overall, others average 28%.
  • Over 500 international top class security researchers deployed in large teams of 40-60 ethical hackers.
We have pioneered a disruptive, ethical hacker powered application testing approach deploying large teams of senior security researchers with hacker mimicking mindsets to rapidly uncover exploits in your applications.
I would like to find some time for a conversation during November, to give you the insight into why Microsoft and Hewlett-Packard Enterprise have invested in us and why SAP is our top global reseller.
Many thanks,
Geoffrey XXXXXX
Business Development Manager

Now, we have our mailboxes on GMail and my usual method of procedure for such spammy crap from clueless salespeople is:

  • block the sender
  • unsubscribe
  • report the mail as spam

But now it’s different: first, this guy is probably trying to legitimate himself with lies for sending shit to me that I had not requested; second, despite the “personal” approach, this was clearly a mass-mailing so the whole story was clearly a lie; third, I am going to sit and wait at the doctor’s for some time. I could invest that time to run down the guy. My reply:

Kindly let me know who sent you my way, so that I can double check that. Then, maybe, we can talk.

The guy replies soon after. New mail, new lie:

Hi Marco,

Apologies for not including that in the email and I am more than happy to say how I got to you.

I have spoken to Ragnar Harper around how this may become beneficial to Telenor and he has mentioned your name during the conversations. As I have been unable to get back in contact with Ragnar I thought it best that I could gain your input into this moving forward by having some time aligned in our diaries to discuss more?

[…additional crap redacted…]

Not only I had never met or talked with any Ragnar Harper: the guy was not in our Company’s ERP. In my eyes, Mr.Geoffrey was lying right down the line. Time to get rid of him:

There is no Ragnar Harper here

Your emails will hereby be blocked and reported as Spam

— MM

You may think he stopped there. Well, he didn’t:

Apologies but I just phoned the switchboard and he left last month.

Above is his linkedin profile as well.
Sorry for the confusion.
So he had this beautiful relation with Ragnar Harper, to the point that they talked together about Geoff’s products and Ragnar pointed him to me as the “person responsible for application security” (which I am not…) and he didn’t even know that Ragnar had left the company. But there is more: to get to know that, he didn’t call Ragnar: he had to call the switchboard and check LinkedIn. Geoff is clearly clutching at straws. My reply:

You just picked up a random name. You are trying to sell me something for security, which should entail a certain amount of trust in your company, and you start the interaction with a lie: not bad for someone moved by an “ethical hacking” spirit.

There can’t be any trust between me and your company. Your messages are already landing in my spam folder and that’s where they shall be. Just go and try your luck with someone else, “Geoffrey”.

The king is naked, they say. Now he gotta stop, you think. He didn’t:

Thank you for your time Marco and I am sorry that you feel that way.

If you believe that it has started on a lie then that may well be your choice as I have been in contact with Ragnar since my days at YYYYYY before joining XXXXXX and we have yet to catch up this month since he has now departed. I shall head elsewhere as requested as that is not the kind of reputation that we uphold here.

Best wishes,


Uh, wait a minute. Did I beat the guy for no reason? Could it be that he actually knew this ex-colleague Ragnar Harper and I am assuming too much? Is it all a misunderstanding? If it is, I want to know and apologise. As said, I had never met or talked to Ragnar Harper, but I can still try to contact him through LinkedIn:

Question from an ex colleague

Hei Ragnar. My name is Marco Marongiu, I am the Head of IT in Telenor Digital. I apologize for bothering you, it’s just a short question: have you referred me to a Sales person called Geoffrey XXXXXX from a Security company called XXXXXX?

This person approached me via email. Long story short, he says he knows you personally and that you sent him my way (to use his words). Is it something that you can confirm?

We two have never met in Telenor so that sounded strange and I handled him as a spammer. But if that is actually true I do want to send the guy my apologies. Thanks in any case

Med vennlig hilsen

— Marco

I am grateful that Ragnar took some time to reply and confirm my suspect: he had never known or met the guy. I thanked Ragnar and stopped talking to “Geoffrey”. At the same time I thought it was a good story to tell, so here we go.



Tagged: spammers

by bronto at November 10, 2017 07:00 PM

October 29, 2017

ncdu - for troubleshooting diskspace and inode issues

In my box of sysadmin tools there are multiple gems I use for troubleshooting servers. Since I work at a cloud provider sometimes I have to fix servers that are not mine. One of those tools is `ncdu`. It's a very usefull tool when a server has a full disk, both full of used space or full of used inodes. This article covers ncdu and shows the process of finding the culprit when you're out of disk space or inodes.

October 29, 2017 12:00 AM