Planet SysAdmin

August 22, 2019

Chris Siebenmann

Making sense of OpenBSD 'pfctl -ss' output for firewall state tables

Suppose, not entirely hypothetically, that you have some OpenBSD firewalls and every so often you wind up looking at the state table listing that's produced by 'pfctl -ss'. On first impression, this output looks sort of understandable, with entries like:

all tcp <- 128.100.3.X:46392       ESTABLISHED:ESTABLISHED
all tcp 128.100.3.X:46392 ->       ESTABLISHED:ESTABLISHED

I won't say that appearances are deceptive here, but things are not as straightforward as they look once you start wanting to know what this is really telling you. For instance, there is no documentation on what that 'all' actually means. Since I've been digging into this, here's what I've learned.

The general form of a state table entry as printed by 'pfctl -ss' is:


At least for our firewalls, the interface is generally 'all'. The protocol can be any number of things, including tcp, udp, icmp, esp, ospf, and pfsync. For TCP connections, the listed states are the TCP states (and you can get all of the weird and wonderful conditions where the two directions of the connection are in different states, such as half-closed connections). For other protocols there's a smaller list; see the description of 'set timeout' in pf.conf's OPTIONS section for a discussion of most of them. There's also a NO_TRAFFIC state for when no traffic has happened in one direction.

So let's talk about directions, the field for which I have called DIR and which will always be either '<-' or '->', which mean in and out respectively. By that I mean PF_IN and PF_OUT (plus PF_FWD for forwarded packets), not 'inside' and 'outside'. OpenBSD PF doesn't have any notion of inside and outside interfaces, but it does have a notion of incoming traffic and outgoing traffic, and that is what ultimately determines the direction. If a packet is matched or handled during input and that creates a state table entry, that will be an in entry; similarly, matching or passing it during output will create an out entry. Sometimes this is through explicit 'pass in' and 'pass out' rules, but other times you have a bidirectional rule (eg 'match on <IF> ... binat-to ...') and then the direction depends on packet flow.

The first thing to know is that contrary to what I believed when I started writing this entry, all state table entries are created by rules. As far as I can tell, there are no explicit state table entries that get added to handle replies; the existing 'forward' state table entries are just used in reverse to match the return traffic. The reason that state table entries usually come in pairs (at least for us) is that we have both 'pass in' and 'pass out' rules that apply to almost all packets, and so both rules create a corresponding state table entry for a specific connection. An active, permitted connection will thus have two state table entries, one for the 'pass in' rule that allows it in and one for the 'pass out' rule that allows it out.

The meaning of the left and the right address changes depending on the direction. For an out state table entry, the left address is the (packet or connection) source address and the right address is the destination address; for an in state table entry it's reversed, with the left address the destination and the right address the source. The LEFT-STATE and RIGHT-STATE fields are associated with the left and the right addresses respectively, whatever they are, and for paired up state table entries I believe they're always going to be mirrors of each other.

(I believe that the corollary of this is that the NO_TRAFFIC state can only appear on the destination side, ie the side that didn't originate the packet flow. This means that for an out state NO_TRAFFIC will always be the right state, and on an in state it will always be the left one.)

So far I have shown a pair of state table entries from a simple firewall without any sort of NAT'ing going on (which includes 'rdr-to' rules). If you have some sort of NAT in effect, the output changes and generally that change will be asymmetric between the pair of state table entries. Here is an example:

all tcp 128.100.X.X:22 <-       ESTABLISHED:ESTABLISHED
all tcp 128.100.3.Y:60689 ( -> 128.100.X.X:22       ESTABLISHED:ESTABLISHED

This machine has made an outgoing SSH connection that was first matched by a 'pass in' rule and then NAT'd on output. Inbound NAT creates a different set of state table entries:

all tcp 10.X.X.X:22 (128.100.20.X:22) <-       ESTABLISHED:ESTABLISHED
all tcp -> 10.X.X.X:22       ESTABLISHED:ESTABLISHED

The rule is that the pre-translation address is in () and the post translation address is not. On outbound NAT, the pre-translation address is the internal address and the post-translation one is the public IP; on inbound NAT it's the reverse. Notice that this time the NAT was applied on input, not on output, and of course there was a 'pass in' rule that matched.

(If you have binat-to machines they can have both sorts of entries at once, with some connections coming in from outside and some connections going outside from the machine.)

If you do your NAT through bidirectional rules (such as 'match on <IF> ...'), where NAT is applied is determined by what interface you specify in the rule combined with packet flow. This is our case; all of our NAT rules are applied on our perimeter firewall's external interface. If we applied them to the internal interface, we could create situations where the right address had the NAT mapping instead of the left one. The resulting state table entries would look like this (for an inbound connect that was RDR'd):

all tcp 128.100.3.X:25 <- 128.100.A.B:39304       ESTABLISHED:ESTABLISHED
all tcp 128.100.A.B:39304 -> 128.100.3.YYY:25 (128.100.3.X:25)       ESTABLISHED:ESTABLISHED

This still follows the rule that the pre-translation address is in the () and the post-translation address is not.

In general, given only a set of state table entries, you don't know what is internal and what is external. This is true even when NAT is in effect, because you don't necessarily know where NAT is being applied (as shown here; all NAT'd addresses are internal ones, but they show up almost all over). If you know certain things about your rules, you can know more from your state table entries (without having to do things like parse IP addresses and match network ranges). Given how and where we apply NAT, it's always going to appear in our left addresses, and if it appears on an in state table entry it's an external machine making an inbound connection instead of an internal machine making an outgoing one.

PS: According to the pfctl code, you may sometimes see extra text in left or right address that look like '{ <IP address> }'. I believe this appears only if you use af-to to do NAT translation between IPv4 and IPv6 addresses. I'm not sure if it lists the translated address or the original.

PPS: Since I just tested this, the state of an attempted TCP connection in progress to something that isn't responding is SYN_SENT for the source paired with CLOSED for the destination. An attempted TCP connection that has been refused by the destination with a RST has a TIME_WAIT:TIME_WAIT state. Both of these are explicitly set in the relevant pf.c code; see pf_create_state and pf_tcp_track_full (for the RST handling). Probably those are what you'd expect from the TCP state transitions in general.

Sidebar: At least three ways to get singleton state table entries

I mentioned that state table entries usually come in pairs. There are at least three exceptions. The first is state table entries for traffic to the firewall itself, including both pings and things like SSH connections; these are accepted in 'pass in' rules but are never sent out to anywhere, so they never get a second entry. The second is traffic that is accepted by 'pass in' rules but then matches some 'block out' rule so that it's not actually sent out. The third and most obvious exception is that if you match in one direction with 'no state' but use state on the other one, perhaps by accident or omission.

(Blocked traffic tends to have NO_TRAFFIC as the state for one side, but not all NO_TRAFFIC states are because of blocks; sometimes they're just because you're sending traffic to something that doesn't respond.)

I was going to say things about the relative number of in and out states as a consequence and corollary of this, but now that I've looked at our actual data I'm afraid I have no idea what's going on.

(I think that part of it is that for TCP connections, you can have closed down or inactive connections where one state table entry expires before the other. This may apply to non-TCP connections too, but my head hurts. For that matter, I'm not certain that 'pfctl -ss' is guaranteed to report a coherent copy of the state table. Pfctl does get it from the kernel in a single ioctl(), but the kernel may be mutating the table during the process.)

by cks at August 22, 2019 12:53 AM

August 21, 2019

Chris Siebenmann

A gotcha with Fedora 30's switch of Grub to BootLoaderSpec based configuration

I upgraded my office workstation from Fedora 29 to Fedora 30 yesterday. In the past, such upgrades been problem free, but this time around things went fairly badly, with the first and largest problem being that after the upgrade, booting any kernel gave me a brief burst of kernel messages, then a blank screen and after a few minutes a return to the BIOS and Grub main menu. To get my desktop to boot at all, I had to add 'nomodeset' to the kernel command line; among other consequences, this made my desktop a single display machine instead of a dual display one.

(It was remarkably disorienting to have my screen mirrored across both displays. I kept trying to change to the 'other' display and having things not work.)

The short version of the root cause is that my grub.cfg was rebuilt using outdated kernel command line arguments that came from /etc/default/grub, instead of the current command line arguments that had previously been used in my original grub.cfg. Because of how the Fedora 30 grub.cfg is implemented, these wrong command line arguments were then remarkably sticky and it wasn't clear how to change them.

In Fedora 29 and earlier, your grub.cfg is probably being maintained through grubby, Fedora's program for this. When grubby adds a menu entry for a new kernel, it more or less copies the kernel command line arguments from your current one. While there is a GRUB_CMDLINE_LINUX setting in /etc/default/grub, its contents are ignored until and unless you rebuild your grub.cfg from scratch, and there's nothing that tries to update it from what your current kernels in your current grub.cfg are actually using. This means that your /etc/default/grub version can wind up being very different from what you're currently using and actually need to make your kernels work.

One of the things that usually happens by default when you upgrade to Fedora 30 is that Fedora switches how grub.cfg is created and updated from the old way of doing it itself via grubby to using a Boot Loader Specification (BLS) based scheme; you can read about this switch in the Fedora wiki. This switch regenerates your grub.cfg using a shell script called (in Fedora) grub2-switch-to-blscfg, and this shell script of course uses /etc/default/grub's GRUB_CMDLINE_LINUX as the source of the kernel arguments.

(This is controlled by whether GRUB_ENABLE_BLSCFG is set to true or false in your /etc/default/grub. If it's not set at all, grub2-switch-to-blscfg adds a 'GRUB_ENABLE_BLSCFG=true' setting to /etc/default/grub for you, and of course goes on to regenerate your grub.cfg. grub2-switch-to-blscfg itself is run from the Fedora 30 grub2-tools RPM posttrans scriptlet if GRUB_ENABLE_BLSCFG is not already set to something in your /etc/default/grub.)

A regenerated grub.cfg has a default_kernelopts setting, and that looks like it should be what you want to change. However, it is not. The real kernel command line for normal BLS entries is actually in the Grub2 $kernelopts environment variable, which is loaded from the grubenv file, normally /boot/grub2/grubenv (which may be a symlink to /boot/efi/EFI/fedora/grubenv, even if you're not actually using EFI boot). The best way to change this is to use 'grub2-editenv - list' and 'grub2-editenv - set kernelopts="..."'. I assume that default_kernelopts is magically used by the blscfg Grub2 module if $kernelopts is unset, and possibly gets written back to grubenv by Grub2 in that case.

(You can check that your kernels are using $kernelopts by inspecting an entry in /boot/loader/entries and seeing that it has 'options $kernelopts' instead of anything else. You can manually change that for a specific entry if you want to.)

This is going to make it more interesting (by which I mean annoying) if and when I need to change my standard kernel options. I think I'm going to have to change all of /etc/default/grub, the kernelopts in grubenv, and the default_kernelopts in grub.cfg, just to be sure. If I was happy with the auto-generated grub.cfg, I could just change /etc/default/grub and force a regeneration, but I'm not and I have not yet worked out how to make its handling of the video modes and the menus agree with what I want (which is a basic text experience).

(While I was initially tempted to leave my system as a non-BLS system, I changed my mind because of long term issues. Fedora will probably drop support for grubby based setups sooner or later, so I might as well get on the BLS train now.)

To give credit where it's due, one (lucky) reason that I was able to eventually work out all of this is that I'd already heard about problems with the BLS transition in Fedora 30 in things like Fedora 30: When grub2-mkconfig Doesn’t Work, and My experiences upgrading to Fedora 30. Without that initial awareness of the existence of the BLS transition in Fedora 30 (and the problems it caused people), I might have been flailing around for even longer than I was.

PS: As a result of all of this, I've discovered that you no longer need to specify the root device in the kernel command line arguments. I assume the necessary information for that is in the dracut-built initramfs. As far as the blank screen and kernel panics go, I suspect that the cause is either or both of 'amdgpu.dpm=0' and 'logo.nologo', which were still present in the /etc/default/grub arguments but which I'd long since removed from my actual kernel command lines.

(I could conduct more experiments to try to find out which kernel argument is the fatal one, but my interest in more reboots is rather low.)

Update, August 21st: I needed to reboot my machine to apply a Fedora kernel update, so I did some experiments and the fatal kernel command line argument is amdgpu.dpm=0, which I needed when the machine was new but had turned off since then.

by cks at August 21, 2019 05:32 PM

Saying goodbye to Flash (in Firefox, and in my web experience)

Today, for no specific reason, I finally got around to removing the official Adobe-provided Linux Flash plugin packages from my office workstation. I was going to say that I did it on both my home and my office machine, but it turns out that I apparently removed it on my home machine some time ago; the flash-plugin package was only lingering on my work machine. This won't make any difference to my experience of the web in Firefox, because some time ago I disabled Flash in Firefox itself, setting the plugin to never activate. Until I walked away from Chrome, Chrome and its bundled version of Flash was what I reached for when I needed Flash.

I kept the plugin around for so long partly because for a long time, getting Flash to work was one of the painful bits of browsing the web on Linux. Adobe's Linux version of Flash was behind the times (and still is), for a long time it was 32-bit only, and over the years it required a variety of hacks to get it connected to Firefox (cf, and, and, and so on). Then, once I had Flash working, I needed more things to turn it off when I didn't want it to play. After all of this, having an officially supplied 64-bit Adobe Flash package that just worked (more or less) seemed like sort of a miracle, so I let it sit around even well after I'd stopped using it.

Now, though, the web has moved on. The last website that I cared about that used Flash moved to HTML5 video more than a year ago, and as mentioned I haven't used Flash in Firefox for far longer than that. Actively saying goodbye by removing the flash-plugin package seemed about time, and after all of the hassles Flash has put me through over the years, I'm not sad about it.

(Flash's hassles haven't just been in the plugin. I've had to use a few Flash-heavy websites over the years, including one that I at least remember as being implemented entirely as a Flash application, and the experience was generally not a positive one. I'm sure you can do equally terrible things in HTML5 with JavaScript and so on, but I think you probably have to do more work and that hopefully makes people less likely to do it.)

Flash is, unfortunately, not the last terrible thing that I sort of need in my browsers. Some of our servers have IPMI BMCs that require Java for their KVM over IP stuff, specifically Java Web Start. I actually keep around a Java 7 install just for them, although the SSL ciphers they support are getting increasingly ancient and hard to talk to with modern browsers.

(I normally say TLS instead of SSL, but these are so old that I feel I should call what they use 'SSL'.)

PS: I'm aware that there is (or was) good web content done in Flash and much of that content is now in the process of being lost, and I do think that that is sad. But for me it's kind of an abstract sadness, since I never really interacted with that corner of the web, and also I'm acclimatized to good things disappearing from the web in general.

by cks at August 21, 2019 03:54 AM

August 18, 2019

Errata Security

Censorship vs. the memes

The most annoying thing in any conversation is when people drop a meme bomb, some simple concept they've heard elsewhere in a nice package that they really haven't thought through, which takes time and nuance to rebut. These memes are often bankrupt of any meaning.

When discussing censorship, which is wildly popular these days, people keep repeating these same memes to justify it:
  • you can't yell fire in a crowded movie theater
  • but this speech is harmful
  • Karl Popper's Paradox of Tolerance
  • censorship/free-speech don't apply to private organizations
  • Twitter blocks and free speech
This post takes some time to discuss these memes, so I can refer back to it later, instead of repeating the argument every time some new person repeats the same old meme.

You can't yell fire in a crowded movie theater

This phrase was first used in the Supreme Court decision Schenck v. United States to justify outlawing protests against the draft. Unless you also believe the government can jail you for protesting the draft, then the phrase is bankrupt of all meaning.

In other words, how can it be used to justify the thing you are trying to censor and yet be an invalid justification for censoring those things (like draft protests) you don't want censored?

What this phrase actually means is that because it's okay to suppress one type of speech, it justifies censoring any speech you want. Which means all censorship is valid. If that's what you believe, just come out and say "all censorship is valid".

But this speech is harmful or invalid

That's what everyone says. In the history of censorship, nobody has ever wanted to censor good speech, only speech they claimed was objectively bad, invalid, unreasonable, malicious, or otherwise harmful

It's just that everybody has different definitions of what, actually is bad, harmful, or invalid. It's like the movie theater quote. For example, China's constitution proclaims freedom of speech, yet the government blocks all mention of the Tienanmen Square massacre because it's harmful. It's "Great Firewall of China" is famous for blocking most of the content of the Internet that the government claims harms its citizens.

At least in case of movie theaters, the harm of shouting "fire" is immediate and direct. In all these other cases, the harm is many steps removed. Many want to censor anti-vaxxers, because their speech kills children. But the speech doesn't, the virus does. By extension, those not getting vaccinations may harm people by getting infected and passing the disease on. But the speech itself is many steps removed from this, and there's plenty of opportunity to counter this bad speech with good speech.

Thus, this argument becomes that all speech can be censored, because I can also argue that some harm will come from it.

Karl Popper's Paradox of Tolerance

This is just a logical fallacy, using different definitions of "tolerance". The word means "putting up with those who disagree with you". The "paradox" comes from allowing people free-speech who want to restrict free-speech.

But people are shifting the definition of "tolerance" to refer to white-supremacists, homophobes, and misogynists. That's also intolerance, of people different than you, but it's not the same intolerance Popper is talking about. It's not a paradox allowing the free-speech of homophobes, because they aren't trying to restrict anybody else's free-speech.

Today's white-supremacists in the United States don't oppose free-speech, quite the opposite. They champion free-speech, and complain the most about restrictions on their speech. Popper's Paradox doesn't apply to them. Sure, the old Nazi's in Germany also restricted free-speech, but that's distinct from their racism, and not what modern neo-Nazi's are championing.

Ironically, the intolerant people Popper refers to in his Paradox are precisely the ones quoting it with the goal of restricting speech. Sure, you may be tolerant in every other respect (foreigners, other races, other religions, gays, etc.), but if you want to censor free-speech, you are intolerant of people who disagree with you. Popper wasn't an advocate of censorship, his paradox wasn't an excuse to censor people. He believed that "diversity of opinions must never be interfered with".

Censorship doesn't apply to private organizations

Free speech rights, as enumerated by the First Amendment, only apply to government. Therefore, it's wrong to claim the First Amendment protects your Twitter or Facebook post, because those are private organizations. The First Amendment doesn't apply to private organizations. Indeed, the First Amendment means that government can't force Twitter or Facebook to stop censoring you.

But "free speech" doesn't always mean "First Amendment rights". Censorship by private organizations is still objectionable on "free speech" grounds. Private censorship by social media isn't suddenly acceptable simply because government isn't involved.

Our rights derive from underlying values of tolerance and pluralism. We value the fact that even those who disagree with us can speak freely. The word "censorship" applies both to government and private organizations, because both can impact those values, both can restrict our ability to speak.

Private organizations can moderate content without it being "censorship". On the same page where Wikipedia states that it won't censor even "exceedingly objectionable/offensive" content, it also says:
Wikipedia is free and open, but restricts both freedom and openness where they interfere with creating an encyclopedia. 
In other words, it will delete content that doesn't fit its goals of creating an encyclopedia, but won't delete good encyclopedic content just because it's objectionable. The first isn't censorship, the second is. It's not "censorship" when the private organization is trying to meet its goals, whatever they are. It's "censorship" when outsiders pressure/coerce the organization into removing content they object to that otherwise meets the organization's goals.

Another way of describing the difference is the recent demonetization of Steven Crowder's channel by YouTube. People claim YouTube should've acted earlier, but didn't because they are greedy. This argument demonstrates their intolerance. They aren't arguing that YouTube should remove content in order to achieve its goals of making money. They are arguing that YouTube should remove content they object to, despite hurting the goal of making money. The first wouldn't be censorship, the second most definitely is.

So let's say you are a podcaster. Certainly, don't invite somebody like Crowder on your show, for whatever reason you want. That's not censorship. Let's say you do invite him on your show, and then people complain. That's also not censorship, because people should speak out against things they don't like. But now let's say that people pressure/coerce you into removing Crowder, who aren't listeners to your show anyway, just because they don't want anybody to hear what Crowder has to say. That's censorship.

That's what happened recently with Mark Hurd, a congressman from Texas who has sponsored cybersecurity legislation, who was invited to speak at Black Hat, a cybersecurity conference. Many people who disliked his non-cybersecurity politics objected and pressured Black Hat into dis-inviting him. That's censorship. It's one side who refuse to tolerate a politician of the opposing side.

All these arguments about public vs. private censorship are repeats of those made for decades. You can see them here in this TV show (WKRP in Cincinati) about Christian groups trying to censor obscene song lyrics, which was a big thing in the 1980s.

This section has so far been about social media, but the same applies to private individuals. When terrorists (private individuals) killed half the staff at Charlie Hebdo for making cartoons featuring Muhamed, everyone agreed this was a freedom of speech issue. When South Park was censored due to threats from Islamic terrorists, people likewise claimed it was a free speech issue.

In Russia, the police rarely arrests journalists. Instead, youth groups and thugs beat them up. Russia has one of the worst track records on freedom of speech, but it's mostly private individuals who are responsible, not their government.

These days in America, people justify Antifa's militancy, which tries to restrict the free speech of those they label as "fascists", because it's not government restrictions. It's just private individuals attacking other private individuals. It's no more justified than any of these other violence attacks on speech.

Twitter blocks and free speech

The previous parts are old memes. There's a new meme, that somehow Twitter "blocks" are related to free-speech.

That's nonsense. If I block you on Twitter, then the only speech I'm preventing you from seeing is my own. It also prevents me from seeing some (but not all) stuff you post, but again, the only one affected by this block is me. It doesn't stop others from seeing your content. Censorship is about stopping others from hearing speech that I object to. If there's no others involved, it's not censorship. In particular, while you are free to speak anything you want, I'm likewise free to ignore you.

Sure, there are separate concerns when the President simultaneously uses his Twitter account for official business and also blocks people. That's a can of worms that I don't want to get into. But it doesn't apply to us individuals.


The pro-censorship arguments people are making today are the same arguments people have been making for thousands of years, such as when ancient Rome had the office of "censor" who (among other duties) was tasked with restricting harmful speech. Those arguing for censorship of speech they don't like believe that somehow their arguments are different. They aren't. It's the same bankrupt memes made over and over.

by Robert Graham ( at August 18, 2019 11:45 PM


Recession is coming

A recession is coming. As someone whose career has endured two big ones, I want to prepare you, the person who hasn't lived through one yet, for what will come.

General Expectations

There are three broad trends that will impact the tech-industry pretty hard.

  • VC will stop, or become very dear.
  • Customers won't have money either, so they'll buy less of your stuff.
  • Bond issues (what companies do when VC can't fill a need) will become very expensive.

If you're aware of how your company makes and spends money this should scare the pants off of you. You don't need to read the rest of this.

But for the rest of you...

Do we know what kind of recession it'll be?

Not yet. The crystal-ball at this stage is suggesting two different kinds of recession:

  • A stock-market panic. Sharp, deep, but ultimately short since the fundamentals are still OKish. Will topple the most leveraged companies.
  • The Trump trade-wars trigger a global sell-off. Long, but shallow. Will affect the whole industry, but won't be a rerun of the 2008-2010 disaster.

If the market is jittery about the trade-wars, a Trump re-election might be the trigger.

Beware of October. While 2008 was a summer recession, the dot-com bust, and the two before it were triggered by a crash in October.

What about twitter? They don't make money. Will they fail?

No, their service is way too valuable. What will happen is that new investment will be incredibly hard to get, making improving shareholder value really hard, which in turn will incentivise them to cut costs. That means layoffs. That means they'll invest even less in safety tools.

It means they may enter and leave bankruptcy several times. Once the corner of the recession is turned, they will be bought by a company that does make money. Possibly an old-guard tech corp like Microsoft or Cisco.

This pattern will follow for any company that depends on continuous investment to remain operational.

What about Uber? They don't make money.

Uber's business model is to break the taxi industry with the support of the stock-market. When the money supply dries up, they'll need to turn a profit. This means layoffs. It also means they'll pull out of their especially unprofitable cities. It will mean rate increases. If they fall as far as bankruptcy, expect big rate increases.

People will learn what Uber/Lyft's true costs are by the end of the recession.

Will my private company fail?

The biggest change will be that living on the runway will become extremely expensive. Any investment will cost larger portions of the company, which means employees will get less in a merger or in an IPO.

If your company can actually get into the black (not on the runway), you're in much better shape. You won't be hiring as fast or at all, but you're much more likely to continue to have a job.

But if you can't get off the runway, and can't earn investment? In the early stages of the recession this will be bankruptcies, layoffs, no-notice shutterings, and sadness. In later stages, when profitable companies start bargain hunting, it will mean buy-outs.

What about my stock-options (private company)?

If your company makes it through the recession, expect them to come out highly diluted.

If your company doesn't make it through the recession, you'll only get paid for the shares you exercise. And even then, you may not get much at all.

What about my RSUs (public company)?

Expect the stock price to tank. That $100K grant, vested over four years? May only be worth $60K by the time it is fully exercised. This is exactly the risk you take by agreeing to be compensated through stock. Your company doesn't owe you the difference between promised and actual value.

The bottom line: start budgeting your year on your salary only, not your stock exercises.

What about my bonuses?

Most bonus programs I've seen have some component in company performance. Expect that to be negative for a few years. Which means expecting to get under, to well under your bonus targets.

If the bonus program isn't suspended entirely. This can happen, especially if they're getting shareholder pressure to cut costs.

What about my benefits?

They'll stay the same for at least the first year. If the recession drags on, you may see some changes. The first to go will be 'wellness' benefits like gym memberships. Next will be continuing education benefits, expect to have to self-finance conferences work previously paid for. Expect work-paid travel to be cut way down. In the second or third year expect to have to cover more of your health-care premium.

There are whole benefits that were standard during the first dot-com boom, and went away after the crash and shareholders demanded more value. It took years for some of them to come back. It happened after the 2008 crash, and it will happen again this time. This will impact even the FAANG companies.

What does bankruptcy really mean?

It means someone outside of your company is holding your company to certain cost targets. Depending on the chapter, this may have power to break employment agreements; capitalism is about creditors getting paid.

Sometimes the court overseer can force the sale of the company to another company. The rules for this are very different than free market sales, it could be the only thing you get out of it is to still have a job. And maybe not that.

Bankruptcies are no fun.

What if I get fired?

You'll get some kind of severance. How much depends on a lot, like whether or not your now-ex company is in bankruptcy. In the early stages, it'll be some cash and an offer to cover your medical expenses for a few months. In a bankruptcy termination, your severance will be the last paycheck and nothing else.

Start building your emergency fund now while the money is still good. If you end up coming to work one morning and can't get past the lobby? You'll be so much happier if you did.

What if I don't work in the bay area? Am I completely screwed?

This will matter less than you think -- hiring in the Bay Area will slump or stop -- but will hit areas that aren't Bay Area, NYC, DC Metro Area, even harder. All areas will have very illiquid if not frozen hiring markets. For the first time in many tech-workers experience, the tech hiring market may become a buyers market. Our salaries are the way they are now because it's been a seller's market for over a decade.

This means that offers will be lower than you'd expect, and your annual raise may not happen. Tech salaries dropped during the dot-com crash. They dropped, for a few quarters, during the Great Recession. The Bureau of Labor Statistics tracks this stuff. It can happen again.

by SysAdmin1138 at August 18, 2019 09:28 PM

August 17, 2019

Sarah Allen

graph in rust using petgraph

Getting started with Rust and the petgraph crate, I made a little program to write a graph in “dot” file format. Below is the rust and some command-line code to turn it into a png.

visual graph representation with circles and arrows

use petgraph::Graph;
use petgraph::dot::{Dot, Config};
use std::fs::File;
use std::io::Write;

fn main() {
    println!("hello graph!");
    let mut graph = Graph::<_, i32>::new();
        (0, 1), (0, 2), (0, 3),
        (1, 2), (1, 3),
        (2, 3),

    println!("{:?}", Dot::with_config(&graph, &[Config::EdgeNoLabel]));
    let mut f = File::create("").unwrap();
    let output = format!("{}", Dot::with_config(&graph, &[Config::EdgeNoLabel]));
    f.write_all(&output.as_bytes()).expect("could not write file");

output of cargo run:

hello graph!
digraph {
    0 [label="\"A\""]
    1 [label="\"B\""]
    2 [label="\"C\""]
    3 [label="\"D\""]
    0 -> 1
    0 -> 2
    0 -> 3
    1 -> 2
    1 -> 3
    2 -> 3

Generate PNG from DOT file

The dot command is part of graphviz, which I installed with brew install graphviz.

The following command creates a PNG file from the .dot file generated by Rust code above

dot -T png -O

The resulting PNG is displayed at the top-right of this post (next to the Rust code).

Special thanks to:
* mcarton’s help on stackoverflow for enlightening me on a bit of Rust nuance as I experimented with petgraph.
* rudifa’s post graphviz-on-the-mac

by sarah at August 17, 2019 11:36 PM

August 16, 2019

Racker Hacker

August 14, 2019

Steve Kemp's Blog

That time I didn't find a kernel bug, or did I?

Recently I saw a post to the linux kernel mailing-list containing a simple fix for a use-after-free bug. The code in question originally read:

    hdr->pkcs7_msg = pkcs7_parse_message(buf + buf_len, sig_len);
    if (IS_ERR(hdr->pkcs7_msg)) {
        return PTR_ERR(hdr->pkcs7_msg);

Here the bug is obvious once it has been pointed out:

  • A structure is freed.
    • But then it is dereferenced, to provide a return value.

This is the kind of bug that would probably have been obvious to me if I'd happened to read the code myself. However patch submitted so job done? I did have some free time so I figured I'd scan for similar bugs. Writing a trivial perl script to look for similar things didn't take too long, though it is a bit shoddy:

  • Open each file.
  • If we find a line containing "free(.*)" record the line and the thing that was freed.
  • The next time we find a return look to see if the return value uses the thing that was free'd.
    • If so that's a possible bug. Report it.

Of course my code is nasty, but it looked like it immediately paid off. I found this snippet of code in linux-5.2.8/drivers/media/pci/tw68/tw68-video.c:

    if (hdl->error) {
        return hdl->error;

That looks promising:

  • The structure hdl is freed, via a dedicated freeing-function.
  • But then we return the member error from it.

Chasing down the code I found that linux-5.2.8/drivers/media/v4l2-core/v4l2-ctrls.c contains the code for the v4l2_ctrl_handler_free call and while it doesn't actually free the structure - just some members - it does reset the contents of hdl->error to zero.

Ahah! The code I've found looks for an error, and if it was found returns zero, meaning the error is lost. I can fix it, by changing to this:

    if (hdl->error) {
        int err = hdl->error;
        return err;

I did that. Then looked more closely to see if I was missing something. The code I've found lives in the function tw68_video_init1, that function is called only once, and the return value is ignored!

So, that's the story of how I scanned the Linux kernel for use-after-free bugs and contributed nothing to anybody.

Still fun though.

I'll go over my list more carefully later, but nothing else jumped out as being immediately bad.

There is a weird case I spotted in ./drivers/media/platform/s3c-camif/camif-capture.c with a similar pattern. In that case the function involved is s3c_camif_create_subdev which is invoked by ./drivers/media/platform/s3c-camif/camif-core.c:

        ret = s3c_camif_create_subdev(camif);
        if (ret < 0)
                goto err_sd;

So I suspect there is something odd there:

  • If there's an error in s3c_camif_create_subdev
    • Then handler->error will be reset to zero.
    • Which means that return handler->error will return 0.
    • Which means that the s3c_camif_create_subdev call should have returned an error, but won't be recognized as having done so.
    • i.e. "0 < 0" is false.

Of course the error-value is only set if this code is hit:

    hdl->buckets = kvmalloc_array(hdl->nr_of_buckets,
                      GFP_KERNEL | __GFP_ZERO);
    hdl->error = hdl->buckets ? 0 : -ENOMEM;

Which means that the registration of the sub-device fails if there is no memory, and at that point what can you even do?

It's a bug, but it isn't a security bug.

August 14, 2019 10:01 AM

August 13, 2019

Time for Change: Going Independent

The post Time for Change: Going Independent appeared first on

After 12 intense years at Nucleus, it's time for something new: as of September 2019 I'll stop my activities at Nucleus and continue to work as an independent, focussing on Oh Dear!, DNS Spy & Syscast.

The road to change

Why change? Why give up a steady income, health- & hospital insurance, a company car, paid holidays, fun colleagues, exciting tech challenges, ... ?

I think it's best explained by showing what an average day looked like in 2016-2017, at the peak of building DNS Spy.



Back when I had the idea to create a DNS monitoring service, the only way I could make it work was to code on it at crazy hours. Before the kids woke up and after they went to bed. Before and after the more-than-full-time-job.

This worked for a surprisingly long time, but eventually I had to drop the morning hours and get some more sleep in.

Because of my responsibilities at Nucleus (for context: a 24/7 managed hosting provider), I was often woken during the night for troubleshooting/interventions. This, on top of the early hours, made it impossible to keep up.

After a while, the new rhythm became similar, but without the morning routine.



Notice anything missing in that schedule? Household chores? Some quality family time? Some personal me-time to relax? Yeah, that wasn't really there.

There comes a point where you have to make a choice: either continue on this path and end up wealthy (probably) but without a family, or choose to prioritize the family first.

As of September 2019, I'll focus on a whole new time schedule instead.



A radical (at least for me) change of plans, where less time is spent working, more time is spent with the kids, my wife, the cats, the garden, ...

I'm even introducing a bit of whatever-the-fuck-i-want-time in there!

What I'll be working on

In a way I'm lucky.

I'm lucky that I spent the previous 10+ years working like a madman, building profitable side businesses and making a name for myself in both the open source/linux and PHP development world. It allows me to enter September 2019 without a job, but with a reasonable assurance that I'll make enough money to support my family.


For starters, I'll have more time & energy to further build on DNS Spy & Oh Dear!. These 2 side businesses will from now on be called "businesses", as they'll be my main source of income. It isn't enough to live on, mind you, so there's work to be done. But at least there's something there to build on.

Next to that, my current plan is to revive and start building on Syscast. The idea formed in 2016 (the "workaholic" phase, pre-DNS Spy) and was actually pretty fleshed out already. Making online courses, building upon the 10+ years of sysadmin & developer knowledge.

Syscast didn't happen in 2016 and pivoted to a podcast that featured impressive names like Daniel Stenberg (curl & libcurl), Seth Vargo (Hashicorp Vault), Matt Holt (Caddy) and many others instead.

I've always enjoyed giving presentations, explaining complicated technologies in easy terms and guiding people to learn new things. Syscast fits that bill and would make for a logical project to work on.

Looking back at an amazing time

A change like this isn't taken lightly. Believe me when I say I've been debating this for some time.

I'm grateful to both founders of Nucleus, Wouter & David, that they've given me a chance in 2007. I dropped out of college, no degree, just a positive attitude and some rookie PHP knowledge. I stumbled upon the job by accident, just googling for a PHP job. Back then, there weren't that many. It was either Nucleus or a career writing PHP for a bank. I think this is where I got lucky.

I've learned to write PHP, manage Linux & Windows servers, do customer support, how to do marketing, keep basic accounting and the value of underpromise and overdeliver. I'll be forever grateful to both of them for the opportunity and the lessons learned.

It was also an opportunity to work with my best friend, Jan, for the last 9 years. Next to existing friends, I'm proud to call many of my colleagues friends too and I hope we can stay in touch over the years. I find relationships form especially tight in intense jobs, when you heavily rely on each other to get the job done.

Open to new challenges

In true LinkedIn parlanceI'm open to new challenges. That might be a couple of days of consultancy on Linux, software architecture, PHP troubleshooting, scalability advice, a Varnish training, ...

I'm not looking for a full-time role anywhere (see the time tables above), but if there's an interesting challenge to work on, I'll definitely consider it. After all, there are mouths to feed at home. ;-)

If you want to chat, have a coffee, exchange ideas, brainstorm or revolutionize the next set of electric cars, feel free to reach out (my contact page has all the details).

But first, a break

However, before I can start doing any of that, I need a time-out.

In September, my kids will go to school and things will be a bit more quiet around the house. After living in a 24/7 work-phase for the last 10 years, I need to cool down first. Maybe I'll work on the businesses, maybe I won't. I have no idea how hard that hammer will hit come September when I suddenly have my hands free.

Maybe I'll even do something entirely different. Either way, I'll have more time to think about it.

The post Time for Change: Going Independent appeared first on

by Mattias Geniar at August 13, 2019 07:45 AM

Racker Hacker

buildah error: vfs driver does not support overlay.mountopt options

Storage bins

Buildah and podman make a great pair for building, managing and running containers on a Linux system. You can even use them with GitLab CI with a few small adjustments, namely the switch from the overlayfs to vfs storage driver.

I have some regularly scheduled GitLab CI jobs that attempt to build fresh containers each morning and I use these to get the latest packages and find out early when something is broken in the build process. A failed build appeared in my inbox earlier this week with the following error:

+ buildah bud -f builds/builder-fedora30 -t builder-fedora30 .
vfs driver does not support overlay.mountopt options

My container build script is fairly basic, but it does include a change to use the vfs storage driver:

# Use vfs with buildah. Docker offers overlayfs as a default, but buildah
# cannot stack overlayfs on top of another overlayfs filesystem.

The script doesn’t change any mount options during the build process. A quick glance at the /etc/containers/storage.conf revealed a possible problem:

# Storage options to be passed to underlying storage drivers

# mountopt specifies comma separated list of extra mount options
mountopt = "nodev,metacopy=on"

These mount options make sense when used with an overlayfs filesystem, but they are not used with vfs. I commented out the mountopt option, saved the file, and ran a test build locally. Success!

Fixing the build script involved a small change to the storage.conf just before building the container:

# Use vfs with buildah. Docker offers overlayfs as a default, but buildah
# cannot stack overlayfs on top of another overlayfs filesystem.

# Newer versions of podman/buildah try to set overlayfs mount options when
# using the vfs driver, and this causes errors.
sed -i '/^mountopt =.*/d' /etc/containers/storage.conf

My containers are happily building again in GitLab.

by Major Hayden at August 13, 2019 12:00 AM

August 10, 2019

Simon Lyall

Audiobooks – July 2019

The Return of the King by J.R.R Tolkien. Narrated by Rob Inglis. Excellent although I should probably listen slower next time. 10/10

Why Superman Doesn’t Take Over the World: What Superheroes Can Tell Us About Economics by J. Brian O’Roark

A good idea for a theme but author didn’t quite nail it. Further let down in audiobook format when the narrator talked to invisible diagrams. 6/10

A Fabulous Creation: How the LP Saved Our Lives by David Hepworth

Covers the years 1967 (Sgt Peppers) to 1982 (Thriller) when the LP dominated music. Lots of information all delivered in the authors great style. 8/10

The Front Runner by Matt Bai

Nominally a story about the downfall of Democratic presidential front-runner Gray Hart in 1987. Much of the book is devoted to how norms of political coverage changed at that moment due to changes in technology & culture. 8/10

A race like no other: 26.2 Miles Through the Streets of New York by Liz Robbins

Covering the 2007 New York marathon it follows the race with several top & amateur racers. Lots of digressions into the history of the race and the runners. Worked well 8/10

1983: Reagan, Andropov, and a World on the Brink by Taylor Downing

An account of how escalations in the cold war in 1983 nearly lead to Nuclear War, with the Americans largely being unaware of the danger. Superb 9/10

The High cost of Free Parking (2011 edition) by Donald Shoup.

One of the must-read books in the field although not a revelation for today’s readers. Found it a little repetitive (23 hours) and talking to diagrams and equations doesn’t work in audiobook format. 6/10


by simon at August 10, 2019 11:34 PM


NES and SNES Controllers on a 6502 (like the C64)

NES and SNES controllers support 8 to 12 buttons with only three data pins (plus VCC/GND). Let’s attach them to a C64 – or any 6502-based system!

NES Connector

The NES controller needs to be connected to +5V, GND and three GPIOs.

| 5  6  7   \  
| 4  3  2  1 |  
Pin Description
7 +5V

SNES Connector

The SNES controller’s pins are just like the NES controller’s, but with a different connector.

| 7  6  5 | 4  3  2  1 |  
Pin Description
1 +5V

User Port

The C64 User Port exposes, among other lines, +5V, GND and 8 GPIOs (CIA#2 Port B):

 1 | 2  3  4  5  6  7  8  9  10 | 11 12  
--- ---------------------------- -------  
 A | B  C  D  E  F  H  J  K  L  | M  N

(viewed towards the C64 edge connector)

Pin Description
2 +5V


Let’s semi-arbitrarily map the signals like this:

GPIO Description
PB3 LATCH (for both controllers)
PB4 DATA (controller 1)
PB5 CLK (for both controllers)
PB6 DATA (controller 2)

The latch and clock outputs go to both controllers. There is a data line for each controller.

So the connection diagram for two NES controllers looks like this:

Description User Port Pin NES #1 Pin NES #2 Pin Color
GND 1 1 1 black
+5V 2 7 7 red
LATCH F 3 3 blue
DATA#1 H 4 green
CLK J 2 2 white
DATA#2 K 4 yellow

And this is the same diagram for two SNES controllers:

Description User Port Pin NES #1 Pin NES #2 Pin Color
GND 1 7 7 black
+5V 2 1 1 red
LATCH F 3 3 blue
DATA#1 H 4 green
CLK J 2 2 white
DATA#2 K 4 yellow

In fact, you can attach an NES and an SNES controller in parallel for each of the two slots, as long as you ever only connect one controller per slot.

This is the user port connector with wires attached for two controllers, using the color scheme above:

This is an NES connector attached as the first controller (green data line):

And this is an SNES connector attached as the second controller (yellow data line):

The Code

The code to read both controllers at a time is pretty simple:

; C64 CIA#2 PB  
nes_data = $dd01  
nes_ddr  = $dd03  
bit_latch = $08 ; PB3 (user port pin F): LATCH (both controllers)  
bit_data1 = $10 ; PB4 (user port pin H): DATA  (controller #1)  
bit_clk   = $20 ; PB5 (user port pin J): CLK   (both controllers)  
bit_data2 = $40 ; PB6 (user port pin K): DATA  (controller #2)

; zero page  
controller1 = $e0 ; 3 bytes  
controller2 = $f0 ; 3 bytes

    lda #$ff-bit_data1-bit_data2  
    sta nes_ddr  
    lda #$00  
    sta nes_data

    ; pulse latch  
    lda #bit_latch  
    sta nes_data  
    lda #0  
    sta nes_data

    ; read 3x 8 bits  
    ldx #0  
l2: ldy #8  
l1: lda nes_data  
    cmp #bit_data2  
    rol controller2,x  
    and #bit_data1  
    cmp #bit_data1  
    rol controller1,x  
    lda #bit_clk  
    sta nes_data  
    lda #0  
    sta nes_data  
    bne l1  
    cpx #3  
    bne l2  

After calling query_controllers, three bytes each at controller1 and controller2 will contain the state:

; byte 0:      | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |  
;         NES  | A | B |SEL|STA|UP |DN |LT |RT |  
;         SNES | B | Y |SEL|STA|UP |DN |LT |RT |  
; byte 1:      | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |  
;         NES  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  
;         SNES | A | X | L | R | 1 | 1 | 1 | 1 |  
; byte 2:  
;         $00 = controller present  
;         $FF = controller not present

A 0 bit means the button is pressed, 1 means it is released.

The code pulses LATCH once, which makes the controllers sample their button states and transmit the first bit through their respective DATA lines. Pulsing CLK 15 more times will make the controllers send the remaining bits. Since SNES controllers send 16 bits of data, but NES controllers only send 8 bits, the type of controller can be detected through the lowermost nybble of byte 1. Similarly, the presense of a controller is detected by continuing to read bits: If a controller is attached, it will send 0 bits.


The driver code, together with demo code for the C64 can be found at


For each additional controller, only a single extra GPIO is needed. This way, six controllers would be possible on a single 8 bit I/O port, with only slightly modified code.

by Michael Steil at August 10, 2019 10:52 PM

Errata Security

Hacker Jeopardy, Wrong Answers Only Edition

Among the evening entertainments at DEF CON is "Hacker Jeopardy", like the TV show Jeopardy, but with hacking tech/culture questions. In today's blog post, we are going to play the "Wrong Answers Only" version, in which I die upon the hill defending the wrong answer.

The problem posed is:
Apparently, people gave 21, 22, and 25 as the responses. The correct response, according to RFC assignments of well-known ports, is 23.

But the real correct response is port 21. The problem posed wasn't about which port was assigned to Telnet (port 23), but what you normally see these days. 

Port 21 is assigned to FTP, the file transfer protocol. A little known fact about FTP is that it uses Telnet for it's command-channel on port 21. In other words, FTP isn't a text-based protocol like SMTP, HTTP, POP3, and so on. Instead, it's layered on top of Telnet. It says so right in RFC 959:

When we look at the popular FTP implementations, we see that they do indeed respond to Telnet control codes on port 21. There are a ton of FTP implementations, of course, so some don't respond to Telnet (treating the command channel as a straight text protocol). But the vast majority of what's out there are implementations that implement Telnet as defined.

Consider network intrusion detection systems. When they decode FTP, they do so with their Telnet protocol parsers. You can see this in the Snort source code, for example.

The question is "normally seen". Well, Telnet on port 23 has largely been replaced by SSH on port 22, so you don't normally see it on port 23. However, FTP is still popular. While I don't have a hard study to point to, in my experience, the amount of traffic seen on port 21 is vastly higher than that seen on port 23. QED: the port where Telnet is normally seen is port 21.

But the original problem wasn't so much "traffic" seen, but "available". That's a problem we can study with port scanners -- especially mass port scans of the entire Internet. Rapid7 has their yearly Internet Exposure Report. According to that report, port 21 is three times as available on the public Internet as port 23.

So the correct response to the posed problem is port 21! Whoever answered that at Hacker Jeopardy needs to have their score updated to reflect that they gave the right response.

Prove me wrong. 

by Robert Graham ( at August 10, 2019 10:48 PM

August 09, 2019


Immutable servers and containers

The immutable server pattern has been around for a long time The whole idea around it is you simplify your change-management procedures by making it very hard if not impossible to change system-state after it is launched. It makes that famous auditor question...

If an engineer Leeroy Jenkins!!!  in a change, how do you ensure your approved state is maintained?

...easier to answer.

They're not allowed to log in at all. That should never happen.

No having to explain your puppet/chef application interval and making sure your config-management platform covers everything an engineer can change on the system.

No having to go over your detective controls to make sure sudo usage is monitored and tracked back to change-requests.

No setting up alarms for detecting changes going unapproved.

Turn off SSH on the box, you just can't do it.

Now, this definitely makes the compliance/regulation processes much easier to endure.

It also has benefits that the Docker side of the tech industry has been talking about for years now.

Which is also true of the immutable-server pattern. If it carries state, it shouldn't be immutable. Because I'm well known for running Logstash, I'll use ElasticSearch as an example of how it could be used with either immutable pattern here (server or docker).

  • client-only nodes: Dockerize/immutable that thing. Carry no state, negotiate with the rest of the cluster to direct traffic. Go for it.
  • ingestion-only nodes: These carry local state, so losing one can lose some transactions. But it isn't persistent across more than a few seconds, or a minute. Safe to immutable-ize.
  • master-only nodes:These do maintain state, but in the sense of being the final judge of what gets committed to the cluster's state. These are safe-ish to immutable-ize, but care must be taken to ensure a minimum number of them are up and joined to the cluster. This makes them more complicated to integrate into an immutable framework (min: 3 isn't quite enough assurance).
  • data-only nodes: Filled to the gills with state. So much state, it can take hours for the cluster to fully recover from a loss. Subject these to your full change-management, detective-controls config.

Databases of any kind should be exempted from immutable-ness or containerization.

It turns out much of my career has been in maintaining state-containing machines. It's only recently that I've started working in architectures that even have logic-only nodes. I don't have the reflexes for it yet, but I do know what shouldn't go into a container.

These are good patterns to follow, but know where they won't apply cleanly and accept that.

by SysAdmin1138 at August 09, 2019 02:04 PM

August 08, 2019


The lifecycle of infrastructure at a standard-pattern cloudy startup

There seems to be a consistent pattern of infrastructure usage for cloud-based startups running on the Silicon Valley pattern of startup growth.

Bootstrapped (no venture capital) Startup:

Money is a precious, precious thing for this kind of startup. Every penny is accounted for, and cost-saving measures like delete-fests and delaying cost-bearing updates happen a lot. Time is cheap, money is not.

In companies like these, cloud expenses will be purely compute. Why pay extra for Amazon/Microsoft to manage the database, when we can run it ourselves on the same sized instance for much less money?

Early Stage Venture-Capitalist Funded Startup:

This company has runway. Money is a concern, but a distant one thanks to VC. Money is cheap, time is not.

In companies like these, cloud expenses run the board. Why spend time managing a database and replication chains when Amazon/Microsoft does it for you? Why bother running container frameworks, when the cloud-vendor has one of their own that works good enough?

  • SaaS it where possible.
  • If it carries state, PaaS it.
  • Offload as much ops-work as you can.
  • Focus on what the company is good at, and don't bother to reinvent management frameworks for state-containing services.

Mid Stage, Starting To See Scaling Issues Startup Company

This company has been around a while, and their infrastructure is starting to run into scaling problems. Maybe the container framework isn't flexible enough. Or the database offering can't handle multi-region failover well, or at all. Money is always there, time is budgeted, but complexity is not cheap.

Companies that reach this stage are starting to feel the corners of the box that the cloud-provider puts folk into. This is when companies start in-sourcing some of the previously cloud-provider offerings.

  • Implement a novel datastore that isn't offered by the provider, because the new datastore solves more problems than in-sourcing causes.
  • Implement a RDBMS replication framework too complex for the provider.
  • In-source container frameworks because scaling-bugs in the provider are that annoying.
  • Many other things.

Global Company

Can't really be called a startup anymore, much as people would like to. Instead, they get the name Unicorn because of how rare it is for a startup to get to this stage without failing or getting eaten by another Unicorn. They're profitable (for SV values of profit), always hiring, and have been managing complexity for years.

Companies that reach this stage have enough compute going on that the question of, "do we need to build our own datacenters to save money?" becomes a real concern. They have a long history of cloud-provider usage, but that relationship has been proven to be a bit to... training-wheels lately.

  • Keep stuff in the cloud-provider that the cloud-provider is good at (S3 buckets, for instance)
  • Put into your own infrastructure things that are central to the business, and core to the offering.
  • Maintain cloud-provider relationships for peripheral products, development work, and business-automation work (that isn't SaaSed already).
  • Open-source the frameworks and homebrew products that have been used for years internally (spinnaker, kafka, kubernetes...)

by SysAdmin1138 at August 08, 2019 07:29 PM

August 07, 2019

Racker Hacker

Fedora 30 on Google Compute Engine

Google building

Fedora 30 is my primary operating system for desktops and servers, so I usually try to take it everywhere I go. I was recently doing some benchmarking for kernel compiles on different cloud plaforms and I noticed that Fedora isn’t included in Google Compute Engine’s default list of operating system images.

(Note: Fedora does include links to quick start an Amazon EC2 instance with their pre-built AMI’s. They are superb!)

First try

Fedora does offer cloud images in raw and qcow2 formats, so I decided to give that a try. Start by downloading the image, decompressing it, and then repackaging the image into a tarball.

$ wget
$ xz -d Fedora-Cloud-Base-30-1.2.x86_64.raw.xz
$ mv Fedora-Cloud-Base-30-1.2.x86_64.raw disk.raw
$ tar cvzf fedora-30-google-cloud.tar.gz disk.raw

Once that’s done, create a bucket on Google storage and upload the tarball.

$ gsutil mb gs://fedora-cloud-base-30-image
$ gsutil cp fedora-30-google-cloud.tar.gz gs://fedora-cloud-image-30/

Uploading 300MB on my 10mbit/sec uplink was a slow process. When that’s done, tell Google Compute Engine that we want a new image made from this raw disk we uploaded:

$ gcloud compute images create --source-uri \
    gs://fedora-cloud-image-30/fedora-30-google-cloud.tar.gz \

After a few minutes, a new custom image called fedora-30-google-cloud will appear in the list of images in Google Compute Engine.

$ gcloud compute images list | grep -i fedora
fedora-30-google-cloud   major-hayden-20150520    PENDING
$ gcloud compute images list | grep -i fedora
fedora-30-google-cloud   major-hayden-20150520    PENDING
$ gcloud compute images list | grep -i fedora
fedora-30-google-cloud   major-hayden-20150520    READY

I opened a browser, ventured to the Google Compute Engine console, and built a new VM with my image.

Problems abound

However, there are problems when the instance starts up. The serial console has plenty of errors:[WARNING]: address "" is not resolvable

Obviously something is wrong with DNS. It’s apparent that cloud-init is stuck in a bad loop:[WARNING]: Calling '' failed [87/120s]: bad status code [404][WARNING]: Calling '' failed [93/120s]: bad status code [404][WARNING]: Calling '' failed [99/120s]: bad status code [404][WARNING]: Calling '' failed [105/120s]: bad status code [404][WARNING]: Calling '' failed [112/120s]: bad status code [404][WARNING]: Calling '' failed [119/120s]: unexpected error [Attempted to set connect timeout to 0.0, but the timeout cannot be set to a value less than or equal to 0.][CRITICAL]: Giving up on md from [''] after 126 seconds

Those are EC2-type metadata queries and they won’t work here. The instance also has no idea how to set up networking:

Cloud-init v. 17.1 running 'init' at Wed, 07 Aug 2019 18:27:07 +0000. Up 17.50 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | eth0:  | False |     .     |     .     |   .   | 42:01:0a:f0:00:5f |
ci-info: |  lo:   |  True | | |   .   |         .         |
ci-info: |  lo:   |  True |     .     |     .     |   d   |         .         |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+

This image is set up well for Amazon, but it needs some work to work at Google.

Fixing up the image

Go back to the disk.raw that we made in the first step of the blog post. We need to mount that disk, mount some additional filesystems, and chroot into the Fedora 30 installation on the raw disk.

Start by making a loop device for the raw disk and enumerating its partitions:

$ sudo losetup  /dev/loop0 disk.raw
$ kpartx -a /dev/loop0

Make a mountpoint and mount the first partition on that mountpoint:

$ sudo mkdir /mnt/disk
$ sudo mount /dev/mapper/loop0p1 /mnt/disk

We need some extra filesystems mounted before we can run certain commands in the chroot:

$ sudo mount --bind /dev /mnt/disk/dev
$ sudo mount --bind /sys /mnt/disk/sys
$ sudo mount --bind /proc /mnt/disk/proc

Now we can hop into the chroot:

$ sudo chroot /mnt/disk

From inside the chroot, remove cloud-init and install google-compute-engine-tools to help with Google cloud:

$ dnf -y remove cloud-init
$ dnf -y install google-compute-engine-tools
$ dnf clean all

The google-compute-engine-tools package has lots of services that help with running on Google cloud. We need to enable each one to run at boot time:

$ systemctl enable google-accounts-daemon google-clock-skew-daemon \
    google-instance-setup google-network-daemon \
    google-shutdown-scripts google-startup-scripts

To learn more about these daemons and what they do, head on over to the GitHub page for the package.

Exit the chroot and get back to your main system. Now that we have this image just like we want it, it’s time to unmount the image and send it to the cloud:

$ sudo umount /mnt/disk/dev /mnt/disk/sys /mnt/disk/proc
$ sudo umount /mnt/disk
$ sudo losetup -d /dev/loop0
$ tar cvzf fedora-30-google-cloud-fixed.tar.gz disk.raw
$ gsutil cp fedora-30-google-cloud-fixed.tar.gz gs://fedora-cloud-image-30/
$ gcloud compute images create --source-uri \
    gs://fedora-cloud-image-30/fedora-30-google-cloud-fixed.tar.gz \

Start a new instance with this fixed image and watch it boot in the serial console:

[   10.379253] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[   10.381350] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[   10.382487] RAPL PMU: hw unit of domain package 2^-0 Joules
[   10.383415] RAPL PMU: hw unit of domain dram 2^-16 Joules
[   10.503233] EDAC sbridge:  Ver: 1.1.2

Fedora 30 (Cloud Edition)
Kernel 5.1.20-300.fc30.x86_64 on an x86_64 (ttyS0)

instance-2 login:

Yes! A ten second boot with networking is exactly what I needed.

by Major Hayden at August 07, 2019 12:00 AM

August 06, 2019

Everything Sysadmin

Demo Data as Code

My newest article for acmQueue magazine is called Demo Data as Code:

by Tom Limoncelli at August 06, 2019 07:43 PM

August 05, 2019

Errata Security

Securing devices for DEFCON

There's been much debate whether you should get burner devices for hacking conventions like DEF CON (phones or laptops). A better discussion would be to list those things you should do to secure yourself before going, just in case.

These are the things I worry about:
  • backup before you go
  • update before you go
  • correctly locking your devices with full disk encryption
  • correctly configuring WiFi
  • Bluetooth devices
  • Mobile phone vs. Stingrays
  • USB

Traveling means a higher chance of losing your device. In my review of crime statistics, theft seems less of a threat than whatever city you are coming from. My guess is that while thieves may want to target tourists, the police want to even more the target gangs of thieves, to protect the cash cow that is the tourist industry. But you are still more likely to accidentally leave a phone in a taxi or have your laptop crushed in the overhead bin. If you haven't recently backed up your device, now would be an extra useful time to do this.

Anything I want backed up on my laptop is already in Microsoft's OneDrive, so I don't pay attention to this. However, I have a lot of pictures on my iPhone that I don't have in iCloud, so I copy those off before I go.


Like most of you, I put off updates unless they are really important, updating every few months rather than every month. Now is a great time to make sure you have the latest updates.

Backup before you update, but then, I already mentioned that above.

Full disk encryption

This is enabled by default on phones, but not the default for laptops. It means that if you lose your device, adversaries can't read any data from it.

You are at risk if you have a simple unlock code, like a predicable pattern or a 4-digit code. The longer and less predictable your unlock code, the more secure you are.

I use iPhone's "face id" on my phone so that people looking over my shoulder can't figure out my passcode when I need to unlock the phone. However, because this enables the police to easily unlock my phone, by putting it in front of my face, I also remember how to quickly disable face id (by holding the buttons on both sides for 2 seconds).

As for laptops, it's usually easy to enable full disk encryption. However there are some gotchas. Microsoft requires a TPM for its BitLocker full disk encryption, which your laptop might not support. I don't know why all laptops don't just have TPMs, but they don't. You may be able to use some tricks to get around this. There are also third party full disk encryption products that use simple passwords.

If you don't have a TPM, then hackers can brute-force crack your password, trying billions per second. This applies to my MacBook Air, which is the 2017 model before Apple started adding their "T2" chip to all their laptops. Therefore, I need a strong login password.

I deal with this on my MacBook by having two accounts. When I power on the device, I log into an account using a long/complicated password. I then switch to an account with a simpler account for going in/out of sleep mode. This second account can't be used to decrypt the drive.

On Linux, my password to decrypt the drive is similarly long, while the user account password is pretty short.

I ignore the "evil maid" threat, because my devices are always with me rather than in the hotel room.

Configuring WiFi

Now would be a good time to clear out your saved WiFi lists, on both your laptop and phone. You should do this regularly anyway. Anything that doesn't include a certificate should be removed. Your device will try to connect to known access-points, and hackers will setup access points with those names trying to entrap you.

If you want to use the official DEF CON WiFi, they provide a certificate which you can grab and install on your device. Sadly, it's not available right now. It's available now. The certificate authenticates the network, so that you won't be tricked into connecting to fake/evil-twin access points.

You shouldn't connect via WiFi to anything for which you don't have a certificate while in Vegas. There will be fake access points all over the place. I monitor the WiFi spectrum every DEF CON and there's always shenanigans going on. I'm not sure exactly what attacks they are attempting, I just know there's a lot of nonsense going on.

I also reset the WiFi MAC address in my laptop. When you connect to WiFi, your MAC address is exposed. This can reveal your identity to anybody tracking you, so it's good to change it. Doing so on notebooks is easy, though I don't know how to do this on phones (so I don't bother).

Bluetooth trackers

Like with WiFi MAC addresses, people can track you with your Bluetooth devices. The problem is chronic with devices like headphones, fitness trackers, and those "Tile" devices that are designed to be easily tracked.

Your phone itself probably randomizes its MAC address to avoid easy tracking, so that's less of a concern. According to my measurements, though, my MacBook exposes its MAC address pretty readily via Bluetooth.

Instead of merely tracking you, hackers may hack into the devices. While phones and laptops are pretty secure against this threat (with the latest updates applied), all the other Bluetooth devices I play with seem to have gapping holes just waiting to be hacked. Your fitness tracker is likely safe walking around your neighborhood, but people at DEFCON may be playing tricks on it.

Personally, I'm bringing my fitness tracker on the hope that somebody will hack it. The biggest threat is loss of the device, or being tracked. It's not that they'll be able to hack into my bank account or something.

Mobile phone vs. Stingrays

In much the same way the DEF CON WiFi is protected against impersonation, the mobile network isn't. Anybody can setup evil twin cell towers and intercept your phone traffic. The insecurity of the mobile phone network is pretty astonishing, you can't protect yourself against it.

But at least there's no reason to believe you are under any worse threat at DEF CON. Any attempt to setup interception devices by attendees will quickly bring down the Feds (unless, of course, they do it in the 900 MHz range).

I install apps on my phone designed to track these things. I'm not diligent at it, but I've never seen such devices ("Stringrays" or "IMSI Catchers") at DEF CON, operated either by attendees or the Feds.


Mousejacking is still a threat, where wireless mouse/keyboard dongles can be hijacked. So don't bring those.

Malicious USB devices that people connect to your computer are a threat. A good example is the "USB Rubber Ducky" device. Some people disable USB entirely. Others use software to "whitelist" which devices can be plugged in. I largely ignore this threat.

Note that a quick google of "disable USB" leads to the wrong device. They are focused on controlling thumbdrives. That's not really the threat. Instead, the the threat is things like network adapters that will redirect network traffic to/from the device, and enable attacks that you think you are immune to because you aren't connected to a network.


I've probably forgotten things on this list. Maybe I'll update this later when people point out the things I missed.

If you pay attention to WiFi, Bluetooth, and full disk encryption, you are likely fine.

You are still in danger from other minor shenanigans, like people tracking you.

There are still some chronic problems, like mobile network or USB security, but at the same time, they aren't big enough threats for me to worry about.

by Robert Graham ( at August 05, 2019 07:16 AM

August 04, 2019


The Apollo Guidance Computer [video + slides]

I re-did an updated version of my part of the Ultimate Apollo Guidance Computer Talk at VCF West 2019.

It was followed by Frank O’Brien‘s talk on the role of the AGC in the Apollo missions.

Here is the video:

And here are my slides in PDF format:

AGC_CHM.pdf (43 MB)

by Michael Steil at August 04, 2019 09:28 PM

August 02, 2019

Vincent Bernat

Securing BGP on the host with origin validation

An increasingly popular design for a datacenter network is BGP on the host: each host ships with a BGP daemon to advertise the IPs it handles and receives the routes to its fellow servers. Compared to a L2-based design, it is very scalable, resilient, cross-vendor and safe to operate.1 Take a look at “L3 routing to the hypervisor with BGP” for a usage example.

Spine-leaf fabric two spine routers, six leaf routers and nine physical hosts. All links have a BGP session established over them. Some of the servers have a speech balloon expliciting the IP prefix they want to handle.
BGP on the host with a spine-leaf IP fabric. A BGP session is established over each link and each host advertises its own IP prefixes.

While routing on the host eliminates the security problems related to Ethernet networks, a server may announce any IP prefix. In the above picture, two of them are announcing 2001:db8:cc::/64. This could be a legit use of anycast or a prefix hijack. BGP offers several solutions to improve this aspect and one of them is to reuse the features around the RPKI.

Short introduction to the RPKI

On the Internet, BGP is mostly relying on trust. This contributes to various incidents due to operator errors, like the one that affected Cloudflare a few months ago, or to malicious attackers, like the hijack of Amazon DNS to steal cryptocurrency wallets. RFC 7454 explains the best practices to avoid such issues.

IP addresses are allocated by five Regional Internet Registries (RIR). Each of them maintains a database of the assigned Internet resources, notably the IP addresses and the associated AS numbers. These databases may not be totally reliable but are widely used to build ACLs to ensure peers only announce the prefixes they are expected to. Here is an example of ACLs generated by bgpq3 when peering directly with Apple:2

$ bgpq3 -l v6-IMPORT-APPLE -6 -R 48 -m 48 -A -J -E AS-APPLE
policy-options {
 policy-statement v6-IMPORT-APPLE {
  from {
    route-filter 2403:300::/32 upto /48;
    route-filter 2620:0:1b00::/47 prefix-length-range /48-/48;
    route-filter 2620:0:1b02::/48 exact;
    route-filter 2620:0:1b04::/47 prefix-length-range /48-/48;
    route-filter 2620:149::/32 upto /48;
    route-filter 2a01:b740::/32 upto /48;
    route-filter 2a01:b747::/32 upto /48;

The RPKI (RFC 6480) adds public-key cryptography on top of it to sign the authorization for an AS to be the origin of an IP prefix. Such record is a Route Origination Authorization (ROA). You can browse the databases of these ROAs through the RIPE’s RPKI Validator instance:

Screenshot from an instance of RPKI validator showing the validity of for AS 64476
RPKI validator shows one ROA for

BGP daemons do not have to download the databases or to check digital signatures to validate the received prefixes. Instead, they offload these tasks to a local RPKI validator implementing the “RPKI-to-Router Protocol” (RTR, RFC 6810).

For more details, have a look at “RPKI and BGP: our path to securing Internet Routing.”

Using origin validation in the datacenter

While it is possible to create our own RPKI for use inside the datacenter, we can take a shortcut and use a validator implementing RTR, like GoRTR, and accepting another source of truth. Let’s work on the following topology:

Spine-leaf fabric two spine routers, six leaf routers and nine physical hosts. All links have a BGP session established over them. Three of the physical hosts are validators and RTR sessions are established between them and the top-of-the-rack routers—except their own top-of-the-racks.
BGP on the host with prefix validation using RTR. Each server has its own AS number. The leaf routers establish RTR sessions to the validators.

You assume we have a place to maintain a mapping between the private AS numbers used by each host and the allowed prefixes:3

ASN Allowed prefixes
AS 65005 2001:db8:aa::/64
AS 65006 2001:db8:bb::/64,
AS 65007 2001:db8:cc::/64
AS 65008 2001:db8:dd::/64
AS 65009 2001:db8:ee::/64,
AS 65010 2001:db8:ff::/64

From this table, we build a JSON file for GoRTR, assuming each host can announce the provided prefixes or longer ones (like 2001:db8:aa::­42:d9ff:­fefc:287a/128 for AS 65005):

  "roas": [
      "prefix": "2001:db8:aa::/64",
      "maxLength": 128,
      "asn": "AS65005"
    }, {
      "…": "…"
    }, {
      "prefix": "2001:db8:ff::/64",
      "maxLength": 128,
      "asn": "AS65010"
    }, {
      "prefix": "2001:db8:11::/64",
      "maxLength": 128,
      "asn": "AS65006"
    }, {
      "prefix": "2001:db8:11::/64",
      "maxLength": 128,
      "asn": "AS65009"

This file is deployed to all validators and served by a web server. GoRTR is configured to fetch it and update it every 10 minutes:

$ gortr -refresh=600 \
        -verify=false -checktime=false \
INFO[0000] New update (7 uniques, 8 total prefixes). 0 bytes. Updating sha256 hash  -> 68a1d3b52db8d654bd8263788319f08e3f5384ae54064a7034e9dbaee236ce96
INFO[0000] Updated added, new serial 1

The refresh time could be lowered but GoRTR can be notified of an update using the SIGHUP signal. Clients are immediately notified of the change.

The next step is to configure the leaf routers to validate the received prefixes using the farm of validators. Most vendors support RTR:

Platform Over TCP? Over SSH?
Juniper JunOS ✔️
Cisco IOS XR ✔️ ✔️
Cisco IOS XE ✔️
Cisco IOS ✔️
Arista EOS
BIRD ✔️ ✔️
FRR ✔️ ✔️
GoBGP ✔️

Configuring JunOS

JunOS only supports plain-text TCP. First, let’s configure the connections to the validation servers:

routing-options {
    validation {
        group RPKI {
            session validator1 {
                hold-time 60;         # session is considered down after 1 minute
                record-lifetime 3600; # cache is kept for 1 hour
                refresh-time 30;      # cache is refreshed every 30 seconds
                port 8282;
            session validator2 { /* OMITTED */ }
            session validator3 { /* OMITTED */ }

By default, at most two sessions are randomly established at the same time. This provides a good way to load-balance them among the validators while maintaining good availability. The second step is to define the policy for route validation:

policy-options {
    policy-statement ACCEPT-VALID {
        term valid {
            from {
                protocol bgp;
                validation-database valid;
            then {
                validation-state valid;
        term invalid {
            from {
                protocol bgp;
                validation-database invalid;
            then {
                validation-state invalid;
    policy-statement REJECT-ALL {
        then reject;

The policy statement ACCEPT-VALID turns the validation state of a prefix from unknown to valid if the ROA database says it is valid. It also accepts the route. If the prefix is invalid, the prefix is marked as such and rejected. We have also prepared a REJECT-ALL statement to reject everything else, notably unknown prefixes.

A ROA only certifies the origin of a prefix. A malicious actor can therefore prepend the expected AS number to the AS path to circumvent the validation. For example, AS 65007 could annonce 2001:db8:dd::/64, a prefix allocated to AS 65006, by advertising it with the AS path 65007 65006. To avoid that, we define an additional policy statement to reject AS paths with more than one AS:

policy-options {
    as-path EXACTLY-ONE-ASN "^.$";
    policy-statement ONLY-DIRECTLY-CONNECTED {
        term exactly-one-asn {
            from {
                protocol bgp;
                as-path EXACTLY-ONE-ASN;
            then next policy;
        then reject;

The last step is to configure the BGP sessions:

protocols {
    bgp {
        group HOSTS {
            local-as 65100;
            type external;
            # export [ … ];
            neighbor 2001:db8:42::a10 {
                peer-as 65005;
            neighbor 2001:db8:42::a12 {
                peer-as 65006;
            neighbor 2001:db8:42::a14 {
                peer-as 65007;

The import policy rejects any AS path longer than one AS, accepts any validated prefix and rejects everything else. The enforce-first-as directive is also pretty important: it ensures the first (and, here, only) AS in the AS path matches the peer AS. Without it, a malicious neighbor could inject a prefix using an AS different than its own, defeating our purpose.4

Let’s check the state of the RTR sessions and the database:

> show validation session
Session                                  State   Flaps     Uptime #IPv4/IPv6 records
2001:db8:4242::10                        Up          0   00:16:09 0/9
2001:db8:4242::11                        Up          0   00:16:07 0/9
2001:db8:4242::12                        Connect     0            0/0

> show validation database
RV database for instance master

Prefix                 Origin-AS Session                                 State   Mismatch
2001:db8:11::/64-128       65006 2001:db8:4242::10                       valid
2001:db8:11::/64-128       65006 2001:db8:4242::11                       valid
2001:db8:11::/64-128       65009 2001:db8:4242::10                       valid
2001:db8:11::/64-128       65009 2001:db8:4242::11                       valid
2001:db8:aa::/64-128       65005 2001:db8:4242::10                       valid
2001:db8:aa::/64-128       65005 2001:db8:4242::11                       valid
2001:db8:bb::/64-128       65006 2001:db8:4242::10                       valid
2001:db8:bb::/64-128       65006 2001:db8:4242::11                       valid
2001:db8:cc::/64-128       65007 2001:db8:4242::10                       valid
2001:db8:cc::/64-128       65007 2001:db8:4242::11                       valid
2001:db8:dd::/64-128       65008 2001:db8:4242::10                       valid
2001:db8:dd::/64-128       65008 2001:db8:4242::11                       valid
2001:db8:ee::/64-128       65009 2001:db8:4242::10                       valid
2001:db8:ee::/64-128       65009 2001:db8:4242::11                       valid
2001:db8:ff::/64-128       65010 2001:db8:4242::10                       valid
2001:db8:ff::/64-128       65010 2001:db8:4242::11                       valid

  IPv4 records: 0
  IPv6 records: 18

Here is an example of accepted route:

> show route protocol bgp table inet6 extensive all
inet6.0: 11 destinations, 11 routes (8 active, 0 holddown, 3 hidden)
2001:db8:bb::42/128 (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Next hop type: Router, Next hop index: 0
                Address: 0xd050470
                Next-hop reference count: 4
                Source: 2001:db8:42::a12
                Next hop: 2001:db8:42::a12 via em1.0, selected
                Session Id: 0x0
                State: <Active NotInstall Ext>
                Local AS: 65006 Peer AS: 65000
                Age: 12:11
                Validation State: valid
                Task: BGP_65000.2001:db8:42::a12+179
                AS path: 65006 I
                Localpref: 100
                Router ID:

A rejected route would be similar with the reason “rejected by import policy” shown in the details and the validation state would be invalid.

Configuring BIRD

BIRD supports both plain-text TCP and SSH. Let’s configure it to use SSH. We need to generate keypairs for both the leaf router and the validators (they can all share the same keypair). We also have to create a known_hosts file for BIRD:

(validatorX)$ ssh-keygen -qN "" -t rsa -f /etc/gortr/ssh_key
(validatorX)$ echo -n "validatorX:8283 " ; \
              cat /etc/bird/
validatorX:8283 ssh-rsa AAAAB3[…]Rk5TW0=
(leaf1)$ ssh-keygen -qN "" -t rsa -f /etc/bird/ssh_key
(leaf1)$ echo 'validator1:8283 ssh-rsa AAAAB3[…]Rk5TW0=' >> /etc/bird/known_hosts
(leaf1)$ echo 'validator2:8283 ssh-rsa AAAAB3[…]Rk5TW0=' >> /etc/bird/known_hosts
(leaf1)$ cat /etc/bird/
ssh-rsa AAAAB3[…]byQ7s=
(validatorX)$ echo 'ssh-rsa AAAAB3[…]byQ7s=' >> /etc/gortr/authorized_keys

GoRTR needs additional flags to allow connections over SSH:

$ gortr -refresh=600 -verify=false -checktime=false \
      -cache= \
      -ssh.bind=:8283 \
      -ssh.key=/etc/gortr/ssh_key \
      -ssh.method.key=true \
      -ssh.auth.user=rpki \
INFO[0000] Enabling ssh with the following authentications: password=false, key=true
INFO[0000] New update (7 uniques, 8 total prefixes). 0 bytes. Updating sha256 hash  -> 68a1d3b52db8d654bd8263788319f08e3f5384ae54064a7034e9dbaee236ce96
INFO[0000] Updated added, new serial 1

Then, we can configure BIRD to use these RTR servers:

roa6 table ROA6;
template rpki VALIDATOR {
   roa6 { table ROA6; };
   transport ssh {
     user "rpki";
     remote public key "/etc/bird/known_hosts";
     bird private key "/etc/bird/ssh_key";
   refresh keep 30;
   retry keep 30;
   expire keep 3600;
protocol rpki VALIDATOR1 from VALIDATOR {
   remote validator1 port 8283;
protocol rpki VALIDATOR2 from VALIDATOR {
   remote validator2 port 8283;

Unlike JunOS, BIRD doesn’t have a feature to only use a subset of validators. Therefore, we only configure two of them. As a safety measure, if both connections become unavailable, BIRD will keep the ROAs for one hour.

We can query the state of the RTR sessions and the database:

> show protocols all VALIDATOR1
Name       Proto      Table      State  Since         Info
VALIDATOR1 RPKI       ---        up     17:28:56.321  Established
  Cache server:     rpki@validator1:8283
  Status:           Established
  Transport:        SSHv2
  Protocol version: 1
  Session ID:       0
  Serial number:    1
  Last update:      before 25.212 s
  Refresh timer   : 4.787/30
  Retry timer     : ---
  Expire timer    : 3574.787/3600
  No roa4 channel
  Channel roa6
    State:          UP
    Table:          ROA6
    Preference:     100
    Input filter:   ACCEPT
    Output filter:  REJECT
    Routes:         9 imported, 0 exported, 9 preferred
    Route change stats:     received   rejected   filtered    ignored   accepted
      Import updates:              9          0          0          0          9
      Import withdraws:            0          0        ---          0          0
      Export updates:              0          0          0        ---          0
      Export withdraws:            0        ---        ---        ---          0

> show route table ROA6
Table ROA6:
    2001:db8:11::/64-128 AS65006  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:11::/64-128 AS65009  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:aa::/64-128 AS65005  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:bb::/64-128 AS65006  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:cc::/64-128 AS65007  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:dd::/64-128 AS65008  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:ee::/64-128 AS65009  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)
    2001:db8:ff::/64-128 AS65010  [VALIDATOR1 17:28:56.333] * (100)
                                  [VALIDATOR2 17:28:56.414] (100)

Like for the JunOS case, a malicious actor could try to workaround the validation by building an AS path where the last AS number is the legitimate one. BIRD is flexible enough to allow us to use any AS to check the IP prefix. Instead of checking the origin AS, we ask it to check the peer AS with this function, without looking at the AS path:

function validated(int peeras) {
   if (roa_check(ROA6, net, peeras) != ROA_VALID) then {
      print "Ignore invalid ROA ", net, " for ASN ", peeras;

The BGP instance is then configured to use the above function as the import policy:

protocol bgp PEER1 {
   local as 65100;
   neighbor 2001:db8:42::a10 as 65005;
   ipv6 {
      import keep filtered;
      import where validated(65005);
      # export …;

You can view the rejected routes with show route filtered, but BIRD does not store information about the validation state in the routes. You can also watch the logs:

2019-07-31 17:29:08.491 <INFO> Ignore invalid ROA 2001:db8:bb::40:/126 for ASN 65005

Currently, BIRD does not reevaluate the prefixes when the ROAs are updated. There is work in progress to fix this. If this feature is important to you, have a look at FRR instead: it also supports the RTR protocol and triggers a soft reconfiguration of the BGP sessions when ROAs are updated.

  1. Notably, the data flow and the control plane are separated. A node can remove itself by notifying its peers without losing a single packet. ↩︎

  2. People often use AS sets, like AS-APPLE in this example, as they are convenient if you have multiple AS numbers or customers. However, there is currently nothing preventing a rogue actor to add arbitrary AS numbers to their AS set. ↩︎

  3. We are using 16-bit AS numbers for readability. Because we need to assign a different AS number for each host in the datacenter, in an actual deployment, we would use 32-bit AS numbers. ↩︎

  4. Cisco routers and FRR enforce the first AS by default. It is a tunable value to allow the use of route servers: they distribute prefixes on behalf of other routers. ↩︎

by Vincent Bernat at August 02, 2019 09:16 AM

August 01, 2019

The Geekess

Outreachy Progress 2019-07

The main focus of this month was gearing up for the application period for the December 2019 round. We’ve made some changes to the application process, which required both changes to the website and communication to stakeholders (applicants, coordinators, and mentors).

We aren’t forgetting the May 2019 interns though! Their internships are still active until August 20. We hosted two internship chats this month, and found a contractor to provide resume review for the interns.

Organizer work:

  • Coordinated with a new contractor to add Outreachy career advice services, including career chats and resume review
  • Reviewed mid-point feedback and facilitated conversations with mentors and interns
  • Coordinated with volunteers for the Outreachy Tapia booth about travel
  • Ordered Outreachy promotional materials for Tapia
  • Provided guidance to new communities thinking about participating in the December 2019 round

Documentation work:

  • Wrote a blog post explaining the process changes for applicants and mentors in the December 2019 application round
  • Updated Outreachy promotional materials with the new deadline changes
  • Wrote an Outreachy Applicant Guide
  • Started writing an Outreachy Internship Guide
  • Updated our mentor FAQ to mirror the process changes
  • Wrote an email to interns explaining how informational interviews work, and giving resume guidelines

Development work:

  • Deployed code to separate out the Outreachy initial application period and the contribution period
  • Deployed code to hide Outreachy internship project details until the contribution period opens
  • Deployed code to hide pending/approved status until the contribution period opens
  • Wrote 46 new tests for the deployed code, increasing the total number of tests to 77.

by Sage at August 01, 2019 09:06 PM

Steve Kemp's Blog

Building a computer - part 3

This is part three in my slow journey towards creating a home-brew Z80-based computer. My previous post demonstrated writing some simple code, and getting it running under an emulator. It also described my planned approach:

  • Hookup a Z80 processor to an Arduino Mega.
  • Run code on the Arduino to emulate RAM reads/writes and I/O.
  • Profit, via the learning process.

I expect I'll have to get my hands-dirty with a breadboard and naked chips in the near future, but for the moment I decided to start with the least effort. Erturk Kocalar has a website where he sells "shields" (read: expansion-boards) which contain a Z80, and which is designed to plug into an Arduino Mega with no fuss. This is a simple design, I've seen a bunch of people demonstrate how to wire up by hand, for example this post.

Anyway I figured I'd order one of those, and get started on the easy-part, the software. There was some sample code available from Erturk, but it wasn't ideal from my point of view because it mixed driving the Z80 with doing "other stuff". So I abstracted the core code required to interface with the Z80 and packaged it as a simple library.

The end result is that I have a z80 retroshield library which uses an Arduino mega to drive a Z80 with something as simple as this:

#include <z80retroshield.h>

// Our program, as hex.
unsigned char rom[32] =
    0x3e, 0x48, 0xd3, 0x01, 0x3e, 0x65, 0xd3, 0x01, 0x3e, 0x6c, 0xd3, 0x01,
    0xd3, 0x01, 0x3e, 0x6f, 0xd3, 0x01, 0x3e, 0x0a, 0xd3, 0x01, 0xc3, 0x16,

// Our helper-object
Z80RetroShield cpu;

// RAM I/O function handler.
char ram_read(int address)
    return (rom[address]) ;

// I/O function handler.
void io_write(int address, char byte)
    if (address == 1)

// Setup routine: Called once.
void setup()

    // Setup callbacks.
    // We have to setup a RAM-read callback, otherwise the program
    // won't be fetched from RAM and executed.

    // Then we setup a callback to be executed every time an "out (x),y"
    // instruction is encountered.

    // Configured.
    Serial.println("Z80 configured; launching program.");

// Loop function: Called forever.
void loop()
    // Step the CPU.

All the logic of the program is contained in the Arduino-sketch, and all the use of pins/ram/IO is hidden away. As a recap the Z80 will make requests for memory-contents, to fetch the instructions it wants to execute. For general purpose input/output there are two instructions that are used:

IN A, (1)   ; Read a character from STDIN, store in A-register.
OUT (1), A  ; Write the character in A-register to STDOUT

Here 1 is the I/O address, and this is an 8 bit number. At the moment I've just configured the callback such that any write to I/O address 1 is dumped to the serial console.

Anyway I put together a couple of examples of increasing complexity, allowing me to prove that RAM read/writes work, and that I/O reads and writes work.

I guess the next part is where I jump in complexity:

  • I need to wire a physical Z80 to a board.
  • I need to wire a PROM to it.
    • This will contain the program to be executed - hardcoded.
  • I need to provide power, and a clock to make the processor tick.

With a bunch of LEDs I'll have a Z80-system running, but it'll be isolated and hard to program. (Since I'll need to reflash the RAM/ROM-chip).

The next step would be getting it hooked up to a serial-console of some sort. And at that point I'll have a genuinely programmable standalone Z80 system.

August 01, 2019 10:01 AM

July 31, 2019


Notes on Self-Publishing a Book

In this post I would like to share a few thoughts on self-publishing a book, in case anyone is considering that option.

As I mentioned in my post on burnout, one of my goals was to publish a book on a subject other than cyber security. A friend from my Krav Maga school, Anna Wonsley, learned that I had published several books, and asked if we might collaborate on a book about stretching. The timing was right, so I agreed.

I published my first book with Pearson and Addison-Wesley in 2004, and my last with No Starch in 2013. 14 years is an eternity in the publishing world, and even in the last 5 years the economics and structure of book publishing have changed quite a bit.

To better understand the changes, I had dinner with one of the finest technical authors around, Michael W. Lucas. We met prior to my interest in this book, because I had wondered about publishing books on my own. MWL started in traditional publishing like me, but has since become a full-time author and independent publisher. He explained the pros and cons of going it alone, which I carefully considered.

By the end of 2017, Anna and I were ready to begin work on the book. I believe our first "commits" occurred in December 2017.

For this stretching book project, I knew my strengths included organization, project management, writing to express another person's message, editing, and access to a skilled lead photographer. I learned that my co-author's strengths included subject matter expertise, a willingness to be photographed for the book's many pictures, and friends who would also be willing to be photographed.

None of us was very familiar with the process of transforming a raw manuscript and photos into a finished product. When I had published with Pearson and No Starch, they took care of that process, as well as copy-editing.

Beyond turning manuscript and photos into a book, I also had to identify a publication platform. Early on we decided to self-publish using one of the many newer companies offering that service. We wanted a company that could get our book into Amazon, and possibly physical book stores as well. We did not want to try working with a traditional publisher, as we felt that we could manage most aspects of the publishing process ourselves, and augment with specialized help where needed.

After a lot of research we chose Blurb. One of the most attractive aspects of Blurb was their expert ecosystem. We decided that we would hire one of these experts to handle the interior layout process. We contacted Jennifer Linney, who happened to be local and had experience publishing books to Amazon. We met in person, discussed the project, and agreed to move forward together.

I designed the structure of the book. As a former Air Force officer, I was comfortable with the "rule of threes," and brought some recent writing experience from my abandoned PhD thesis.

I designed the book to have an introduction, the main content, and a conclusion. Within the main content, the book featured an introduction and physical assessment, three main sections, and a conclusion. The three main sections consisted of a fundamental stretching routine, an advanced stretching routine, and a performance enhancement section -- something with Indian clubs, or kettle bells, or another supplement to stretching.

Anna designed all of the stretching routines and provided the vast majority of the content. She decided to focus on three physical problem areas -- tight hips, shoulders/back, and hamstrings. We encouraged the reader to "reach three goals" -- open your hips, expand your shoulders, and touch your toes. Anna designed exercises that worked in a progression through the body, incorporating her expertise as a certified trainer and professional martial arts instructor.

Initially we tried a process whereby she would write section drafts, and I would edit them, all using Google Docs. This did not work as well as we had hoped, and we spent a lot of time stalled in virtual collaboration.

By the spring of 2018 we decided to try meeting in person on a regular basis. Anna would explain her desired content for a section, and we would take draft photographs using iPhones to serve as placeholders and to test the feasibility of real content. We made a lot more progress using these methods, although we stalled again mid-year due to schedule conflicts.

By October our text was ready enough to try taking book-ready photographs. We bought photography lights from Amazon and used my renovated basement game room as a studio. We took pictures over three sessions, with Anna and her friend Josh as subjects. I spent several days editing the photos to prepare for publication, then handed the bundled manuscript and photographs to Jennifer for a light copy-edit and layout during November.

Our goal was to have the book published before the end of the year, and we met that goal. We decided to offer two versions. The first is a "collector's edition" featuring all color photographs, available exclusively via Blurb as Reach Your Goal: Collector's Edition. The second will be available at Amazon in January, and will feature black and white photographs.

While we were able to set the price of the book directly via Blurb, we could basically only suggest a price to Ingram and hence to Amazon. Ingram is the distributor that feeds Amazon and physical book stores. I am curious to see how the book will appear in those retail locations, and how much it will cost readers. We tried to price it competitively with older stretching books of similar size. (Ours is 176 pages with over 200 photographs.)

Without revealing too much of the economic structure, I can say that it's much cheaper to sell directly from Blurb. Their cost structure allows us to price the full color edition competitively. However, one of our goals was to provide our book through Amazon, and to keep the price reasonable we had to sell the black and white edition outside of Blurb.

Overall I am very pleased with the writing process, and exceptionally happy with the book itself. The color edition is gorgeous and the black and white version is awesome too.

The only change I would have made to the writing process would have been to start the in-person collaboration from the beginning. Working together in person accelerated the transfer of ideas to paper and played to our individual strengths of Anna as subject matter expert and me as a writer.

In general, I would not recommend self-publishing if you are not a strong writer. If writing is not your forte, then I highly suggest you work with a traditional publisher, or contract with an editor. I have seen too many self-published books that read terribly. This usually happens when the author is a subject matter expert, but has trouble expressing ideas in written form.

The bottom line is that it's never been easier to make your dream of writing a book come true. There are options for everyone, and you can leverage them to create wonderful products that scale with demand and can really help your audience reach their goals!

If you want to start the new year with better flexibility and fitness, consider taking a look at our book on Blurb! When the Amazon edition is available I will update this post with a link.

Update: Here is the Amazon listing.

Cross-posted from Rejoining the Tao Blog.

by Richard Bejtlich ( at July 31, 2019 06:52 PM

The Origin of the Term Indicators of Compromise (IOCs)

I am an historian. I practice digital security, but I earned a bachelor's of science degree in history from the United States Air Force Academy. (1)

Historians create products by analyzing artifacts, among which the most significant is the written word.

In my last post, I talked about IOCs, or indicators of compromise. Do you know the origin of the term? I thought I did, but I wanted to rely on my historian's methodology to invalidate or confirm my understanding.

I became aware of the term "indicator" as an element of indications and warning (I&W), when I attended Air Force Intelligence Officer's school in 1996-1997. I will return to this shortly, but I did not encounter the term "indicator" in a digital security context until I encountered the work of Kevin Mandia.

In August 2001, shortly after its publication, I read Incident Response: Investigating Computer Crime, by Kevin Mandia, Chris Prosise, and Matt Pepe (Osborne/McGraw-Hill). I was so impressed by this work that I managed to secure a job with their company, Foundstone, by April 2002. I joined the Foundstone incident response team, which was led by Kevin and consisted of Matt Pepe, Keith Jones, Julie Darmstadt, and me.

I Tweeted earlier today that Kevin invented the term "indicator" (in the IR context) in that 2001 edition, but a quick review of the hard copy in my library does not show its usage, at least not prominently. I believe we were using the term in the office but that it had not appeared in the 2001 book. Documentation would seem to confirm that, as Kevin was working on the second edition of the IR book (to which I contributed), and that version, published in 2003, features the term "indicator" in multiple locations.

In fact, the earliest use of the term "indicators of compromise," appearing in print in a digital security context, appears on page 280 in Incident Response & Computer Forensics, 2nd Edition.

From other uses of the term "indicators" in that IR book, you can observe that IOC wasn't a formal, independent concept at this point, in 2003. In the same excerpt above you see "indicators of attack" mentioned.

The first citation of the term "indicators" in the 2003 book shows it is meant as an investigative lead or tip:

Did I just give up my search at this point? Of course not.

If you do time-limited Google searches for "indicators of compromise," after weeding out patent filings that reference later work (from FireEye, in 2013), you might find this document, which concludes with this statement:

Indicators of compromise are from Lynn Fischer, Lynn, "Looking for the Unexpected," Security Awareness Bulletin, 3-96, 1996. Richmond, VA: DoD Security Institute.

Here the context is the compromise of a person with a security clearance.

In the same spirit, the earliest reference to "indicator" in a security-specific, detection-oriented context appears in the patent Method and system for reducing the rate of infection of a communications network by a software worm (6 Dec 2002). Stuart Staniford is the lead author; he was later chief scientist at FireEye, although he left before FireEye acquired Mandiant (and me).

While Kevin, et al were publishing the second edition of their IR book in 2003, I was writing my first book, The Tao of Network Security Monitoring. I began chapter two with a discussion of indicators, inspired by my Air Force intelligence officer training in I&W and Kevin's use of the term at Foundstone.

You can find chapter two in its entirety online. In the chapter I also used the term "indicators of compromise," in the spirit Kevin used it; but again, it was not yet a formal, independent term.

My book was published in 2004, followed by two more in rapid succession.

The term "indicators" didn't really make a splash until 2009, when Mike Cloppert published a series on threat intelligence and the cyber kill chain. The most impactful in my opinion was Security Intelligence: Attacking the Cyber Kill Chain. Mike wrote:

I remember very much enjoying these posts, but the Cyber Kill Chain was the aspect that had the biggest impact on the security community. Mike does not say "IOC" in the post. Where he does say "compromise," he's using it to describe a victimized computer.

The stage is now set for seeing indicators of compromise in a modern context. Drum roll, please!

The first documented appearance of the term indicators of compromise, or IOCs, in the modern context, appears in basically two places simultaneously, with ultimate credit going to the same organziation: Mandiant.

The first Mandiant M-Trends report, published on 25 Jan 2010, provides the following description of IOCs on page 9:

The next day, 26 Jan 2010, Matt Frazier published Combat the APT by Sharing Indicators of Compromise to the Mandiant blog. Matt wrote to introduce an XML-based instantiation of IOCs, which could be read and created using free Mandiant tools.

Note how complicated Matt's IOC example is. It's not a file hash (alone), or a file name (alone), or an IP address, etc. It's a Boolean expression of many elements. You can read in the text that this original IOC definition rejects what some commonly consider "IOCs" to be. Matt wrote:

Historically, compromise data has been exchanged in CSV or PDFs laden with tables of "known bad" malware information - name, size, MD5 hash values and paragraphs of imprecise descriptions... (emphasis added)

On a related note, I looked for early citations of work on defining IOCs, and found a paper by Simson Garfinkel, well-respected forensic analyst. He gave credit to Matt Frazier and Mandiant, writing in 2011:

Frazier (2010) of MANDIANT developed Indicators of Compromise (IOCs), an XML-based language designed to express signatures of malware such as files with a particular MD5 hash value, file length, or the existence of particular registry entries. There is a free editor for manipulating the XML. MANDIANT has a tool that can use these IOCs to scan for malware and the so-called “Advanced Persistent Threat.”

Starting in 2010, the debate was initially about the format for IOCs, and how to produce and consume them. We can see in this written evidence from 2010, however, a definition of indicators of compromise and IOCs that contains all the elements that would be recognized in current usage.

tl;dr Mandiant invented the term indicators of compromise, or IOCs, in 2010, building off the term "indicator," introduced widely in a detection context by Kevin Mandia, no later than his 2003 incident response book.

(1) Yes, a BS, not a BA -- thank you USAFA for 14 mandatory STEM classes.

by Richard Bejtlich ( at July 31, 2019 06:52 PM

Twenty Years of Network Security Monitoring: From the AFCERT to Corelight

I am really fired up to join Corelight. I’ve had to keep my involvement with the team a secret since officially starting on July 20th. Why was I so excited about this company? Let me step backwards to help explain my present situation, and forecast the future.

Twenty years ago this month I joined the Air Force Computer Emergency Response Team (AFCERT) at then-Kelly Air Force Base, located in hot but lovely San Antonio, Texas. I was a brand new captain who thought he knew about computers and hacking based on experiences from my teenage years and more recent information operations and traditional intelligence work within the Air Intelligence Agency. I was desperate to join any part of the then-five-year-old Information Warfare Center (AFIWC) because I sensed it was the most exciting unit on “Security Hill.”

I had misjudged my presumed level of “hacking” knowledge, but I was not mistaken about the exciting life of an AFCERT intrusion detector! I quickly learned the tenets of network security monitoring, enabled by the custom software watching and logging network traffic at every Air Force base. I soon heard there were three organizations that intruders knew to be wary of in the late 1990s: the Fort, i.e. the National Security Agency; the Air Force, thanks to our Automated Security Incident Measurement (ASIM) operation; and the University of California, Berkeley, because of a professor named Vern Paxson and his Bro network security monitoring software.

When I wrote my first book in 2003-2004, The Tao of Network Security Monitoring, I enlisted the help of Christopher Jay Manders to write about Bro 0.8. Bro had the reputation of being very powerful but difficult to stand up. In 2007 I decided to try installing Bro myself, thanks to the introduction of the “brolite” scripts shipped with Bro 1.2.1. That made Bro easier to use, but I didn’t do much analysis with it until I attended the 2009 Bro hands-on workshop. There I met Vern, Robin Sommer, Seth Hall, Christian Kreibich, and other Bro users and developers. I was lost most of the class, saved only by my knowledge of standard Unix command line tools like sed, awk, and grep! I was able to integrate Bro traffic analysis and logs into my TCP/IP Weapons School 2.0 class, and subsequent versions, which I taught mainly to Black Hat students. By the time I wrote my last book, The Practice of Network Security Monitoring, in 2013, I was heavily relying on Bro logs to demonstrate many sorts of network activity, thanks to the high-fidelity nature of Bro data.

In July of this year, Seth Hall emailed to ask if I might be interested in keynoting the upcoming Bro users conference in Washington, D.C., on October 10-12. I was in a bad mood due to being unhappy with the job I had at that time, and I told him I was useless as a keynote speaker. I followed up with another message shortly after, explained my depressed mindset, and asked how he liked working at Corelight. That led to interviews with the Corelight team and a job offer. The opportunity to work with people who really understood the need for network security monitoring, and were writing the world’s most powerful software to generate NSM data, was so appealing! Now that I’m on the team, I can share how I view Corelight’s contribution to the security challenges we face.

For me, Corelight solves the problems I encountered all those years ago when I first looked at Bro. The Corelight embodiment of Bro is ready to go when you deploy it. It’s developed and maintained by the people who write the code. Furthermore, Bro is front and center, not buried behind someone else’s logo. Why buy this amazing capability from another company when you can work with those who actually conceptualize, develop, and publish the code?

It’s also not just Bro, but it’s Bro at ridiculous speeds, ingesting and making sense of complex network traffic. We regularly encounter open source Bro users who spend weeks or months struggling to get their open source deployments to run at the speeds they need, typically in the tens or hundreds of Gbps. Corelight’s offering is optimized at the hardware level to deliver the highest performance, and our team works with customers who want to push Bro to the even greater levels. 

Finally, working at Corelight gives me the chance to take NSM in many exciting new directions. For years we NSM practitioners have worried about challenges to network-centric approaches, such as encryption, cloud environments, and alert fatigue. At Corelight we are working on answers for all of these, beyond the usual approaches — SSL termination, cloud gateways, and SIEM/SOAR solutions. We will have more to say about this in the future, I’m happy to say!

What challenges do you hope Corelight can solve? Leave a comment or let me know via Twitter to @corelight_inc or @taosecurity.

by Richard Bejtlich ( at July 31, 2019 06:52 PM

July 24, 2019

Sec-tools v0.3: HTTP Security Headers

The latest version of my sec-tools project includes a new tool “sec-gather-http-headers“. It scans one of more URLs for security HTTP headers. As usual, you can use sec-diff to generate alerts about changes in the output and sec-report to generate a matrix overview of the headers for each URL.

The JSON output looks like this:

$ sec-gather-http-headers
    "http_headers": {
        "": {
            "Expect-CT": "max-age=2592000, report-uri=\"\"",
            "Feature-Policy": null,
            "Access-Control-Allow-Origin": null,
            "X-Frame-Options": "deny",
            "Referrer-Policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
            "Access-Control-Allow-Headers": null,
            "X-XSS-Protection": "1; mode=block",
            "Strict-Transport-Security": "max-age=31536000; includeSubdomains; preload",
            "Public-key-pins": null,
            "Content-Security-Policy": "default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' wss://; font-src; form-action 'self'; frame-ancestors 'none'; frame-src; img-src 'self' data: *; manifest-src 'self'; media-src 'none'; script-src; style-src 'unsafe-inline'",
            "X-Content-Type-Options": "nosniff",
            "Access-Control-Allow-Methods": null
        "": {
            "Expect-CT": null,
            "Feature-Policy": null,
            "Access-Control-Allow-Origin": null,
            "X-Frame-Options": null,
            "Referrer-Policy": null,
            "Access-Control-Allow-Headers": null,
            "X-XSS-Protection": "1; mode=block",
            "Strict-Transport-Security": "max-age=31536000; includeSubdomains",
            "Public-key-pins": null,
            "Content-Security-Policy": "frame-ancestors 'self';",
            "X-Content-Type-Options": "nosniff",
            "Access-Control-Allow-Methods": null

An example PDF output with a matrix overview:


by admin at July 24, 2019 04:15 AM

July 21, 2019

Vincent Bernat

A Makefile for your Go project (2019)

My most loathed feature of Go was the mandatory use of GOPATH: I do not want to put my own code next to its dependencies. I was not alone and people devised tools or crafted their own Makefile to avoid organizing their code around GOPATH.

Hopefully, since Go 1.11, it is possible to use Go’s modules to manage dependencies without relying on GOPATH. First, you need to convert your project to a module:1

$ go mod init hellogopher
go: creating new go.mod: module hellogopher
$ cat go.mod
module hellogopher

Then, you can invoke the usual commands, like go build or go test. The go command resolves imports by using versions listed in go.mod. When it runs into an import of a package not present in go.mod, it automatically looks up the module containing that package using the latest version and adds it.

$ go test ./...
go: finding v0.0.5
go: downloading v0.0.5
?       hellogopher     [no test files]
?       hellogopher/cmd [no test files]
ok      hellogopher/hello       0.001s
$ cat go.mod
module hellogopher

require v0.0.5

If you want a specific version, you can either edit go.mod or invoke go get:

$ go get
go: finding v0.0.4
go: downloading v0.0.4
$ cat go.mod
module hellogopher

require v0.0.4

Add go.mod to your version control system. Optionally, you can also add go.sum as a safety net against overriden tags. If you really want to vendor the dependencies, you can invoke go mod vendor and add the vendor/ directory to your version control system.

Thanks to the modules, in my opinion, Go’s dependency management is now on a par with other languages, like Ruby. While it is possible to run day-to-day operations—building and testing—with only the go command, a Makefile can still be useful to organize common tasks, a bit like Python’s or Ruby’s Rakefile. Let me describe mine.

Using third-party tools

Most projects need some third-party tools for testing or building. We can either expect them to be already installed or compile them on the fly. For example, here is how code linting is done with Golint:

BIN = $(CURDIR)/bin
    @mkdir -p $@
$(BIN)/%: | $(BIN)
    @tmp=$$(mktemp -d); \
       env GO111MODULE=off GOPATH=$$tmp GOBIN=$(BIN) go get $(PACKAGE) \
        || ret=$$?; \
       rm -rf $$tmp ; exit $$ret


GOLINT = $(BIN)/golint
lint: | $(GOLINT)
    $(GOLINT) -set_exit_status ./...

The first block defines how a third-party tool is built: go get is invoked with the package name matching the tool we want to install. We do not want to pollute our dependency management and therefore, we are working in an empty GOPATH. The generated binaries are put in bin/.

The second block extends the pattern rule defined in the first block by providing the package name for golint. Additional tools can be added by just adding another line like this.

The last block defines the recipe to lint the code. The default linting tool is the golint built using the first block but it can be overrided with make GOLINT=/usr/bin/golint.


Here are some rules to help running tests:

PKGS     = $(or $(PKG),$(shell env GO111MODULE=on $(GO) list ./...))
TESTPKGS = $(shell env GO111MODULE=on $(GO) list -f \
            '{{ if or .TestGoFiles .XTestGoFiles }}{{ .ImportPath }}{{ end }}' \

TEST_TARGETS := test-default test-bench test-short test-verbose test-race
test-bench:   ARGS=-run=__absolutelynothing__ -bench=.
test-short:   ARGS=-short
test-verbose: ARGS=-v
test-race:    ARGS=-race
check test tests: fmt lint
    go test -timeout $(TIMEOUT)s $(ARGS) $(TESTPKGS)

A user can invoke tests in different ways:

  • make test runs all tests;
  • make test TIMEOUT=10 runs all tests with a timeout of 10 seconds;
  • make test PKG=hellogopher/cmd only runs tests for the cmd package;
  • make test ARGS="-v -short" runs tests with the specified arguments;
  • make test-race runs tests with race detector enabled.

go test includes a test coverage tool. Unfortunately, it only handles one package at a time and you have to explicitely list the packages to be instrumented, otherwise the instrumentation is limited to the currently tested package. If you provide too many packages, the compilation time will skyrocket. Moreover, if you want an output compatible with Jenkins, you need some additional tools.

COVERAGE_MODE    = atomic
COVERAGE_XML     = $(COVERAGE_DIR)/coverage.xml
test-coverage-tools: | $(GOCOVMERGE) $(GOCOV) $(GOCOVXML) # ❶
test-coverage: COVERAGE_DIR := $(CURDIR)/test/coverage.$(shell date -u +"%Y-%m-%dT%H:%M:%SZ")
test-coverage: fmt lint test-coverage-tools
    @mkdir -p $(COVERAGE_DIR)/coverage
    @for pkg in $(TESTPKGS); do \ # ❷
        go test \
            -coverpkg=$$(go list -f '{{ join .Deps "\n" }}' $$pkg | \
                    grep '^$(MODULE)/' | \
                    tr '\n' ',')$$pkg \
            -covermode=$(COVERAGE_MODE) \
            -coverprofile="$(COVERAGE_DIR)/coverage/`echo $$pkg | tr "/" "-"`.cover" $$pkg ;\
    @$(GOCOVMERGE) $(COVERAGE_DIR)/coverage/*.cover > $(COVERAGE_PROFILE)
    @go tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_HTML)

First, we define some variables to let the user override them. In ❶, we require the following tools—built like golint previously:

  • gocovmerge merges profiles from different runs into a single one;
  • gocov-xml converts a coverage profile to the Cobertura format, for Jenkins;
  • gocov is needed to convert a coverage profile to a format handled by gocov-xml.

In ❷, for each package to test, we run go test with the -coverprofile argument. We also explicitely provide the list of packages to instrument to -coverpkg by using go list to get a list of dependencies for the tested package and keeping only our owns.


Another useful recipe is to build the program. While this could be done with just go build, it is not uncommon to have to specify build tags, additional flags, or to execute supplementary build steps. In the following example, the version is extracted from Git tags. It will replace the value of the Version variable in the hellogopher/cmd package.

VERSION ?= $(shell git describe --tags --always --dirty --match=v* 2> /dev/null || \
            echo v0)
all: fmt lint | $(BIN)
    go build \
        -tags release \
        -ldflags '-X hellogopher/cmd.Version=$(VERSION)' \
        -o $(BIN)/hellogopher main.go

The recipe also runs code formatting and linting.

The excerpts provided in this post are a bit simplified. Have a look at the final result for more perks, including fancy output and integrated help!

  1. For an application not meant to be used as a library, I prefer to use a short name instead of a name derived from an URL, like It makes it easier to read import sections:

    import (


by Vincent Bernat at July 21, 2019 07:20 PM

July 20, 2019

Evaggelos Balaskas

A Dead Simple VPN

DSVPN is designed to address the most common use case for using a VPN

Works with TCP, blocks IPv6 leaks, redirect-gateway out-of-the-box!




Notes on the latest ubuntu:18.04 docker image:

# git clone
Cloning into 'dsvpn'...
remote: Enumerating objects: 88, done.
remote: Counting objects: 100% (88/88), done.
remote: Compressing objects: 100% (59/59), done.
remote: Total 478 (delta 47), reused 65 (delta 29), pack-reused 390
Receiving objects: 100% (478/478), 93.24 KiB | 593.00 KiB/s, done.
Resolving deltas: 100% (311/311), done.

# cd dsvpn

# ls
LICENSE  Makefile  include  logo.png  src

# make
cc -march=native -Ofast -Wall -W -Wshadow -Wmissing-prototypes -Iinclude -o dsvpn src/dsvpn.c src/charm.c src/os.c
strip dsvpn

# ldd dsvpn (0x00007ffd409ba000) => /lib/x86_64-linux-gnu/ (0x00007fd78480b000)
/lib64/ (0x00007fd784e03000)

# ls -l dsvpn
-rwxr-xr-x 1 root root 26840 Jul 20 15:51 dsvpn

Just copy the dsvpn binary to your machines.


Symmetric Key

dsvpn uses symmetric-key cryptography, that means both machines uses the same encyrpted key.


dd if=/dev/urandom of=vpn.key count=1 bs=32

Copy the key to both machines using a secure media, like ssh.



It is very easy to run dsvpn in server mode:


dsvpn server vpn.key auto

Interface: [tun0]
net.ipv4.ip_forward = 1
Listening to *:443

ip addr show tun0

4: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 500
    inet peer scope global tun0
       valid_lft forever preferred_lft forever

I prefer to use CIDR in my VPNs, so in my VPN setup:

dsvpn server /root/vpn.key auto 443 auto

Using as the VPN Server IP.

systemd service unit - server

I’ve created a simple systemd script dsvpn_server.service

or you can copy it from here:


Description=Dead Simple VPN - Server

ExecStart=/usr/local/bin/dsvpn server /root/vpn.key auto 443 auto


and then:

systemctl enable dsvpn.service
systemctl  start dsvpn.service


It is also easy to run dsvpn in client mode:


dsvpn client vpn.key

# dsvpn client vpn.key
Interface: [tun0]
Trying to reconnect
Connecting to
net.ipv4.tcp_congestion_control = bbr

ip addr show tun0

4: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 500
    inet peer scope global tun0
       valid_lft forever preferred_lft forever

dsvpn works in redict-gateway mode,
so it will apply routing rules to pass all the network traffic through the VPN.

ip route list via dev tun0
default via dev eth0 proto static via dev eth0 via dev tun0 dev eth0 proto kernel scope link src  dev tun0 proto kernel scope link src

As I mentioned above, I prefer to use CIDR in my VPNs, so in my VPN client:

dsvpn client /root/vpn.key 443 auto

Using as the VPN Client IP.

ip addr show tun0

11: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 500
    inet peer scope global tun0
       valid_lft forever preferred_lft forever

systemd service unit - client

I’ve also created a simple systemd script for the client dsvpn_client.service

or you can copy it from here:


Description=Dead Simple VPN - Client

ExecStart=/usr/local/bin/dsvpn client /root/vpn.key 443 auto


and then:

systemctl enable dsvpn.service
systemctl  start dsvpn.service

and here is an MTR from the client:



Enjoy !


Tag(s): vpn, dsvpn

July 20, 2019 07:00 PM

Vincent Bernat

Writing sustainable Python scripts

Python is a great language to write a standalone script. Getting to the result can be a matter of a dozen to a few hundred lines of code and, moments later, you can forget about it and focus on your next task.

Six months later, a co-worker asks you why the script fails and you don’t have a clue: no documentation, hard-coded parameters, nothing logged during the execution and no sensible tests to figure out what may go wrong.

Turning a “quick-and-dirty” Python script into a sustainable version, which will be easy to use, understand and support by your co-workers and your future self, only takes some moderate effort. As an illustration, let’s start from the following script solving the infamous Fizz-Buzz test:

import sys
for n in range(int(sys.argv[1]), int(sys.argv[2])):
    if n % 3 == 0 and n % 5 == 0:
    elif n % 3 == 0:
    elif n % 5 == 0:


I find useful to write documentation before coding: it makes the design easier and it ensures I will not postpone this task indefinitely. The documentation can be embedded at the top of the script:

#!/usr/bin/env python3

"""Simple fizzbuzz generator.

This script prints out a sequence of numbers from a provided range
with the following restrictions:

 - if the number is divisble by 3, then print out "fizz",
 - if the number is divisible by 5, then print out "buzz",
 - if the number is divisible by 3 and 5, then print out "fizzbuzz".

The first line is a short summary of the script purpose. The remaining paragraphs contain additional details on its action.

Command-line arguments

The second task is to turn hard-coded parameters into documented and configurable values through command-line arguments, using the argparse module. In our example, we ask the user to specify a range and allow them to modify the modulo values for “fizz” and “buzz”.

import argparse
import sys

class CustomFormatter(argparse.RawDescriptionHelpFormatter,

def parse_args(args=sys.argv[1:]):
    """Parse arguments."""
    parser = argparse.ArgumentParser(

    g = parser.add_argument_group("fizzbuzz settings")
    g.add_argument("--fizz", metavar="N",
                   help="Modulo value for fizz")
    g.add_argument("--buzz", metavar="N",
                   help="Modulo value for buzz")

    parser.add_argument("start", type=int, help="Start value")
    parser.add_argument("end", type=int, help="End value")

    return parser.parse_args(args)

options = parse_args()
for n in range(options.start, options.end + 1):
    # ...

The added value of this modification is tremendous: parameters are now properly documented and are discoverable through the --help flag. Moreover, the documentation we wrote in the previous section is also displayed:

$ ./ --help
usage: [-h] [--fizz N] [--buzz N] start end

Simple fizzbuzz generator.

This script prints out a sequence of numbers from a provided range
with the following restrictions:

 - if the number is divisble by 3, then print out "fizz",
 - if the number is divisible by 5, then print out "buzz",
 - if the number is divisible by 3 and 5, then print out "fizzbuzz".

positional arguments:
  start         Start value
  end           End value

optional arguments:
  -h, --help    show this help message and exit

fizzbuzz settings:
  --fizz N      Modulo value for fizz (default: 3)
  --buzz N      Modulo value for buzz (default: 5)

The argparse module is quite powerful. If you are not familiar with it, skimming through the documentation is helpful. I like to use the ability to define sub-commands and argument groups.


A nice addition to a script is to display information during its execution. The logging module is a good fit for this purpose. First, we define the logger:

import logging
import logging.handlers
import os
import sys

logger = logging.getLogger(os.path.splitext(os.path.basename(sys.argv[0]))[0])

Then, we make its verbosity configurable: logger.debug() should output something only when a user runs our script with --debug and --silent should mute the logs unless an exceptional condition occurs. For this purpose, we add the following code in parse_args():

# In parse_args()
g = parser.add_mutually_exclusive_group()
g.add_argument("--debug", "-d", action="store_true",
               help="enable debugging")
g.add_argument("--silent", "-s", action="store_true",
               help="don't log to console")

We add this function to configure logging:

def setup_logging(options):
    """Configure logging."""
    root = logging.getLogger("")
    logger.setLevel(options.debug and logging.DEBUG or logging.INFO)
    if not options.silent:
        ch = logging.StreamHandler()
            "%(levelname)s[%(name)s] %(message)s"))

The main body of our script becomes this:

if __name__ == "__main__":
    options = parse_args()

        logger.debug("compute fizzbuzz from {} to {}".format(options.start,
        for n in range(options.start, options.end + 1):
            # ...
    except Exception as e:
        logger.exception("%s", e)

If the script may run unattended—e.g. from a crontab, we can make it log to syslog:

def setup_logging(options):
    """Configure logging."""
    root = logging.getLogger("")
    logger.setLevel(options.debug and logging.DEBUG or logging.INFO)
    if not options.silent:
        if not sys.stderr.isatty():
            facility = logging.handlers.SysLogHandler.LOG_DAEMON
            sh = logging.handlers.SysLogHandler(address='/dev/log',
                "{0}[{1}]: %(message)s".format(
            ch = logging.StreamHandler()
                "%(levelname)s[%(name)s] %(message)s"))

For this example, this is a lot of code just to use logger.debug() once, but in a real script, this will come handy to help users understand how the task is completed.

$ ./ --debug 1 3
DEBUG[fizzbuzz] compute fizzbuzz from 1 to 3


Unit tests are very useful to ensure an application behaves as intended. It is not common to use them in scripts, but writing a few of them greatly improves their reliability. Let’s turn the code in the inner “for” loop into a function with some interactive examples of use to its documentation:

def fizzbuzz(n, fizz, buzz):
    """Compute fizzbuzz nth item given modulo values for fizz and buzz.

    >>> fizzbuzz(5, fizz=3, buzz=5)
    >>> fizzbuzz(3, fizz=3, buzz=5)
    >>> fizzbuzz(15, fizz=3, buzz=5)
    >>> fizzbuzz(4, fizz=3, buzz=5)
    >>> fizzbuzz(4, fizz=4, buzz=6)

    if n % fizz == 0 and n % buzz == 0:
        return "fizzbuzz"
    if n % fizz == 0:
        return "fizz"
    if n % buzz == 0:
        return "buzz"
    return n

pytest can ensure the results are correct:1

$ python3 -m pytest -v --doctest-modules ./
============================ test session starts =============================
platform linux -- Python 3.7.4, pytest-3.10.1, py-1.8.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /home/bernat/code/perso/python-script, inifile:
plugins: xdist-1.26.1, timeout-1.3.3, forked-1.0.2, cov-2.6.0
collected 1 item PASSED                                  [100%]

========================== 1 passed in 0.05 seconds ==========================

In case of an error, pytest displays a message describing the location and the nature of the failure:

$ python3 -m pytest -v --doctest-modules ./ -k fizzbuzz.fizzbuzz
============================ test session starts =============================
platform linux -- Python 3.7.4, pytest-3.10.1, py-1.8.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /home/bernat/code/perso/python-script, inifile:
plugins: xdist-1.26.1, timeout-1.3.3, forked-1.0.2, cov-2.6.0
collected 1 item FAILED                                  [100%]

================================== FAILURES ==================================
________________________ [doctest] fizzbuzz.fizzbuzz _________________________
101     >>> fizzbuzz(5, fizz=3, buzz=5)
102     'buzz'
103     >>> fizzbuzz(3, fizz=3, buzz=5)
104     'fizz'
105     >>> fizzbuzz(15, fizz=3, buzz=5)
106     'fizzbuzz'
107     >>> fizzbuzz(4, fizz=3, buzz=5)
108     4
109     >>> fizzbuzz(4, fizz=4, buzz=6)

/home/bernat/code/perso/python-script/ DocTestFailure
========================== 1 failed in 0.02 seconds ==========================

We can also write unit tests as code. Let’s suppose we want to test the following function:

def main(options):
    """Compute a fizzbuzz set of strings and return them as an array."""
    logger.debug("compute fizzbuzz from {} to {}".format(options.start,
    return [str(fizzbuzz(i, options.fizz,
            for i in range(options.start, options.end+1)]

At the end of the script,2 we add the following unit tests, leveraging pytest’s parametrized test functions:

# Unit tests
import pytest                   # noqa: E402
import shlex                    # noqa: E402

@pytest.mark.parametrize("args, expected", [
    ("0 0", ["fizzbuzz"]),
    ("3 5", ["fizz", "4", "buzz"]),
    ("9 12", ["fizz", "buzz", "11", "fizz"]),
    ("14 17", ["14", "fizzbuzz", "16", "17"]),
    ("14 17 --fizz=2", ["fizz", "buzz", "fizz", "17"]),
    ("17 20 --buzz=10", ["17", "fizz", "19", "buzz"]),
def test_main(args, expected):
    options = parse_args(shlex.split(args))
    options.debug = True
    options.silent = True
    assert main(options) == expected

The test function runs once for each of the provided parameters. The args part is used as input for the parse_args() function to get the appropriate options we need to pass to the main() function. The expected part is compared to the result of the main() function. When everything works as expected, pytest says:

python3 -m pytest -v --doctest-modules ./
============================ test session starts =============================
platform linux -- Python 3.7.4, pytest-3.10.1, py-1.8.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /home/bernat/code/perso/python-script, inifile:
plugins: xdist-1.26.1, timeout-1.3.3, forked-1.0.2, cov-2.6.0
collected 7 items PASSED                                  [ 14%][0 0-expected0] PASSED                           [ 28%][3 5-expected1] PASSED                           [ 42%][9 12-expected2] PASSED                          [ 57%][14 17-expected3] PASSED                         [ 71%][14 17 --fizz=2-expected4] PASSED                [ 85%][17 20 --buzz=10-expected5] PASSED               [100%]

========================== 7 passed in 0.03 seconds ==========================

When an error occurs, pytest provides a useful assessment of the situation:

$ python3 -m pytest -v --doctest-modules ./
================================== FAILURES ==================================
__________________________ test_main[0 0-expected0] __________________________

args = '0 0', expected = ['0']

    @pytest.mark.parametrize("args, expected", [
        ("0 0", ["0"]),
        ("3 5", ["fizz", "4", "buzz"]),
        ("9 12", ["fizz", "buzz", "11", "fizz"]),
        ("14 17", ["14", "fizzbuzz", "16", "17"]),
        ("14 17 --fizz=2", ["fizz", "buzz", "fizz", "17"]),
        ("17 20 --buzz=10", ["17", "fizz", "19", "buzz"]),
    def test_main(args, expected):
        options = parse_args(shlex.split(args))
        options.debug = True
        options.silent = True
>       assert main(options) == expected
E       AssertionError: assert ['fizzbuzz'] == ['0']
E         At index 0 diff: 'fizzbuzz' != '0'
E         Full diff:
E         - ['fizzbuzz']
E         + ['0'] AssertionError
----------------------------- Captured log call ------------------------------                125 DEBUG    compute fizzbuzz from 0 to 0
===================== 1 failed, 6 passed in 0.05 seconds =====================

The call to logger.debug() is included in the output. This is another good reason to use the logging feature! If you want to know more about the wonderful features of pytest, have a look at “Testing network software with pytest and Linux namespaces.”

To sum up, enhancing a Python script to make it more sustainable can be done in four steps:

  1. add documentation at the top,
  2. use the argparse module to document the different parameters,
  3. use the logging module to log details about progress, and
  4. add some unit tests.

You can find the complete example on GitHub and use it as a template!

Update (2019.06)

There are some interesting threads about this article on Lobsters and Reddit. While the addition of documentation and command-line arguments seems to be well-received, logs and tests are sometimes reported as too verbose. Dan Connolly write “Practical production python scripts” as an answer to this post.

  1. This requires the script name to end with .py. I dislike appending an extension to a script name: the language is a technical detail that shouldn’t be exposed to the user. However, it seems to be the easiest way to let test runners, like pytest, discover the enclosed tests. ↩︎

  2. Because the script ends with a call to sys.exit(), when invoked normally, the additional code for tests will not be executed. This ensures pytest is not needed to run the script. ↩︎

by Vincent Bernat at July 20, 2019 03:04 PM

July 18, 2019

Evaggelos Balaskas

slack-desktop and xdg-open

Notes from archlinux

xdg-open - opens a file or URL in the user’s preferred application

When you are trying to authenticate to a new workspace (with 2fa) using the slack-desktop, it will open your default browser and after the authentication your browser will re-direct you to the slack-desktop again using something like this


This is mime query !

$ xdg-mime query default x-scheme-handler/slack

$ locate slack.desktop
$ more /usr/share/applications/slack.desktop

[Desktop Entry]
Comment=Slack Desktop
GenericName=Slack Client for Linux
Exec=/usr/bin/slack --disable-gpu %U

I had to change the Exec entry above to point to my slack-desktop binary

Tag(s): slack, xdg

July 18, 2019 09:20 PM

Steve Kemp's Blog

Building a computer - part 2

My previous post on the subject of building a Z80-based computer briefly explained my motivation, and the approach I was going to take.

This post describes my progress so far:

  • On the hardware side, zero progress.
  • On the software-side, lots of fun.

To recap I expect to wire a Z80 microprocessor to an Arduino (mega). The arduino will generate a clock-signal which will make the processor "tick". It will also react to read/write attempts that the processor makes to access RAM, and I/O devices.

The Z80 has a neat system for requesting I/O, via the use of the IN and OUT instructions which allow the processor to read/write a single byte to one of 255 connected devices.

To experiment, and for a memory recap I found a Z80 assembler, and a Z80 disassembler, both packaged for Debian. I also found a Z80 emulator, which I forked and lightly-modified.

With the appropriate tools available I could write some simple code. I implemented two I/O routines in the emulator, one to read a character from STDIN, and one to write to STDOUT:

IN A, (1)   ; Read a character from STDIN, store in A-register.
OUT (1), A  ; Write the character in A-register to STDOUT

With those primitives implemented I wrote a simple script:

;  Simple program to upper-case a string
org 0
   ; show a prompt.
   ld a, '>'
   out (1), a
   ; read a character
   in a,(1)
   ; eof?
   cp -1
   jp z, quit
   ; is it lower-case?  If not just output it
   cp 'a'
   jp c,output
   cp 'z'
   jp nc, output
   ; convert from lower-case to upper-case.  yeah.  math.
   sub a, 32
   ; output the character
   out (1), a
   ; repeat forever.
   jr start
   ; terminate

With that written it could be compiled:

 $ z80asm ./sample.z80 -o ./sample.bin

Then I could execute it:

 $ echo "Hello, world" | ./z80emulator ./sample.bin
 Testing "./sample.bin"...

 1150 cycle(s) emulated.

And that's where I'll leave it for now. When I have the real hardware I'll hookup some fake-RAM containing this program, and code a similar I/O handler to allow reading/writing to the arduino's serial-console. That will allow the same code to run, unchanged. That'd be nice.

I've got a simple Z80-manager written, but since I don't have the chips yet I can only compile-test it. We'll see how well I did soon enough.

July 18, 2019 10:01 AM

July 14, 2019

Evaggelos Balaskas

kubernetes with minikube - Intro Notes

Notes based on Ubuntu 18.04 LTS

My notes for this k8s blog post are based upon an Ubuntu 18.05 LTS KVM Virtual Machine. The idea is to use nested-kvm to run minikube inside a VM, that then minikube will create a kvm node.

minikube builds a local kubernetes cluster on a single node with a set of small resources to run a small kubernetes deployment.

Archlinux –> VM Ubuntu 18.04 LTS runs minikube/kubeclt —> KVM minikube node



Nested kvm



$ grep ^NAME /etc/os-release
NAME="Arch Linux"

Check that nested-kvm is already supported:

$ cat /sys/module/kvm_intel/parameters/nested

If the output is N (No) then remove & enable kernel module again:

$ sudo modprobe -r kvm_intel
$ sudo modprobe kvm_intel nested=1

Check that nested-kvm is now enabled:

$ cat /sys/module/kvm_intel/parameters/nested



Inside the virtual machine:

$ grep NAME /etc/os-release
PRETTY_NAME="Ubuntu 18.04.2 LTS"
$ egrep -o 'vmx|svm|0xc0f' /proc/cpuinfo

$ kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used



If the above step fails, try to edit the xml libvirtd configuration file in your host:

# virsh edit ubuntu_18.04

and change cpu mode to passthrough:


  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Nehalem</model>


  <cpu mode='host-passthrough' check='none'/>


Install Virtualization Tools

Inside the VM


sudo apt -y install


We need to be included in the libvirt group

sudo usermod -a -G libvirt $(whoami)
newgrp libvirt



kubectl is a command line interface for running commands against Kubernetes clusters.

size: ~41M

$ export VERSION=$(curl -sL
$ curl -LO$VERSION/bin/linux/amd64/kubectl

$ chmod +x kubectl
$ sudo mv kubectl /usr/local/bin/kubectl

$ kubectl completion bash | sudo tee -a /etc/bash_completion.d/kubectl
$ kubectl version

if you wan to use bash autocompletion without logout/login use this:

source <(kubectl completion bash)

What the json output of kubectl version looks like:

$ kubectl version -o json | jq .
The connection to the server localhost:8080 was refused - did you specify the right host or port?
  "clientVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.0",
    "gitCommit": "e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529",
    "gitTreeState": "clean",
    "buildDate": "2019-06-19T16:40:16Z",
    "goVersion": "go1.12.5",
    "compiler": "gc",
    "platform": "linux/amd64"


The connection to the server localhost:8080 was refused - did you specify the right host or port?

it’s okay if minikube hasnt started yet.



size: ~40M

$ curl -sLO

$ chmod +x minikube-linux-amd64

$ sudo mv minikube-linux-amd64 /usr/local/bin/minikube

$ minikube version
minikube version: v1.2.0

$ minikube update-check
CurrentVersion: v1.2.0
LatestVersion: v1.2.0

$ minikube completion bash | sudo tee -a /etc/bash_completion.d/minikube 

To include bash completion without login/logout:

source $(minikube completion bash)


KVM2 driver

We need a driver so that minikube can build a kvm image/node for our kubernetes cluster.

size: ~36M

$ curl -sLO

$ chmod +x docker-machine-driver-kvm2

$ mv docker-machine-driver-kvm2 /usr/local/bin/


Start minikube

$ minikube start --vm-driver kvm2

* minikube v1.2.0 on linux (amd64)
* Downloading Minikube ISO ...
 129.33 MB / 129.33 MB [============================================] 100.00% 0s
* Creating kvm2 VM (CPUs=2, Memory=2048MB, Disk=20000MB) ...
* Configuring environment for Kubernetes v1.15.0 on Docker 18.09.6
* Downloading kubeadm v1.15.0
* Downloading kubelet v1.15.0
* Pulling images ...
* Launching Kubernetes ...
* Verifying: apiserver proxy etcd scheduler controller dns
* Done! kubectl is now configured to use "minikube"

Check via libvirt, you will find out a new VM, named: minikube

$ virsh list
 Id    Name                           State
 1     minikube                       running


Something gone wrong:

Just delete the VM and configuration directories and start again:

$ minikube delete
$ rm -rf ~/.minikube/ ~/.kube

kubectl version

Now let’s run kubectl version again

$ kubectl version -o json | jq .

  "clientVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.0",
    "gitCommit": "e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529",
    "gitTreeState": "clean",
    "buildDate": "2019-06-19T16:40:16Z",
    "goVersion": "go1.12.5",
    "compiler": "gc",
    "platform": "linux/amd64"
  "serverVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.0",
    "gitCommit": "e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529",
    "gitTreeState": "clean",
    "buildDate": "2019-06-19T16:32:14Z",
    "goVersion": "go1.12.5",
    "compiler": "gc",
    "platform": "linux/amd64"



Start kubernetes dashboard

$ kubectl proxy --address --accept-hosts '.*'
Starting to serve on [::]:8001



July 14, 2019 06:41 PM

Bash bits: find has a -delete flag

Bash Bits are small examples, tips and tutorials for the bash shell. This bash bit shows you that find has a -delete option. I recently found this out, before I would always use -exec rm {} ;. The delete flag is shorter, performs better and is easier to remember.

July 14, 2019 12:00 AM

July 12, 2019

Only zero is false, everything else is true in C++

When using numbers in a boolean (implicit conversion), remember that only zero evaluates to false. Anything else, including negative numbers, will evaluate to true. This snippet talks about the rules for implicit conversion in C++ when using booleans. For seasoned programmers it's nothing new, but I found it interesting.

July 12, 2019 12:00 AM

July 10, 2019

Trigger an on demand uptime & broken links check after a deploy with the Oh Dear! API

The post Trigger an on demand uptime & broken links check after a deploy with the Oh Dear! API appeared first on

You can use our API to trigger an ondemand run of both the uptime check and the broken links checker. If you add this to, say, your deploy script, you can have near-instant validation that your deploy succeeded and didn't break any links & pages.

Source: Trigger an ondemand uptime & broken links check after a deploy -- Oh Dear! blog

The post Trigger an on demand uptime & broken links check after a deploy with the Oh Dear! API appeared first on

by Mattias Geniar at July 10, 2019 12:51 PM

July 09, 2019

There’s more than one way to write an IP address

The post There’s more than one way to write an IP address appeared first on

Most of us write our IP addresses the way we've been taught, a long time ago:,, ... but that gets boring after a while, doesn't it?

Luckily, there's a couple of ways to write an IP address, so you can mess with coworkers, clients or use it as a security measure to bypass certain (input) filters.

Not all behaviour is equal

I first learned about the different ways of writing an IP address by this little trick.

On Linux:

$ ping 0
PING 0 ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.053 ms
64 bytes from icmp_seq=2 ttl=64 time=0.037 ms

This translates the 0 to However, on a Mac:

$ ping 0
PING 0 ( 56 data bytes
ping: sendto: No route to host
ping: sendto: No route to host

Here, it translates 0 to a null-route

Zeroes are optional

Just like in IPv6 addresses, some zeroes (0) are optional in the IP address.

$ ping 127.1
PING 127.1 ( 56 data bytes
64 bytes from icmp_seq=0 ttl=64 time=0.033 ms
64 bytes from icmp_seq=1 ttl=64 time=0.085 ms

Note though, a computer can't just "guess" where it needs to fill in the zeroes. Take this one for example:

$ ping 10.50.1
PING 10.50.1 ( 56 data bytes
Request timeout for icmp_seq 0

It translates 10.50.1 to, adding the necessary zeroes before the last digit.

Overflowing the IP address

Here's another neat trick. You can overflow a digit.

For instance:

$ ping 10.0.513
PING 10.0.513 ( 56 data bytes
64 bytes from icmp_seq=0 ttl=61 time=10.189 ms
64 bytes from icmp_seq=1 ttl=61 time=58.119 ms

We ping 10.0.513, which translates to The last digit can be interpreted as 2x 256 + 1. It shifts the values to the left.

Decimal IP notation

We can use a decimal representation of our IP address.

$ ping 167772673
PING 167772673 ( 56 data bytes
64 bytes from icmp_seq=0 ttl=61 time=15.441 ms
64 bytes from icmp_seq=1 ttl=61 time=4.627 ms

This translates 167772673 to

Hex IP notation

Well, if decimal notation worked, HEX should work too -- right? Of course it does!

$ ping 0xA000201
PING 0xA000201 ( 56 data bytes
64 bytes from icmp_seq=0 ttl=61 time=7.329 ms
64 bytes from icmp_seq=1 ttl=61 time=18.350 ms

The hex value A000201 translates to By prefixing the value with 0x, we indicate that what follows, should be interpreted as a hexadecimal value.

Octal IP notation

Take this one for example.

$ ping
PING ( 56 data bytes

Notice how that last .010 octet gets translated to .8?

Using sipcalc to find these values

There's a useful command line IP calculator called sipcalc you can use for the decimal & hex conversions.

The post There’s more than one way to write an IP address appeared first on

by Mattias Geniar at July 09, 2019 11:31 AM

July 08, 2019

Anton Chuvakin - Security Warrior

This Blog Is Here For Historical Interest Only!

This is a reminder/disclaimer: this blog is here for historical interest only. Its history really ended on August 1, 2011 with this post.

Naturally, you already know all this if you follow me on Twitter ...

There you have it!

by Anton Chuvakin ( at July 08, 2019 05:53 PM

July 06, 2019

GNUplot tips for nice looking charts from a CSV file

Recently I had to do some testing which resulted in a lot of log data. Looking at a bunch of text is not the same as seeing things graphically, this particular logdata was perfect to put in a graph. My goto favorite tool for graphs and charts is gnuplot. Not only is it very extensible, it is also reproducable. Give me a configfile and command over 'do this, then this and then such and such' in Excel to get a consistent result. In this article I'll give tips for using gnuplot, which include parsing a CSV file, a second axis, environment variables, A4 PDF's and a ton of styling options for a nice looking chart.

July 06, 2019 12:00 AM

July 04, 2019

The Geekess

Outreachy progress 2019-06

Outreachy organizer admin tasks:

  • Run Outreachy biweekly internship chats
  • Read and follow up on initial feedback for Outreachy interns
  • Handle issues with intern time commitments
  • Remind Outreachy interns if they haven’t created a blog
  • Communicate with potential community and sponsors for the December 2019 round
  • Put out a call for Outreachy booth volunteers for the Tapia conference
  • Coordinate with potential contractors who can offer career advice and interviewing workshops to Outreachy interns.

Development tasks:

  • Work is progressing on the Outreachy website changes to separate out the initial application period and the contribution period.
  • Most of the template and view changes have been made.
  • Thorough tests need to be written next month to reveal any bugs.

by Sage at July 04, 2019 05:09 PM

June 19, 2019

Sean's IT Blog

Using Amazon RDS with Horizon 7 on VMware Cloud on AWS

Since I joined VMware back in November, I’ve spent a lot of time working with VMware Cloud on AWS – particularly around deploying Horizon 7 on VMC in my team’s lab.  One thing I hadn’t tried until recently was utilizing Amazon RDS with Horizon.

No, we’re not talking about the traditional Remote Desktop Session Host role. This is the Amazon Relational Database Service, and it will be used as the Event Database for Horizon 7.

After building out a multisite Horizon 7.8 deployment in our team lab, we needed a database server for the Horizon Events Database.  Rather than deploy and maintain a SQL Server in each lab, I decided to take advantage of one of the benefits of VMware Cloud on AWS and use Amazon RDS as my database tier.

This isn’t the first time I’ve used native Amazon services with Horizon 7.  I’ve previously written about using Amazon Route 53 with Horizon 7 on VMC.

Before we begin, I want to call out that this might not be 100% supported.  I can’t find anything in the documentation, KB58539, or the readme files that explicitly state that RDS is a supported database platform.  RDS is also not listed in the Product Interoperability Matrix.  However, SQL Server 2017 Express is supported, and there are minimal operational impacts if this database experiences an outage.

What Does a VDI Solution Need With A Database Server?

VMware Horizon 7 utilizes a SQL Server database for tracking user session data such as logins and logouts and auditing administrator activities that are performed in the Horizon Administrator console. Unlike on-premises environments where there are usually existing database servers that can host this database, deploying Horizon 7 on VMware Cloud on AWS would require a new database server for this service.

Amazon RDS is a database-as-a-service offering built on the AWS platform. It provides highly scalable and performant database services for multiple database engines including Postgres, Microsoft SQL Server and Oracle.

Using Amazon RDS for the Horizon 7 Events Database

There are a couple of steps required to prepare our VMware Cloud on AWS infrastructure to utilize native AWS services. While the initial deployment includes connectivity to a VPC that we define, there is still some networking that needs to be put into place to allow these services to communicate. We’ll break this work down into three parts:

  1. Preparing the VMC environment
  2. Preparing the AWS VPC environment
  3. Deploying and Configuring RDS and Horizon

Preparing the VMC Environment

The first step is to prepare the VMware Cloud on AWS environment to utilize native AWS services. This work takes place in the VMware Cloud on AWS management console and consists of two main tasks. The first is to document the availability zone that our VMC environment is deployed in. Native Amazon services should be deployed in the same availability zone to reduce any networking costs. Firewall rules need to be configured on the VMC Compute Gateway to allow traffic to pass to the VPC.

The steps for preparing the VMC environment are:

  1. Log into
  2. Click Console
  3. In the My Services section, select VMware Cloud on AWS
  4. In the Software-Defined Data Centers section, find the VMware Cloud on AWS environment that you are going to manage and click View Details.
  5. Click the Networking and Security tab.
  6. In the System menu, click Connected VPC. This will display information about the Amazon account that is connected to the environment.
  7. Find the VPC subnet. This will tell you what AWS Availability Zone the VMC environment is deployed in. Record this information as we will need it later.

Now that we know which Availability Zone we will be deploying our database into, we will need to create our firewall rules. The firewall rules will allow our Connection Servers and other VMs to connect to any native Amazon services that we deploy into our connected VPC.

This next section picks up from the previous steps, so you should be in the Networking and Security tab of the VMC console. The steps for configuring our firewall rules are:

  1. In the Security Section, click on Gateway Firewall.
  2. Click Compute Gateway
  3. Click Add New Rule
  4. Create the new firewall rule by filling in the following fields:
    1. In the Name field, provide a descriptive name for the firewall rule.
    2. In the Source field, click Select Source. Select the networks or groups and click Save.
      Note: If you do not have any groups, or you don’t see the network you want to add to the firewall, you can click Create New Group to create a new Inventory Group.
    3. In the Destination field, click Select Destination. Select the Connected VPC Prefixes option and click Save.
    4. In the Services field, click Select Services. Select Any option and click Save.
    5. In the Applied To field, remove the All Interfaces option and select VPC Interfaces.
  5. Click Publish to save and apply the firewall rule.

There are two reasons that the VMC firewall rule is configured this way. First, Amazon assigns IP addresses at service creation. Second, this firewall rule can be reused for other AWS Services, and access to those services can be controlled using AWS Security Groups instead.

The VMC gateway firewall does allow for more granular rule sets. They are just not going to utilized in this walkthrough.

Preparing the AWS Environment

Now that the VMC environment is configured, the RDS service needs to be provisioned. There are a couple of steps to this process.

First, we need to configure a security group that will be used for the service.

  1. Log into your Amazon Console.
  2. Change to the region where your VMC environment is deployed.
  3. Go into the VPC management interface. This is done by going to Services and selecting VPC.
  4. Select Security Groups
  5. Click Create Security Group
  6. Give the security group a name and description.
  7. Select the VPC where the RDS Services will be deployed.
  8. Click Create.
  9. Click Close.
  10. Select the new Security Group.
  11. Click the Inbound Rules tab.
  12. Click Edit Rules
  13. Click Add Rule
  14. Fill in the following details:
    1. Type – Select MS SQL
    2. Source – Select Custom and enter the IP Address or Range of the Connection Servers in the next field
    3. Description – Description of the server or network
    4. Repeat as Necessary
  15. Click Save Rules

This security group will allow our connection servers to access the database services that are being hosted in RDS.

Once the security group is created, the RDS instance can be deployed. The steps for deploying the RDS instance are:

  1. Log into your Amazon Console.
  2. Change to the region where your VMC environment is deployed.
  3. Go into the RDS management interface. This is done by going to Services and selecting RDS.
  4. Click Create Database.
  5. Select Microsoft SQL Server.
  6. Select the version of SQL Server that will be deployed. For this walkthrough, SQL Server Express will be used.

    Note: There is a SQL Server Free Tier offering that can be used if this database will only be used for the Events Database. The Free Tier offering is only available with SQL Server Express. If you only want to use the Free Tier offering, select the Only enable options eligible for RDS Free Tier Usage.

  7. Click Next.
  8. Specify the details for the RDS Instance.
    1. Select License Model, DB Engine Version, DB instance class, Time Zone, and Storage.
      Note: Not all options are available if RDS Free Tier is being utilized.
    2. Provide a DB Instance Identifier. This must be unique for all RDS instances you own in the region.
    3. Provide a master username. This will be used for logging into the SQL Server instance with SA rights.
    4. Provide and confirm the master username password.
    5. Click Next.
  9. Configure the Networking and Security Options for the RDS Instance.
      1. Select the VPC that is attached to your VMC instance.
      2. Select No under Public Accessibility.
        Note: This refers to access to the RDS instance via a public IP address. You can still access the RDS instance from VMC since routing rules and firewall rules will allow the communication.
      3. Select the Availability Zone that the VMC tenant is deployed in.
      4. Select Choose Existing VPC Security Groups
      5. Remove the default security group by clicking the X.
      6. Select the security group that was created for accessing the RDS instance.

  10. Select Disable Performance Insights.
  11. Select Disable Auto Minor Version Upgrade.
  12. Click Create Database.

Once Create Database is clicked, the deployment process starts. This takes a few minutes to provision. After provisioning completes, the Endpoint URL for accessing the instance will be available in the in RDS Management Console. It’s also important to validate that the instance was deployed in the correct availability zone. While testing this process, some database instances were created in an availability zone that was different from the one selected during the provisioning process.

Make sure you copy your Endpoint URL. You will need this in the next step to configure the database and Horizon.

Creating the Horizon Events Database

The RDS instance that was provisioned in the last step is an empty SQL Server instance. There are no databases or SQL Server user accounts, and these will need to be created in order to use this server with Horizon. A tool like SQL Server Management Studio is required to complete these steps, and we will be using SSMS for this walkthrough. The instance must be accessible from the machine that has the database management tools installed.

The Horizon Events Database does not utilize Windows Authentication, so a SQL Server user will be required along with the database that we will be setting up. This also requires DB_Owner rights on that database so it can provision the tables when we configure it in Horizon the first time.

The steps for configuring the database server are:

  1. Log into new RDS instance using SQL Server Management Studio using the Master Username and Password.
  2. Right Click on Databases
  3. Select New Database
  4. Enter HorizonEventsDB in the Database Name Field.
  5. Click OK.
  6. Expand Security.
  7. Right click on Logins and select New Login.
  8. Enter a username for the database.
  9. Select SQL Server Authentication
  10. Enter a password.
  11. Uncheck Enforce Password Policy
  12. Change the Default Database to HorizonEventsDB
  13. In the Select A Page section, select User Mapping
  14. Check the box next to HorizonEventsDB
  15. In the Database Role Membership section, select db_owner
  16. Click OK

Configuring Horizon to Utilize RDS for the Events Database

Now that the RDS instance has been set up and configured, Horizon can be configured to use it for the Events Database. The steps for configuring this are:

  1. Log into Horizon Administrator.
  2. Expand View Configuration
  3. Click on Event Configuration
  4. Click Edit
  5. Enter the Database Server, Database Name, Username, and Password and click OK.

Benefits of Using RDS with Horizon 7

Combining VMware Horizon 7 with Amazon RDS is just one example of how you can utilize native Amazon services with VMware Cloud on AWS. This allows organizations to get the best of both worlds – easily consumed cloud services to back enterprise applications with an platform that requires few changes to the applications themselves and operational processes.

Utilizing native AWS services like RDS has additional benefits for EUC environments. When deploying Horizon 7 on VMware Cloud on AWS, the management infrastructure is typically deployed in the Software Defined Datacenter alongside the desktops. By utilizing native AWS services, resources that would otherwise be reserved for and consumed by servers can now be utilized for desktops.


by seanpmassey at June 19, 2019 01:00 PM

June 18, 2019

Sarah Allen

debugging openssl shared libary

I’m debugging an issue where my app uses a library that requires me to dynamically link with an openssl library. What’s more I’m debugging it on an old linux version. Sigh.

gdb to the rescue!

After figuring out how to build openssl from source, I stumbled upon a gdb trick… suppose you are using a fairly standard open source library (like openssl) and you want to debug something that uses it (some other library that doesn’t work over ssl), gdb will let you know if there’s an easy way to download the symbols! Just type gdb + library name.

Here’s an example

gdb openssl
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /usr/bin/openssl...(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install openssl-1.0.1e-57.el6.x86_64
(gdb) quit

Now I can use this command to install the debug symbols for the specific version of openssl that is installed on this system:

debuginfo-install openssl-1.0.1e-57.el6.x86_64

then I can debug my app looking at how it calls openssl. In the gdb session below, I first set a breakpoint in main, and run to that point…

(gdb) b main
(gdb) run
Starting program: /home/builder/src/app 
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]

Breakpoint 1, main (argc=4, argv=0x7fffffffe698) at sample.cpp:226
226         LOG("Here I am in main!")

now the openssl library is loaded and I can set a breakpoint in it:

(gdb) b SSL_CTX_set_verify
Breakpoint 2 at 0x7ffff7734bb0: file ssl_lib.c, line 2040.
(gdb) c
Creating connection object
[New Thread 0x7ffff4bd6700 (LWP 53)]
Connecting to server/app URL: rtmps://

Breakpoint 2, SSL_CTX_set_verify (ctx=0x62efe0, mode=1, cb=0x7ffff7acb6c0 <SecuredConnectionIO::VerifyCallback(int, x509_store_ctx_st*)>)
    at ssl_lib.c:2040
2040        ctx->verify_mode=mode;

I can look at variables or all of the function arguments:

(gdb) p mode
$1 = 1
(gdb) info args
ctx = 0x62efe0
mode = 1
cb = 0x7ffff7acb6c0 <SecuredConnectionIO::VerifyCallback(int, x509_store_ctx_st*)>

How cool is that?

by sarah at June 18, 2019 04:49 AM

June 16, 2019

Sarah Allen

digital identity: how to verify trust?

How can we communicate with each other on the Internet so that we know each other when we want to be known, yet can have privacy or anonymity when appropriate? My brief notes from April 2018 Internet Identity Workshop (below) still feel relevant a year later.

If we believe that a particular person is trust-worthy, to trust their digital representation, we somehow need to identify that some bits that travel across wires or air actually originate from that person.

In today’s Web, we have a network of trusted authorities, typically my social network or email provider creates a relationship with me and I prove my identity with a password. The challenge is that they also house all of my personal data — could there be a way for me to identify myself without making myself vulnerable to the whims or errors of these companies? New models are emerging.

  • Mobile Drivers License: British Columbia and U.S. Commerce Department’s National Institute of Standards and Technology (NIST) have funded development of a new kind of digital ID. Folks are working on ways to validate the identity and “claims” of an individual. This is not just for fraud detection. It also potentially protects the privacy of an individual, in contrast to a traditional drivers license, where I must reveal my home address while proving that I’m over 21.

  • Decentralized Identifier (DID): a standard way for individuals and organizations to create permanent, globally unique, cryptographically verifiable identifiers entirely under the identity owner’s control. Sovrin Foundation Whitepaper

  • With blockchains, every public key can now have its own address, enabling a decentralized self-service registry for public keys.

  • Trust without shared secrets. In cryptography we typically share a secret which allows us to decrypt future messages. But the best way to keep a secret is not to tell anyone. We can actually verify a secret without knowing it. Zero-knowledge proof

  • Object capabilities. In the real world we have physical objects that we can transfer for very specific authorization (e.g. a key to your car) whereas digital keys must be kept secret to avoid replication — what if authorization were couple with objects in the digital world. Some basic examples illustrate the framework, discussed further in false dichotomy of control vs sharing.

Full notes from IIW 26: PDF Proceedings, wiki

More about IIW

The Internet Identity Workshop (IIW) gathers experts across the industry to solve this particular question. People share their understanding of the problem and potential solutions in this unique unconference twice a year. I always learn unexpected and useful technical solutions, and more importantly gain a deeper understanding of this challenging problem of identity.

by sarah at June 16, 2019 12:56 PM

June 05, 2019

Cryptography Engineering

How does Apple (privately) find your offline devices?

At Monday’s WWDC conference, Apple announced a cool new feature called “Find My”. Unlike Apple’s “Find my iPhone“, which uses cellular communication and the lost device’s own GPS to identify the location of a missing phone, “Find My” also lets you find devices that don’t have cellular support or internal GPS — things like laptops, or (and Apple has hinted at this only broadly) even “dumb” location tags that you can attach to your non-electronic physical belongings.

The idea of the new system is to turn Apple’s existing network of iPhones into a massive crowdsourced location tracking system. Every active iPhone will continuously monitor for BLE beacon messages that might be coming from a lost device. When it picks up one of these signals, the participating phone tags the data with its own current GPS location; then it sends the whole package up to Apple’s servers. This will be great for people like me, who are constantly losing their stuff: if I leave my backpack on a tour bus in China in my office, sooner or later someone else will stumble on its signal and I’ll instantly know where to find it.

(It’s worth mentioning that Apple didn’t invent this idea. In fact, companies like Tile have been doing this for quite a while. And yes, they should probably be worried.)

If you haven’t already been inspired by the description above, let me phrase the question you ought to be asking: how is this system going to avoid being a massive privacy nightmare?

Let me count the concerns:

  • If your device is constantly emitting a BLE signal that uniquely identifies it, the whole world is going to have (yet another) way to track you. Marketers already use WiFi and Bluetooth MAC addresses to do this: Find My could create yet another tracking channel.
  • It also exposes the phones who are doing the tracking. These people are now going to be sending their current location to Apple (which they may or may not already be doing). Now they’ll also be potentially sharing this information with strangers who “lose” their devices. That could go badly.
  • Scammers might also run active attacks in which they fake the location of your device. While this seems unlikely, people will always surprise you.

The good news is that Apple claims that their system actually does provide strong privacy, and that it accomplishes this using clever cryptography. But as is typical, they’ve declined to give out the details how they’re going to do it. Andy Greenberg talked me through an incomplete technical description that Apple provided to Wired, so that provides many hints. Unfortunately, what Apple provided still leaves huge gaps. It’s into those gaps that I’m going to fill in my best guess for what Apple is actually doing.

A big caveat: much of this could be totally wrong. I’ll update it relentlessly when Apple tells us more.

Some quick problem-setting

To lay out our scenario, we need to bring several devices into the picture. For inspiration, we’ll draw from the 1950s television series “Lassie”.

A first device, which we’ll call Timmy, is “lost”. Timmy has a BLE radio but no GPS or connection to the Internet. Fortunately, he’s been previously paired with a second device called Ruth, who wants to find him. Our protagonist is Lassie: she’s a random (and unknowing) stranger’s iPhone, and we’ll  assume that she has at least an occasional Internet connection and solid GPS. She is also a very good girl. The networked devices communicate via Apple’s iCloud servers, as shown below:

Lassie(Since Timmy and Ruth have to be paired ahead of time, it’s likely they’ll both be devices owned by the same person. Did I mention that you’ll need to buy two Apple devices to make this system work? That’s also just fine for Apple.)

Since this is a security system, the first question you should ask is: who’s the bad guy? The answer in this setting is unfortunate: everyone is potentially a bad guy. That’s what makes this  problem so exciting.

Keeping Timmy anonymous

The most critical aspect of this system is that we need to keep unauthorized third parties from tracking Timmy, especially when he’s not lost. This precludes some pretty obvious solutions, like having the Timmy device simply shout “Hi my name is Timmy, please call my mom Ruth and let her know I’m lost.” It also precludes just about any unchanging static identifier, even an opaque and random-looking one.

This last requirement is inspired by the development of services that abuse static identifiers broadcast by your devices (e.g., your WiFi MAC address) to track devices as you walk around. Apple has been fighting this — with mixed success — by randomizing things like MAC addresses. If Apple added a static tracking identifier to support the “Find My” system, all of these problems could get much worse.

This requirement means that any messages broadcast by Timmy have to be opaque — and moreover, the contents of these messages must change, relatively frequently, to new values that can’t be linked to the old ones. One obvious way to realize this is to have Timmy and Ruth agree on a long list of random “pseudonyms” for Timmy, and have Timmy pick a different one each time.

This helps a lot. Each time Lassie sees some (unknown) device broadcasting an identifier, she won’t know if it belongs to Timmy: but she can send it up to Apple’s servers along with her own GPS location. In the event that Timmy ever gets lost, Ruth can ask Apple to search for every single one of Timmy‘s possible pseudonyms. Since neither nobody outside of Apple ever learns this list, and even Apple only learns it after someone gets lost, this approach prevents most tracking.

A slightly more efficient way to implement this idea is to use a cryptographic function (like a MAC or hash function) in order to generate the list of pseudonyms from a single short “seed” that both Timmy and Ruth will keep a copy of. This is nice because the data stored by each party will be very small. However, to find Timmy, Ruth must still send all of the pseudonyms — or her “seed” — up to Apple, who will have to search its database for each one.

Hiding Lassie’s location

The pseudonym approach described above should work well to keep Timmy‘s identity hidden from Lassie, and even from Apple (up until the point that Ruth searches for it.) However, it’s got a big drawback: it doesn’t hide Lassie‘s GPS coordinates.

This is bad for at least a couple of reasons. Each time Lassie detects some device broadcasting a message, she needs to transmit her current position (along with the pseudonym she sees) to Apple’s servers. This means Lassie is constantly telling Apple where she is. And moreover, even if Apple promises not to store Lassie‘s identity, the result of all these messages is a huge centralized database that shows every GPS location where some Apple device has been detected.

Note that this data, in the aggregate, can be pretty revealing. Yes, the identifiers of the devices might be pseudonyms — but that doesn’t make the information useless. For example: a record showing that some Apple device is broadcasting from my home address at certain hours of the day would probably reveal when I’m in my house.

An obvious way to prevent this data from being revealed to Apple is to encrypt it — so that only parties who actually need to know the location of a device can see this information. If Lassie picks up a broadcast from Timmy, then the only person who actually needs to know Lassie‘s GPS location is Ruth. To keep this information private, Lassie should encrypt her coordinates under Ruth’s encryption key.

This, of course, raises a problem: how does Lassie get Ruth‘s key? An obvious solution is for Timmy to shout out Ruth’s public key as part of every broadcast he makes. Of course, this would produce a static identifier that would make Timmy‘s broadcasts linkable again.

To solve that problem, we need Ruth to have many unlinkable public keys, so that Timmy can give out a different one with each broadcast. One way to do this is have Ruth and Timmy generate many different shared keypairs (or generate many from some shared seed). But this is annoying and involves Ruth storing many secret keys. And in fact, the identifiers we mentioned in the previous section can be derived by hashing each public key.

A slightly better approach (that Apple may not employ) makes use of  key randomization. This is a feature provided by cryptosystems like Elgamal: it allows any party to randomize a public key, so that the randomized key is completely unlinkable to the original. The best part of this feature is that Ruth can use a single secret key regardless of which randomized version of her public key was used to encrypt.


All of this  leads to a final protocol idea. Each time Timmy broadcasts, he uses a fresh pseudonym and a randomized copy of Ruth‘s public key. When Lassie receives a broadcast, she encrypts her GPS coordinates under the public key, and sends the encrypted message to Apple. Ruth can send in Timmy‘s pseudonyms to Apple’s servers, and if Apple finds a match, she can obtain and decrypt the GPS coordinates.

Does this solve all the problems?

The nasty thing about this problem setting is that, with many weird edge cases, there just isn’t a perfect solution. For example, what if Timmy is evil and wants to make Lassie reveal her location to Apple? What if Old Man Smithers tries to kidnap Lassie?

At a certain point, the answer to these question is just to say that we’ve done our best: any remaining problems will have to be outside the threat model. Sometimes even Lassie knows when to quit.

by Matthew Green at June 05, 2019 07:29 PM

June 04, 2019

Everything Sysadmin

Next nycdevops meetup: Kubernetes Informers (Wed, June 19)

Robert Ross (a.k.a. Bobby Tables) will be the speaker at the next nycdevops meetup on Wed, une 19, 2019.

Full details and RSVP info:

NOTE: Different day and location!

  • Title: Staying Informed with Kubernetes Informers
  • Speaker: Robert Ross (Bobby Tables) from FireHydrant
  • Date: Wed, June 19, 2019
  • Location: Compass, 90 Fifth Ave, New York, NY 10011

Kubernetes state is changing all the time. Pods are being created. Deployments are adding more replicas. Load balancers are being created from services. All of these things can happen without anyone noticing. But sometimes we need to notice, however, for when we need to react to such events. What if we need to push the change to an audit log? When if we want to inform a Slack room about a new deployment? In Kubernetes, this is possible with the informers that are baked into the API and Go client. In this talk we'll learn how informers work, and how to receive updates when resources change using a simple Go application.


Bobby is the founder of, and also previously worked as a staff software engineer at Namely, and also built things at DigitalOcean. He likes bleeding edge tech and making software that helps teams build better better systems. From deploying Spinnaker, Istio, and Kubernetes, he has cursed at a lack of docs and code spelunked through the code and loves telling the war stories about them.

Full details and RSVP info:

by Tom Limoncelli at June 04, 2019 02:53 PM

June 01, 2019

The Geekess

Outreachy Progress 2019-05

Outreachy organizer admin tasks:

  • Announce the accepted Outreachy interns
  • Handle situations where interns were unable to accept the internship
  • Update the intern contract dates (we need to automate this in the website code!)
  • Send signed intern contracts to Conservancy
  • Send an invoice request to Conservancy for Outreachy sponsors
  • Follow all interns on Twitter, retweet their first blog posts if @outreachy is tagged
  • Run the first intern Q&A session

Development tasks:

  • Add website code to allow mentors to invite co-mentors and have co-mentors sign the mentorship agreement for any selected interns
  • Create page for organizers to see contact info for mentors who selected an intern (so we can easily subscribe them to the mentor mailing list)


Outreachy at PyCon U.S. Sprints

We successfully tested the Outreachy website developer’s documentation at the PyCon U.S. sprints. This is the first step towards limiting the “bus factor” and ensuring that many people can work on the Outreachy website.

13 people participated in the sprints. Most were running Linux, but there was one Windows user who successfully followed our installation guide and successfully made their first contribution.

Half of the participants were unfamiliar with the Django web development framework that the Outreachy website uses. There were several people who had never made a contribution to free software before. We’re proud that they could make their first impact on the free software world with Outreachy!

Overall, 8 pull requests were merged, with 7 more pull requests waiting for review. The pull requests included improvements like clarifying our documentation, clarifying application questions, ensuring links on our opportunities page were valid, and improving the layout of our pages for past rounds.

Blog Post Prompts

During the Outreachy internship, interns are required to blog every two weeks. The Outreachy organizers found that interns often didn’t know what to blog about, so we started creating a series of blog post prompts. The prompts are highly relevant to the intern experience as the internship progresses.

Our first blog post prompts normalize the fact that all interns struggle during the first few weeks. The mid-point blog post prompt asks interns to reflect on their original project timeline, and how unexpected complexity means projects often have to be scaled back. We wanted to do a full series of blog post prompts last round, but we ran out of time before the next application period kicked off.

This round, we’re finishing out the last two blog post prompts for weeks 9 and 11 of the internship. They will focus on the next steps after the Outreachy internship, namely how interns can start a career in tech or free software.

The week 9 blog post prompt is for interns to write about what direction they would like to take their career. Some Outreachy interns are still in school, so we ask them to provide what time frame they want to take their next steps in.

The week 11 email will prompt the interns to work on their resumes, and then post them on their blog.

Career Development

Outreachy organizers are also in discussions with some contractors who may be able to provide some career advice to Outreachy interns. We’ve long wanted to provide more career services to interns, but haven’t been able to allocate organizer time to this. We’re still in negotiations, but we hope this round we can finally offer this.

by Sage at June 01, 2019 01:28 AM

May 29, 2019


Visualizing Commodore 1541 Disk Contents – Part 2: Errors

We have previously visualized the physical layout of C64/1541 disks. In order to understand the encoding and potential read errors, it is more useful to visualize the disks sector-by-sector.

This animation shows 12 regular disks. (Click for original size.)

  • Every pack of 17-21 lines is a track, numbered 1-41.
  • Every line within a pack is one sector.
  • The raw sector contents are drawn from left to right.
  • The cyan part is the header, the green part the data.
  • Black is 0, cyan/green is 1.
  • White represents missing header or sector sections.

The tool to generate these images from .G64 files is available at

Let us look at the header first. Here is an animation of the header of tracks 23 to 25 of a few disks:

The header contains six data bytes encoded using the 4-to-5 GCR scheme:

  • H: Header Code: Every header starts with 0x08 so it can be distinguished from the sector data (0x07). You can see that it is the same bit pattern on all sectors on all tracks.
  • T: Track: This is the track number (1-35). It is the same for all sectors on the same track.
  • S: Sector: This is the sector number (0-16/17/18/20, depending on the track). You can see the same bit patterns on each of the three tracks shown.
  • ID: ID: The two-byte ID should be unique per disk and is used to detect disk changes. It is the same for all sectors on the same disk.
  • C: Checksum: The checksum is the XOR of C, S, T and ID, so you can see it behaves kind of randomly in the animation above.
  • GAP: The gap does not contain usable data, but separates the header from the sector data.


The first byte of the data section is the code 0x07 to distinguish it from a header. The rest is the 256 data bytes. The Commodore DOS filesystem uses two link bytes (next track, next sector) at the beginning of every sector, which is why the next two bytes in this visualization look more regular between disks.

Good Disks

First, let’s look at some error-free disks in practice.

Empty Disk

An empty disk looks quite uniform: (Click for original size.)

Only sectors 0 and 1 of track 18 look different:

18/0 contains the block allocation map (BAM) and the name of the disk, and 18/1 contains the first 8 directory entries.

Illegal Bits

The last few bits of the end of the header are also different for sectors 0 and 1. The headers of all sectors directly after formatting the disk end in a pattern of alternating zeros and ones, but as soon as a sector has been written to, this is no longer the case for the last few bits. It gets even more interesting when visualizing these bits across several reads of the same disk:

These are unstable illegal bits on disk, bits that sometimes read back as zeros and sometimes as ones.

When writing a sector, the software in the disk drive waits for the correct sector header to pass by the read head, then waits until the exact end of the header gap area, and starts writing the new data. When switching to write mode, the magnetization written onto the media for the first ~15 µs (4-5 bits) will be analog values between logical 0 and 1. These are technically illegal values, and will be unstable when read.

These illegal bits are benign, because they do not encode anything. But it is important to note that two reads of a good disk may not be identical.


Now, let’s look at what common errors look like:

Logical Errors

Some read errors are caused by buggy software writing to disk: (Click for original size.)

This disk is missing some sector headers:

This is one of the faulty GEOS boot disks: Some sectors were written with the wrong speed zone setting, so they spilled into the next header, overwriting it. The sector contents are still there, but the missing header makes them unreadable with the original Commodore DOS. (The visualization tool will guess the sector number in case of a sector with a missing header.)


There is an example of a dropout:

There is one sector that starts reading back as all 0 bits (black) at some point, and the next sector data is completely missing (white), which is because the SYNC mark of the next sector was not readable.

This can be caused by demagnetization, or more likely, by dirt on the disk’s surface. Sometimes, the dirt will scrape off if we just read the disk often enough. The following animation shows the result of subsequent reads of the same disk:

Here, you can see how the faulty sector on the third track in the picture starts out all-black, with the next sector missing, over unstable data reads, to finally stable and correct reads. You can also see how the previous two sectors are impacted by the same speck of dirt, just not as much.

(The fact that it is different sector numbers that are impacted on the different tracks is due to the fact that on 1541 disks, there is no sector alignment between tracks, i.e. whether sector 0 on one track touches sector 0..20 on the next track is practically random.)

If the dropout persists across retries, it might still be dirt, just of the more resilient kind. Just clean the disk and try again.

Weak Bit Data Errors

This is a disk with multiple checksum errors in the data section across several tracks. Cleaning the disk did not help, so these are weak bits:

Weak bits do not result in flipped bits, but in missing or duplicate bits. As you can see in the animation, sections of the data move back and forth between retries. It is clearer when looking at a single sector. This excerpt contains four weak bits:

The read logic of Commodore disk drives measures the time between zero-to-one and one-to-zero edges and then decides how many zeros or ones were on disk. Therefore, weak bits lead to duplicate or missing bits.

A single weak bit can be recovered from by retrying: If it reads back correctly every now and then, it is as easy as retrying until the checksum is correct. But with multiple weak bits in a single sector, this gets exponentially less likely.


All that current tools can do with faulty disks is retry until the checksum is correct. By analyzing the weak bits, it should be possible to create tools for data recovery on a lower level. Multiple reads will reveal where the weak bits are, so a tool could try out different sequences around the weak bits and verify the checksum.

by Michael Steil at May 29, 2019 07:17 AM

May 23, 2019


Face to Face: Committer's Day

At the Face to Face meeting held on the occasion of the ICMC19 Conference in Vancouver, a novelty was introduced: For the last day of the meeting all committers were invited to participate, either personally or remotely via video conference.

Three of us committers (Paul, Nicola, Matthias) came to Vancouver for the conference, so we were able to participate in person, Bernd (DE) and Shane (AU) joined us remotely.

Paul Yang, Matthias St. Pierre, Tim Hudson, Matt Caswell, Richard Levitte, Nicola Tuveri (from left to right)

The group was completed by the OMC members Paul Dale (AU), Kurt Roeckx (BE), Mark Cox (UK).

While Paul Yang had already met the team members during their visit to China in 2017, for Nicola and me it was the first personal encounter. We were both curious and a little bit anxious to find out how we would be received by the long time team members. But it turned out that our worries were unfounded: after passing Tim Hudson’s vegemite survival test (see below), we were both cordially accepted by the team.

Matt started the meeting with a detailed introduction and a status report about the architectural changes to the library and the ongoing FIPS development. After that, we embarked on a lively and fruitful discussion. The outcome of the meeting is a different story and will be reported elsewhere. For us committers, it was an interesting and instructive experience to see how these OMC F2F meetings took place and how actions were planned and decisions made.

But even more important than anything else was the fact that we were able to get to know each other and that we all had a great time together. In this respect the meeting was so successful, that we decided to have online video conference meetings on a regular base in order to improve our team interaction and collaboration.

May 23, 2019 05:15 PM

May 21, 2019

Michael Biven

The Law of Diminishing Clouds

The Law of Diminishing Clouds — when the number of computers outside of your control or visibility have a greater negative impact than the benefits of using the cloud. Can also refer to when the costs savings from the cloud is less than the internal effort and/or costs of figuring out what the cloud is doing.

May 21, 2019 07:42 AM

May 20, 2019


New Committers

Following on from our additions to the committers last year, the OpenSSL Management Committee has now added four new Committers.

The latest additions to the committers are:

What this means for the OpenSSL community is that there is now a larger group of active developers who have the ability to review and commit code to our source code repository. These new committers will help the existing committers with our review process (i.e., their reviews count towards approval) which we anticipate will help us keep on top of the github issues and pull request queues.

As always, we welcome comments on submissions and technical reviews of pull requests from anyone in the community.

Note: All submissions must be reviewed and approved by at least two committers, one of whom must also be an OMC member as described in the project bylaws

May 20, 2019 12:00 PM

May 05, 2019

LZone - Sysadmin

Nagios Check Plugin for nofile Limit

Following the recent post on how to investigate limit related issues which gave instructions what to check if you suspect a system limit to be hit I want to share this Nagios check to cover the open file descriptor limit. Note that existing Nagios plugins like this only check the global limit, only check one application or do not output all problems. So here is my solution which does:
  1. Check the global file descriptor limit
  2. Uses lsof to check all processes "nofile" hard limit
It has two simple parameters -w and -c to specify a percentage threshold. An example call:
./ -w 70 -c 85
could result in the following output indicating two problematic processes:
WARNING memcached (PID 2398) 75% of 1024 used CRITICAL apache (PID 2392) 94% of 4096 used
Here is the check script doing this:

# MIT License # # Copyright (c) 2017 Lars Windolf # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE.

# Check "nofile" limit for all running processes using lsof

MIN_COUNT=0 # default "nofile" limit is usually 1024, so no checking for # processes with much less open fds needed

WARN_THRESHOLD=80 # default warning: 80% of file limit used CRITICAL_THRESHOLD=90 # default critical: 90% of file limit used

while getopts "hw:c:" option; do case $option in w) WARN_THRESHOLD=$OPTARG;; c) CRITICAL_THRESHOLD=$OPTARG;; h) echo "Syntax: $0 [-w <warning percentage>] [-c <critical percentage>]"; exit 1;; esac done

results=$( # Check global limit global_max=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 3) global_cur=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 1) ratio=$(( $global_cur * 100 / $global_max))

if [ $ratio -ge $CRITICAL_THRESHOLD ]; then echo "CRITICAL global file usage $ratio% of $global_max used" elif [ $ratio -ge $WARN_THRESHOLD ]; then echo "WARNING global file usage $ratio% of $global_max used" fi

# We use the following lsof options: # # -n to avoid resolving network names # -b to avoid kernel locks # -w to avoid warnings caused by -b # +c15 to get somewhat longer process names # lsof -wbn +c15 2>/dev/null | awk '{print $1,$2}' | sort | uniq -c |\ while read count name pid remainder; do # Never check anything above a sane minimum if [ $count -gt $MIN_COUNT ]; then # Extract the hard limit from /proc limit=$(cat /proc/$pid/limits 2>/dev/null| grep 'open files' | awk '{print $5}')

# Check if we got something, if not the process must have terminated if [ "$limit" != "" ]; then ratio=$(( $count * 100 / $limit )) if [ $ratio -ge $CRITICAL_THRESHOLD ]; then echo "CRITICAL $name (PID $pid) $ratio% of $limit used" elif [ $ratio -ge $WARN_THRESHOLD ]; then echo "WARNING $name (PID $pid) $ratio% of $limit used" fi fi fi done )

if echo $results | grep CRITICAL; then exit 2 fi if echo $results | grep WARNING; then exit 1 fi

echo "All processes are fine."
Use the script with caution! At the moment it has no protection against a hanging lsof. So the script might mess up your system if it hangs for some reason. If you have ideas how to improve it please share them in the comments!

May 05, 2019 08:19 PM

April 13, 2019

Feeding the Cloud

Secure ssh-agent usage

ssh-agent was in the news recently due to the compromise. The main takeaway from that incident was that one should avoid the ForwardAgent (or -A) functionality when ProxyCommand can do and consider multi-factor authentication on the server-side, for example using libpam-google-authenticator or libpam-yubico.

That said, there are also two options to ssh-add that can help reduce the risk of someone else with elevated privileges hijacking your agent to make use of your ssh credentials.

Prompt before each use of a key

The first option is -c which will require you to confirm each use of your ssh key by pressing Enter when a graphical prompt shows up.

Simply install an ssh-askpass frontend like ssh-askpass-gnome:

apt install ssh-askpass-gnome

and then use this to when adding your key to the agent:

ssh-add -c ~/.ssh/key

Automatically removing keys after a timeout

ssh-add -D will remove all identities (i.e. keys) from your ssh agent, but requires that you remember to run it manually once you're done.

That's where the second option comes in. Specifying -t when adding a key will automatically remove that key from the agent after a while.

For example, I have found that this setting works well at work:

ssh-add -t 10h ~/.ssh/key

where I don't want to have to type my ssh password everytime I push a git branch.

At home on the other hand, my use of ssh is more sporadic and so I don't mind a shorter timeout:

ssh-add -t 4h ~/.ssh/key

Making these options the default

I couldn't find a configuration file to make these settings the default and so I ended up putting the following line in my ~/.bash_aliases:

alias ssh-add='ssh-add -c -t 4h'

so that I can continue to use ssh-add as normal and have not remember to include these extra options.

April 13, 2019 01:45 PM

April 09, 2019

Everything Sysadmin

April NYC DevOps Meetup: Building a tamper-evident CI/CD system

The April nycdevops Meetup is Thursday, April 18. Doors open at 6:30pm!

NOTE: The meetings are now on THURSDAY.

  • Title: How to build a tamper-evident CI/CD system
  • Speaker: Trishank Karthik Kuppusamy, Datadog, Inc

TALK DESCRIPTION: CI/CD is critical to any DevOps operation today, but when attackers compromise it, they get to distribute malicious software to millions of unsuspecting users. We present how Datadog used TUF and in-toto to develop, to the best of our knowledge, the industry's first end-to-end verified pipeline that automatically builds integrations for the Datadog agent. That is, even if this pipeline is compromised, users should not be able to install malware. We will show a demonstration of our pipeline in production being used to protect users of the Datadog agent, and describe how you can use TUF + in-toto secure your own pipeline.

SPEAKER BIO: Trishank Karthik Kuppusamy is a security engineer at Datadog, Inc. Previously, he led the research and development of The Update Framework (TUF) and Uptane at the NYU Tandon School of Engineering. He is also a member of the IEEE-ISTO Uptane standardization alliance, and an Editor of in-toto Enhancements.

Space is limited. Please RSVP soon!

by Tom Limoncelli at April 09, 2019 05:10 PM

March 11, 2019

WordPress update hangs after “Unpacking the update”

If your WordPress installation updates just stops after showing the message “Unpacking the update”, try increasing the memory limit of PHP. Unzipping the update takes quite a bit of memory. Newer versions of WordPress keep getting larger and larger, requiring more memory to unpack. So it can suddenly break, as it did for me.

You may want to check the actual limit PHP is using by creating a small “php info” PHP page in your webroot and opening that in your browser. For example:


Name it something like “phpinfo_52349602384.php”. The random name is so that if you forget to remove the file, automated vulnerability scanners won’t find it. Open that file in the browser and the memory limit should be mentioned somewhere under “memory_limit”.

by admin at March 11, 2019 02:27 PM

March 08, 2019

The Lone Sysadmin

Easy Dell PowerEdge Firmware Updates, 2019 Edition

I’ve become quite the minimalist in my environments, mostly because I’ve been doing a lot of compliance & security work. Speaking generally, most hardware management tools don’t & won’t pass any form of compliance audit and in that context are way more trouble than they’re worth (negative ROI, see my post “Free, Like a Puppy“). […]

The post Easy Dell PowerEdge Firmware Updates, 2019 Edition appeared first on The Lone Sysadmin. Head over to the source to read the full post!

by Bob Plankers at March 08, 2019 05:14 PM

March 03, 2019

A multi-platform API client in Go

In the last few months I have been working, together with a colleague, on an API client for several well-known systems and cloud providers. When we started, I was a novice in the Go programming language, I had no experience in programming API clients, and I trusted the makers of the APIs enough to have great expectations at them.

Today, a few months later, several hours programming later and a bunch of lines of code later, I am a better novice Go programmer, I have some experience in interfacing with APIs, and my level of trust in the API makers is well beneath my feet.

This article will be a not-so-short summary of the reasons why we started this journey and all the unexpected bad surprises we got along the way. Unfortunately, I will be able to show only snippets of the code we have written because I didn’t get the authorisation to make it public. I will make the effort to ensure that the snippets are good enough to help you get a better understanding of the topics.

OK, enough preface. It’s time to tell the story.

The life of a cloud services administrator

We use several cloud services in Telenor Digital. We have G Suite. We have a Github Enterprise instance in house, but many repositories are still in where we have several organisations. We have Slack. And we have our share of Atlassian products with their user database managed in Atlassian Crowd. And then some more, but these are enough for this post.

With all these systems, each one with their user database, keeping things in check is a real pain. It takes a systematic approach and discipline when creating accounts, and even more when off-boarding accounts, to ensure that no rogue access is kept by ex employees. A mistake in on-boarding an account can be easily fixed along the way, a mistake in off-boarding can go unnoticed, compromise the company’s intellectual property and, ultimately, the company’s business and must be taken seriously.

All these systems’ web interfaces are designed to make it easy to on-board and off-board single accounts (with the notable exception of G Suite, which allows you to add and change accounts in batches by uploading a CSV file). When the turnaround of accounts is large and you have several on-boardings and off-boardings every month, these web interfaces are not nice and practical any more and you find yourself buried into an orgy of clicks, text fields and sub-windows. You can have the best manual procedures and all the discipline in the world, but a mistake is just a blink away.

The problem with inconsistency

Another problem with multiple account directories is inconsistency. Separate systems lead to duplicated information, and keeping duplicated information in sync is hard. So you may have an account for John Smith in G Suite initially created as, and registered in Crowd as jo with the same address, and under the display name of “Jo”; no “John” nor “Smith” in there. Later on, becomes a big company and that email address sounds terribly unprofessional: an alias is added in G Suite as, John is mandated to use the new address for any communication and his business cards, and the old “jo” is progressively forgotten.

John gets increasingly fed up by how the nice start-up he once worked for is looking more and more like a classic dinosaur corporate, and one fine day he leaves the company to become an entrepreneur. Bob, the service administrator, gets an off-boarding request for John. Bob joined the company at a late stage, when John Smith was for everyone and the old alias was long forgotten. He quickly finds John’s account in G Suite and disables it, but he finds no match in Crowd for neither John’s full name nor the address.

Bob doesn’t think that he could search in Crowd for the aliases John had in G Suite and after some more research he concludes that this Johnny guy didn’t have an account in Crowd after all. And here we are, with a rogue account that could be exploited by John Smith himself (if he wasn’t the nice guy he is), or worse: by an external attacker who found John’s password in a password database shared by an Elbonian hacker.

The API client project

My colleague and me were discussing these problems when our then-boss suggested that we could used our Fun Fridays to develop an API client to automate most of these operations. The idea was interesting and challenging at the same time, as none of us had ever worked with APIs. But we decided to give it a try anyway.

We debated for some time about what services we wanted to interface with, which operations we wanted to automate, and which functionality we should implement first. Deciding which operations we wanted to automate was easy: on-boarding and off-boarding. Deciding what we wanted to tackle first and on which services was slightly more complicated and we took a systematic approach.

We made a list of the services we wanted to automate, and for each of the services we set the priority for the implementation of on-boarding and off-boarding. The score system was very simple, it was just three values: A (must be done now), B (must be done one day), and C (it would be nice to have).

Once we had set the priorities for each service and each operation, we started discussing what prerequisites we must have in place to make the ‘A’s happen for either on-boarding and off-boarding. It turned out that, in order to implement the on-boarding, there was a lot of simplification required in the account directories if we wanted to keep the client complexity at a sensible level, while the prerequisites for doing the off-boarding part were not that many. So we decided to focus on the off-boarding.

With the priorities set and the focus on off-boarding, the choice of the services was easy. We would start with our ‘A’s, that is: Google and Crowd, and then we would focus on the ‘B’s, that is: and Slack. The rest would be done at a later stage, when the most urgent functionality is ready for A’s and B’s.

Given the services and the priorities, it was time to decide on what the actual functionality of the client should be. Of course, the client should suspend an account given a unique ID for it, but what else?

Here my precious co-worker Miha came with a great idea: given an email address, we query Google; if an account matches that email address, we fetch all of the addresses associated with the account and we suspend the user; we then use the list of email addresses we got from Google to search accounts registered in any other system that match one of those email addresses, and we disable those accounts, too.

And finally, what language should we use to write the client? Both me and Miha had a good Perl 5 background and we were also learning Go. In the end, it’s not about what language you master most, as much as how well that language supports the APIs you want to use. We would check Google API support for Perl and Go and then decide.

Finally, it was time to start coding.

Google home page home page

The home of Google’s Directory API has a “Getting Started” section  with examples in many different languages. But no Perl 5. No recent unofficial support for the Google API was available for Perl 5 either. That was a deal breaker and we decided to use Go.

Google API console

Google API console

Following the instructions in the quick start guide, we met the Google API console. The console got some improvements while we worked on our project, but overall it remained a ugly piece of a user-unfriendly interface, where the only safe way to find anything you are looking for in a reasonable time is to learn the path by heart: labels and menus won’t help much.

To work with the Google API in Go you need to install two packages: the Directory API Go client library and the OAuth2 package. The documentation of both is rather overwhelming for a novice Go programmer, and the underlying “philosophy” also needs some consideration. It took some time to understand which steps were required to get to the point where we could interact with the API. Here is a short summary:

  • you read a configuration stored in a JSON file using the ConfigFromJSON function, and you get a *oauth.Config in return
  • you use that Config to get a OAuth2 token (or to validate one you already have) that hopefully provides all the scopes (the “permissions”) you requested;
  • with the right Config and the right token, you can finally get a Client object, that is: something you can use to interact with the API; but you are not quite there yet;
  • you can use the Client to get a Service struct; the struct contains pointers to objects that give you access to all the services provided by the Google API;
  • in our case, since we’ll be working on the users directory, we would “extract” the Users service and use just that one.

We factored out in separate packages the parts that we knew we would reuse for all other services, and this whole process is compactly written in our code as:

// Initialise the Google Directory API interface
// credential information will go in the google.config.json, or anything set through the -gc option
// the authentication token goes in google.token.json, or anything set through the -gt option
googleClient := goauth.GetClient(goauth.GetConfig(googleCredentials, googleAdmin.AdminDirectoryUserScope), googleToken)

// with this client, we can get a Service struct, and a User service out of it
googleServices, err := googleAdmin.New(googleClient)
if err != nil {
    log.Fatalf("Unable to retrieve directory Client %v", err)

googleUserSvc := googleServices.Users

When it comes to extracting information, the Go package uses a “chaining” style of calls that is very handy to use once you wrap your head around it, but a bit hard to understand if all you have is the package documentation (and, in addition, you are a Go novice). An example is worth a thousand words:

    r, err := srv.Users.List().Customer("my_customer").MaxResults(10).

Basically, any method call returns an object that you can do more calls on, which in turn returns another object and so on, until you call Do(): that will put together all the parameters you have specified in the chain, query the API and, God willing, return the desired information.

These Go packages implement any feature we could wish for, including server-side filtering of results. Filtering is more important than it may seem at a first glance. A G Suite account contains a lot of information which results in fairly sized data structures if one pulls the whole thing.There is no point in pulling out so much information you don’t need or you don’t use: it will bloat memory usage and make your calls to the API slower. Let’s make an example.

Say the variable usersvc contains the Users service we got from the client: if addr was one of a user’s email addresses, you could fetch all the information by doing:

googleUser, err := usersvc.Get(addr).Do()

A user account contains, among many other things, a person’s first and last name, street address, phone number, other email addresses, the address of one’s boss… but you don’t need all that stuff if all you are trying to do is to select a user for off-boarding: you just want to know if the account is there, if it’s active and, maybe, a few more detail. Getting only the information you need is easy: you just add a call to Fields() in the chain before you call Do():

googleUser, err := usersvc.Get(addr).

Much better, easy to read, simple, faster.

We played a lot with the code in quickstart.go until we got a reasonable understanding of what we were doing. We didn’t however a clear understanding of OAuth2 yet. More on this later.

And, finally, the “suspend” functionality for G Suite accounts was ready and functional.

To summarise what we have seen so far: the worse part of the Google API is the console: the API is full of useful features and the Go libraries match the API well.

Atlassian Crowd

Atlassian Crowd REST APIs developer's page

Atlassian Crowd REST APIs developer’s page

With the expectations set by the Google API there was quite some disappointment when we tried to do something sensible with Crowd. The API looked rather primitive; XML, and not JSON, is the default output format: JSON is supported but to get a JSON output you must ask for it explicitly, pretty please; there is no real Quick Start document and the documentation is fragmented; to search for information in Crowd, you must use the abstruse Crowd Query Language (CQL). To add some more, partial changes are not implemented: if, say, you have a user and you want to set it to inactive, you cannot simply make a request that changes that tiny boolean attribute: you have to fetch the whole set of information about the user, change the attribute and post the whole blob again. If you fail to do so, all values but the one you set (and a few fundamental others) will be nullified.

We couldn’t find a decent Go library for Crowd: those we found were either a bit “immature”, or they didn’t implement all the functionality we needed. In the end, we had to write something ourselves. We iterated several times, and failed a lot, before we built a library with enough functionality to allow us to deactivate users in Crowd. The library we wrote is not complete either, more parts will be added when required.

Armed with that library, we can create a “Server” object (which would actually be called a “Client” in all other API packages, we’ll have to fix this one day…)

// For this to work, you must have the crowd credentials in a JSON file from crowdCredentials (default:
// crowd.config.json in the directory where you launch this command from.)
crowdClient, err := crowd.NewServerFromFile(crowdCredentials)
if err != nil {
    log.Fatalf("Cannot instanciate a Crowd \"server\": %v", err)

… and then, for example:

    users, err := crowdClient.FetchUsersByEmail([]string{email})
    if err != nil {
        log.Printf("WARNING: while fetching %s from Crowd: %v\n", email, err)


    if user.Active {
        user, err := crowdClient.DeactivateUser(user.Name)
        if err != nil {
                    log.Printf("WARNING: Cannot deactivate Crowd user %s: %v\n", user.Name, err)
        } else {
            fmt.Printf("Crowd user %s deactivated\n", user.Name)

With that functionality implemented, we could finally put the pieces together and use the information we get from Google. In fact, we started implementing the main features in our list:

  • suspend a Google account given an email address
  • deactivate a Crowd user given the ID
  • deactivate a Crowd user given the email address
  • get the list of email addresses associated to a Google account; suspend the Google account; suspend any Crowd account that matches one of the email addresses we got from Google

The first version of the client was not something we were really proud of, it even had authentication credentials in the code (not that we did ever commit credentials in our repository, but having pieces of configuration in the code itself is a horrible thing to do no matter what). But we got it to work, we factored out some code into separate libraries for reuse, and we even wrote the documentation for the code. It felt good, and gave us the boost to kick those credentials out of the code and in JSON files 🙂

As you will see further on, for all services we have tried to match the same “scheme” used in the Google library to create a client object. However, Crowd was different enough from G Suite that keeping the same approach didn’t make much sense. In fact, you don’t have OAUth2 in Crowd, you don’t use tokens but only a basic HTTP authentication. Instead of forcing ourselves into a GetClient(GetConfig) dance, we made the code much simpler:

// now that we have a functional user service, we are ready to query information from Google;
// let's get ready to query Crowd, as well.
// For this to work, you must have the crowd credentials in a JSON file from crowdCredentials (default:
// crowd.config.json in the directory where you launch this command from.)
crowdClient, err := crowd.NewServerFromFile(crowdCredentials)
if err != nil {
    log.Fatalf("Cannot instanciate a Crowd \"server\": %v", err)

Note that the call to create a client object is NewServerFromFile. In fact, our Crowd library represents the Crowd directory server, in a way, and we extract information from that directory by means of method calls. Still, it feels weird and an anomaly compared to the rest, and it may be that in the future we will introduce a backward-incompatible change and name the functions more consistently. But not now.

To summarise: we wrote a lot of code to be able to interact with Crowd, and being novices we had to rewrite it several times. When the code was functional and complete enough, it had an inconsistent interface. We’ll redo it once again, eventually.


It happens sometimes that, when you though you could not be more disappointed, there comes another bad surprise, worse than any of the previous ones. Well, to us Github was a bag of unpleasant surprises. But let’s go in order.


Our company has several organisations in Github enterprise, and our team is in charge for managing the membership in two of them. People often join the organisations with their pre-existing personal github account, and that’s an important fact. A github account is in no way “yours”, your company doesn’t own it. It’s not like G Suite, where your company has an instance of the services and only sees the accounts managed there. is a flat space, and when you do a search for accounts you are actually searching the whole github. People may decide to keep some information private, and the fact that they are in one of your organisations doesn’t change anything in that respect: you cannot see or search what is not public: neither the address nor the real name. This was one of the bad surprises that we found, but not the first in chronological order.

No OAuth “token page”

When you are working with the Google API and your API client starts the OAuth flow for authorisation, you are first sent to a Google page where you must confirm that the application is authorised to do what it’s requesting. When you complete the authorisation, you are redirected to another Google page that shows you the token, which you can then use in your client configuration. For example, the client code shown in the quick start guide will show you the link to the authorisation page and wait for you to input the token; once you have followed the link, authorised the client and visualised the token, you must copy it and paste it back to the client: the client will start using it and also save it for later use.

Well, surprise. Github of course has the first page but doesn’t provide a second one: it expects you to provide a link to the second page (the “Redirection URL”). Of course we didn’t have one. In addition, we weren’t really sure how OAuth was supposed to work, so we stopped for a while and took an online training on OAuth. That didn’t make us super-experts in the field, but it gave us a much better idea of what we were looking at and how it was supposed to work. Summarising the whole training in a few lines is impossible, but if you set to work with APIs where OAuth is involved I kindly recommend that you get acquainted with the basics at the very least.

Let’s go back to the problem with the redirection URL, that is: the URL you are redirected to after the authorisation phase. In the beginning, we thought of having an HTTP server somewhere in Amazon to just take those calls and show the token, but that seemed to be a bit overkill, added a dependency and would add the burden on us to keep the server secure. Then we thought we could use a lambda function instead, and we tried looking into those for a while. Until we had an idea: the client could spawn a goroutine for a HTTP listener on the local host, which would then be contacted in the redirection phase; since the token is added as a query string to the redirection URL, the listener will decode the parameter and pass it back to the client, which would then eliminate the intermediate manual step of copy-pasting the token in the script. Miha worked on it and we made it work in the end. That also turned useful for Slack, as it shares the same s*ttiness as github.

API v3 versus v4 home page home page

If you head to you’ll immediately notice that they push you very gently to use the GitHub API v4. This is the next generation of the GitHub API and it’s fundamentally different from v3. For one, queries are built using GraphQL, a query language expressed in JSON notation. If you want to use v4 you first have to grasp the query language itself; then, if you want to make queries through the API in Go, you must get into all the complications of mapping GraphQL’s JSON data with Go structs in a precise way. When you are through with all these ceremonies and boilerplate, you can finally query the data you were longing for.

According to GitHub, GraphQL gives you “The ability to define precisely the data you want—and only the data you want”. I am afraid it’s not the case. In fact, we found ourselves in a situation where we were not really getting what we wanted out of our calls to the APIs and we wrote to GitHub support. Our support request included, among others, this question:

Is there a way to make a single call with the Github API v4 that will return a list of users whose login matches a given string exactly and are in certain organizations?

and after a sizeable while they replied:

There isn’t a way to limit the search for users in certain organizations. However, it’s possible to use the GraphQL API to search for users.

Well, if there wasn’t a way to search for users through the API then they could well fold the whole company and grow vegetables instead, right? Anyway, v4 was clearly not covering our use cases. Besides, it turned out that the two APIs where not feature-equivalent..

But the funniest thing of all: the way that the developers web site, and the home page in particular, is structured made us think that v4 is a stable API (and if it wasn’t, maybe it should be called “preview”, or “beta”, or something along those lines, shouldn’t it?). But no, it’s not. If you look at the changelog the specs are changing all the time, and often in a backward-incompatible way.

So, guess what? After having invested quite some effort in using v4 we switched to v3. v3 is a bit primitive and inefficient, e.g.: if you want the details for each of the users in a certain organisation, it’s not enough to fetch the user list for the organisation because the user objects returned will only have the login attribute set: you must then iterate over the user objects and fetch the details for each, one by one. OK for small organisations, but not so much for bigger ones since the amount of queries that you can run on the API is limited.

Calling github with v3

Once we set for v3, the rest of the plan unrolled as gracefully as possible. We created an OAuth app in github with all the necessary scopes and access and then we started to code. We tried to be consistent and created GetClient and GetConfig calls that would match the ones we made for the other services as much as possible. So in the code we have:

// Get the Github API client config (to be used immediately after in a couple of places, otherwise we could just
// chain GetClient/GetConfig like we did elsewhere)
githubClientConfig := ghoauth.GetConfig(githubCredentials, "admin:org read:user user:email")

// Initialise the Github API interface
githubClient := ghoauth.GetClient(githubClientConfig, githubToken)

What was different from Google’s case however is that our github client configuration is now a struct of structs, so that we can have access to both the information we got from the configuration file and the OAuth configuration object that we need to create a client:

  // Type Config is a representation of the whole configuration of a Github API client. It holds both the OAuth2 part (the OAuth2
  // field, which maps to a *oauth2.Config), and the client specific information (the Client field, which holds a
  // *ghoauth.ClientInfo). This type is used in GetConfig and GetClient, which makes it pretty much fundamental for anything in
  // this package.
  type Config struct {
      Client *ClientInfo
      OAuth2 *oauth2.Config

So, when we need the organisations service we can do something similar to what we do with a Google API client:

// get to the OrganizationsService in the client
githubOrgSvc := githubClient.Organizations

and if we need to have the list of the names of organisations we manage, we can still access that (without parsing the configuration file a second time):

// organizations in Github that we manage
githubOrgs := githubClientConfig.Client.Organizations

Now we can happily use these two pieces of information to do, for example, this:

// Remove any given github accounts from our organizations
for _, githubID := range suspendFromGithub {
    for _, org := range githubOrgs {
        err := RemoveGithubUserIdFromOrg(githubOrgSvc, githubID, org, apply)
        errorutil.LogWarning("%v", err)

In summary: we had to invest a lot of time before we could actually write the code to remove users from github organizations; the prerequisites included learning about OAuth and find out about the shortcomings of using the API v4 instead of v3.


Of these four APIs that we worked with, the Slack API was the most disappointing. A proper user management functionality in the API seems to be only available through SCIM and only for the most expensive plans.

With a more “normal” paid plan like the one we have, the user management functionality is ridiculously narrow. In particular, there is no API endpoint to create a new user or to reactivate a suspended user. There is no official API endopoint to suspend a user either, but there is an unofficial and undocumented endpoint. We needed this functionality dearly, so we started to search for go packages that implemented the Slack API and see what functionality they could offer.

According to, the Slack API implementation more widely adopted are nlopes’ and lestrrat-go’s . The lestrrat package is reportedly based on nlopes, is nicely done and quite similar to Google’s in the structure of the methods but, alas, it doesn’t implement setInactive. I didn’t fancy the idea of forking the software and then have to maintain a fork, so I took a look at the other package.

The nlopes package implements the setInactive call. Other than that, it’s well behind the lestrrat package: the methods are not as nicely structured as its competitor and the documentation is ridiculous. I gave it a try for some time, and then decided I couldn’t take it any more. I went back to the lestrrat package and got an hint from the documentation about how I could implement an additional endpoint for setInactive. Then I forked the repo and started hammering it until I got it working — in fact, it was not as plain simple as the documentation hinted, but still possible.

Once I got my fork of the package working, we could finally get to implementing account suspension in our client. And that wasn’t a clean path either because… Slack is stupid.

Web interface to assign Scopes to a Slack API application

Web interface to assign Scopes to a Slack API application

The first thing that we needed was to create an application for our client on the Slack side and assign it the proper scopes. Slack has a nice web interface to do that, so far so good. Now, in order to manipulate certain attributes of a user, you need a certain set of scopes, namely: users:read,, users.profile:read, users.profile:write. However, in order to call the setInactive endopoint, you need the client scope. Funny fact, you just can’t select client together with the other scopes.

But, as I just said. Slack is stupid. So if you first request some scopes for an application from a client, and then make a request for a different scope from the same client, you’ll be asked to confirm that it’s OK and the new scope will be added to the existing. And there comes the ugly code:

// The Slack API is rather stupid, and so is the implementation of OAuth2, so we have to work around it.
// To make the lookups work, we need the following scopes:
// users:read users.profile:read users.profile:write
// To make setInactive work, we need the client scope
// These two sets of scopes cannot be requested together, so we have to request them separately and
// authorise the client twice, so that the associated user gets all the scopes that he needs.
// This is crazy, but... that's how it is.
slackClient := slackoauth.GetClient(slackoauth.GetConfig(slackCredentials, `users:read users.profile:read users.profile:write`), slackToken)
slackClient = slackoauth.GetClient(slackoauth.GetConfig(slackCredentials, `client`), slackToken)

Notice how we create slackClient the first time only to authorise the first set of scopes, and then we just overwrite it one line later in order to authorise the second set of scopes. Isn’t that great?

You may have also noticed that we implemented GetConfig and GetClient here, too, and that their arguments are as consistent as they can be with their counterparts in Google and github.

Once we have the clients, we can extract the handle for the Users service (used for lookups) and the handle for the UsersAdmin service (used to set accounts to inactive):

// // get to the Slack users and usersadmin services
slackUserSvc := slackClient.Users()
slackUsersAdminSvc := slackClient.UsersAdmin()

And then with those handles you can do things like, e.g.:

// Deactivate Slack accounts given by user ID
for _, slackId := range suspendFromSlack {
    // Try to fetch a user with the requested ID, see if we get an error
    user, err := slackUserSvc.Info(slackId).Do(slackCtx)

    if err != nil {
        // notify a user with the given ID doesn't exist, and iterate
        fmt.Printf("Cannot lookup a Slack user with ID %s: %v\n", slackId, err)

    err = DeactivateSlackUser(slackUsersAdminSvc, user, slackCtx, apply)
    errorutil.LogWarning("%v", err)

This seems less simple than it looks and deserves some explanation. When you think of a Slack ID you may think of the username, a handle like, e.g., @alice or @bob. Well, it’s not: that handle has no value to Slack. The user ID however, is a string that you have to dig for and I am pretty sure that none of the readers do know theirs: it’s in your profile, in the ‘…’ menu, and it’s something like UXXXXXXXX (an 8-characters, ASCII alphanumeric string prefixed with a “U”). All API calls that can manipulate users require the user ID, and since you rarely know it you have to resort to some other information that you can map back to the ID. So, rather than use code like the one above, you’ll more often do:

// Deactivate any Slack account that matches the given email addresses
for _, slackEmail := range suspendEmailFromSlack {
    err = DeactivateSlackEmail(slackUserSvc, slackUsersAdminSvc, slackEmail, slackCtx, apply)
    errorutil.LogWarning("%v", err)

and in that function you will first look up the user by email:

user, err := usersvc.LookupByEmail(email).Do(ctx)

If you get a user object, that will have the user ID in the ID attribute, so you can call the DeactivateSlackUser we mentioned earlier. Here it would really take some more code to show, sorry I can’t show more.

In summary: Slack’s “standard” APIs sucks when it comes to user management; the web interface sucks when it comes to define client authorisations; it has the same problem as github when it comes to the redirection URL for OAuth; the lestrrat-go slack library was our choice although incomplete, but still preferable to the messy and badly documented nlopes implementation.

Inventory of an experience

At the end of this long ride we got:

  • a nice API client to suspend users from Google G Suite, Crowd, Slack, and to remove them from organizations in
  • a set of go packages to manage OAuth2 authorization to G Suite, Slack and Github
  • a embryonic go package for the Crowd API, that we’ll grow over time;
  • some additional tooling that we reused in the packages mentioned above for managing OAuth tokens, errors and command-line flags;
  • more knowledge about OAuth
  • more Go skills
  • less trust in API providers in general
  • an experience to tell

I hope you enjoyed this. We did… to a point 🙂

by bronto at March 03, 2019 07:05 PM

February 16, 2019

Colin Percival

FreeBSD ZFS AMIs Now Available

Earlier today I sent an email to the freebsd-cloud mailing list:
Hi EC2 users,

FreeBSD 12.0-RELEASE is now available as AMIs with ZFS root disks in all 16
publicly available EC2 regions:

[List elided; see the GPG-signed email for AMI IDs]

The ZFS configuration (zpool named "zroot", mount point on /, /tmp,
/usr/{home,ports,src}, /var/{audit,crash,log,mail,tmp}) should match what
you get by installing FreeBSD 12.0-RELEASE from the published install media
and selecting the defaults for a ZFS install.  Other system configuration
matches the FreeBSD 12.0-RELEASE AMIs published by the release engineering

I had to make one substantive change to 12.0-RELEASE, namely merging r343918
(which teaches /etc/rc.d/growfs how to grow ZFS disks, matching the behaviour
of the UFS AMIs in supporting larger-than-default root disks); I've MFCed
this to stable/12 so it will be present in 12.1 and later releases.

If you find these AMIs useful, please let me know, and consider donating to
support my work on FreeBSD/EC2 (  If
there's enough interest I'll work with the release engineering team to add
ZFS AMIs to what they publish.

February 16, 2019 09:50 PM

February 15, 2019

The Lone Sysadmin

End Anonymous Conference Feedback

There’s a lot of talk lately about the terrible, horrible, sexist, racist, misogynist, and generally unconstructive feedback that presenters get at conferences: This is on top of feedback where an attendee gives a presenter one star because the food wasn’t what they wanted, or they sat next to someone smelly, the room was cold, an […]

The post End Anonymous Conference Feedback appeared first on The Lone Sysadmin. Head over to the source to read the full post!

by Bob Plankers at February 15, 2019 05:08 PM

February 13, 2019


OpenSSL 3.0 and FIPS Update

As mentioned in a previous blog post, OpenSSL team members met with various representatives of the FIPS sponsor organisations back in September last year to discuss design and planning for the new FIPS module development project.

Since then there has been much design work taking place and we are now able to publish the draft design documentation. You can read about how we see the longer term architecture of OpenSSL changing in the future here and you can read about our specific plans for OpenSSL 3.0 (our next release which will include a FIPS validated module) here.

OpenSSL 3.0 is a major release and will be a significant change to the internal architecture of OpenSSL. We plan to keep impacts on existing end user applications to an absolute minimum with the intention that the vast majority of existing well-behaved applications will just need to be recompiled. No deprecated APIs will be removed in this release.

The biggest change will be the introduction of a new concept known as Providers. These can be seen as a replacement for the existing ENGINE interface and will enable much more flexibility for implementors. libcrypto gives applications access to a set of cryptographic algorithms, while different Providers may have different implementations of those algorithms.

Out-of-the-box OpenSSL will come with a set of Providers available. For example the “default” Provider will implement all of the most commonly used algorithms available in OpenSSL today. There will be a “legacy” Provider that implements legacy cryptographic algorithms and a FIPS Provider that implements FIPS validated algorithms. Existing engines will still work (after a recompile) and will be made available via both the old ENGINE APIs as well as a Provider compatibility layer.

The new design incorporates the FIPS module into main line OpenSSL. It will no longer be a separate download and support periods will also be aligned. It will of course be possible to build OpenSSL with or without the FIPS module depending on your own individual circumstances and requirements.

The FIPS module version number will be aligned with the main OpenSSL version number. OpenSSL 3.0.0 will incorporate the 3.0.0 FIPS module. Not every release of OpenSSL will necessarily lead to an update in the FIPS module version number so there may be “gaps”. For example OpenSSL 3.0.1 might still provide and work with the 3.0.0 module.

New APIs will be introduced to give applications greater flexibility in the selection of algorithm implementations. Of course support will be maintained for existing APIs so applications don’t need to use the new APIs if they don’t want to. For example applications will be able to set different algorithm selection criteria for different SSL_CTXs. This might be used to enforce selection of FIPS validated algorithms for one SSL_CTX, while allowing another SSL_CTX to use default implementations.

There is much still to be done to make this new OpenSSL design a reality. However with the publication of these design documents we are encouraged to see that pull requests are already starting to come in to make the necessary changes to the code. We expect the coming months to be very active amongst the development community as we head towards alpha and beta releases later on this year.

February 13, 2019 10:30 AM

February 11, 2019

Cryptography Engineering

Attack of the week: searchable encryption and the ever-expanding leakage function

A few days ago I had the pleasure of hosting Kenny Paterson, who braved snow and historic cold (by Baltimore standards) to come talk to us about encrypted databases.

Kenny’s newest result is with first authors Paul Grubbs, Marie-Sarah Lacharité and Brice Minaud (let’s call it GLMP). It isn’t so much about building encrypted databases, as it is about the risks of building them badly. And — for reasons I will get into shortly — there have been a lot of badly-constructed encrypted database schemes going around. What GLMP point out is that this weakness isn’t so much a knock against the authors of those schemes, but rather, an indication that they may just be trying to do the impossible.

Hopefully this is a good enough start to get you drawn in. Which is excellent, because I’m going to need to give you a lot of background.

What’s an “encrypted” database, and why are they a problem?

Databases (both relational and otherwise) are a pretty important part of the computing experience. Modern systems make vast use of databases and their accompanying query technology in order to power just about every software application we depend on.

Because these databases often contain sensitive information, there has been a strong push to secure that data. A key goal is to encrypt the contents of the database, so that a malicious database operator (or a hacker) can’t get access to it if they compromise a single machine. If we lived in a world where security was all that mattered, the encryption part would be pretty easy: database records are, after all, just blobs of data — and we know how to encrypt those. So we could generate a cryptographic key on our local machine, encrypt the data before we upload it to a vulnerable database server, and just keep that key locally on our client computer.

Voila: we’re safe against a database hack!

The problem with this approach is that encrypting the database records leaves us with a database full of opaque, unreadable encrypted junk. Since we have the decryption key on our client, we can decrypt and read those records after we’ve downloaded them. But this approach completely disables one of the most useful features of modern databases: the ability for the database server itself to search (or query) the database for specific records, so that the client doesn’t have to.

Unfortunately, standard encryption borks search capability pretty badly. If I want to search a database for, say, employees whose salary is between $50,000 and $100,000, my database is helpless: all it sees is row after row of encrypted gibberish. In the worst case, the client will have to download all of the data rows and search them itself — yuck.

This has led to much wailing and gnashing of teeth in the database community. As a result, many cryptographers (and a distressing number of non-cryptographers) have tried to fix the problem with “fancier” crypto. This has not gone very well.

It would take me a hundred years to detail all of various solutions that have been put forward. But let me just hit a few of the high points:

  • Some proposals have suggested using deterministic encryption to encrypt database records. Deterministic encryption ensures that a given plaintext will always encrypt to a single ciphertext value, at least for a given key. This enables exact-match queries: a client can simply encrypt the exact value (“John Smith”) that it’s searching for, and ask the database to identify encrypted rows that match it.
  • Of course, exact-match queries don’t support more powerful features. Most databases also need to support range queries. One approach to this is something called order revealing encryption (or its weaker sibling, order preserving encryption). These do exactly what they say they do: they allow the database to compare two encrypted records to determine which plaintext is greater than the other.
  • Some people have proposed to use trusted hardware to solve these problems in a “simpler” way, but as we like to say in cryptography: if we actually had trusted hardware, nobody would pay our salaries. And, speaking more seriously, even hardware might not stop the leakage-based attacks discussed below.

This summary barely scratches the surface of this problem, and frankly you don’t need to know all the details for the purpose of this blog post.

What you do need to know is that each of the above proposals entails has some degree of “leakage”. Namely, if I’m an attacker who is able to compromise the database, both to see its contents and to see how it responds when you (a legitimate user) makes a query, then I can learn something about the data being queried.

What some examples of leakage, and what’s a leakage function?

Leakage is a (nearly) unavoidable byproduct of an encrypted database that supports queries. It can happen when the attacker simply looks at the encrypted data, as she might if she was able to dump the contents of your database and post them on the dark web. But a more powerful type of leakage occurs when the attacker is able to compromise your database server and observe the query interaction between legitimate client(s) and your database.

Take deterministic encryption, for instance.

Deterministic encryption has the very useful, but also unpleasant feature that the same plaintext will always encrypt to the same ciphertext. This leads to very obvious types of leakage, in the sense that an attacker can see repeated records in the dataset itself. Extending this to the active setting, if a legitimate client queries on a specific encrypted value, the attacker can see exactly which records match the attacker’s encrypted value. She can see how often each value occurs, which gives and indication of what value it might be (e.g., the last name “Smith” is more common than “Azriel”.) All of these vectors leak valuable information to an attacker.

Other systems leak more. Order-preserving encryption leaks the exact order of a list of underlying records, because it causes the resulting ciphertexts to have the same order. This is great for searching and sorting, but unfortunately it leaks tons of useful information to an attacker. Indeed, researchers have shown that, in real datasets, an ordering can be combined with knowledge about the record distribution in order to (approximately) reconstruct the contents of an encrypted database.

Fancier order-revealing encryption schemes aren’t quite so careless with your confidentiality: they enable the legitimate client to perform range queries, but without leaking the full ordering so trivially. This approach can leak less information: but a persistent attacker will still learn some data from observing a query and its response — at a minimum, she will learn which rows constitute the response to a query, since the database must pack up the matching records and send them over to the client.

If you’re having trouble visualizing what this last type of leakage might look like, here’s a picture that shows what an attacker might see when a user queries an unencrypted database vs. what the attacker might see with a really “good” encrypted database that supports range queries:


So the TL;DR here is that many encrypted database schemes have some sort of “leakage”, and this leakage can potentially reveal information about (a) what a client is querying on, and (b) what data is in the actual database.

But surely cryptographers don’t build leaky schemes?

Sometimes the perfect is the enemy of the good.

Cryptographers could spend a million years stressing themselves to death over the practical impact of different types of leakage. They could also try to do things perfectly using expensive techniques like fully-homomorphic encryption and oblivious RAM — but the results would be highly inefficient. So a common view in the field is researchers should do the very best we can, and then carefully explain to users what the risks are.

For example, a real database system might provide the following guarantee:

“Records are opaque. If the user queries for all records BETWEEN some hidden values X AND Y then all the database will learn is the row numbers of the records that match this range, and nothing else.”

This is a pretty awesome guarantee, particularly if you can formalize it and prove that a scheme achieves it. And indeed, this is something that researchers have tried to do. The formalized description is typically achieved by defining something called a leakage function. It might not be possible to prove that a scheme is absolutely private, but we can prove that it only leaks as much as the leakage function allows.

Now, I may be overdoing this slightly, but I want to be very clear about this next part:

Proving your encrypted database protocol is secure with respect to a specific leakage function does not mean it is safe to use in practice. What it means is that you are punting that question to the application developer, who is presumed to know how this leakage will affect their dataset and their security needs. Your leakage function and proof simply tell the app developer what information your scheme is (provably) going to protect, and what it won’t.

The obvious problem with this approach is that application developers probably don’t have any idea what’s safe to use either. Helping them to figure this out is one goal of this new GLMP paper and its related work.

So what leaks from these schemes?

GLMP don’t look at a specific encryption scheme. Rather, they ask a more general question: let’s imagine that we can only see that a legitimate user has made a range query — but not what the actual queried range values are. Further, let’s assume we can also see which records the database returns for that query, but not their actual values.

How much does just this information tell us about the contents of the database?

You can see that this is a very limited amount of leakage. Indeed, it is possibly the least amount of leakage you could imagine for any system that supports range queries, and is also efficient. So in one sense, you could say authors are asking a different and much more important question: are any of these encrypted databases actually secure?

The answer is somewhat worrying.

Can you give me a simple, illuminating example?

Let’s say I’m an attacker who has compromised a database, and observes the following two range queries/results from a legitimate client:

Query 1: SELECT * FROM Salaries BETWEEN ⚙ and 🕹    Result 1: (rows 1, 3, 5)
Query 2: SELECT * FROM Salaries BETWEEN 😨 and 🎱    Result 2: (rows 1, 43, 3, 5)

Here I’m using the emoji to illustrate that an attacker can’t see the actual values submitted within the range queries — those are protected by the scheme — nor can she see the actual values of the result rows, since the fancy encryption scheme hides all this stuff. All the attacker sees is that a range query came in, and some specific rows were scooped up off disk after running the fancy search protocol.

So what can the attacker learn from the above queries? Surprisingly: quite a bit.

At very minimum, the attacker learns that Query 2 returned all of the same records as Query 1. Thus the range of the latter query clearly somewhat overlaps with the range of the former.  There is an additional record (row 43) that is not within the range of Query 1. That tells us that row 43 must must be either the “next” greater or smaller record than each of rows (1, 3, 5). That’s useful information.

Get enough useful information, it turns out that it starts to add up. In 2016, Kellaris, Kollios, Nissim and O’Neill showed that if you know the distribution of the query range endpoints — for example, if you assumed that they were uniformly random — then you can get more than just the order of records. You can reconstruct the exact value of every record in the database.

This result is statistical in nature. If I know that the queries are uniformly random, then I can model how often a given value (say, Age=34 out of a range 1-120) should be responsive to a given random query results. By counting the actual occurrences of a specific row after many such queries, I can guess which rows correlate to specific record values. The more queries I see, the more certain I can be.The Kellaris et al. paper shows that this takes O(N^4~log~N) queries, where is the number of possible values your data can take on (e.g., the ages of your employees, ranging between 1 and 100 would give N=100.) This is assuming an arbitrary dataset. The results get much better if the database is “dense”, meaning every possible value occurs once.

In practice the Kellaris et al. results mean that database fields with small domains (like ages) could be quickly reconstructed after observing a reasonable number of queries from a legitimate user, albeit one who likes to query everything randomly.

So that’s really bad!

The main bright spot in this research —- at least up until recently — was that many types of data have much larger domains. If you’re dealing with salary data ranging from, say, $1 to $200,000, then N=200,000 and this dominant N^4 tends to make Kellaris et al. attacks impractical, simply because they’ll take too long. Similarly, data like employee last names (encoded as a form that can be sorted and range-queries) gives you even vaster domains like N=26^{12}, say, and so perhaps we could pleasantly ignore these results and spend our time on more amusing engagements.

I bet we can’t ignore these results, can we?

Indeed, it seems that we can’t. The reason we can’t sit on our laurels and hope for an attacker to die of old age recovering large-domain data sets is due to something called approximate database reconstruction, or \epsilon-ADR.

The setting here is the same: an attacker sits and watches an attacker make (uniformly random) range queries. The critical difference is that this attacker isn’t trying to get every database record back at its exact value: she’s willing to tolerate some degree of error, up to an additive \epsilon N. For example, if I’m trying to recover employee salaries, I don’t need them to be exact: getting them within 1% or 5% is probably good enough for my purposes. Similarly, reconstructing nearly all of the letters in your last name probably lets me guess the rest, especially if I know the distribution of common last names.

Which finally brings us to this new GLMP paper, which puts \epsilon-ADR on steroids. What it shows is that the same setting, if one is willing to “sacrifice” a few of the highest and lowest values in the database, an attacker can reconstruct nearly the full database in a much smaller (asymptotic) number of queries, specifically: O(\epsilon^{-2} log~\epsilon^{-1}) queries, where \epsilon is the error parameter.

The important thing to notice about these results is that the value N has dropped out of the equation. The only term that’s left is the error term \epsilon. That means these results are “scale-free”, and (asymptotically, at least), they work just as well for small values of N as large ones, and large databases and small ones. This is really remarkable.

Big-O notation doesn’t do anything for me: what does this even mean?

Big-O notation is beloved by computer scientists, but potentially meaningless in practice. There could be huge constants in these terms that render these attacks completely impractical. Besides, weird equations involving epsilon characters are impossible for humans to understand.

Sometimes the easiest way to understand a theoretical result is to plug some actual numbers in and see what happens. GLMP were kind enough to do this for us, by first generating several random databases — each containing 1,000 records, for different values of N. They then ran their recovery algorithm against a simulated batch of random range queries to see what the actual error rate looked like as the query count increased.

Here are their results:

GLMPgraphExperimental results (Figure 2) from Grubbs et al. (GLMP, 2019). The Y-axis represents the measured error between the reconstructed database and the actual dataset (smaller is better.) The X-axis represents the number of queries. Each database contains 1,000 records, but there are four different values of N tested here. Notice that the biggest error occurs around the very largest and smallest values in the dataset, so the results are much better if one is willing to “sacrifice” these values.

Even after just 100 queries, the error in the dataset has been hugely reduced, and after 500 queries the contents of the database — excluding the tails — can be recovered with only about a 1-2% error rate.

Moreover, these experimental results illustrate the fact that recovery works at many scales: that is, they work nearly as well for very different values of N, ranging from 100 to 100,000. This means that the only variable you really need to think about as an attacker is: how close do I need my reconstruction to be? This is probably not very good news for any real data set.

How do these techniques actually work?

The answer is both very straightforward and deeply complex. The straightforward part is simple; the complex part requires an understanding of Vapnik-Chervonenkis learning theory (VC-theory) which is beyond the scope of this blog post, but is explained in the paper.

At the very highest level the recovery approach is similar to what’s been done in the past: using response probabilities to obtain record values. This paper does it much more efficiently and approximately, using some fancy learning theory results while making a few assumptions.

At the highest level: we are going to assume that the range queries are made on random endpoints ranging from 1 to N. This is a big assumption, and more on it later! Yet with just this knowledge in hand, we learn quite a bit. For example: we can compute the probability that a potential record value (say, the specific salary $34,234) is going to be sent back, provided we know the total value lies in the range 1-N (say, we know all salaries are between $1 and $200,000).

If we draw the resulting probability curve in freehand, it might look something like the chart below. This isn’t actually to scale or (probably) even accurate, but it illustrates a key point: by the nature of (random) range queries, records near the center are going to have a higher overall chance of being responsive to any given query, since the “center” values are more frequently covered by random ranges, and records near the extreme high- and low values will be chosen less frequently.

badgraphI drew this graph freehand to mimic a picture in Kenny’s slides. Not a real plot!

The high-level goal of database reconstruction is to match the observed response rate for a given row (say, row 41) to the number of responses we’d expect see for different specific concrete values in the range. Clearly the accuracy of this approach is going to depend on the number of queries you, the attacker, can observe — more is better. And since the response rates are lower at the highest and lowest values, it will take more queries to guess outlying data values.

You might also notice that there is one major pitfall here. Since the graph above is symmetric around its midpoint, the expected response rate will be the same for a record at .25*N and a record at .75*N — that is, a $50,000 salary will be responsive to random queries at precisely same rate as a $150,000 salary. So even if you get every database row pegged precisely to its response rate, your results might still be “flipped” horizontally around the midpoint. Usually this isn’t the end of the world, because databases aren’t normally full of unstructured random data — high salaries will be less common than low salaries in most organizations, for example, so you can probably figure out the ordering based on that assumption. But this last “bit” of information is technically not guaranteed to come back, minus some assumptions about the data set.

Thus, the recovery algorithm breaks down into two steps: first, observe the response rate for each record as random range queries arrive. For each record that responds to such a query, try to solve for a concrete value that minimizes the difference between the expected response rate on that value, and the observed rate. The probability estimation can be made more efficient (eliminating a quadratic term) by assuming that there is at least one record in the database within the range .2N-.3N (or .7N-.8N, due to symmetry). Using this “anchor” record requires a mild assumption about the database contents.

What remains is to show that the resulting attack is efficient. You can do this by simply implementing it — as illustrated by the charts above. Or you can prove that it’s efficient. The GLMP paper uses some very heavy statistical machinery to do the latter. Specifically, they make use of a result from Vapnik-Chervonenkis learning theory (VC-theory), which shows that the bound can be derived from something called the VC-dimension (which is a small number, in this case) and is unrelated to the actual value of N. That proof forms the bulk of the result, but the empirical results are also pretty good.

Is there anything else in the paper?

Yes. It gets worse. There’s so much in this paper that I cannot possibly include it all here without risking carpal tunnel and boredom, and all of it is bad news for the field of encrypted databases.

The biggest additional result is one that shows that if all you want is an approximate ordering of the database rows, then you can do this efficiently using something called a PQ tree. Asymptotically, this requires O(\epsilon^{-1} log~\epsilon^{-1}) queries, and experimentally the results are again even better than one would expect.

What’s even more important about this ordering result is that it works independently of the query distribution. That is: we do not need to have random range queries in order for this to work: it works reasonably well regardless of how the client puts its queries together (up to a point).

Even better, the authors show that this ordering, along with some knowledge of the underlying database distribution — for example, let’s say we know that it consists of U.S. citizen last names — can also be used to obtain approximate database reconstruction. Oy vey!

And there’s still even more:

  • The authors show how to obtain even more efficient database recovery in a setting where the query range values are known to the attacker, using PAC learning. This is a more generous setting than previous work, but it could be realistic in some cases.
  • Finally, they extend this result to prefix and suffix queries, as well as range queries, and show that they can run their attacks on a dataset from the Fraternal Order of Police, obtaining record recovery in a few hundred queries.

In short: this is all really bad for the field of encrypted databases.

So what do we do about this?

I don’t know. Ignore these results? Fake our own deaths and move into a submarine?

In all seriousness: database encryption has been a controversial subject in our field. I wish I could say that there’s been an actual debate, but it’s more that different researchers have fallen into different camps, and nobody has really had the data to make their position in a compelling way. There have actually been some very personal arguments made about it.

The schools of thought are as follows:

The first holds that any kind of database encryption is better than storing records in plaintext and we should stop demanding things be perfect, when the alternative is a world of constant data breaches and sadness.

To me this is a supportable position, given that the current attack model for plaintext databases is something like “copy the database files, or just run a local SELECT * query”, and the threat model for an encrypted database is “gain persistence on the server and run sophisticated statistical attacks.” Most attackers are pretty lazy, so even a weak system is probably better than nothing.

The countervailing school of thought has two points: sometimes the good is much worse than the perfect, particularly if it gives application developers an outsized degree of confidence of the security that their encryption system is going to provide them.

If even the best encryption protocol is only throwing a tiny roadblock in the attacker’s way, why risk this at all? Just let the database community come up with some kind of ROT13 encryption that everyone knows to be crap and stop throwing good research time into a problem that has no good solution.

I don’t really know who is right in this debate. I’m just glad to see we’re getting closer to having it.


by Matthew Green at February 11, 2019 04:41 PM

February 03, 2019

The Lone Sysadmin

Out-of-Office Messages are a Security Risk

Every once in a while I get asked why I don’t have an out-of-office message for my email or voice mail. Truth is, I’ll often monitor my email even when I’m out, though I often practice good operations discipline by not responding. Just as intermittent problems with computer systems are hard to deal with, a […]

The post Out-of-Office Messages are a Security Risk appeared first on The Lone Sysadmin. Head over to the source to read the full post!

by Bob Plankers at February 03, 2019 07:31 PM