SRE as a Lifestyle Choice
Can you apply SRE thinking to things that don’t involve computers?
The last project I worked on for the government was a pilot program to fix Title 5, otherwise known as civil service competitive hiring. By 2018 I was ready to leave government and told USDS so. I was moved over to short term projects, mainly assisting and adversing other leads. The competitive hiring project was by far my favorite.
It was the sort of interesting problem I’m really attracted to: how do you train an agency to conduct a technical hiring process more closely mapped to the private sector? That means software engineers interviewing software engineers, hiring committees, structured technical interviews, detailed written feedback, the works.
In theory the vast majority of hiring the government does should be through a competitive process governed by Title 5. In reality that system is shockingly and profoundly broken. Something like 60% of positions that open through competitive hiring close without an offer put out to a single qualified candidate. One could write an entire thesis on how broken it is or why it is broken, but here are some highlights to give you the lay of the land. They represent a typical experience with competitive hiring:
- Resumes are thrown out unless they contain exact matches for critical keywords and all critical keywords at that. This can get pretty ridiculous when the positions are scientific or technical in nature. White House HR once tried to reject a candidate whose previous employer was Github because he didn’t specifically list experience with version control on his resume.
- Since candidates have no idea which keywords on a job posting will be critical they produce 30 to 40 page resumes to cover as much ground as possible. Those long resumes are then scored higher even if huge portions of the work experience are not relevant to the position because they have more keywords in them.
- Candidates are given a survey asking them to self-assess how their skills meet the requirements of the position. If you don’t give yourself the highest score on every metric you get rejected. Even if everyone else lies in the most horrible, blatant fashion and you are truly the most qualified candidate, because you didn’t give yourself 5 out of 5 mastery on every requirement, your application is dead.
- Veterans get bonus points added to their score before candidates are ranked for hiring lists, but since everyone in the system is cheating by this point and no one has interviewed any of these candidates to verify their abilities yet, veterans with no experience are frequently ranked above qualified candidates. If enough veterans apply, qualified candidates might not even make the hiring list.
Of course what’s interesting about this problem is that there’s absolutely nothing in the law requiring it to work this way. And yet this was the way it worked at Every. Single. Agency. The HR reps for each agency have the discretion to run a legitimate hiring pipeline, but they don’t because they are afraid of getting sued. It’s hard to say how rational those fears are. Since 1994 the Office of Personnel Management (basically government HR) has only been sued over hiring issues about 100 times (source). For context, there were about 7,000 employment and discrimination cases in the private sector during fiscal year 2016 alone (source). I used to joke that the less likely a given risk was, the more likely the government was to hyperventilate over it. This was one of those places where the joke cut too close to reality. Government HR reps were obsessed with minimizing the risks of a competitive hire getting challenged in court.
And yet every good student of resiliency and security understands that minimizing risk only improves a system up to a point. Once past that sweet spot continuing to try to perfect the system will often lead to counterintuitive and destructive outcomes.
Part of the reason why I liked this particular project was that I am fascinated by system failure. Technical system failure, organizational system failure, it doesn’t really matter. In a technical context, the solution to this problem of how and when to optimize a system without over optimizing and making the whole thing worse is the domain of Site Reliability Engineering. So as we thought about how we could train one agency to build a hiring pipeline, I got a bit distracted by the question of whether one could apply SRE procedures to something that does not involve any actual software engineering.
By this point I had worked in government for three years and four months, which is not very long but is enough to watch even the simplest kinds of policy guidance get misinterpreted and weaponized. One agency took guidance on agile contracting to mean they could not establish any specific requirements ever. Another tried to rebuild a custom equivalent of AWS Elastic Load Balancer to satisfy FIPS 140–2. Still another thought DISA’s guidance on not developing in production meant that they couldn’t sync their git repositories. No matter how thoughtfully written or well researched a policy is someone, somewhere in government will create failure from enforcing it. It is inevitable.
These situations are fascinating to me — although I generally prefer to encounter them as a spectator. It seemed clear that if we successfully taught one agency to use a different hiring system, our recommendations and structures would be reproduced by other agencies. As they spread the odds that they would be put in the wrong hands and misapplied increased dramatically. The longer this project survived the more failure became inevitable.
For engineers in charge of maintaining service reliability this is a familiar concept. At a certain scale failure becomes inevitable and you must change your approach from preventing it to responding and adapting to it. It is the foundation of resilience engineering.
But could you build a hiring system that detected failures and course corrects the same way an SRE team might detect technical failures and course correct?
Principles of Reliability
The first step was to define how exactly SRE works beyond the technical description of tools. We came up with three focus areas:
- Monitoring: How do we know when something is broken?
- Budgeting: What level of performance do we need from this system? What are we optimizing for and how much failure can we withstand before we lose value?
- Escalation/Empowerment: If on the ground workers realize the system is producing a subpart result are they empowered to fix it?
The last bullet seemed the most problematic. The government prefers to standardize things to the point of stripping individuals of their agency. Individual variance means risk, it means bias, it means the fallibility of one person’s judgment over the predictable results of an institution. But it’s also necessary for resilience. As the old proverb goes: “You cannot step into the same river twice.” In complex systems, environments and conditions are always changing, if people are not given some ability to exercise their own judgment the system will be consistent and predictable in its failure only.
As we examined different types of setups from the private sector or elsewhere in government, I found myself constantly highlighting places where discretion played a key role in the outcome. From time to time hiring committees overrule or disregard the scores of interviewers if the score seems inconsistent with the written feedback. How should that be handled? What happens when an interviewer runs out of time on a structured interview? Is the interview invalid because it is not identical to the interview everyone else got or is the interviewer allowed to show discretion and rate the candidate anyway? How strict should justifications on resume reviews be mapped to formal job requirements? Too strict and you force candidates to copy and paste exact language onto their resume. Too loose and resume reviewers add criteria on the fly in order to support biases against candidates based on age, gender and race.
A technical recruiter I knew once explained to me a whole social engineering strategy she had for her company’s hiring committee: “I need to give them a weak candidate to review first so that they can get the ‘No’ out of their system. If I don’t do that they will judge every candidate by stricter and stricter standards. Then I follow the weak candidate up with a really strong one so that they get excited about hiring great people again. After that they’re ready to have normal conversations about candidates.”
Does our planned system allow for the same level of discretion? Is it even appropriate to allow for that kind of discretion? More to the point, does our planned system have ways of protecting necessary discretion if an oversight body tries to remove it in order to increase “fairness”?
With that in mind, I decided to approach the project design by running a series of thought experiments that focused not on how we would prevent a given problem, but how we would identify it developing in progress. We sat down, looked at our proposed pipeline structure and brainstormed a bunch of potential vulnerabilities mapped to each stage of the process.
- Agency identifies someone as a subject matter expert (SME) for the purpose of reviewing and interviewing candidates who is not qualified.
Developing Job Listings
- Agency uses language that scares off qualified candidates, particularly underrepresented groups
- Agency fails to separate “requirements” from “nice to haves”, forcing qualified candidates out of the pipeline because they “don’t have all the keywords”
- Bias: Reviewers approve/reject candidates based on gut feeling, inventing additional criteria to justify the decision. Candidates rejected for listing older languages on their resume, working at certain companies, attending college later in life or not at all.
- Irrational escalation of standards: reviewers judge candidates stricter or not based on time of day, the candidate they reviewed before, perceived robustness of pipeline (Note: is it better to fill an open position ASAP or leave the position open for a bit to find a better candidate?)
- Certain interviewers are scheduled for more than their fair share of interviews, burning them out faster.
- Interviewers write questions that do not test skills actually relevant to the position
- Multiple interviews overrepresent some competencies and underrepresent (or fail to assess at all) other competencies
- Interviewers fail to accurately transcribe QA, making it difficult for the hiring manager to assess interviewer’s feedback
- Hiring Manager Discretion: How and why to overrule interview feedback?
- Interviewer scores an interview one way but the tone of their written feedback suggests something else
- Candidate scores high in technical interview but demonstrates behavioral issues (ie What to do with brilliant assholes).
This is not an inclusive list, nor was it intended to be. In resilience engineering you must expect that you will never be able to anticipate all the potential failure points.
For each one of these issues rather than developing a solution to incorporate back into our process we asked ourselves “if this was happening how would we know it? What would we expect to see?”. In order to incorporate SRE into our process he had to be able to figure out what the agency should be monitoring.
CAP Theorem for Hiring
There were lots of challenges around developing a monitoring strategy. Most of our problems wrapping our heads around things in the thought experiment stage came down to a simple truth that success is easy to define when running a website. Keep the damn thing online and available as often as it takes for it to make money. Others might word that concept more politely, with lip service to value to customers, but at most for profit organizations value to customers is only appreciated to the degree that it can encourage customers to keep being customers. If competition is non-existent, SLOs get a lot more generous.
But hiring is not as clear cut. As Machiavellian as saying availability needs are determined by customer tolerance for taking their business elsewhere might be, it is at least straight forward. What is a positive outcome in a hiring system? Is it a lot of hires? Is it snatching up the best hires? Is it a system that can be demonstrated to successfully avoid institutional discrimination?
I got made fun of a little for suggesting that one cannot optimize for large number of hires, diversity and highly qualified candidates the same way one cannot optimize for consistency, availability and network partitions in distributed systems. (Or if you prefer the Project Management Triangle: cheap, fast, good.) But it seemed very clear to me that this was the case. If you optimize for a large number hires who are also highly qualified, underrepresented groups will naturally be …. well, underrepresented. You can have a large number of diverse hires if you are willing to hire less qualified people and train them. Or you can hire a small number of diverse and highly qualified candidates.
You cannot have all three.
Which to an organization choses naturally effects what our error budget is. If we’ve accepted that we will never be able to build a perfect system, we must then decide how imperfect it is allowed to be before we care. If we care about diversity and quality, then we might still track data on number of hires but overall we’ll be more tolerant of failures there.
What’s tricky about applying error budgets to things like quality of hires and diversity is that these vectors may not be as easily quantifiable as uptime and latency are. To begin with the diversity side relies on self reporting, which candidates are understandably suspicious of. You can only read so many stories about how white sounding names do better on resumes than brown sounding names, or how androgynous diminutives give female candidates an edge by making hiring managers assume they are men (and therefore more qualified) before you start to think only a sucker would check those optional boxes about gender, race and disability status on an application.
So it all came down to: what could we monitor accurately and what couldn’t we?
Monitoring used to be an easy thing to explain. For the longest time it felt self-evident: you are watching a system. Lately I’ve had my doubts. The closer I look at it, the more there seem to be schools of thought.
There’s the observability crowd. People who want to be able to track the health of the system by tracing each stage of any single transaction from beginning to end. This almost always seems to involve a lot — and I do mean A LOT — of dashboards. It is governed by the fundamental belief that exploring a healthy system is just as important as exploring a broken one.
Then there’s the alerting crowd. These people don’t care much about the behavior of the system when it’s working well. They care about states that violate predefined conditions. The assumption is we know what success looks like and the only time monitoring becomes important is when we have violated the requirements of that success.
Each approach has its merits. Observability acknowledges the possibility that there are failure conditions we aren’t considering or have missed. Alerting concedes that we do not have infinite attention and if we can’t prioritize how we focus we end up missing big things. Most organizations do a mix of both approaches, but they do solve for different problems and make different demands upfront.
Observability, unfortunately, requires a lot of maturity around logging. You need to both define how the current state of things is represented and have the infrastructure to collect that state, store it and aggregate it… to say nothing of analyzing it. We would have loved to take an observability approach to this problem but it was beyond what we could ever produce.
It was also too easy to abuse. There’s nothing bureaucrats love more than being able to dissect every aspect of both success or failure. An observability approach seemed likely to backfire by making it possible to callout tiny inconsistencies in otherwise functioning hiring exercises.
SLOs and error budgets are so effective in traditional SRE because they tell you when to stop. A huge number of failures in my experience comes from optimizing for the sake of optimizing without any discernible benefit. There was a really interesting talk at Strange Loop this year about how quirks in code layout fool programmers into chasing their tails thinking they’re improving performance when they’re not. Overoptimization and premature optimization are big problems in technology, but they can also be big problems in other fields. Title 5 got itself in the position it did because scores of people decided to eliminate inconsistency to the point where it crippled functionality.
In our case, we got much farther by defining antipatterns than we did by creating metrics. As hard as it was to define metrics, it would be even harder convincing the keepers of the system to think about things in terms of budgets and not drive the process off a cliff trying to optimize on those metrics. Some of this was a product of our position as advisors without formal authority or ownership. Were we at a traditional org we would have been able to set a range — no lower than this no higher than that — for metrics that might have headed off the destructive impulse to make things just a tiny bit better.
The most important anti-pattern was also the biggest elephant in the room around competitive hiring: veterans. Our goals weren’t to discourage or eliminate veterans. If they were qualified we wanted them just as badly as qualified technical people from any background. But the broken system had been so badly gamed by unqualified veterans that it impacted people’s perception of what a solution should look like. We had to work hard to ensure that tracking the number of veterans in the pipeline was defined as an anti-pattern. The team lead spent what felt like a ridiculous amount of time telling people over and over that we could not track data on veterans where subject matter experts making cuts could access it AND that we were not required to put veterans who had been eliminated because they failed an interview back up for consideration solely because they were veterans. We were monitoring to make sure that veteran status wasn’t tied to qualification either positively or negatively.
Normally the monitoring of a system is relatively easy and the empowerment of the people is hard. With this system the monitoring was hard and the empowerment ended up being easy. Agencies were desperate to hire good people and they hated the current process. We thought we were going to have to hold their hands through the change but all we had to do was get OPM to admit that our proposed pipeline was legal. One CIO looked at our documentation and told our team lead “No offense but I don’t need a pilot, I’m just going to do it.”
We got asked if our proposed process could be used for other skills too? Justice wanted to use it to hire paralegals. Interior wanted to use it for everything.
Government being government (ie: slow) by the time the pilots really got moving I was ready to roll off and start work at Auth0. It wasn’t until several months later that I heard how all our marathon white boarding sessions had played out. The pilots — which tossed out self-assessments, had software engineers review resumes of software engineers and put candidates through a couple rounds of interviewing before ranking them — had produced a hiring list with which agencies had hired 12 people. TWELVE! In a system where hiring 0 people was the normal outcome. Women had also scored on average higher than men, which helped alleviate concerns that deviating from the normal procedure would invite discrimination lawsuits. In the end The Office of Performance and Personnel Management secured funding to expand the program federal wide and hired the USDS team lead full time to oversee it.
Will these results survive? Can they survive? After months of thinking about this I’m still not sure. But if nothing else the experience highlighted for me why I’m less interested in technical focused challenges of SRE and more interested in the people focused processes that resiliency tries to tackle.