Bagaimana Facebook melakukan IT Automation
Running the operations of Facebook is an enormous task. First, there’s the sheer scale of its global networks and the absolute priority it places on reliable service and satisfying user experiences. Then there’s the fact that just maintaining the current offerings isn’t enough: operations is expected to constantly create new flexibility and capacity for the business to pursue its broader innovation agenda. Looking forward, Facebook’s ambitious initiatives include its Connectivity Lab (a plan to bring internet access to those around the world who lack it), AI and deep learning, and virtual reality as a next generation computing platform.
I sat down with Jay Parikh, the company’s Vice President of Engineering, to discover what his priorities might suggest more broadly about the future of operations. An edited version of our conversation follows.
HBR: When a given company thinks about “the future of operations,” it thinks mainly about how its own operations will change. Often that vision features automating a lot of work that humans do today. Is that true for Facebook?
Parikh: We use a lot of automation because the team has a lot of complexity to manage. We have an infrastructure that spans hundreds and hundreds of thousands of computers all around the world, we’re serving 1.44 billion people on the main app and hundreds of millions of people on the other apps, and there are thousands of engineers writing software that’s getting deployed all the time – changing features, making things more optimized, providing some new service so we can launch a new feature in Messenger or Instagram. So we have really had to focus a lot on orchestration and automation so that we can keep up with the scale and with the pace of product development that we want.
Could you give an example of work that was manually done, and is now automated?
We built a system we call FBAR, for Facebook Auto Remediation, to do a very basic set of hardware remediation tasks. Before, if a server had a hard drive failure or some hardware error, an alarm would go off and some human would have to log in, or walk to the computer, and try to debug or fix it. You’d do some things to try to fix it in software, you’d reboot the machine, you might try to reimage it. A lot of that software remediation and debugging is all automated now. No person has to be involved with that. The system will detect the error– it could be a disk drive, it could be a CPU, it could be a networking card or power failure—and can go ahead and do a bunch of things that it knows how to do.
But these are really dumb things – things that are really, really trivial to automate. And thereby, it allows us to take that engineer we really worked hard to recruit, and trained to deal with higher-level things, away from doing something pretty remedial – work that is not fun, and they’re not growing or learning anything, yet it’s time consuming – and say: “Hey, help us figure out how to architect this new service. Go figure out how to make this thing run faster, or help build this new automation to tackle the challenges we have in mobile applications. Help us design our new data center.”
Do you have a disciplined process for spotting the next thing to automate, and constantly moving that human work upward?
We do. We generally watch a bunch of metrics, like machine counts and failures, to figure out the next thing to automate. And we’ve been on a steady march throughout the years. Another example has to do with our clusters – a cluster is just some number of servers doing a certain type of function in our infrastructure – and the process of “turning up,” or getting the thing ready to go. There’s a lot of configuration involved, installing software, and making sure the right things are talking to each other, and when I got to Facebook in 2009, it was all done manually. We literally would just write on a whiteboard: this is assigned to Jay, this is assigned to Bob, this is assigned to Sally, this is assigned to Phil — and then realize, “Oh wait, Phil can’t do his thing until Bob is done doing his thing,” and draw an arrow to show the dependency. It was time consuming, but more important, it was error-prone. And what if Bob was out and George substituted for him? Later we’d be scratching our heads: “Why does this thing behave a little bit differently?”
You don’t have to worry about that with automation, because work consistently gets done in the same way every time. And cluster turn-ups which used to take three or four months now can be done in the course of a week – sometimes in just a couple of days.
And, I presume, with fewer people.
On that point, a third example I’ll give has to do with “break/fixes.” We have this fleet of servers that is pretty large, and things do break in a data center. So we have done a lot of work on the software automation side of this, to the point that we only need one technician in the data center for every 25,000 servers. That is a ratio that is basically unheard of. Most IT shops have ratios of one to 200, or one to 500. But the point of all of this automation is that it allows us to take very simple but time-consuming tasks and move them off the plates of our really smart people. We’d rather have them thinking about the next two years coming than the last two years of stuff we’ve already built.
So this notion of augmenting your people and enabling them up to take on more interesting work – that is really what drives the automation agenda?
The reason why the automation is so important is that it does free up these teams to go think about and do things for the future. If you think about it, most tech companies in the world today are spending an incredible amount of time competing for the best of the best in the world. For a smart person in tech, there are just a lot of fun companies with fun problems to go work on. So once you’ve worked so hard to get people into the company and to ramp them up and to understand your environment, you want to keep them. You want to have them be engaged. You want them to grow and develop their careers and stay with you for this ride that is scale over a really long period of time.
I think one of the major ways you do this is by continuing to keep them out of their comfort zone. If they end up doing “humdrum” work for a long time, they’re not learning anything, yet they’re spending a lot of time doing it. That leads to burnout and unhappiness, and then they’re going to go somewhere else. So, I think, if growing your company depends on keeping these brilliant technologists engaged, the necessity isthat you have to keep automating and rearchitecting your systems so that things don’t become boring, monotonous, and repetitive. Automation serves your talent objectives.
Usually engineering and operations leaders are more obsessed with efficiency – not so focused on maximizing people’s happiness.
A lot of companies also get very caught up in structures around operations. They say to one team: you have the keys to do this – but if you want to do this, you have to go to another team to do it. My belief is that you’ve just got to hire good people, and then break down those traditional organizational barriers to getting things done. If the people are really good, you can base your approach on the assumption that they’re all trying to do what’s best for the company. Put them together, give them some goals, and then just let them go.
Otherwise, you put them in a position where one team, in order to hit its goals, always wants to make changes, and another team only wants to make everything stable and cost-effective. Those two are completely opposed to each other. So you’re setting up tens, dozens, hundreds of people just to be constantly butting heads. Yet that is a very common org structure: Here’s the product teams, here’s the middleware and back-end engineering teams, here’s an operations team, here’s a security team, here’s an IT team. And it used to be that way at Facebook. When I showed up at Facebook in 2009, this is what I saw and I thought is was perfectly reasonable. It was this way when I was at Akamai, it was this way when I got to Ning.
I’m sensing you changed that.
We realized we had to demolish it all, because it was causing inefficiencies in our operations. It was slowing us down. We weren’t making the best decisions possible. Sometimes we were making decisions that were short-term cost based, when we should have been basing them on what we would need to be ready for a year from now. In other places, the focus on automation wasn’t there, because it wasn’t a team’s own responsibility. Questions about automation were being tossed to a small team of people who just weren’t going to keep up with this swarm of engineers that we were hiring. We had to rethink the entire organizational structure, and the types of people we were hiring, to break up all these different sort of walls.
Do you still separate the teams working on innovations from the teams maintaining today’s core business?
No, we come up with what we call big bets as a team, planning the investments in technology that will take one or two or three years to build. For example, we built a new compiler that runs the front end of our site. It took us a couple of years to build and that R&D effort was done in parallel, in the same team that was maintaining the existing run time that was running the live Facebook. It absolutely wasn’t that there was one team separate, working on the new thing, and another team stuck behind with the old thing.
Most management gurus tell you to separate that team, to give the innovation a chance to take off. That must have been hard to manage, having the two things going on at once.
It is hard to manage. There are interruptions coming in every day that want to take you away from that long-term goal. You really have to make sure you have the right team, with all the skill sets and the diversity you need. You have to be a little flexible on deadlines, and build in a little bit of wiggle room in case the interrupts don’t play in your favor.
But meanwhile there are so many benefits — starting with the fact that it’s done in a much more open way. No one likes it when there’s a team working in some secure undisclosed location “doing something really cool that is going to replace what you’ve got.” It’s a tough thing to manage on both ends. The innovation team worries: “We’re not making any impact now – they’ve just stuffed us in a corner and told us go deliver something great.” And the core business team you’re depending on to deal with problems and customer support issues and all the urgent things that come up, is saying, “When do we get to work on something cool? Are we second-class citizens or something?”
That emphasis on openness seems appropriate for Facebook.
I think in younger tech companies like Facebook, the culture is more open. There are very few things that are done in secrecy. Mark [Zuckerberg] has always set up the company to be a very transparent and communicative team or environment. For example, he does a Q&A with the entire company every Friday, and encourages us to ask hard questions; he really wants to reinforce that we’re open, we’re one team. And of course it’s consistent with the mission of the company: to connect everyone on the planet. That has been there, consistent, for a long time and we really believe in it. So everything we talk about and do in operations needs to connect to how we’re going to get that done. It all needs to tie in.