“But we have a QA team, right. Why aren’t they writing the end-to-end tests for this?”
“We have a mobile team, right? Why don’t they build the mobile version of this?”
“We have a performance team, right. Why don’t they write the load tests for this?”
Excellent questions! Unfortunately, asking me these questions, you stepped into a trap.
Perhaps you were expecting “because I said so” as an answer, but I’m not your dad (let’s assume, because that would be quite the tangent).
No, I don’t like brushing important question off like that. I’d like to give some context of “why the QA team can’t just do QA for us.” So yeah, sorry, you’re on the hook for a multiple week, episodic drama entitled “The Road to Platform Teams.”
But no worries, we’ll laugh (a bit), we’ll cry (most of the time), it will be an experience like no other. And all this for the low-low price of an estimated eight minutes of your time (for this first one, estimated by Silver Bullet).
And at the end of it, perhaps you’ll have learned to stop asking me questions.
How do you structure an engineering organization once it gets too big to be just #oneteam? There’s many ways to do this, and it’s usually a bit of a journey for a company to find a model that works for them.
Let’s relive such a journey in the compressed time of just a few weeks. Let’s see where various setups succeed and fail, and iterate from there. Then, ultimately, we’ll hopefully have a better understanding why Mattermost is structured the way it is — and perhaps, why this would make sense for you too. While not perfect, it makes sense for us right now. We can always reorg tomorrow (but no spoilers).
Are you ready?
The most obvious first iteration in any organization is “let’s go horizontal!”
At the architectural level, we often slice our product in horizontal slices: we got a frontend layer, a backend layer, a cloud/operational layer, and QA to make sure it all kinda works. That makes sense, let’s split our teams up that way as well! People in these layers all have the same job anyway, more or less, so why not? YOLO!
So, we’re going to put web frontend engineers in a web team, all mobile people in a mobile team, all server people in a backend team, and all QA people in a QA team.
There is plenty of upside to this. It brings people doing the same job together. Working together daily, they share knowledge, can mentor each other, talk about their problems with their technology stack and fix them. Their manager is likely an expert in their field so fully understands their every day problems. The team recruits its own members so all is nice self contained. Nice!
Let’s do it.
Alright, now we have to ship a feature. This one’s a bit bigger, it doesn’t just affect moving a button from one side of the page to another (as is usually the case), it also requires some new endpoints in the backend. It also needs to be implemented in the mobile app, and because it’s critical, we have to make sure that it works properly end-to-end — so we’ll need end to end tests, QA team!
This will require a bit of coordination, so in comes 🥁 management.
Management says: “This looks fun, we’re going to turn this into a project!”
We’re going to bring together the leads and architects of all teams and we’re going to carefully plan and spec this whole thing out. Since communication and coordination between teams is rather expensive, we are going to have to nail down these specs quite precisely. Then, we’re going to create a dependency map and going to do some planning. Have you heard about the magic of the Gantt chart?
Both the frontend teams (web and mobile) rely on back-end endpoints being present. So in the first week, the backend team is going to do its magic. Then the next week, this will be ready for the frontend teams. They will work on this in parallel (efficient!) for the next two weeks, and then finally after that, the QA team jumps in to test this thing and automate that testing, which cannot take more than a week.
And voila (as the French Canadians say), in four weeks we’ll have ourselves a feature!
Off we go!
Sadly, things go wrong pretty much immediately. There is a major incident in cloud, requiring the attention of half the back-end team for pretty much a week. There’s still progress made, but because we also have five other projects running at the same time, some senior (code for old) manager person needs to step in to decide which of these take priority. Sadly, our feature is low on the list, so we’re delayed by a week. The web, mobile and QA teams who had planned their work based on the assumption that the backend team would “do their job,” now have to overhaul their planning. Those backend people, they just cannot seem to estimate their way out of a paper bag...
The next week, luckily, the backend team delivers its endpoints. Yay! The frontend and mobile team are unblocked and can go, go, go (ironically not using Go)! Sadly, as these teams build out their frontends, they discover some problems with the endpoints.
“But it’s exactly done according to the spec!” the backend team yells across the team fence with some delay. “You agreed to this!”
“Yeah, well, we didn’t catch this until we actually got to use it!” the frontend teams yell back.
“Oh my,” the backend people think, “these frontend people cannot anticipate their way out of a paper bag. How hard can this stuff possibly be with just a single user interacting with their code?”
Ok fine, next week we’ll have time to make these changes, we got actual hard scaling problems to solve, we can’t context switch all the time.
In the mobile team, they decide not to wait, and, for now, mock the endpoints as they will (hopefully) work down the line. In the web team they decide to just pause development and move to other projects for now.
Another week, more incidents. The backend team doesn’t have time to implement the endpoint changes. More replanning is happening across all other teams. The web team learns about the mobile team’s mocking approach and starts to implement their own mock using a different framework that’s more “webby” than the one the mobile team used.
Another week, and the backend team delivers the updated endpoints. The mobile team removes the mocks to make sure everything works as expected. It doesn’t fully, but with a few days of work, manages to fix everything. The mobile version is ready to go.
The web team had to schedule other projects in the mean time, so they won’t get to it this week. QA is preparing to write tests, but likely will start with just covering mobile, because web isn’t nearly ready yet.
Another week, the web team throws away the half-finished mock implementation they started to work on, because the actual implementation is now ready. They implement the feature and they’re done!
Another week, and now it’s the QA team’s moment to shine. The feature is ready, and other teams have moved to other things. But guess what, the QA team uncovers issues. Indeed, there are bugs. Both in the web and mobile clients. They create tickets for both the mobile and web teams.
The next week, the mobile and web teams have their triage and planning session. They noticed the reported bugs. “This is probably a server issue!” they say, and reassign it to the backend team.
The backend team has their planning session. “What was this project about again? Ah right, I remember this. But we implemented this according to the spec, so it must be a client issue. Let me rerun our test suite. Yep, all good. Let’s assign it back.”
The next week, management (in one of their various fancy JIRA dashboards) notices the bouncing back and forth of the bug tickets and sets up a meeting with all involved to resolve the issue. They resolve it. Fixes are scheduled for the next week.
The next week, the fixes are implemented.
The week after that week, the QA retests things and all seems ok.
And voila (as the French Canadians say) after twelve weeks, we have ourselves a feature!
While a lot of what happened here (and this greatly simplifies what happens in real software development) can be attributed to the inherently unpredictability of software development (external factors, naive estimations), the glaring issue here is communication and coordination cost. We have 4 teams with their own planning cycles, priorities, habits and mini cultures. They cannot deliver a feature without heavy coordination. Because they are organizationally distant, they communicate through lossy and stringent “interfaces” (specs, JIRA tickets) interpreted through often cynical lenses. Obviously, if they’d all be using Mattermost none of this would have happened, but we cannot all be so lucky (wink wink).
There must be a better way!
Indeed, there may be. So next week, let’s flip this whole structure on its side, and go vertical. Sounds kinda random? This is management, we just throw stuff at the wall and see what sticks. YOLO! No seriously, there’s good reason to believe this may actually work, you’ll see.
How’s that for a cliffhanger 🤯