Your SaaS's most important trait is Evolvability

In the world of commercial SaaS, your technology is always on a trajectory to become generic. Competition catches up. Broader trends change the way software is meant to look, feel, and be used. The longer your product stays static the less it stands out.

What this means for your technology organization is that good software architecture is not some fixed set of principals from a textbook, but the practice of building adaptive strategies into your software, your infrastructure, and your team.

There was an O'Reilly book, Building Evolutionary Architectures. Its basic idea was hugely influential on me, but the text of the book missed the mark because it ended up deep-diving on side topics rather than being an in-depth discussion of the core idea. The essence of building a system that can adapt to market changes, company growth, and competition, is to identify fitness functions that each describe an attribute that can be optimized to keep your product healthy.

If you measure how well your software is doing, and how it and its development organization are responding to changes, you will always have a picture of whether your architecture is healthy or ailing. You'll also always have a picture of whether you're software is, well not done because it never is, but done for now.

The tip of the iceberg

According to Wikipedia Net Promoter Score, or NPS is "a market research metric that is based on a single survey question asking respondents to rate the likelihood that they would recommend a company, product, or a service to a friend or colleague." Most companies use it in some form to determine how well their products are doing in the market. Sometimes you even get NPS for individual features or facets of your product offering. There are other functions as well that your company might care about: ARR, net retention, gross retention, churn, out-of-cycle churn, etc.

These can all be characterized as a derivatives of your core software architecture fitness functions – not in the mathematical sense, but in the stock market sense. They're affected by what you want to optimize. They're also a trailing metric, and to optimize your engineering and software architecture you want leading metrics.

If all this seems abstract and you don't know how your software architecture affects net retention, don't worry I'll get there. What you want to be able to do to communicate your department's effectiveness to others in and out of engineering and to improve is infer a reasonable connection from functions you define to these iceberg metrics. The most important corrollary to "measure everything," is "measure it right."

An Example: Calendar Sync

Calendars describe a scarce, conflict-prone resource: a person's or facility's time. Because it's a conflict-prone resource and because most people aren't going to use your calendar as their only calendar, you decide you want a way to do a two-way sync with outside calendars. That way they can schedule without checking multiple calendars by eye for conflicts.

This is a great feature to use as an example because in the ideal case a user won't even notice it. It's not just not flashy. It's ideal state is invisible. So the only thing you can really tune is a customer's perception of its reliability.

As a user, when I create a calendar event on either calendar, I expect to see it show up on both. That's the core purpose of a sync. I can tell you easily with a yes-no whether I'm achieving that, but how do I know if I'm achieving it well, and how do I know what potential improvements are impactful?

Finding the core attributes to optimize

For that, let's brainstorm some attributes that I can build fitness functions around.

  • How long the sync can take for a single appointment before the user becomes frustrated?
  • How long can the sync take before the probability of accidentally scheduling a conflict rises?
  • How long can it take for a whole calendar to sync in bulk for the first time?
  • How far ahead and behind does the user expect to be able to see?
  • How long can a sync problem last before you need to proactively report it to a customer? And how quickly and exactly can you detect and classify a problem?
  • What are the common pitfalls (like, say, accidentally syncing duplicate copies of appointments) that you need to be able to detect and mitigate? These may be caused by bugs, but they're kinds of bugs that are on vendors' or partners' systems, or they're easy to introduce and intermittent enough to be hard to catch in QA.

If these also look like feature requirements, there's a good reason for that. You're coming up with the measurements for core attributes or features of the system that determine its fitness for purpose. And by derivative you affect the end user's likely satisfaction and likelihood to tell others, "Yeah this calendar's the one you need."

Deriving the fitness functions

Finding the hard numbers and friction points to hit above requires product research. This is why market-engaged product engineers are so powerful in your organization relative to "pure techie" engineers or contractors. This is engineering-focused product research. It requires knowing enough of the guts of the system you're building and maintaining and enough about who's using the product and why to ask the right questions.

You can't just assume you know the number. You'll either frustrate customers or you'll over-optimize and waste time and money. So you should have your engineers work directly with customers or market research data to come up with the right answers. For the purposes of continuing our illustration, here are some made-up answers to these questions. Yours would vary by market:

  • Users will wait 15 minutes on their appointment to sync before they get frustrated enough to call support.
  • But they'd ideally like to see it in < 60 seconds, and they stop noticing improvements at 15 seconds.
  • Duplicate appointments are very bad, especially with recurrences. No user should see more than 5 duplicated appointments in a given month before we alert support of a problem.
  • If Outlook or Google goes down for a customer or globally for more than 2 minutes, we should reach out to support and let them know. Additionally we need to be able to estimate recovery time once it comes back up.
  • Outlook releases feature or security updates we have to account for every 6 months on average
  • Google does the same every 3-4 months.
  • The cost of operating and maintaining the service should not exceed 5% of our overall COGS budget.

Now we have numbers. They're affected by the number of users who opt in to calendar sync, and the rate of change, outage, and bugs of the external systems we connect to, and the rate of uncaught bugs we introduce. And they all ultimately feed into NPS, CSAT, and our ability to renew and retain users.

Over time you learn things like "every 10,000 users we add to the system means another cycle of optimizing queries and queues." and "our logs and alerts are becoming hard to monitor and we need to change strategies in about a quarter" and "every time we change this aspect of the system it takes us a month to get it right."

How does this affect software architecture?

You learn which are the things about your systems you have to modify the most, and if your engineers are doing their job well, your software architecture bends to make those things more modifiable and to give you longer lead times to anticipate change.

Taking engineering spend and lead time into account on a new product, your queuing system for your calendar sync probably started in Postgres or MongoDB. It works well to start with, but as you hit 100,000 calendars or so you're starting to see that the queues aren't keeping up. You need to change queueing systems or find some way to scale the data store.

Now if you've defined your fitness functions and used that to inform your monitoring, then you have an idea of how long you have before you'll drop out of spec on that 60 second appointment sync time. The longer the lead time, the better, more long term a decision you can make architecturally.

If you don't define your functions and wait until you have days left, then you're probably doubling the size of your database instance and hoping for the best. Some engineer hacks in a way to use the read-replica you weren't using and buys another few months of time before you drop out of spec again and you're never sure when that will be. Whereas if you had two months of lead time you could have switched to SQS or Kafka and been good for a year or two before you needed to strategize about scale again.

The same goes for changes to Outlook or Google Calendar. They introduce some change and an API parameter you're using is going away. If you haven't worked to make that interface with the outside world easy to modify, then you have a pair of engineers working back to back 60 hour weeks to change the implementation and QA it, and they still release that bug that adds duplicate appointments to the calendar, fixing a crisis and causing another, compounding the hacks that now exist in the system and make it harder to modify.

Operating under crisis leads to tech debt. Your system will become a stovepipe of individual hacks the longer you operate without a clear idea of how the system is scaling to environmental changes and user growth. And eventually those hacks will compound to the point that you can't modify the system, support tickets take forever, and customers crush your retention numbers in a fit of rage.

Saving time and money by defining fitness well

Lastly, having your fitness functions defined means you know when you're done for now. And you also know what you don't have to do.

Your calendar sync service only ever does one thing. Sure there's a remote chance that a new calendar product will come up and you'll need to add sync for that, but really the likelihood is that your customers are using one or another major calendar system. Therefore you don't have to spend time planning for undue growth of the codebase. You can derive tests and QA for Google and Outlook and you can ignore Apple and other lesser calendar products. You can move engineers onto other projects until it's clear you're a month or so out from one of your core metrics going out of spec.

And you can listen to a junior engineer's excited ramblings about making a deep change to the system that will "make everything better!" And when you do you'll be able to tell yourself whether that's really likely or even necessary and then redirect that engineer to something more useful with the standard that all of engineering has already agreed to.

Wrapping up.

The point is this: understanding the ecosystem your software operates in, its state and how it typically changes leads to understanding your software's fitness for purpose. Putting concrete numbers, classifiers, and functions around that understanding allows you to set standards for engineering to aspire to, and shapes the evolution of the software around the changes to that ecosystem.

By aligning architecture with environmental change rather than general software trends your engineers want to adopt or assumptions they make about a market and users, you have clear start and stop points for modifying a system and points of likely change you can make more flexible as you build them.

Evolutionary architecture is a powerful way of thinking about building software writ-large, and gives you a sound set of principles to lead a department with whether it's 5 people or 150.