Microservice Foundations: Commands

For me, all of software architecture and in fact product-development organization exists in an ecosystem. The parameters of this ecosystem change over time with respect to your products, related products, and the tech landscape overall. And the great challenge, of a tech organization writ-large is to evolve.

I'm writing my method to develop an evolutionary software architecture as a series of blog posts instead of a library or framework, because I think that you should extend the principles for your environment rather than try to shoehorn my exact framework into your ecosystem.

In this first post, I'm going to introduce you to my current iteration of a Command base class and how it serves the purpose of being a core building block of a massively scalable microservice system.

Motivation

I want a core idea of a microservice that I can build into anything. I want to be able to hire someone, promote them, move them across projects or split or merge projects without them having to relearn everything. I want to be able to incorporate a new company's tech as quickly and cleanly as possible in the event of an M&A.

There's a technical practice and infrastructure to doing that. There's a method to follow that allows you to do this over and over and over again and reproduce good results. One of the core ideas in my arsenal is my approach to microservice design.

During my tenure at Teamworks as Principal Engineer, then VP and then Chief Architect, I helped the department grow from 12 engineers to hundreds. We refined the fundamental principles of microservice design into a bit of an art. I've been using these abstractions for years, although I daresay we've gotten better at them.

I should preface this all to say I think of the unit of "microservice" as a small collection of tightly related cogs focused on facilitating a core idea. I don't ascribe to the orthodoxy that says every function is a microservice, or that it all has to be serverless all the way down. I go with the casual notion that my microservice should avoid scope creep, and that I should be able to describe what it does to another technical person with a single-sentence thesis and maybe a paragraph of descriptive matter.

For my money, I think the best microservice "system" is Command Query Responsibility Separation (CQRS). I don't think you need to go whole-hog into event sourcing or use separate databases for queries vs. mutations from the get-go. Sometimes you do. Sometimes you don't. Often you eventually need one of these things if your microservice becomes a critical piece of kit, and you don't want to write code up front that makes more work for you when you have to do that.

Starting with the idea of commands (or mutations) and queries as your base unit of architecture affords you the ability to flex into these things. It also allows you to conveniently break up your microservice if it's starting to feel less "micro." We'll talk about Queries in another post. This is about Commands and how to build them.

💡
A quick aside about my stack: I write a lot of Python. I've been writing it since 1997 and I can do it in my sleep. But while these examples are Python and specifically I use FastAPI and Pydantic preferentially, they are not the only way to achieve this. I'll try to explain what's going on under the hood, and then you can take that into your own framework or use mine.

The Command base class

Before I get into the code, let me tell you what I'm looking for in a realized Command:

  • Well-defined. The command needs to have a well-defined schema.
  • Serializable. Every command needs to serialize and deserialize into its schema in whatever format (JSON, Protobuf).
  • Orderable. Every command needs to be able to be sequenced in the system.
  • Versioned. Having the ability to version commands means you can version an API, and if at some point you do forensics of "how did this data get into this state anyway?" having versioned commands means you can replay a command log reliably.
  • Composable. You should strive to write commands so they can be composed into macros without losing throughput for the user or the coder.

If your command base class has all these it is also:

  • Auditable. You probably don't need to start with an audit trail for every command, but if your command class supports adding one, you get a lot of other useful things for free.

Now as promised, it's time for code. This is a Pydantic model meant to be used inside of a FastAPI-based microservice. First I'll show you the class, then I'll talk about how we achieve the above with it.

import abc
import pydantic as pyd
from datetime import datetime
from fastapi import BackgroundTasks
from uuid_utils import uuid7  # this has timestamp built in

# see https://jeffersonheard.ghost.io/production-python-setting-up-fastapi-to-talk-to-a-real-database/
from myapp.resources import Context

class Command(pyd.BaseModel, abc.ABC):

    # update this when you update the class
    version: str = "1.0"

    # this will automatically be generated when the class is instantiated 
    # in the request flow.
    event_id: str = pyd.Field(default_factory=lambda: uuid7().hex)

    # for orderability
    @computed_field
    @cached_property
    def origination_ts(self) -> datetime:
        return self.event_id.timestamp
    
    # for log streams
    @computed_field
    @cached_property
    def fq_event_id(self) -> str:
        return f"{self.typename}:{self.event_id}"

    # this serializes the type name when it's written out for audit.
    @computed_field
    @property
    def typename(self) -> str:
        return self.__class__.__name__

    async def check_perm(self, ctx: Context, raise_on_fail=True) -> bool:
        return True

    @abstractmethod
    async def cmd(self, ctx: Context, bkg: BackgroundTasks): 
        pass
Note, for the definition of Context, see my previous post on databases.

"But this is so simple, Jeff!" Yes, it is. A lot of the interesting functionality is covered by Pydantic, but here we have a command that is treated like an event. It has a versioned schema, a total ordering, and can be serialized in a log stream to JSON or protobuf or msgpack. Then it can be deserialized and executed the same way each time regardless of whether the code has changed (assuming you have the need to do that, and thus are doing the required upkeep of storing old versions) and regardless of the number of log streams you're collating.

For many microservices, this is as far as you have to go. You don't really need the log stream or event sourcing to start with, and when you're doing something new it's almost always more setup than it's worth. But this command class affords you a ton that you definitely do need to start out. Even using it as a traditional API endpoint, you get:

  • Good logging metadata with typename, event_ts and fq_event_id.
  • Complete schema, including API docs for the endpoint (this comes from FastAPI and Pydantic working together)
  • Good refactorability both from being able to compose command classes, but also because if your units of design are commands and queries, you can break a few selected commands and queries out and create a new microservice with a minimum of rewritten code and boilerplate.
  • Good evolvability. You can collect related commands into a single file; you can break them up into single files in a package; and you can split a package off into a new microservice with almost no extra work. When my team and I build services this way, it's really easy to get the lay of the land when you hop from one microservice to the other.
💡
Although we have an abstract cmd method, if you want to switch to steam processing style execution at some point, you can just factor that out into it's own processing function or write a bunch of non-cmd stream processors to read the event log. It requires more work to monitor things written that way, but there are good reasons to go that way. such as if your single Command message is ubiquitous to a lot of different asynchronous processes.

The one thing that isn't well-defined in the abstract base class is composability, because it depends more on practice than framework. For that, we want to see what a concrete Command class will look like.

Writing a real command

This is a simplified version of a real command in Tableroq, my new solopreneur project. I've left out the validators and gotten rid of some of the fields to condense it, but the essence of what a command is is here.

I realize it's a relatively simple CRUD operation, but that's because most mutations still are that. Only if you're married to CRUD you will eventually stray from the spirit of it for one thing or another – like say mass altering permissions. Then you either fit the proverbial square peg into the round CRUD hole (ouch!), or you write some RPC-like API endpoint and feel bad about it because it doesn't look like all your other endpoints. With this abstraction, everything's just a Command.

class AlterDashboard(Command):
    "Creates a new version of an existing dashboard with changes."

    name: str | None = Field(default=None, description="...")
    description: str | None = Field(default=None, description="...")
    num_cols_xs: int | None = Field(default=None, description="...")
    num_cols_sm: int | None = Field(default=None, description="...")
    num_cols_md: int | None = Field(default=None, description="...")
    num_cols_lg: int | None = Field(default=None, description="...")
    num_cols_xl: int | None = Field(default=None, description="...")
    aspect_ratio: float | None = Field(default=None, description="...")
    
    dataset_definitions: dict[str, DatasetDefinition]
    charts: list[dict[str, Any]]
    app_metadata: dict | None = None
    
    async def check_perm(self, ctx: Context, raise_on_fail=True) -> bool:
        if not ctx.perms.check_permission(
          "dashboard", ctx.object_id, "can_modify", "user", ctx.user_id
        ):
            if raise_on_fail:
                raise HTTPException(403, "Forbidden")
            else:
                return False
        return True
    
    async def target(self, ctx):
        return await fetch_current_object_or_404(ctx, Dashboard)
    
    async def cmd(self, ctx: Context, bkg: BackgroundTasks):            
        dash0 = await self.target(ctx)     

        # save old version
        await ctx.writer.execute(
            update(Dashboard)
            .where(
                Dashboard.slug == dash0.slug,
                Dashboard.customer_id == ctx.customer_id,
            )
            .values(version=Dashboard.version - 1)
        )

        # create new version
        frm = Dashboard(
            id=fresh_id(Dashboard, ctx.customer_id),
            customer_id=ctx.customer_id,
            name=self.name or dash0.name,
            slug=dash0.slug,
            definition=DashboardDefinition(
                dataset_definitions=self.dataset_definitions,
                chart_definitions=self.charts,
                num_cols_xs=alt(self, dash0.definition, 'num_cols_xs'),
                num_cols_sm=alt(self, dash0.definition, 'num_cols_sm'),
                num_cols_md=alt(self, dash0.definition, 'num_cols_md'),
                num_cols_lg=alt(self, dash0.definition, 'num_cols_lg'),
                num_cols_xl=alt(self, dash0.definition, 'num_cols_xl'),
                aspect_ratio=alt(self, dash0.definition, 'aspect_ratio'),
            ),
            app_metadata=alt(self, dash0.definition, 'app_metadata'),
            created_by=dash0.created_by,
            updated_by=ctx.user_id,
            created_at=dash0.created_at,
            updated_at=self.origination_ts,
            version=0
        )

        # note we don't commit, this becomes important.
        ctx.writer.add(frm)

I keep talking about composability. This is where practice comes in. The first thing I do is I separate a command from the request and resource context in which it was made.

I particularly avoid committing in atomic commands. I leave session creation and transaction management to the API endpoint or process or to a "macro" command. This way I can string commands together and compose them into macros without having API calls that involve say 12-100 database transactions. Imagine that we have a SaveUpdatesAndPublish command at some point. It's a Create or an Alter, followed by a Publish, followed by a NotifySubscribers. If I've written each command to use a separate transaction:

  • I've needlessly replicated the most expensive part of the command three times
  • I've also added a "partial commit" scenario that could turn into a hard-to-track down error later if a bug in one of the commands or a network problem causes it to only partly fail. You just have to write so much more "error correcting" code if the transaction doesn't fail in one go.

Putting it together in FastAPI

import logging

from fastapi import Request, Depends, BackgroundTasks
from .auth import get_user_context

from trq.data import Context
from trq.commands.dashboards import AlterDashboard

log = logging.getLogger(__name__)

from .routes import app_router

Ctx = Annotated[Context, Depends(Context())]
TokenCtx = Annotated[Context, Depends(get_user_context)]

@app_router.api_route(
    "/dash.alter/{object_id:path}", 
    methods=["HEAD", "POST", "OPTIONS"], 
)
async def dashboard(
    cmd: AlterDashboard, 
    ctx0: TokenCtx, 
    bkg: BackgroundTasks
):
    log.info(
        "%s from API. Altering dashboard %s for customer %s by user %s",
        cmd.fq_event_id, 
        ctx.object_id, 
        ctx.customer_id, 
        ctx.user_id
    )
    async with ctx0:
        await cmd.check_permission(ctx)
        await cmd.cmd(ctx, bkg)
    log.info("%s completed successfully.", cmd.fq_event_id)

So there's a lot going on here in a small space. We've taken our command, yes, and we're executing it, but let's spend a bit on what surrounds it. Context in particular does a lot of work as does the Annotated and Depends.

Effectively what's happening here is that

  • FastAPI constructs the request
  • FastAPI's dependency injection helps construct the Context given the object_id in the path and the data about who the user is and what customer they're acting on behalf of from get_user_context (OAuth, JWT, whatever auth happens, happens inside of there).
  • Because FastAPI knows what Pydantic classes are, it automatically validates and constructs our AlterDashboard command for us from the JSON post body.

So that means that validation and unauthorized conditions are automatically handled for us, as well as acquiring the resources needed to perform our alter, like the database connection. Now we just need to:

  • Begin a transaction.
  • Check object existence.
  • Check the user's permission to alter the object.
  • Run our command.

To begin a transaction, we use async with ctx0. To check existence, see the target method in AlterDashboard. If that fails, it'll raise a 404. To check permissions, we hit the check_perm method. If that fails, it raises a 403. Then we await our cmd.cmd(ctx, bkg).

And that, my friends, is our first command from soup to nuts.

Wrapping it up

I realize all this has been in the weeds and technical, so let me take a moment to bring it back.

The point of all this is to balance productivity, simplicity, flexibility, and power for the developer. The most expensive and hard to plan for resource in your technical organization is never infrastructure. It's not subscriptions. It's not the 5 extra pods you run in EKS to provide the same response-times you would have if you'd written clever code in your favorite framework. It's people.

You will, as an architect or a manager or an executive or a team leader spend more time, money, and effort getting people to row their boat in the same direction than you ever will writing code. By starting with a strong, clear foundation based on good principles you will pay dividends throughout the life of the software you steward. It will evolve faster, more reliably, and people will be happier when you value clear over clever.

And that's the goal of this series, to provide a framework that makes the engineering clear. I hope this gives you a good taste of what's to come.