Repositories and their true purpose

Lately, posts and tweets regarding the Repository pattern have made yet another resurgence. It's seemingly impossible to predict when, where or why such "spicy topics" will rear their heads... However, the spark that causes the ignition of these "hot topics" is almost always the following question (or something similar):

How many times have you replaced the underlying database implementation because of your use of the Repository pattern?

— Random Techfluencer

That's why, in this blog post, I'd like to provide some further clarity regarding this totally misunderstood software design pattern and why the #1 argument (the question above) against its use is actually insignificant and almost irrelevant.

Defining a Repository

First and foremost, let's start off by defining what a Repository actually is. The Repository pattern is defined as follows in PoEAA:

Mediates between the domain and data mapping layers using a collection-like interface for accessing domain objects.

It is of paramount importance that we establish the facts below before moving on to the other sections.

(...) accessing domain objects

Domain objects are actors in the domain layer that possess an authoritative set of business capabilities for carrying out certain tasks. These capabilities or behaviors are exposed as public methods on said actors in order to make consistent state changes. Domain objects are also known as write models, entities or as aggregates in DDD lingo.

Note: Aggregates and their exact purpose is out of scope for this blog post. However, I can recommend this short, concise write-up by Shawn McCool if you'd like to learn more about them.

You've probably heard the notion "business logic" numerous times by now. Well, these models are the ones that actually determine what that "business logic" should entail.

(...) collection-like interface (...)

In an ideal world, a persistence layer for the entities would not be needed as everything can be added and removed from an in-memory collection. For example:

final class Users
{
    private array $users;

    private function __construct(User ...$users)
    {
        $this->users = $users;
    }

    public static function empty(): self
    {
        return new self();
    }

    public function add(User $user): void
    {
        $this->users[$user->id()->asString()] = $user;
    }

    public function find(UserId $id): User
    {
        return $this->users[$id->asString()] 
            ?? throw CouldNotFindUser::becauseItIsMissing();
    }

    public function remove(User $user): void
    {
        unset($this->users[$user->id()->asString()]);
    }
}

Unfortunately, the real world is often rather different from the ideal world. PHP has its (in)famous request-response lifecycle which results in the loss of every bit of relevant context once an incoming request has been handled and a response has been sent to the client. A Repository assists us in approximating this ideal world by giving the illusion that we can perform our operations on in-memory collections that seemingly live forever. An example Repository could be:

interface UserRepository
{
    public function find(UserId $id): User;
    public function save(User $user): void;
    public function remove(User $user): void;
}

Please take note of the minimal signature of this interface.

Collection-oriented vs. persistence-oriented

Vaughn Vernon, the author of The Big Red Book (iDDD), mentions collection-oriented and persistence-oriented Repository implementations in Chapter 12. I'd like to briefly mention this fact, because this is the reason why you might see different "flavors" of Repository implementations in the wild. The difference lies primarily in the semantics.

A collection-oriented design can be considered a traditional design because of adherence to an in-memory collection's standard interface.

$users->add($user);

A persistence-oriented design is also known as a save-based Repository.

$users->save($user);

Personally, I prefer the persistence orientation due to PHP's ephemeral nature.

Authoritative collection

A Repository is the authoritative collection for interacting with a specific type of entity. It can be used to store, filter, retrieve and remove entities based on the application's needs. In other words, we delegate the task for remembering the existence of a certain entity to the Repository.

Explained by example: publishing a post

Let's take a look at a use case to solidify our understanding.

final readonly class PublishPostHandler
{
    public function __construct(
        private PostRepository $posts,
    ) {}

    public function handle(PublishPost $command): void
    {
        $post = $this->posts->find($command->id);
        
        $post->publish();
        
        $this->posts->save($post);
    }
}

This use case assumes that a Post entity must already exist in order to publish it. Since the PostRepository is the authoritative collection for dealing with these Post entities, we can ask it to provide us with a Post entity for the given PostId:

$post = $this->posts->find($command->id);

Once we've received a Post instance, we carry on with the task we were supposed to carry out in the first place:

$post->publish();

The publish method exposes the behavior that is responsible for actually "publishing a blog post". If we dig a little deeper, we can see that it is also enforcing crucial invariants:

public function publish(): void
{
    if ($this->isPublished()) {
        throw CouldNotPublish::becauseAlreadyPublished();
    } elseif ($this->summary->isEmpty()) {
        throw CouldNotPublish::becauseSummaryIsMissing();
    } elseif ($this->body->isEmpty()) {
        throw CouldNotPublish::becauseBodyIsMissing();
    } elseif ($this->tags->isEmpty()) {
        throw CouldNotPublish::becauseTagsAreMissing();
    }

    // omitted for brevity
}

If everything goes well, we move on and tell the PostRepository to remember the Post in its current state:

$this->posts->save($post);

Next time we interact with the PostRepository and ask for the exact same entity, we can expect to receive the Post in this state. The PostRepository will ensure that this condition is met at all times. This is a Repository's single most important responsibility, after all. The PostRepository clearly defines the boundaries around the application service which also yields a lot of benefits such as isolated testability and purposefully keeping the (core) domain oblivious to its surroundings.

Persistence agnosticity

Let's quickly recall Random Techfluencer's original statement:

How many times have you replaced the underlying database implementation because of your use of the Repository pattern?

Random Techfluencer is actually discouraging the use of the Repository because "how many times are you going to swap out data sources?".

Now, please let me make something absolutely clear. The swapping of the data source is a puny argument whether you use it to promote or obstruct the use of the Repository. It does not matter which camp (pro / contra) you belong to. Do you really want to think about swapping out data sources as you are designing the domain? This kind of thinking is - in my humble opinion - flawed.

Ad hoc persistence swapping

The fact that you can easily swap out data sources later on is nothing but a bonus that you are awarded by carefully placing boundaries around your application. This is "boundaries in software design 101" and trying to use this as the main selling point every single time does noone any good.

You can start out your application by using simple JSON files on disk and gradually evolve towards "beefier" solutions as different needs emerge.

final class UserRepositoryUsingJsonFilesOnDisk implements UserRepository
{
    public function add(User $user): void
    {
        // add a user
    }

    public function find(UserId $id): User
    {
        // find a user
    }

    public function remove(User $user): void
    {
        // remove a user... you get the point
    }
}

Different features can evolve independent of each other and infrastructural costs can be kept to a minimum. Why use an expensive cloud-hosted solution for everything if 90% of the other features are well-suited for a storage mechnism like SQLite? Why keep using MySQL for every single feature if 10% of the features are well-suited for Elastic and Riak?

Testability

In a similar vein to persistence agnosticity, testability is another bonus we are awarded by carefully placing boundaries around our application. The real thing can keep using a DoctrinePostRepository while the tests can use an InMemoryPostRepository allowing us to have lightning fast tests.

The test for the "publishing a blog post" use case, that was mentioned previously, might look as follows:

// Arrange
$post = $this->aPost(['id' => PostId::fromInt($id = 123)]); // draft
$repository = $this->aPostRepository([$post]); // in-memory repository
$handler = new PublishPostHandler($repository);

// Act
$handler->handle(new PublishPost($id));

// Assert
$this->assertTrue($repository->wasSaved($post));
$this->assertTrue($post->isPublished());

In this example, we're testing the application service represented by the command handler. We don't need to test that the repository stored the data in the database or wherever else. We need to test the specific behavior of the handler, which is to publish the Post object and pass it into the repository to preserve its state.

"This is not a big deal", you might rightfully say, "I can just hit persistence during my tests every single time". I'm not sure if you know someone who's worked on a project whose test suite was completely shut down because it just took way too long to go through the entire thing? I do know someone and that person is unfortunately me. Integration and System / E2E tests definitely have their place, but the sheer velocity and the fast feedback loop of unit tests is still highly desirable.

Alleviating performance issues

Performance is another reason as to why a Repository is often employed. It's not an uncommon scenario to have millions of instances of a certain entity type so we are kind of forced to offload this to an external data store.

Assume the following excerpt from an imaginary User entity:

public function changeEmail(Email $newEmail, Users $allUsers)
{
    if ($allUsers->findByEmail($newEmail)) {
        throw new CannotChangeEmail::becauseEmailIsAlreadyTaken();
    }
    
    $this->email = $newEmail;
}

The changeEmail behavior depends on a Users collection to determine whether the new email address can be used. The (imaginary) domain experts told us that an email change may not happen as long as there is another user in possession of that new email address.

This code will work just fine until we hit a certain amount of users. The collection's sheer size will become a bottleneck for the lookups that must be performed in order to enforce invariants. We could fix this problem by injecting a UserRepository instead of passing every single User in existence via an in-memory Users collections.

public function changeEmail(Email $newEmail, UserRepository $users)
{
    if ($users->findByEmail($newEmail)) {
        throw new CannotChangeEmail::becauseEmailIsAlreadyTaken();
    }
    
    $this->email = $newEmail;
}

This way, the domain model will still be responsible for enforcing the invariants; but we had to trade the domain model's purity off against performance. Nonetheless, this is most definitely an acceptable trade-off.

Command Query Responsibility Segregation

"I thought this blog post was about the Repository pattern? What's the deal with CQRS all of a sudden..?" Please let me explain.

Write models (commands)

Until now, we've seen how the Repository helps us with dealing with the lifecycle of domain objects. We established the fact that these domain objects are also known as write models / entities / aggregates that are responsible for performing state changes in a consistent manner. In other words, the aggregates represent a consistency boundary that must follow the business rules and apply them at all times in order to stay consistent. Naturally, these state changes always occur as a result of a command entering an application.

Read models (queries)

We need to ask ourselves whether we actually need to perform state changes or just need some data. Why would we "just need some data"? Well... you guessed it right: for queries. CQRS is a dead simple pattern for separating the logical models for read and write concerns—that's it. It has nothing to do with event sourcing / eventual consistency / separated data stores etc. These buzzwords are often thrown into the mix by people who don't really know what they're talking about. Use cases that involve queries will benefit from better optimized, dedicated read models.

Explained by example: displaying a table of invoices

Let's take a look at a use case to solidify our understanding.

final readonly class ViewInvoicesController
{
    public function __construct(
        private GetMyInvoices $query,
        private Factory $view,
    ) {}

    public function __invoke(Request $request): View
    {
        $invoices = $this->query->get();

        return $this->view->make('view-invoices', [
            'invoices' => $invoices,
        ]);
    }
}

This use case is responsible for displaying a table of invoices to the user. All of the magic happens during this line:

$invoices = $this->query->get();

The query handler GetMyInvoices provides us with a collection of InvoiceSummary read models dedicated for this purpose. A single InvoiceSummary instance might look as follows:

final readonly class InvoiceSummary
{
    public function __construct(
        public int $amountOfDiscountsApplied,
        public string $paymentTerms,
        public string $recipient,
        public int $totalAmountInCents,
    ) {}
}

Eagle-eyed readers may already have noticed that this is in fact a Data Transfer Object. DTOs typically contain only data and no behavior. However, this is exactly what we want: a read model dedicated to the purpose of displaying some relevant data to the user. You may already have noticed that this model doesn't contain any information regarding the individual invoice line items; and this is totally on purpose! A table view cannot display individual invoice line items. Thus, our read model is optimized and carefully crafted for this exact use case.

The write model might look like this (courtesy of Shawn McCool):

final readonly class LineItem
{
    public __construct(private bool $isDiscount) {}

    public function isDiscount(): bool
    {
        return $this->isDiscount;
    }
}

final class Invoice
{
    private RecipientName $recipientName;
	
    private LineItems $lineItems;

    public function __construct(
        RecipientName $recipientName
    ) {
        $this->recipientName = $recipientName;
        $this->lineItems = LineItems::empty();
    }

    public function addLineItem($item): void
    {
        if (
            $item->isDiscount()
            && $this->lineItems->hasDiscountedItem()
        ) {
            throw CannotAddLineItem::multipleDiscountsForbidden($item);
        }

        $this->lineItems->add($item);
    }
}

So to be more precise, we went directly to the data source itself instead of trying to shoe-horn a use case into an Invoice write model that is totally not designed to fulfill a specialized query-based, read use case. Why carry the burden of instantiating this complex write model in order to fulfill a use case that won't even need any of the line items that are defined within this write model? The write model requires all of the line items in order to keep its state consistent, but the read model does not.

Where does a Repository belong to: application or domain layer?

We can consider the application layer as the specific layer within a multi-layered architecture that handles the implementation details unique to the application, such as database persistence, internet protocol knowledge (sending emails, API interactions), and more. Now, let's establish the domain layer as the layer in a multi-layered architecture that primarily deals with business rules and business logic.

Given these definitions, where exactly do our repositories fit into the picture? Let's revisit a variation of a source code example we discussed earlier:

final class InMemoryUserRepository implements UserRepository
{
    private array $users = [];

    public function find(UserId $id): User
    {
        return $this->users[$id->asString()]
            ?? throw CouldNotFindUser::becauseItIsMissing();
    }

    public function remove(User $user): void
    {
        unset($this->users[$user->id()->asString()]);
    }

    public function save(User $user): void
    {
        $this->users[$user->id()->asString()] = $user;
    }
}

I'm observing numerous implementation details that can be regarded as "noise". Therefore, this implementation detail belongs in the application layer. Let's remove this noise and see what we are left with:

final class InMemoryUserRepository implements UserRepository
{
    private array $users = [];

    public function find(UserId $id): User
    {
    }

    public function remove(User $user): void
    {
    }

    public function save(User $user): void
    {
    }
}

Does this actually remind you of something? Perhaps this?

interface UserRepository
{
    public function find(UserId $id): User;
    public function save(User $user): void;
    public function remove(User $user): void;
}

Placing an interface at layer boundaries entails the following implication: While the interface itself can encompass domain-specific concepts, its implementation should not. In the context of repository interfaces, they belong to the domain layer. The implementation of repositories belongs to the application layer. Consequently, we can freely utilize type-hinting for repositories within the domain layer, without any need for dependencies on the application layer.

Various other benefits

Below is a non-exhaustive list of various other benefits a Repository can bring along with it:

Access to the decorator pattern to add additional concerns without having to modify the domain e.g. to employ something like hashids for YouTube-like identifiers.
The ability to implement the transactional outbox pattern for mission-critical, event-driven systems.
Centralizing access / persistence logic if the application relies on data models primarily and you'd like to migrate away.
Automatically adding audit information alongside the persisted entity.

...

Wrap-up

That was a lot to go through! Thanks for sticking around until the end.

Basically, if we were to enumerate all of the benefits for using a Repository, persistence agnosticity would definitely come last or at the very least be close to being last. Therefore, I hope that we can stop taking concepts at face value and actually examine them a little deeper to unearth the actual use cases and the contexts in which they're supposed to be used.

Repository is the authoritative actor for safely collecting and preserving entities and managing their lifecycle
The ability to swap underlying the persistence driver is a mere bonus
The ability to easily test without an actual persistence driver is a mere bonus
Do use a Repository for your write models
Don't use a Repository for your read models: go to the data source instead

Join the discussion on X (formerly Twitter)! I'd love to know what you thought about this blog post.

Thanks for reading!