June 20, 20235 minutes to read — Tags: systems, programming

Choosing an appropriate data store and schema design is an important early decision for web development projects. A common convention is to start with a relational database (like PostgreSQL or MySQL) and a normalized schema that minimizes redundant data (formally called 3NF).

I’d like to encourage an alternative access-oriented approach to designing schemas that prioritizes scalable, extensible data access patterns over normalization and ad hoc queries. This approach is heavily influenced by Rick Houlihan’s amazing presentation at re:Invent 2018: Advanced Design Patterns for DynamoDB.

Challenges with normalized SQL schemas

In my experience, a major challenge using normalized schemas for web apps is that important DB queries that were initially fast get slower as the data volume grows. Specifically queries that join, sort, and filter unindexed data perform worse in OLTP databases.

It’s certainly possible to design schemas and write queries that scale gracefully with more data. But when developers start from a mindset of a maximally normalized schema, it’s easy to implement a few gnarly SQL queries for important features.

This can be particularly challenging for lean startups. Features built with inefficient schemas are often fast with small data sets, such as test environments and early days on production. But as usage grows, performance degrades. This scaling bottleneck causes developers to significantly rework the schema or data architecture just as the product gains popularity. It’s a particularly toxic form of tech debt that’s due just as people are paying attention.

As an example, let’s consider a social network app with a personalized newsfeed of updates from people you follow.

A normalized (3NF) schema might look like: a posts table with and author_id that’s a foreign key to a users table, and a followers table with follower_id and followed_id that are also both foreign keys to a users table.

To get some relevant posts for my homepage newsfeed, I’d generate a query like:

SELECT posts.* FROM posts
INNER JOIN followers ON followers.followed_id = posts.author_id
WHERE followers.follower_id = 'my_user_id'
ORDER by posts.relevance_score DESC
LIMIT 50;

Let’s also assume we have the appropriate indexes on follower_id, followed_id, relevance_score, and author_id.

This query still becomes slower the more users follow each other, since the DB must read from a different segment of the posts index for each user you follow. Rather than growing gracefully with your product, the query performance is a tailwind for product growth.

When faced with slow production queries, many developers quickly implement caching. Or to put it more generally, they look to denormalize the data to better accommodate important access patterns.

A snappy user experience quickly takes priority over data normalization.

And because of the urgent need to scale a popular feature, devs often implemented caching hastily, which can incur even more tech debt and system complexity.

If we’re so quick to give up normalization for scalability, is it truly so valuable to begin with? Can we start with scalability, and reduce the complexity of caching? I think so.

Access-oriented schema design

Let’s instead design our storage based on how we plan to query the data. In our homepage newsfeed example, we know we want the query to display personalized, relevant news items to remain quick as the number of users and posts grow.

So we might create a feed_items table with user_id, relevance_score, and post_id fields. When a new post is inserted, our application fans out to write the appropriate entries to the feed_items table.

Now the query for the news feed is:

SELECT posts.*, feed_items.relevance_score FROM posts
INNER JOIN feed_items ON feed_items.post_id = posts.id
WHERE feed_items.user_id = 'my_user_id'
ORDER by feed_items.relevance_score DESC
LIMIT 50;

This query more easily scales with a compound index on feed_items.user_id and relevance_score. It lets us fetch the appropriate posts from a single, contiguous segment of the index. We also defer the need to add a caching tier like Redis or Memcache as well.

The point is not simply to fanout on write rather than fanout on read—it’s that planning ahead to make your important access patterns fast and scalable can save significant rework later. It’s become one of my favorite topics to dive into during system design reviews: does the engineer have a reasonable understanding of how their data access performance will scale with usage?

Developer affordances and NoSQL

One advantage of NoSQL wide-column data stores like AWS DynamoDB and Apache Cassandra is to force developers to implement scalable queries. Scalable queries require a partition key, and don’t support joins.

DynamoDB drives home this developer affordance with distinct API methods. Use GetItem for scalable, performance-sensitive queries. And in the rare cases you need to operate over an arbitrarily large portion of your data set, use Scan.

In contrast, SQL does not have such conspicuous affordances for developers to know if their query is scalable. The massive performance difference between the two queries above is easy to miss. Developers would need to have an intuitive understanding of their data volume, or run EXPLAIN commands as an extra step. Arguably, SQL’s big advantage is that it lets authors create complex ad hoc queries. This is fantastic for analytics, reporting, and data exploration where users are less sensitive to latency. But it can confound developers who want their queries to scale gracefully.

In summary

  1. When designing a new system, consider an access-oriented DB design that ensures important queries will scale gracefully with data volume and concurrent usage.
  2. Consider whether a schema change will let you keep using your primary data store without needing the architectural and operational complexity of a secondary cache.
  3. Consider how you can build affordances in your API to encourage fast, scalable queries, and discourage slow, complex ones.

Aaron Suggs
Hi, I'm Aaron Suggs. 😀👋

Welcome to my personal blog. I manage engineering teams at Instructure, previously Lattice, Glossier and Kickstarter. I live in Chapel Hill, NC. Find me on LinkedIn, and GitHub.