Chapter 4: Scale, Systems and New Clients (2021)

2021 started with a product idea.

Observing what Discord server owners actually needed, I built a high frequency event ingestion system. The concept was simple: server events would trigger processing tasks. The reality was harder. Each task required a minimum of 1024MB RAM and 5-10 minutes to complete. A single VPS processing tasks sequentially created backlogs immediately.

We moved to AWS Lambda. The architecture was event driven: Discord events invoked Lambda functions with parameters, different functions handling different event types. Fire and forget. AWS handled the scaling. Failures were acceptable but monitored: each task retried up to 3 times before being marked failed. At peak we were processing 50,000 events per day across 3-4 long term clients. The system solved a problem that had no off the shelf solution for this specific use case.

Then came April Fools Day.

The server had grown to 200-300k members. Most Discord servers mark April Fools by changing their icon. We wanted something different: mass nickname change for every member simultaneously.

The problem was Discord rate limits. Sequential processing of 300,000 users wasn’t viable.

I built a boss and worker system: a single monolithic project running in separate containers. The boss class held a workers property storing instances of each worker. When the start command was issued the boss distributed work using Discord’s sharding formula adapted for our use case: each user ID divided by worker count determined which worker handled that user. No two workers touched the same user. No wasted API calls.

It took 12 hours to change 300,000 nicknames. It worked exactly as intended. Members got the joke. The same system restored everything after. A fellow developer in the community asked directly how the task coordination and rate limiting had been handled: the approach wasn’t obvious and the execution was clean enough to warrant the question.

Not a true distributed system. But the first time I had built coordinated multi-process task distribution. The mental model that would become proper distributed systems architecture started here.

Around this time I started learning Go. Parts of the task automation system moved to Go: specifically components where we could avoid heavy external executables entirely and build using only Go’s standard library. Zero external dependencies. Lightweight on idle. Tasks that previously required 1024MB RAM and 5-10 minutes could now run in 32MB and 3 minutes. These Go services ran on unused cluster capacity, effectively reducing that portion of infrastructure cost to near zero.

A new long term client came in through the same network. Someone who had seen the task automation product and wanted the person behind it directly.

First projects were practical utilities: auction systems, birthday notification bots: built in Kotlin while I was learning it. Then wrote a CSRF middleware library for Ktor based on the GoFiber implementation. github.com/CRZA5/ktorcsrf. First open source library built intentionally for others.

Then came the hardest project of the year.

The same client had another server where the existing developer was leaving. Everything was custom: moderation, welcome systems, economy, games, levels: all built in Java with extended libraries and in-memory caching treating memory as source of truth, synced to the database via background jobs.

Over 100 files. The largest codebase I had seen.

My job was to rewrite the entire thing in Kotlin and migrate from MySQL to MongoDB.

It took a few weeks. Deployed it. Serving 200-300k members.

A week later economy balances and levels started resetting and decreasing with no obvious cause.

I had never encountered a race condition before. Node.js’s single-threaded model meant they rarely surfaced in my previous work. This was different: multiple processes accessing and modifying the in-memory cache simultaneously without coordination. The cache was the source of truth. When two processes read and wrote the same entry concurrently the results were unpredictable.

A week of debugging to identify it. Learned about locks. Fixed it. After that the system ran stable with no further major issues.

Also wrote PostgreSQL backup scripts following the same pattern as MongoDB: pg_dump packaged in Docker, running as a Kubernetes CronJob, uploading to Wasabi via AWS CLI. Wrote a Discord.js button based pagination library when Discord introduced buttons: github.com/CRZA5/discordjs-button-embed-pagination. Built a YouTube notification bot polling YouTube’s XML feed: github.com/CRZA5/youtube-notifications-bot-ts.

By end of 2021 the stack had expanded significantly. Go, Kotlin, TypeScript, Python: each chosen for specific reasons. The infrastructure was more mature. The problems were getting harder.

And the hardest ones were still ahead.