????? ??? ?? ???? ???? ??? ??? ???? ??? ??? ??? ?? ??? ??? ???. ??? " ?? ?? ?? "? ??? ??? ??? ???? ??? ?? ??? ?? ??? ?? ?? ???? . ??? ??? ??????? ??? ???? ???, ??? ??? ????, ??? ?????? ?????(?? ??, ?? ??? ?? ? bla bla bla). ?? ???? ??? ????? ?? ?????.

??? ???? 3??? ??? ??? ? 1?? ?? ??? , ???? , ?? ????/?? ??? ? ??? ??? ?? ? ???.

? ????? ?? ??? ?? ??? ????.

1. ?? : ??? ??? ???? ???? ??, ?? ??? ??, ???, ???, ???? ?? ??? ????.
2. ??? ?? ?? : ??? ?? ??? ?????, ?? ??? ???? ???, ??? ? ?? ??? ????.
3. ?? ?????? : ?? ?? ?? ??????, ? ?? ?????? ???? ? ?? ??? ????? ?? ?? ? ????.

1?: ?? ??

? ?? ?? ?? ?? ?? T20 ??? ????? ??? ? ?????? ?? ??? ?? ???? ?? ?? ???? ????? ???.

The use case was that at the start of an over users would be asked to predict a scenario for which they would have 20/30 seconds. And at the end of the over moderator will be submitting, what actually happened among all the predicted scenarios. And the scoring would work based on who answered correctly and how much time was taken to answer.

So immediately after getting the requirement, we did effort estimation and that was clearly not fitting our budget, so we cut features which we could skip on v0. This is important because in a tight deadline you might wanna give out fewer features than giving out half-baked features.

This is going to be primarily backend focused blog, sorry I won’t be able to explain how those beautiful UIs were made.

Chapter 2: Design

Next was obviously to design the System, which is popularly known as High-Level Design. This was the most challenging part as well as the most fun part for me. We did not do extensive Low-level design and took all the modeling calls on the fly while implementing.

Step 1: Scale estimation

Before starting anything we did a rough estimation as to what sort of traffic was gonna hit us. Given the scale of our system, we estimated maybe 50k people will engage with the quiz. So we targetted to build for 100k users. But most people would probably answer in the first 10 seconds in a 30 second window. Which roughly gives us a target qps of 20k.

My rule of thumb is for a new system always design the system for twice the number of anticipated users. So that in case you end up having more users your system does not fail to scale. And users generally grow over time so you don’t have to keep on changing the system constantly. But that does not mean that you put twice the infrastructure at place of that what’s actually required. I mostly would have autoscaling enabled in my systems so whenever there are traffic spikes, the system will scale itself.

(We did other estimations as to storage on the cache as well, but skipped them to keep the article short)

Step 2: Identifying high scale APIs

?? ?? ??? ?? ??? API? ????, ??? ?? ??? ??? ????? ??? ? ??? ???? ?? ? ?? ??? ??? ??????. ??? ???? API? ?? ?? ??? ?? ??? ???? ?????.

??? ??? ??? ??? API???.

1. ??? ?? ? ?? ????
2. ???? ????
3. ??? ??(??? API? ?? ??? ??)
4. ??? ?? ??

3??: ?? ?? ??

?? ??? ? ?? ? ??? ?? ??????.

1. We are going to need asynchronous and distributed processing for score and leaderboard calculation for the most obvious reasons. If you are planning to run score calculation synchronously on a single node for 1 million users or beyond, good luck to you ??. The idea is simple, we split a big task into smaller tasks and run them on different nodes, and let them do this at their own pace. And also we want this process to be fault-tolerant.
2. ?? ? ??? ? ????? ??? ?? ? ?? ??? ??? ? ????. ? ?? ??? ?? ?? ? ?? ?????? . ??? ?? ????? ??? DB?? ?? ??? ????? Postgres ?? Redis/ Memcached/ Hazelcast ???? ? ???? ?? ? ???? .

4??: ??? ??

While taking the design decisions this time, I choose to go with techs that I knew and did not want to venture into finding what could be the best tech for this use case. For example, I had familiarity with Redis, so for cache, it was the obvious choice. Also whenever I hear leaderboard it automatically translates to Redis sorted sets in my mind. And for the async processing, Kafka still remains my number one choice. Given proper time I would probably do a bit more exploration, but for this time I was not going to wander into wildlands of finding the best tech because I did not have TIME!!!!

So this API was gonna let our users submit answers for the quiz. The main challenge here was the this would generate a lot of concurrent write operations, which will increase the load on our database a lot. So we had to do two things

1. Decrease load on the DB
2. Generate some sort of back-pressure or let the DB operators work at their own pace, so that the DB is not overwhelmed

My goto solution for decreasing write load is to do batching. So if I had to do 10 write operations, I would batch them to get a single query and then run 1 DB write operation.

And Back-pressure almost shouts message-queues.

So combining both, the solution we came up with is this…

Don’t freak out, allow me to explain what’s happening

1. ??? ??? ?? ???? ?? ???? 202 HTTP ?? ??? ?????. ??? ?? ??? ??? ??? ??? ??? ???? ??? ?? ????. ??? ??? ??? ?? ?? ??????. ??? ???? ???? ?? ??? ??? ? ?? ????? .
2. ?? ???? ???/???/???? ?? ?? ???? ??? ???? ??? ??? ???? . ?? Kafka? ?? ??? ????? ?? ?? ??? ? ????. ??? ?? ?? ?? ?? ???? ????? ?? ??? ??? ? ?? ?? ?? ?? ??? ?? ???? ????? ??????. DB ?? ? ?? ??? ??? ???? ??? ??? ?? ?? ????. ???? Kafka? ?? ??? ??? ?? ??? .
3. ?? ??? ?? ??? ???? ?????. ?? ?? 10?? ??? ??? ???? ?? ?? ??? ?????.
4. The DB writer picks up the batches and makes insert queries into the DB. The backpressure is mainly introduced by the DB writer, it picks up messages at its own pace since the message consumer was pull-based. And thus we prevent DB overload. And since it’s working on batches instead of running 100 DB queries we ran only 10 DB queries.

This solves 30% of our problem, let’s jump on to the next one.

?? ?? ???, ?? ??, ??? ??. ? ???? ?? ??? ?? ?? ?? ??? ?? ???? ????. ??? ???? ???? ?? ??? ???? ??? ? ????. ???? ??? ???? ?? ???? ??? ??? ???? ???. ??? ? ?? ??? ?? ??? ?????. ?? ??? DB ?? ?? ??? ?? ?? ???? ??? ? ??? ?? ?????. fashion. So what do we do? We go back to our friend Messages Queues for asynchronous processing again. Cool so we can calculate the scores asynchronously, but what about the leaderboard? That needs to be available all the time right? And what about rank? Until and unless scores for all the users have been calculated you can’t really put ranks, right? And calculating rank specifically could be a hard problem.

Now, who will save us from this? Worry not my friend, remember I mentioned Redis briefly while talking about cache? They have a beautiful thing called sorted set (what an amazing creation, thank you Redis Labs ??). In a sorted set, you can add keys with a score, and Redis will order it accordingly in O(log(N)). That will sort our ranking problem ??. It also allows us to run range queries, such as give me top 5, or get me the rank for a specific key, and all of this happens in O(log(N)). That is exactly what we need here.

The elements are added to a hash table mapping Redis objects to scores. At the same time the elements are added to a skip list mapping scores to Redis objects (so objects are sorted by scores in this “view”) — Sorted set internals

Bammmmmm the leaderboard problem is also solved.

Alright alright, it’s not easy, I was just happy that I was able to get a workable solution quickly. Now let’s jump back to our drawing board.

Looks a bit intimidating no? Allow me to explain

1. As soon as the moderator submits the correct answer a score calculation trigger message is pushed, which is to kick off the entire processing pipeline.
2. The batcher receives the trigger message and generate a pair of DB offset and limit object depending on how many people answered correctly. For example, if 10 people answered correctly and the batch size is 5 then it would generate two objects (batches). Batch 1 {offset: 0, limit: 5}, batch 2 {offset: 5, limit 5}. Why do we do this? So that we can run batch processing or run paginated DB queries and we don’t end up calling the DB without any limit. So if I had to get 1 million records from the DB, and I do that in one shot it would introduce a lot of problems in many places. So we break that down into smaller pieces and run smaller but multiple queries, which would return a smaller number of rows.
3. The user batch processor will now receive these batch messages and will run DB queries accordingly. The processor which receives the {offset: 0, limit: 5} message, will get the first 5 users ids from the DB (will do another batch operation as well but that’s a bit hard to explain here, so skipping). And after this is where we say goodbye to batch processing and switch to stream processing. Because the batch processor will now put 5 user ids into the queue which will be processed by the next processor.
4. ?? ??? ?? ???? ?? ??? ID? ???? ?? ??? ???? ?? ??? ??? ?????. ?? ?? 1 DB ????? ???? ??? ??? ?????. ?? ?? ??? ???? ?? ?? ???? ??? ?????? Redis? ????? ??? ????? ???????. ??? ? ??? ??? ??? DB? ?? ?? ???? ??? Redis? ?? ?? ???? ?? + ??? ?? ???. ??? ??? ??? ?? ?? ??? ??? ???? ?? ??? ???? ?? ?? ??? ???? ?? ? ????.

??? ??? ??? ???? Redis ??? ??? DB? ???? ?? ? ?? ??? ???? ??? ?? ???????. ?? ?? ??? ????? .

This concludes the primary design phase. There were also DB, API design, and other stuff which I am skipping here.

Chapter 4: Implementation

Design completion solved 70% of our problems, and we knew we could crack this, so we would jump onto development quickly.

In an ideal world, I would go and create a new service, try out a new language such as Go, for better and faster parallel processing and other stuff. But given the timeline, it was not the right thing to do. We stuck to our core NodeJS service and put everything there. Microservice purists might lose it after seeing this statement, but in a race against time, your principals gotta take the back seat sometimes. Among the many tradeoffs, this was one of the major calls that we took.

Apart from this, we also had to cut down on unit and integration tests, the guilt of which is still haunting us. But we are backfilling the tests gradually post-release.

I guess after seeing the detailed design posted above you should be able to implement your own, hence I am not gonna do a code deep dive this time ??

Chapter 5: Deployment and Monitoring

After a few bug fixes and QA sign-off, we were good to go. Job done right? No, my friend we still had to set up heavy monitoring for this piece. Since it was developed in a very short time I at least was a little underconfident. We by default had tracing enabled on this service via LightStep. So apart from traces, I had set up dedicated monitoring of traffic, error rates, latency dashboards, alerts for all the APIs. And post go-live I and my teammates, we got onto a call and we monitored the system in and out for at least an hour, from RAM and CPU usages to Error logs. So always give equal weightage to observability and monitoring as well. There were small issues on the production system and we were only able to catch them early because of monitoring.

Chapter 6: Retrospective

The system works, but after a breather, it’s important to do a retrospective and identify things that we missed and work upon them. I am sure we missed a ton of stuff and cut a lot of corners and got a huge scope of improvement. For example here are a couple of…

Things we could do better

1. We used the already existing Postgres DB for this since the driver, ORM and the supporting infrastructure were already there. But I would probably explore a bit on the database solutions.
2. NodeJS is awesome, but I feel Go would be a better fitting solution for this. We could have explored this.
3. I tried writing a query to calculate the score in the DB only and failed miserably. I could probably write that and could do batch processing for score calculation as well, reducing the DB operations even further.
4. We could not run extensive load and performance tests, which is a must.
5. We could have written two different stages for DB score update and Redis sorted set update, that would be a cleaner implementation.

Parting notes

We did the best we could in that short time. Even the implementation diverted slightly from the original design. But it’s okay, trade-offs are constantly going to be there. Even though we completed the core functionality in a week approximately, there were small patches that we had to do post that.

I hope you were able to learn a thing or two about system design and software development today ??

Credits

Shout out to my awesome teammates Akash Raj and Aashirwad Kashyap, we all worked together to build this in just about a week.

I am Aritra Das, I am a Developer, and I really enjoy building complex Distributed Systems. Feel free to reach out to me on Linkedin or Twitter for anything related to tech.

Happy learning…

??? ???? 6?? ?? ??

?? ??? ???, ????? ? ?????? ???? ????? ?? ?? ?? 6? ?? ?? ????? ??? ?? ??? ? ??? ??? ?? ?? Rocketech Software Development? ??? ????. ??? IT ??? ???? ?????? ?? ?? ? ?? ??? ??? ???? ????.

Python ? ?? 11? 25?

The Pragmatic Bookshelf? ?? ?? ????? ??

Python?? ? ???? ????? ???? ??? ??????. ?? ???? ?????? ??? ??? ????? The Pragmatic Bookshelf? Black Friday Early Bird ??? ?? ?? ?? Python ???? 40% ??? ???? ?????! ??? ???? ?? eBook? ??????. ??: The Pragmatic Bookshelf ????? eBook? ?? 2021? 11? 29??? ???? ?? turkeysale2021? ?????.

??? ???? 6?? ?? ??

?? ??? ???, ????? ? ?????? ???? ????? ?? ?? ?? 6? ?? ?? ????? ??? ?? ??? ? ??? ??? ?? ?? Rocketech Software Development? ??? ????. ??? IT ??? ???? ?????? ?? ?? ? ?? ??? ??? ???? ????.

??? ???

??? ??? ????? ?????. ?? ? ?? ??? (?? ????)? ?????? ?? ????. ?? ???? ??, ??, ???, ?, ?, ?? ?? ?? ??? ?????.

? ?? : ???? ? ?

? ??? ?? ? ?? (1 ?)

? ??? ??? ???? ? ??? ?? ? ??? ?? ??? ?????. ??? ???? ??? ??? ?? ? ??? ?? ??? ??? ??? ? ?????. ??? ?? ??? ??? ??? ??? ?????.

??? ? ??

?? ??? ??? ??? ????. ??? ?? ???? ?????.

Duke Blue Devils ?? ?? Mike Krzyzewski? ?? ?????

Mike Krzyzewski? ?? ??? ??? ?? ???? ????? ??? ??? ? ?? ??? ?? ? ?? ? ????.

'Succession'? ??? ??(Nicholas Braun)? ?????? ?? ?? ???? ?? ? ????. ?? ??? ?? ??? ???? ?? ?????

Pioneer Woman Ree Drummond? ?????? ???? ?? ?? ????. ??? ??? ???? ?????.

Natalie? '90 Day: Single Life'? ??? Mike Youngquist? ??? ?? ????? Mike? ?? ?? ??? ?? ?? ?? ??? ??? ????.

??? ?? ??? ?? ??? ???????

?? ?????? ?? ??? ??? ??? ?? ???? HowStuffWorks ??? ??? ??? ?????!

?? ??, ?? ?? ?? ?? ? ?? ?? ?? ???????

??? ?? ??? ?? ??? ?? ?? ??? ?? ????? ?????. ??? ?? ?? ?? ????? ??? ??????

???? ????? ??

??? ??? Fugates ??? Combs ??? ??? ??? ???? ????, ? ? ???? ? ??? ??? ??? ??? ?? ??? ??????. ? ??? ???????? ??? ????? ?? ?? ???????

California Condor 'Virgin Birth'? ?? ?? ? ?????

?? ??? ?? ????? ???? ??? ?? ?????? ? ??? ??? ?? ?? ???? ???? ????. ??? ??? '??' ??? ??????

??? ???? ??? ??? ??? ??? ?? 11? ?? '??? ??' ?????????.

Twitter ????: 2017? 11? 2? ???, 9:06 p.

Pitt ?? ??? ?? ???? ???? ??? ????.

?? ?? ?? ???? ??? ???? ?? ??? ?? ??? 1?? ???? ?? ?? ? ? ?? ?? ???? ?? ??? ???? ??????. ??? ???? ??? ?? ?? ?? ??? ?? ???? ??????. ??? ??? ?? ? ??? ??? ??? ??????. ?? ? ??? ???? ??? ??? ???.

?? ?? Rick And Morty? ?? ?? ?? ?? ??? ?? ?? ?????.

??? : Rick And Morty Rick And Morty? ?? ??? ?? ??? ??? ????. ?? ?? ?? ??? ??? ??? ?? ???? ???? ???? ???? ?? TV ????? ??? ???? ?? ?? ????? ??? ?? ??? ???? ???? ?? ????.

??? ???? ?? "?? ??"????? ????? ??? ?? ?????.

(Photo: Drew Angerer/Getty Images) ? ?? ???? ??? ??? ???? ??? ???? ?? ?? ?? ????. ???? ?? ???, ?? ???, ?? ???? ?? ??? ??? ?? ? ?? ? ?? ???? ??? ??? ?? ??? ??????. ??? ?? ??????.

Cardi B? Offset? ? Kulture? Instagram?? ???? ? ???? ??????.

Cardi B? Offset? 3? ? Kulture? Instagram?? ??? ?? ?????? ??????.

Selena Gomez? Cara Delevingne?? Knicks ???? Kiss Cam? ?? ?? ?????.

Selena Gomez? ??? ?? Cara Delevingne? ?? "??? ?? ???? ??? ???? ????.

???, ??? ????? ??? ?? ???: '??? ??'

??? ???? ??? ???? ??? ???? ??????.

Jamie Dornan? Henry Cavill?? ??? ??? ??? ?? ??? ??? ?? Marvel? ????? ?????.

Jamie Dornan? Superman ??? ?? ???? ???? Henry Cavill?? ????? ?????. ??? ?? MCU? ???? ?? ?? ??? ???????.