Google Interview Question
Software EngineersCountry: United States
Interview Type: In-Person
For me on hight level, there would be different application servers which would be there to handle the requests (might be load balancer can decide them). Initially in memory of server there would me many queues for up and down. At one time one queue would be filled with video id for up and similarly one queue with down. As the queue will full, another queue would be active and it will signal the background process.
Background process will update database by number and with version of master DB which has syncronized last. This table will only contain how many videos has liked or disliked at this session on their end.
When master DB will synchronize, it will check last updated number and add, reduce the count. Once master DB will get information, it will start updating databases with updated info. We can use primary secondary concept. So after Master DB will update secondary DB, it would be primary and so table lock would be required and then application server will start pointing new primary DB. Then after that new secondary will update.
Requirements
* For each video - have a counter for happy, and a counter for non-happy
* One user can either like or don't like a video. You can cancel you like or dislike
One machine design
* Users table
* Video table
* User to video table - this table will have a third column for like\dislike. Since the user can only choose one option.
Capacity in large systems
* 1 billion users (100 million active users)
* Videos 1 billion per day => 365 billion in one year. 5 * 365 = 1,825 billion in 5 years
* Average of likes per video: 100 (many videos with no likes, but some videos with a lot of likes. So in 5 years this is the number of records in the like table: 182500 billion.
* If we keep all the relationship between user and movie in a table:
* Keep one record: 64 byte for the ID of the video + 64 byte for the ID of user + 1 byte for the like\dislike = 130 bytes = 0.15 KB for each record. Total size 18250 billion KB ==> 18250 TB ==> 18 Petabyte
* So we will do something else. For each movie we will keep the count of like and dislike. And for each user, we will keep the list of movies that he liked.
* For each movie 2 counters, size of int: 8 byte * 1825 billion = 14600 billion byte = 14.5 TB
* For each user the list of likes: each user can have 1000 likes = 1000 billion * 64 byte = 64000 billion byte = 64 TB
* Keep the movie counters in DB, and also for the recent movies keep it in memory
* Keep user to movie data in DB
Handle concurrency
* What happen when 2 users update the count of a movie? The server should get the requests as "user X, Movie y, + 1". For example use the database capability to do "select for update".
* Make sure to update memory cache each time the counter updates in DB.
How many machines?
* Let say that 50% of the videos are seen daily, so we need 7 TB in memory for the counters. So considering 70 GB of RAM, this means 100 servers. And 10 more servers for handling machines failures.
* Data base: considering 4 TB SSD drive per machine: 16 machines for the user likes, and 4 machines for movie count. Let say that each machine have a slave so 40 machines
How the data is spread in machines? Should be according to location. Both the users data and the movies data.
in a high level view, I would suppose it would consist of new machines/virtual machines being spun ( with initial counter being 0) by other machines whenever their load threshold is reached. a load balancer equaly distributes the load between all the machines. each of the machine knows about all the other machines and whenever they are idle they seek some other machine to merge their count into, and then shutdown. until finally only one machine is left with the counter.
Use a eventual consistency data store. Something like NoSQL. Of course that data is distributed. For likes and other stuff, strong consistency should hamper performance.
- Noobie January 30, 2015