Groupon Interview Question
Senior Software Development EngineersCountry: United States
Interview Type: Phone Interview
Preprocess the DB and create files with title -> movieInfo mapping. Caliculate the hash of the title and append this information to the corresponding file. Given a query, calculate the hash and load the corresponding file, If the file size is more than 250 MB, keep loading the file partially and search the contents.
Stats:
10GB data
250 MB RAM
How many movies are there in this world ? 0.5 million ?
Amount of information per movie available = 10GB/0.5M = 20 KB of information. So most likely textual information and not the movie itself.
How much can 250MB RAM hold ? 12.5K movies at a time - 40 loads of the data per query to search for title. (linear search/binary search - you still need to load data to acknowledge that the section of data is not what you are looking for.!).
Alternative approach - MD5 hash the movie title - 16 bytes (128 bits). Another 200 bytes to store the movie location on disk. = 250 bytes per movie stored in a hash map = 1 M movies can be looked up!
So, when a query comes in, MD5 movie title, look up in hash map, go to disk (IIIIIIII/OOOOO) and fetch the description and display it!
- Multidisk arrays ? and store in hash map the disk location
- RAIDs for greater IOPs.? (Prefer read- friendly RAIDs for cheaper cost)
Use multi-core machine to parallelize request processing.
Some stats on the given data:
- venkatvi November 28, 2015RAM: 250 MB
Data size: 10 GB - we dont know the nature of this data, is it just text information about the movie or also contains the movie itself. I am presuming the former.
How many movies are there in the world ? Lets put an upper bound on the this information? - 0.5 million ?
Around 20 KB of movie information is present about each of the 0.5 million movies in the given 10GB of data.
If one ought to load all of the 20KB info onto RAM, we can read only 12.5 k movies at a time. In which case, loading partial data from disk and searching for the movie of choice is one way to do the job. One needs to load data 40 times to finish searching for one movie.
Alternatively, If the movie title is hashed (a MD5 hash uses 128 bits representing 32 chars(2 chars per byte)).
The movie info's location at disk can be represented by a path - approximately 20 words? (each word can take ~10 bytes). So in all, each key(title):value(location in disk) would max take 250 bytes
Given 250 MB ram, we can store upon 1 million movies.(which is above our original guess on the number of movies).
So, when user searches for a title, md5 hash the title(takes about 400 MB per second - can handle upto 2000 requests in a second to generate hash value), look up in the in-memory hash map, and do a disk I/O for looking up description and showing it to the user.
Trouble is with disk I/O on concurrent requests.
Nature of this application is more read-oriented than write. So, using concurrent disks like (Multidisk arrays/RAID(0-6 which are more read friendly)s) could help in faster IOPs.
One could also have a multicore machine which can parallelize the process time.