Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
if we are dealing with boolean search query we can store different sets of documents on different machines, and do the intersect of the results from each machine.
for example if we shard our documents to 1 machine having a-h, 2 machine i-o , 3 machine p-z
and we get a search query like "number of employees at google today"
employees, at google -> machine 1
number, of -> machine 2
today -> machine 3
if we store the documents on inverted index for each word, for example
employees -> doc1, doc3, doc5
at -> doc1, doc2, doc3, doc4, doc 5
google -> doc1 doc5
do machine 1 will return doc1, doc5 and intersect it with the result of the other machines.
if we are dealing with boolean search query we can store different sets of documents on different machines, and do the intersect of the results from each machine.
for example if we shard our documents to 1 machine having a-h, 2 machine i-o , 3 machine p-z
and we get a search query like "number of employees at google today"
employees, at google -> machine 1
number, of -> machine 2
today -> machine 3
if we store the documents on inverted index for each word, for example
employees -> doc1, doc3, doc5
at -> doc1, doc2, doc3, doc4, doc 5
google -> doc1 doc5
do machine 1 will return doc1, doc5 and intersect it with the result of the other machines.
First, talk with the interviewer, "Why split a search query across multiple machines?". It is for high performance or completeness or both. Or "What is on the multiple machines?"
Second, Split Query Index and Document.
Third, if the index is slitted into K parts kept on K machines.
then the query must be split into K data-based sub-queries.
(It depends on the split method)
Forth, if the index is duplicated into M parts on M machines.
If the query is time costly, one way is to split the query into M logic-based sub-query.
What is data-based sub-query and logic-based sub-query?
data-based sub-query: each sub-query scans different parts of the Index using same keywords.
logic-based sub-query: each sub-query scans the same Index using different keywords.
logic-based sub-query (Divide and Merge) is more complex.
Finally, when the target index is complete. Get the target documents on Document Server.
The same method is used on duplication servers and partition servers.
if we are dealing with boolean search query we can store different sets of documents on different machines, and do the intersect of the results from each machine.
for example if we shard our documents to 1 machine having a-h, 2 machine i-o , 3 machine p-z
and we get a search query like "number of employees at google today"
employees, at google -> machine 1
number, of -> machine 2
today -> machine 3
if we store the documents on inverted index for each word, for example
employees -> doc1, doc3, doc5
at -> doc1, doc2, doc3, doc4, doc 5
google -> doc1 doc5
do machine 1 will return doc1, doc5 and intersect it with the result of the other machines.
if we are dealing with boolean search query we can store different sets of documents on different machines, and do the intersect of the results from each machine.
for example if we shard our documents to 1 machine having a-h, 2 machine i-o , 3 machine p-z
and we get a search query like "number of employees at google today"
employees, at google -> machine 1
number, of -> machine 2
today -> machine 3
if we store the documents on inverted index for each word, for example
employees -> doc1, doc3, doc5
at -> doc1, doc2, doc3, doc4, doc 5
google -> doc1 doc5
do machine 1 will return doc1, doc5 and intersect it with the result of the other machines.
if we are dealing with boolean search query we can store different sets of documents on different machines, and do the intersect of the results from each machine.
for example if we shard our documents to 1 machine having a-h, 2 machine i-o , 3 machine p-z
and we get a search query like "number of employees at google today"
employees, at google -> machine 1
number, of -> machine 2
today -> machine 3
if we store the documents on inverted index for each word, for example
employees -> doc1, doc3, doc5
at -> doc1, doc2, doc3, doc4, doc 5
google -> doc1 doc5
do machine 1 will return doc1, doc5 and intersect it with the result of the other machines.
Strategy of splitting a search query across multiple machines would depend on the type of distributed index maintained by the machines.
- Saurabh March 13, 2014Typically for large scaled index, the architecture be, there will be set of master index servers 1 to n that index documents from the crawler. Index is maintained across all these servers.
For every index master server, there will be multiple (1 to m) slave servers with replicated indexes. These slave servers will handle queries.
Now typically a query analyzer will receive query fired from the browser. Query analyzer will split the query into multiple sub queries as follows:
So if the query is like "Google Mountain view jobs"
Sub queries:
1. complete query: "Google Mountain view jobs"
2. boolean query: "Google AND Mountain AND view AND jobs"
3. Distance / Proximity query: (Google Mountain) AND (Mountain view) AND (View jobs) AND (Google Mountain View) AND (Mountain View Jobs)
4 .Single word queries: Google, Mountain, view, Jobs
Typically there will be another fuzzy query analyzer which will re-issue single word queries as fuzzy queries
5. Single word fuzzy queries: Google~, Mountain~, view~, jobs~
6. Range Query: if the search term includes a date range, regular expressions will pickup the date range and create a date range query.
Once all the queries are finalized, the queries are fired across multiple shards (slave servers) in parallel.
Now this could be done using Publish Subscribe Messaging model or by directly firing query across multiple slave server using software or hardware load balanced way.
All the servers will receive one query each and will come up with the matching document result with the type weight (vector like structure) and count weight (no of word counts matched in the document).
Based on their individual weights, a dot product will be calculated to compute an IR score.Also a page rank score is calculated based on the search result page weight. (example: if the search resulted in wikipedia.org page then this result will have heigher value than mamapapa.com page which happens to have same searched terms)
Multiple results will be combined together based on their weight rank and page rank and first K results will be sent back to the server caching the complete result in the cache servers so that if the user clicks on page 2 of the result, the result can be sent back form the cache server than to compute the result all over again.