How would you split a search q

Google Interview Question for Software Engineer / Developers

-5

of 7 votes

9
Answers
How would you split a search query across multiple machines?
- Guy February 10, 2014 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer System Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 13 vote

Strategy of splitting a search query across multiple machines would depend on the type of distributed index maintained by the machines.

Typically for large scaled index, the architecture be, there will be set of master index servers 1 to n that index documents from the crawler. Index is maintained across all these servers.

For every index master server, there will be multiple (1 to m) slave servers with replicated indexes. These slave servers will handle queries.

Now typically a query analyzer will receive query fired from the browser. Query analyzer will split the query into multiple sub queries as follows:
So if the query is like "Google Mountain view jobs"
Sub queries:
1. complete query: "Google Mountain view jobs"
2. boolean query: "Google AND Mountain AND view AND jobs"
3. Distance / Proximity query: (Google Mountain) AND (Mountain view) AND (View jobs) AND (Google Mountain View) AND (Mountain View Jobs)
4 .Single word queries: Google, Mountain, view, Jobs

Typically there will be another fuzzy query analyzer which will re-issue single word queries as fuzzy queries
5. Single word fuzzy queries: Google~, Mountain~, view~, jobs~
6. Range Query: if the search term includes a date range, regular expressions will pickup the date range and create a date range query.

Once all the queries are finalized, the queries are fired across multiple shards (slave servers) in parallel.

Now this could be done using Publish Subscribe Messaging model or by directly firing query across multiple slave server using software or hardware load balanced way.

All the servers will receive one query each and will come up with the matching document result with the type weight (vector like structure) and count weight (no of word counts matched in the document).

Based on their individual weights, a dot product will be calculated to compute an IR score.Also a page rank score is calculated based on the search result page weight. (example: if the search resulted in wikipedia.org page then this result will have heigher value than mamapapa.com page which happens to have same searched terms)

Multiple results will be combined together based on their weight rank and page rank and first K results will be sent back to the server caching the complete result in the cache servers so that if the user clicks on page 2 of the result, the result can be sent back form the cache server than to compute the result all over again.

- Saurabh March 13, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

if we are dealing with boolean search query we can store different sets of documents on different machines, and do the intersect of the results from each machine.

for example if we shard our documents to 1 machine having a-h, 2 machine i-o , 3 machine p-z

and we get a search query like "number of employees at google today"
employees, at google -> machine 1
number, of -> machine 2
today -> machine 3

if we store the documents on inverted index for each word, for example
employees -> doc1, doc3, doc5
at -> doc1, doc2, doc3, doc4, doc 5
google -> doc1 doc5
do machine 1 will return doc1, doc5 and intersect it with the result of the other machines.

- abe February 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- abe February 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

First, talk with the interviewer, "Why split a search query across multiple machines?". It is for high performance or completeness or both. Or "What is on the multiple machines?"

Second, Split Query Index and Document.

Third, if the index is slitted into K parts kept on K machines.
then the query must be split into K data-based sub-queries.
(It depends on the split method)

Forth, if the index is duplicated into M parts on M machines.
If the query is time costly, one way is to split the query into M logic-based sub-query.

What is data-based sub-query and logic-based sub-query?
data-based sub-query: each sub-query scans different parts of the Index using same keywords.
logic-based sub-query: each sub-query scans the same Index using different keywords.
logic-based sub-query (Divide and Merge) is more complex.

Finally, when the target index is complete. Get the target documents on Document Server.
The same method is used on duplication servers and partition servers.

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews