Interview Question for SDE1s


Country: United States
Interview Type: In-Person




Comment hidden because of low score. Click to expand.
2
of 2 vote

Assumption:
1. We cannot load all files into memory (1 TB).
2. Word means separate word (in text "killer" theres is just one word; there is no word "kill").

1. Load several source files into memory (for example 100 files; it takes 1 GB).
2. Build tree (RB-Tree), where key is a word and value is number of occurences.
3. Save into new temp file: each line with format "word number".
4. Repeat 1-3 until process all files.
5. Open 2 temp files, use external merge sort by reading string from each file, if equal words appear then merge then into one record, write result into new temp file. Maintain the most frequent word (s).
6. Delete processed temp files.
7. Repeat until we have one file.

- Ivan April 22, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Trie can be used for this problem.

N - word n
L[n] - number of characters in N word

1. Iterate thru files, with N words and get next word. Complexity: L[n]
2. Build a trie along the way for next found word. Leafs contains number of word occurrence. Complexity L[n]
3. Keep a counter (or an array of 5 top mose) for the most frequent word

Overall complexity: L[n]*L[n]

- Marcello Ghali April 21, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

It seems you don't have enough memory to arrange a trie in memory (1 TB).

- Ivan April 22, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

I'm using a HashMap to create a frequency and a max PriorityQueue to maintain the N most frequent.
To insert and remove a register from PriorityQueue would be a long (N). IE: Log N of 5.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

/**
 * Created by sky on 21/04/15.
 */
public class MostFrequentWords {
    private static final class Counter {
        private String word;
        private int counter;

        public Counter(String word, int initial) {
            this.word = word;
            this.counter = initial;
        }

        public void increment() {
            counter++;
        }

        @Override
        public String toString() {
            return "[" + word + ", " + counter + "]";
        }
    }

    public static void main(String[] args) throws Exception {
        Reader in = new InputStreamReader(System.in);
        BufferedReader br = new BufferedReader(in);

        Map<String, Counter> freq = new HashMap<>();

        int N = 1;


        PriorityQueue<Counter> priorityQueue = new PriorityQueue<>(N + 1, new Comparator<Counter>() {
            @Override
            public int compare(Counter o1, Counter o2) {
                return o1.counter - o2.counter;
            }
        });

        String line;
        while((line = br.readLine()) != null) {
            Counter counter = freq.get(line);

            if (counter == null) {
                counter = new Counter(line, 1);
                freq.put(line, counter);
                priorityQueue.add(counter);
            } else {
                counter.increment();
                priorityQueue.add(counter);
            }

            if (priorityQueue.size() > N)
                priorityQueue.remove();
        }

        br.close();
        in.close();

        while (!priorityQueue.isEmpty()) {
            Counter max = priorityQueue.remove();
            System.out.println(max);
        }

    }
}

- Felipe Cerqueira April 21, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Use MapReduce.

- inucoder April 21, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Assuming we have 1GB RAM.
1. Process 100 files simultaneously.
2. In order to count occurence keep a hashMap<Word,Occurrence> for all the words.
3. In order to get top 5 elements also maintain a MinHeap of max words.

- Sushant April 25, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Instead of making a suffix tree, we can take a map<string, integar>. We can load all name of files in queue. We can start 100 threads at a time (depending on memory). Every thread will read one file and then create a map and also hold a list of 5 top words (keys), along with it. Once thread will finish it will write all data in file.
Then next file would be taken from queue on finish and start same thread. at the end we will have 100 files. We can then merge them and find top 5 searched elements. To speed up, we can keep these 100 files as memory mapped files.
Suffix tree is not a good choice because whenever we will search whether this word exists, has map will give in one attempt, while suffix tree will traverse the node.

- Phoniex July 27, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

map/reduce + hashing seems the right approach, whether we do this on single computer or multiple

- interviewbuster August 09, 2016 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

More simpler approach. Divide the files among hadoop cluster. Each box will run a shell script (ex: tr command of unix) to dump top frequently used words, dump the results to file. Aggregate all the result files to find the top used words.

- Anonymous June 22, 2017 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

I would go with following solution:
Store all words in Map<String, Integer> where key is a word and Integer is counter for this word.
and then using selection algorithm on Values would give you as many most frequent words as you need.

- xpoh April 21, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

consider words: key, keychain. 2 entries in a Dictionary are needed, however since they share common prefix - "key" they can be compacted using, say, prefix tree.

- Marcello Ghali April 21, 2015 | Flag


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More