You are given very huge file ,

Amazon Interview Question for SDE1s

0

of 0 votes

11
Answers
You are given very huge file , with each line containing a single word. We have to give the count and word which is repeated most. I answer of using TRIE data structure to hold the word. I am reading a word at a time and incrementing the counter if i am getting the same word. I am keeping a global max count to keep the max count and the word. Complexity will be O(total letters in the file);
- ur.devesh May 16, 2014 in India | Report Duplicate | Flag | PURGE
Amazon SDE1 Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: Phone Interview

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

If we say huge file, than it can go any whare ... even number of unique elements can also go anywhere (to fit in memory, either using a trie or HashTable). So i think:

The obvious answer is of course to keep a hash map\trie and store a counter of the occurrence of elements as you move through the file. This is (in terms of time complexity) the optimal solution.

If however, your space requirements are tight you can perform an external sort on the file and then find the longest consecutive run of equal elements. The following should have a constant memory footprint and can be done in Θ(nlogn).

- Ajeet May 17, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Trie is good. Store the count in leaf. In a global variable have the max count & word. If new word's count is more than existing update it.

If N maximum words has to be found, use max-heap to store the word with frequency as determining max-heap property.

- geekyjaks May 16, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Why not using HT with key as word and value as the count

- Anon May 16, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Anon - I don't see why a hashtable or a trie would wouldn't work just fine for this.

- Michael.J.Keating May 16, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

simple with complexity=n

package amazon.test.practice;
import java.util.*;
import java.io.*;
public class Wordcount {
public static void main(String args[]) throws FileNotFoundException
{
	
	HashMap<String, Integer> hm = new HashMap<String, Integer>();
	Scanner input=new Scanner(new File("c:\\1.txt"));
	String max="none";
	hm.put(max, 0);
	while(input.hasNext())
	{
		String word = input.next();
		//System.out.println(word);
		if (hm.containsKey(word))
		{
			hm.put(word,hm.get(word)+1);
		}
		else
		{
			hm.put(word, 1);
		}
		if (hm.get(word)>hm.get(max))
		{
			max=word;
		}
	}
/*	Set set = hm.entrySet();
    Iterator i=set.iterator();
    while(i.hasNext())
    {
    	 Map.Entry me = (Map.Entry)i.next();
         System.out.print(me.getKey() + ": ");
         System.out.println(me.getValue());
      
    }*/
    System.out.println(max+"-------"+hm.get(max));
	}

}

- satishkumardhule May 17, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

In my opinion, it depends on the environment you have, how frequent you have to do this, and what the expected time is.
If the requirement is to do this often and if you have a client / server environment with multiple workers, some sort of MapReduce might be an alternative. You have a server that create chunks (splits) of the given file and passes the information of chunks (from, to) to the workers. Also you have workers to do the "map" and others to do the "reduce".
In the Map-Phase the workers create a Mapping of words to occurances for the given chunk. Then they pass it to an intermediate store (e.g. disk)
In the Reduce-Phase workers take the created Map-Results from different workers and combine them to a final result.

- Scavi May 17, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The complexity of using a trie is O(nlogk), where k is the average word length.

- Anonymous May 18, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

where does log come into picture? IMO, its O(nk)

- anon123 May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

the

- Anonymous May 18, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Iterate through the file reading one word at a time and write the word in a dictionary.

Case 1: if the word is already present in dictionary - then value++;
case 2 ; if the word is not present in dictionary - add key as word and value as 1

c# code

Dictionary< string , int> dicWordCount = new Dictionary<string , int>();
while(file.Read())
{
string w = file.ReadLine();
if(dicWordCount.containsKey(w))
{
dicWordCont[w] ++; //increment the value by one
}else
{
dicWordCount.Add(w,1);
}
}

//now, iterate through the values and find max of value
int maxOccurance = 0;
int keyWord = "":
foreach(string s in dicWordCont.Values)
{
int count = dicWordCount[s];
if( maxOccurance < count)
{
maxOccurance = count;
keyWord = s;
}
}

- Anonymous May 18, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Heap based solution,
1. Heap data structure created so as to hold the string, count.
2. Read each string and search the heap for the string, if found update the count and heapify else create new node with count=1.

- HardCode June 23, 2014 | Flag Reply

CareerCup

Amazon Interview Question for SDE1s

Books

Videos

Resume Review

Mock Interviews