Citigroup Interview Question
Financial Software DevelopersCountry: United States
a) Have a hashmap with <string,count> and min heap. While reading every word,increment the count of word in the hashmap and check the head of min heap.If the count of the head is less the count of current word,remove the head and add the current word to head.
b) Hashmap of <letters,count>
How can you check if the word you added in the heap after remove the min is not already exist in the heap. for example if you have only 2 words {ab,6} , {ac,3} so min heap head will be {ac,3} if you countered ab you will increase ab with 1 to be{ab,7} comparing it with {ac,3} which is > so remove head {ac} and insert {ab} so heap now contains repeated strings
a) build a heap of word count
b) take a array of 256. now read every word and increment the respective character
A heap isn't an adequate data structure for counting occurrences. How will you efficiently update occurrence counts as you go through the file? Note that your data structure will need to be much bigger than 10 elements, since you will also need to keep track of words that are not in the top 10 in case later occurrences of those words bring them into the top 10.
I will read character by character and store the character count in an array , and the word count in a map.
I have written the code for file containing a,b or c . Here is the code:
package fileio;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
public class WordCount {
public static void processFile(){
Map<String, Integer> m = new HashMap<String, Integer>();
BufferedInputStream bis =null;
try{
bis = new BufferedInputStream(new FileInputStream("E:/app.txt"));
}catch(FileNotFoundException e){
System.out.println(e);
}
int chars[]=new int[27];
StringBuilder sb = new StringBuilder();
String temp=null;
int ch;
try{
while((ch=bis.read())!=-1){
// System.out.println((char)ch);
switch(ch){
case 'a':
case 'A':
chars[1]++;
sb.append((char)ch);
break;
case 'b':
case 'B':
chars[2]++;
sb.append((char)ch);
break;
case 'c':
case 'C':
chars[3]++;
sb.append((char)ch);
break;
case ' ':
if(sb.length()>0){
temp=sb.toString();
// System.out.println(temp);
sb.delete(0, sb.length());
// System.out.println("sb:"+sb);
}
if(m.containsKey(temp)){
int count=m.get(temp);
count++;
m.put(temp, count);
}else{
m.put(temp, 1);
}
default:
if(sb.length()>0)
sb.delete(0, sb.length());
}
}
}catch(Exception e){
System.out.println(e);
}
System.out.println("A["+chars[1]+"]");
System.out.println("B["+chars[2]+"]");
System.out.println("C["+chars[3]+"]");
System.out.println(m);
Map<String,Integer> topwords = new LinkedHashMap<String,Integer>();
String tempStr="";
for (int i = 0; i < 3; i++) {
int temp1=0;
Set<String> keys = m.keySet();
for(String temp2:keys){
if(m.get(temp2)>temp1){
tempStr=temp2;
temp1=m.get(temp2);
}
}
topwords.put(tempStr,m.get(tempStr));
m.remove(tempStr);
}
System.out.println(topwords);
}
public static void main(String[] args) {
processFile();
}
}
As file size is very very big, say more than 2 GB, I feel above code is not much efficient.It will take longer time. Instead reading a single file , I would split the file into n number of small files based on the line count. Then , will run your processing logic in a multithread environment , will use fixed thread pool instead of creating many threads. I will use concurrent hash map instead HashMap<String,Interger> to store the word and count.
Partition the file using uniform hash into N/M files (where M is the size of main memory), so each file will fit in memory.
Read each file, and count the words using a hash table. Iterate on the hash-table and find the top k entries by using a min-heap (replace the min item in the heap if the current word count is higher).
Repeat for all the partitions and keep updating the heap to keep track of the top ranked words.
If Size is very big than we have to use the NIO package with channel, buffer and read the file with chunks of file an chunk size means buffer size.
package com.first;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;
public class FrequencyWord {
public static void main(String[] args) {
BufferedReader br = null;
StringTokenizer st;
HashMap<String, Integer> hm=new HashMap<String, Integer>();
try {
FileReader f=new FileReader("E:\\java Preparation\\click me\\Rescued.txt");
// InputStream is=new FileInputStream(f);
//System.out.println(is.read());
br = new BufferedReader(f);
String sCurrentLine;
String word;
int count;
while ((sCurrentLine = br.readLine()) != null) {
st=new StringTokenizer(sCurrentLine," ");
while(st.hasMoreTokens())
{
word=st.nextToken();
if(hm.containsKey(word))
{
count=hm.get(word);
count++;
hm.put(word,count);
}
else
{
hm.put(word, 1);
}
}
System.out.println(sCurrentLine);
System.out.println("********************");
}
System.out.println(hm.size());
Set<String> keyset=hm.keySet();
Iterator it=keyset.iterator();
String key;
int maxfreq=1;
String maxword = null;
int freq;
while(it.hasNext())
{
key=(String) it.next();
freq=hm.get(key);
if(freq>maxfreq)
{
maxfreq=freq;
maxword=key;
}
}
System.out.println(maxword+" : "+maxfreq);
} catch (FileNotFoundException e) {
e.printStackTrace();
}catch (IOException e)
{
e.printStackTrace();
}
}
}
I have tested it with the file have 50000 words and output is "the " as highest
Create 2 hash-tables one contains the first 26 letters of english alphabet a-z as keys and value field counts the number of occurrences of each alphabet in the file and the second hash-table , now we start reading the file and for ever alphabet we read we will update the alphabet has table and for every word we will update the second has table if there is a collision we will increment the value of the colliding word if there is no collision we create a new entry with the word in to the table with value equal to one.
Step: declare a min heap,say minheap, of 10 string items and each of its node contains a variable -count. This count is used to keep the heap property. Declare another hashmap,say hmap, with string as its key and count of string as its value. Declare a 256 int array,say chCount, if the strings are only contains ascci characters else declare 16 bit int array.
- BVarghese April 21, 2013Step 2: scan each characters from the file, update chCount for each of them, when a word occurs, insert/update(update hmap if this word already in hmap) into hmap and increment its value by one.Then compare this word with top element of minheap. If the count variable of top element of minheap is less than value of hmap's present word , then if it is not in minheap replace the min element with new element from the hmap.If this element is already in minheap, then just increment its count and re-adjust heap property.
Finally we will get minheap with 10 elements and character count in chCount array.