Google Interview Question
Software Engineer / DevelopersCountry: India
Interview Type: Phone Interview
It has to be done in iterative method.
Pick randomly some lines from "hello" file. For example pick the lines in the following pattern.
The first line from the "hello" file, then move the cursor to the line at the 100th MB (in hello file), and then 200th MB and so on. Compare it against the each and every file and which ever the file has more number of "found" entries focus on that one and go to the next file.
there are 2 approaches to solve this problem:
Solution 1: This is an expensive approach though..for comparing each word of the source to each word of the target file.
1. Compare each word from "Hello" file with each word from each of the "target files".
2. Set a match index based on the comparison. There is an option to not count the duplicate words found during comparison.
3. The max value of the ten match indexes corresponds to the file having closest match.
Solution 2:
1. Let's consider source file Hello is a Dictionary of words.
2. Compare the target files with the Dictionary of words and get target match indexes.
3. Based on the closest number we can decide which file is the closet one.
eg: Source Match Index = 45
Target 1 Match Index = 10
Target 2 Match Index = 20
Target 3 Match Index = 30
Target 4 Match Index = 40
Target 5 Match Index=50.... etc
if (Source file size < Target File Size)
then Target 5 would be a closest file because most of the source file words could be present in target.
else target 4 would be the closet option.
Below is the code for the problem. I took an example of three files instead of 10 files..
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ClosestFileMatching
{
public class ClosestFileMatch
{
public string[] targetfilename = new string[] {"C:\\Test\\one.txt", "C:\\Test\\two.txt", "C:\\Test\\three.txt"};
public int[] MatchIndexArray = new int[3];
public string sourcefilename;
public string sourceword;
int matchindex;
Dictionary<string, string> sourceDict = new Dictionary<string, string>();
/// <summary>
/// This method compares words between target files and the source dictionary and returns a matchindex Array.
/// </summary>
/// <param name="word"></param>
/// <returns></returns>
public int[] CompareWords()
{
string word;
char[] buffer = new char[256];
for (int i = 0; i < 3; i++)
{
foreach (string filename in targetfilename)
{
StreamReader tr = new StreamReader(filename);
word = tr.ReadLine();
string[] split_targetwords = word.Split(' ');
foreach (string targetstr in split_targetwords)
{
if (sourceDict.ContainsKey(targetstr))
{
matchindex++;
}
}
MatchIndexArray[i] = matchindex;
matchindex = 0;
}
}
return MatchIndexArray;
}
/// <summary>
/// Split words in the file and create a Dictionary..
/// </summary>
public void SplitWordsAddDict()
{
StreamReader sr = new StreamReader(sourcefilename);
int i = 0;
while (sr.Peek() >= 0)
{
sourceword = sr.ReadLine();
string [] split_words = sourceword.Split(' ');
foreach (string str in split_words)
{
sourceDict.Add(sourceword.Split(' ')[i], sourceword.Split(' ')[i]);
i++;
}
}
}
static void Main(string[] args)
{
ClosestFileMatch filematch = new ClosestFileMatch();
filematch.sourcefilename = "C:\\Test\\SourceFile.txt";
filematch.SplitWordsAddDict();
filematch.CompareWords();
Console.WriteLine(filematch.MatchIndexArray.Max());
for (int i = 0; i < filematch.MatchIndexArray.Length - 1; i++)
{
if (filematch.MatchIndexArray[i] == filematch.MatchIndexArray.Max())
{
Console.WriteLine(filematch.targetfilename[i]);
}
}
}
}
}
There are many File Similarity Algorithms for these kind of problems, like - TTTD (Two Threshold Two Divisor) Algorithm, Rabin's Fingerprinting etc. File sizes here are very large, so the dead is to generate the fingerprints from a file to represent that using some hash functions like - simHash or minHash.
Seems to be a problem of information retrieval and text mining.
- Sagar June 02, 2014Cosine Similarity seems to be the trick .... We need to construct feature vectors involving tf-idfs and then carry out thier similarity....