Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
"A second pass would be needed to detect false positives, but by that point the dataset should be very pruned."
I don't think a second pass can detect false positives. If an element is detected to be a duplicate, it can be a duplicate or not a duplicate(false positive). So we can not delete it when we find a element is detected, that is dataset can not be pruned. I didn't find a paper on line that proposes a method to make sure bloom filter can achieve 100% accuracy. If some one knows it, I'm very glad to hear that.
If duplicates are detected we compare them word by word to make sure it is not a false positive.
We can use hashing + hashing + hashing approach.
First hash will be based on number of characters in line
Second hash will be number of words in the line
third will be hash of words and now we can store line here.
When a new line comes, it will pass all three hashes and if comes into same bucket then we can compare lines.
using these 3 hash (or filters) we can minimize the chances of clashing lines.
What if we used another file and a hash map? For each line read from the original file, hash it and check if it exists in the hash map and if it does it means that it's a duplicate so you just don't write it in the new file. This would require only one pass but it also requires doubling the storage space and the extra memory for the hash map.
It sounds good. If a file is too large break it into a smaller chunks. but retain only one hash table during whole operation. Smaller chunks will lead us to creation of smaller file, merge the files to get the output.
1. Have a TreeSet to store each line parsed.
2. Divide file in chunks.
3. Create another file handler for new file for output.
4. Split input file in chunks.
5. For each chunk and start parsing it line by line.
if(you find the line is existing in set) - skip the line
else output the line to output file.
6. Once done, delete the input file and rename output file to that of input file.
I would still go with unix command as mentioned by nilkn.
Apply a modified version of external sort.
1. Split the large file into smaller chunks and sort these chunks.
2. Apply pair-wise merge of these chunks, ignoring duplicate lines as they are found.
// Alternative approach to above would be to create a dict : Hash the lines
// [or the digest? in case of very large length lines] such that
// Obviously this *assumes* a good hash with no collisions.
// [Is untested.]
hash_dict={}
line_number=1
with open ("filename", "wb") as file:
for line in file:
hash=generate_hash(line)
if hash not in hash_dict.keys():
hash_dict[hash]=line_number
line_number+=1
else:
delete_line_from_file_inplace(file)
//delete_line_from_file_inplace can be done in different ways, google can help
-----------------------------------
Just a thought, off the top of my head the order of lines can be maintained in a convoluted way, for the first method mentioned by Alex, before sorting, record the line number/ append it at the beginning of the line(n), and ignore this entry while sorting(nlogn), make a pass over the file to remove the duplicates(n), and sort again with the line numbers (nlogn), Remove line# (n). But this seems expensive :\
Have a MapReduce job:
1. Make the line number as a key and value as the line and Split the input file into several smaller files.
2. Mapper can just reverse the key and value, so now we have line as key and line number as value
3. Reducer will have line and list of line numbers as values (Reducer will get as an input all the same lines together the shuffle, sorting and combining step would do this)
4. Reducer will do a min(line numbers) and output line and minLineNumber
5. Output will have all non duplicate lines, so merge it keeping in mind the line number, so order is maintained.
Here's what I came up with. View the complete solution with comments, considerations, and test cases here - https://github.com/johndifini/javaAlgos/blob/master/Stripper.java
/**
* Naming Convention - ExocticEngineer was too long for a class name :-)
*/
public class Stripper {
/**
* By putting the unique lines into fixed-sized chunks, this approach
* will use significantly less mem if the file contains a lot of dupes.
* If the file doesn't contain any dupes, we'll waste some processing time.
*/
public static void removeDupes(String filename, int chunkSize) throws IOException {
//@todo Validate user input. Filename can't be null; Chunk size has to be >=1
// Note - BufferedReader is NOT loading the entire file into mem
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> uniqueLines = new HashSet<String>(chunkSize);
List<Set<String>> chunks = new ArrayList<Set<String>>();
String line;
while((line = reader.readLine()) != null) {
uniqueLines.add(line);
if(uniqueLines.size() == chunkSize) {
// Save a copy of the lines using copy constructor
chunks.add(new HashSet<String>(uniqueLines));
uniqueLines.clear();
}
}
reader.close();
// If size of uniqueLines is not equal to chunkSize, we haven't saved it yet
if(uniqueLines.size() != chunkSize) {
chunks.add(new HashSet<String>(uniqueLines));
uniqueLines.clear();
}
// Although the final list could be smaller than chunkSize,
// it's a good approximation of the minimum size of the final set
Set<String> consolidated = new HashSet<String>(chunkSize);
for(Set<String> buff : chunks) {
consolidated.addAll(buff);
}
for(String congo : consolidated) {
System.out.println("<"+congo+">");
}
}
}
During reading the file.
keep this data structure
1) Hash of the lines read, Key is hash of words xor with each other, the value will be a structure consist of the line offset, the word count, and the # if characters, Hence the equal function will avoid reading the line from offset for every hash collision since the length or word count may be different.
2) keep a queue of the duplicate lines as <offsets>, hence sorted by the increasing offset, If found equal via hashtable push into hashset <current offset>
3). In memory DS (hashtable and the hashset> .
4) Once reading the file. Get the number of lines it contains, the then get the size of hashset. these are the # of lines which needed purged.
5)Iterate over the hashset, and start reading from offset(K)(lenth of file - sizeof(hashset),
for each entry in hashset, get the line from the file at K (provided the offset K is not in the hashset), and overwrite the offset from the hashset. Remove the hashset entry
6) May need a loop till the hashset is empty because of possible offset line in file already in hashset.
or,
Use MapReduce
1) Map lines to the line number. Produces Hash of the line as key, value is the pair<offset, line>
2) the reducer will get the sorted key value pairs(key as hash value) and will dump
just the offset into the output file(in case of the duplicate, the offset will be the first occurrence of the line). Thus the offset of the duplicate lines are not output. (offset, 0)
3) the last job is to merge the offset pair(offset, 0) and write the resulting lines into the output file
Use a trie to find duplicate lines. The following algorithm shows how to do an in-place edit of the file. One traversal to register unique lines to keep, and another traversal reading only the unique lines and copying them over to their correct location in the file. This algorithm works only on Unix based systems, but one could make minor tweaks to make this work everywhere.
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>
#include <stdexcept>
#define _XOPEN_SOURCE_EXTENDED 1
#include <unistd.h>
#include <stdlib.h>
#include "Trie.h"
void overwrite_with_unique_lines(char * filepath)
{
std::fstream fs(filepath);
if (!fs.is_open())
{
throw std::runtime_error("File not found");
}
Trie<char, bool, 256> trie;
std::vector<long> transitionLineNumbers;
std::vector<off_t> lineOffsets;
bool previousDuplicate = true;
std::string line;
long lineNum = 0;
off_t offset = 0;
while (std::getline(fs, line))
{
bool * present = trie.find(line.begin(), line.end());
bool duplicate = present != NULL;
if (duplicate)
{
*present = true;
}
else
{
trie.update(line.begin(), line.end(), false);
}
if (duplicate != previousDuplicate)
{// on toggle
transitionLineNumbers.push_back(lineNum);
previousDuplicate = duplicate;
}
// Store the line offset
lineOffsets.push_back(offset);
offset = fs.tellg();
++lineNum;
}
fs.close();
// reopen and print out unique lines
fs.open(filepath);
if (!fs.is_open())
{
throw std::runtime_error("File not found");
}
bool print = false;
int transitionIndex = 0;
offset = 0;
for (long i = 0; i < lineNum; ++i)
{
if (i == transitionLineNumbers[transitionIndex])
{
print = !print;
++transitionIndex;
}
if (print)
{
fs.seekg(lineOffsets[i], fs.beg);
std::getline(fs, line);
fs.seekg(offset, fs.beg);
fs << line << "\n";
std::cout << "Copied '" << line << "' with a \\n from offset: " << lineOffsets[i] << " to " << offset << std::endl;
offset += (line.size() + 1);
}
}
fs.close();
truncate(filepath, offset);
}
int main(int argc, char ** argv)
{
if (argc < 2)
{
std::cerr << "Require one argument: filepath" << std::endl;
exit(1);
}
char * filepath = argv[1];
overwrite_with_unique_lines(filepath);
return 0;
}
Plenty of ways to do this in Unix, like sort file.txt | uniq -u | tee out.txt. Obviously sorting isn't asymptotically optimal, though. A faster approach would be to build a histogram (as a hash map).
- nilkn January 31, 2014If the file is so large that it can't be held in memory at once, an obvious extension is to break it into chunks, build a histogram for each chunk, and merge the histograms.
Possibly a more practical approach would be to first construct a bloom filter. This can be made highly accurate, though it will still produce the occasional false positive. A second pass would be needed to detect false positives, but by that point the dataset should be very pruned.