Google Interview Question for Software Engineer / Developers


Country: United States
Interview Type: In-Person




Comment hidden because of low score. Click to expand.
2
of 2 vote

Plenty of ways to do this in Unix, like sort file.txt | uniq -u | tee out.txt. Obviously sorting isn't asymptotically optimal, though. A faster approach would be to build a histogram (as a hash map).

If the file is so large that it can't be held in memory at once, an obvious extension is to break it into chunks, build a histogram for each chunk, and merge the histograms.

Possibly a more practical approach would be to first construct a bloom filter. This can be made highly accurate, though it will still produce the occasional false positive. A second pass would be needed to detect false positives, but by that point the dataset should be very pruned.

- nilkn January 31, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

"A second pass would be needed to detect false positives, but by that point the dataset should be very pruned."
I don't think a second pass can detect false positives. If an element is detected to be a duplicate, it can be a duplicate or not a duplicate(false positive). So we can not delete it when we find a element is detected, that is dataset can not be pruned. I didn't find a paper on line that proposes a method to make sure bloom filter can achieve 100% accuracy. If some one knows it, I'm very glad to hear that.

- qgbian February 13, 2014 | Flag
Comment hidden because of low score. Click to expand.
-1
of 1 vote

If duplicates are detected we compare them word by word to make sure it is not a false positive.

- scorpionnishme May 12, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

During the first pass that use bloom filter, store the deleted lines using a hashtable, associate with a count as 0, in the second pass, map all lines to hashtable, if exist, plus one in hastable. Eventually,delete all those lines with >=1 hash value

- alibaba September 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
2
of 2 vote

We can use hashing + hashing + hashing approach.
First hash will be based on number of characters in line
Second hash will be number of words in the line
third will be hash of words and now we can store line here.

When a new line comes, it will pass all three hashes and if comes into same bucket then we can compare lines.
using these 3 hash (or filters) we can minimize the chances of clashing lines.

- SK January 31, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

What if we used another file and a hash map? For each line read from the original file, hash it and check if it exists in the hash map and if it does it means that it's a duplicate so you just don't write it in the new file. This would require only one pass but it also requires doubling the storage space and the extra memory for the hash map.

- Alin January 31, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

It sounds good. If a file is too large break it into a smaller chunks. but retain only one hash table during whole operation. Smaller chunks will lead us to creation of smaller file, merge the files to get the output.

- Psycho February 01, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

If using a hashtable for the whole file, it is gonna be larger than the whole file. So you can not use one hashtable for the whole file. bloom filter is desired. Then a second pass

- kingpotter1990 September 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

1. Have a TreeSet to store each line parsed.
2. Divide file in chunks.
3. Create another file handler for new file for output.
4. Split input file in chunks.
5. For each chunk and start parsing it line by line.
if(you find the line is existing in set) - skip the line
else output the line to output file.
6. Once done, delete the input file and rename output file to that of input file.

I would still go with unix command as mentioned by nilkn.

- AlgoAlgae April 24, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Apply a modified version of external sort.
1. Split the large file into smaller chunks and sort these chunks.
2. Apply pair-wise merge of these chunks, ignoring duplicate lines as they are found.

- Alex January 30, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

But it will change the order of lines in the file.

- RG January 31, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

This is what I come up with too. But the order will be changed to the order of how each line gets hashed

- Guy January 31, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

// Alternative approach to above would be to create a dict : Hash the lines
// [or the digest? in case of very large length lines] such that
// Obviously this *assumes* a good hash with no collisions.
// [Is untested.]

hash_dict={}
line_number=1
with open ("filename", "wb") as file:
for line in file:
hash=generate_hash(line)
if hash not in hash_dict.keys():
hash_dict[hash]=line_number
line_number+=1
else:
delete_line_from_file_inplace(file)

//delete_line_from_file_inplace can be done in different ways, google can help
-----------------------------------
Just a thought, off the top of my head the order of lines can be maintained in a convoluted way, for the first method mentioned by Alex, before sorting, record the line number/ append it at the beginning of the line(n), and ignore this entry while sorting(nlogn), make a pass over the file to remove the duplicates(n), and sort again with the line numbers (nlogn), Remove line# (n). But this seems expensive :\

- 4rk January 31, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Sry "append it at the beginning of the line" should be "prepend it at the beginning of the line"

- 4rk January 31, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

We can first calculate fingerprints for each line (MD5/ simhash..etc.), this will reduce the text file size quite a lot. Then use whatever hashmaps/hammington distance/bit vector to remove duplicates.

- elendil February 12, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Have a MapReduce job:
1. Make the line number as a key and value as the line and Split the input file into several smaller files.
2. Mapper can just reverse the key and value, so now we have line as key and line number as value
3. Reducer will have line and list of line numbers as values (Reducer will get as an input all the same lines together the shuffle, sorting and combining step would do this)
4. Reducer will do a min(line numbers) and output line and minLineNumber
5. Output will have all non duplicate lines, so merge it keeping in mind the line number, so order is maintained.

- dhirendra.sinha December 28, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Here's what I came up with. View the complete solution with comments, considerations, and test cases here - https://github.com/johndifini/javaAlgos/blob/master/Stripper.java

/**
 * Naming Convention - ExocticEngineer was too long for a class name :-)
 */
public class Stripper {
  /**
   * By putting the unique lines into fixed-sized chunks, this approach
   * will use significantly less mem if the file contains a lot of dupes.
   * If the file doesn't contain any dupes, we'll waste some processing time.
   */
  public static void removeDupes(String filename, int chunkSize) throws IOException {
    //@todo Validate user input. Filename can't be null; Chunk size has to be >=1
    
    // Note - BufferedReader is NOT loading the entire file into mem
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> uniqueLines = new HashSet<String>(chunkSize);
    List<Set<String>> chunks = new ArrayList<Set<String>>();
    String line;
    while((line = reader.readLine()) != null) {
      uniqueLines.add(line);
      if(uniqueLines.size() == chunkSize) {
        // Save a copy of the lines using copy constructor
        chunks.add(new HashSet<String>(uniqueLines));
        uniqueLines.clear();
      }
    }
    reader.close();

    // If size of uniqueLines is not equal to chunkSize, we haven't saved it yet
    if(uniqueLines.size() != chunkSize) {
      chunks.add(new HashSet<String>(uniqueLines));
      uniqueLines.clear();
    }

    // Although the final list could be smaller than chunkSize,
    // it's a good approximation of the minimum size of the final set
    Set<String> consolidated = new HashSet<String>(chunkSize);
    for(Set<String> buff : chunks) {
      consolidated.addAll(buff);
    }

    for(String congo : consolidated) {
      System.out.println("<"+congo+">");
    }
  }
}

- johndifini October 30, 2016 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

During reading the file.
keep this data structure
1) Hash of the lines read, Key is hash of words xor with each other, the value will be a structure consist of the line offset, the word count, and the # if characters, Hence the equal function will avoid reading the line from offset for every hash collision since the length or word count may be different.
2) keep a queue of the duplicate lines as <offsets>, hence sorted by the increasing offset, If found equal via hashtable push into hashset <current offset>
3). In memory DS (hashtable and the hashset> .
4) Once reading the file. Get the number of lines it contains, the then get the size of hashset. these are the # of lines which needed purged.
5)Iterate over the hashset, and start reading from offset(K)(lenth of file - sizeof(hashset),
for each entry in hashset, get the line from the file at K (provided the offset K is not in the hashset), and overwrite the offset from the hashset. Remove the hashset entry
6) May need a loop till the hashset is empty because of possible offset line in file already in hashset.


or,
Use MapReduce
1) Map lines to the line number. Produces Hash of the line as key, value is the pair<offset, line>
2) the reducer will get the sorted key value pairs(key as hash value) and will dump
just the offset into the output file(in case of the duplicate, the offset will be the first occurrence of the line). Thus the offset of the duplicate lines are not output. (offset, 0)

3) the last job is to merge the offset pair(offset, 0) and write the resulting lines into the output file

- suhaib.realtor July 04, 2017 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

Use a trie to find duplicate lines. The following algorithm shows how to do an in-place edit of the file. One traversal to register unique lines to keep, and another traversal reading only the unique lines and copying them over to their correct location in the file. This algorithm works only on Unix based systems, but one could make minor tweaks to make this work everywhere.

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>
#include <stdexcept>

#define _XOPEN_SOURCE_EXTENDED 1
#include <unistd.h>
#include <stdlib.h>

#include "Trie.h"

void overwrite_with_unique_lines(char * filepath)
{
    std::fstream fs(filepath);
    if (!fs.is_open())
    {
        throw std::runtime_error("File not found");       
    }

    Trie<char, bool, 256> trie;
    std::vector<long> transitionLineNumbers;
    std::vector<off_t> lineOffsets;

    bool previousDuplicate = true;
    std::string line;
    long lineNum = 0;
    off_t offset = 0;
    while (std::getline(fs, line))
    {
        bool * present = trie.find(line.begin(), line.end());
        bool duplicate = present != NULL;
        if (duplicate)
        {
            *present = true;
        }
        else
        {
            trie.update(line.begin(), line.end(), false);
        }

        if (duplicate != previousDuplicate)
        {// on toggle
            transitionLineNumbers.push_back(lineNum);
            previousDuplicate = duplicate;
        }

        // Store the line offset
        lineOffsets.push_back(offset);
        offset = fs.tellg();

        ++lineNum;
    }

    fs.close();

    // reopen and print out unique lines
    fs.open(filepath);
    if (!fs.is_open())
    {
        throw std::runtime_error("File not found");       
    }

    bool print  = false;
    int transitionIndex = 0;
    offset = 0;
    for (long i = 0; i < lineNum; ++i)
    {
        if (i == transitionLineNumbers[transitionIndex])
        {
            print  = !print;
            ++transitionIndex;
        }

        if (print)
        {
            fs.seekg(lineOffsets[i], fs.beg);
            std::getline(fs, line);

            fs.seekg(offset, fs.beg);
            fs << line << "\n";
            std::cout << "Copied '" << line << "' with a \\n from offset: " << lineOffsets[i] << " to " << offset << std::endl;
            
            offset += (line.size() + 1);
        }
    }

    fs.close();
    truncate(filepath, offset);
}

int main(int argc, char ** argv)
{
    if (argc < 2)
    {
        std::cerr << "Require one argument: filepath" << std::endl;
        exit(1);
    }

    char * filepath = argv[1];
    overwrite_with_unique_lines(filepath);

    return 0;
}

- Gokul Subramanian July 18, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

what if the file contain billions of lines and each line is 100 characters? then the trie is as large as 256^100, larger than #of atoms in universe.

- kingpotter1990 September 07, 2014 | Flag


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More