Google Interview Question
Senior Software Development EngineersCountry: United States
Interview Type: In-Person
in general good solution.
Collections.sort(col) is O(n*log n).
But Query has running time of 2*n*k, complexity is O(n*k) in all cases (best and worst) because of set.contains (n - number of entries, k - number of tags in query).
set.contains has linear time, and you call it in every iteration.
But this is a possible trade-off for such a task.
@EPavlova & @zr.roman
If instead of a HashSet a SortedSet is used, then the complexity would be O(n*lg n). Because the lookup can then be done via binary search in lg n time.
The question did not say what happens when there is a partial overlap.
For example. (2,4,0) and (3,5,1). So when we query "0,1".
My Initial stab, might be wrong. And i could do only o(n^2). Thinking about a better one.
Algorithm,
1) I call an entry [start,end,tag] as an element.
2) Take each element and go over the rest of n-1 elements.
3) If there is a complete overlap by which i mean, one interval is a complete subset of another, update the tag of the smaller interval to add tag of the larger interval.
Once this is done,
Its pretty straight forward to query and return them.
Please comment and state better solutions guys, !!
yes, the drawback of this solution is that you perform approx. n^2 iterations for every query.
I offer a solution, where queries are processed in at most n iterations.
partial overlap cannot be treated the same as "include" relation.
Let's have query (0,1).
Case #1 (include): given [23, 72, 0], [34, 53, 1]. The answer to query is [34, 53].
Case #2 (partial): given [23, 72, 0], [20, 53, 1]. What will be the answer to query (0,1)?
[23, 72] ?
[20, 53] ?
or both?
How to determine answer?
Could you clarify that a little more in detail.
@zr.roman, as said earlier, "the overlapped part matches both tags". Added another example to clarify. Thanks!
please, one more detail:
in case we have input [23, 72, 0] and [20, 23, 1] what should be the answer to query (0,1) ? [23, 23] ? right?
here we need data structure called "forest", i.e. set of trees.
Steps:
1) We are given 2D array with 3 columns [startIndex, endIndex, tag].
We take this array, and build forest.
Example: given: [23, 72, 0], [34, 53, 1], [100, 128, 0], [75, 80, 2], [55, 60, 3], [40, 50, 2].
The forest is:
#1 tree: 2 [75, 80]
#2 tree: 0 [23, 72] -> ( 1 [34, 53] -> 2 [40, 50] ) and 3 [55, 50]
#3 tree: 0 [100, 128]
The complexity is O(n* log n): O(log n) for inserting a node into a tree performed n times.
2) Processing queries.
the task is starting from roots find all paths in all trees, containing all the tags from query, and take the last nodes in those paths.
(See examples below in comment).
Complexity of query processing is O(n * log n), n - total number of all nodes in all trees.
c# implementation with DFS. Full testable solution.
O(log n) - insert 1 entry to forest.
O(n) - processing 1 query.
using System;
using System.Collections.Generic;
using System.Linq;
namespace TaggedSubstrs {
public class TaggedPiecesContainer {
// public members
public TaggedPiecesContainer( IEnumerable<int[]> tags ) {
BuildForest( tags );
}
public void AddEntry( int[] entry ) {
// try to add entry to any tree from forest
for (int i = 0; i < _forest.Count; i++) {
if ( _checkTagEntryAgainstNode( entry, _forest[ i ] ) ) {
AddToNode( _forest[ i ], entry );
return;
}
}
// if not added to any tree, then create a new tree
_forest.Add( _addNode( entry) );
}
public IEnumerable<int[]> Query( int[] tags ) {
_resList.Clear();
for ( int i = 0; i < _forest.Count; i++ ) {
_tagsVisited.Clear();
DFS( _forest[ i ], new HashSet<int>( tags ) );
}
foreach ( var node in _resList ) {
yield return node.Value;
}
}
// private members
private void BuildForest( IEnumerable<int[]> input ) {
var inputList = input as IList<int[]> ?? input.ToList();
for ( int i = 0; i < inputList.Count; i++ ) {
AddEntry( inputList[ i ] );
}
}
private void AddToNode( Node node, int[] entry ) {
foreach ( var child in node.Children ) {
if ( _checkTagEntryAgainstNode( entry, child ) ) {
AddToNode( child, entry );
return;
}
}
node.Children.Add( _addNode( entry) );
}
private void DFS( Node node, HashSet<int> tags ) {
_tagsVisited.Add( node.Tag );
// count if path tags match input tags,
// the rule of match specified in task
if ( _tagsVisited.Count >= tags.Count ) {
int cnt = GetTagsMatchesCnt( tags, _tagsVisited );
// we found the target path, so take the node and return
if ( cnt == tags.Count ) {
_resList.Add( node );
return;
}
}
foreach ( var child in node.Children ) {
DFS( child, tags );
// when we go 1 level up, remove the last visited tag
if ( _tagsVisited.Count > 0 ) {
_tagsVisited.RemoveAt( _tagsVisited.Count - 1 );
}
}
}
private int GetTagsMatchesCnt( IEnumerable<int> tags, IEnumerable<int> tagsVisited) {
int cnt = 0;
foreach ( var tagVisited in tagsVisited ) {
if ( tags.Contains( tagVisited ) ) {
cnt++;
}
}
return cnt;
}
private readonly List<Node> _forest = new List<Node>();
private readonly List<int> _tagsVisited = new List<int>();
private readonly List<Node> _resList = new List<Node>();
private readonly Func<int[], Node> _addNode = tagEntry => new Node {
Value = new[] { tagEntry[ 0 ], tagEntry[ 1 ] }, Tag = tagEntry[ 2 ]
};
private readonly Func<int[], Node, bool> _checkTagEntryAgainstNode = (tagEntry, node) => {
return node.Value[ 0 ] <= tagEntry[ 0 ] && node.Value[ 1 ] >= tagEntry[ 1 ];
};
}
public class Node {
public int[] Value;
public int Tag;
public readonly List<Node> Children = new List<Node>();
}
}
Examples.
Example1: query: 0,1,2.
We have such path in tree#2: 0 -> 1-> 2. The last node in the path is [40, 50]. This is the answer.
Example2: query: 0, 2. The answer is same as in Example1 ( [40, 50] ), 'cause only 1 path contains 0 and 2 tags.
Example3: query: 1,3. The answer is null. Because in all the trees there are no paths containing both 1 and 3.
Example4: query: 0. The answer is [23, 72] and [100, 128], because 2 trees possess paths containing tag 0, and the last nodes in those paths are [23, 72] and [100, 128].
Example5: query: 2. The answer is [40, 50] and [75, 80], because 2 trees possess path containing tag 2, and the last nodes in those paths are [40, 50] and [75, 80].
1) Put tags into a hash map such that k = tag# and v = [[startIdx0,endIdx0],[startIdx1,endIdx1],...] each set of start, end indexes present an entry of a tag. For example for the tag set above this would be k = 0, v = [[23,72],[100,128]] . Note that, i'd prefer to have value as a list of arrays or a list of lists for easier manipulation later on.
2) On query look up the values from hash table
3) Loop through the values obtained from the hashtable lookup; set the first value list for the first tag as the selected interval. Starting from second iteration, check every interval in the value list for and find overlapping intervals with the selected interval and update the selected interval to the new overlapping parts. If there's no overlap with the selected interval at any iteration of the loop -> return empty.
This should run in 1st step O(n), 2nd step O(k), 3rd step ~ O(k) so the whole thing would be O(n). (k = query size, n = # of tag entries)
Clarification on 3rd step: every interval in the value list would be checked against every interval in the selected interval. This could be carried out in many ways; one way could be to check the startIdx of every interval to the selected intervals, if it's bigger check it against the endIdx of the same interval, if it is smaller than that; we have an overlap! then find the minimum of the endIndices and max of te startIndices of these 2 intervals and create a new interval with found value which is the overlap. this is repeated until selected interval is empty or end of the loop is reached(checked all values we looked up from the hashtable).
time complexity of 3 step is O(n^2), 'cause you have to perform at most n(n-1)/2 iterations to perform the task.
Imagine you have 5 tag entries in hashtable: 0,1,2,3,4.
If you receive query 0,1,2,3,4, you will have to compare all with all, this is O(n^2).
I'm sorry but you're mistaken, 3rd step is not O(n^2) even if you receive all tags as query it'd be O(n) worst case, since you take the first one as the selected and update this selected as you compare it.
So in your example of receiving 0,1,2,3,4;
in the worst case comparisons are as follows:
0 with 1 => selected
selected with 2 => update selected
selected with 3 => update selected
selected with 4 => update selected
sorry too, but let's see.
Assume we have tags query 0,1,2,3,4.
Each tag has 2 elements. Assume there are no overlaps at all.
0 with 1 => 12 iterations:
As follows:
1) 00 contains 10
2) 00 contains 11
3) 01 contains 10
4) 01 contains 11
5) 10 contains 00
6) 10 contains 01
7) 11 contains 00
8) 11 contains 01
So, we have 4 elements.
Go to next check: 4 elements with tag 2:
I counted, you will need 14*2 = 28 iterations. (count yourself to verify).
The rest 3 checks even worse.
You have 4 elements in first 2 tags and 8 iterations to check their mutual overlaps.
When you have 6 elements, you need 28 iterations to check their mutual overlaps.
Can you count number of iterations at further checks? It is possible, but it is a combinatorial task.
You have combinatorial complexity, looks like.
First of all n is the number of entries not number of tags(if you have 2 elements in one tag that'd mean you have n-1 tags). Second is that there wouldn't be 12 iterations on comparing 2 tags with 2 elements each there'd be 4! Which is exactly n in this case because for 4 entries n is 4. The checks 7-12 you made are redundant checks which are not done. Checks are only done forward not backwards. Also there's no need to perform if 1 interval contains both elements if you individually check them anyways(so you can eliminate #3 and #6 from your list too).
As follows :
1) 00 contains 10
2) 00 contains 11
3) 01 contains 10
4) 01 contains 11
That said, i have noticed a different flaw in my algorithm such that it is possible that the intervals might be divided to pieces. When one tag has r elements and the other has t elements, it is possible to end up with r*t elements in the selected interval. Which could yield (n-r-t)*r*t operations which is O(n^3). Sorry about that.
thanks for description, but I don't understand why you eliminate the mentioned items from the list of iterations. (NB: I corrected the post a little).
Let's see closer and assume the following situation:
1) 00 contains 10 ? NO.
2) 00 contains 11 ? NO.
Here you say that I can eliminate next iteration, which is:
3) 10 contains 00 ?
But how can I eliminate it if the check will detect YES? But you eliminated it and never got to know that 10 contains 00. In fact before the check you really do not know whether 10 contains 00 or not! We know that 00 does not contain 10 (#1), but this says nothing about the reverse situation. So, we cannot skip it. The same is about the rest iterations from my list (in total 8 iterations).
Let's take the example right from task.
{34, 53} contains {23, 72} ? NO.
And then you skip reverse check: {23, 72} contains {34, 53} ?
The check result is YES, but you never got to know that, you've skipped it.
I apologise for the ambiguity, i think we're thinking of different checks. You're thinking about a "contains" check whereas i am trying an "overlaps" checks. For example :
{34, 53} overlaps with {23, 72} ? True.
and the reverse is always the same as well.
This check would return false if and only if we're talking about completely separate intervals such as {15, 25} and {42,56}
My mistake i should have been more clear.
now I see your idea, thank you.
But in case of checking only general "overlaps", we gain new problems, for example:
{34, 53, 1} overlaps with {23, 72, 0} ? True.
But what then to give as an answer in case of query (0,1) ?
{34, 53} ? or {23, 72} ? or both ?
No information. We need in worst case 2 more checks to determine which of them contains another one.
So, worst case:
#1: {34, 53} contains {23, 72} ? NO.
#2: {23, 72} contains {34, 53} ? YES.
So, the answer to query (0,1) is {34, 53}.
(But keep in mind we made 2 extra "contains" checks in addition to 1 "overlaps" check to obtain the result. So, we have to take this into consideration, and overall complexity will increase dramatically, n^2 or n^3 or so, need to count).
What's the output for this case:
[1,10, 0]
[2, 7, 1]
[3, 5, 2]
[6, 8, 3]
Should we get [3,5], [6,7] ?
import java.util.ArrayList;
import java.util.HashMap;
public class Question1
{
public ArrayList<String> getMatchingRanges(HashMap <String, ArrayList<String>> tags, String[] qTags)
{
if (qTags==null)
return null;
ArrayList<String> result = (ArrayList<String>) tags.get(qTags[0]).clone();
for(int i=1; i<qTags.length; i++)
{
ArrayList<String> newRes = new ArrayList<String>();
ArrayList<String> temp = tags.get(qTags[i]);
for(int j=0; j<temp.size();j++)
{
int newRange1 = Integer.parseInt(temp.get(j).split(":")[0]);
int newRange2 = Integer.parseInt(temp.get(j).split(":")[1]);
for(int k=0; k<result.size();k++)
{
int range1 = Integer.parseInt(result.get(k).split(":")[0]);
int range2 = Integer.parseInt(result.get(k).split(":")[1]);
if ((newRange1>=range1 && newRange1<=range2)
|| (newRange2>=range1 && newRange2<=range2))
{
String val = Math.max(range1,newRange1) + ":" + Math.min(range2,newRange2);
newRes.add(val);
}
}
}
result=newRes;
if (result.size()==0)
return null;
}
return result;
}
public static void main(String[] args)
{
HashMap <String,ArrayList<String>> tags = new HashMap <String,ArrayList<String>>();
ArrayList<String> zero = new ArrayList<String>();
zero.add("23:72");
zero.add("100:128");
ArrayList<String> one = new ArrayList<String>();
one.add("10:53");
tags.put("0",zero);
tags.put("1",one);
ArrayList<String> result = new Question1().getMatchingRanges(tags, new String[]{"0","1"});
if (result!=null)
{
for(String s : result)
System.out.println(s);
}
result = new Question1().getMatchingRanges(tags, new String[]{"0"});
if (result!=null)
{
for(String s : result)
System.out.println(s);
}
}
}
class tagOverlapping:
def __init__(self, tagList):
self.tagList = tagList
self.startIndex = dict()
self.tagDict = dict()
for t in self.tagList:
s = t[0]
e = t[1]
t_id = t[2]
if t_id in self.tagDict:
self.tagDict[t_id].add((s,e))
else:
self.tagDict[t_id] = set([(s,e)])
sTagList = sorted(self.tagList, key = lambda x: x[0])
tags = set(self.tagDict.keys())
for st in sTagList:
k_id = st[2]
exTags = tags.difference(set([k_id]))
for t in exTags:
vals = self.tagDict[t]
for v in vals:
if v[0] >= st[0] and v[1] <= st[1]:
self.tagDict[k_id].add(v)
print(self.tagDict)
def queryTags(self, qt):
print('Looking for tags in : ', qt)
resultSet = set(self.tagDict[qt[0]])
qt = qt[1::]
for t in qt:
list1 = set(self.tagDict[t])
resultSet = resultSet.intersection(list1)
print(resultSet)
tList = [ [10, 23,0], [50, 70, 1], [9, 15, 2], [6,17,0], [11, 45, 2], [100, 300, 3], [120, 145, 1], [45, 79, 3], [105, 299,4], [7,15, 4], [2, 14, 1]]
print(tList)
myT = tagOverlapping(tList)
myT.queryTags([1,0])
myT.queryTags([0])
myT.queryTags([4,3])
class tagOverlapping:
def __init__(self, tagList):
self.tagList = tagList
self.startIndex = dict()
self.tagDict = dict()
for t in self.tagList:
s = t[0]
e = t[1]
t_id = t[2]
if t_id in self.tagDict:
self.tagDict[t_id].add((s,e))
else:
self.tagDict[t_id] = set([(s,e)])
sTagList = sorted(self.tagList, key = lambda x: x[0])
tags = set(self.tagDict.keys())
for st in sTagList:
k_id = st[2]
exTags = tags.difference(set([k_id]))
for t in exTags:
vals = self.tagDict[t]
for v in vals:
if v[0] >= st[0] and v[1] <= st[1]:
self.tagDict[k_id].add(v)
print(self.tagDict)
def queryTags(self, qt):
print('Looking for tags in : ', qt)
resultSet = set(self.tagDict[qt[0]])
qt = qt[1::]
for t in qt:
list1 = set(self.tagDict[t])
resultSet = resultSet.intersection(list1)
print(resultSet)
tList = [ [10, 23,0], [50, 70, 1], [9, 15, 2], [6,17,0], [11, 45, 2], [100, 300, 3], [120, 145, 1], [45, 79, 3], [105, 299,4], [7,15, 4], [2, 14, 1]]
print(tList)
myT = tagOverlapping(tList)
myT.queryTags([1,0])
myT.queryTags([0])
myT.queryTags([4,3])
the idea is: compute the overlapping of segments of one tag and that of its previous tag in the query list, until all tags have been considered.
algorithm works as following:
1. push all segments of first tags into queue
2. push a canary into the queue, i.e. -1
3. for each tag of the rest in the query, do the following:
a. copy the queue
b. for each segment [low, high] of this tag, compute the overlapping with each segment in the queue and push them into the queue until hit the canary
b. move the canary to the end of the queue
class Solution {
public:
queue<int> seekTags(vector<vector<int>>& tags, vector<int>& query)
{
queue<int> myq;
if (query.size() == 0) return myq;
for (int i = 0; i < tags.size(); i++) {
if (query[0] == tags[i][2]) {
myq.push(tags[i][0]);
myq.push(tags[i][1]);
}
}
if (query.size() == 1) return myq;
myq.push(-1); // canary
for (int t = 1; t < query.size(); t++) {
queue<int> copy = myq;
for (int i = 0; i < tags.size(); i++) {
if (tags[i][2] != t) continue;
/* if tags sorted, tags[i][1] < low to stop (optimize) */
queue<int> tmpq = copy;
while (!tmpq.empty() && tmpq.front() != -1 && tmpq.size() > 2) {
int low = tmpq.front(); tmpq.pop();
int high = tmpq.front(); tmpq.pop();
if (tags[i][0] < high && tags[i][1] > low) {
myq.push(max(low, tags[i][0]));
myq.push(min(high, tags[i][1]));
}
}
}
while (myq.front() != -1) myq.pop();
myq.pop(); // pop canary
if (myq.size() == 0) break; // no more interset
myq.push(-1);
}
return myq;
}
void printQueue(queue<int>& ans) {
std::cout << "Queue: " << std::endl;
while (!ans.empty()) {
std::cout << ans.front() << std::endl;
ans.pop();
}
}
};
Step 1: Add all intervals to a hash table based on tag as the key
Step 2: Build interval tree using intervals from buckets for tags in search
Step 3: For each item in interval bucks, Search for all intersections for a given tag
Create the minimal overlapping region, add to output (if not present)
#include <iostream>
#include "vector"
#include "map"
using namespace std;
struct Interval
{
int low, high;
int tag;
bool operator==(const Interval &other) const
{
if (this->low == other.low && this->high == other.high) return true;
return false;
}
};
typedef vector<Interval> IntervalList;
typedef struct
{
Interval *i;
IntervalList intersection_list;
Interval intersecting_interval;
} IntervalObject;
typedef vector<IntervalObject *> IntervalObjects;
typedef map<int, IntervalObjects> IntervalHashTable;
struct ITNode
{
Interval *i;
ITNode *left, *right;
int max;
};
ITNode * newNode(Interval *i)
{
ITNode *node = new ITNode;
node->max = i->high;
node->left = node->right = NULL;
node->i = i;
return node;
};
// insert node into interval tree
ITNode *insert(ITNode *root, Interval *i)
{
// empty tree, create new node
if (root == NULL)
return newNode(i);
// If new node's low value is smaller, then new node goes to the left
if (i->low < root->i->low)
root->left = insert(root->left, i);
// new node goes to the right
else
root->right = insert(root->right, i);
// update max, if needed
if (root->max < i->high)
root->max = i->high;
return root;
}
// build intersection_list list
void intervalSearch(ITNode *root, IntervalObject *object)
{
// Base Case, tree is empty
if (root == NULL)
return;
// If given interval intersection_list with root, add to results, keep digging
if (root->i->low <= object->i->high && object->i->low <= root->i->high)
{
if (root->i->low != object->i->low || root->i->high != object->i->high) // skip self
{
object->intersection_list.push_back(*root->i);
}
}
// If left child and max of left child >= intervals then potential overlap left
if (root->left != NULL && root->left->max >= object->i->low)
{
intervalSearch(root->left, object);
}
intervalSearch(root->right, object);
}
void intervalReduce(IntervalObject *object)
{
for (auto i : object->intersection_list)
{
if (i.low > object->intersecting_interval.low && i.low <= object->intersecting_interval.high)
{
object->intersecting_interval.low = i.low;
}
if (i.high < object->intersecting_interval.high && i.high >= object->intersecting_interval.low)
{
object->intersecting_interval.high = i.high;
}
}
}
void addOutput(IntervalObject *object, vector<Interval> &output)
{
Interval my = { 1,2,3 };
if (find(output.begin(), output.end(), object->intersecting_interval) == output.end())
{
output.push_back(object->intersecting_interval);
cout << "Interval: " << object->intersecting_interval.low << "," << object->intersecting_interval.high << endl;
}
else
{
cout << "Duplicate: " << object->intersecting_interval.low << "," << object->intersecting_interval.high << endl;
}
}
int main(int argc, char **argv)
{
IntervalHashTable ihash;
Interval ints[] = { { 10,20,0 },{ 5,15,0 },{ 15,35,1 },{ 0,3,2 }, { 2,7,1 },{ 14,17,0 }, {30,40,0} };
int n = sizeof(ints) / sizeof(ints[0]);
ITNode *root = NULL;
// Create hashtable tag=key, list=intervals
for (auto i : ints)
{
Interval *ii = new Interval(i);
IntervalObject *object = new IntervalObject();
object->i = ii;
object->intersecting_interval = i; // entire scope
ihash[i.tag].push_back(object);
}
int tags[] = { 0, 1 };
// Using hash and tags build interval tree
for (auto tag : tags)
{
for (auto object : ihash[tag])
{
root = insert(root, object->i);
}
}
vector<Interval> output;
// using hash and tags, create intersection list, reduce and create output
for (auto tag : tags)
{
for (auto object : ihash[tag])
{
intervalSearch(root, object);
cout << "Parent: " << object->intersecting_interval.low << "," << object->intersecting_interval.high << " List: ";
for (auto i : object->intersection_list)
{
cout << i.low << "," << i.high << " ";
}
cout << endl;
intervalReduce(object);
addOutput(object, output);
}
}
}
Input: { 10,20,0 },{ 5,15,0 },{ 15,35,1 },{ 0,3,2 }, { 2,7,1 },{ 14,17,0 }, {30,40,0}
Search: {0,1}
Output: { 15,15 }, {30, 35}, {5, 7}
Input: { 23,72,0 },{ 34,53,1 },{ 100,128,0 },{ 75,80,2 },{ 55,60,3 },{ 40,50,2 }
Search: {0,1,2}
Output: {40,50}, {100,128}, {75,80}
Hello,
First my initial tought was to traverse through the string and to find intersections but the problem clearly states that string is too long (a lot of time and memory consuption) and decided to restrict myself only to a solution taking into account entries.
I would like to propose the following solution with complexity O(mlogm) where m is the number of entries. Let's sketch the algorithm :
First I generate two points from each entry - start point and end point and sort all point by their string index (start or end index). Then I traverse point array and store in a set all tags which are part of the query tags and still have't finish (their end time has not been come) yet. In case I have ther is an intersection of all querry tags ( set.size() == tags.size()) then produce the result.
Here is some code I wrote :
- EPavlova December 11, 2015