Chris
BAN USERSWE
- 0of 0 votes
AnswersA new species has been discovered and we are trying to understand their form of communication. So, we start by identifying their alphabet. Tough Mr. Brown managed to steal a vocabulary that has words in it in ascending order and a team already identified the letters. So, we have now ordered words as sequences of numbers, print every letter once in ascending order.
- Chris in United States
[3,2]
[4,3,2]
[4,1,1]
[1, 2]
[1, 1]
alphabet: [3, 4, 2, 1]| Report Duplicate | Flag | PURGE
Software Engineer Data Structures
@hprem991: I understand, maybe you could code that O(n) solution that uses only one pass to create the sum and distribute it to all nodes, assuming we had a minimal spanning tree. Maybe just with an array, calculate the sum of an array and set each element to that value. Are you sure you don't create an O(n^2) solution that just looks like O(n) on the surface?
- Chris August 14, 2017I assume only positive values > 0.
I fully agree with your conclusion, it means you need to visit each node to build the sum in on node and then visit again each node to copy that value to all other nodes.
I guess the question aims, what path should you take to achieve 2*(no of nodes) time units. I think you are looking at a minimum spanning tree to calculate and distribute the sum (I think there are two one should really know, but I actually need to look them up as well).
An other point of view could be, to calculate the sum, you need to see each value. there is no way you can build a sum on elements without having at least once seen all elements values. Then, since you need to copy it to all other nodes, there is no other way than visiting each node again. So, 2*n is best conceivable, I don't think you can do better either.
@NoOne:
this articles essence is "to find out if two people are blood-related only consider parent relationships because then your graph grows only by 2^generations whereas if you would navigate the children as well you would grow it by (2+#children)^generations ..."
when I solved the problem, I drew the graph on a paper and it was clear that only the parents are relevant, so I didn't even mention it ;-) nor did I have child relationships in the graph... of course, navigate the parents only... because it limits the growth of the frontier
I think one could even extend the algorithm by considering the birth date and only grow the youngest nodes upwards (using a prio-queue instead of a queue), so the frontier covers about the same birth years assuming our common anccestor children will not be too far apart (between 2 and 20 years, maybe extremes have 60 years (but only with common fathers ... :-)) ...
cool thanks for inspiring to think that way, not so much for the articles content ;-)
is the expression an AST? meaning for example
*
/ \
a +
/ \
1 2
was originally a * (1+2)
is the order of the expressions given in such a way there exists a solution without the need to reorder
x = 2
y = 2
return x * y
vs.
return x * y
x = 2
y = 3
if so, then I would suggest to treat assign expressions in a way that the left node provides the variable name and the right node is the expression being evaluated recursively (of course can and should be done in a more OO way)
variables[assignment.left] = evaluate(assignment.right, variables)
def evaluate(root, variables):
if isinstance(root, BinaryOp):
if root.operator = '+':
return evaluate(root.left) + evaluate(root.right)
elif root.oeprator = '*':
return evaluate(root.left) * evaluate(root.right)
elif isinstance(root, VariableRef):
return variables[root.variableName]
elif isinstance(root, Number)
return root.value
raise "error"
and return expression as
return evaluate(return.expression, variables)
of course things like types and other semantic rules need to be defined...
- Chris August 13, 2017I don't understand how the game works:
- is k a number given, e.g. one can take 100-k stones, the first who exceeds this
range, looses - or the last who is within the range wins?
- or is it that the person who can't take the last k <= 15 stones wins?
- or anything other?
edited (mixed up terms)
I would try to approach the question in a structured way, where do these queries come from, what do we expect to ask tomorrow, so we don't have to change this structure all the time. Anyway, for the exact question:
page(s) visited by 100 users:
- would store for each number of unique users per day the pages
- as a b-tree with #of unique users as key and page id and number of total visits as value (I would loose which user it was with this aggregation), but i could have fast statistics for a page ranking based on user visits.
- to ask which was visited 100 times, I had a O(lg(m)+k) query. m is the different page frequencies per day, k the number of pages with exactly 100 users per day. Instead of b-tree I could use a HT with the frequency as index, so it would be O(k), how ever, I'd expect a query > 100, > 1000 or >100 < 200 etc...
pages visited by 1 user 15 times:
- same as above, I would get all pages visited by 1 unique user and ask which one had only 15 visits in total. if that's not good enough, because there are many pages with 1 unique user a day, I would use a different index on the b-tree of first query.
that index would consist of two integers, the #unique users and the #of total page visits (ordered primarily on unique users, secondary on total page visits) So I could ask both questions on the same index with logarithmic upper bounds (depends what I want exactly, e.g. one result or multiple results)
page visited by u3 more than 20 times (may be more than one...first, all > 20?, any if exists):
- I would assume the vector describing #visits / page for a single user and day is sparse.
- hash-table with user id as key and a vector as value
- the vector contains the tuple page-id, #visits
- to find I would have O(1) to get to the vector and a linear scan on the vector, If I sort the vector, I had O(lg(n)) to find the min(>20), O(1) to find any or none, or O(n) to find all > 20
is the file on disc? so the bottle neck is disc I/O
is the file on memory? so, the bottle neck is memory
is the file on multiple machines: good, we can work in parallel: assign blocks of x GB to m machines, take care that the blocks overlap so at least one machine has the whole pattern ...
@Vijay.B
thanks, that's correct, I didn't check for already used distances. This should work now (code is ugly)
def zero_if_none(a):
return 0 if a is None else a
def find_gas_station_pos(dist):
def aux(i, j, c):
# check if gas station at pos x would work
def verify(x):
for r in [range(0, i), range(j + 1, station_count)]:
for t in r:
if zero_if_none(distances.get(abs(x - pos[t]))) <= 0:
return False
return True
# add / remove distances based on delta
def add(x, delta):
for r in [range(0, i), range(j + 1, station_count)]:
for t in r:
distances[abs(x - pos[t])] += delta
# base case
if i > j: return True
# try from right
x = pos[-1] - dist[-c]
if zero_if_none(distances.get(x)) > 0:
if verify(x):
add(x, -1)
pos[i] = x
if aux(i + 1, j, c + 1): return True #found solution
add(x, 1) #backtrack
else:
if aux(i,j,c+1): return True
# try from left
x = dist[-c]
if zero_if_none(distances.get(x)) > 0:
if verify(x):
add(x, -1)
pos[j] = x
if aux(i, j - 1, c+1): return True #found solution
add(x, 1) #backtrack
return False #backtrack
else:
return(aux,i,j,c+1)
station_count = int((sqrt(8*len(dist)+1)+1)/2) #len(dist)=(station_count^2-station_count)/2
pos = [0 for i in range(station_count)]
distances = defaultdict(int)
dist.sort()
for i in range(0, len(dist)-1):
distances[dist[i]] += 1
if station_count <= 1: return pos
pos[-1] = dist[-1]
if aux(1, station_count - 2, 2):
return pos
else:
return [] # no solution found
Solution:
1) Brute force approach:
- store all students positions and when adding one, choose the two
with the greatest distance among each other and place the new
student in the middle
- If students are kept in sorted order in array it's O(n) time
complexity
- accordingly removal has time complexity O(n)
- space complexity is O(n)
it will require special care for the first two, as seating the
first at postion 0 and second at position n-1
2) Tree as ordered set for distances:
- have a tree that orders the students according to the
free distance on their right (if equal, take position of student)
- have a hashtable that tranlates from student id to tree-node
a) add:
- deal with special case if less than two students (see below)
- pick student with largest distance on its right and insert
the new student in between.
- update the nodes accordingly
- time complexity: O(lg(n))
b) remove:
- find the right node (O(1))
- patch nodes to the left and right
- update nodes (O(lg(n))
- total timecomplexity O(lg(n))
Since we are talking about a row of students, I assume 20 or 30
people is the upper bound for that row. In this case the O(lg(n))
solution is outprformed by the O(n) solution due to cache reasons
and because of lower consts.
Anyway the O(lg(n)) solution is more fun to code:
- to solve the issue with the empty row, I placed two sentinel
nodes at position -1 and n, the left's left links back to itself,
the right's right links back to itself
- I place a dist function (getDist), which will look like
right.pos - this.pos and due to the sentinels always work.
- I create a compare function that uses this dist function and
on equality prefers the lower position (as the problem stated)
- I have an add function which will add a tree node and an entry
to it in the idex (hashtable)
- I have a update function which reorders the tree when the nodes
key change (I remove and re-insert with STL)
- I have a Student structure which has pos_, left_, right_ and id_
attributes (id_ is informative)
- students_ is the ordered set, ordered according the description
above
addStundent looks like:
int addStudent(int studentId)
{
Student* left = *students_.begin(); // the one with the largest space on the right
Student* right = left->right_;
if (right->pos_ - left->pos_ == 1) return -1; // no space
Student* newStudent = new Student(studentId, -1, left, right); // pos = -1
if (left->left_ == left) newStudent->pos_ = 0; // right of left sentinel
else if (right->right_ == right) newStudent->pos_ = right->pos_ - 1; // left of right sentinel
else newStudent->pos_ = (right->pos_ + left->pos_) / 2; // between two students
left->right_ = newStudent;
right->left_ = newStudent;
update(left); // update left node in tree, right node's distance and pos didn't change
add(newStudent); // add new node to tree and index
return newStudent->pos_;
}
removeStudent:
- studentsIdx_ is the hashtable
void removeStudent(int studentId)
{
auto itRemove = studentsIdx_.find(studentId);
Student* remove = *itRemove->second; // student to remove
Student* left = remove->left_; // left of student to remove
Student* right = remove->right_; // right of student to remove
left->right_ = right;
right->left_ = left;
update(left); // left's node distance changed, right nodes distance and pos didn't
students_.erase(remove); // remove pointer from set
studentsIdx_.erase(itRemove); // remove from hashtable
delete remove; // free memory
}
the full code (C++ 11)
#include <cassert>
#include <iostream>
#include <set>
#include <unordered_map>
using namespace std;
struct Student
{
int id_; // the id of the student
int pos_; // the position of the student
Student* left_; // student on the left
Student* right_; // student on the right
int getDist() const { return right_->pos_ - pos_; }
Student(int id, int pos, Student* left, Student* right)
: id_(id), pos_(pos), left_(left == nullptr ? this : left),
right_(right == nullptr ? this : right) {
}
};
struct StudentOrderComp
{
bool operator() (const Student* left, const Student* right) {
if (left->getDist() > right->getDist()) return true; // first the greater distances
if (left->getDist() < right->getDist()) return false;
return left->pos_ < right->pos_; // if equal distance, first lower positions
}
};
class ClassRoom
{
private:
int seatCount_; // number of seats
set<Student*, StudentOrderComp> students_; // ordered set with students
unordered_map<int, decltype(students_)::const_iterator> studentsIdx_; // student id -> Student
public:
ClassRoom(int rowLength)
{
seatCount_ = rowLength;
Student *leftSentinel = new Student(-1, -1, nullptr, nullptr);
Student *rightSentinel = new Student(-2, seatCount_, leftSentinel, nullptr);
leftSentinel->right_ = rightSentinel;
add(leftSentinel);
add(rightSentinel);
}
~ClassRoom()
{
for (Student* s : students_) delete s; // clean up
}
int addStudent(int studentId)
{
assert(studentId >= 0); // id -1 and -2 is reserved for left and right sentinels
Student* left = *students_.begin(); // the one with the largest space on the right
Student* right = left->right_; // the student on the right of left
if (right->pos_ - left->pos_ == 1) return -1; // no space
Student* newStudent = new Student(studentId, -1, left, right);
if (left->left_ == left) newStudent->pos_ = 0; // left sentinel
else if (right->right_ == right) newStudent->pos_ = right->pos_ - 1; // right sentinel
else newStudent->pos_ = (right->pos_ + left->pos_) / 2; // in between two students
left->right_ = newStudent; // patch existing left's right pointer
right->left_ = newStudent; // patch existings right's left pointer
update(left); // update left node in tree, right node's distance and pos didn't change
add(newStudent); // add new node to tree and index
return newStudent->pos_;
}
void removeStudent(int studentId)
{
assert(studentId >= 0); // id -1 and -2 is reserved for left and right sentinels
auto itRemove = studentsIdx_.find(studentId);
assert(itRemove != studentsIdx_.end()); // must exist
Student* remove = *itRemove->second; // student to remove
Student* left = remove->left_; // left of student to remove
Student* right = remove->right_; // right of student to remove
left->right_ = right; // patch left's right
right->left_ = left; // patch right's left
update(left); // left's node distance changed, right nodes distance and pos didn't
students_.erase(remove); // remove pointer from set
studentsIdx_.erase(itRemove); // remove from index
delete remove; // free memory
}
private:
// insert into index and ordered set
void add(Student* student)
{
studentsIdx_[student->id_] = students_.insert(student).first;
}
// used to update the tree node in the set (remove and re-insert)
void update(Student* student)
{
auto it = studentsIdx_.find(student->id_);
assert(it != studentsIdx_.end()); // must exist
students_.erase(it->second); // remove it
add(student); // add it again
}
};
int main()
{
ClassRoom r(10);
cout << "add(1): " << r.addStudent(1) << endl; // 0
cout << "add(2): " << r.addStudent(2) << endl; // 9
cout << "add(3): " << r.addStudent(3) << endl; // 4
cout << "add(4): " << r.addStudent(4) << endl; // 6
cout << "remove(2)" << endl;
r.removeStudent(2);
cout << "add(5): " << r.addStudent(5) << endl; // 2 (it's first 3 space)
cout << "add(6): " << r.addStudent(6) << endl; // 9 (now only 3 space)
cout << "remove(1)" << endl; // creates a 2 space
r.removeStudent(1);
cout << "add(7): " << r.addStudent(7) << endl; // 0 (first 2 space)
cout << "add(8): " << r.addStudent(8) << endl; // 7
cout << "add(9): " << r.addStudent(9) << endl; // 1 (first 1 space)
cout << "add(10): " << r.addStudent(10) << endl; // 3
cout << "add(11): " << r.addStudent(11) << endl; // 5
cout << "add(12): " << r.addStudent(12) << endl; // 8 (last empty space)
cout << "add(13): " << r.addStudent(13) << endl; // -1, full
cout << "remove(3)" << endl;
r.removeStudent(3);
cout << "add(14): " << r.addStudent(14) << endl; // 4 (only empty space)
}
Find if two people in a family tree are blood-related.
Solution:
- do a BFS from two sources, use one queue for it
- limit the search depth to max_generations to prevent
exploring too many people
the last point is a common problem with social network graphs and the
k'th degree relationships.
the frontier of first generation is 2^1, the 2nd generation is 2^2, if we explore
theoretically 20 generation we will visit 1 Mio people.
from collections import deque
def blood_related(a, b, max_generations=5):
generation = 0 # generation
queue = deque([(a, 0), (b, 1), (None, None)])
visited = [set(), set()]
while queue:
cur, src = queue.popleft()
if cur is None: # use none to indicate next generation in queue
generation += 1
if not queue or generation > max_generations:
break
queue.append((None, None))
elif cur in visited[src^1]:
return True
else:
visited[src].add(cur)
for parent in cur.parents:
if parent not in visited[src]:
queue.append((parent, src))
return False
class Person:
def __init__(self, name, mother = None, father = None):
self.parents = [mother, father]
self.name = name
g = Person('g', Person('h'), Person('i'))
f = Person('f')
d = Person('d')
e = Person('e')
c = Person('c', e, d)
a = Person('a', g, c)
b = Person('b', e, f)
print(blood_related(a,b))
print(blood_related(a,g))
print(blood_related(b,g))
or looking at the samples, it prints the path lengths or node-sums, which then is:
sum(s[i]) = N and s[i-1] <= s[i], s[0] = 0, s[i] = sum(s[j]) + 2^k, for any 0 <= k and 0 < j < i
def sum_paths(N):
def aux(path, sum, j):
if sum == N:
res.append(path)
if sum < N:
for i in range(j, int.bit_length(N)):
aux(path + [sum + (1 << i)], sum + (1 << i), i)
res = []
aux([0], 0, 0)
return res
I assume the question is:
print sequences s where sum(s[i]) = N and s[i-1] <= s[i], s[0] = 0, s[i] = 2^k, for any 0 <= k
that would be:
def sum_paths(N):
def aux(path, sum, j):
if sum == N:
res.append(path)
elif sum < N:
for i in range(j, int.bit_length(N)):
aux(path + [1 << i], sum + (1 << i), i)
res = []
aux([0], 0, 0)
return res
@Anwar: doesn't seem to work in all cases.
input [1, 9, 10, 10, 11, 20], returns [0, 1, 10, 20] which would create distances of [1, 9, 10, 10, 19, 20] (which is similar but not the same)
[0, 10, 11, 20] or [0, 9, 10, 20] would be valid solutions
I checked 'cause I didn't really understand the removal of doublicates and
result[1]=allPairDist[0]
can't work because the first isn't nececcarily nearest point (as the example above shows, distance 1 is between 10-11 somewhere in the middle
anyway I'd like to explore your idea more, it seems interesting, maybe you could explain the ideas
looks like a tough one...
Clearification / Definition:
- the positions of the gas stations are one dimensional (X-Position),
further denoted as pos[i], where i is the gas station, 0 <= i < n-1.
I assume pos[i-1] < pos[i], which will later result from the solution
- the distances given dist[j] are random, the only thing known is all
distances between all pairs are in this vector and the distances are
integers with no error
- dist[i] != 0 for any i
Observations:
- the longest distance is pos[n-1] - pos[0]
- then for all remaining points there is a longest distance between
pos[0] or pos[n-1]
Algorithm (with backtracking)
- sort distance array
- put all distances into a hashtable with distance as key and #times
of occurence as value
- start with:
pos[0] = 0
pos[n-1] = longest distance
i = 1, j = n-2, c = 2
- recursion(i, j, c):
if i > j: return true
d = c-longest distance
assume pos[i] = d, verify if other distances are available, if yes,
pos[1] = d, recurse on (i+1, j, c+1) (if recursion returns true, return true,
if false, try 2nd option
if no, assume pos[i] = pos[n-1] - d, verify, if ditances to defined
gas stations exists, if yes, recurse, if not return False
runtime-analyzis: that's tough, fast explanation O(2^n * n), because
at each recursion level, there are two choices and verification of
a choice takes 2n/2 average. How ever, one can't construct
a sample that will explore as many paths.
i assume a far more elegant solution exists
from math import sqrt
from collections import defaultdict
def find_gas_station_pos(dist):
def aux(i, j, c):
# check if gas station at pos x would work
def verify(x):
for r in [range(0, i), range(j + 1, station_count)]:
for t in r:
count = distances.get(abs(x - pos[t]))
if count is None or count <= 0:
return False
return True
# add / remove distances based on delta
def add(x, delta):
for r in [range(0, i), range(j + 1, station_count)]:
for t in r:
distances[abs(x - pos[t])] += delta
# base case
if i > j: return True
# try from right
x = pos[-1] - dist[-c]
if verify(x):
add(x, -1)
pos[i] = x
if aux(i + 1, j, c + 1): return True #found solution
add(x, 1) #backtrack
# try from left
x = dist[-c]
if verify(x):
add(x, -1)
pos[j] = x
if aux(i, j-1, c+1): return True #found solution
add(dist[c], 1) #backtrack
return False #backtrack
station_count = int((sqrt(8*len(dist)+1)+1)/2) #len(dist)=(station_count^2-station_count)/2
pos = [0 for i in range(station_count)]
distances = defaultdict(int)
dist.sort()
for i in range(0, len(dist)-1):
distances[dist[i]] += 1
if station_count <= 1: return pos
pos[-1] = dist[-1]
if aux(1, station_count - 2, 2):
return pos
else:
return [] # no solution found
print(find_gas_station_pos([10, 20, 70, 80, 30, 20, 100, 70, 50, 90]))
@sheva: it's correct, assume areas is one for all
- Probability to pick 1st is 1
- Next iteration, probability to pick 2nd is 0.5, so 0.5 for 1st and 0.5 for 2nd. 1st and 2nd have same probability (uniform)
- Next iteration, probability to pick 3rd is 1/3, so 1st and 2nd together is 2/3, 3 rd is 1/3, still uniform
...
it's an elegant algorithm for such cases, it's a similar one that is used when shuffling a deck of cards
I think it's important to define the problem first:
- what does slow mean? high latency? for which percentile of requests? all, a few, e.g. 5%? Where is it measured (could it be, it's not the server but too much traffic, so server receives the request already late and it takes too long to deliver to client, ...) best is, draw a picture, with the server(s) the infrastructure around until the point of measurement... worst case it's the end customer telling you this with no additional information.
- are there any service level objectives (SLO's) defined one can compare against?
- is there history data? did high latency increase slowly?
- when did it change? was it due to a new release or no modifications
- what diagnostics is available?
Potential issues are:
- the server it self is overloaded (e.g. to much disc usage)
- the server contacts other services and waits for them (e.g. a slow service in the back end)
- the CPU is overloaded because some CPU bound processes eat away CPU power
- to many services use too much memory, the system keeps swapping in and out
- data grew, current sharding scheme may not be effective anymore
- OS update, a new network stack or driver?
- etc. etc. etc. etc.
I think a systematic approach could be
- verify / assume objectives
- narrow down the service(s) causing delay (DB cluster, app server cluster, what services specifically)
this is starting at the front-end service: what's the delay between receive request and served request? does this include the delay causing problems? If yes: go down, check the call graphs to other services, is there any call causing the delay? If yes, move to this server and repeat there
- look into the server(s) what is causing the delay (disc, cpu, memory (e.g. page faults))
- verify history data if available (sudden change, continuous process)
Assumption:
- 0 <= e < n, for all e in ex (this should be mentioned and not just assumed)
Alternative approach:
- while all other approaches are perfectly valid, I think it's worth to mention the non-deterministic version:
- generate a number [0..n) and then check if it's in the excluded list.
- if not, great, if it is, repeat the step above.
public int random(int n, Set<Integer> ex) {
if(n <= ex.length()) return -1;
int num = 0;
do {
num = new Random().nextInt(n);
} while(ex.contains(num));
return num;
}
This is very good in average, when n is significantly larger than the excluded list. It is an infinite loop if the excluded list size is n.
It takes less than n iterations in average if there is only one number not in excluded list (e.g. with n = 2 and len(ex) = 1, you have P(at least one num not in ex after 2 trials) = 75% ... how ever, there's theoretically an infinite long tail which is the problem of this approach. Maybe you do the math or write a Montecarlo simulation. It's interesting.
An other alternative is to combine the two approaches in one loop and which ever leads first to a result wins. This is the absolute best approach, because it removes the non-determinism for worst cases and gives a very good average time behavior.
@ali.kheam:
my thoughts:
- There is no method that sorts in O(k), given k is the fixed cache size except variations of radix sort (if the values have limited range)
- If k is e.g. 3 a hashtable is overkil, maybe that was the intention
- If the ranking function has a low range (e.g. 0-254) you could construct a radix sort based DS with nested hashtables etc. It's possible but it's not very nice... that wouold have O(1) on all operations, so maybe you should have asked what is the range of the ranking function and what is the size of the cache, is it as well constant, etc...
@koustav.adorable: I know, it's the geeks4geeks solution. It runs in O(n*m) best and worst case. It's strategy is difficult to understand without description if the "histogram" doesn't jump right at you ;-)
If it's about finding a single rectangle, solutions with significant better average runtime are thinkable that have the same worst case behavior (depends on the rectangle dimensions). Sometimes you would approach a problem differently if you didn't know about "well known problems" ... I think that's the point of the interviews, how does one solve a problem that is new.
@NoOne: I don't understand your last statement, what do you need to specify on the ranking function properties? It's a ranking function, it will return a rank value and it is told that the rank value of a single item changes on get (not all, no complete re-shuffling). We can assume it's an integer or what ever, at the end what we need is top k of the rank values and it's associated items in the cache (the same rank value discussion is a detail).
- Chris July 30, 2017depends what you mean exactly by "merge", e.g.
a: [1..7]
b: [5..10]
c: [9..13]
and you want [1..13] as a merged interval of a,b,c
you might want to sort by start times if you want to merge starting by start. You walk through the sorted intervals, pick the first and if the next overlaps you max on the end until no further overlap with maxed end occurs...
If you like thinking backwards, just do the same from the end with minimized start. Its symetric... so it seems to not matter on the result and run time.
If you want to work a bit more with your intervals, it may make sense to put them into an interval tree. you may google it.
@koustav: [1,2,0,2,2,4] returns 3, should return 4 as 4,2,0,1, I think Fernando has a similar issue
Assumptions:
- S is a sequence with unique values, because it has an order (or ordered set)
- Since the set can be empty that is only the case if
A[i] >= n or A[i] < 0, for all A[i], 0 <= i < n
- I therefore assume values within A have no constraint other than being
integers
[- a neat simplification would be to state, 0 <= A[i] < n for all A[i], 0 <= i < n
which would make the thing easier to code.]
Solution:
- insight brings drawing it as a graph. I think key is to notice it's a
graph problem, makes it easier to draw and argue.
- first approach is starting from each node until closing a cycle or until
encountering a node that has a value < 0 or >= n
This is inefficient as it might have common subproblems
It's O(n^2) worst time, e.g. for array [n-1, 0, 1, 2, ..., n-2]
- better approach is to start with node i and have an array which stores per
node how large the already discovered path is that one can visit starting
from node i. That is, how long is the path until it starts a cycle or
encounters a node of value bejond array range
this has time complexity of O(n) compensated, and a space complexity of O(n)
I get the imrpession there is a more elegant approach but coudln't find it
in 10 minutes thinking.
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
int max_component_size(const vector<int>& input) {
size_t n = input.size();
vector<int> component_size(n, 0);
int max_size = 0;
for (size_t i = 0; i < n; ++i) {
if (component_size[i] == 0) {
// 1. discover until
// - a cycle
// - an already found component
// - a value "pointing" bejond array range
size_t j = i;
int disc_size = 0;
while (component_size[j] == 0) {
component_size[j] = -1;
j = static_cast<size_t>(input[j]); // negative input values will create > n values
if (j < n) disc_size++;
else break;
}
if (j >= n || component_size[j] != -1) {
// it's a path enetering an already discovered
// path or it's a path to nowhere j >= n
if(j < n) disc_size += component_size[j];
max_size = max(disc_size, max_size);
// backtrack and store discovered component size
j = i;
while (j < n && component_size[j] == -1) {
component_size[j] = disc_size;
j = input[j];
disc_size--;
}
} else {
// it's a cycle
max_size = max(disc_size, max_size);
while (component_size[j] == -1) {
component_size[j] = disc_size;
j = input[j];
}
}
}
}
return max_size;
}
int main()
{
cout << max_component_size({ 5, 4, 0, 3, 1, 6, 2 }) << endl; // should be 4: sample given
cout << max_component_size({ 1, 2, 0, 2, 2, 4 }) << endl; // should be 5: 5,4,2,1,0
cout << max_component_size({ 5, 0, 1, 2, 3, 4 }) << endl; // should be 6: e.g. 0,5,4,3,2,1
cout << max_component_size({ 1, 2, 3, 4, 5, 0 }) << endl; // should be 6: e.g. 0,1,2,3,4,5
cout << max_component_size({ -1, 9, 2 }) << endl; // should be 1, 9
cout << max_component_size({ -1, 9, -2 }) << endl; // should be 0
cout << max_component_size({ 1, 0, -1, 2, 3, -3, 5, 8, 4 }) << endl; // 4: 8,4,3,2
}
@NoOne: LRU is significantly easier, because you needn't maintain a sorted order. Here you have a ranking function that gives you an arbitrary order. So, you can't do it with a HT poitinig to linked list elements except you accept O(n) time to change to position of the element that gets hit and receives a new arbitrary rank value.
With LRU, the rank value would always be the biggest, so you can maintain the order without comparing anything...
It's similar but not the same. The question is actually therefore a bit less annoying ;-)
Assumptions:
- coorinate is with integers
- no overflow can occure due to range of values
Implementation:
- for each pair of points a,b find a' and b' such that a,b,b',a' forms a square
a' and b' are defined if a and b are defined (two options, a' and b' or a'' and b'')
dx = ax - bx
dy = ay - by
a'x = ax - dy, b'x = bx - dy
a'y = ay + dx, b'y = by + dx
a''x = ax + dy, b''x = bx + dy
a''y = ay - dx, b''y = by - dx
find a' and b' or a'' and b'' in array
the find can be done in O(1) by using a hashtable
- overall time complexity O(n^2), and O(n) space
if working with floating point, this won't work as straight forward, but you can actually do
almost the same, just instead of using the hashtable to find the point pairs, use the
hashtable as a spatial store dividing the space into small squares, then you put the
points into all the 8 squares and the one it lies on (etc. etc.)
find the 9 squares around a point and check against all points found within this squares.
If you divide space small enough, you might even get away by just checking if there is
point in the square... but get the error discussion right.
#include <iostream>
#include <unordered_set>
#include <vector>
#include <functional>
using namespace std;
bool find_square_in_table(const vector<pair<int, int>>& points) {
// hash function for set of coordindates
auto pair_hash = [](const pair<int, int>& p) {
return hash<decltype(p.first)>()(p.first) * 9315 + hash<decltype(p.second)>()(p.second);
};
// build hash table with all coordinates
unordered_set<pair<int, int>, decltype(pair_hash)> ht(10, pair_hash);
for (const auto& point : points) {
ht.emplace(point);
}
// buid pairs and look for points forming a square
for (int i = 0; i < points.size() - 1; ++i) {
for (int j = i + 1; j < points.size(); ++j) {
int dx = points[i].first - points[j].first;
int dy = points[i].second - points[j].second;
int ij[] = { i, j };
for (int s = -1; s < 2; s += 2) {
bool found = true;
for (int idx : ij) {
pair<int, int> ptf = pair<int, int>(points[idx].first + s * dy, points[idx].second - s * dx);
found &= ht.find(ptf) != ht.end();
}
if (found) {
return true;
}
}
}
}
return false;
}
int main()
{
cout << find_square_in_table({ {5, 5}, {10, 5}, {10, 10}, {5, 10} }) << endl; // true
cout << find_square_in_table({ { 5, 5 },{ 10, 5 },{ 10, 10 },{ 5, 11 } }) << endl; // false
cout << find_square_in_table({ { 14, 5 }, {2, 9}, { 7, 3 },{ 12, 12 },{ 5, 10 },{3, 9} }) << endl; // true: (7,3),(5,10),(12,12),(14,5)
return 0;
}
since the problem was changed to no parent pointers, here the described approach without parent pointers
# solution without parent pointer
from collections import deque
# searches the tree from the root to the node in O(n) time
# and returns the path to the node as a dictionary (key=node, value=parent)
def find_path_to_node(root, node):
path = {}
stack = [[False, root, None]]
while stack:
visited, root, parent = stack[-1]
if visited:
stack.pop()
del path[root]
else:
stack[-1][0] = True
path[root] = parent
if root == node:
break # found node
for child in [root.left, root.right]:
if child is not None:
stack.append([False, child, root])
return path
# does the find in O(n) time
def find_nearest_leaf(root, node):
parents = find_path_to_node(root, node)
queue = deque()
queue.append((0, None, node))
while queue:
dist, prev, cur = queue.popleft()
# if is it a leaf and not the node I started at, finished
if dist > 0 and cur.is_leaf():
return dist
for adj in [parents.get(cur), cur.left, cur.right]:
if adj is not None and adj != prev:
queue.append((dist + 1, cur, adj))
class Node:
def __init__(self, value, left=None, right=None):
self.left = left
self.right = right
self.value = value
def is_leaf(self):
return self.right is None and self.left is None
node_h = Node('h', Node('c', Node('b'), Node('e')), Node('K', None, Node('m', Node('p'), None)))
root = Node('p', node_h, Node('q'))
print(find_nearest_leaf(root, node_h)) # return 2
@tnutty2k8:
because of operation precedence... you may want to write down the recursion tree, for array = [a,b,c], this should give you the insight.
right, there are all integers, so, I suppose the only advantage you get from division is in a situation like 1, -100, 50... in order to get rid of the -100, one could do: 1/-100+50... but actually 1--100+50 is still better...
but x/0 could lead to positive infinity which would be neat... ;-)
The thing is full of special cases one needs to consider in order to get an optimal solution. However, since the question got re-posted, I assume somebody has a hint ready...
I try to rephrase what I understood:
- create an associative cache (a lookup, key value store, ...)
- in the cache there are items, I assume some objects with attributes (e.g. an Item)
- items have a "rank" which I assume, high ranked items should be kept in the cache over low ranked items
- the cache has a size limit, e.g. maxSize_
- get(key) method: get's an element from the cache or nullptr if not there
- put(item) method: places the element in the cache and if cache would grow over max_size removes the lowest ranked element (which might be the one just added)
Assumptions:
- there are no two element with the same rank, if so one can evict one of the two elements if their rank score is the lowest
- rank is an integer value
in your example there is a db_read_entry(key) method, which I assume is from a data access component. I would leave DB aspects out since it is not relevant for the cache functionality
Solution for the first question I would solve similar to you. get is trivial put would be:
void put(const shared_ptr<Item>& item) {
if (ht_.find(item->getId()) == ht_.end()) {
ht_.emplace(item->getId(), item);
queue_.emplace(pair<int, int>(item->getRank(), item->getId()));
}
// clean up queue
if (queue_.size() > maxSize_) {
auto qitem = queue_.top(); // get top element
queue_.pop(); // remove top element
ht_.erase(qitem.second); // remove by id
}
}
the only thing to mention is that it might be slightly stupid if the elment being added has lower rank than top of the queue because it is added and then removed again.
For the follow up the question it is a bit tricky, because you want to actually change the priority of an item in the queue. Most standard containers have a problem with this because you have no random access to the item.
One way around it can be to just add another item to the queue with the new priority. Since you need to keep the HT size constant, the queue will eventually be cleaned, but there is a risk that the queue is significantly larger than the HT. It is not optimal.
If you implement a binary heap your self, you can solve this issue and place pointers to the binary heap elements into the HT as values. If you change the priority of an item, you just bubble it up or down in the heap and you will get to the element by O(1) coming from the HT. I once implemented such a binary heap as an exercise. I maintained the index into the heap-vector on the "iterators" pointing to the heap elements. That turned out to be quite optimal. However, I never found anybody doing it the same way. E.g. when you look at Dijkstra implementations people usually just add queue items with lower priority and accept the queue growing larger than needed (I think it's not relevant).
your statement "create an associated binary tree(std::set) with the hashmap(map of entry as key, and the iterator entry in the std::set of entries)" I didn't understand.
Do you mean a hashtable with the item-id as key and the tree-iterator as value. In the tree (you will need a map, not a set) you would have the rank as keys and the item as values"?
like:
unordered_map<int, map<int, int>::iterator>> item_id_to_treeItem;
map<int, shared_ptr<item>> rank_to_item;
This works as well, you take the tree's min-element, from there you get the items-id, then you can remove it from the HT. This solution has the same O(lg(n)) properties for the get() and put() as the heap due to maintenance of tree, but the tree needs to rebalance etc. which is heavier on constants than binary heap that can be implemented in a vector which is much heap- and cachefriendlier than trees.
the idea is to do a special kind of BFS:
1) I assume I have a parent pointer for each node, if I don't I can
easily get a dictionary with necessary parent pointers, which is
derived from the path from the tree root to the node given
(I need the root then, that defines the tree I can look in)
Further I assume that if I get passed a leaf I should return the
nearest leaf and not the passed in node (leaf).
2) Instead of maintaining visited array, I can simply not go
back where I came from because there are no cycles and no
forward edges in a regular binary tree
3) I just start at a node and go to the parent, left and right
...
Sample tree
p
/ \
h q
/ \
c k
/ \ \
b e m
/
p
complete implementation using parent pointers for simplicity
in python 3:
from collections import deque
def find_nearest_leaf(node):
queue = deque()
queue.append((0, None, node))
while queue:
dist, prev, cur = queue.popleft()
if dist > 0 and cur.is_leaf():
return (cur, dist)
for adj in [cur.parent, cur.left, cur.right]:
if adj is not None and adj != prev:
queue.append((dist + 1, cur, adj))
return (None, None)
class Node:
def __init__(self, value, left=None, right=None):
self.parent = None
self.left = left
self.right = right
self.value = value
if self.right is not None:
self.right.parent = self
if self.left is not None:
self.left.parent = self
def is_leaf(self):
return self.right is None and self.left is None
node_h = Node('h', Node('c', Node('b'), Node('e')), Node('K', None, Node('m', Node('p'), None)))
Node('p', node_h, Node('q'))
print(find_nearest_leaf(node_h)) # return 2
If u have a parent pointer it's really simple, from a node go up, left and right (put those nodes into a queue as pairs of current: enqueued((self, self.left)), ...
Then if u dequeue only visit those that are not equal to the first element taken from the queue (don't go back where I came from).
Then the first leave encountered is it...
If u have no parent pointer, u create first a parent-pointer list, from the root to the node u start from, all other parent pointers are not needed.
Code follows...
1. per document: extract document text, remove all tags, format etc.
2. per document: perform a word stemming and remove filling words (like "a", "I", etc...)
3. create word-id's where each word is an id
4. per document: create a vector of words, where you have per word-id, the number of times
it occures in a document. e.g. {(1, 1), (8,10), (14,2)}
5. perform k-means, which is, find k partitions whereas k is the desired amount
of categories. per category, maybe take the top most used words to describe
it.
k-means works as follows:
1) pick k random categories and then check for each document where it has
the most similiarity to and assign it to this category
2) now for each category re-calculate it's center by adding up the vectors
3) check for each document to which category it's closest and re-asign
it to this category
4) stop if no re-asignment happend in 3) or continue with 2)
5) for similarity you can pick anythin, like the cosine-similarity or
other similarity functions
assumtions:
1) timestamp is an integer, with 1 ms resolution, large enough it won't overflow
2) if there are more than two messages a second per user id, e.g. with timestamps 100, 101, 102, it's sufficient to call too_fast(100, 101) and too_fast(101, 102) and leave out too_fast(100, 102)
approach 1:
for each message, get the unique user id user_id, put it into a hash table (user_id as key, message as value). Before putting it in, check if there is already a message stored for this user, if so compare the times, if the two are within 1 second, call too_fast
approach 2:
approach 1 is fine, it works, but the HT will eventually grow large, consuming a lot of memory, whereas we actually only need the messages from unique user_ids from the last second, which I assume, is considerably less than the set of users over a longer periode (mustn't be, but for most system it's a good assumption). Therefore I would keep a queue which stores user_id's and times and I would constantly check if I can remove an element from the queue because it's "expired" and if so, I would check if I can remove from the HT because it may be there is a younger message already in the HT for the same user...
in c++
/*
assumes Messages is an iterateable type will read from a stream.
Memory management, should use smart pointers, which isn't here
*/
void analyzeMessagesForShortRepeatedLogin(const Messages& messages)
{
queue<Message> msgQueue;
unsorted_map<int64, Message> msgPerUser;
for(const auto msg&: messages) {
auto oldMsgIter = msgPerUser.find(msg.getUserId());
if(oldMsgIter != msgPerUser.end()) {
if(msg.getTime() - oldMsgIter->second.getTime() < 1000) {
too_fast(oldMsgIter->second, msg);
}
}
while(!msgQueue.empty() && msg.getTime() - msgQueue.back().getTime() >= 1000) {
auto iter = msgPerUser.find(msgQueue.back().getUserId());
if(iter != msgPerUser.end() && msg.getTime() - iter->second.getTime() >= 1000) {
msgPerUser.remove(iter);
}
msgQueue.pop();
}
msgQueue.push(msg); // EDITED, thanks to funk
msgPerUser[msg.getUserId()] = msg;
}
}
@dizhu2016: I think you run into numerical issues, there are three constraints
1) distribute exactly B machines, no more, no less
2) each cluster must have at least one machine
3) when you have one remaining machine, choose the cluster which have highest ceiling tasks per machine
this conditions are not easily satisfied in a closed formula, e.g.
a[] = {1, 1000}
B = 2
N = 2
d[0] = 1 * 2 // 1001 = 0
d[1] = 1000 * 2 // 1001 = 1
// denotes integer division, or (floor(a/b), "next smaller integer")
Issues with the above:
a) this is only one machine distributed
b) even with proper rounding you would get d[0] = 0 d[1] = 2, which violates the condition of at least one machine in a cluster
c) you can fix it and distribute the remaining machines after giving each cluster a machine, but the intuition whether this is right or wrong is not so straight forward, given constraint 3)
d) besides, there will be numerical issues, due to any kind of rounding it's not given that you will distribute exactly B machines, it could well be B+1 or B-1 due to rounding.
it's like if you distribute 10 candies to 20 people, and you can only give 0 or 1 to a person, if you do this with doubles and round, you either distribute 0 candies or 20 candies...
@krupen.ghetiya
the question asks for the *median* of all elements from a dataset where you have random access to the elements. Your answer is for a *running median* problem, that is the median can be queried any point in time from the seen elements. Typically in running median data arrives from a stream, which means no random access.
As a result, the two heaps *running median* is not linear in time, it's O(lg(N)). ...
Distribution is a different beast, too because it would be safe to assume the data won't fit into the memory of a single machine and the two heaps solution would need some major tricks/assumptions/modifications to work in most cases
basically you need a tokenizer, where ";" is a token. it's probably enough to have junks of tokens that do not contain ';' and are no string litterals ('abc', "efg"), don't forget escapes like "\"" ... and comments like --.*\n and /* ... */
so, a somewhat limitted tokenizer will do it. it can be done using a regular expression like (comments not taken into account in sample below)
(\"([^\"])*\"|\'([^\\\'])*\'|([^\'\";])*)*;
instead of code use one of the many online regex testers, like regex101.com
- Chris July 22, 2017- the common approach is to use the quicksort partition method.
partition will pick a guess of a median and place smaller elements
on the left side of the array, equal elements in the middle part and
greater element on the right side and return the index of start and end
of middle part.
This is applied repeatedly until the array is separated
into two equal sized parts (considering the even length case). It has
the challenge of selecting a good pivot element...
One needs to consider as well odd and even length arrays, see code below.
- how to distribute:
one approach could be to guess the median, by just
determening the median on one machine. This would work if the
distribution accross machines is uniform and if a statistical
error is accepted
- alternatively one could use median of median method, where
every machine calculates a median and at the end median
of medians is calculated. This will be a pretty good aproximation,
but it's still an approximation.
- first machine does guess a good median, e.g. by using
median of medians, then it promotes this median to
other machines, each machine will then send back, how many
elements are left and right of that element, this information
is then used to repeat the process (look left or right)
the problem here is the communication among the machines, the slowest
or potentially a dead machine will slow the progress, further machines
don't work until the intermediate step has been completed by all others.
- a combination of those steps would be good, every machine
calculates the median +/- 10 elements and sends it to a coordinator.
the coordinater then determines median of medians and determines if it's
within the elements provided, if so, it can calculate the perfect
median, if not it will get back, with the best aproximation...
- what should be considered is that in a real life application
there is probably no point in time where the data won't change (e.g. grow).
so, if the perfect mean needs to be found and the algorithm is supposed
to terminate under all circumstances, some sort of snapshot mechanism
is required.
anyway here the median based on partition
def partition(data, begin, end):
pivot_idx = (begin + end) // 2 # midle, better randomize or median of median
pivot = data[pivot_idx]
data[end], data[pivot_idx] = data[pivot_idx], data[end] # move pivot to the end
piv_begin = 0 #[begin..piv_begin) < pivot
piv_end = 0 #[piv_begin..piv_end) = pivot
idx = 0 #[piv_end..idx) > pivot
while idx <= end:
if data[idx] < pivot:
data[piv_end], data[idx] = data[idx], data[piv_end] # swap
data[piv_begin], data[piv_end] = data[piv_end], data[piv_begin] # swap
piv_end += 1
piv_begin += 1
elif data[idx] == pivot:
data[piv_end], data[idx] = data[idx], data[piv_end] # swap
piv_end += 1
idx += 1
return (piv_begin, piv_end - 1)
def median(data):
n = len(data)
if n == 0: return None
if n == 1: return data[0]
begin = 0
end = n - 1
while begin < end:
m_beg, m_end = partition(data, begin, end)
if m_beg > (n - 1) // 2:
end = m_beg
elif m_end < (n - 1) // 2:
begin = m_end
else:
begin = (n - 1) // 2
end = (n - 1) // 2
if n % 2 == 1: return data[begin] #odd case, one element
return (data[begin] + min(data[begin + 1:])) / 2 #even case average of the two midle elements
print(median([1,2,3])) #2
print(median([3,2,1,4])) #2.5
print(median([2,1])) #1.5
print(median([3])) #3
print(median([9,9,9,1,1,1])) #5
Repednaheeks, Android test engineer at Automated Traders Desk
I am Edna, a blogger . As an editor with a strong background in English and Hindi language. I enjoy writing ...
Rep
Rep
Repjudydschultz1234, Android test engineer at Eterno Infotech Pvt Ltd
Spent a weekend researching weebles in Nigeria. My current pet project is developing strategies for how to break someone's ...
RepAmber Van is the top rated company that offers friendly and professional removals services.Our featured services includes domestic moves ...
RepJames Gulledge, Dev Lead at ADP
Get you driving in comfort and class in Atlanta! Prestige Luxury Rentals offer a wide selection of premium vehicles Our ...
@hprem991:
- Chris August 14, 2017"each node has a function called updateOnWrite()" that means when I write, each node is called. I write n times, I call n times each node... code it, believe me, what you are talking about doesn't exist. I know this patterns quite well ;-)