Chris

SWE

Questions (1)

Comments (50)

Follows (1)

Followers (6)

Reputation 2645

Page:

1 2 3 4 5 6 7 8 9

Comment hidden because of low score. Click to expand.

of 0 vote

@hprem991:
"each node has a function called updateOnWrite()" that means when I write, each node is called. I write n times, I call n times each node... code it, believe me, what you are talking about doesn't exist. I know this patterns quite well ;-)

- Chris August 14, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@hprem991: I understand, maybe you could code that O(n) solution that uses only one pass to create the sum and distribute it to all nodes, assuming we had a minimal spanning tree. Maybe just with an array, calculate the sum of an array and set each element to that value. Are you sure you don't create an O(n^2) solution that just looks like O(n) on the surface?

- Chris August 14, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@hprem991 how does copy on write work?

- Chris August 14, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

n-1 operations to build the sum, and n-1 to distribute (if you draw it), but that's not really the matter

- Chris August 14, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I assume only positive values > 0.

I fully agree with your conclusion, it means you need to visit each node to build the sum in on node and then visit again each node to copy that value to all other nodes.

I guess the question aims, what path should you take to achieve 2*(no of nodes) time units. I think you are looking at a minimum spanning tree to calculate and distribute the sum (I think there are two one should really know, but I actually need to look them up as well).

An other point of view could be, to calculate the sum, you need to see each value. there is no way you can build a sum on elements without having at least once seen all elements values. Then, since you need to copy it to all other nodes, there is no other way than visiting each node again. So, 2*n is best conceivable, I don't think you can do better either.

- Chris August 14, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@NoOne:
this articles essence is "to find out if two people are blood-related only consider parent relationships because then your graph grows only by 2^generations whereas if you would navigate the children as well you would grow it by (2+#children)^generations ..."

when I solved the problem, I drew the graph on a paper and it was clear that only the parents are relevant, so I didn't even mention it ;-) nor did I have child relationships in the graph... of course, navigate the parents only... because it limits the growth of the frontier

I think one could even extend the algorithm by considering the birth date and only grow the youngest nodes upwards (using a prio-queue instead of a queue), so the frontier covers about the same birth years assuming our common anccestor children will not be too far apart (between 2 and 20 years, maybe extremes have 60 years (but only with common fathers ... :-)) ...

cool thanks for inspiring to think that way, not so much for the articles content ;-)

- Chris August 14, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

is the expression an AST? meaning for example

* 
  /    \
a       +
       /    \
     1      2

was originally a * (1+2)

is the order of the expressions given in such a way there exists a solution without the need to reorder

x = 2
y = 2
return x * y

vs.

return x * y
x = 2
y = 3

if so, then I would suggest to treat assign expressions in a way that the left node provides the variable name and the right node is the expression being evaluated recursively (of course can and should be done in a more OO way)

variables[assignment.left] = evaluate(assignment.right, variables)
def evaluate(root, variables):
	if isinstance(root, BinaryOp):
		if root.operator = '+':
			return evaluate(root.left) + evaluate(root.right)
		elif root.oeprator = '*':
			return evaluate(root.left) * evaluate(root.right)
	elif isinstance(root, VariableRef):
		return variables[root.variableName]
	elif isinstance(root, Number)
		return root.value
	raise "error"

and return expression as

return evaluate(return.expression, variables)

of course things like types and other semantic rules need to be defined...

- Chris August 13, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I don't understand how the game works:
- is k a number given, e.g. one can take 100-k stones, the first who exceeds this
range, looses - or the last who is within the range wins?
- or is it that the person who can't take the last k <= 15 stones wins?
- or anything other?

- Chris August 13, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

edited (mixed up terms)
I would try to approach the question in a structured way, where do these queries come from, what do we expect to ask tomorrow, so we don't have to change this structure all the time. Anyway, for the exact question:

page(s) visited by 100 users:
- would store for each number of unique users per day the pages
- as a b-tree with #of unique users as key and page id and number of total visits as value (I would loose which user it was with this aggregation), but i could have fast statistics for a page ranking based on user visits.
- to ask which was visited 100 times, I had a O(lg(m)+k) query. m is the different page frequencies per day, k the number of pages with exactly 100 users per day. Instead of b-tree I could use a HT with the frequency as index, so it would be O(k), how ever, I'd expect a query > 100, > 1000 or >100 < 200 etc...

pages visited by 1 user 15 times:
- same as above, I would get all pages visited by 1 unique user and ask which one had only 15 visits in total. if that's not good enough, because there are many pages with 1 unique user a day, I would use a different index on the b-tree of first query.
that index would consist of two integers, the #unique users and the #of total page visits (ordered primarily on unique users, secondary on total page visits) So I could ask both questions on the same index with logarithmic upper bounds (depends what I want exactly, e.g. one result or multiple results)

page visited by u3 more than 20 times (may be more than one...first, all > 20?, any if exists):
- I would assume the vector describing #visits / page for a single user and day is sparse.
- hash-table with user id as key and a vector as value
- the vector contains the tuple page-id, #visits
- to find I would have O(1) to get to the vector and a linear scan on the vector, If I sort the vector, I had O(lg(n)) to find the min(>20), O(1) to find any or none, or O(n) to find all > 20

- Chris August 13, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

is the file on disc? so the bottle neck is disc I/O
is the file on memory? so, the bottle neck is memory
is the file on multiple machines: good, we can work in parallel: assign blocks of x GB to m machines, take care that the blocks overlap so at least one machine has the whole pattern ...

- Chris August 12, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

@Vijay.B
thanks, that's correct, I didn't check for already used distances. This should work now (code is ugly)

def zero_if_none(a):
    return 0 if a is None else a

def find_gas_station_pos(dist):            
    def aux(i, j, c):        
        # check if gas station at pos x would work
        def verify(x):
            for r in [range(0, i), range(j + 1, station_count)]:
                for t in r:
                    if zero_if_none(distances.get(abs(x - pos[t]))) <= 0:
                        return False
            return True
        
        # add / remove distances based on delta
        def add(x, delta):
            for r in [range(0, i), range(j + 1, station_count)]:
                for t in r:
                    distances[abs(x - pos[t])] += delta

        # base case
        if i > j: return True
        
        # try from right
        x = pos[-1] - dist[-c]
        if zero_if_none(distances.get(x)) > 0:
            if verify(x):
                add(x, -1)
                pos[i] = x
                if aux(i + 1, j, c + 1): return True #found solution
                add(x, 1) #backtrack
        else:
            if aux(i,j,c+1): return True

        # try from left
        x = dist[-c]
        if zero_if_none(distances.get(x)) > 0:
            if verify(x):
                add(x, -1)
                pos[j] = x
                if aux(i, j - 1, c+1): return True #found solution
                add(x, 1) #backtrack
            return False #backtrack
        else:
            return(aux,i,j,c+1)
        
    station_count = int((sqrt(8*len(dist)+1)+1)/2) #len(dist)=(station_count^2-station_count)/2
    pos = [0 for i in range(station_count)]
    distances = defaultdict(int)    
    dist.sort()
    for i in range(0, len(dist)-1):
        distances[dist[i]] += 1

    if station_count <= 1: return pos
    pos[-1] = dist[-1]
    if aux(1, station_count - 2, 2):
        return pos
    else:
        return [] # no solution found

- Chris August 12, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Solution:
1) Brute force approach:
- store all students positions and when adding one, choose the two
with the greatest distance among each other and place the new
student in the middle
- If students are kept in sorted order in array it's O(n) time
complexity
- accordingly removal has time complexity O(n)
- space complexity is O(n)
it will require special care for the first two, as seating the
first at postion 0 and second at position n-1

2) Tree as ordered set for distances:
- have a tree that orders the students according to the
free distance on their right (if equal, take position of student)
- have a hashtable that tranlates from student id to tree-node
a) add:
- deal with special case if less than two students (see below)
- pick student with largest distance on its right and insert
the new student in between.
- update the nodes accordingly
- time complexity: O(lg(n))
b) remove:
- find the right node (O(1))
- patch nodes to the left and right
- update nodes (O(lg(n))
- total timecomplexity O(lg(n))

Since we are talking about a row of students, I assume 20 or 30
people is the upper bound for that row. In this case the O(lg(n))
solution is outprformed by the O(n) solution due to cache reasons
and because of lower consts.

Anyway the O(lg(n)) solution is more fun to code:
- to solve the issue with the empty row, I placed two sentinel
nodes at position -1 and n, the left's left links back to itself,
the right's right links back to itself
- I place a dist function (getDist), which will look like
right.pos - this.pos and due to the sentinels always work.
- I create a compare function that uses this dist function and
on equality prefers the lower position (as the problem stated)
- I have an add function which will add a tree node and an entry
to it in the idex (hashtable)
- I have a update function which reorders the tree when the nodes
key change (I remove and re-insert with STL)
- I have a Student structure which has pos_, left_, right_ and id_
attributes (id_ is informative)
- students_ is the ordered set, ordered according the description
above

addStundent looks like:

int addStudent(int studentId) 
{	
	Student* left = *students_.begin(); // the one with the largest space on the right
	Student* right = left->right_; 
	if (right->pos_ - left->pos_ == 1) return -1; // no space
	Student* newStudent = new Student(studentId, -1, left, right); // pos = -1
	if (left->left_ == left) newStudent->pos_ = 0; // right of left sentinel
	else if (right->right_ == right)  newStudent->pos_ = right->pos_ - 1; // left of right sentinel
	else newStudent->pos_ = (right->pos_ + left->pos_) / 2; // between two students
	left->right_ = newStudent; 
	right->left_ = newStudent; 
	update(left); // update left node in tree, right node's distance and pos didn't change
	add(newStudent); // add new node to tree and index
	return newStudent->pos_; 
}

removeStudent:
- studentsIdx_ is the hashtable

void removeStudent(int studentId)
{
	auto itRemove = studentsIdx_.find(studentId);
	Student* remove = *itRemove->second; // student to remove
	Student* left = remove->left_; // left of student to remove
	Student* right = remove->right_; // right of student to remove
	left->right_ = right; 
	right->left_ = left;
	update(left); // left's node distance changed, right nodes distance and pos didn't
	students_.erase(remove); // remove pointer from set
	studentsIdx_.erase(itRemove); // remove from hashtable
	delete remove; // free memory
}

the full code (C++ 11)

#include <cassert>
#include <iostream>
#include <set>
#include <unordered_map>

using namespace std;

struct Student
{
	int id_; // the id of the student
	int pos_; // the position of the student
	Student* left_; // student on the left
	Student* right_; // student on the right
	int getDist() const { return right_->pos_ - pos_; }
	Student(int id, int pos, Student* left, Student* right) 
		: id_(id), pos_(pos), left_(left == nullptr ? this : left), 
		  right_(right == nullptr ? this : right) {
	}
};

struct StudentOrderComp 
{
	bool operator() (const Student* left, const Student* right) {
		if (left->getDist() > right->getDist()) return true; // first the greater distances
		if (left->getDist() < right->getDist()) return false;
		return left->pos_ < right->pos_; // if equal distance, first lower positions
	}
};

class ClassRoom 
{
private:	
	int seatCount_;	// number of seats
	set<Student*, StudentOrderComp> students_; // ordered set with students	
	unordered_map<int, decltype(students_)::const_iterator> studentsIdx_; // student id -> Student
	
public:
	ClassRoom(int rowLength)
	{
		seatCount_ = rowLength;
		Student *leftSentinel = new Student(-1, -1, nullptr, nullptr);
		Student *rightSentinel = new Student(-2, seatCount_, leftSentinel, nullptr);
		leftSentinel->right_ = rightSentinel;
		add(leftSentinel); 
		add(rightSentinel); 
	}

	~ClassRoom() 
	{
		for (Student* s : students_) delete s; // clean up
	}

	int addStudent(int studentId) 
	{	
		assert(studentId >= 0); // id -1 and -2 is reserved for left and right sentinels
		Student* left = *students_.begin(); // the one with the largest space on the right
		Student* right = left->right_; // the student on the right of left
		if (right->pos_ - left->pos_ == 1) return -1; // no space
		Student* newStudent = new Student(studentId, -1, left, right);
		if (left->left_ == left) newStudent->pos_ = 0; // left sentinel			
		else if (right->right_ == right)  newStudent->pos_ = right->pos_ - 1; // right sentinel
		else newStudent->pos_ = (right->pos_ + left->pos_) / 2; // in between two students
		left->right_ = newStudent; // patch existing left's right pointer
		right->left_ = newStudent; // patch existings right's left pointer
		update(left); // update left node in tree, right node's distance and pos didn't change
		add(newStudent); // add new node to tree and index
		return newStudent->pos_; 
	}

	void removeStudent(int studentId)
	{
		assert(studentId >= 0); // id -1 and -2 is reserved for left and right sentinels
		auto itRemove = studentsIdx_.find(studentId);
		assert(itRemove != studentsIdx_.end()); // must exist
		Student* remove = *itRemove->second; // student to remove
		Student* left = remove->left_; // left of student to remove
		Student* right = remove->right_; // right of student to remove
		left->right_ = right; // patch left's right
		right->left_ = left; // patch right's left
		update(left); // left's node distance changed, right nodes distance and pos didn't
		students_.erase(remove); // remove pointer from set
		studentsIdx_.erase(itRemove); // remove from index
		delete remove; // free memory
	}

private:
	// insert into index and ordered set
	void add(Student* student) 
	{
		studentsIdx_[student->id_] = students_.insert(student).first;
	}

	// used to update the tree node in the set (remove and re-insert)
	void update(Student* student) 
	{
		auto it = studentsIdx_.find(student->id_);
		assert(it != studentsIdx_.end()); // must exist
		students_.erase(it->second); // remove it
		add(student); // add it again
	}
};


int main()
{
	ClassRoom r(10);
	cout << "add(1): " << r.addStudent(1) << endl; // 0
	cout << "add(2): " << r.addStudent(2) << endl; // 9
	cout << "add(3): " << r.addStudent(3) << endl; // 4
	cout << "add(4): " << r.addStudent(4) << endl; // 6
	cout << "remove(2)" << endl; 
	r.removeStudent(2);
	cout << "add(5): " << r.addStudent(5) << endl; // 2 (it's first 3 space)
	cout << "add(6): " << r.addStudent(6) << endl; // 9 (now only 3 space)
	cout << "remove(1)" << endl; // creates a 2 space 
	r.removeStudent(1);
	cout << "add(7): " << r.addStudent(7) << endl; // 0 (first 2 space)
	cout << "add(8): " << r.addStudent(8) << endl; // 7
	cout << "add(9): " << r.addStudent(9) << endl; // 1 (first 1 space)
	cout << "add(10): " << r.addStudent(10) << endl; // 3 
	cout << "add(11): " << r.addStudent(11) << endl; // 5
	cout << "add(12): " << r.addStudent(12) << endl; // 8 (last empty space)
	cout << "add(13): " << r.addStudent(13) << endl; // -1, full
	cout << "remove(3)" << endl;
	r.removeStudent(3);
	cout << "add(14): " << r.addStudent(14) << endl; // 4 (only empty space)
}

- Chris August 12, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think he meant, a calculator that will react on key presses, so pressing 1 is an event etc. you can even build a finite automaton around it that will work as a calculator, so it looks pretty "eventish" to me

- Chris August 10, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Find if two people in a family tree are blood-related.

Solution:
- do a BFS from two sources, use one queue for it
- limit the search depth to max_generations to prevent
exploring too many people

the last point is a common problem with social network graphs and the
k'th degree relationships.

the frontier of first generation is 2^1, the 2nd generation is 2^2, if we explore
theoretically 20 generation we will visit 1 Mio people.

from collections import deque

def blood_related(a, b, max_generations=5):
    generation = 0 # generation
    queue = deque([(a, 0), (b, 1), (None, None)])
    visited = [set(), set()]
    while queue:
        cur, src = queue.popleft()        
        if cur is None: # use none to indicate next generation in queue
            generation += 1 
            if not queue or generation > max_generations: 
                break
            queue.append((None, None))
        elif cur in visited[src^1]: 
            return True
        else:
            visited[src].add(cur)
            for parent in cur.parents:
                if parent not in visited[src]:
                    queue.append((parent, src))
    return False

class Person:
    def __init__(self, name, mother = None, father = None):
        self.parents = [mother, father]    
        self.name = name


g = Person('g', Person('h'), Person('i'))
f = Person('f')
d = Person('d')
e = Person('e')
c = Person('c', e, d)
a = Person('a', g, c)
b = Person('b', e, f)
print(blood_related(a,b))
print(blood_related(a,g))
print(blood_related(b,g))

- Chris August 10, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

or looking at the samples, it prints the path lengths or node-sums, which then is:
sum(s[i]) = N and s[i-1] <= s[i], s[0] = 0, s[i] = sum(s[j]) + 2^k, for any 0 <= k and 0 < j < i

def sum_paths(N):
    def aux(path, sum, j):
        if sum == N: 
            res.append(path)
        if sum < N:    
            for i in range(j, int.bit_length(N)):
                aux(path + [sum + (1 << i)], sum + (1 << i), i)
    res = []
    aux([0], 0, 0)
    return res

- Chris August 09, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@Anwar: never mind, this happens ;-)
@tryingtolearn: thanks, appreciate it

- Chris August 09, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I assume the question is:
print sequences s where sum(s[i]) = N and s[i-1] <= s[i], s[0] = 0, s[i] = 2^k, for any 0 <= k

that would be:

def sum_paths(N):
    def aux(path, sum, j):
        if sum == N: 
            res.append(path)
        elif sum < N:    
            for i in range(j, int.bit_length(N)):
                aux(path + [1 << i], sum + (1 << i), i)
    res = []
    aux([0], 0, 0)
    return res

- Chris August 09, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

@tryingtolearn: correct always at least two solutions or no solution. Imagine the mirrored solution, it has the same distances...

- Chris August 09, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

the follow-up will be how to optimize the linear time for selecting C element (if C is big for example) ... you can get to O(lg(n)+lg(C))

- Chris August 08, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@Anwar: doesn't seem to work in all cases.
input [1, 9, 10, 10, 11, 20], returns [0, 1, 10, 20] which would create distances of [1, 9, 10, 10, 19, 20] (which is similar but not the same)
[0, 10, 11, 20] or [0, 9, 10, 20] would be valid solutions

I checked 'cause I didn't really understand the removal of doublicates and

result[1]=allPairDist[0]

can't work because the first isn't nececcarily nearest point (as the example above shows, distance 1 is between 10-11 somewhere in the middle

anyway I'd like to explore your idea more, it seems interesting, maybe you could explain the ideas

- Chris August 07, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

looks like a tough one...

Clearification / Definition:
- the positions of the gas stations are one dimensional (X-Position),
further denoted as pos[i], where i is the gas station, 0 <= i < n-1.
I assume pos[i-1] < pos[i], which will later result from the solution
- the distances given dist[j] are random, the only thing known is all
distances between all pairs are in this vector and the distances are
integers with no error
- dist[i] != 0 for any i

Observations:
- the longest distance is pos[n-1] - pos[0]
- then for all remaining points there is a longest distance between
pos[0] or pos[n-1]

Algorithm (with backtracking)
- sort distance array
- put all distances into a hashtable with distance as key and #times
of occurence as value
- start with:
pos[0] = 0
pos[n-1] = longest distance
i = 1, j = n-2, c = 2
- recursion(i, j, c):
if i > j: return true
d = c-longest distance
assume pos[i] = d, verify if other distances are available, if yes,
pos[1] = d, recurse on (i+1, j, c+1) (if recursion returns true, return true,
if false, try 2nd option
if no, assume pos[i] = pos[n-1] - d, verify, if ditances to defined
gas stations exists, if yes, recurse, if not return False

runtime-analyzis: that's tough, fast explanation O(2^n * n), because
at each recursion level, there are two choices and verification of
a choice takes 2n/2 average. How ever, one can't construct
a sample that will explore as many paths.

i assume a far more elegant solution exists

from math import sqrt
from collections import defaultdict 

def find_gas_station_pos(dist):            
    def aux(i, j, c):        
        # check if gas station at pos x would work
        def verify(x):
            for r in [range(0, i), range(j + 1, station_count)]:
                for t in r:
                    count = distances.get(abs(x - pos[t]))
                    if count is None or count <= 0:
                        return False
            return True
        
        # add / remove distances based on delta
        def add(x, delta):
            for r in [range(0, i), range(j + 1, station_count)]:
                for t in r:
                    distances[abs(x - pos[t])] += delta

        # base case
        if i > j: return True
        
        # try from right
        x = pos[-1] - dist[-c]
        if verify(x):
            add(x, -1)
            pos[i] = x
            if aux(i + 1, j, c + 1): return True #found solution
            add(x, 1) #backtrack

        # try from left
        x = dist[-c]
        if verify(x):
            add(x, -1)
            pos[j] = x
            if aux(i, j-1, c+1): return True #found solution
            add(dist[c], 1) #backtrack
        return False #backtrack
        
    station_count = int((sqrt(8*len(dist)+1)+1)/2) #len(dist)=(station_count^2-station_count)/2
    pos = [0 for i in range(station_count)]
    distances = defaultdict(int)    
    dist.sort()
    for i in range(0, len(dist)-1):
        distances[dist[i]] += 1

    if station_count <= 1: return pos
    pos[-1] = dist[-1]
    if aux(1, station_count - 2, 2):
        return pos
    else:
        return [] # no solution found

print(find_gas_station_pos([10, 20, 70, 80, 30, 20, 100, 70, 50, 90]))

- Chris August 07, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

1. Is the distance, absolute value (always positive) or can we use the sign to tell 2nd it's left or right of first
2. Are distances integers, and can we assume ab+bc=ac without any error?

- Chris August 07, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

@sheva: it's correct, assume areas is one for all
- Probability to pick 1st is 1
- Next iteration, probability to pick 2nd is 0.5, so 0.5 for 1st and 0.5 for 2nd. 1st and 2nd have same probability (uniform)
- Next iteration, probability to pick 3rd is 1/3, so 1st and 2nd together is 2/3, 3 rd is 1/3, still uniform
...
it's an elegant algorithm for such cases, it's a similar one that is used when shuffling a deck of cards

- Chris August 03, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@Fernando, now I see, the sum is any number in the array. It seems like 3-sum and there are better solutions far beyond simple. For your solution, be careful to not reuse elements in the sum.

- Chris August 02, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 4 vote

I think it's important to define the problem first:
- what does slow mean? high latency? for which percentile of requests? all, a few, e.g. 5%? Where is it measured (could it be, it's not the server but too much traffic, so server receives the request already late and it takes too long to deliver to client, ...) best is, draw a picture, with the server(s) the infrastructure around until the point of measurement... worst case it's the end customer telling you this with no additional information.
- are there any service level objectives (SLO's) defined one can compare against?
- is there history data? did high latency increase slowly?
- when did it change? was it due to a new release or no modifications
- what diagnostics is available?
Potential issues are:
- the server it self is overloaded (e.g. to much disc usage)
- the server contacts other services and waits for them (e.g. a slow service in the back end)
- the CPU is overloaded because some CPU bound processes eat away CPU power
- to many services use too much memory, the system keeps swapping in and out
- data grew, current sharding scheme may not be effective anymore
- OS update, a new network stack or driver?
- etc. etc. etc. etc.

I think a systematic approach could be
- verify / assume objectives
- narrow down the service(s) causing delay (DB cluster, app server cluster, what services specifically)
this is starting at the front-end service: what's the delay between receive request and served request? does this include the delay causing problems? If yes: go down, check the call graphs to other services, is there any call causing the delay? If yes, move to this server and repeat there
- look into the server(s) what is causing the delay (disc, cpu, memory (e.g. page faults))
- verify history data if available (sudden change, continuous process)

- Chris August 01, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Assumption:
- 0 <= e < n, for all e in ex (this should be mentioned and not just assumed)

Alternative approach:
- while all other approaches are perfectly valid, I think it's worth to mention the non-deterministic version:
- generate a number [0..n) and then check if it's in the excluded list.
- if not, great, if it is, repeat the step above.

public int random(int n, Set<Integer> ex) {
	if(n <= ex.length()) return -1;
	int num = 0;
	do {
		num = new Random().nextInt(n);
	} while(ex.contains(num));
	return num;
}

This is very good in average, when n is significantly larger than the excluded list. It is an infinite loop if the excluded list size is n.
It takes less than n iterations in average if there is only one number not in excluded list (e.g. with n = 2 and len(ex) = 1, you have P(at least one num not in ex after 2 trials) = 75% ... how ever, there's theoretically an infinite long tail which is the problem of this approach. Maybe you do the math or write a Montecarlo simulation. It's interesting.

An other alternative is to combine the two approaches in one loop and which ever leads first to a result wins. This is the absolute best approach, because it removes the non-determinism for worst cases and gives a very good average time behavior.

- Chris August 01, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@ali.kheam:
my thoughts:
- There is no method that sorts in O(k), given k is the fixed cache size except variations of radix sort (if the values have limited range)
- If k is e.g. 3 a hashtable is overkil, maybe that was the intention
- If the ranking function has a low range (e.g. 0-254) you could construct a radix sort based DS with nested hashtables etc. It's possible but it's not very nice... that wouold have O(1) on all operations, so maybe you should have asked what is the range of the ranking function and what is the size of the cache, is it as well constant, etc...

- Chris August 01, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

@koustav.adorable: I know, it's the geeks4geeks solution. It runs in O(n*m) best and worst case. It's strategy is difficult to understand without description if the "histogram" doesn't jump right at you ;-)
If it's about finding a single rectangle, solutions with significant better average runtime are thinkable that have the same worst case behavior (depends on the rectangle dimensions). Sometimes you would approach a problem differently if you didn't know about "well known problems" ... I think that's the point of the interviews, how does one solve a problem that is new.

- Chris July 30, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@NoOne: I don't understand your last statement, what do you need to specify on the ranking function properties? It's a ranking function, it will return a rank value and it is told that the rank value of a single item changes on get (not all, no complete re-shuffling). We can assume it's an integer or what ever, at the end what we need is top k of the rank values and it's associated items in the cache (the same rank value discussion is a detail).

- Chris July 30, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

depends what you mean exactly by "merge", e.g.
a: [1..7]
b: [5..10]
c: [9..13]
and you want [1..13] as a merged interval of a,b,c
you might want to sort by start times if you want to merge starting by start. You walk through the sorted intervals, pick the first and if the next overlaps you max on the end until no further overlap with maxed end occurs...

If you like thinking backwards, just do the same from the end with minimized start. Its symetric... so it seems to not matter on the result and run time.

If you want to work a bit more with your intervals, it may make sense to put them into an interval tree. you may google it.

- Chris July 29, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

do we know there is exactly one rectangle in the matrix or do we need to search the max size rectangle in the matrix? or in other words, do *all* the '1' form a rectangle or do they have a different shape where a rectangle is contained within it?

- Chris July 29, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@koustav: [1,2,0,2,2,4] returns 3, should return 4 as 4,2,0,1, I think Fernando has a similar issue

Assumptions:
- S is a sequence with unique values, because it has an order (or ordered set)
- Since the set can be empty that is only the case if
A[i] >= n or A[i] < 0, for all A[i], 0 <= i < n
- I therefore assume values within A have no constraint other than being
integers
[- a neat simplification would be to state, 0 <= A[i] < n for all A[i], 0 <= i < n
which would make the thing easier to code.]

Solution:
- insight brings drawing it as a graph. I think key is to notice it's a
graph problem, makes it easier to draw and argue.
- first approach is starting from each node until closing a cycle or until
encountering a node that has a value < 0 or >= n
This is inefficient as it might have common subproblems
It's O(n^2) worst time, e.g. for array [n-1, 0, 1, 2, ..., n-2]
- better approach is to start with node i and have an array which stores per
node how large the already discovered path is that one can visit starting
from node i. That is, how long is the path until it starts a cycle or
encounters a node of value bejond array range

this has time complexity of O(n) compensated, and a space complexity of O(n)
I get the imrpession there is a more elegant approach but coudln't find it
in 10 minutes thinking.

#include <vector>
#include <algorithm>
#include <iostream>

using namespace std;

int max_component_size(const vector<int>& input) {
	size_t n = input.size();
	vector<int> component_size(n, 0);
	int max_size = 0;
	for (size_t i = 0; i < n; ++i) {
		if (component_size[i] == 0) {
			// 1. discover until 
			//    - a cycle
			//    - an already found component
			//    - a value "pointing" bejond array range
			size_t j = i;
			int disc_size = 0;
			while (component_size[j] == 0) {
				component_size[j] = -1;
				j = static_cast<size_t>(input[j]); // negative input values will create > n values
				if (j < n) disc_size++;
				else break;
			}
			if (j >= n || component_size[j] != -1) {		
				// it's a path enetering an already discovered
				// path or it's a path to nowhere j >= n
				if(j < n) disc_size += component_size[j];
				max_size = max(disc_size, max_size);
				// backtrack and store discovered component size
				j = i; 
				while (j < n && component_size[j] == -1) {
					component_size[j] = disc_size;
					j = input[j];
					disc_size--;
				}
			} else {
				// it's a cycle
				max_size = max(disc_size, max_size);
				while (component_size[j] == -1) {
					component_size[j] = disc_size;
					j = input[j];
				}
			}
		}
	}
	return max_size;
}

int main()
{	
	cout << max_component_size({ 5, 4, 0, 3, 1, 6, 2 }) << endl; // should be 4: sample given
	cout << max_component_size({ 1, 2, 0, 2, 2, 4 }) << endl; // should be 5: 5,4,2,1,0
	cout << max_component_size({ 5, 0, 1, 2, 3, 4 }) << endl; // should be 6: e.g. 0,5,4,3,2,1
	cout << max_component_size({ 1, 2, 3, 4, 5, 0 }) << endl; // should be 6: e.g. 0,1,2,3,4,5
	cout << max_component_size({ -1, 9, 2 }) << endl; // should be 1, 9
	cout << max_component_size({ -1, 9, -2 }) << endl; // should be 0
	cout << max_component_size({ 1, 0, -1, 2, 3, -3, 5, 8, 4 }) << endl; // 4: 8,4,3,2
}

- Chris July 29, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@NoOne: LRU is significantly easier, because you needn't maintain a sorted order. Here you have a ranking function that gives you an arbitrary order. So, you can't do it with a HT poitinig to linked list elements except you accept O(n) time to change to position of the element that gets hit and receives a new arbitrary rank value.
With LRU, the rank value would always be the biggest, so you can maintain the order without comparing anything...
It's similar but not the same. The question is actually therefore a bit less annoying ;-)

- Chris July 29, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@nikhilesh.singampalli:
looks neat, except of the to_CHAR(...) construct, because, if that column was indexed, you would do a full table scan instead of using the index...

- Chris July 28, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Assumptions:
- coorinate is with integers
- no overflow can occure due to range of values

Implementation:
- for each pair of points a,b find a' and b' such that a,b,b',a' forms a square
a' and b' are defined if a and b are defined (two options, a' and b' or a'' and b'')
dx = ax - bx
dy = ay - by
a'x = ax - dy, b'x = bx - dy
a'y = ay + dx, b'y = by + dx
a''x = ax + dy, b''x = bx + dy
a''y = ay - dx, b''y = by - dx

find a' and b' or a'' and b'' in array
the find can be done in O(1) by using a hashtable

- overall time complexity O(n^2), and O(n) space

if working with floating point, this won't work as straight forward, but you can actually do
almost the same, just instead of using the hashtable to find the point pairs, use the
hashtable as a spatial store dividing the space into small squares, then you put the
points into all the 8 squares and the one it lies on (etc. etc.)
find the 9 squares around a point and check against all points found within this squares.
If you divide space small enough, you might even get away by just checking if there is
point in the square... but get the error discussion right.

#include <iostream>
#include <unordered_set>
#include <vector>
#include <functional>

using namespace std;

bool find_square_in_table(const vector<pair<int, int>>& points) {
	// hash function for set of coordindates
	auto pair_hash = [](const pair<int, int>& p) {
		return hash<decltype(p.first)>()(p.first) * 9315 + hash<decltype(p.second)>()(p.second);
	};

	// build hash table with all coordinates
	unordered_set<pair<int, int>, decltype(pair_hash)> ht(10, pair_hash);
	for (const auto& point : points) {
		ht.emplace(point);
	}

	// buid pairs and look for points forming a square
	for (int i = 0; i < points.size() - 1; ++i) {
		for (int j = i + 1; j < points.size(); ++j) {
			int dx = points[i].first - points[j].first;
			int dy = points[i].second - points[j].second;
			int ij[] = { i, j };
			for (int s = -1; s < 2; s += 2) {
				bool found = true;
				for (int idx : ij) {
					pair<int, int> ptf = pair<int, int>(points[idx].first + s * dy, points[idx].second - s * dx);
					found &= ht.find(ptf) != ht.end();
				}
				if (found) {
					return true;
				}
			}
		}
	}
	return false;
}

int main()
{
	cout << find_square_in_table({ {5, 5}, {10, 5}, {10, 10}, {5, 10} }) << endl; // true
	cout << find_square_in_table({ { 5, 5 },{ 10, 5 },{ 10, 10 },{ 5, 11 } }) << endl; // false
	cout << find_square_in_table({ { 14, 5 }, {2, 9}, { 7, 3 },{ 12, 12 },{ 5, 10 },{3, 9} }) << endl; // true: (7,3),(5,10),(12,12),(14,5)
    return 0;
}

- Chris July 28, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

since the problem was changed to no parent pointers, here the described approach without parent pointers

# solution without parent pointer
from collections import deque

# searches the tree from the root to the node in O(n) time
# and returns the path to the node as a dictionary (key=node, value=parent)
def find_path_to_node(root, node):
    path = {}
    stack = [[False, root, None]]
    while stack:        
        visited, root, parent = stack[-1]
        if visited:
            stack.pop()
            del path[root]
        else:
            stack[-1][0] = True
            path[root] = parent
            if root == node:
                break # found node
            for child in [root.left, root.right]:
                if child is not None:
                    stack.append([False, child, root])
    return path

# does the find in O(n) time
def find_nearest_leaf(root, node):
    parents = find_path_to_node(root, node)
    queue = deque()
    queue.append((0, None, node))
    while queue:
        dist, prev, cur = queue.popleft()
        # if is it a leaf and not the node I started at, finished
        if dist > 0 and cur.is_leaf():
            return dist
        for adj in [parents.get(cur), cur.left, cur.right]:
            if adj is not None and adj != prev:
                queue.append((dist + 1, cur, adj))
    
class Node:
    def __init__(self, value, left=None, right=None):
        self.left = left
        self.right = right
        self.value = value

    def is_leaf(self): 
        return self.right is None and self.left is None
    
node_h = Node('h', Node('c', Node('b'), Node('e')), Node('K', None, Node('m', Node('p'), None)))
root = Node('p', node_h, Node('q'))
print(find_nearest_leaf(root, node_h)) # return 2

- Chris July 28, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

what do you mean by "true collar"?

- Chris July 28, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@tnutty2k8:
because of operation precedence... you may want to write down the recursion tree, for array = [a,b,c], this should give you the insight.

right, there are all integers, so, I suppose the only advantage you get from division is in a situation like 1, -100, 50... in order to get rid of the -100, one could do: 1/-100+50... but actually 1--100+50 is still better...
but x/0 could lead to positive infinity which would be neat... ;-)

The thing is full of special cases one needs to consider in order to get an optimal solution. However, since the question got re-posted, I assume somebody has a hint ready...

- Chris July 28, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

I try to rephrase what I understood:
- create an associative cache (a lookup, key value store, ...)
- in the cache there are items, I assume some objects with attributes (e.g. an Item)
- items have a "rank" which I assume, high ranked items should be kept in the cache over low ranked items
- the cache has a size limit, e.g. maxSize_
- get(key) method: get's an element from the cache or nullptr if not there
- put(item) method: places the element in the cache and if cache would grow over max_size removes the lowest ranked element (which might be the one just added)

Assumptions:
- there are no two element with the same rank, if so one can evict one of the two elements if their rank score is the lowest
- rank is an integer value

in your example there is a db_read_entry(key) method, which I assume is from a data access component. I would leave DB aspects out since it is not relevant for the cache functionality

Solution for the first question I would solve similar to you. get is trivial put would be:

void put(const shared_ptr<Item>& item) {
		if (ht_.find(item->getId()) == ht_.end()) {
			ht_.emplace(item->getId(), item);
			queue_.emplace(pair<int, int>(item->getRank(), item->getId()));
		}

		// clean up queue
		if (queue_.size() > maxSize_) {
			auto qitem = queue_.top(); // get top element
			queue_.pop(); // remove top element
			ht_.erase(qitem.second); // remove by id
		}
	}

the only thing to mention is that it might be slightly stupid if the elment being added has lower rank than top of the queue because it is added and then removed again.

For the follow up the question it is a bit tricky, because you want to actually change the priority of an item in the queue. Most standard containers have a problem with this because you have no random access to the item.

One way around it can be to just add another item to the queue with the new priority. Since you need to keep the HT size constant, the queue will eventually be cleaned, but there is a risk that the queue is significantly larger than the HT. It is not optimal.

If you implement a binary heap your self, you can solve this issue and place pointers to the binary heap elements into the HT as values. If you change the priority of an item, you just bubble it up or down in the heap and you will get to the element by O(1) coming from the HT. I once implemented such a binary heap as an exercise. I maintained the index into the heap-vector on the "iterators" pointing to the heap elements. That turned out to be quite optimal. However, I never found anybody doing it the same way. E.g. when you look at Dijkstra implementations people usually just add queue items with lower priority and accept the queue growing larger than needed (I think it's not relevant).

your statement "create an associated binary tree(std::set) with the hashmap(map of entry as key, and the iterator entry in the std::set of entries)" I didn't understand.
Do you mean a hashtable with the item-id as key and the tree-iterator as value. In the tree (you will need a map, not a set) you would have the rank as keys and the item as values"?

like:
unordered_map<int, map<int, int>::iterator>> item_id_to_treeItem;
map<int, shared_ptr<item>> rank_to_item;

This works as well, you take the tree's min-element, from there you get the items-id, then you can remove it from the HT. This solution has the same O(lg(n)) properties for the get() and put() as the heap due to maintenance of tree, but the tree needs to rebalance etc. which is heavier on constants than binary heap that can be implemented in a vector which is much heap- and cachefriendlier than trees.

- Chris July 28, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

What? What's the list of tasks? Start times, dependencies, ... Start and end times? Please clarify the question, it's too much guessing.

- Chris July 27, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@NoOne: msgPerUser.remove(iter) does it, right? I was sloppy how ever, doing it during coffee break :-)

- Chris July 27, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

please provide some hints about the desired time and space complexity. There were answers with exponential run time on your previous post with the same question.

- Chris July 27, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

the idea is to do a special kind of BFS:
1) I assume I have a parent pointer for each node, if I don't I can
easily get a dictionary with necessary parent pointers, which is
derived from the path from the tree root to the node given
(I need the root then, that defines the tree I can look in)
Further I assume that if I get passed a leaf I should return the
nearest leaf and not the passed in node (leaf).
2) Instead of maintaining visited array, I can simply not go
back where I came from because there are no cycles and no
forward edges in a regular binary tree
3) I just start at a node and go to the parent, left and right
...

Sample tree

p      
       / \
      h   q
    /   \
   c     k
 /  \     \
b    e     m
          /
         p

complete implementation using parent pointers for simplicity
in python 3:

from collections import deque

def find_nearest_leaf(node):
    queue = deque()
    queue.append((0, None, node))
    while queue:
        dist, prev, cur = queue.popleft()
        if dist > 0 and cur.is_leaf():
            return (cur, dist)
        for adj in [cur.parent, cur.left, cur.right]:
            if adj is not None and adj != prev:
                queue.append((dist + 1, cur, adj))
        return (None, None)
    
class Node:
    def __init__(self, value, left=None, right=None):
        self.parent = None
        self.left = left
        self.right = right
        self.value = value
        if self.right is not None:
            self.right.parent = self
        if self.left is not None:
            self.left.parent = self

    def is_leaf(self): 
        return self.right is None and self.left is None
    
node_h = Node('h', Node('c', Node('b'), Node('e')), Node('K', None, Node('m', Node('p'), None)))
Node('p', node_h, Node('q'))
print(find_nearest_leaf(node_h)) # return 2

- Chris July 26, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

If u have a parent pointer it's really simple, from a node go up, left and right (put those nodes into a queue as pairs of current: enqueued((self, self.left)), ...
Then if u dequeue only visit those that are not equal to the first element taken from the queue (don't go back where I came from).
Then the first leave encountered is it...

If u have no parent pointer, u create first a parent-pointer list, from the root to the node u start from, all other parent pointers are not needed.

Code follows...

- Chris July 26, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. per document: extract document text, remove all tags, format etc.
2. per document: perform a word stemming and remove filling words (like "a", "I", etc...)
3. create word-id's where each word is an id
4. per document: create a vector of words, where you have per word-id, the number of times
it occures in a document. e.g. {(1, 1), (8,10), (14,2)}
5. perform k-means, which is, find k partitions whereas k is the desired amount
of categories. per category, maybe take the top most used words to describe
it.
k-means works as follows:
1) pick k random categories and then check for each document where it has
the most similiarity to and assign it to this category
2) now for each category re-calculate it's center by adding up the vectors
3) check for each document to which category it's closest and re-asign
it to this category
4) stop if no re-asignment happend in 3) or continue with 2)
5) for similarity you can pick anythin, like the cosine-similarity or
other similarity functions

- Chris July 25, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

assumtions:
1) timestamp is an integer, with 1 ms resolution, large enough it won't overflow
2) if there are more than two messages a second per user id, e.g. with timestamps 100, 101, 102, it's sufficient to call too_fast(100, 101) and too_fast(101, 102) and leave out too_fast(100, 102)

approach 1:
for each message, get the unique user id user_id, put it into a hash table (user_id as key, message as value). Before putting it in, check if there is already a message stored for this user, if so compare the times, if the two are within 1 second, call too_fast

approach 2:
approach 1 is fine, it works, but the HT will eventually grow large, consuming a lot of memory, whereas we actually only need the messages from unique user_ids from the last second, which I assume, is considerably less than the set of users over a longer periode (mustn't be, but for most system it's a good assumption). Therefore I would keep a queue which stores user_id's and times and I would constantly check if I can remove an element from the queue because it's "expired" and if so, I would check if I can remove from the HT because it may be there is a younger message already in the HT for the same user...

in c++

/*
assumes Messages is an iterateable type will read from a stream.
Memory management, should use smart pointers, which isn't here 
*/
void analyzeMessagesForShortRepeatedLogin(const Messages& messages) 
{
	queue<Message> msgQueue;
	unsorted_map<int64, Message> msgPerUser; 

	for(const auto msg&: messages) {
		auto oldMsgIter = msgPerUser.find(msg.getUserId());
		if(oldMsgIter != msgPerUser.end()) {
			if(msg.getTime() - oldMsgIter->second.getTime() < 1000) {
				too_fast(oldMsgIter->second, msg);
			}
		}
		while(!msgQueue.empty() && msg.getTime() - msgQueue.back().getTime() >= 1000) {
			auto iter = msgPerUser.find(msgQueue.back().getUserId());
			if(iter != msgPerUser.end() && msg.getTime() - iter->second.getTime() >= 1000) {
				msgPerUser.remove(iter);
			}
			msgQueue.pop(); 
		}
		msgQueue.push(msg); // EDITED, thanks to funk 
		msgPerUser[msg.getUserId()] = msg;
	}
}

- Chris July 25, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@dizhu2016: I think you run into numerical issues, there are three constraints
1) distribute exactly B machines, no more, no less
2) each cluster must have at least one machine
3) when you have one remaining machine, choose the cluster which have highest ceiling tasks per machine

this conditions are not easily satisfied in a closed formula, e.g.
a[] = {1, 1000}
B = 2
N = 2
d[0] = 1 * 2 // 1001 = 0
d[1] = 1000 * 2 // 1001 = 1
// denotes integer division, or (floor(a/b), "next smaller integer")
Issues with the above:
a) this is only one machine distributed
b) even with proper rounding you would get d[0] = 0 d[1] = 2, which violates the condition of at least one machine in a cluster
c) you can fix it and distribute the remaining machines after giving each cluster a machine, but the intuition whether this is right or wrong is not so straight forward, given constraint 3)
d) besides, there will be numerical issues, due to any kind of rounding it's not given that you will distribute exactly B machines, it could well be B+1 or B-1 due to rounding.
it's like if you distribute 10 candies to 20 people, and you can only give 0 or 1 to a person, if you do this with doubles and round, you either distribute 0 candies or 20 candies...

- Chris July 22, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@krupen.ghetiya
the question asks for the *median* of all elements from a dataset where you have random access to the elements. Your answer is for a *running median* problem, that is the median can be queried any point in time from the seen elements. Typically in running median data arrives from a stream, which means no random access.
As a result, the two heaps *running median* is not linear in time, it's O(lg(N)). ...

Distribution is a different beast, too because it would be safe to assume the data won't fit into the memory of a single machine and the two heaps solution would need some major tricks/assumptions/modifications to work in most cases

- Chris July 22, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

basically you need a tokenizer, where ";" is a token. it's probably enough to have junks of tokens that do not contain ';' and are no string litterals ('abc', "efg"), don't forget escapes like "\"" ... and comments like --.*\n and /* ... */
so, a somewhat limitted tokenizer will do it. it can be done using a regular expression like (comments not taken into account in sample below)

(\"([^\"])*\"|\'([^\\\'])*\'|([^\'\";])*)*;

instead of code use one of the many online regex testers, like regex101.com

- Chris July 22, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- the common approach is to use the quicksort partition method.
partition will pick a guess of a median and place smaller elements
on the left side of the array, equal elements in the middle part and
greater element on the right side and return the index of start and end
of middle part.
This is applied repeatedly until the array is separated
into two equal sized parts (considering the even length case). It has
the challenge of selecting a good pivot element...
One needs to consider as well odd and even length arrays, see code below.

- how to distribute:
one approach could be to guess the median, by just
determening the median on one machine. This would work if the
distribution accross machines is uniform and if a statistical
error is accepted
- alternatively one could use median of median method, where
every machine calculates a median and at the end median
of medians is calculated. This will be a pretty good aproximation,
but it's still an approximation.
- first machine does guess a good median, e.g. by using
median of medians, then it promotes this median to
other machines, each machine will then send back, how many
elements are left and right of that element, this information
is then used to repeat the process (look left or right)
the problem here is the communication among the machines, the slowest
or potentially a dead machine will slow the progress, further machines
don't work until the intermediate step has been completed by all others.
- a combination of those steps would be good, every machine
calculates the median +/- 10 elements and sends it to a coordinator.
the coordinater then determines median of medians and determines if it's
within the elements provided, if so, it can calculate the perfect
median, if not it will get back, with the best aproximation...
- what should be considered is that in a real life application
there is probably no point in time where the data won't change (e.g. grow).
so, if the perfect mean needs to be found and the algorithm is supposed
to terminate under all circumstances, some sort of snapshot mechanism
is required.

anyway here the median based on partition

def partition(data, begin, end):
    pivot_idx = (begin + end) // 2 # midle, better randomize or median of median
    pivot = data[pivot_idx] 
    data[end], data[pivot_idx] = data[pivot_idx], data[end] # move pivot to the end
    piv_begin = 0 #[begin..piv_begin) < pivot
    piv_end = 0 #[piv_begin..piv_end) = pivot
    idx = 0 #[piv_end..idx) > pivot
    while idx <= end:
        if data[idx] < pivot:
            data[piv_end], data[idx] = data[idx], data[piv_end] # swap
            data[piv_begin], data[piv_end] = data[piv_end], data[piv_begin] # swap
            piv_end += 1
            piv_begin += 1
        elif data[idx] == pivot:
            data[piv_end], data[idx] = data[idx], data[piv_end] # swap
            piv_end += 1
        idx += 1
    return (piv_begin, piv_end - 1)

def median(data):
    n = len(data)
    if n == 0: return None
    if n == 1: return data[0]
    begin = 0
    end = n - 1
    while begin < end:
        m_beg, m_end = partition(data, begin, end)
        if m_beg > (n - 1) // 2:
            end = m_beg
        elif m_end < (n - 1) // 2:
            begin = m_end
        else:
            begin = (n - 1) // 2
            end = (n - 1) // 2
    
    if n % 2 == 1: return data[begin] #odd case, one element
    return (data[begin] + min(data[begin + 1:])) / 2 #even case average of the two midle elements

print(median([1,2,3])) #2
print(median([3,2,1,4])) #2.5
print(median([2,1])) #1.5
print(median([3])) #3
print(median([9,9,9,1,1,1])) #5

- Chris July 22, 2017 | Flag Reply

Page:

1 2 3 4 5 6 7 8 9

CareerCup

Chris

Books

Videos

Resume Review

Mock Interviews