Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: Phone Interview
Target is to find closest 100 starts.
The best way to implement is K-NN algorithm.
Solution Step
1) Here assume K is 10 . Selection of K is find out using RMSE. this is another bigger topic and let's not deep dive into that.
2) Here are selecting 10 centroid not 1 centroid. Why 10 centroid?. Because some star cluster can happen more than one location.
3) Let's take an analogy. let's assume in US is a space (3 dimension) and people are stars. We have to find where the population density is more. NY is one cluster, LA is another cluster. Bay area is 3rd cluster and similary we have more clusters. Out of these we have to find the densely populated location. Of course the obvious answer is NY city. Of course we don't know the answer how do we calculate.
4) Read first 10 rows from the fille ( because we assume our K is 10) and make them as centroid.
5) Read rows from 1 to 100000 and for each row find closest centroid.
for rows 1 to10000
for cluster 1 to 10
closestCluster = min(closestCluster,(cluster[j]-row[i]))
end
end
6) Assume we randomly select centroid to find densely populated location. Assume we selected 10 location in kansas state (because they are in the middle of US) and we calculate distance between each person to all the 10 location. Find the min distance and assign that person to that centroid.
7) Once we assign all the points/person to the centroid , recalculate the centroid.
8) Each centroid has some points/person assigned to them. Recalculate the centroid for x,y,z using ((x1+x2+x3+...+xn)/n
This is the new centroid. After this step the centroids are moved to different direction and they are scattered.
9) Continue the step 5,6,7,8 till the centroid is converged.
10) When I say converge centroid doesn't change much distance from previous runtime to current time and change is really insignificant.
11) While finding centroid for each point/person , maintaing a minheap structure for each cluster and this maintains informations about 100 closest point from the cluster.
12) For each cluster find distance between centroid to each point and sum them up.
int[] distances;
for each cluster c
c=0;
for each points 1 to 100
c=c+ distance between point and cluster[i]
end
distance[i]=c;
end
13) After this step find the mindistance from the array and all the points belongs to that centroid/cluster are closest points.
Note : In some places I used generously data structure. I was more concentrating on finding the solution and implementation rather than best implementation. If you explain the stuff above it takes more than 60 minutes and he would be more happy with solution.
Find the centroid of the X, Y, Z points. Then find the closest 100 points from the centroid using a max-heap of size 100.
@Anonymous: What would be the formula for centroid calculation here in 3 dimensional space?
@anon ..selection algorithm will find kth max/min in O(n) but we have to have to find n such stars so it won't remain linear ..i wonder if we can do it in O(N) ..let me know if any have idea ?
Guys Use Maxheap Here , basically dat muggy guy were addicted of fixed answers , you can have a look at more application of max-heap below link
Here is algorithm for above
create maxheap of k=100 points , time O(k)
let tottal starts are n then for each n-k stars compare each start in this set of remaining elements with root of max-heap , if its less then root then make it root & call heapify
time required O(n-k) * O(logk)
Finally, MinHeap has k =100 closet elements and root of the Minheap H is the kth closet element.
so tottal complexity O(n-k)*O(logk)+O(k)
correct me if anything wrong ?
Thanks
Shashank
visit us shashank7s dot blogspot dot com
www dot facebook dot com/LestCode
wrong. It's to find nearest points, rather than smallest points.
Just to copy solution from "find minimum K number from total N numbers" doesn't work here.
Were you asked to code this? If so, I can't do this as I don't know how to calculate the distance between 2 3-dimensional points. If not, you would do it as such:
Run through each coordinate, calculating the distance and storing it in a priority queue, where a smaller distance has a higher priority. Then, pop the first 100 values and do whatever you like with them (print to screen, store in an array, etc).
Where n is the number of star coordinates, the complexity is O(n+100).
I messed up the complexity. It would not be O(n) because generating a priority queue involves traversing a heap. I assume it would be O(nlogn), but it's early, so forgive me if I'm wrong.
I thought of this. Please let me know what you think.
1- Sort points based on their distance.
2- Set the minimal distance to the distance between the 1st and the 100th point. Then iterate through points, calculate the distance between the 2nd and 101th, 3rd and 102nd, ... and update the minimal distance accordingly.
I assumed the set of 100 closest points is defined as the set of points which the distance between its two farthest points is less than all other possible sets of 100 points.
This is my solution
Put the first 100 star coordinates in an array and calculate the bounding sphere (wikipedia :Smallest_enclosing_sphere) of the points. While doing so, also mark the nodes in the array if they lie on the surface of the bounding sphere.
From then on keep picking the rest of the points from the file. If a point falls outside( distance from center of current bounding sphere to this coordinate is greater than the radius of the bounding sphere), ignore it. Else if inside, find the point from all the marked nodes( surface points) which is the farthest from this point, replace it, recalculate the bounding sphere and proceed. The time complexity should be O(k*n) ; k = 100
The problem is similar to kth smallest of n items problem. So it could be solved by using selection algorithm which can have O(N) complexity and O(1) space.
Here is the pseudo code of selection algorithm taken from wikipedia:
function select(list, left, right, k)
if left = right // If the list contains only one element
return list[left] // Return that element
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
// The pivot is in its final sorted position,
// so pivotDist reflects its 1-based position if list were sorted
if pivotDist = k
return list[pivotNewIndex]
else if k < pivotDist
return select(list, left, pivotNewIndex - 1, k)
else
return select(list, pivotNewIndex + 1, right, k - pivotDist)
Sorry mate, this does not work at all in this case.
We are not measuring the closest to a specific point, but the closest to each other (if we measure against a specific point then your algorithm fails again. The good solution in that case is to order them by Manhattan distance and choose the top by applying real distance).
So, this is pretty much a clustering problem... Whereas selection algorithm does not help.
Target is to find closest 100 starts.
The best way to implement is K-NN algorithm.
Solution Step
1) Here assume K is 10 . Selection of K is find out using RMSE. this is another bigger topic and let's not deep dive into that.
2) Here are selecting 10 centroid not 1 centroid. Why 10 centroid?. Because some star cluster can happen more than one location.
3) Let's take an analogy. let's assume in US is a space (3 dimension) and people are stars. We have to find where the population density is more. NY is one cluster, LA is another cluster. Bay area is 3rd cluster and similary we have more clusters. Out of these we have to find the densely populated location. Of course the obvious answer is NY city. Of course we don't know the answer how do we calculate.
4) Read first 10 rows from the fille ( because we assume our K is 10) and make them as centroid.
5) Read rows from 1 to 100000 and for each row find closest centroid.
6) Assume we randomly select centroid to find densely populated location. Assume we selected 10 location in kansas state (because they are in the middle of US) and we calculate distance between each person to all the 10 location. Find the min distance and assign that person to that centroid.
7) Once we assign all the points/person to the centroid , recalculate the centroid.
8) Each centroid has some points/person assigned to them. Recalculate the centroid for x,y,z using ((x1+x2+x3+...+xn)/n
This is the new centroid. After this step the centroids are moved to different direction and they are scattered.
9) Continue the step 5,6,7,8 till the centroid is converged.
10) When I say converge centroid doesn't change much distance from previous runtime to current time and change is really insignificant.
11) While finding centroid for each point/person , maintaing a minheap structure for each cluster and this maintains informations about 100 closest point from the cluster.
12) For each cluster find distance between centroid to each point and sum them up.
13) After this step find the mindistance from the array and all the points belongs to that centroid/cluster are closest points.
- analog76 February 11, 2012Note : In some places I used generously data structure. I was more concentrating on finding the solution and implementation rather than best implementation. If you explain the stuff above in a normal way it takes more than 60 minutes and he would be more happy with solution.