dr.house
BAN USER40 responses and no one has mentioned heap?
Build a max-heap of size n from the digits. Pop-out the maximum element from the heap, swap it with the left most digit (then shift left-most pointer by one to the right). Repeat k times. Deleting an element from the heap also balances the heap, which is a O(log(heap_size)) operation.
Time complexity: O(n + klogn), space complexity : O(n). Note that any comparison (and swap) based algorithm CANNOT sort an array of size n in less than O(nlogn). If k = n, the time complexity becomes O(nlogn)
Another attempt in Java. Note that only a leaf node can ever be the deepest node.
class NodeHolder {
Node node;
int maxDepth;
}
void traverse(Node root) {
NodeHolder holder = new NodeHolder();
_traverse(root, 0, holder);
return holder.node;
}
void _traverse(Node root, int depth, NodeHolder deepest) {
if (root == null) return;
if (root->left == null && root->right == null) {
if (depth > deepest.maxDepth) {
deepest.node = root;
deepest.maxDepth = depth;
}
}
_traverse(root->left, depth + 1, deepest);
_traverse(root->right, depth + 1, deepest);
}
I think the question is simpler. You need to perform one iteration in which you compute the sum of all the nodes in a sub-tree (including the root of the sub-tree) and store it on the root of the sub-tree. The value at the root of the tree would then be the sum of all the nodes in the tree.
In the second iteration, visit every node and attempt to disconnect the link to its parent. Assume you are at node p, then:
sum of the sub-tree p will be stored at p (let's call it S)
if you disconnect, the sum of the rest of the tree would be globalSum - S.
Maximize the difference of S and (globalSum-S) and choose such a node p.
Devin,
No. The question is to redact the data set into something that can be easily queried. The challenge is to develop an algorithm on commodity hardware (limited memory, limited disk size) but numerous such commodity hardware. To find the average frequency you can iterate over the entire index and compute.
Assume you have access to a hard disk of unlimited size (we'll get rid of this assumption later). Read the numbers from file and set up a hash map of numbers to their frequency. Keep building the hash map until you run out of main memory. Now you would need to flush this "index" onto disk. One strategy is to modulo the numbers by 1024 (say) and store the record in an appropriately named file.
index_0 : all records of type <number, frequency> where number % 997 = 0
index_1 : all records of type <number, frequency> where number % 997 = 1
index_2 : all records of type <number, frequency> where number % 997 = 2
index_3 : all records of type <number, frequency> where number % 997 = 3
...
index_1023 : all records of type <number, frequency> where number % 997 = 1023
Keep building the hash map from the original data and keep flushing the map back to disk. Next, sort every index file according to the number. If each index file is too large to load in memory for sorting, you need an external sorting algorithm. Then do another pass on the algorithm to merge similar records.
You now have multiple sorted arrays and you need to produce a single sorted array - this is the standard n-way merge algorithm! Once you have such a consolidated index file, you can answer any frequency query you want.
How do you speed up this algorithm? Let's assume you have access to a cluster of 100 hosts available to you. Split the 1TB file into 100 chunks. Let every host run this algorithm locally. We have now reduced the problem to merging 100 sorted arrays, albeit on different computers. Let 100 arrays reduce to 50 arrays by pairing two hosts. Then reduce to 25 arrays, then 12, then 6, then 3, then 2 and finally the master host would receive the sorted array.
What if the host does not have enough disk space to accommodate the index? Quite simple really! Find the capacity on disk that can be made available. If 1 GB of space is available for the index, then chunk the data set into 1024 parts and run it on a cluster of 1024 hosts (worst case being that the frequency of every number is 1, which means the index will be roughly equal to the original data set). Instead of merging the indices in one file (because one host does not have that much space), rearrange the files in such a way:
host 0 has index_0 files from all the computers.
host 1 has index_1 files from all the computers.
host 2 has index_2 files from all the computers.
host 3 has index_3 files from all the computers.
...
host 1023 has index_1023 files from all the computers.
Merge the index files on each computer (standard merge algorithm). Query every host for the topmost frequency and then find the winner of winners.
This algorithm scales horizontally - throw in more hosts and you'll get results faster.
You need a periodic reconciliation algorithm. Along with every entity record, store a timestamp (say in UTC). When you run the algorithm, you will see a number of records for the same data with different timestamps. Pick the one with the latest timestamp and mark it as the most recently reconciled data. This algorithm will be run on both the databases and will require a periodic sync.
- dr.house November 11, 2012I don't really understand the upvoted solution but here's an extension to the idea
S1: Starts from index = 0, and grows towards the middle
S2: Starts from index = n-1, and grows towards the middle (by decrementing the index)
S3: Starts from the middle (mid = n/2) and flip-flops on both sides mid+1, mid-1, mid+2, mid-1. This can be easily implemented by keeping track of a counter and the direction (essentially +1 or -1). In case the index where you want to "push" the element is already occupied, try finding an index on the same side, meaning that:
if mid + k is occupied, try mid - (k + 1) if that's also occupied, then it means both S1 and S2 are exerting pressure on the stack and pushing an element on S3 is not possible.
This solution works well when all three stacks are of equal size, guaranteeing fairness to each stack. It takes care of the situation when one stack dominates the other two.
Building a trie out of the contents of the dictionary seems to solve the problem. Since the question states that a binary search tree needs to be utilized, it seems we will need to build a balanced binary search tree out of dictionary data. The dictionary is typically presented in sorted order (e.g. look at /usr/dict/words or /usr/share/dict/words in some systems). The BST can be built recursively by looking at the middle element and then building the left and right halves.
- dr.house October 17, 2012Assuming every node contains 26 pointers, a node can be represented using 4 bytes (the first 6 representing the node information and the next 26 indicating the presence of child pointers)[1]
Traverse pre-order. Write '0' for an internal node and '1' for a leaf node. Write the 4 bytes representing the node after that. Do a breadth first traversal on the tree and keep writing these 4 bytes one after the other. You will end up taking 4n bytes to represent the tree.
While deserializing, create blocks of 4 bytes each and put it in a queue. Dequeue to get the first node. Make it the root. Then dequeue k blocks where k is the number of children of the root, attach them appropriately. Move to the first child node which has non-zero children. Repeat process.
[1] Assumption for ease of implementation. There may be other ways.
Nice problem.
Assume M = [m1, m2, m3, ....my ], and N = [n1, n2, n3, ....ny ], the product of the two vectors will be m1n1 + m2n2 + m3n3....my. ny
Here N is the vector B, of size y and is composed of constants (meaning that the vector is given to you). M, on the other hand, is supposed to be comprised of either elements from another vector A (of size x) and zeroes (totaling y-x). The other constraint is that the elements in A need to appear in the same order (read: insert zeroes in between the elements). Minimizing the product means minimizing the expression above (m1n1 + m2n2 + m3n3....my. ny). n1, n2, n3 ... are constants. You need to determine m1, m2....
Assume you are looking at the last element in this expression. We can either place the last element of A, or a 0, whichever minimizes the value of the expression.
Let F(i, j, k) be the value of the expression where i zeros are placed in j + 1 locations with kth element of the array A the next to be placed in the expression. We need to find F(y-x, y-1, x-1). The optimal substructure thus becomes:
F(i, j, k) = min { F(i-1, j-1, k), F(i, j-1, k-1) + A[k]*B[j] }
F(i-1, j-1, k) indicating that you placed a zero in the jth position and F(i, j-1, k-1) indicating that you placed the kth element of A in jth position and you still have i zeros to fill up.
The solution can be found by filling up this 3D matrix leading to a solution of time complexity O(y^2.x)
What are the base conditions of this recursion? I'll leave it to the astute reader.
So I'm going to assume that the solution is of practical significance. The cluster of hosts acts as a single task executor - the task being finding the median of numbers on each host. Periodically, every host determines itself to be the leader (they can all vie for a global lock, which expires every 90 seconds or so, e.g.). The leader sends out control messages to each host in the cluster. The leader also gets registered in some sort of a dashboard which the user interacts with to start the task. The leader, before starting the activity, queries for all the peers available to participate in the activity. In case less than P hosts are available, it may choose to error out. Also, every task has a taskId (an incrementing variable)
The leader, L, does the following:
size = 0;
// get the size of the combined array and have every host sort their data
for each host i in 1....P (excluding itself)) {
ignore = request_sync(i, { "task" : taskId, "control" : Signal.StartMedianTask});
reply = request_sync(i, { "task" : taskId, "control" : Signal.Sort }); // sort your numbers
size = size + reply;
}
size = size + myData.size;
myData.sort(); // sort the data on this host.
arr = []; // initialize an array of structures to be used as a heap
counter = 0;
for (each host i in 1...P (excluding itself)) {
n = request_sync(i, { "task" : taskId, "control" : Signal.Pop });
arr[counter++] = { "host: i, "n" : n };
}
arr[counter++] = myData[0];
size = size / 2;
heapify (arr); // arr now becomes a heap // using the member n of the structures
counter = 1;
while (size != 0) {
min = arr[0];
if (min.host == me && counter < myData.size) {
n = myData[counter++];
} else {
n = request_sync(min.host, { "task" : taskId, "control" : Signal.Pop });
}
if (n == null) {
n = INT_MAX;
}
arr[0] = { "host: min.host, "n" : n };
heapify (arr);
size --;
}
for each host i in 1....P (excluding itself)) {
ignore = request_sync(i, { "task" : taskId, "control" : Signal.EndMedianTask});
}
Of course, if one of the leader dies. Someone else will acquire the lock and become the leader. The new leader will generate a new taskId. If the follower hosts see a different taskId in between a task, they will assume that a new leader has started operation and they will ignore all previous taskIds.
Every host has to perform O(PlogP) time in sorting the data and answering O(N/2) pop queries in the worst case.
Note: There is an edge case missing in the algorithm for the astute reader
Google loves such probability computations based on dynamic programming
If P(n, x) is the probability that n die generate a sum greater than or equal to x, then P(n, x) follows this recursive definition:
P(n, x) = P(n-1, x) + (1/m) * sum { P(n-1, x-k), 1<=k<=m }
The first term computes the probability of generating a sum >=x using n-1 die, irrespective of the result of the nth dice. The second term computes the probability of generating a sum >=x-k using n-1 die and the probability of the nth dice generating a value k so that the total sum becomes >=x
Base cases:
P(1, x) = { 0 if x > m, (m-x+1) / m, otherwise }
I'm assuming there is no ordering of tasks. To ensure high throughput, you need a queue to hold pending tasks. Every VM runs an application to poll the queue for work, executes it, reports the result in some fashion and loops forever. Of course, you would need someone to enqueue the tasks in the queue. You would also want to invent some concept of batching identical tasks together so that resource utilization is maximized. The good thing about a pool of executor is that you can scale up/scale down with the load, very easily.
If the tasks have priority, the simplest solution is to use multiple queues, one for each priority band. The task executors look at the highest priority queue first, then the second and so on. There is a risk of starvation though.
The queue is interesting. It is required because there is no guarantee how long a particular task would take. Immediately, there's an impedence mismatch in the producers and the consumers. There would definitely be a situation where the producer generates tasks faster than the consumers can eat combined. The queue needs to be fault tolerant as well and needs to have a defined delivery guarantee. Typically such applications would use queue implementations such as Amazon SQS. SQS guarantees at least once delivery, which means a task can be requested to be done twice. The executors need to be able to ensure this does not happen (look up idempotency in distributed systems)
eugene.yarovoi:
It actually, doesnt matter if the VMs reside on the same host or not. Even if they did, you have a higher probability of a failure (a host going down meaning more than one VMs are out of service). You need to design for reliability anyway.
I'm not sure how you would find the maximum using two stacks. Here's a solution using a double ended queue. Let:
- D : be a queue of elements
- M : be a deque. The maximum of elements in D will always be the front element of this queue (that's our invariant).
We always enqueue and dequeue elements in D when requested. We don't enqueue in M always. We always ensure that if the front of M is the element being dequeued, we dequeue it as well. If that happens, the largest element is deleted, and the front of M should now be the second largest element. This tells us that M (if seen from the the front), should always contain the maximum element till the point it occurs in D.
function enqueue (x):
D.enqueue(x);
while (!M.isEmpty() && x > M.rear()) {
M.dequeue_rear();
}
M.enqueue(x);
The deque is assumed to supported the following operations:
-enqueue: enqueue at the rear of the queue
-dequeue: dequeue from the front of the queue
-dequeue_rear : dequeue from the rear of the queue
(stack of stacks solution)
- dr.house December 14, 2017Use a stack but limit it's size to a permissible limit, say 4mb, or 4.19 million characters. Once the stack grows beyond this limit, serialize the entire stack onto disk and start a new stack in memory. If the current stack becomes empty, load the last serialized stack from disk into memory and keep working.
There is an edge case where stacks are written and read from disks after almost every operation. The problem is very similar to thrashing in virtual memory.