Miguel Oliveira
BAN USERI enjoy algorithms, puzzles and programming competitions :)
 1 Answer Interview "red flags"
Hi,
 Miguel Oliveira September 09, 2013
In the beginning of CtCI book, Gayle talks about the case of a student which she referred but had too many "red flags".
Could you elaborate more on the "red flags"? Maybe with a few more examples of these kind of mistakes that we should avoid in an interview? Flag  PURGE
ok, i didn't pick the right words. there are other ways but it gets much more complex than using a hashmap.
The hashmap approach is easy to code and yields a O(n) time and O(n) extra space solution. To the best of my knowledge, the other approaches take at least O(n) time and O(n) extra space, unless we can change the array inplace but it gets tricky.
Hence, i don't see any reason to not follow the hashing approach.
Mithun, you made a mistake in your example and I think you misunderstood the suggested approach.
M/C A=> 1:100 2:96 7:98 , so 1 and 7 would be the top 2.
Anyway, even if we change it to
M/C A=> 1:100 2:96 7:94
M/C B=> 3:99 5:97 7:4
M/C C=> 4:98 6:95 7:4
Where 7 is the most visited site as you intended, the merging in 5th step implies that "For all URLs in the merged set calculate correct values by asking
all servers for their scores for each URL", so we would get 102 as the count for url 7.
that's less efficient than hashing
 Miguel Oliveira September 30, 2013If there's only 1 number appearing once, it's actually easier. just use xor. The duplicate values cancel out since their xor is 0
int unique_number(const vector<int>& a) {
int ans(0);
for (size_t i=0; i < a.size(); i++)
ans ^= a[i];
return ans;
}
PS: you're using i in both nested loops
 Miguel Oliveira September 30, 2013If there's more than one number appearing once, i think we need to use hashing.
 Miguel Oliveira September 30, 2013We are given a Directed Graph about clue derivations.
The text "To keep his mind trouble free he maintains the links such that no two clues can be derived clues of each other. " means that this is actually a Directed Acyclic Graph.
As seen in the example, we can remove a link (1,4) if there is a link (1,2) and (2,4). Hence, for each clue, we can use a DFS (depth first search) on its subtree of clue derivations. In this process, if we process an edge A>B where B was already visited, then this B was already derived using other derivation edges, and so A>B can be discarded.
This algorithm takes time O(N^2) with O(N) space. We can transform this in O(N+M) time with O(N^2) space which is roughly the same to me, by saving the results of each DFS per node in memory.
#include <iostream>
#include <vector>
#include <string.h>
using namespace std;
const int MAX = 1001;
char visited[MAX];
vector<int> adj[MAX];
void Dfs(int v) { // depth first search this subtree
visited[v] = 1;
for (size_t j = 0; j < adj[v].size(); j++)
if (!visited[adj[v][j]])
Dfs(adj[v][j]);
}
int Solve(int N, int E) {
for (int i = 1; i <= N; i++) {
memset(visited, 0, sizeof visited);
visited[i] = 1;
for (size_t j = 0; j < adj[i].size(); j++)
if (visited[adj[i][j]])
E; // transition already covered
else
Dfs(adj[i][j]);
}
return E;
}
int main() {
int t, a, b, i, N, E;
cin >> t;
while (t) {
cin >> N >> E;
for (i = 1; i <= N ; i++)
adj[i].clear();
for (i = 0; i < E; i++) {
cin >> a >> b;
adj[a].push_back(b);
}
cout << Solve(N, E) << endl;
}
return 0;
}

Miguel Oliveira
September 29, 2013 i suggest you don't use "l" as a variable name. it's quite easy to read it as "1" (i did at first there). That should work. I prefer using erase() as described in my answer because the code is simpler.
 Miguel Oliveira September 29, 2013This is incorrect. Note that for input "ZABCSC" and substring "ABC" the string becomes "ZSC".
 Miguel Oliveira September 29, 2013Since the question says "use very simple code and concept(ALGORITHM)", I suppose they don't want any sophisticated like the KMP algorithm to find substring occurrences.
int RemoveSubstring(string& text, const string& sub) {
size_t pos = text.find(sub);
if (pos == string::npos)
return 1;
text.erase(pos, sub.size());
return 0;
}

Miguel Oliveira
September 29, 2013 The idea is to have a fixed set of N points and then have multiple queries with different values of X. So, that won't help
 Miguel Oliveira September 29, 2013i think we need a better definition of "month". is it 30 days?
 Miguel Oliveira September 29, 2013the aggregated top 10 is not necessarily in the top 10 of a individual server. so it won't guarantee correctness
 Miguel Oliveira September 28, 2013I think they would expect a solution that spreads the work more uniformly among the different servers, which is hard with these constraints.
However, i think you have the merit of getting the right result. +1
I think that violates this "the maps are too large to transmit over the network (especially sending all of them to a central server "
there are too many visits to centralize a counter in one machine
"the maps are too large to transmit over the network (especially sending all of them to a central server (..)"
can't go that way
that's quite similar to the approach i posted before. this does not guarantee 100% correctness. the aggregate top 10 does not need to be in the top 10 of any individual server.
 Miguel Oliveira September 28, 2013Here's a counter example where the top 3 does not appear in the top 3 of any of the 3 servers:
s1= { a:10, b:9, c:8, d:7, e: 6, f: 5 },
s2= { g:10, h:9, i:8, d:7, e: 6, f: 5 },
s3= { j:10, j:9, l:8, d:7, e: 6, f: 5 }
the top 3 is {d: 21, e: 18, f: 15 }, so that approach also does not guarantee 100% correctness
it will not. check anon and tony's posts above.
 Miguel Oliveira September 28, 2013Why use Hashtable<Box, Boolean> (you actually mean HashMap) ? You don't need the boolean value because it is exists/don't exist. HashSet is enough.
Or in your pseudocode, Hashtable<Box>.
You can avoid that kind of hashing. Check my answer below.
The current solutions given here are too complex. You can do something simpler and smarter.
Note that we're given 10 integers, each from 0 to 9. If we concatenate the 10 numbers we'll get at most 9,999,999,999 which fits in a 64 bit integer.
Also note that having dominoes (0,2) and (2,0) is the same, as having boxes [(0,2); (9,1); (3,3); (7,4); (5,6)] or [(0,2); (3,3); (5,6); (7,4); (9,1)] is also the same, the order does not matter.
#include <algorithm>
#include <vector>
#include <unordered_set> // hashtable in c++11
using namespace std;
typedef pair<int,int> Domino;
class DominoChecker {
unordered_set<long long> hash;
vector<vector<Domino> > boxes;
public:
bool addBox(const vector<int>& box) {
vector<Domino> v;
for (int i = 0; i < 5; i++) {
Domino d(box[2*i], box[2*i+1]);
if (d.first > d.second)
swap(d.first, d.second); // order does not matter
v.push_back(d);
}
sort(v.begin(), v.end()); // order does not matter here as well.
long long hash_value = 0;
for (int i = 0 ; i < 5; i++)
hash_value = hash_value * 100 + v[i].first*10 + v[i].second;
if (hash.find(hash_value) != hash.end())
return false;
hash.insert(hash_value);
boxes.push_back(v); // i suppose we want to store them
return true;
}
};

Miguel Oliveira
September 28, 2013 First we go through all cells in the matrix. If cell[i][j] is zero, we put cell[i][0] and cell[0][j] equal to zero. This runs in O(n^2).
Then we process the first row. If cell[0][j] is 0, then nullifyColumn(j) is called which just sets all values in that column to 0. nullifyColumn runs in O(n) and there are n columns so this takes at most O(n^2).
Likewise, we do the same for the first column and nullify the rows. This takes at most O(n^2).
O(n^2 + n^2 + n^2) is O(n^2) time. We use the first row and first column to store the zeros, so O(1) extra space.
In those cases, we need to use something like a bloom filter.
It is essentially a hash table but uses less memory. The consequence of this is that we can have false positives when looking up if a character was already seen before, but not false negatives (saying that a character didn't occur before but it did).
Despite the false positives, there are many reports of very successful uses of bloom filters, check quora.com/WhatarethebestapplicationsofBloomfilters
I forgot to add that this is much faster than checking all numbers between 45 and 4578
 Miguel Oliveira September 28, 2013Due to "while (left + EPS < right)"
if we find an interval [left, right] with range less than the epsilon, we reached that precision.
I would use a HashMap. The keys are the characters. The value is the first position the character occurs or 1 if duplicate. The time complexity is O(n), n being the number of characters in the stream.
If we know the range of characters is the ASCII values, we can just use an array instead. If the characters can be any unicode char, then a hash map should be better.
// hasNext() and getNext() are already defined.
char getUniqueCharacter() {
HashMap<Character, Integer> hash = new HashMap<Character, Integer>();
int pos = 0;
while (hashNext()) {
pos++;
char c = getNext();
if (hash.containsKey(c))
hash.put(c, 1);
else
hash.put(c, pos);
}
char result = '#';
pos++; // inexistent position
for (Map.Entry<Character, Integer> entry : hash.entrySet()) {
int p = entry.getValue();
if (p != 1 && p < pos) {
pos = p;
result = entry.getKey();
}
}
return result;
}

Miguel Oliveira
September 28, 2013 yeah, Anon gave an example in his answer. These types of questions require some discussion with the interviewer. I don't see a way to do it in a very efficient way given the constraints, maybe those constraints were not so strict as it seems.
 Miguel Oliveira September 26, 2013It seems that many people don't read the comments. Maybe you could edit the question and say explicitly that we have many queries and we want better than O(n) per query
 Miguel Oliveira September 26, 2013the "why" was already answered before. check the other posts
 Miguel Oliveira September 26, 2013Since the capital letters remain unchanged, we can pick the positions of the low case letters and use stl's next_permutation on them. The others are unchanged.
void PrintAllLowerCasePermutations(const string& s) {
int i;
vector<int> pos;
string lower;
for (i = 0; i < s.size(); i++)
if (islower(s[i])) { // #include <cctype>
pos.push_back(i);
lower += s[i];
}
sort(lower.begin(), lower.end());
do {
for (i = 0; i < pos.end(); i++)
s[pos[i]] = lower[i];
cout << s << endl;
} while(next_permutation(lower.begin(), lower.end()));
}

Miguel Oliveira
September 26, 2013 edit: this is does not guarantee 100% correctness
The question says we can't use MapReduce directly because the amount of data is too large. However, the overall top 10 sites are *expected* to be in the top 10 of at least one of the computers.
Hence, we can sort the logs of each computer and emit the top 10 of each computer in the mapping phase. Then, the reduce phase aggregates the number of visits for each site. The result will be the top 10 of the aggregate.
right, that's why a greedy approach will most likely fail. i think we need to go for dynamic programming. there are only 13 cards
 Miguel Oliveira September 26, 2013such greedy approach is incorrect. see Pooja's example
 Miguel Oliveira September 26, 2013that's not what the question asks..
 Miguel Oliveira September 26, 2013We can treat this as a shortest path problem.
Going from chest to clue A, then to clue B and returning to chest, does not affect the result of picking all clues except A and B.
So we can use dynamic programming with bitmasks (handy that N <= 24). We can process the bitmasks in ascending order from 1 to 2^N  1. A bitmask will have bit i set to 1 if we need to pick clue i. The minimum cost of selecting a set of clues is given by:
 the cost of picking one of the clues with bit set to 1 + the cost of picking all the other clues
 or the cost of picking 2 of the clues with bit set to 1 + the cost of picking all the other clues
#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
const int MAX = 24;
int dp[1<<MAX], N, sx, sy, x[MAX], y[MAX], dist[MAX][MAX], chest_dist[MAX];
int main() {
int nt, t, i, j, all_mask, mask, cost;
scanf("%d", &nt);
for (t = 1; t <= nt; t++) {
scanf("%d %d %d", &sx, &sy, &N);
for (i = 0 ; i < N ; i++) {
scanf("%d %d", &x[i], &y[i]);
chest_dist[i] = (x[i]sx)*(x[i]sx)+(y[i]sy)*(y[i]sy);
}
for (i = 0 ; i < N ; i++)
for (j = i+1; j < N ; j++)
dist[i][j] = (x[i]x[j])*(x[i]x[j])+(y[i]y[j])*(y[i]y[j]);
dp[0] = 0;
all_mask = (1 << N)  1;
for (mask = 1; mask <= all_mask; mask++) {
// mask with bit i set to 1 if we need to pick clue i
dp[mask] = INT_MAX;
for (i = 0 ; i < N; i++)
if (mask & (1 << i)) {
// pick clue i and come back to chest
dp[mask] = min(dp[mask],
chest_dist[i]*2 + dp[mask ^ (1<<i)]);
for (j = i+1; j < N; j++)
if (mask & (1 << j)) {
// pick clues i and j, come back to chest
cost = chest_dist[i]+chest_dist[j]+dist[i][j] +
dp[mask ^ (1<<i) ^ (1<<j)];
dp[mask] = min(dp[mask], cost);
}
}
}
printf("%d\n", dp[all_mask]);
}
return 0;
}

Miguel Oliveira
September 26, 2013 This is a shortest path problem. Your greedy approach is incorrect. Counterexamples:
2
1 2
3
3 2
1 1
1 1
0 0
3
1 0
2 1
1 0
In case 1, the best is to take clue at 1, 1 and come back. Then clue at 1, 1 go to clue 3, 2 and come back. This costs 60. Your program returns 78.
The answer to test 2 is 10, but your program gives 12.
if you see my reply to your post above, i think that we'll end up following almost all paths. Thus it won't be much better than the O(n) linear scan.
 Miguel Oliveira September 26, 2013While the constant factor might be big for certain data structures, it's still quite worth it due to the O(n) vs O(log(n)) time.
In this case, the data structure does not need to change between queries, I don't see any reason for the performance to degrade.
Itanium, you're missing the point of the question. We have a fixed set of points, but we want to do several queries with different points X. The trivial approach takes O(Q*N), but we would like it to be O(Q*log(N)) or O(Q*sqrt(N)) at least.
Now i also see that i misread the question. I thought we were talking about 2d or 3d points, not just an array of floats. I still don't see a better approach avoiding sorting.
I'm not sure that the trie approach ends up being better than a linear scan through the N points. The trie stores the most significant digits first and we don't know the order of magnitude of the points represented in the children of a node. We can have numbers with 3, 4, 10 digits, so we end up traversing all points (we have little potential to cut the search).
The sorting constraint is a bit weird to me. I would like to clarify it a bit better with the interviewer.
If we completely ignore sorting, maybe we should go for a sort of Locality Sensitive Hashing. The goal of the data structure would be to group the points in buckets of points close to each other.
At this point, I would also ask the interviewer if we need 100% correctness or if an approximate answer is ok.
Let's say we group the points in O(sqrt(n)) buckets. To answer a query, we could pick a random sample of each bucket and find the bucket closest to point X. The 100% correctness part is relevant here. Since we group closest points together, we expect that the selected bucket contains the closest point (although it can be wrong a few times).
Then, we could brute force the elements of this bucket to find the closest point to X, yielding O(sqrt(n)) per query.
We can try to have O(log(n)) buckets but the sorting constraint makes it hard to have a good performance checking the n/log(n) elements of a bucket.
Still, i think that grouping the points into proper balanced buckets can be quite hard. In an interview I would go for a straightforward approach. First by going through each point and selecting the ~sqrt(n) available points closest to it. Or trying to binary search the maximum distance between points in a bucket and check how many buckets a given threshold yields.
What do you think?
Like you said, that's the very straightforward way to solve the question, answering each query of a point X in O(N) time.
This kind of question ("goal is to efficiently find the number that is closest to 'x' from the fixed set", "what data structure will you actually use for storing the fixed set of floating point numbers to achieve this?") implies that we need to do better than the trivial approach.
it's 200.
you probably meant "a=200/300*300;"
There are many possible options  en.wikipedia.org/wiki/Nearest_neighbor_search
I think we should go for a Space Partitioning method.
I would pick a KdTree (where K is the number of dimensions of the points). This would allow us to answer each query of a point X in O(log N) time, where N is the number of points of the fixed set.
please edit your post then, adding that information. but that "sort" constraint is weird
 Miguel Oliveira September 25, 2013i think you're supposed to consider the whole file as a string, not check palindromes on a linebyline basis
 Miguel Oliveira September 25, 2013You missed the point of the question. Given N you should get the corresponding column name.
N = 3 > "C"
N = 27 > "AA"
N = 18278 > "ZZZ"
N = 18279 > "AAAA"
etc
The second method has a bug. It won't answer correctly given the array {1,5,8}, K = 10. It would say 5,5 but there is only one 5.
for(i=0;i<n;i++)
{
Complement=Ka[i];
if(SearchHashTable(Complement)!=1)
cout<<"Indices:\t"<<HashTable[Complement]<<"\t"<<i<<endl;
HashTable[a[i]]=i;
}

Miguel Oliveira
September 25, 2013 That only works for a small number of rows and columns
 Miguel Oliveira September 25, 2013First, we should read the file into a string and find the longest palindrome in this string.
The best method is to use Manacher's algorithm which runs in O(N) time with extra O(N) space. I don't think an interviewer can possibly expect this in a 45min interview, i suppose a standard O(N^2) solution is expected by checking all possible centers.
My implementation of Manacher's algorithm is the following:
// More details on akalin.cx/longestpalindromelineartime
const int MAXLEN = 1000000;
int p[2*MAXLEN+10], np; // size of palindromes with even and odd length
void ManacherAlgorithm(char* str) {
int i, s, e, j, d, pal_len, len = strlen(str);
np = 0;
i = 0;
pal_len = 0;
// Loop invariant: str[(i  pal_len)..i] is a palindrome.
// Loop invariant: np >= 2 * i  pal_len. The code path that
// increments pal_len skips the lfilling innerloop.
// Loop invariant: np < 2 * i + 1. Any code path that increments
// i past seqLen  1 exits the loop early and so skips
// the lfilling inner loop.
while (i < len) {
// First, see if we can extend the current palindrome.
// Note that the center of the palindrome remains fixed.
if (i > pal_len && str[ipal_len1] == str[i]) {
pal_len += 2;
++i;
continue;
}
// The current palindrome is as large as it gets,so we append it
p[np++] = pal_len;
// Now to make further progress, we look for a smaller palindrome sharing
// the right edge with the current palindrome. If we find one, we can try
// to expand it and see where that takes us. At the same time, we can fill
// the values for l that we neglected during the loop above. We make use of
// our knowledge of the length of the previous palindrome (pal_len) and the
// fact that the values of l for positions on the right half of the
// palindrome are closely related to the values of the corresponding
// positions on the left half of the palindrome.
//
// Traverse backwards starting from the secondtolast index up to the
// edge of the last palindrome.
s = np  2;
e = s  pal_len;
for (j = s; j > e; j) {
d = j  e  1;
// We check to see if the palindrome at p[j] shares a left edge
// with the last palindrome. If so, the corresponding palindrome
// on the right half must share the right edge with the last
// palindrome, and so we have a new value for pal_len.
if (p[j] == d) {
pal_len = d;
break;
}
p[np++] = min(d, p[j]);
}
if (j == e) {
i++;
pal_len = 1;
}
}
p[np++] = pal_len;
// Traverse backwards starting from the secondtolast index up until we
// get p to size 2 * seqLen + 1.
// We can deduce from the loop invariants we have enough elements.
s = np  2;
e = s  (2*len + 1  np);
for (j = s; j > e ; j) {
// The d here uses the same formula as the d in the inner loop above.
// (Computes distance to left edge of the last palindrome.)
d = j  e  1;
// We bound p[i] with min for the same reason as in the inner loop above.
p[np++] = min(d, p[j]);
}
pal_len = 1;
for (i = 0; i < np; i++)
if (p[i] > pal_len) {
pal_len = p[i];
j = i;
}
s = j / 2  pal_len / 2;
e = s + pal_len  1;
printf("max palindrome size = %d indices [%d, %d]\n", pal_len, s, e);
}

Miguel Oliveira
September 25, 2013 NxN matrix, this takes O(N^2) time.
 Miguel Oliveira September 25, 2013
@Kiarash, that sounds intuitive but it's not true because the time is equal to the square of the euclidean distance between the points. Being the square makes the difference (recall the a^2 + b^2 = c^2 equation and what happens with angles bigger than 90 degrees)
 Miguel Oliveira October 01, 2013