Interview Question for Developer Program Engineers


Country: United States
Interview Type: Phone Interview




Comment hidden because of low score. Click to expand.
0
of 2 vote

Here is my attempt at this problem. I think it's important to ask if you need to optimize for speed or space. In my implementation, I optimized for space. You could use a hash table to optimize for speed and sacrifice more space.

To conserve space I used a doubly-linked link list. Link list ensures minimum required space and ease of adding elements. The double links help with sorting of the linked list on the go.

This link is being sorted based on the frequency of each number. The more frequent, the closer to the front the node is.

Average is being computed the hard way at the end...

#import <stdlib.h>
#import <stdio.h>

struct node{
	struct node* next;
	struct node* prev;

	int num;
	int f;
}*first, *last;


////
// Given a set of numbers, store them so that you can answer questions
// about each number's frequency: 
// What number ahs average frequency?
// What number has 2nd highest frequency?
//
// Implementation using a linked list that is ordered based on the frequency,
// where most frequent numers are at the end.
////



void init_node(struct node** n)
{
	*n = (struct node*)malloc(sizeof(struct node));
	(*n)->next = NULL;
	(*n)->prev = NULL;
	(*n)->num = 0;
	(*n)->f = 0;
}

struct node* find(int i)
{
	struct node *s;
	s = first;
	while (s != NULL)
	{
		if (s->num == i)
			return s;
		s = s->next;
	}
	return NULL;
}

// insetrs number
// if it already exists, increment the frequency counter and reorder the nodes based on frequency
// if it doesnt exist, add it at the end
void insert(int i)
{
	printf("inserting %d...\n", i);
	if (first == NULL)
	{
		struct node* n;
		init_node(&n);
		n->num = i;
		n->f = 1;
		first = last = n;
	}
	else
	{
		struct node* s = find(i);
		if (s != NULL)
		{
			s->f++;
			while (s->prev != NULL && s->prev->f < s->f)
			{
				int num = s->num, f = s->f;
				s->num = s->prev->num;
				s->f = s->prev->f;
				s->prev->num = num;
				s->prev->f = f;

				s = s->prev;
			}
		}
		else
		{
			init_node(&s);
			s->num = i;
			s->f = 1;
			s->prev = last;
			last->next = s;
			last = s;

		}
	}
}

struct node* find_average()
{
	int count = 0, sum = 0;
	struct node* n = first;

	while (n != NULL)
	{
		count++;
		sum += n->f;
		n = n->next;
	}
	int average = sum/count;
	printf("average %d\n", average);

	n = first;
	while (n != NULL)
	{
		if (n->f == average)
			printf("num: %7d %7d times\n", n->num, n->f);

		n = n->next;
	}
}

int main()
{

	first = last = NULL;

	int i;
	for (i = 0; i < 100000; i++)
		insert(rand()%10000);

	struct node* n = first;
	for (i = 0; i < 10; i++)
	{
		printf("%2d. %7d %7d times\n", i+1, n->num, n->f);
		n = n->next;
	}

	find_average();

}

Sample output...

1.    1822      27 times
 2.    1959      25 times
 3.     364      24 times
 4.    8952      23 times
 5.    5935      22 times
 6.    8034      22 times
 7.    7403      21 times
 8.    5779      21 times
 9.    1624      21 times
10.    6679      21 times

average 10
num:    7798      10 times
num:    7396      10 times
num:    1270      10 times
num:    8039      10 times
num:    1018      10 times
num:    3623      10 times
...

- Adrian November 18, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

You cannot have a Linked List containing More than one TB of nodes. Even the number is in GB's you might need a highly paralleled super computer with a huge amount of Memory to do that. We need to do some disk based approach to achieve that

- beginner November 18, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

I think we are hoping that a lot of numbers repeat so that we don't actually stoer a TB of of data in our data structure.

- Adrian November 18, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

While reading the file I would build a hash table translating (num => freq)
To get them sorted I'd build a BST or a Heap with node containing (freq, num) sorted by num.
According to the question much of them repeat, so both hash table and heap should be rather small.

- Grr1967 November 18, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Oops... sorted by freq, not num. Sorry.

- Grr1967 November 18, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Why do you need a heap/BST? Why do you want to sort the data?

- Epic_coder November 18, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

To be able to list frequencies in descending order: highest 1st, 2nd, etc.

- Grr1967 November 18, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

So in your design, you'd have O(1) for the hash table and O(logn). The design a the top (with the linked list) is O(n*n). Looks like yours is more efficient.

I did think of using a hash table, but had a hard time finding a way to translate the freq into a hash.

- Adrian November 18, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

Sorry, you have O(1) + O(n*logn).

After thinking about this further, wouldn't hash table be bad in this scenario, as it would waste a lof of space?

- Adrian November 18, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

I believe the key factor here is the clue that much of the data in the file is duplicated. Assuming that, hash table (with one key per frequency) won't swallow tons of space.

- Grr1967 November 18, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

Assume you have access to a hard disk of unlimited size (we'll get rid of this assumption later). Read the numbers from file and set up a hash map of numbers to their frequency. Keep building the hash map until you run out of main memory. Now you would need to flush this "index" onto disk. One strategy is to modulo the numbers by 1024 (say) and store the record in an appropriately named file.

index_0 : all records of type <number, frequency> where number % 997 = 0
index_1 : all records of type <number, frequency> where number % 997 = 1
index_2 : all records of type <number, frequency> where number % 997 = 2
index_3 : all records of type <number, frequency> where number % 997 = 3
...
index_1023 : all records of type <number, frequency> where number % 997 = 1023

Keep building the hash map from the original data and keep flushing the map back to disk. Next, sort every index file according to the number. If each index file is too large to load in memory for sorting, you need an external sorting algorithm. Then do another pass on the algorithm to merge similar records.

You now have multiple sorted arrays and you need to produce a single sorted array - this is the standard n-way merge algorithm! Once you have such a consolidated index file, you can answer any frequency query you want.

How do you speed up this algorithm? Let's assume you have access to a cluster of 100 hosts available to you. Split the 1TB file into 100 chunks. Let every host run this algorithm locally. We have now reduced the problem to merging 100 sorted arrays, albeit on different computers. Let 100 arrays reduce to 50 arrays by pairing two hosts. Then reduce to 25 arrays, then 12, then 6, then 3, then 2 and finally the master host would receive the sorted array.

What if the host does not have enough disk space to accommodate the index? Quite simple really! Find the capacity on disk that can be made available. If 1 GB of space is available for the index, then chunk the data set into 1024 parts and run it on a cluster of 1024 hosts (worst case being that the frequency of every number is 1, which means the index will be roughly equal to the original data set). Instead of merging the indices in one file (because one host does not have that much space), rearrange the files in such a way:

host 0 has index_0 files from all the computers.
host 1 has index_1 files from all the computers.
host 2 has index_2 files from all the computers.
host 3 has index_3 files from all the computers.
...
host 1023 has index_1023 files from all the computers.

Merge the index files on each computer (standard merge algorithm). Query every host for the topmost frequency and then find the winner of winners.

This algorithm scales horizontally - throw in more hosts and you'll get results faster.

- dr.house November 18, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

This is a nifty solution to efficiently attaining the frequency of a number given the number.

I don't believe it quite answers OPs question though.

What we want is to determine the average frequency, and second highest frequency.

- Devin November 19, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Devin,

No. The question is to redact the data set into something that can be easily queried. The challenge is to develop an algorithm on commodity hardware (limited memory, limited disk size) but numerous such commodity hardware. To find the average frequency you can iterate over the entire index and compute.

- dr.house November 19, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

- store {data, freq} in a hash table. requires one pass over the file
- find average and second largest frequency using one pass over hash table (sorting is not needed)

- alexey.stepanyan November 19, 2012 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More