We have 'n' patients and 'm' p

Epic Systems Interview Question for Software Engineer / Developers

1

of 1 vote

24
Answers
We have 'n' patients and 'm' problems. The problems are of boolean type. Eg diabetes problem would be 'T' if a patient has it or 'F' otherwise. Suggest the data structure you would store this scenario on?

Q: We have a set of problems {diabetes, liver disease, kidney disease} find all the patients who have at least the 3 problems from the set.

The number of patients can be huge (n).

The number of problems not comparatively huge (m).

Which would be the best data structure to store these kind of records, so that we have a better search time.
- Anon October 01, 2014 in United States | Report Duplicate | Flag | PURGE
Epic Systems Software Engineer / Developer Data Structures

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 4 vote

The answer is a K-D Tree: Split on the m problems. Each leaf represents a combination of the problems, and stores a list of references to patients.

Assuming that m is significantly smaller than n:
Time to search is O(1) - basically the height of the tree, which is m
Space is O(n) since no patient objects have to be duplicated

Edit: This is assumed to be an OPTIMIZED tree. Meaning: the tree would ONLY store enough paths to divide all the patients.

A simplified example: There are 100 diseases, and only 3 of those diseases are seen in all the patients. Some patients have only disease A, and the rest have both disease B and C. Then, the tree will only have a single node, with 2 children (not 2^100 children). The left child stores all patients with disease A, and the right child stores all diseases WITHOUT disease A. In practice, trees like this will be very small, relative to your data set. They're used in machine learning and indexing for geographic related searches.

- Tsuna00 October 03, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think this is the best solution. It's basically a balanced binary tree. Each path represents a combinations of problems and leads to the list of patients with these problems.

- undefined October 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Remember the question is to find all the patients with at least the 3 particular diseases. It actually takes O(2^(m-3)) to find all the patients from the tree.

I think the best solution depends. If n >> 2^m, then the tree solution is good. If
n is comparable or even less than 2^m, a simple 2D array may be better.

- Anonymous October 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

2 issues with your response:
1. The problem states that n is relatively large compared to m
2. I didn't state it, but the tree would ONLY store enough paths to divide all the patients.

A simplified example: If there are 100 diseases, but only 3 of those diseases are seen in all the patients. Some patients have disease A, and the rest have both disease B and C. Then the optimized tree will only have a single node, with 2 children (not 2^100 children). The left child is all patients with disease A, and the right child is all diseases WITHOUT disease A. In practice, trees like this will be very small, relative to your data set.

- Tsuna00 November 13, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

^counter argument for above suggestion:
The problem demands solution for "find all the patients who have at least the 3 problems from the set."
So if you create a 2D array, you can surely access required disease in O(1). However, while searching for one patient who has at least 3 diseases, you have to access all the disease columns in worst case. So for one patient, the time complexity in you DS will be O(m). Even though m is small, this seems a trivial solution. The time complexity in your case will be O(n x m) for all the patients. Please correct me if I am wrong.

- left4dead October 02, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 votes

Maintain a extra column (Tot_diseases) containing total number of diseases per patient. Increment this attribute while inserting 'T' for a particular disease. This way time complexity will be O(1).

- khunima October 02, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@khunima can you explain the additional column comment in some detail how would it reduce the time complexity to O(1)? (based on the argument to which it is commented)

- Anon October 04, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Explanation for what @khunmia is trying to say: maintain a count of the number of diseases each patient has using a separate column. So now there are m+1 columns(m diseases + one Tot_diseases column). When returning the patients with atleast 3 diseases, check if the value of the Tot_diseases column is >=3. If yes, return that patient. This makes it O(1) to check if a patient has atleast 3 diseases and the over all time O(n)

- punter October 10, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

searching diseases for each patient is not o(n), because we have know the index of every disease. Even we don't know that, we can use ArrayList<HashMap<String, Boolean>> to do that.

- Anonymous November 06, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I would use a hashtable where patients are mapped to the keys and the problems are values.

- Anonymous October 01, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Not sure how one can search based on values in a hash map?

- Anonymous October 02, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

That's impossible, you have to iterate all the values of the hash map to find it.

- huangxiao09 October 02, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

Two dimensional array of size n x m work?
n - no of patients
m - no of diseases.

since n > m, one might want to go over the n less no of times compared to m. For every patient (i.e. for each n) , inspect the required diseases ( at specific columns in the row for the patient) in constant time, as array look is O(1).
time complexity would be O(n).
space complexity would be O(n x m)

Hi Jobseeker,
What was the answer you came up with?

- Anonymous October 02, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The data structure I came up with was also the dynamic 2 dimensional array. For space and time its efficient. Space complexity is O(n+m) and the time complexity to find an element is constant.

- Anon October 04, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

We could use integers represented by m-bits. Each bit could represent a disease(boolean values). space complexity is now O(n). But time complexity is still O(n*m):
Let x represent the set of given diseases as bits. Bitwise and it with all 'n' numbers. count no. of set bits in O(m) for each patient.

If we could store the no. of bits stored in all 2^m bits stored before hand:
space complexity: O(n)+O(2^m).
time complexity: O(n)

- Anonymous October 02, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think in this case, using a database and create a table would be a good choice, the table stores all the patients and what kind of diseases they have.
Or for another thought, using XML to store this kind of information?

- huangxiao09 October 02, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The interviewer was keen on knowing the specific datastructure I would use. I never asked him about the DB part, I thought any of our specific datastructures would be the answer.

- Anon October 04, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Here is my answer,

I would create a Map of problems, which would have two keys "T" and "F" with their values being a List of patients corresponding to that problem's key.

To find the patients with all three problems, I would create an empty list. I would then get the list of patients with one problem (eg: m_diabetes.get("T")) and iterate over that list and adding a patient to the empty list if the patient is in the other two lists (ie: m_liver.get("T") and m_kidney.get("T")).

Initialize empty list of Patients, m_all
for Patient p in list of patients with problem m_diabetes
	if p in list of patients with problem m_kidney
		if p in list of patients in problem m_liver
			add p to list m_all

My rationale is this; using the problem possibilities as the keys and the patients as the values allows us not to need to iterate over every patient to see what problems they have. It also allows us to "short-circuit" our search if say the patient doesn't have one of the problems.

My algorithm complexity analysis is super-rusty, so I'm not sure how much better this is.

I'd appreciate any counter-arguments or proofs.

- Dave October 04, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

As it is request for Datastructure...
We can use Ordered double Linked list.

struct FindPatientProbelm
{
long Patient_ID; //Unique...
string Patient_Name;
string Patient_Problem;
struct FindPatientProblem *Left;
sturct FindPatientProbem *Right;
};
typedef struct FindPatientProblem *Node;

Need to have three Pointers to traverse;
First --> Always pointing to first node.
Mid --> For implementing search pattern,
Last --> To point last node.

Applying binary search..

- Johnson October 09, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Asked me the same question, but there was little variance. He asked how soon can you find out a disease which has maximum patients infected. Next, find maximum number of patients who have common maximum common diseases.

- Akki November 06, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can have two hash table
1. key as 'problem' value as List of 'patients'.
2. key as 'patient' value as List of 'problems'

Basically, its type of indexing we have create for making search easy in all possible scenario.

- Sandeep December 27, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How about a Bipartite graph where we have two sets. One of patients and other of problems. An edge exists between a node in patient set and a node in problem set if that patient has that problem. That way picking a problem from the set we can easily find out any 3 nodes it is connected to in patient set.

- Meena Chaudhary February 13, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The answer to this question is using a maxheap. You can find the top indicators for the n patients and if more patients are added, then maxheap can fix itself and sort the tree

- Divya Sridharan March 05, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Why not use HashMap<Person, HashSet<Disease>> map ?
Then we iterate through all the people, then it should be O(n).

if( map.get(i).contains(diabetes) && map.get(i).contains(..) && map.get(i).contains(..)){
res.add(i);
}

return res;

Since the .contains() of HashSet should run in O(1), then the whole TIME complexity can be O(n) ?? Just asking. Not sure.

- nickhuang0810 July 13, 2015 | Flag Reply

CareerCup

Epic Systems Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews