Epic Systems Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: Phone Interview
I think this is the best solution. It's basically a balanced binary tree. Each path represents a combinations of problems and leads to the list of patients with these problems.
Remember the question is to find all the patients with at least the 3 particular diseases. It actually takes O(2^(m-3)) to find all the patients from the tree.
I think the best solution depends. If n >> 2^m, then the tree solution is good. If
n is comparable or even less than 2^m, a simple 2D array may be better.
2 issues with your response:
1. The problem states that n is relatively large compared to m
2. I didn't state it, but the tree would ONLY store enough paths to divide all the patients.
A simplified example: If there are 100 diseases, but only 3 of those diseases are seen in all the patients. Some patients have disease A, and the rest have both disease B and C. Then the optimized tree will only have a single node, with 2 children (not 2^100 children). The left child is all patients with disease A, and the right child is all diseases WITHOUT disease A. In practice, trees like this will be very small, relative to your data set.
^counter argument for above suggestion:
The problem demands solution for "find all the patients who have at least the 3 problems from the set."
So if you create a 2D array, you can surely access required disease in O(1). However, while searching for one patient who has at least 3 diseases, you have to access all the disease columns in worst case. So for one patient, the time complexity in you DS will be O(m). Even though m is small, this seems a trivial solution. The time complexity in your case will be O(n x m) for all the patients. Please correct me if I am wrong.
Maintain a extra column (Tot_diseases) containing total number of diseases per patient. Increment this attribute while inserting 'T' for a particular disease. This way time complexity will be O(1).
@khunima can you explain the additional column comment in some detail how would it reduce the time complexity to O(1)? (based on the argument to which it is commented)
Explanation for what @khunmia is trying to say: maintain a count of the number of diseases each patient has using a separate column. So now there are m+1 columns(m diseases + one Tot_diseases column). When returning the patients with atleast 3 diseases, check if the value of the Tot_diseases column is >=3. If yes, return that patient. This makes it O(1) to check if a patient has atleast 3 diseases and the over all time O(n)
Two dimensional array of size n x m work?
n - no of patients
m - no of diseases.
since n > m, one might want to go over the n less no of times compared to m. For every patient (i.e. for each n) , inspect the required diseases ( at specific columns in the row for the patient) in constant time, as array look is O(1).
time complexity would be O(n).
space complexity would be O(n x m)
Hi Jobseeker,
What was the answer you came up with?
We could use integers represented by m-bits. Each bit could represent a disease(boolean values). space complexity is now O(n). But time complexity is still O(n*m):
Let x represent the set of given diseases as bits. Bitwise and it with all 'n' numbers. count no. of set bits in O(m) for each patient.
If we could store the no. of bits stored in all 2^m bits stored before hand:
space complexity: O(n)+O(2^m).
time complexity: O(n)
I think in this case, using a database and create a table would be a good choice, the table stores all the patients and what kind of diseases they have.
Or for another thought, using XML to store this kind of information?
Here is my answer,
I would create a Map of problems, which would have two keys "T" and "F" with their values being a List of patients corresponding to that problem's key.
To find the patients with all three problems, I would create an empty list. I would then get the list of patients with one problem (eg: m_diabetes.get("T")) and iterate over that list and adding a patient to the empty list if the patient is in the other two lists (ie: m_liver.get("T") and m_kidney.get("T")).
Initialize empty list of Patients, m_all
for Patient p in list of patients with problem m_diabetes
if p in list of patients with problem m_kidney
if p in list of patients in problem m_liver
add p to list m_all
My rationale is this; using the problem possibilities as the keys and the patients as the values allows us not to need to iterate over every patient to see what problems they have. It also allows us to "short-circuit" our search if say the patient doesn't have one of the problems.
My algorithm complexity analysis is super-rusty, so I'm not sure how much better this is.
I'd appreciate any counter-arguments or proofs.
As it is request for Datastructure...
We can use Ordered double Linked list.
struct FindPatientProbelm
{
long Patient_ID; //Unique...
string Patient_Name;
string Patient_Problem;
struct FindPatientProblem *Left;
sturct FindPatientProbem *Right;
};
typedef struct FindPatientProblem *Node;
Need to have three Pointers to traverse;
First --> Always pointing to first node.
Mid --> For implementing search pattern,
Last --> To point last node.
Applying binary search..
How about a Bipartite graph where we have two sets. One of patients and other of problems. An edge exists between a node in patient set and a node in problem set if that patient has that problem. That way picking a problem from the set we can easily find out any 3 nodes it is connected to in patient set.
Why not use HashMap<Person, HashSet<Disease>> map ?
Then we iterate through all the people, then it should be O(n).
if( map.get(i).contains(diabetes) && map.get(i).contains(..) && map.get(i).contains(..)){
res.add(i);
}
return res;
Since the .contains() of HashSet should run in O(1), then the whole TIME complexity can be O(n) ?? Just asking. Not sure.
The answer is a K-D Tree: Split on the m problems. Each leaf represents a combination of the problems, and stores a list of references to patients.
- Tsuna00 October 03, 2014Assuming that m is significantly smaller than n:
Time to search is O(1) - basically the height of the tree, which is m
Space is O(n) since no patient objects have to be duplicated
Edit: This is assumed to be an OPTIMIZED tree. Meaning: the tree would ONLY store enough paths to divide all the patients.
A simplified example: There are 100 diseases, and only 3 of those diseases are seen in all the patients. Some patients have only disease A, and the rest have both disease B and C. Then, the tree will only have a single node, with 2 children (not 2^100 children). The left child stores all patients with disease A, and the right child stores all diseases WITHOUT disease A. In practice, trees like this will be very small, relative to your data set. They're used in machine learning and indexing for geographic related searches.