Firstly, need to ask interviewer to clarify:
- # of log entries/day, # of pages and # of users? can the data fit into a traditional DB or requires distributed storage. I assume the analyzed results can be loaded into an RDBMS.
- is this real-time analysis or can we use a Hadoop job to ETL? I assume ETL is allowed.
- should we optimize for query performance or write performance? I assume query performance is more critical.

The solution is to write raw log data to HDFS, using Apache Flume to support multiple data centers. Use Hadoop job to analyze the raw data and load results into an RDBMS for query.

In RDBMS, to support the first two queries, we can use a table with columns "page, user-count, visit-count". For third query, we can create a table with "page, user-id, count".

- lngbrc February 05, 2015 | Flag Reply
If it comes to Java - you can represent using following class and process the Queries from the pool of page objects.

public class page{
String url;
String User;
int NoOfTimesVisited;


All queries can also converted to Hadoop MR solutions,
Mapper Key - URL
Mapper Value - UserName (emit 1 if you are intrested to count numbers of Users on URL)

- pk February 05, 2015 | Flag Reply

