There is a cluster of servers.

Amazon Interview Question for SDE-2s

1

of 1 vote

4
Answers
There is a cluster of servers. In this cluster some group of servers are running application A, some group are running B, etc. Each application server produces huge logs and the log file sizes run into GBs. Each minute there are millions of log entries.

You need to design a system that allows you to:
1. Specify the name of the application whose logs you want to search.
2. Search any text that the log message may contain.
3. Search within a time stamp range.
4. Search within the specified log level(s).

The system should be real-time.
- Curious May 09, 2017 in India | Report Duplicate | Flag | PURGE
Amazon SDE-2 System Design

Email me when people comment.

An error occurred in subscribing you.

Country: India

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 6 vote

Piped grep commands aside, you're probably better off with either Amazon Kinesis -> Spark -> Elasticsearch or similar architecture or using 3rd-party solutions like Logly. That said, here's a typical recommended approach.

1. Each application logger and usage should include the application and component name, along with timestamp, level, and message from an initial design perspective.

2. Either add a log "wrapper" to each application that persists them to a queue, cache or central data mart. An alternative approach would be to make sure log rotation enabled on the server and tail logs and pipe to a "wrapper" to export them to central location. Another alternative (my preference) is to use Logstash to consume log messages and ship them to external store, usually Kafka or Kinesis topics.

3. Consumer groups may grok/normalize the various log entries if not already standardized and republish normalized version to another "topic" in Kafka or Kinesis. You could also use a stream processor built into Kafka or 3rd-party like Spark if necessary.

4. Again leverage Logstash to consume normalized topics and publish to Elasticsearch or other Lucene-based search engine with document store. Be sure to add index rotation to data mart later (perhaps beyond 15 days' records) or your cluster will become massive.

5. Spin up Kibana to allow searchable browser-based interface and various dashboards and visualizations, plus time-series data analysis plugins.

6. Add either server-side alerting using Nagios or Monit, or you can use commercial products like Elastic's X-Pack "Watcher". Personally I use Monit and have custom shell script that triggers webhook message to a Slack channel for various system alerts.

;-) Enjoy!

- Mike Sparr - www.goomzee.com May 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

one of the solution could be that !
1- Each application may have different parameters as per requirement in the log file.
2- Each log file have Application_id which is unique for each application .
3- Apart this application id each each log will have LOG_LEVEL, DATE, TIME_STAMP, BODY,
STATUS.
4- Store LOG from each application into the log_distributed system.
5- map each application_name with application_id the system, now we can query to get all the logs with the application _name .
6- We can query all the logs we the given time_stamp. and LOG_LEVEL

- Harsh Bhardwaj June 08, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I am wondering how do you design the schema to store log record. which allows the query the records by Time stamp, by log level as well as the application type to succeed. For time stamp, I believe the logs can be ordered (via re time stamping) at the time of retrieving the logs from the Kafka ( or other queuing system), rather than using the time stamp provided by the application server.
Once the logs are pushed to the log collector (logstash etc), some real time analytics or decision can me made per log instance (e.g event trigger, action on critical log etc, raise alerts). Apart from these real time action, the logs needs to be stored and queried.
Lets suppose the logs are stored in database (RDMS or no sql ??), these needs to be searched to provide the answers which this question poses.
How shall we store the log records to allow fast search/lookups

- ali.kheam June 11, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Here is a proposal for a solution:
1) logs will be kept in clusters of in-memory databases, each cluster dedicated to a group of application servers running application X.

2) each cluster will hold data structure to store the logs for the application for a particular type.

3) the in-memory structure for holding logs for application type X will e based on sets of N associative sets caches when there will be a cache for each log level, the set entries will be time ranges, lets say minute based and the set will rely on suffix-trees for fast search of text.
There will be such in-memory structure per each day.
this will allow a fast search of according to application type and filtered per: level, time-stamp, particular text or any combination of the above.

- Arie March 07, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- Arie March 07, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

-2

of 2 vote

grep "{particular-text}" logfile.txt | grep "{1[5-7]-05-2017 19:20}" | grep "{log-level}"

- Kapil May 17, 2017 | Flag Reply

CareerCup

Amazon Interview Question for SDE-2s

Books

Videos

Resume Review

Mock Interviews