Amazon Interview Question
SDE-2sCountry: India
one of the solution could be that !
1- Each application may have different parameters as per requirement in the log file.
2- Each log file have Application_id which is unique for each application .
3- Apart this application id each each log will have LOG_LEVEL, DATE, TIME_STAMP, BODY,
STATUS.
4- Store LOG from each application into the log_distributed system.
5- map each application_name with application_id the system, now we can query to get all the logs with the application _name .
6- We can query all the logs we the given time_stamp. and LOG_LEVEL
I am wondering how do you design the schema to store log record. which allows the query the records by Time stamp, by log level as well as the application type to succeed. For time stamp, I believe the logs can be ordered (via re time stamping) at the time of retrieving the logs from the Kafka ( or other queuing system), rather than using the time stamp provided by the application server.
Once the logs are pushed to the log collector (logstash etc), some real time analytics or decision can me made per log instance (e.g event trigger, action on critical log etc, raise alerts). Apart from these real time action, the logs needs to be stored and queried.
Lets suppose the logs are stored in database (RDMS or no sql ??), these needs to be searched to provide the answers which this question poses.
How shall we store the log records to allow fast search/lookups
Here is a proposal for a solution:
1) logs will be kept in clusters of in-memory databases, each cluster dedicated to a group of application servers running application X.
2) each cluster will hold data structure to store the logs for the application for a particular type.
3) the in-memory structure for holding logs for application type X will e based on sets of N associative sets caches when there will be a cache for each log level, the set entries will be time ranges, lets say minute based and the set will rely on suffix-trees for fast search of text.
There will be such in-memory structure per each day.
this will allow a fast search of according to application type and filtered per: level, time-stamp, particular text or any combination of the above.
Here is a proposal for a solution:
1) logs will be kept in clusters of in-memory databases, each cluster dedicated to a group of application servers running application X.
2) each cluster will hold data structure to store the logs for the application for a particular type.
3) the in-memory structure for holding logs for application type X will e based on sets of N associative sets caches when there will be a cache for each log level, the set entries will be time ranges, lets say minute based and the set will rely on suffix-trees for fast search of text.
There will be such in-memory structure per each day.
this will allow a fast search of according to application type and filtered per: level, time-stamp, particular text or any combination of the above.
Piped grep commands aside, you're probably better off with either Amazon Kinesis -> Spark -> Elasticsearch or similar architecture or using 3rd-party solutions like Logly. That said, here's a typical recommended approach.
- Mike Sparr - www.goomzee.com May 17, 20171. Each application logger and usage should include the application and component name, along with timestamp, level, and message from an initial design perspective.
2. Either add a log "wrapper" to each application that persists them to a queue, cache or central data mart. An alternative approach would be to make sure log rotation enabled on the server and tail logs and pipe to a "wrapper" to export them to central location. Another alternative (my preference) is to use Logstash to consume log messages and ship them to external store, usually Kafka or Kinesis topics.
3. Consumer groups may grok/normalize the various log entries if not already standardized and republish normalized version to another "topic" in Kafka or Kinesis. You could also use a stream processor built into Kafka or 3rd-party like Spark if necessary.
4. Again leverage Logstash to consume normalized topics and publish to Elasticsearch or other Lucene-based search engine with document store. Be sure to add index rotation to data mart later (perhaps beyond 15 days' records) or your cluster will become massive.
5. Spin up Kibana to allow searchable browser-based interface and various dashboards and visualizations, plus time-series data analysis plugins.
6. Add either server-side alerting using Nagios or Monit, or you can use commercial products like Elastic's X-Pack "Watcher". Personally I use Monit and have custom shell script that triggers webhook message to a Slack channel for various system alerts.
;-) Enjoy!