authors are vetted experts in their fields and write on topics in which they have demonstrated experience. 我们所有的内容都经过同行评审,并由同一领域的Toptal专家验证.
Fabrice Triboix's profile image

Fabrice Triboix

Fabrice is a cloud architect and software developer with 20+ years of experience who’s worked for Cisco, Samsung, Philips, Alcatel, and Sagem.

Expertise

Previously At

Cisco
Share

Elasticsearch is a powerful software solution designed to quickly search information in a vast range of data. 结合Logstash和Kibana,这形成了非正式的名称 “ELK stack”,通常用于收集、临时存储、分析和可视化日志数据. 通常还需要一些其他的软件,比如 Filebeat to send the logs from the server to Logstash, and Elastalert to generate alerts based on the result of some analysis ran on the data stored in Elasticsearch.

The ELK Stack is Powerful, But…

我使用ELK管理日志的经验非常复杂. 一方面,它非常强大,其功能范围令人印象深刻. 另一方面,它的设置很棘手,维护起来也很麻烦.

The fact is that Elasticsearch is very good in general and can be used in a wide variety of scenarios; it can even be used as a search engine! Since it is not specialized for managing log data, this requires more configuration work to customize its behavior for the specific needs of managing such data.

Setting up the ELK cluster was quite tricky and required me to play around with a number of parameters in order to finally get it up and running. Then came the work of configuring it. In my case, 我有五个不同的软件需要配置(Filebeat), Logstash, Elasticsearch, Kibana, and Elastalert). This can be a quite tedious job, as I had to read through the documentation and debug one element of the chain that doesn’t talk to the next one. Even after you finally get your cluster up and running, 您仍然需要对其进行日常维护操作:打补丁, upgrading the OS packages, checking CPU, RAM, and disk usage, making minor adjustments as required, etc.

我的整个ELK堆栈在Logstash更新后停止工作. 仔细检查后发现,出于某种原因, ELK developers 决定更改配置文件中的关键字并将其复数化. That was the last straw and decided to look for a better solution (at least a better solution for my particular needs).

我想存储由Apache和各种PHP和节点应用程序生成的日志, 并对它们进行解析,以发现软件中存在缺陷的模式. The solution I found was the following:

  • Install CloudWatch Agent on the target.
  • 配置CloudWatch Agent以将日志发送到CloudWatch日志.
  • 触发Lambda函数的调用以处理日志.
  • 如果找到模式,Lambda函数将向Slack通道发布消息.
  • Where possible, apply a filter to the CloudWatch log groups to avoid calling the Lambda function for every single log (which could ramp up the costs very quickly).

And, at a high level, that’s it! A 100% serverless solution that will work fine without any need for maintenance and that would scale well without any additional effort. 与服务器集群相比,这种无服务器解决方案的优势有很多:

  • In essence, all routine maintenance operations that you would periodically perform on your cluster servers are now the responsibility of the cloud provider. Any underlying server will be patched, upgraded and maintained for you without you even knowing it.
  • You don’t need to monitor your cluster anymore and you delegate all scaling issues to the cloud provider. Indeed, a serverless set up such as the one described above will scale automatically without you having to do anything!
  • The solution described above requires less configuration, and it is very unlikely that a breaking change will be brought into the configuration formats by the cloud provider.
  • Finally, it is quite easy to write some CloudFormation templates to put all that as infrastructure-as-code. 如果要设置一个完整的ELK集群,则需要大量的工作.

Configuring Slack Alerts

So now let’s get into the details! 让我们来研究一下这样一个设置的CloudFormation模板是什么样子的, complete with Slack webhooks for alerting engineers. 我们需要首先配置所有的Slack设置,所以让我们深入了解它.

AWSTemplateFormatVersion: 2010-09-09

Description: Setup log processing

Parameters:
  SlackWebhookHost:
  	Type: String
  	Description: Host name for Slack web hooks
  	Default: hooks.slack.com

  SlackWebhookPath:
  	Type: String
  	Description: Path part of the Slack webhook URL
  	Default: /services/YOUR/SLACK/WEBHOOK

为此,你需要设置你的Slack工作区,请查收 this WebHooks for Slack guide for additional info.

一旦你创建了你的Slack应用程序并配置了一个传入钩子, 钩子URL将成为CloudFormation堆栈的一个参数.

Resources:
  ApacheAccessLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

  ApacheErrorLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

Here we created two log groups: one for the Apache access logs, the other for the Apache error logs.

I did not configure any lifecycle mechanism for the log data because it is out of the scope of this article. In practice, you would probably want to have a shortened retention window and to design S3 lifecycle policies to move them to Glacier after a certain period of time.

Lambda Function to Process Access Logs

现在让我们实现Lambda函数,它将处理Apache访问日志.

BasicLambdaExecutionRole:
	Type: AWS::IAM::Role
	Properties:
  AssumeRolePolicyDocument:
  Version: 2012-10-17
  Statement:
  - Effect: Allow
  Principal:
  Service: lambda.amazonaws.com
  Action: sts:AssumeRole
  ManagedPolicyArns:
  -在攻击:aws:我::/服务/ AWSLambdaBasicExecutionRole aws:政策

这里我们创建了一个IAM角色,它将附加到Lambda函数, to allow them to perform their duties. In effect, the AWSLambdaBasicExecutionRole is (despite its name) an IAM policy provided by AWS. 它只允许Lambda函数创建它的日志组和该组中的日志流, and then to send its own logs to CloudWatch Logs.

ProcessApacheAccessLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['status'][0] == "5":
    # This is a 5XX status code
    print(f"收到一个带有5XX状态码的Apache访问日志:{raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"Sending Slack post to: host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200

因此,这里我们定义了一个Lambda函数来处理Apache访问日志. 请注意,我没有使用Apache默认的通用日志格式. I configured the access log format like so (and you will notice that it essentially generate logs formatted as JSON, 这使得进一步的处理更容易):

LogFormat "{\"vhost\": \"%v:%p\", \"client\": \"%a\", \"user\": \"%u\", \"timestamp\": \"%{%Y-%m-%dT%H:%M:%S}t\", \"request\": \"%r\", \"status\": \"%>s\", \"size\": \"%O\", \"referer\": \"%{Referer}i\", \"useragent\": \"%{User-Agent}i\"}" json

This Lambda function is written in Python 3. 它接收从CloudWatch发送的日志行,并可以搜索模式. In the example above, it just detects HTTP requests that resulted in a 5XX status code and posts a message to a Slack channel.

在模式检测方面,您可以做任何您喜欢的事情, 而且它是一门真正的编程语言(Python), 而不是仅仅在Logstash或Elastalert配置文件中的正则表达式模式, 给了你很多实现复杂模式识别的机会.

Revision Control

A quick word about revision control: I found that having the code inline in CloudFormation templates for small utility Lambda functions such as this one to be quite acceptable and convenient. Of course, 用于涉及许多Lambda函数和层的大型项目, 这很可能不方便,您需要使用SAM.

ApacheAccessLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheAccessLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !子在攻击:aws:日志:$ {}aws:地区:$ {aws:: AccountId}:日志组:*

上面的代码允许CloudWatch日志调用Lambda函数. One word of caution: I found that using the SourceAccount property can lead to conflicts with the SourceArn.

Generally speaking, I would suggest not to include it when the service that is calling the Lambda function is in the same AWS account. The SourceArn 会禁止其他帐户调用Lambda函数吗.

ApacheAccessLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheAccessLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheAccessLogGroup
  DestinationArn: !GetAtt ProcessApacheAccessLogFunction.Arn
  FilterPattern: "{$.status = 5*}"

订阅过滤器资源是CloudWatch日志和Lambda之间的链接. Here, logs sent to the ApacheAccessLogGroup will be forwarded to the Lambda function we defined above, but only those logs that pass the filter pattern. Here, the filter pattern is expecting some JSON as input (the filter patterns starts with ‘{‘ and ends with ‘}’), and will match the log entry only if it has a field status which starts with “5”.

This means that we call the Lambda function only when the HTTP status code returned by Apache is a 500 code, which usually means something quite bad is going on. This ensures that we don’t call the Lambda function too much and thereby avoid unnecessary costs.

More information on filter patterns can be found in Amazon CloudWatch documentation. CloudWatch的过滤模式非常好,尽管显然没有Grok那么强大.

Note the DependsOn field, which ensures CloudWatch Logs can actually call the Lambda function before the subscription is created. This is just a cherry on the cake, it’s most probably unnecessary as in a real-case scenario, Apache would probably not receive requests before at least a few seconds (eg: to link the EC2 instance with a load balancer, 并让负载均衡器识别EC2实例的状态为健康状态).

Lambda Function to Process Error Logs

现在让我们看一下处理Apache错误日志的Lambda函数.

ProcessApacheErrorLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['level'] in ["error", "crit", "alert", "emerg"]:
    # This is a serious error message
    msg = log['msg']
    if msg.startswith("PHP Notice") or msg.startswith("PHP Warning"):
    print(f"Ignoring PHP notices and warnings: {raw_log}")
    else:
    print(f"Received a serious Apache error log: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"Sending Slack post to: host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200

This second Lambda function processes Apache error logs and will post a message to Slack only when a serious error is encountered. In this case, PHP notice and warning messages are not considered serious enough to trigger an alert.

同样,这个函数期望Apache错误日志是json格式的. 所以这里是错误日志格式字符串我一直在使用:

ErrorLogFormat "{\"vhost\": \"%v\", \"timestamp\": \"%{cu}t\", \"module\": \"%-m\", \"level\": \"%l\", \"pid\": \"%-P\", \"tid\": \"%-T\", \"oserror\": \"%-E\", \"client\": \"%-a\", \"msg\": \"%M\"}"
ApacheErrorLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheErrorLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !子在攻击:aws:日志:$ {}aws:地区:$ {aws:: AccountId}:日志组:*
  SourceAccount: !Ref AWS::AccountId

该资源授予CloudWatch Logs调用Lambda函数的权限.

ApacheErrorLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheErrorLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheErrorLogGroup
  DestinationArn: !GetAtt ProcessApacheErrorLogFunction.Arn
  FilterPattern: '{$.msg != "PHP Warning*" && $.msg != "PHP Notice*"}'

Finally, we link CloudWatch Logs with the Lambda function using a subscription filter for the Apache error log group. Note the filter pattern, which ensures that logs with a message starting with either “PHP Warning” or “PHP Notice” do not trigger a call to the Lambda function.

Final Thoughts, Pricing, and Availability

关于成本的最后一句话:此解决方案比操作ELK集群便宜得多. 存储在CloudWatch中的日志的定价与S3相同, 而Lambda每月允许100万次通话,这是其免费套餐的一部分. This would probably be enough for a website with moderate to heavy traffic (provided you used CloudWatch Logs filters), 特别是如果你编写得很好并且没有太多错误的话!

另外,请注意Lambda函数最多支持1,000个并发调用. 在撰写本文时,这是AWS中的一个无法更改的硬限制. 但是,您可以期望对上述函数的调用持续大约30-40ms. This should be fast enough to handle rather heavy traffic. If your workload is so intense that you hit this limit, 你可能需要一个基于kineesis的更复杂的解决方案, which I might cover in a future article.

Further Reading on the Toptal Blog:

Understanding the basics

  • What is the ELK stack?

    ELK is an acronym for Elasticsearch-Logstash-Kibana. Additional software items are often needed, such as Beats (a collection of tools to send logs and metrics to Logstash) and Elastalert (to generate alerts based on Elasticsearch time series data).

  • Is ELK stack free?

    The short answer is: yes. The various software items making up the ELK stack have various software licenses but usually have licenses that offer free usage without any support. 但是,由您来设置和维护ELK集群.

  • How does the ELK stack work?

    ELK堆栈是高度可配置的,因此没有一种方法可以使其工作. For example, 下面是Apache日志条目的路径:Filebeat读取该条目并将其发送到Logstash, which parses it, and sends it to Elasticsearch, which saves and indexes it. Kibana can then retrieve the data and display it.

Hire a Toptal expert on this topic.
Hire Now
Fabrice Triboix's profile image
Fabrice Triboix

Located in London, United Kingdom

Member since September 6, 2017

About the author

Fabrice is a cloud architect and software developer with 20+ years of experience who’s worked for Cisco, Samsung, Philips, Alcatel, and Sagem.

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. 我们所有的内容都经过同行评审,并由同一领域的Toptal专家验证.

Expertise

Previously At

Cisco

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

Toptal Developers

Join the Toptal® community.