-
Notifications
You must be signed in to change notification settings - Fork 492
Adds support for directly querying cloudwatch metrics via the AWS SDK #2262
Conversation
captncraig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments. I am not sure exatly what to make of the api before I see a concrete example in some docs.
| @@ -0,0 +1,105 @@ | |||
| // Package cloudwatch defines structures for interacting with Cloudwatch Metrics. | |||
| package cloudwatch // import "bosun.org/cloudwatch" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is a better place for this package than off of the root? Perhaps cmd/bosun/cloudwatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd based it off the graphite backend, so was copying the pattern where there was an top level graphite and opentsdb folder with the struct representing the query objects. I notice that the newer data sources like elastic and influx are in cmd/bosun/expr/
I could refactor it to just all live in the cmd/bosun/expr/cloudwatch.go file if that works for you ?
cmd/bosun/expr/cloudwatch.go
Outdated
| return parsed | ||
| } | ||
|
|
||
| func CloudWatchQuery(e *State, T miniprofiler.Timer, region, namespace, metric, period, statistic, dimensions, sduration, eduration string) (*Results, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some docs that explain how to use this please? Its a bit hard to visualize without an example as the end-user will call into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation for the public method, configuration section and a full example of an alert using cloud watch metrics
|
Like the idea, been thinking about doing something similar for azure. General thoughts on top of what Craig said:
|
d76c901 to
6bfb9b2
Compare
|
I've updated it so it now supports passing in a profile name in a prefix key similar to the elastic datasource. |
cmd/bosun/expr/cloudwatch.go
Outdated
| b, _ := json.MarshalIndent(req, "", " ") | ||
| T.StepCustomTiming("cloudwatch", "query", string(b), func() { | ||
|
|
||
| // disabled caching fuctionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually to put it differently querying is pretty essential for the expression language. Since Bosun expressions don't have "real" variables that hold values but just text expansion, operations that combine sets can result in queries being made repeatedly in a single expression. Therefore it is important that those queries be cache hits in this case.
| svc := new(mockCloudWatchClient) | ||
| c.Profiles["bosun-default"] = svc | ||
| e := State{ | ||
| now: time.Date(2018, time.January, 1, 0, 0, 0, 0, time.UTC), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment explaining this Time value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - I lost that I was in a test function - ignore comment :-/
| alert elb_error_rate { | ||
| critNotification = default | ||
| template = elb_error_rate | ||
| $region="eu-west-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer: I don't know much about cloudwatch.
I wonder if we may want operations that can get N items and tag results for things like region and or/namespaces. So for example, might I want something like (again, making things up since I don't know CW): cw(getregions("myApp", "eu-*"), ...). Then one query is made for your regions and you are writing one alert that applies to whatever regions are for that application. However, maybe CW doesn't even work like this so this doesn't mean much. If that would be better, we can add new types to the language - like a awsRegions type that returns a list of regions easily. So if you think changing certain arguments to fetch and pass multiple items might reduce alert repition we can explore that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have other ways to reduce duplication though through macros. That way things are a bit more declarative but you are not repeating all the contents of the alert. So as a CW person, asking you to imagine your own environment and what other people might be doing with CW for what might make sense.
It also isn't the end of the world if we go forward and make some new functions that can do more if they are needed. Don't want to over-engineer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been having a think about this.
It would probably be most useful to have a function which can populate the dimensions parameter based on filter criteria similar to the grafana dimension_values templating function
http://docs.grafana.org/features/datasources/cloudwatch/#query-variable
A good example of where this is useful is that it's hard to write an alert to check the individual CPU usage of EC2 instances in a auto scaling group. Cloudwatch doesn't have the ability to wildcard on the instance id and if you only supply the id of the autoscaling group it will return a single series with the average for the whole group.
in theory this would look something like
$prodDim = getdimensions("instanceId","tag:asg=webservers-scaling-group")
["prod"]cw($region, $namespace, $metric, $period, $statistics, $prodDim, "1h" ,"")
The drawback is that you need to do a getMetricStatistics call for each unique dimension value
For namespace & metric I don't think it makes sense because it would be hard to write a query that would actually return anything due to the way cloud watch dimensions work.
Similarly I'm not sure how effective pulling from multiple regions in a single query is, because it's comparable to querying multiple tsdb servers with one call.
Is this something you'd want tackled immediately ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At Stack we are currently looking to do something similar with Azure monitoring this month. So the main reason I am interested is to see how our requirements for that will relate to this - and what the commonalities are.
In terms of which arguments are strings (maybe a csv within a string) there are two ways we might be able to get both things in certain instances:
- Make an argument type that takes either a string that has a list of dimensions, or a awsDemensionType that would be the return of a getdimensions call (or something like this, can replace dimensions with regions etc)
- Have multiple functions with different names that are similar, but take different sorts of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had an attempt at adapting to allow multiple dimensions, and think the approach of having multiple functions which return the groups of dimensions to use in the query is the cleanest.
The cw() function dimension parameter accepts a series of helper functions which give it a list of dimensions to apply.
cwdim() just parses the dimensions directly from a string. So in the simplest case
$dim=cwdim("LoadBalancerName:dynamic-content-elb")
Or if you wanted to specify multiple entries
$dim=cwdim("LoadBalancerName:dynamic-content-elb|LoadBalancerName:static-content-elb")
If for things like s3 where you need 2 or more dimensions per query do
$dim = cwdim("type:IA,name:prod|type:IA,name:test")
Which would perform two queries and return a seriesSet with with two series with tags
{type=IA,name=test}
{type=IA,name=prod}
End to end example
$region="eu-west-1"
$namespace="AWS/ELB"
$metric="UnHealthyHostCount"
$period="300"
$statistics="Maximum"
$dimensions=cwdim("LoadBalancerName:dynamic-content-elb|LoadBalancerName:static-content-elb")
$unhealthy_hosts=["production"]cw($region, $namespace, $metric, $period, $statistics, $dimensions, "12h" ,"")
$unhealthy_hosts
For queries where you want to dynamically determine the dimensions have functions like cwdesc to query for the values you want
cwdesc("eu-west-1","InstanceId", "tag:Stack:Elasticsearch")
This would use e2_describe api to get the instance ids for all ec2 instances tagged with Stack=Elasticsearch
I have a working prototype using cw & cwdim. If you think this approach is ok I'll finish the other functions and tidy the code up so it's suitable for review
@kylebrandt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that sounds similar to like what I did with Azure. You probably want to create a type for it in the expression language (see the Azure commit) so the type checking can make sure the argument is sane.
The other outstanding commit which may get which may require some rebasing or whatever is the moving miniprofiler into the state, but that is just simple find and replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR with the new functions to support dynamically generating the dimensions, and updated the docs. Query format looks like this now
Directly specifying dimension values
$region="eu-west-1"
$namespace="AWS/EC2"
$metric="CPUUtilization"
$period=60
$statistics="Average"
$dimensions=cwdim('''[ {"InstanceId":"i-0123456abcedfgh"} ] ''')
$start = 1h
cw($region, $namespace, $metric, $period, $statistics, $dimensions, "$start" ,"")
Getting dimension values by query ec2_describe api
$region="eu-west-1"
$namespace="AWS/EC2"
$metric="CPUUtilization"
$period=60
$statistics="Average"
$filter = '''[{"tag:Purpose":["database"]} , {"InstanceType":["m4.2xlarge"]}]'''
$dimensions = cwdimq("eu-west-1","EC2","InstanceId",$filter)
$start = 1h
cw($region, $namespace, $metric, $period, $statistics, $dimensions, "$start" ,"")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylebrandt @captncraig If you have time it be great to get feedback on the new query format. Thanks
206c778 to
8ade1e8
Compare
97108f6 to
b6e1314
Compare
Requires a role with the ListMetrics,GetMetricData and GetMetricStatistics capabilities. Role can be supplied using either the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or by attaching it to the EC2 instance on which bosun runs.
gofmt files after rebase
Added function to dynamically generate dimensions by querying ec2_describe api Changed type of period parameter to be scalar so it can be used with d() properly
521584e to
2f36c61
Compare
|
Closing in favour of #2484 |
Requires a role with the ListMetrics,GetMetricData and GetMetricStatistics capabilities.
Role can be supplied using either the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or by attaching it to the EC2 instance on which bosun runs.
Example query