We had a brief outage recently.
Given that the code had not been changed in a month, we suspected some maintenance in an Azure data center stepped on our application.
Ping tests and self-tests failed for approximately 10 minutes.
The outage resolved on its own without intervention.
I submitted a ticket to Azure Support to determine the cause of the outage but the reason I'm writing this post is because of the behavior I observed with the CPU graphs for Cloud Services while investigating the outage.
The CPU graphs show different results depending on the time range selected.
I would expect to see the CPU spike with the same value no matter what time range I selected. But, to see the spike that fired the alert, I had to to "Edit" the chart and select different time ranges to see the differences. It wasn't until I selected a narrow custom time range that the CPU graph would display the CPU spike that corresponded to the alert firing. The alert fires if the CPU percentage exceeds 80% over 15 minutes. So, if you "know" something happened, try different time ranges but especially the custom range to find what you are looking for.
This behavior has been documented and forwarded to the Azure portal team for review. It appears in both the Classic and current Azure portal.
The response from Azure Support when I raised this concern.
"I have had discussion with our Azure UI team and Azure Monitoring team regarding the portal graph.
As they mentioned, When we look at the 24 hours of data in the portal, the data is aggregated at 1 hour granularity and the average is shown. Similar is the case for 1 week of data shown on the portal. Since the spike exist for 5 to 10 minutes, we need to see the custom data option instead of using the 24 hour and 1 week. These 24 hours/ 1 week graph will be helpful when you have spike for more than an hour."
The CPU spikes are lower in the graphs that have a longer time range because of the aggregation and averaging. This is not a bug with the Azure graphs, it's a feature. ;-)