. . . .


Setting up StackExchange's Opserver

posted 25 October 2013

Nick Craver from StackExchange recently posted StackExchange’s homegrown monitoring system. The system, named Opserver is available on GitHub.

In this post I hope to gloss over some of the issues and/or configuration steps necessary to run this slick monitor. So lets try this.

Installation:

  1. Grab the code from GitHub either, for the lazy here is the zip file link
  2. If using Visual Studio 2010 or previous, make sure you have ASP.NET MVC installed. Grab that at the ASP.Net MVC4 homepage and make sure to run the installer as administrator [Windows 7+].
  3. Crack open the solution file (either from your cloned repo OR the unzipped package downloaded in step ). There should be 2 projects visible and loaded. If you get a "The project file '.....csproj' cannot be opened. The project type is not supported by this installation." error, see this StackOverflow answer.
  4. Compile. You should have no errors and everything succeeds.
Side note: This codebase was built using Visual Studio 2012 and recently updated to use Visual Studio 2013.

Configuration:

All configuration is handled by .json files in the Opserver/Config directory. Nick has a .example file for each of the different types of dashboards there are. Some of these appear to be very centric to StackExchanges build, so out of the box you may not be able to use all the configurations without reworking some areas of your codebase.

SecuritySettings.json</span>

This is the first file you'll need to copy/paste in order to get into the system.
Edit this file to match the security settings you want.
The example looks like this:
	<?xml version="1.0" encoding="utf-8"?>
	<SecuritySettings provider="AD">
	 <!-- Optional, these networks can see the overview dashboard without authentication -->
	    <InternalNetworks>
	        <Network name="SE Internal" cidr="10.0.0.0/8" />
	    </InternalNetworks>
	</SecuritySettings>
	 
	<!-- Example of global access for everyone:<SecuritySettings provider="alladmin" />-->

which limits access to StackExchange’s internal network, yours will need to be changed

Possible "provider" attributes are: "activedirectory", "ad", "alladmin", or no attribute in which everyone is read only. "ad" and "activedirectory" are analogous.
For my case I have everyone being an admin, since this is mainly for our internal development and QA teams.
Mine looks like this:
<?xml version="1.0" encoding="utf-8"?>
<SecuritySettings provider="alladmin">
</SecuritySettings> 

Note, even if everyone is an admin, they will have to login initially using an empty username/password combination to get into the system.

RedisSettings.json

If you do not know what Redis is or do not use it skip this.
Adding a RedisSettings.json file allows you to monitor Redis memory use as well as perform some health checks.
The configuration for Redis is a tad cryptic for me still, so bear with me.
Here is our config file, modified to remove server names and specifics:
{
 "allServers": {
  "name": "All",
  "instances": [
   {
    "name": "Core",    
    "analysisRegexes": {
     "**Dev**": "^dev-",
     "**Live**": "^prod-",
    }
   }   
  ]
 },
 "Servers": [
  { "name": "server1" },
  { "name": "server1Slave1" },
  { "name": "server2" },
  { "name": "server2Slave1" },
  { "name": "server2Slave2" }
 ]
}
Under the instances array we have a "Core" name and two children that define regular expressions to match Redis key names for categorizing. That is, if a key starts with "dev-", it will be categorized as "**Dev**" under the "Core" instance when running analysis. The Servers array defines the server names your Redis instances are running on.
Here's a screenshot of what the Redis monitor home page looks like:
You can see the "Core" name in use on the left.

SQLSettings.json

The entries in this file define the connection strings for any SQL server instances you wish to monitor. There are two main arrays to populate, “clusters” and “instances”.</h3>

"clusters" refers to SQL Server 2012 cluster configurations. If the SQL instance you put in here are not SQL Server 2012 you can get memory use spark charts only. Outside of that it will appear offline.
"instances" refers to standalone SQL server instances, and at least in our internal setup, make up a majority of our SQLSettings configuration.
Our configuration, truncated some for brevity:
{
    "defaultConnectionString": "Data Source=$ServerName$;Initial Catalog=master;Integrated Security=SSPI;",
    "clusters": [
        {
         "name": "SDCluster01",
         "nodes": [
          { "name": "SDCluster01\\SDCluster01_01" },
   { "name": "SDCluster02\\SDCluster01_02" },
         ]
        },       
    ],
    "instances": [        
        { "name": "SDDB01" },
 { "name": "SDDB02" },
 { "name": "SDDB03" },
 { "name": "SDDB04" },
    ]
}
Any one of the JSON objects that have a "name" property, can also specify a "connectionString" entry in case it differs from the "defaultConnectionString"'s entry.
Here is a display of our non 2012 cluster and information:
<amp-img alt="Opserver - SQL area height="207" src="https://1.bp.blogspot.com/-mCUf9nkdEyU/Umr9z6sg7GI/AAAAAAAAAlg/sK3xXOjIQ9c/s400/blog_sql.png" width="400" />
As you can see, the top "cluster" is highlighted in red, even tho the spark chart is coming in.
Also you can see under the standalone grouping we have 1 instance that's not up at all.
Each one of these instances you can drill into to obtain more detailed information, for example:

ExceptionsSettings.json

To use this monitor panel, you pretty much need to use StackExchange.Exceptional. This, again, is StackExchange's error logging system, which writes unhandled exceptions to a central database.
If you happen to not use this, skip this for now. The configuration for this is pretty straight forward for a quick setup, the only necessary item is the connection string of the database your Exceptional logging system is writing to.
{
    "warningRecentCount": "100",
    "criticalRecentCount": "200",
    "viewGroups": "StatusExceptionsRO",
    "applications": [
        "Core",  
        "Customer Support",
        "API",
        "Intranet",
        "Polling"
    ],
    "stores": [
        {
            "name": "SanDiego",
            "queryTimeoutMs": 2000,
            "pollIntervalSeconds": 10,
            "connectionString": "Server=SDDB_01;Database=Exceptions;Integrated Security=SSPI;"
        }
    ]
}
String items underneath "applications" defines what is displayed by default on the Exception pages header. If an exception is written to the database that does not fit any of those application names, a new category will display in the pane.
I was going to post a screenshot, but after blurring out anything from our stack, it was useless. I would like to add however, that both Exceptional and the Exception page are just beautiful.

Conclusion

So far my coworkers and I love this, and I can see from the code its only getting better (hello there sp_WhoIsActive, sp_BlitzIndex!).
I hope this helps someone. Together with Exceptional I've personally found 3 issues with our codebase not previously known, scary but also eye opening. And I think that is the goal, make your errors eye opening and easy to see, so you can fix them.