TLDR: Network Performance Monitoring (NPM) tools are used to is how we overcome the verdict where: “The network is guilty until proven innocent!”
In this piece we’ll show how NPM is being redefined, which trends that are shaping it, then we’ll look at a couple of examples of modern Network Performance Monitoring (NPM) tools.
Network performance monitoring is a process of measuring and monitoring the quality of service of a network to understand if it is properly operating. Network performance monitoring aids network administrators and analysts in gathering network data, allowing them to measure performance variables and identify potential issues or risks.
Behind the jargon are some very real human disputes!
When “my” app stops working on “your” network, and “users” start complaining: who’s the first to get the blame?
It has been this way since I started in IT in 1994. We blamed SNA, then TCP/IP and then DNS and the internet. The network team were always the target for our collective IT ire!
I was now The Network! My new peers would tell this common story.
“The problem for use network engineers is that the network is guilty until proven innocent.
I open (yet) another ticket that a user has titled, ‘Is the network down?” – the fact that they could create a ticket over the network is lost upon them.
We no longer laugh at the IT Crowd meme, “Have you tried turning it off and on again?” We’re living it. “
The Mean Time To Innocence (MTTI) is how long it takes for the networking organization to prove it is not the network causing the degradation. Once that task is accomplished, it is common to assume some other component of IT such as the servers must be at fault.
That’s why networks are drawn on whiteboards as a mysterious cloud: it’s complicated, and “Here’s where the magic happens!”
Network engineers use network performance monitoring to see what’s happening in the cloud, where magic happens.
And it’s changing all the time!
And the network performance monitoring tools have to keep up, or you’ll be flying blind in that cloud where magic happens.
What is changing all the time? Cloud workloads and their underlying infrastructure are dynamic and ephemeral in nature:
Times have changed, and so must NPM: modern NPM has entered the game.
Not every enterprise IT shop will be using these modern IT trends, and therefore they might not require modern network performance monitoring tools.
But for this article we’ll assume we’re talking about modern vendors and enterprises where all three trends apply.
The big three trends that are forcing enterprises to upgrade their NPM are:
Assuming we are talking about a modern IT shop, what are each of these three trends and how do they impact NPM and require a modern NPM.
Anything that is spread over a network is distributed, one might argue, but some things are more distributed than others.
Beyond the 1990’s client/server architecture, then the internet, then the 3-tier web-app-db stack… by comparison, today’s applications can be hyper-distributed across public clouds, hybrid clouds, multi-clouds and edge CDNs. The reality is: it’s a mix of everything above.
So if the app is “down”, how do we know which part is broken?
How does the network engineer’s NPM tool see across all of these networks, some of which they don’t control? Do they need to use multiple NPMs? Or accept blindness in part of the network?
Other than Network Performance Monitring, related approaches include Cloud Security Posture Management.
It’s so easy to add services to the network. And to then delete them. In fact, if you look at containers — ephemeral by nature — you could have hundreds/thousands appearing and disappearing, hidden behind virtual interfaces on a virtual machine in the cloud…
Serverless services take this even further. How do you know what good looks like on your network, if your network is constantly changing because of dynamic, transient, ephemeral services that connect and disconnect all the time?
NPM is no longer just the province of a battered and bruised central networking team.
SREs, Application and cloud engineers… all have the power to create networks. Putting central command-and-control ITSM processes on these people just pushes them to Shadow IT, and they still are on the network.
You are forced to answer the question: “How can we let people go fast within guardrails that are monitored?”
We’ve talked about the problems facing NPM tools, who’s agitated by those problems (everyone) and why it’s getting harder (3 trends): but what about the solution?
The 7 features of a modern NPM tool include:
There is a broad market of dozens of NPM tools, but if you focus on tools that can handle modern distributed, dynamic and democratized IT, then here are two examples that SJULTRA has worked with:
Collecting data is cheap, but not having it when you need it can be expensive, so you should instrument everything, and collect all the useful data you reasonably can. — Datadog
Datadog’s Network Performance Monitoring provides full visibility across the distributed system of network component that makes up your on-premises, cloud, and hybrid environments, with little to no overhead.
But Datadog is not just about the network.
Datadog handles dynamic services because it unifies network data with infrastructure, application, and user experience data in a single pane of glass. Network Device Monitoring (NDM) autodiscovers devices from a wide range of vendors and lets you drill down to monitor the health of individual devices.
Lastlly, Datadog enables the democratization of NPM because anyone can use it: you don’t have to be a network engineer wearing a propeller hat. Here’s Alex Kanevsky, a Lead Architect:
With Datadog NPM, we can quickly determine if our network is at fault for slow traffic or low connectivity before our applications are affected, so that insuring travel is a seamless experience for our customers.
LEAD ARCHITECT AT GENERALI GLOBAL ASSISTANCE
Datadog provides a unified solution to seamlessly combine the three pillars of observability and enable full visibility across the application stack.
Other than that, the most important thing is the smooth, frictionless integration process with more than 400 built-in integrations and pre-defined dashboard templates for them.
The diagram below compiles what is offered by Datadog.
As an example of a simple, super-useful feature, one of the causes of application failures is the expiration of SSL certificates. Because Datadog can analyze everything in the stack then you can set up rules for things that touch the network, infrastructure and applications:
Grafana is a multi-platform, open source, analytics, and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources.
Grafana becomes a network performance monitoring tool by customizing it through mixins, which is the first difference when comparing it to the Datadog’s bundled commercial off-the-shelf (COTS) model.
Here at Grafana Labs, we want to help facilitate the monitoring of those important network devices and provide a few simple ways to get started. Our SNMP mixin is a great baseline for monitoring network devices. A mixin is a configurable, reusable, and extensible package which bundles together Grafana dashboards and Prometheus rules and alerts.
Grafana’s extensibility is what enables it to meet the modern NPM needs for distributed and dynamic services and to democratize access by making the information visual and easy to understand.
SJULTRA is a cloud security consulting firm based in California. Our consulting team works with customers around the world to develop high-impact strategies that drive effective digital business transformation leveraging cloud based technology.
SJULTRA partners with vendors and enterprises to design, build and operate cloud security infrastructure stacks, including Network Performance Monitoring.
One of the offerings that SJULTRA offers is to support cloud security vendors and their enterprise clients to build lab systems before production readiness.