Integrated Performance Monitoring Using Advanced Data Flows

Authored by

Andy Pernsteiner

This blog post was written in 2021 and reflects product capabilities at that time. Some information may be outdated.

In this blog post, we will examine new features that help you identify problematic workflows and optimize performance for end-users in real-time.

As one would expect with an all-flash, scale-out storage system, our customers consider performance to be one of their top priorities. VAST R&D has worked tirelessly to build a software architecture that consistently provides great performance across a wide variety of workloads. One might say that we’ve made high performance storage an easily attained commodity, and that administrators, application architects, and users alike no longer need to concern themselves with storage performance.

However, even when a system can deliver great performance, system administrators still want to be able to quickly and easily diagnose potential bottlenecks. Regardless of how fast a filesystem is, there will always be scenarios where admins will need to investigate issues reported by their users and application owners. To this end, VAST customers have always enjoyed an intuitive, responsive GUI, providing quick access to preloaded reports such as:

Cluster bandwidth and latency: For a quick snapshot view of overall system performance.
Breakdown of performance per CNode: To know if a CNode is overloaded or if it is experiencing abnormally high latency.
Breakdown of individual protocol operations, both for data and metadata: To narrow down which types of operations are perhaps performing slowly, and also to know more about the characteristics of the workload.

Figure 1: VAST Dashboard showing key metrics including cluster bandwidth and latency

Customizable Analytics For Your Unique Needs

More advanced users can create their own customized analytics, based on hundreds of different metric types. These can be visualized within the VAST GUI, or be collected using our RESTful API for aggregation and analysis in other tools, from ElasticSearch and Prometheus, to Grafana and Kibana. This helps administrators fine tune their dashboards to answer more complex issues.

For a while now, VAST customers have been able to view real time and historical data showing the Top Actors on their systems, enabling administrators to easily see which users, hosts, and file system exports are busiest. This provides valuable insight which can greatly reduce the time it takes to diagnose and troubleshoot user complaints or performance issues.

These tools, while extremely useful, do not quite capture the full picture of what is happening in an environment. Our customers want to answer additional questions, such as:

Show me all the views that my user ‘Brett’ is reading and writing to
Show me the top 50 machines sending I/O to CNode5
Show me whether my users are leveraging multiple client machines for their job, or incorrectly sending their I/O through a single machine.
Show me who is writing to the /engineering-old export

These questions are difficult to answer on nearly every enterprise file server today. This is not because these file servers do not collect sufficient metrics, rather it is because of how those metrics are collected and stored. To illustrate this example, let’s think about how VAST collects most of its protocol metrics. From the time the I/O is received, until the time the I/O is responded to or “ack’d”, that CNode is responsible for servicing the I/O and tracking the counters and latency timers for each client request.

CNode’s have an in-memory store about all things related to an I/O, including:

client-IP
client-VAST-ID/UID/SID
Protocol
VAST-VIP
I/O size (in bytes)
I/O type and direction

Every 10 seconds, the VAST-VMS service (which is a separate docker container that floats among CNodes in the cluster) polls all CNode’s clusterwide for metrics, including the ones above. VMS is then responsible for inserting these metrics into its database, to allow for retrieval and analysis. Up until VAST-4.0, these individual elements were queried and stored separately from each other. EG: I/O metrics related to Client-IPs were stored separately from I/O metrics related to user (UID/SID) information.

Connecting The Dots: Build a Complete Picture to Solve Problems in Real Time

Storing different types of data in separate tables is a relatively common practice. It helps logically partition data, and also ensures that queries for that data can be efficient, so long as you are looking at one facet at a time. The issue is that this model does not lend itself to more complex correlations. EG: with this model you are able to answer the following questions:

Show me the top 500 users (in terms of BW, iops, or md-ops)
Show me the top 500 clients
Show me the top 500 views/exports

However: you do not have lineage between these object types. You do not know for certain if “Kartik”, who is reading 600MByte/sec from the cluster, is reading it from one client, or 20. You do not know if he is reading that data from the /engineering export, or the /qa export. You also do not know which VIPs or CNodes he is connected to.

To solve this problem, we came up with a new way to emit and store performance data. Rather than re-write our existing metrics flows we added this new technique specifically for our new “Data Flow” feature in VAST-4.0.

As already mentioned, each CNode already has all the information we need about each I/O, sitting in memory, just waiting to be queried. The first thing that we did was create a new RPC which VMS could execute to request “Data Flow” metrics from a given CNode. From a high level, VMS would execute the Remote Procedure Call (RPC) as follows:

“Send me all DataFlows for the past 60 seconds”

The CNode would then respond to this RPC with a series of tuple-like objects with the following structure:

{timestamp_user_host_VIP_CNode_Export : {iops: 100, bw: 100…}}

It's important to note that each “tuple” represents an individual flow, from end-to-end. A DataFlow, in VAST parlance, is a discrete object which shows the I/O for a given combination of object types. A key distinction of using this approach is that we do not have to extrapolate (guess) on how a user, host, CNode, VIP, or view is contributing to the overall I/O of the system, we can know. This allows administrators to be much better informed about how the system is being utilized, and reduce the time required to narrow down an issue.

Figure 3: Data Flow example with focus on client machine and its IO flow across the cluster

Conclusion

As you can see, VAST has been constantly extending the platform to allow customers to get more information about their system, and the way that users and applications are interacting with it. Like every other piece of functionality on VAST, Data Flows is available out of the box at no additional charge. Customers running previous versions of VAST Software can simply apply a NonDisruptiveUpgrade to gain new functionality.

To see this feature in action, check out the demo below or contact us.

Integrated Performance Monitoring Using Advanced Data Flows

In this blog post, we will examine new features that help you identify problematic workflows and optimize performance for end-users in real-time.

Customizable Analytics For Your Unique Needs

Connecting The Dots: Build a Complete Picture to Solve Problems in Real Time

Conclusion

More from this topic