| OVIS enables powerful open-source cluster management |
Jan. 23, 2007
Once upon a time, clustering was restricted to mid-range computers like VAXes and SPARC servers. Now, however, clustering software for Linux has become commonplace. Unfortunately, real-time cluster-monitoring tools have lagged behind. This can make properly managing a cluster... interesting.
Sandia National Laboratories has a possible answer: OVIS, an open-source software tool that provides intelligent, real-time monitoring of computer clusters
OVIS 1.1 takes a statistical approach to the problem of computational platform monitoring and analysis. Traditionally, cluster monitoring tools keep an eye on manufacturer-specified, "absolute" thresholds. OVIS takes a new tack. It observes the overall statistical properties and environmental effects of a cluster, characterizing individual device behaviors and comparing them to a large number of statistically similar devices.
Thus, individual node values that appear to deviate from the norm are flagged as aberrant. This "norm" is established by real-time and historical analysis of the cluster and its nodes. This technique, Sandia's OVIS developers claim, can accurately expose problems much earlier than the current practice of simply waiting for a pre-determined threshold -- necessarily set high to preclude too many false alarms -- to be crossed.
OVIS does more, though, than keep a watchful eye out for misbehaving cluster nodes. It includes visualization and correlation reporting tools that enable the system builder or administrator to see what's what across nodes.
For example, the baseline capabilities of OVIS include the visual display of the nodes' individual internal temperature, CPU utilization, fan speed, etc., as well as a cluster's aggregate statistics. By letting administrators actually see how the cluster is doing as a whole, rather than as single nodes, OVIS makes it much easier to tune the cluster configuration and determine the effects of real-time changes.
Though not part of the current download distribution, OVIS also incorporates a Bayesian inference scheme that dynamically creates models for the normal behavior of the cluster. With Bayesian analysis, a statistical survey of events on the cluster is created and used to predict the likelihood of events. A crude example would be a Bayesian analysis program that predicts how many CPUs overheating would likely lead to the cluster failing in the near future. The Bayesian features will be in later versions of the program.
OVIS is now available for free download at the OVIS project website, and is released under the BSD license. For more details about OVIS, including Linux installation instructions, refer to the PDF document, Monitoring Computational Clusters with OVIS (PDF download).
Sandia National Laboratories is operated by Sandia Corp. for the U.S. Department of Energy's National Nuclear Security Administration. With main facilities in Albuquerque, N.M., and Livermore, Calif., Sandia has major research and development responsibilities in national security, energy, and environmental technologies.
-- Steven J. Vaughan-Nichols
Do you have comments on this story?
Talkback here NOTE: Please post your comments regarding our articles using the above link. Be sure to use this article's title as the "Subject" in your posts. Before you create a new thread, please check to see if a discussion thread is already running on the article you plan to comment on. Thanks!
(Click here for further information)
|
|
|
7 Advantages of D2D Backup
For decades, tape has been the backup medium of choice. But, now, disk-to-disk (D2D) backup is gaining in favor. Learn why you should make the move in this whitepaper.
4 Legal Reasons to Control Internet Access
The Internet is obviously a valuable resource for many organizations. However, many are exposed to legal liability concerns because they fail to control Internet access. Learn if you're safe in this white paper.
Rapidly Resolve J2EE Application Problems
Whether you are in the process of building J2EE applications or have J2EE applications already running in production, you must ensure that they deliver the expected ROI. Learn how in this white paper.
Load Testing 2.0 for Web 2.0
There are many unknowns in stress testing Web 2.0 applications. Find out how to test the performance of Web 2.0 in this white paper.
Build Better Games Online
For the game infrastructure providers, life is complex. Making money from games has become more complicated. Why? Find out in this white paper.
Building a Virtual Infrastructure from Servers to Storage
This white paper discusses the virtual storage solutions that reduce cost, increase storage utilization, and address the challenges of backing up and restoring Server environments.
Gaining Faster Wireless Connections with WiMAX
Welcome to what is quickly becoming the hyperconnected world where anything that would benefit from being connected to the network will be connected. Learn more in this white paper.
Is Your Desktop a Security Threat?
The new wave of sophisticated crimeware not only targets specific companies, but also targets desktops and laptops as backdoor entryways into those business’ operations and resources. Learn how to stay safe in this white paper.
Increasing SAN Reliability by 100 Percent
Storage area networks (SAN) are a strong part of storage plans. Learn how to increase your reliability and uptime by 100 percent in this case study.
|
|
|
|
|