LitArk » Books » Home and family

coll. - Sustained Simulation Performance 2017

Here you can read online coll. - Sustained Simulation Performance 2017 full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 0, publisher: Springer, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Sustained Simulation Performance 2017
Author:
coll
Publisher:
Springer
Genre:
Books / Home and family
Year:
0
Rating:
4 / 5
Favourites:
Add to favourites
Your mark:
- 80
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Sustained Simulation Performance 2017: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Sustained Simulation Performance 2017" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

coll.: author's other books

Who wrote Sustained Simulation Performance 2017? Find out the surname, the name of the author of the book and a list of all author's works by series.

Sustained Simulation Performance 2017 — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Sustained Simulation Performance 2017" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Part I
System Management

Springer International Publishing AG 2017

Michael M. Resch , Wolfgang Bez , Erich Focht , Michael Gienger and Hiroaki Kobayashi (eds.) Sustained Simulation Performance 2017

Theory and Practice of Efficient Supercomputer Management

Vadim Voevodin 1

(1)

Research Computing Center of Lomonosov Moscow State University, Moscow, Russia

Vadim Voevodin

Email:

Abstract

The efficiency of using modern supercomputer systems is very low due to their high complexity. It is getting harder to control the state of supercomputer, but the cost of low efficiency can be very significant. In order to solve this issue, software for efficient supercomputer management is needed. This paper describes a set of tools being developed in Research Computing Center of Lomonosov Moscow State University (RCC MSU) that is intended to provide a holistic approach to efficiency analysis from different points of view. Efficiency of particular user applications and whole supercomputer job flow, efficiency of computational resources utilization, supercomputer reliability, HPC facility managementall these questions are being studied by the described tools.

Introduction

Modern supercomputing system consists of a huge amount of different software and hardware components: compute nodes, network, storage, system software tools, software packages, etc. If we want to achieve efficient supercomputer management, we need to think about all behavior aspects of these components. How efficiently users of supercomputer center consume computational resources, what jobs they run, what projects they form, how efficiently partitions and quotas are organized, is system software configured properlyall of these (and not only these) questions need to be taken into account, otherwise the efficiency of the supercomputer usage can be significantly decreased. This means that we need to control everything happening in the supercomputer.

As the supercomputers are getting bigger and more complex, this task is getting harder and harder. This explains the fact that the efficiency of most supercomputing systems is very low. For example, the average Flops performance of one core on old MSU system called Chebyshev for 3days is just above 3% []. The situation is quite the same on many other current supercomputer systems.

The analysis of supercomputer efficiency is further complicated by the fact that different user groups consider efficiency in different ways. Common supercomputer users are primarily interested in solving their tasks, so they mostly think about the efficiency of their particular applications. System administrators are concerned about general usage of computational resources, so they think about the efficiency of supercomputers. In turn, the management people think more globally, so their area of interest is the efficiency of the whole supercomputer center.

The task of efficient supercomputer management is really hard to solve, but the cost of low efficiency can be very high. Here is one example. One day of Lomonosov supercomputer [] (1.7 PFlops in peak, currently #2 in Russia) maintenance costs $25,000. If the job scheduler hangs, a half of the supercomputer will be idle in just 23h. This means that the cost of delay with a proper reaction is very high, so we need to keep control over supercomputers.

All this explains why we need efficient supercomputer management. The next question is: what is needed to achieve it? In our opinion, there are five major directions needed to be studied. Firstly, it is necessary the collect detailed information about the current state of the supercomputer and all its components. So, a monitoring system is needed. Monitoring system provides a huge amount of raw data that need to be filtered out to get the valuable information about the efficiency. Intellectual analysis and convenient visualization systems are needed for that purpose. Furthermore, the efficiency of supercomputer usage directly depends on the reliability of supercomputer components. This means that all-round control of the correctness of system functioning is required. Also, efficient supercomputer management can be very hard without easy-to-use work management system for helpdesk, resource project management, hardware maintenance control, etc.

There are different existing tools that help to analyze and improve efficiency of supercomputer functioning, but they are intended to solve only one or several tasks described above. Currently there is no unified approach that allows to perform holistic analysis of the supercomputer efficiency from different points of view. In Moscow State University, we are developing a toolkit aimed to solve all of these tasks. Further in this paper, six components that form this toolkit will be described in detail.

Moscow State University HPC Toolkit for Efficiency Analysis

Figure shows main components of HPC toolkit for efficiency analysis being developed in Research Computing Center of Lomonosov Moscow State University. They are interconnected and complement each other to develop a holistic approach for solving the posed task.

Fig. 1

Main components of HPC toolkit for efficiency analysis developed in MSU

2.1 DiMMon Monitoring System

There are many different monitoring systems that are successfully applied in practice in many supercomputer centers nowadays (Collectd [], Zabbix, Cacti, etc.). But in our opinion it will be hard to efficiently use such systems in future due to several reasons dictated by the supposed architecture of new supercomputers. Firstly, future monitoring systems need to be very scalable, up to millions of nodes. Also, they need to be easily reconfigurable, expandable and portable. And as current systems, they need to produce low overheads, but dealing with really huge amount of raw monitoring data.

Having all this in mind, Research Center in MSU started to develop DiMMon [], new system focused mostly on performance monitoring. There are three main features that form the basis for this DiMMon approach:

On-the-fly analysis : all relevant information should be extracted from the raw data before storing to the database. This helps to greatly reduce the amount of data needed to be stored and ease further data processing.
In-situ analysis : basic processing of the monitoring data should be performed where it was collected (e.g., on a compute node), only after that it will be sent to the server side. This helps to significantly reduce the amount of data needed to be sent via communication network. Due to the fact that only simple data processing is performed locally (such as simple aggregation or speed calculation), overheads that can affect user job execution on the node are very low.
Dynamic reconfiguration : Monitoring system must be able to change its configuration (data transmission routes, collection parameters, processing rules) without restarting. Future supercomputers will have much more dynamic nature, so in our opinion this feature of a monitoring system will be highly valuable.

Monitoring system with such features provides useful capabilities. For example, first two features enable DiMMon to calculate integral performance metrics for individual jobs while collecting the data. In this case it is unnecessary to scan through the whole database to find information relevant to the particular job run after is has finished; integral characteristics like minimum, maximum and average can be calculated on-the-fly. This helps to perform prompt analysis of job execution and significantly reduce the amount of computation needed for this purpose.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Sustained Simulation Performance 2017»

Look at similar books to Sustained Simulation Performance 2017. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Sanjay Srinivasan

Petroleum Reservoir Modeling and Simulation: Geology, Geostatistics, and Performance Prediction

Stewart

Probability, Markov Chains, Queues, and Simulation: The Mathematical Basis of Performance Modeling

coll.

Driving Performance at Ireland’s Commission for Regulation of Utilities

coll.

OECD Environmental Performance Reviews: Peru 2017

coll.

OECD digital economy outlook 2017.

coll.

Estudios económicos de la OCDE : España 2017.

coll.

Higher education in Kazakhstan 2017.

coll.

OECD-FAO Agricultural Outlook 2017-2026

Coll

The Deal of the Century

coll.

Top 10 Athens

coll.

Seychelles

coll.

iOS 11 by Tutorials

Reviews about «Sustained Simulation Performance 2017»

Discussion, reviews of the book Sustained Simulation Performance 2017 and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.