5th Quattor Workshop Summary (March 2008)

Written
17 March 2008
Author
Michel Jouvin

5th Quattor Workshop - CNAF - 17-18/3/08

Agenda.

Site Reports

CERN - V. Lefebure

Several Quattor instances :

  • Main instance : 7600 profiles with 1700 not Quattor managed (for inventory only), in 140 clusters
  • 2 instances for Linux controls
  • Desktops : currently using only a small subset of NCM components to configure side-wide defaults
  • Special requirement on components : touch only the part they manage, don’t remove comments
  • Rely on lcm rather than ncm-ncd to allow users to select component to run

PAN : still using v6 because of a performance issue with duplicate() function. Fixed proposed for v8, CERN may skip v7.

Namespaces : used mainly for staging purposes (/prod, /test…)

  • Just started to use namespace to organize templates

CDB : currently on a 4-core machine

  • 33 minutes for 7600 profiles
  • Memory issue : compile by batch of 100 profiles
  • May be related to the big RPM repository
  • More and more CDB users : issues with ACL management and performances
  • Compilation serialization leads to long apparent commit time : thinking about allowing some // sessions if no interferences between them.

SPMA and SWrep

  • Plan to cleanup RPM repository to improve compilation memory requirement

CDB2SQL and Oracle

  • CDB2SQL being rewritten : plan for 3x speed improvement
  • CCM v2 deployed : SSL based

Quattor development activities focused on maintenance of CERN templates, in particular namespace migration.

Xen-based virtualization use increasing but needs a more flexible template structure.

Trying to define a profile template structure everyboday will use with aim to separate OS configuration, VO configuration…

Other needs and issues :

  • SMS needs to access CDB ACLs
  • Update in progress to Quattor 1.3 templates : performance issue with push and npush, 2 CERN specific properties in structure_interface
  • Reviewing how to better handle non Quattor managed objects : remove requirements for not used information
  • Plan to use a specific profile_base for these systems
  • Migration of service data historically in CDB to SDB (Service Database)

Manpower situation at CERN:

  • ME left for INFN. Still working on Quattor core component maintenance but not focused anymore on this
  • German is no longer involved
  • Veronique is the only remaining person involved in Quattor maintenance : will concentrate on Quattor for CERN. Not sure to be at CERN anymore in 2 years…

LAL - M. Jouvin

http://indico.cern.ch/materialDisplay.py?contribId=2&sessionId=0&materialId=slides&confId=28976

NIKHEF - R. Starink

Sites:

  • NIKHEF-ELPROD : ~150 hosts
  • Will increase to 300 in April
  • Testbed : ~15 nodes

Install with Quattor for all grid machines

  • CentOS 3, 4, 5

Grid MW deployment with ncm-yaim

  • Lack of time to loak at/swith to QWG : open to benefit from & contribute to communicty effort
  • Interest from another site : occasion to look at it

Straightforward implementation of Xen guests using S. Childs’s work.

Monitoring : Nagios + Ganglia

  • Ganglia presents per-node or summary view of successful/failed execution of ncm-ncd
  • Nagios : alarms in case of non-zero exist (using NPRE)
  • Server configured manually

Pan Compiler : using v7.

SCDB : non-SVN version

  • Using Makefiles to hide SCDB tools

AII : successful migration to v2.

Facing scaling issues with TFTP and/or RPM repositories

  • TFTP unlikely to be the cause
  • External reasons like network ?

Grid-Ireland - S. Childs

18 distributed sites managed : ~400 nodes

  • Single SVN repository with 6 active users
  • RPM repositories replicated by rsync on each site

New compute resources deployed : ~40 Condor VMs

Isntallation of LEMON monitoring

  • Just for TCD site currently
  • Starting with Stijn’s receipee
  • Work required for better integration : alarms, IPMI

Xen config tieded up.

Developping GraphXML output format for template dependencies in panc v8.

Range of new non-grid machines installed

  • Portal server
  • Data management

QWG pulled in via

  • “current” pointer to certified revision
  • “trunk” pointer for development
  • Need to specify ignore:externals in case QWG repository is not accessible
  • Using Stijn’s dummy WN trick for performance improvement

QWG structure based on a site hierarchy : target “compile.sites”

Deployment of OS errata : still not working seamlessly

  • Mainly kernel issues

Plan to bring more service nodes in Quattor (e.g. Web servers)

Morgan Stanley - N. Williams

20K machines in 4 sites

  • Looked at several solutions, including Quattor
  • Pressure to have 10K machines managed with the replacement system by June. Will be a major test before final decision.

Like the Quattor architecture.

Tried SCDB but don’t like Subversion and prefer CDB model with specialized commands for non specialist users

  • Try to write a new thing merging both : AQDB
  • Looking at performances for compiling 10K machines
  • Concerns about scalibility of build server : DHCP, HTTP…

Using AII to manage initial installation :

  • Would like to dramatically improve installation time from 20 mn down to 5 mn

10 people involved, 5 really writing templates : main complaint is difficulty to locate where are the things included

  • Would like to use namespace to provide better predictibily

SPMA : dependency management is very time consuming

Monitoring : very interested by LEMON but no time to look at it in details.

UAM - L. Munoz Mejias

Several clusters :

  • Atlas T2 : 150 nodes, SL 4.5, managed with QWG templates, monitoring with Nagios
  • GVM-UAM : private cluster for UAM users, not necessarily HEP. Many pb with HW support on SL 4.5.
  • 40-desktop cluster : still running old-CDB, will probably be closed (old HW)

Quattor changes :

  • Implementation of staged deployment
  • Dropped SWrep in favor of HTTPrep
  • Secure delivery of profiles, using certificates : lightweight alternative to SINDES developped.
  • AII v2
  • Nagios integration into templates

Plans for the short term:

  • Panc v8
  • Quattorization of Quattor instances

INFN - A. Chierici

Running SLC4/gLite 3.1, except for CE

  • Disk servers run 64-bit (except if machine doesn’t support it)

Quattor configuration :

  • Using ncm-yaim to configure grid services
  • Adopted a new NS schema, close to NIKHEF one
  • Xen studied and tested : Xen-UI to be deployed soon

M.E. back at CNAF : continue to support core development

  • But only for one year…

New LHCb Tier-2 hosted at CNAF and fully quattorized.

PANC : migrated to v7 after Madrid workshop. 40% perf improvement.

LEMON configured using Quattor

  • Storage nodes and WNs
  • On WNs, conflict with GridICE
  • Migration in progress to NS
  • LEMON used for monitoring, Nagios used for alarming

More than 10 people involved in Quattor template maintenance at CNAF

Concerns about CERN role in the future : is the community strong enough to take over CERN role ?

Developped a web application to display node Quattor status graphically

  • Organized by rack
  • When a pointer is on a machine, displays the status of the components
  • Status retrieved from a MySQL database, filled with a cron job
  • Able to send status through RSS

Philips Research - S. Vrijaldenhoven

Current internal cluster not managed by Quattor. Want to move to grid to get access to more resources.

Test grid cluster managed with Quattor (SCDB+QWG templates)

  • SL 4.5, gLite 3.1
  • Plan to move to production soon (planned March 25th).

Nagios work going on.

IBCP - C. Eloto

1 cluster running gLite 3.1

  • 1 CE, 1 SE, 18 WNs
  • Everything installed in VMs
  • Quattor choosen to compensate from lack of manpower : seem to provide efficient management

Site not yet certified : pb with SE.

Faced many problems with early release of QWG templates for gLite 3.1 and some unusual settings (RPM server different from TFTP server)

Documentation needs to be improved for AII configuration, Python 32-bit for 64-bit, gLite 3.0/3.1 coexistence.

BEgrid - Stijn De Weirdt

Quattor configuration :

  • Central configuration database (SCDB)
  • RPM repositories with SWrep
  • Certificates used for ACls on SVN and SWrep
  • Deployment in 2 phases :
  • When central admins update central repository, site admins are notified
  • Deploy the changes with a custom script if they are interested

Sensitive information deployed with SINDES but not properly integrated with AII

  • Need to be done for AII v2

Still not using QWG OS templates for historical reasons but plan to move.

Use of dummy WN build (idea from CERN) to speed up the compilation

  • Compile just once
  • Reuse compiled version of a WN in every real WN profile
  • Not yet integrated with QWG : some changes required, mainly in machine-types/base.tpl and machine-types/wn.tpl

Other issues :

  • Better integration of monitoring tools in Quattor
  • Integration of DNS management in a way similar to DHCP

Main Developments

QWG Templates - M. Jouvin

http://indico.cern.ch/materialDisplay.py?contribId=18&sessionId=2&materialId=slides&confId=28976

AII v2 - R. Starink

Reasons for v2 :

  • v1 limited to PXE + Kickstart. Want to support other installation infrastructure like JumpStart
  • Device schema limited to /dev/[hs]d[a-z] : other schema supported by some hacks
  • RAID and LVM support very limited and impossible to combine
  • KS templates were becoming too complex and not maintainable
  • Untyped schema : no validation possible at compile time
  • Code unnecessarily complex : 2800 lines in v1 vs 1600 in v2

v2 architecture :

  • Front-end is using plug-ins to do the real work
  • 3 types of plugins : NBP, DHCP, osinstall
  • Node profile determines which plug-ins are loaded

aii-pxelinux : default plug-in for NBP, no more use of template

aii-ks : default plug-in for osinstall, no more use of template

  • Generator for Kickstart files
  • Support complex blockdevice combinations : based on ncm-lib-blockdevices
  • Flexible partitioning and formatting of file systems… but a bit slower
  • Site specific setup through hooks : plug-ins for aii-ks organized in 3 groups (pre-install, post-install, post-reboot)
  • Configured through PAN
  • Ordered list

AII configuration has changed… but easy to adjust

  • NBP : configuration moved to /system/aii/nbp/pxelinux
  • osinstall configuration : configuration moved to /system/aii/osinstall/ks
  • Customizatble via 22 variables AII_OSINSTALL_*
  • hooks : excute NCM object, print what is desired in KS file
  • File system definitions : separated between /sys/blockdevices and /system/filesystems
  • Require new pan-templates and ncm-lib-blockdevices

Almost no change in command line interface : aii-shellfe, aii-installfe

  • –notify being reimplemented

Upgrading Quattor server

  • Pre-requisites : CCM >= 2.0.2, pan-templates >= 2.7.1, ncm-lib-blockevices >= 0.17
  • Packages : aii-server >= 2.0.4, aii-ks >= 1.0.1, aii-pxelinux >= 1.0.0
  • Configuration moved from /etc to /etc/aii
  • Move configuration for NBP and osinstall : remove everything under /system/aii/*/options as it will be assumed to be related to a plugin ‘option’

Possible contributions :

  • Alternative modules
  • Hooks : generic (e.g. SINDES support) or site specific to be used by others as a starting point
  • Separate directory for sharing contributions and enhancements : contrib
  • Extensions maintained by extension author
  • A generic extension may be moved to standard AII

Support for AII v1 has been dropped.

Already tested with Xen at NIKHEF.

Update on blockdevices layout - L. Munoz Mejias

blockdevices :

  • partitions_add : bulk addition of partitions to a disk. Partitions given as a pair name/size.
  • lvs_add : similar for LVM
  • Support for LVM stripping aded : stripe_size property on logical_volumes structure.
  • No support for HW raid yet
  • No support for quota yet : looking for suggestions about a quota schema

Core Components

PAN Compiler - Cal

Production version 7.2.9

  • Bug fixes only

Version 8 in development, still in trunk

  • 8.1.0 tagged last week and available for evaluation
  • Feedback from each site would help moving this to production : large chunks rewritten
  • Removed features deprecated in v7
  • Keyworks : define, delete, description, descro
  • Types : embed, fetch, stream
  • Newly deprecated features in v8
  • Bareword include : include mytemplate; changed to include { 'mytemplate };`
  • Using type for binding : must be replace by bind
  • Lowercase automatic variables : loadpath, self, object
  • Language changes :
  • External path syntax : //myboject/some/absolute/path to be deprecated in favor of /my/object/tpl:/some/absolute/path
    • Will allow eventually object templates to be namespaced
  • Literal escaping of paths : can escape part of a path with possibility to improve significantly performance of some templates
    • ‘/some/path/{a/b}/example’ would translate to ‘/some/path/a_2fb/example’
    • Avoid a call to escape()
    • Avoid using DML on large blocks like in SPMA functions because part of the path must be escaped (e.g. package name).
  • More limits on structure template contents : variable, functions, include no longer allowed
  • New and changed functions :
  • format() : printf-like capabilities. Can avoid incremental building of configuration file contents.
    • Not printing everything by itself
  • is_defined(), is_boolean()… return false instead of error if variable does not exist.
  • String manipulation : to_lowercase/uppercase(), split(), replace()
  • to_string() will accept any element, undefined, null and resources included
  • New automatic variable : TEMPLATE. Name of the template that initiated a DML block.
  • Not changed with function calls, including create()
  • New output format : write machine template as a dot (Graphviz) file
  • Logging capabilities added
  • Several logging types : task, call, memory, all, none
  • Messages short and easily parsed for analysis
  • Cost : 15% slower with all loging
  • Example analysis scripts for memory usage vs. time, task per thread vs. time, graph of call (include) structure, performance studies (how much time spent in every template)
  • Change in global variable handling : variable X = ...exists(X)... always true.
  • Must use null instead of undef for tri-state variables

Implementation changes in v8 :

  • Better handling of SELF : faster and less memory intensive, more consistent in various contexts
  • Some optimization : evaluation and use of compile-time expressions, specialized operators to aovid redundant checks at runtime, optimization to allow stricter syntax checking (earlier detection of some errors)
  • “Read-only” resources : infrastructure in place to avoid unnecessary copying of resources but not used yet. Some semantic issues to work out.
  • GRIF full build : 25% faster than v7

Documentation updated, including tutorial and man pages for PAN functions.

Migration of standard templates to v8 required to avoid deprecation warnings

  • CERN has to migrate to v8 to benefit from last change to improve performance : require to remove no longer supported keywords

Would like to clean up panc script options. What are the options used ?

  • To be discussed on the mailing list

Authorization/Entitlements (Morgan&Stanley request)

  • What change : template or configuration ? Real issue is probably configuration and it’s triky to know the reference value.
  • Who did it ? How to get user identity ?
  • Was it authorized ? How to define authorization ? One possibility is to define parts of configuration than can be modified by a template but at the price of flexibility.
  • Need to refine the use case and possible design : first discuss on mailing list and try to write a wrap up, then try to implement a prototype
  • To be acceptable the performance price for this should be only when you use the feature (and not for everybody like in panc v6).

=== CDB - M.E. Poleggi ===

Completed since last workshop :

  • Pan parser : cdb-tpl-view. Available and integrated with pangraph
    • Only includes until panc v8
  • Bug fixes

Postponed :

  • Fine-grained CDB locking with fair-queueing
  • Common authentication service

Issues:

  • Manpower : ME just back into active development and no other contributor foreseen, even though CDB is pretty complete and stable.

Other Core Modules - M.E. Poleggi

See presentation.

CDB :

  • Deployment based on stages in production at CERN

CDB2SQL : no progress, still alive at CERN

SWRepSOAP : collaboration with BARC ended, need for a new maintainer. Open issues :

  • Check signatures

CCM : no maintainer

  • Only bug fixes in the past months
  • Wish list :
  • De-privilege CCM execution
  • Global flag to disable CCM update
  • Luis as the next maintainer

cdispd : no maintainer. Several pending requests :

  • Add global activation switch [#13484]
  • Improve handling of pre dependencies [#23814]
  • Should be restarted after changes in modules [#27529]
  • Cal as the next maintainer ?

ncd : no maintainer: Open issues :

  • Provide a way to force the reconfig of a node [#7791]: could NotD be used?
  • Generic ‘provides’ for NCM components [ #17681]
  • Establish relationship between NCM-component RPM version and config info
    [#21204]
  • Add a new component dependency type to run a component as the last one
    [#27532]

Basic framework :

  • What’s the decision about enforcing LC::Process
  • Add to build tool ability to generate a skeleton for a new component

SPMA/rpmt : no maintainer. Main open issues :

  • Formattiong of SPMA output (#6203, #16825)
  • SPMA should upgrade and not deinstall/install when both arch and version are changed (#19934)
  • SPMA gives up at the first I/O error (#29029)
  • Michel as the next maintainer for SPMA

Build tools : ME will continue to maintain them

Quattor release 1.4 planned with AII 2.0

NotD : notification based on wassh. Used in production at CERN

Common Logging Service : NEw module CAF::RepLogger for applications not instantiated as CAF::Applications

Coding Conventions - M.E. Poleggi

Current recommendation available at https://twiki.cern.ch/twiki/bin/view/ELFms/WriteNCMComponent.

  • Use blank instead of tabs, recommendation : 4 spaces
  • Cleanup modules before making changes
  • Make a header with author, revision… : make header

Documentation :

  • CVS : README, POD files for man pages
  • Technical documents : LaTeX or DocBook, DocBook is an alternative also to POD
  • WWW : Wiki pages mainly, encourage online use instead of PDF copy

Quality assessment :

Other issues :

  • Call wrapper facilities :
  • Enforce LC::Process ? Some caveats (#31194) but maintained. Need to upgrade to last LC:: version
  • IPC::open3 as an alternative but requires a wrapper around it
  • open() wrapper : log which files are touched. Use CAF::Log ?
  • Luis agrees to look at it before next workshop. First contact L. Cons to know the exact status of LC::Process (ME).

Code reviews : time consuming, concentrate on most important things.

  • exit() calls in components
  • Ignored return values
  • Consistency of options with documentation
  • We probably don’t have the manpower to do it regularly, ‘'’concentrate on the process’’’ to ensure nobody new to the framework breaks a component/module because he is not aware of the ‘‘implicit rules’’.

Quattor Documentation

Current sources of information

  • Main web site
  • Twiki
  • QWG wiki
  • Savanah
  • CVS/packages : should be possible to generate web pages automatically from the contents in CVS and RPM packages. Difficult today

Developper’s documentation :

  • Reorganize a bit developper’s guide to have all the information in one place and in a not too long documentation, have it distinct from user’s documentation.
  • Add LC:: and CAF:: documentation pointers to Quattor web site

quattor.org web site :

  • Keep just one page with a short introduction about Quattor and referring to 2 wiki pages : user and developper
  • In each wiki page, have a list of reference to documentation relevant for each user category

Project Hosting Discussion

CERN is now more a user than a major player of Quattor which becomes more and more a community product. Moving hosting to a source forge would allow its administration to be done by the community, whatever where its members are from, and increase its visibility.

  • They provide a rather rich integrated infrstructure for hosting code, writing documentations, bug tracking…

Agreement on the principle of moving to SourceForge.

  • Need to figure out what is involved : start with a module like NCM components and review the implication with build tools
  • Need to define a strategy for documentation and how to incrementally move it to this site
  • Stephen will create/initialize the project
  • Need to discuss with German implications on quattor.org domain and validate with CERN they agree with this move (probably required for project approval by SourceForge).
  • Review the progress and precise plans for moving at next workshop or before on mailing list if we make quick progresses.

License : we need to clean up the license used by the project.

  • Let start with EDG license : every open source license is compatible with sourceforge.net

Release Process - M.E. Poleggi

Progress since the last workshop:

Next release planned: 1.4-1, possible release date May 1st.

  • Unmaintained components : review the list and either mark as obsolete or assign a maintainer
  • Skip obsolete components
  • Use STABLE tag rather than head when defined

Policy for accepting new developper : ensure people with write access to repository knows the rules, even the implicit one.

LISA Paper - Discussion

Reasons for rejection last year:

  • Technical level of the paper : we need to choose to be a user or technical paper.
  • Remove comparison with competitors : concentrate on 2 other solutions
  • Develop presentation of PAN as the distinctive value of Quattor
  • Take input from Nick about Morgan’s reasons to choose Quattor
  • More details on main use cases : GRIF, GRID-Ireland
  • But they as distributed sites they are unusal cases : need to better explain the reason and advantages for this
    and emphasize there is a growing need for this. Explain the benefit of distributed management.
  • Ephasize the specific Quattor features, in particular PAN, making it possible with QWG as an illustration.
  • Avoid a catalog of similar things : try to merge as much as possible.

Deadline : May 8th.

  • Requires almost final version to be ready May 1st.
  • Specific contribution to be available soon

Probably change the paper title to “Distributed management accross distributed domains”.

  • Mention CERN and desktop/latop management as an illustration you can also do traditional management and about Quattor scalability but not have them as a separate section.

First steps (2 weeks): phone call between Monday March 31st and Thurdasy April 3d

  • Stijn, Stephen and Michel exchange between themselves about what they see as the important points to develop about distributed management. Commonalities and differencies.
  • Cal reviews PAN section
  • Nick provides input on Morgan’s evaluation

== Conclusions ==

Next workshop in Amsterdam (NIKHEF), Oct. 27-29.

  • Start : Monday afternoon, Wednesday afternoon : specific // sessions (CDB, SCDB, QWG Templates…)
  • Open bugs must be reviewed before the meeting…