3d Quattor Workshop Summary (March 2007)

Written
12 March 2007
Author
Michel Jouvin

3rd Quattor Workshop - Dublin - 13/3/07

Agenda

Site Reports

CERN

2 instances :

  • Main for compute clusters (89) : #5000 nodes
  • Linux for controls

Some experiments are setting up their own instances

Desktops partially quattorized :

  • Installation with YUM
  • LCM used as a local interface to NCM

MAin instance setup :

  • CDB : still based on flat pseudo namespace for custom templates
  • SPMA + SWrep with extended authentication capabilities (X509, Krb, NICE)
  • LCG through YAIM

Quattor development activities :

  • 99% automated release process
  • Integration with ETICS to come soon
  • Thinking about better integration with Savanah (reports…)
  • Default templates set ready for namespaces
  • Core modules responsability : CCM, CDB, CDB2SQL, NCM framework, SPMA, SWrep, wash

Secure profile transfer over SSL : put in production and working but had to remove it because of the load in case of bulk transfers

  • Waiting for new HW

Xen based virtualization : started to look at Quattor integration with ncm-xen

Miscellanous :

  • Integration of Quattor services with SLS
  • SL5 : no real interest from Linux support, on hold. May be resumed for supporting new desktops

DESY

Quattor used to manage systems for D-GRID initiative :

  • 280 boxes : 250 WNs.

  • Still running Quattor 1.1 with AII, CDB, SPMA, YAIM
  • Home grown components for running YAIM and updating certificates
  • Web interface (DESY specific) developped for administration of systems with Quattor (e.g. install a new WN)

D-GRID reference installation requires SUSE Linux on servers !

  • Implemented a very rudimentary integration to produce a profile allowing installation through AII
  • No plan to quattorize SUSE !

Issues :

  • Compile time : what if we go to virtual machines and double the number of machines
  • How to improve errata handling ?

Dublin

Quattor configuration used :

  • Still a CVS repository with QWG + SCDB tools
  • Dublin fully controls grid systems at every site (no site customization)
  • One single repository for all the sites managed by Dublin, replicated SW repositories at sites
  • 357 machines

  • 3 Quattor deployment servers : production, testing, e-learning

Integration of external clusters required Condor support as a LRMS

  • Added ability to select LRMS on a per queue basis inside a CE

More systems moved to Quattor including SL4 and XEN

  • Moved to new QWG templates, structure stabilized (no change since October)
  • Released ncm-xen

Looking at LAL monitoring through Nagios to detect deployment status.

Issues :

  • How to agree on errata deployment ?
  • Need to cleanup local templates to be inline with QWG standards

BEgrid

UGent stopped using Quattor because of severe issues with vserver (virtualization server)

  • Pb with SPMA not able to install all RPMs in one go, had to rerun it severaltimes
  • Want to get back

QWG templates are going better and better

  • Still busy integrating local mods into default templates, when relevant

Work in progress :

  • gLite integration (Shkelzen)
  • dCache integration
  • Monitoring template integration
  • Solaris support : quite some work needed on AII. Plan to use it for storage.

UAM

2 grid clusters : UAM-LCG2 and GVMUAM-LCG2

  • SL 3.05 on WNs, SL 4.3 on other servers
  • Quattor release 1.2 with SCDB and QWG templates. Eclipe used for management
  • SE : dCache. Manually managed right now. Plan to use NCM components when ready.

1 non grid cluster :

  • Managed with Quattor. Still using CDB.
  • Quattor server : 2 machines with one virtual IP. Currently no HW software, based on manual replication.
  • Many local customizations to template to match specific requirements

New Quattor expert : Luis Munoz.

  • Currently working on AII developments and ncm-accesscontrol

CNAF

Using Quattor 1.2 without QWG templates

  • Implemented their own solution for namespaces and template access control

Future :

  • Upgrade to latest Quattor version : main problem is manpower
  • Migration to new namespace solution : difficult process for storage nodes
  • New blade farm to be installed soon : requires fixing handling of megaraid device in AII
  • Xen under study

GRIF

See agenda

Missing Sites

NIKHEF : Ronald unable to attend this meeting. Still using Quattor with YAIM

PCI : no news but seems to be still using Quattor.

University of Aachen

PAN Compiler - Cal

Current status :

  • panc 6.0.3 : production version, completely frozen, mno-threaded
  • panc v7.1.x : beta (production@GRIF), feature frozen, multi-threaded

New compiler nearly 100% backwards-compatible

  • ‘final’ keyword not enforced in structure templates
  • Additional operators and functions : bitwise operators, if_exists, is_number…
  • New function to force traceback at any point
  • Support for ‘bind’ statement to replace ‘type path = type’ (grammar simplification)
  • Ability to compress profiles and pre-compile individual pan template files (but seems to provide no perf gain in current implementation)
  • Deprecation option : display messages about deprecated features used in current version (for the next version or next 2 versions).
  • Properties no longer duplicated but shared between objects : reduces memory consumption. No change in behaviour at user level.
  • Ready for production (extensive testing done with test units and at GRIF)

Performances :

  • On single-CPU machine, comparable to panc v6
  • Significant improvement on multi-CPU machines but scaling currently far from linear. Built-in multi-thread features with 3 threads per CPUs allocated by default.
  • Need to search “tuning” space : many parameters that can be tuned with Java (memory, threads…)
  • Currently not scaling linearly with the number of cores : need more investigations to understand where the problems are (panc, OS, HW, Java VM…)
  • May look at building a compiler farm using Java integrated web services

Checks required moving to production :

  • Memory consumption acceptable for large builds (in particular the scalability with the number of RPMs in the repositories)
  • Command line scripts is compatible with CDB

Issues :

  • UTF-8 processing by downstream Quattor components : need to discuss what the best solution to limit the impact on components without loosing advantage of UTF-8 support.

Language and compiler documentation and evolution :

  • Description moved to LCGQWG wiki
  • Minor version without language changes
  • Major version involve language changes : planned changes target grammar simplification and comilation speed-up.
  • v8 : suppress ‘define’ and ‘delete’ keywords
  • v9 : switch to uppercase automatic variables
  • v10 : allow only ‘include {dml};’

Source and binary distribution :

  • Sources in a SVN repository (LCGQWG)
  • Binary RPMs built and distributed through ETICS

Future compiler optimization :

  • “Machine code” reduction/optimization : nothing done in current version, learnt from previous versions that this can lead to significant speed-up.
  • Function optimization : built implementation of list/nlist functions
  • Automatic caching of invariant sub-trees : not very easy to do when using lot of global variables as switches (like in QWG templates).

CDB - ME Poleggi

Mainly minor improvements and bug fixes since last meeting.

Still on todo list :

  • Fine-grained CDB locking with fair queuing
  • Common authentication service

Future ideas :

  • Make dependency information available to user interfaces for processing by GUI like pangraph
  • get rid of internal Config lib in favour of AppConfig::File (CCM affected too)
  • Look at lowering SOAP requirements for memory

SCDB Status - Michel Jouvin

See agenda.

Core Components Status - ME Poleggi

Done :

  • Support for namespaces in default templates, with ability to convert old templates. How and when to obsolete old ones ?
    • ncmtplconfvert does the conversion, called by ‘make tplconvert’
  • SWRepSOAP with some enhancements
  • getRecHash() function for converting config tree to PErl data structure
  • cdispd : PID registered in a file, log rotation
  • ncm-ncd : ccm-fetch information in log files, suppress unnecessary locking
  • ncm-templates : new tags supported (LFOR)

In progress :

  • CCM to accept non local profiles (as done with AII)
  • Release process integration with ETICS
  • SPMA/rpmt support for checking signatures : remains a problem with rpmt-py
  • New partition scheme
  • Release 1.3 : namespaces, cumulated fixes and core modules and components enhancements
  • Notification system : ability to notify a set of nodes for a set of task to run. First version in CVS, not functionnal yes. With contribution from BARC

Still pending :

  • CDB : controlling template naming per namespace
  • CDB2SQL : direct back-end interfacing, like XML DB, to avoid reparsing XML profiles
  • panc : template where only key/value are possible that can be used as input by other templates. CERN request for easier reading/writing by other apps.
  • New indirection level for device identification to tackle with device name change when installing a new OS release (e.g. 2.4 to 2.6)

CERN manpower very limited : 1,4 FTE

  • Includes 1 FTE contributed by CNAF and 0,2 by BARC

New component source structure : do it asap

  • Keep the current TPL/ directory as the place for .tpl.cin files (don’t replicated installation structure)
  • Convert current .tpl.cin structure to namespace and new names
  • Increase the minor version for the component

AII : move configuration information from /software/components/aii to /system/aii and related templates from components/ to quattor/ namespace

  • Can be done asap or later depending on manpower

SPMA configuration : may move /software/packages and /software/repositories to /software/components/spma

  • When done, would probably be no longer need for /software and we could move /software/components to /components
  • Let’s discuss more and take a decision at next meeting

Delay Quattor 1.3 until end of March to integrate :

  • Component sources conversion to namespaces
  • Moving AII configuration information in the config tree

wassh : parallel execution engine on top of ssh

  • Now part of Quattor, will be in 1.3
  • Support multiple clusters and sub-clusters : plugin to resolve clusters. Need a volunteer for writing a XMLDB back-end
  • Useful as manual notification tool

pangraph : graph representation of Pan templates. Produces a .png file.

  • Should handle any type of templates : use RE to match and parse the inclusion lines in the template
  • Doesn’t handle templates accessed by variables
  • Integration with Lemon’s XML parser
  • Could probably be improved if there was an ability to produce the include graph in panc (inclusion graph is internally build in v7)

QWG Templates - M. Jouvin

See agenda.

gLite Configuration - Shkelzen Rugovac

Status :

  • WMS : working, including VO configuration. WMS and WMSLB can be configured on separate machines but need to be tested.
  • gCE : still work in progress

RPMs list produced by script.

YAIM used as a source for configuration information.

Needed to adapt ncm-glite (committed to CVS)

  • nostart/noconfigure/noconfigfile lists
  • Change order of configure/start sequence : do configure/start for each service instead of all configure, then all start.

WMSLB :

  • Volist : redundant with the standard VO config. May be need a function to convert.

gCE :

  • Should be ready next week…

AII - Luis Munoz

Current status :

  • Several long lasting bugs and feature requests closed
  • Notification in case of success, not only errors
  • Namespace templates
  • Kickstart template refactoring

Working on improving and updating documenation (man page) to have a full coverage of options.

Pending problems :

  • AII templates schema less : easy to make mistakes detected only when deploying
  • Software raids

Future plans :

  • Define all the possible options explicitly in AII schema.tpl
  • Remove the use of Kickstart templates and ncm-templates processor in favor of direct use of NCM::Component. Try to use ncm-ncd rather that tweaking it, adding ability to use remote profiles (with an explicit URL).
  • No firm date, hopefully before next workshop

Not sure that a Kickstart template can be replaced by a generator that uses only the machine profile. Would probably be better to add to the template processor the ability to call an external function for complex things like producing partitions.

  • Already in use at CERN

Release Management - ME Poleggi

Goal is fully automized process with single configuration point and simple procedures accessed via a unique interface.

Several actors :

  • Contributors : responsible of their own modules. Should keep up to date official status information.
  • Uers : in the feedback loop about problem/fix status but should not decide a component status
  • Release manager : call for release phases, in charge of collecting, building, packaging.

Tools :

  • TWiki : release procedure documentation, official current status page, minimum developper’s convention page (coding standards, CVS handling, Savannah usage), templates for extra documentation (e.g. man pages).

  • CVS :
  • need for a code maturity indicator ? how to implement it : branches ?
  • stricter commit pre-requisites based on checking config.mk contents
  • make : the user interface through several targets.
  • Savannah :
  • Stricter bug workflow (leads to extra load and emails…), fixed mapping between CVS directories and component field. ‘Ready for test’ should not change the person it is assigned to.
  • Probably better workflow documentation could help
  • Automatic extraction of change log information should be possible but neither time, nor expertise to do it.

Release procedure : 3-step process

  • PREPARING : 10-15 days for code (feature) freeze
  • Beta : code frozen
  • Release : tag in CVS

Status page :

  • Meant to be a unique official reference
  • Each developper is responsible of a small part
  • Primary source for the list of candidate packages composing the next release
  • Minimum dynamic usage : release status, components status/maintainer
  • Possible add-ons : links to release notes, links to Savanah items… but links clog TWiki code !!!

Current release process status : 99% automated procedure

  • Mainly global ChangeLog needs to be produced manually
  • Parallel build on different platform (via different checkouts in AFS)
  • Building time is currently 2 hours (not including the time to produce documentation…)

Open issues :

  • Orphaned modules and components : who wants to adopt them ? What about getting rid of MAINTAINER file in favour of a tag in config.mk ?
  • ETICS integration : project area is ready. Still to decide how to enhance the current build framework.
  • Test framework : all core modules should have test units, all components should go through a template syntax check
  • Release updates could be managed as full releases : currently they are not going through the same formal process.

Cal remarks :

  • CVS branching add too much complication. Not a good idea.
  • TWiki is not good for generated information. Need to put it somewhere else (Quattor site ?) and have links to it.

Status tag for component version : add a ‘make stable’ that will move stable tag to last version or a specific version.

  • Use this stable tag to produce the status page.
  • ‘Note’ column in status page should come from config.mk
  • Add an OBSOLETED_BY tag to config.mk

Packaging :

  • Review the metapackages available
  • Do a nigthly build of stable version RPMs and build a RPM repository. Should be provided by ETICS framework.

XEN support - S. Childs

Grid Ireland has deployed service nodes on Xen since Jan. 2005.

  • configuration of Xen VMs largely manual
  • Would like management through Quattor to help creating on demand VMs

Issues :

  • AII integration to allow Kickstart installation of Xen VMs
  • ncm-xen : Quattor component for generating guest VM configurations
  • ncm-grub : support for Xen multi-boot

AII integration :

  • 2 VM types : PVM (Para-virtualized, no net card and unable to PXE), HVM (Hardware, fully, virtualized, PXE capable)
  • Simplest option is to have host do PXE on behalft of guest VM using pypxeboot
  • Installation requires a patched Anaconda for Xen : available from CERN for SL(C?)3/4 but not integrated in standard SL(C) distribution

pypxeboot : Python script invoked by Xen during VM boot acting as a Xen “bootload”

  • Retrieves kernel and config for the VM
  • Uses a customized udhcpc to get IP address for guest’s MAC address (MAC address added at VM creation time)
  • tftp client used to retrieve pxelinux config, kernel and initrd based on VM IP address

ncm-xen : handle VM configuration and creation

  • Creates VM configuration file
  • Schema allows to describe VM HW config (memory, disks, interfaces) boot information
  • May probably be retrieved from machine profile for the machine hosting VMs using ‘value(‘//VMHostProfile/path’)
  • Disks : support both file based disk and dedicated partitions. Create LVM volumes if needed. Would be easier if ncm-lvm component was available (one being written at CERN ?)
  • Launch installation of VM through AII (the machine hosting VMs may be installed with AII first)
  • First serious use for Grid-Ireland to start soon

NCM Components - G. Cancio

Virtual dependencies to allow to express a dependency based on a feature rather than an implementation

  • Typical use case is package updater : allow to depend on package updater rather than spma if people wants to use other package updaters

Meta-components : not really useful. Equivalent is ncm-ncd –all or –component

ncm-ncd –noaction : currently call all components but not enforced by components

  • Require a component to export ‘NOACTION_SUPPORTED’ to be called in no-action mode

Add a property (/software/components/cdispd) to disable cdispd on a particular node without shutting it down.

ncm-cdispd pre dependency ordering : have a global flag to select between current behaviour and a relaxed mode where dependencies are run only if their configuration changes.

Component development guidelines :

  • Move current component writing guide (and related) to TWiki
  • Document configuration options handled by the component
  • Version number for stable release should be >= 1.0

New NVA API to load configuration in a Perl hash :

  • Recommend as the default
  • Load boolean properties as Perl booleans, nlist as hash and list as arrays
  • Rename as getTree()

ncm-access_control replacement - L. Munoz

ncm-access_control :

  • Control users authorized to login, through with credential
  • sudo control

Current ncm-access_control is ugly and full of CERN specific features

  • Does too many unrelated things

Proposed splitting into several different components :

  • ncm-useraccess : control user access to a node (.klogin, SSH public keys), ACLs on services, implement the concept of role (same configuration shared by several users)
  • ncm-sudo : edits the /etc/sudoers, taking full control of it (except non-scalar variable like lists)

ncm-useraccess needs minor review :

  • roles defined in user resource must be added to role members defined with the role
  • SSHKey should be renamed SSHKeyURL and support for specifying the SSK key inside the profile should be added
  • Limit duplication of information (krb4/krb5) when possible

Dissemination

Usenix

Usenix paper submission : must be done before May 14th

  • Final paper : 16 pages max, including figures…
  • Draft paper to be submitted in May. Need to be almost complete.

Need to concentrate on :

  • Administration paradygm description : declarative model. Reward LCFG for introducing ideas
  • Illustrate rather than describe PAN powerfulness
  • Short description of main components
  • Description of some use cases, with a focus on distributed sites management. Should include simple site and desktop management as use cases. Don’t insist too much about grid.

Must include a short review/comparaison with other products we know.

Insist we share a lot of components and configuration information even if the use cases are different.

  • QWG templates as a proof of concept, even if grid is not the main focus for the paper

Mention on going work to integrate VM management.

Try to position ourselves against new management “standards” like WSDM, MUSE.

Try to get “external” opinion : e.g. person from Philips who attended the first workshop (contact : Ronald).

Mention future work and directions :

  • Workflows (moving from a test config to production…)
  • Porting

Others

CHEP : submit a talk proposal about QWG templates and distributed management

Web site : need to be refurbished. Both content and presentation.

  • Need to find somebody (probably outside CERN) who could contribute. LAL ?
  • First need to decide the appropriate structure and tools

Conclusions

Next meeting in Madrid around mid-October