9th Quattor Workshop Summary (March 2010)

Written
17 March 2010
Author
Michel Jouvin

9th Quattor Workshop - 17-19/3/10 - Thessaloniki

Agenda.

Site Reports

LAL

See slides.

MS

Configuration changes:

  • Still one configuration server
  • 2 set of boot servers (Europe, US)
  • 20K managed machines (+5K)
  • Virtual machines with ESX: 1200 ESX server, 1700 managed machines
  • AII rebuild time still 3h
  • Compile time increased from 10 to 18 minutes
  • Using MS-specific version of 8.2.9
  • 8 people involved in template development maintenance (+2)
  • LEMON used: 2x2 servers

Issues:

  • No code submissions recently: legal issues with SF…
  • aqd scalaibility: priority for next year
  • Pending open-source:
  • AII, CCM patches: in prod at MS
  • FUSE interface to configuration browsing: ready for distribution
  • Aquilon: RAL interested, still a significant MS-dependencies

AUTH

Migration from cfg/local to SCDB sites still in progress

Included monitoring information at profiles

Attempt to use Quattor as a central DB for all assets (KVM, StorageArrays…)

  • Implemented as a separate cluster group

SCDB clusters by machine criticality

Xen configuration now based on QWG with a few developments

  • Removed host dependency on guests
  • Trying to remove the use of DHCP

Templates for new machine types

  • Message broker, in production at AUTH
  • egee-NAGIOS/egee-SNAGIOS: work in progress, not based on previous Nagios templates from QWG

QAD: Nagios interface almost completed

JRuby added to SCDB to add task written in Ruby

  • Used for QAD integration

Issue: sync with QWG, how other sites are doing?

CERN

'’Note: report about Quattor in computing center only.’’

CDB 2.2.5 and panc 8.2.11: 10K machines

  • panc 8.3.0 tested

New development: Django-based search tool for CDBSQL

  • Using sqlalchemy: Oracle independent

Test of SCDB: work done by Daniel with the help of Luis

  • Test done with one profile and the CERN template structure
  • Observations:
  • profile location is a big change for CERN
  • Decoupling of commit and deploy may lead to potential problems
  • SVN ACLs are only at the directory level: having them at the file level is an advantage in CDB
  • +: build-in ACLs, local panc compilation giving access to log files

RFE: pkg_add/pkg_repl based on date to ensure people are not installing obsolete version. Is it feasible?

Can we agree on a policy for announcements of new components? Exact rules for using quattor-discuss rather than quattor-grid? Do we really need 2 lists?

NIKHEF

5 clusters, 500 managed machines

  • 350 clones
  • SCDB 2.3.1 + local changes
  • panc 8.2.8
  • Compilation time: ~30s

6 people involved in Quattor maintenance (4 active)

Virtualization: 13 hosts, 48 guests

Nagios: 1 master, 3 slaves

Refactoring of templates for easier usage: not using QWG

  • VO configuration, VOMS
  • Batch system access control, VOViews
  • User banning

Virtualization: currently not used with grid clusters, investigating KVM

Grid Ireland

18 sites, some have expanded.

SCDB with bzr and Git clients and an old version of QWG (r4921)

  • Not up-to-date in particular with errata and VO config

RAL

90% of T1 Quattor managed. T1 fully committed to Quattor.

  • Mostly WNs (700 already, 400+ soon)
  • Also other machine types: bdii, vobox, ….
  • Disk servers: OS + all package management operations, CASTOR config still relying on Puppet
  • Nagios servers: OS only

9 sysadmins involved + 6 more coming

SINDES on the point of deployment to ease server certificates handling.

Ready to start experimenting with Aquilon

Using SCDB+QWG (up-to-date with standard templates)

  • Happy with performances: 4 minutes for a full compile, without profile cloning
  • Would like to slim down RPM list used by various machine types. Is anyone else interested?

Very positive with Quattor so far

  • Much more consistent
  • Batch service much easier to manage
  • Errata deployment much easier
  • Much more agile: easier to deploy new versions

CERN LHCb Online Farm

Online computing

  • Event filtering farm: currently 700 nodes, 500 to 1000 to be added in the future
  • Diskless systems
  • Credit card PCs
  • Various control PCs and infrastructure nodes (Web servers, DNS, ….)
  • Also some Windows nodes (120) but mainly running Linux SLC4/SLC5/RHEL5

Linux machines managed by Quattor

  • Original setup done in 2006 with CDB/panc v6/aii v1
  • SWrep replaced by scripts
  • ViewVC web interface to templates and XM files
  • Problems with Krb authentication in CDB and the fact author of commit is always Apache
  • Decided to give a try to SCDB… see presentation

Maintainer of ncm-diskless_server and ncm-pvss

  • Also developped ncm-checklines as an improved ncm-filecopy

Plans for the future:

  • Finish move to SCDB
  • Cleanup of tempates
  • Update/rewrite the 2 components maintained as part of this use + ncm-network
  • Several issues with ncm-network: see presentation tomorrow

Some extensions made on Quattor schema

  • Additions to /system
  • /hardware/ipmi

LAPP

5 SCDB clusters

  • 1 machine acting both as SVN and SCDB server, still running SL4

10 Gb/s support: with Chelsio interfaces, found a way to do PXE

  • No more need to use 1 Gb/s for Quattor (PXE) boot

Install and configuration of GPFS through Quattor

  • Strongly depending of kernel and require to build RPM for each new version of the kernel
  • Unfortunatly hard to make generic, thus not contributed to standard templates

CNAF

Very positive after the move to SCDB, despite a few problems

  • Error caused by users who commit before compiling, planning a wrapper

3900 templates.

Working on improving QWG errata management.

Main versions deployed

  • SL5 (no longer SLC)
  • gLite 3.2
  • Grid services configured with YAIM

Virutalization

  • Using KVM, phasing out Xen
  • No connection between hypervisor and VMs in templates: not clear we want to do it
  • Hypervisors and VMs managed by Quattor

Issue: complexity is increasing a lot

  • Bug tracking is very difficult

Developments

Pan Compiler - C. Loomis

v8.2 in production: only serious bug will be fixed

v8.3 (development version): 8.3.1 is a release candidate

  • Dependency file format change, annotation syntax
  • Will become 8.4.0 when production version is released

v9 planned before next workshop

  • Will remove support for all deprecated syntax/features
  • See previous list
  • Will require .pan instead of .tpl
  • 8.4 will support both
  • Issue: CVS history cannot track renames…
  • Would probably be easier to add an option to change the default extension to avoid the penalty of trying both for every template

panc release management moved from ant to maven

Current work:

  • Allows sources from non-file locations (eg https)
  • Will enable an XML-based syntax imported as a structure template

Code refactoring:

  • clean up, internationalization
  • Make AST more accessible (e.g. editors, code formatters…)
  • Easier tight control by external tools (e.g. debugger)

Annotations: open the path for automatic documentation generation

  • Need to agree on tags used

Another issue: auto-escaping of nlist keys

  • Possible to look at it but potentially a huge work to take advantage of it

QWG Update - M. Jouvin

See slides.

SCDB Update - M. Jouvin

See slides.

Diskless Servers Support - L. Barda

For LHCb online farm

  • Filtering tasks can be done in memory, disk not used
  • Credit-card PC have no disks; still SLC4, i386

Initial diskless support in Quattor started in 2005 and taken over by LHCb end of 2007.

  • Read-only NFS shared root filesystem configured on the server and copied to the diskless server at the end of kickstart
  • RW files for each client in a specific file system (/.snapshot) remounted with -bind on the diskless server
  • 2 templates to describe a node configuration
  • Proto node: shared root FS content, main part of the configuration
  • Real node: node-specifc configuration

ncm-diskless_server is used to generate the boot image, manage DHCP/PXE for the managed systems, configure the ccm.conf on the node-specific file system.

Current status

  • In production for several years
  • Moved farm from SLC4 32-bit to SLC5 64-bit: was pretty easy
  • A student should work on SLC5 i386 for CCPC

Issues with the current solution

  • Very large mount table on nodes
  • Issues when modifying a RW file on a RO file system
  • List of RH RW files is too large
  • Lack of debug info in system-config-netboot used to generate images and initrd generation is not flexible enough
  • Shared root file system can only be a copy of the server file system

Plan is to rewrite ncm-diskless_server and use UnionFS to replace the NFS mounts and bind tricks.

Several issues with ncm-network too, in particular inappropriate restart of the network breaking the NFS access. Will start a rewrite (ncm-network_NG)

  • Should separate each part of the configuration: interfaces, routes, bonding…. Code restructuring.
  • Should use network utils to modifiy the running system
  • Modify sysconfig files for persistent modifications

Try to interact with the community during the design phase to ensure it could be used at some points in replacement of the current ncm-network.

Migration to SCDB at CERN LHCb Online - L. Barda

Motivations:

  • Wanted to have real names in commits
  • CVS too much hidden in CDB: no more conflict resolution
  • CDB2SQL too heavy

SCDB installation using standard documentation, but no QWG. A few modified components:

  • LDAP authentication to SVN
  • Next Apache will allow subgroups
  • svnperms.py to restrict who can do commits
  • DHCPD config / aii-dhcp
  • quattor.build.xml: no cluster used, profiles located in /profiles/*
  • New target added
  • Post-commit: added the ability to send an email before running post-commit.py
  • build-tag.py run under Apache rather than root without sudo or ssh
  • Also added the ability to run quatview as part of build-tag.py (new section in config file)

Also AII v1 to v2 migration.

Special use case: allow sub-detector people to change their CCPC Mac address

  • Allow them to commit only in the subfolder corresponding to their CCPC and make a deploy
  • Rely on svnperms.py precommit hook to restrict their right to commit

Currently a test instance, in production soon.

Monitoring

NCG-Monitoring and Quattor - R. Starink

Quattor implementation at NIKHEF based on standard/monitoring/quattor but a lot have changed and merge is challenging.

  • Changes both in QWG and NIKHEF templates

Enhancement at NIKHEF:

  • Hierarchy of Nagios servers
  • Slaves schedule probes and return result to master through passive checks. Uses ncm-nsca on master+slaves
  • Master monitors slaves with active checks
  • Master and slaves with the same host/service definition but with slightly different parameters. Quattor helps to keep things consistency.
  • Master is the only one to run Apache: configured with ncm-filecopy
  • pnp4nagios: RRD-based graphs (like Ganglia) integrated with Nagios, usable with any probe
  • Service checks: generic checks + grid-specific ones (DPM, WMS, Torque, MAUI…)

Grid monitoring: wanted to enable NCG-based grid monitoring solution for site

  • SAM test results
  • Local grid checks
  • Ideally entirely Quattor-managed but potential conflict: NCG implements a dynamic detection of tests to run where Quattor describes tests in a static configuration
  • NCG: risk (happened!) of Nagios killed by a dynamic update of the configuration
  • Requirements: co-exist with existing Quattor+Nagios setup, no gLite on Nagios server
  • Incompatibilities fed back to developers

Grid monitoring layout:

  • Master werver can either submit grid tests directly or through a separated gLite UI
  • NCG used to generate Nagios configuration but not to enable/configure Apache, sudo… (done by Quattor)
  • Files generated by NCG included as cfg_file in ncm-nagios
  • Potential conflict between Quattor config and NCG config (duplicated hosts…) preventing Nagios to start
  • Master has a check (delivered by NCG) to push checks generated on master on the gLite UI

Status

  • Started early in pre-OAT times where many manual steps were required
  • Deployed recently the Jan2010 release and now configuration almost fully automated
  • Many perl-xxx dependencies: better to use checkdeps
  • Setup is not straightforward: NCG is still work in progress
  • Not completely happy with NGC-magic but aternative is to describe services in Quattor configuration. Is it maintainable?

Monitoring based on OAT Work - C. Triantafillidys

See slides.

Prototype not ready for distribution, based on QAD and NCG

  • Both still under developments
  • QAD used as a management interface to add MetricSets and select nodes which they are applied to based on /monitoring/features.

NagiosBox for Biomed - M. Jouvin

See slides.

Monitoring Discussion

TCD still using QWG templates but short of manpower.

  • TCD should update QWG trunk so that we can understand/assess difference with NIKHEF
  • QWG templates for describing monitoring configuration for local resources
  • Must allow AUTH usage pattern where all the configuration of Nagios is done outside Quattor (through QAD)
  • RAL will not necessarily use it but could provide feedback on what is done/planned in QWG templates

Grid service monitoring will try to rely on NCG as much as possible

  • Need input from AUTH about possible usage patterns of NCG API, how to integrate it in components…

First step: 2 wiki pages describing NIKHEF configuration and NCG potential

  • Location to be discussed later

Should plan a Nagios node in future test cluster.

StratusLab - C. Loomis

Goal: enhancing grid infrastructure with virtualization and cloud technologies by incorporating could innovation into existing grid infrastructures

  • Simplify and optimize its use and operation
  • Enhance existing computing infrastructure with “Iaas”
  • Grid services will remain the glue among distributed resources
  • Possibility to scale out to commercial clouds
  • Deliver an open cloud API (à la Amazon EC2) to enable community services to take advantage of cloud resources

StratusLab aims to deliver a toolkit that can be easily installed by existing sites.

  • All the software produced will be openSource (exact license to be decided)

Interaction with Quattor community

  • Several partners member of Quattor community: TCD, LAL
  • Major goal is to make the life of sysadmins easier: need to ensure that it is easy to install and configure with automated tools
  • Want to leverage Quattor community’s expertise with grid deployment: don’t want to reinvent the wheel..
  • Would like to use Quattor to build appliance configurations

Project should start beginning of June.

  • Still in negociation

SF Migration and Build Tools - C. Loomis

Current situation

  • Need to converge to using tools on SF and having a sustainable set of build tools
  • Was planned to be done within QUEST and now we need to move forward ourselves
  • Need to have concrete relaistic plans
  • Build tools unmaintained(able) since ME left

Current tools:

  • Wiki
  • MediaWiki on SF
  • Trac on SF enabled but not used
  • Trac@lal: used extensively for QWG documentation
  • SVN: SF for Quattor core, QWG for QWG+SCDB
  • Bug/feature tracking
  • SF: Pan compiler only (based on Bugzilla, working well)
  • Trac@LAL: QWG todo list
  • Savanah@CERN: main place for user bugs

Build tools:

  • Current set of tools very complex and basically unmaintainable
  • Attempt with Ant but not clear there is a real gain as Ant config file is rather complex
  • Maven solution may be possible: many attractive features (dependency management, multi-level project configuration, RPM packaging), more language neutral than Ant
  • Willing to do limited test but will need help from others

Wiki future

  • Trac on SF is lacking to many features (plugins) to allow an easy migration from Trac@LAL
  • A temporary option could be to move current MediaWiki content to Trac@LAL until we have a usable Trac@SF
  • Then we could migrate with export/import of Trac DB
  • We gain some time for SF to install the same Trac version as LAL
  • Agreed but import of MediaWiki contents to Trac@LAL must be the occasion of restructuring current contents to be more generic and ready for import to SF

SVN future

  • Move SCDB to SF asap (and other generic tools if any)
  • QWG part kept at LAL temporarily

Bug/feature tracking

  • Current SF bug tracking is functional and fairly usable, only missing bit is ability to link to SVN
  • Moving all bug tracking of to SF would have the benefit to remove one of many places where users have to go
  • Agreement to close Savanah
  • Review exact requirements and issues in monthly meetings
  • Configure SF with appropriate tickets
  • Communicate with the community
  • Implement the change. Target: 2-3 months

Mailing list

  • quattor-grid: only QWG-related stuff
  • quattor-discuss: everything related to core components, including NCM components
  • quattor-nagios: discussion to Nagios bit and bytes (development list)
  • Announcements: evaluate SF announcement features as they are more appropriate than a mailing list

Build tools:

  • Cal ready to do a mini test with Maaven
  • Elisabetta from CNAF ready to help
  • Should try to remove the too many externals

Quattor Future

Agreement on a monthly meeting.

  • Doodle to find the convenient time slot (if possible allow participation from US but not a strong requirement)
  • Between 1 and 2h
  • Possible facility: Dutch, BELNET, IN2P3
  • Do a test asap: end of next week/beginning of the week after
  • First meeting: first week of April
  • ~10 people ready to participate

Work plan

  • Panc debugger: not in the short term
  • Panc editor: try to find a student who could do the work to have something basic but complete
  • Not scheduled until we find one
  • Component tooling + ability to run non-Perl components
  • Need to understand costs and benefits
  • Concentrate on a review of what this involves
  • Ask Nick; first goal may be to produce a list of issues as a FAQ (reasons why it is not possible)
  • Support for package managers other than RPM
  • SPMA replacement on other platforms
  • May also impact the build tools (ability to package Quattor as something else than RPM)
  • Not first priority for many sites but support for YUM/APT may be valuable for many sites
  • CERN had plans to use YUM of SPMA
  • Concentrate first on a review of what is exactly involved, potential benefits, minimum YUM version to use, understand APT and YUM differences
  • Aquilon integration: RAL will work on this anyway and expect progress before next workshop
  • Will depend a little bit on MS availability to help: MS cannot commit to spend much time but agrees to help
  • Ian taking this in charge but limited to getting it usable at RAL
  • IPv6: not a short-term requirement but needs to be done
  • NIKHEF could commit to try to install a test quattor server that will use IPv6 and check what is working/not working
  • Some expertise available in EGEE3 and several NRENs
  • Make a clear list of what are the changed required
  • Hooks for interoperation with management frameworks
  • Code review and best-practices
  • Define best-practices and recommended API for components and rewrite a documentation for developping components on this (Luis)
  • Set up of build and test infrastructure
  • Cal volunteers for the build tools and for adding unit tests
  • Do a review of the existing build system at LAPP and make proposal to have it more useful and more used (Eric)
  • Review existing components, identifying obsolete/duplicate, and possibly contributing to a documentation page about existing components and their coverage
  • No volunteer but remains an important action
  • QWG templates
  • Other MW support: wait for EMI to see if there is a real need
  • generic service support: Christos ready to work on it in the coming months
  • VM support: to be done in collaboration with StratusLab
  • Integration with monitoring: see previous discussion

Dissemination

  • Must remain/be a strong activity for the community: enlarging the community is a way to pave the way for the future
  • Proposal to make presentations at the EGI conferences and establish contact with NGIs
  • A booth/demo, a poster or a presentation?
  • Could take the opportunity of doing something in common with StratusLab at the first EGI conference
  • A presentation based on RAL experience?
  • Try to be present in NGI meetings
  • Increase our visibility: be present on sites like Oloho
  • Make a list of conference useful to attend, in particular sysadmins conference
  • Wiki page about events and conferences of interest
  • Training harder without funding, will also require work to get the appropriate material
  • Appliance: StratusLab has a need for it, may come as an outcome of the collaboration
  • COSTS project: possibility to get funding for dissemination actions but deadline very close (26/3)
  • 1st application only 4 pages (10K words), more detailed application later
  • Up to 100KE/year during 4 years
  • People (not institutions) from 5 countries
  • 3/4 of proposal passing the 1st stage being funded
  • Attempt to apply based on QUEST proposal and members + BELNET + CNAF (Cal/Michel). Note that initial member list can be refined later.

Documentation

  • Clarify the different targets for the documentation
  • General communication: a basis exists on MediaWiki, need to develop, in particular use case study
  • User documentation: mainly concentrated on QWG templates, missing bits (in particular components)
    • Need to ask users/NGIs what they need
    • Lack of step-by-step guides
  • Documentation for developpers
    • Many things missing
    • Need to document the build tools when we agreed on which ones to use
  • Existing documentation review (Andrea)

Commercial support and consultancy

  • No big interest in the community right now

Long-term organization of the community

  • Legal entity behind the name to support the branding with official logos, presentation layouts…
  • Probably not a short-term action
  • Licensing cleanup: ask PLUME people if they are still ready to contribute without funding

Quattor toolkit component status

  • AII: nothing major, maintained by Luis
  • Nick has some contributions pending
  • SPMA: still maintained, minor modifications
  • ncm-spma review and enhancements: schema adjustements (repository), function rewrite (Victor)
  • rpmt-py: no maintainer but no maintenance needed so far… German is the author.
  • panc: see discussion, maintained/developped by Cal
  • CCM: last contributor/maintainer was Nick
  • Need to enhance CCM to be able to handle gzip profiles without ungizipping them at Apache level
  • ncm-cdispd/ncm-ncd: Nick is the last one who had a look at it
  • Also the new daemon written by Nick to act a proxy for managing non Linux boxes
  • Add ability to run dependendy only if there was a config change affecting them
  • Tracking/querying status of last component run: mod may be already done by Nick
  • ncm-query: Michel maintainer but no change foreseen
  • FUSE file system to browse configuration on client (ncm-query alternative)
  • CDB: maintenance done to use last panc compiler (CERN)
  • cdb-sync may be of broader use: SCDB may have to use it to take benefit of ncm-cdispd alternative
  • ncm-templates: a template processor used in components, not really maintained but works!
  • perl-CAF/perl-LC: actively maintained by Luis

SINDES: an external potentially useful component, should check the status and the accessibility outside CERN

  • No longer any maintenance at CERN, frozen
  • Any possibility to add it to Quattor repository?
  • RAL to drive the effort, Veronique to check with CERN

Worshop preparation is now the responsibility of monthly meeting.

Conclusion

Next meeting

  • Several proposals: Philips (Eindhoven), RAL, Strasbourg
  • After a vote, RAL choosen if they confirm they can organize it
  • Sites are welcome to express their interest in hosting a workshop
  • 2+1 days