10th Quattor Workshop Summary (October 2010)

Written
11 October 2010
Author
Michel Jouvin

10th Quattor Workshop - RAL - 11-13/10/2010

Agenda

Quattor at RAL T1 - A. Sun

Started grid with some bricolage based on Kickstart, Puppet… In 2006 realized that this should be reenginered.

  • 500 WNs, 500 disk servers

MAin benefit of Quattor so far: huge improvement in system management efficiency.

  • But must not underestimate the difficulty of getting the whole team onboard: mostly done now. It takes time to get the existing knowledge put in Quattor config.
  • Experienced it during the last kernel update: full reinstallation would have been an affordable option if necessary

Quattor Usage Report - M. Jouvin

See slides.

Core Tools

ncm-filesystems and ncm-lib-blockdevices New Ideas - L. Munoz

ncm-filesystems: NCM components able to build/destroy block devices

  • Take advantage of advanced description available in Quattor
  • Able to do things not possible to do with Kickstart
  • A few bugs:
  • Logical partitions cannot be grown: use LVM instead, no plan to fix
  • preserve_partitions sometimes not honoured: pb understood, some time required to fix it
  • preserve and format: required by AII but should not be available in the component
  • Some requests for new file system types: tmpfs, iscsi (replacement for ncm-iscsitarget), smbfs, FUSE filesystems…

One of the problem is that ncm-filesystem also manages fstab: proposal to move this part to a specific component, ncm-fstab.

  • Add pseudo-file systems and network filesytems (without a block device) to fstab

This changes require some change in the schema

  • Some validation relaxation

Backward compatibility

  • Profiles 100% compatible
  • ncm-filesystems will require ncm-fstab: can be handled in the component templates

Remark on FUSE filesystems: an alternative to fstab is to use a specific daemon for that but this will not be managed by the component.

SCDB Update - M. Jouvin

See Slides.

PAN Update - C. Loomis

Status of v8 series

  • 8.4.2: last announced, the one everybody should be using
  • 8.4.3, 8.4.4: not yet announced
  • Maaven integration
  • 8.4.5: planned soon after the workshop, last v8 version
  • All deprecated features will provide warnings
  • Add prefix keyword to grammar (not active, implementation in v9)

v9: first beta planned soon after 8.4.5

  • Main feature for the first release:
  • Removal of deprecated features
  • Bareword includes
  • New syntax of external reference: machine:/path instead of //machine/path (will allow proper support of anespaces)
  • Search order between .pan and .tpl reversed?

Spare time going way down: no time to tackle performance issues

  • Not foreseen in the next 6 months

Discussion - Wishes:

  • MS: signing of XML profiles
  • May be easier to chain a signing task with Maaven, the same way it is done for gzipping profiles
  • Michel: auto-escaping of keys in nlist
  • Difficult to implement, pretty intrusive change in the compiler, risk of ambiguity

QWG Update - M. Jouvin

See Slides.

Scrum/Agile ideas agreed. Let’s decide later how to do it

Quattor FS - N. Williams

Alternative to ncm-query implemented as a FUSE file system

  • Written in Python (300 lines): manages access to XML profile
  • Currently requiring 2.5 but should be easy to backport to 2.4

Can specify a default mode for accessing the profile still restricting some parts of the namespace

  • Currently only explicit path but could be possible to add pattern matching to disable access to all ‘password’ attribute for example

nlist/list become directories, values become files whose content is the value.

Can use all the file commands to browse, show the differences between the configuration versions.

Escaped values (e.g. package names): a symlink is created with the unescaped value.

With --cache you can start quattorfs to browse profiles in arbitrary locations (e.g. SCDB build/xml).

Works on Linux and Mac.

Python code stats the NCM directory at every request to detect profile changes.

No Aquilon changes since last workshop.

  • Mainly worked on a DNS schema to produce DNS configuration, including host records, based on what is in the configuration database
  • Aquilon soon to be committed to SF repository
  • A few specific requirements in particular Python 2.6, Kerberos

The main task still to be done is QWG templates integration

  • Aquilon enforces the expected namespaces
  • Namespaces: more directories expected by Aquilon than QWG
  • E.g. cpu/intel/l5520 rather than cpu/intel_l5520, rhel/5.0-x86_64 rather than rhel5.0-x86_64

Schema extensions: need to way to allow optional extensions

  • Product lifecycle
  • Personality: what is used for, OS version/arch
  • Related to Aquilon
  • Function: development, production…
  • Threshold, maintenance windows
  • Start/stop jobs: scripts that must be executed at startup but are not a service

Component subclassing implement and working, including for exception handling

  • ncm-ncd (NCM::Component) enhanced to have a method prefix() returning the configuration path for the current component.
  • Replacement of “my $base” definition
  • Helps with support of subclassing

ncm-network: would like to add support for loopback aliases

Monitoring configuration: would be good to have it embedded into the service configuration rather than done at a later stage.

  • More discussion required on where the template configuring the information should site (service directory or monitoring directory)
  • May think about putting a meta-description of monitoring information in the service and generate the appropriate information on the fly

Versionned components: we need to be able to get several version of a component installed to deploy a new version still using the component described in the configuration description

  • One idea would be to install components at a location that includes the version number
  • Still to be checked if it really solves the problem: the real problem may be to enable SPMA to deploy only a subset of the RPM changes

@xxx@ substitution in source code: should be reduced and probably restricted to where it doesn’t break syntax checks (strings, comments)

Security Management

Quattor and Security - Mingchao Ma

Operational security aims to maintain normal operation at a reasonable cost and effort

  • Prevention and response

Problems at many sites run by Quattor because a lot of unnecessary stuff is installed on WNs

  • E.g. firefox: lot of severe well known vulnerabilities with known exploits
  • Other examples: Samba, Xorg, KDE…
  • XWindows used to triggered a kernel exploit with one recent vulnerability

Would be better not to install something which is not needed.

Also saw some sites claiming having upgraded but in fact still running the old kernel.

Cruft Removal - J. Adams

Work started after realizing that mrtg is running on WNs…

Started to remove the not so useful stuff

  • pkg_del
  • Removal of the inclusion of some RH groups

But got immediatly some complaints about favorite admin tools, editors missing

  • Also a few grid bits broken

Started to turn out in a nightmare: decided to improve checkdeps to produce .dot files showing dependencies.

  • Result: 223 packages removed, 29 running process disappeared, hundreds of MB freed in RAM and disk
  • Goal should be only to have ‘core’ group and add explicitly the required bits

QWG Errata Framework - I. Collier

See slides.

GRIF Approach to Errata Deployment - G. Philippon

GRIF would like to move to scheduled deployment of errata once a month.

  • Only SL security errata (no fastbug)
  • No kernel update except if there is a critical kernel vulnerability to avoid complex/disruptive reboots
  • Kernel updates controlled by each GRIF site admins

In case of a critical vulnerability, a specific out-of-schedule errata is produced.

Deployment strategy

  • First deployment on a test cluster representing the most common configurations to fix main problems
  • Then deployment on production clusters, under control of GRIF site admins
  • NODE_OS_ERRATA_TEMPLATE used to force a machine to stay with its current errata level
  • Cumbersome to maintain, some GRIF-specific scripts to help

Main issues is changed in RPM version, dependencies, arch

  • arch-specific to noarch
  • RPM splitted into -common and -libs
  • RPM name change
  • Pb with algorithm used to guess the most recent version: 4.6 considered more recent than 4.7

Useful companion tools

  • Pakiti: easy to see problems with undeployed/misdeployed errata
  • Nagios: specific probe to detect SPMA problems

Quattor Software Process

Build Tools - C. Loomis

Reasons to replace Quattor Build Tools

  • Broken
  • Incredibly complex
  • Linux dependent
  • No maintainer

Why Maven?

  • Portable, open to non-java language
  • Clean, standard build process
  • Integrated mechanisms for release management
  • All build information in a single file

First tests done with a few components and looked reasonably easy

Roadmap

  1. Components: may need some adjustements in configuration
  2. Primary tools: in particular NCM related stuff
  3. Other tools

Updated build for configuration components exists and ready to be applied to all components

  • Lots of changed required but most of them can be automated
  • Current features
  • Only build functionality, no tagging, code update…
  • Creates RPMs for Quattor clients but only with the Perl modules
  • Tarball with pan configuration and documentation
  • Configuration component archetype
  • Example component with simple command

Basically 3 commands:

  • mvn clean: clean up workspace
  • mvn install: build and store locally
  • mvn deploy: build and deploy

Build features:

  • Substitue values in source files but no need for specific extensions
  • Checks pan language syntax
  • Create RPM and tarballs

Still to be done

  • Script for automated conversion
  • Finish documentation on website
  • Apply to all existing components
  • Verify conversion/update of components
  • Determine integration continuous integration

Maven requirements

  • Java + Maven core (jar file)
  • Or Eclipse plugin

Discussion

Components

  • Review which ones are needed and/or actively maintained
  • Maven migration may be an occasion to contact official maintainers, discuss problems during monthly meetings
  • Ensure there is a reference person for every critical/important component

Quattor releases: not a big deal for people already using Quattor but important for new users

  • Maven should help as it allows to easily maintain a list of what is considered production
  • Component maintainers will keep the control/responsibility of updating the list

SVN repository usage

  • Encourage developpers to use Git at their SVN client to commit their work and restrict trunk to reasonably good stuff
  • trunk should be able to build successfully at every revision
  • May be enforced by rebuilding trunk every night
  • tag management: delay decision until we have some experience with Maven and possible workflow for tagging new versions

Site Experiences

NIKHEF Migration to QWG - R. Starink

Historically, NIKHEF started with pre-QWG templates and tried to remain as compatible as possible but lot of effort duplication

  • Some specific requirements, in particular wants to stay with YAIM

Approach for migration

  • Use 1 SCDB with duplicate namespaces
  • Migration per cluster
  • Focus host layout and services: generic hosts, non-gLite hosts, gLite hosts
  • Hit dirty workarounds

First impressions

  1. Lots of old stuff in our CBD: clean up before migrating
  2. gLite everywhere in QWG examples: in fact missed core machine type
  3. Various services not in QWG: authconfig, audit, psacct…

Decided to keep the current node layout/machine types

Kernel errata management: intentions ok but implementation/documentation not consistent across all OS versions

File system configuration: AII magic too complicated or risky

Node cloning: looks too complicated in QWG

User management via central LDAP

Hardware description: machine classes rather than individual machines

Migration done after 4 weeks without disrupting the site and without complaint from other admins.

  • Too early to say about real benefits
  • Now in a position to contribute to QWG but need to better understand how specific a contribution must be
  • In particular may contribute to better consistency in QWG templates
  • “One size fits all” wont work… but not a reason not to collaborate

SINDES - J. Dudziek

SINDES main purposes:

  • CA : manage the certificates, confirm identities, create/revoke certificates
  • Generated certificate intended to be used only for securing communication: different from the service certificates
  • Notion of time windows during which a given client can request a new certificate
  • Storage centre for secret files, passowrds…
  • Deliver them in a secure way

Based on Apache, openssl, mod_rewrite

Currently in use at CERN and serving 8000 hosts

  • Several applications relying on it

Weaknesses

  • No feature to delete files
  • Only 2 target types: host and cluster. Subclusters needed
  • No easy way to move a machine from a cluster to another one
  • No possibility to view files
  • No file versioning

Possibility for improvements

  • Enhance the current implementation
  • Provide the same features based on an exiting product, e.g. wallet
  • Manpower available: 1 year of technical student

SINDES at RAL - J. Adams

Background: desire to put passwords in templates but plain http serving not very appropriate

  • Anyone can access a machine profile
  • Every node can access another node profile

SINDES used only to deliver the certificates

  • File store has been disabled
  • Information that could be in the file store is put in the profile then transfered securely
  • Integration with AII through hooks (from BEGrid)

Used to secure transmission of profiles but don’t secure template files in the repository

  • Assumption that users accessing the repository can be trusted

Problems

  • SINDES version used for SLC4 only: required a lot of effort to port to SL5
  • In fact CERN has a SL5 version… but no official distribution point
  • Documentation: the most useful is BEGrid documentation

Question: would CERN accept that we import it to quattor.org for easier use by Quattor site and better integration with Quattor?

  • May become a standard component of a Quattor server

BEGrid Experience and Questions - D. Durvaux / S. Rugovac

BEGrid: several partners around BELNET

  • Need to reengineer current SCDB structure: looking for some input

Current configuration based on 2-tier infrastructure

  • 1 central national server running SCDB
  • 1 site server per site which is a SCDB client (doing a checkout): no ability to commit
  • runcheck script on each site server doing replacement in central configuration for site-specific parts and handling/triggering the deployment for the site

Problems:

  • Quattor out-of-sync with the community: way to use it, QWG templats, OS/errata
  • BELNET not yet enough skilled with Quattor to take over the coordination responsibility: still relying on IIHE team
  • SINDES support
  • dCache support: no other Quattor site using it? Relying on very old templates

Possible solution envisionned

  • Refactoring of SVN structure with SVN externals
  • SWrep replacement: is http-based repository suitable?

Discussion

  • Need to clarify the workflow
  • In particular how much control of the central team over the sites: ability to trig deployment…
  • Centrally-triggered deployment requires nothing more than ability to write the site tags/ branch
  • 1 specific branch per site (with its trunk/, tags/ structure) and an reference to the central server
  • An option is one SVN server per site
  • External reference can be or not to a fix revision

Monitoring Support

Future Changes in QWG Nagios Templates - R. Starink

Problems found in standard templates by NIKHEF

  • monitoring/nagios/config
  • Location of some RPM includes (minor)
  • Host list derived from HW database: a problem with a master/slave Nagios config, one function needlessly complicated
  • monitoring/nagios/command: huge list of command definitions
  • Should be break into a common part and a site-specific part
  • Some configuration variable have meaningless defaults rather than trig an error if undefined
  • pnp4nagios configuration missing: general setup easy to add, keep it optional
  • Currently requires modification (1 line) of service.tpl

Hierarchy of Nagios servers: slaves collect the information, master runs the web interface

  • Configuration of host list is different on slave and master: a problem for the current automatic determination of host list in QWG templates
  • Current NIKHEF implementation based on hosts and services groups
  • Grouping done by cluster
  • Master is configured with everything

Grid monitoring based on EGEE/EGI work

  • Using YAIM

Discussion

  • Build a RPM for probles related to monitoring Quattor activity
  • Probe sources should be put and built in the SF repository as any usual Quattor component

Build Awareness (and knowledge) of Quattor - D. O’Callaghan

We need to promote Quattor to:

  • have more users
  • have more contributors
  • to improve Quattor

This requires to make Quattor easier to discover and to make easier to join the community.

Functionality / complexity must be taken into account

  • Quattor’s narrow OS support counts against it
  • Pan language is powerful
  • Complexiting of creating a new configuration component and bringing it to the community

User-facing website: content is good but several problems

  • Server certificate check and user-certificate check
  • Server certificate should be from a well-known CA
  • Should not require a user certificate
  • Search results on Trac are not focused enough, e.g. tutorial. Should not return results about Trac documentation
  • Too much hierarchy exposed on the first page
  • quattor.org doesn’t appear in Google

Aims for user website

  • Create a user landing page outside Trac
  • Resolve security issues or solve it
  • Better internal search results
  • Better external search engine results
  • Links on sites “power by Quattor”

Worth looking at how the competition does things: Puppet, Chef

  • Document differences
  • Document how Quattor can be used in these other environements
  • In particular for OS configuration and initial installation

Marketing Pan outside the Quattor community and separately the other parts of Quattor framework

Need to check Quattor description on some well-known places: Wikipedia, freshmeat, Ohloh…

  • Fix wikipedia based on LISA paper introduction

David agrees to spend some effort on the new landing page before the next monthly meeting.

  • Landing page to be hosted on SF website
  • Try to define a stylesheet that could be reused by other pages, in particular those generated by Maven

Virtualisation Support - Cal

Quattor survey showed 80% of sites use some virtualization

  • Easy integration as PXE booting supported by all hypervisors
  • S. Childs provided configuration component for Xen with some support in QWG

StratusLab contributions

  • 2 new configuration components: source currently in StratusLab repository and will remain there during the duration of the project
  • ncm-libvirtd: configure and control libvirt
  • ncm-oned: configure and control OpenNebula
  • Quattor configuration of a cloud
  • Configures OpenNebula, incl. NFS mounts
  • Configures private and public network bridges
  • HTTPS proxy for OpenNebula XMLRPC server (which is plain http)
  • Ganglia for rudimentary monitoring of cloud infrastructure
  • 2 manual configuration required
  • Definition/addition of hosts in the cloud: in the future may be done by a cron/service with SINDES
  • NFS mounts verification: currently static mounts, should be solved using autofs

Contextualization

  • Networking done by DHCP
  • Files and parameters passed through a disk (ISO image) mounted on /dev/hdc
  • Provide a nice way of handling credentials: the disk is not exposed to other VMs
  • Initialization script on disk run through rc.local

Virtualization provides a cleaner implementation of profile cloning for WNs…

  • One reference WN compilation, without node specific information
  • Node specific information passed in trhgh contextualization
  • One profile per group of machines
  • Compilation time scales with number of groups

… but rely on an external tool (VM manager) to handle the deployment

  • Require a mechanism to specify state of the fabric
  • Still require profile cloning for efficient management of hypervisor machines

Quattor is a good appliance generator offering a bookkeeping to track VM configuration, save state information and regerate images if necessary

  • Would like to be able to interface with virt-install for automated image generation and deployment
  • Implementation could be a service listening profile changes and running virt-install
  • Deployement may use AII hooks

StratusLab developpements will be publically available on Nov. the 2nd.

Actions

QuattorFS

  • Backport to Python 2.4 if possible/easy (James)

SINDES

  • Check with CERN agreement to import it in SF repository (Véronique)
  • Check licensing (Véronique)
  • Clarify official distribution point/channel (CERN)
  • Integration Quattor server configuration (RAL)

Web site

  • Landing page for quattor.org on SF (David)
  • Develop a stylesheet that could be reused by pages generated from Maven
  • Fix Trac server certificate CA (Michel)
  • Remove Trac request for a user certificate for anonymous access (Michel)
  • Enable Trac indexing by robots (Michel)
  • Fix navigation menu behaviour: discuss by email what we want to implement, then implement it (Michel)

Documentation

  • Better integration of former MediaWiki content into existing section, remove duplicates
  • Update SINDES related documentation, improve based on BEGrid wiki and RAL experience
  • Clarify or add missing material to answer Ronald’s questions after his QWG migration experience
  • Implement changes based on Andrea’s review (to be monitored in monthly meetings)
  • Fix/improve Quattor description in Wikipedia and Freshmeat
  • Ensure Quattor is reference on the appropriate open-source or software project portals

QWG

  • Organize initial development meeting based on today’s proposal (Michel)
  • Implement namespace improvements as suggested by Nic
  • standard/hardware/… including a vendor directory
  • vendor/version-arch or vendor/version/arch for OS templates

Monitoring

  • Implement NIKHEF suggestions to add flexibility and support hierarchy of Nagios servers (Ronald?)
  • Collect existing Nagios probes related to Quattor activity monitoring, put them in SF and package them as a RPM (Christos)

Miscellaneous

  • Create a developer’s mailing list
  • Document/encourage the use of Git as a front-end to SVN to better evaluate potential benefits of a migration
  • Check how Git repositories can be configured on SF, migration from SVN, how many repositories by project…
  • Enable and create for new bits like Aquilon
  • Calendar of events on the wiki or in SF
  • May use “news” feature rather than calendar

Wrap-up

Next meetings

  • Spring: CERN
  • Véronique to start a Doodle to find the best dates
  • Fall: Strasbourg

Next monthly meeting ‘'’moved’’’ Tuesday, Nov. the 9th, 2pm CET