11th Quattor Workshop Summary (March 2001)

Written
16 March 2011
Author
Michel Jouvin

11th Quattor Workshop - CERN - 16-18/3/2011

Agenda.

Quattor Status Report - M. Jouvin

See Slides.

Core Tools

Quattor Configuration Modules - L. Munoz Meijias

Recent addition

  • ncm-ofed (from Stijn): Infiniband configuration, very preliminary

Obsolete/suspect modules

  • ncm-http: used only at CERN
  • ncm-tomcat: used only at CERN
  • ncm-tomcat2: who uses it?
  • ncm-rproxy: replaced by ncm-squid?
  • ncm-smartd and ncm-smartd5: same schema, what is the difference?

Component quality: testing is tedious, old bugs are often back…

  • CAF is a critical tool to help with component reliability
  • Can run commands, update/write files with traceability and a real test suite
  • LC status: still maintained, private copy in Quattor
  • LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead
  • LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces

ncm-accounts: rewritten twice in last 2 years, hope v6 will fix all problems definitely

  • Very fast with large number of accounts
  • It correctly handles creation of user’s home directories

ncm-networks: lot of complains, ugly code, need rewriting, must review what the component should really do

  • Define IP config for network cards
  • Define bridge devices: required for virtualization support
  • Tuning of device parameters
  • Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes
  • Must work on diskless servers: no more restart of the network just for testing
  • TCP-parameters should be controlled by ncm-sysctl
  • IPv6 support: schema extension needed
  • Loic may try to look at it

ncm-sudo: CERN needs to be able to manage sudoers.ldap

ncm-useraccess: proposal to remove PAM-based ACL support

  • Mistake to have put it here, use ncm-pam instead

ncm-authconfig rewritten with LDAP support

ncm-modprobe: support for SL4 to SL6

  • SL3 support removed

ncm-spma v2: new incompatible schema, 2-phase upgrade needed

ncm-chkconfig: problem with removal of ability to specify both on and off

  • Need ability to say “set service on to runlevel “,4,5 but off at other levels”

CCM: no changes since Feb. 2009, MS patches still not there?

  • Still no support for local profiles required for easier testing

ncm-ncd: version 1.3 by MS allows subclassing

  • Some components won’t work with 1.2 in the future
  • Should try to enable -t in Perl to get warnings about dangerous components
  • cdp-listend and ncm-cdispd should drop privileges when not needed

RHEL6 support: another platform to maintain… introduces Perl 5.12

  • Major language changes but backward compatible
  • Use stricter modes in Perl 5.12 to identify improvements needed to existing components
  • Don’t jump to quickly into new, incompatible features
  • Core libraries (CAF, LC) are not affected by the changes

Some actions needed

  • Identify obsolete components and move them to another location (delete them?)
  • Identify and fix components using backquotes to spawn commands instead of CAF::Process
  • Do a Quattor component release? or official/recommended list of components?
  • Can the new build tools help with this?
  • What process?

Luis will no longer be able to (significantly) contribute as he’s leaving CERN…

Aquilon and other MS Developments Status - N. Williams

Main developments: integration with ESX through QRD

Profile signature: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab

ncm-network: MS would like to be involved in rewrite, ready to contribute

ncm-modprobes: difficulties with last version because of removal of shell expansion

RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones)

  • Goal: support ready in a few months, fixing problems as they are discovered

Aquilon: will add support for “clusters” in a way similar to SCDB to help with use of QWG templates.

Will start looking at support Solaris 11 in the future.

  • Probably not before autumn

May also look at feeding their internal Windows management system from Quattor.

Pan Compiler Status - C. Loomis

v8.5-7:

  • Warning about all v9 deprecated feature
  • Fix for NS object templates
  • Fix for panc on Windows
  • Introduction of “prefix” statement to simplify paths in templates
  • In action from where it is declared to the end of the template, several possible in the same template but not recommended

v9

  • v9.0: available soon, development version
  • v9.2: early summer, production version

v9 main new features for the first versions

  • Fixed generation of annotation output for better handling of namespaces
  • Fix command line script
  • New ant task and maven plugin

v9 roadmap

  • Investigate use of clojure (lisp-like over JVM): better implementation of mem mgt in panc, agent model similar to panc task mgt, memorization of file system stats…
  • May allow to simplify/streamline panc code
  • Focus on code maintainability rather than new features (none requested)

Support

  • v8.4.7 is the last v8 version
  • v8.2.x is now unsupported, please upgrade
  • v9.x: all new developments in these releases

Requests from the discussion

  • debug() outside DML (Nick)
  • Produce a “map” of the includes without executing (compiling) them (Nick)
  • Difficult when using loadpath
  • clojure may help

QWG Templates

QWG Templates Update - M. Jouvin

See slides.

Nagios Templates - R. Starink

Almost done

  • monitoring/nagios/config: server configuration
  • RPM includes
  • Apache configuration

Difference between generic and local configuration: most work done locally, still need to be merged

Pnp4Nagios: service tpl needs site modifications

Hierarchy of Nagios servers: optional

Grid monitoring: based on EGEE/EGI work

  • Uses YAIM, not easy to integrate into existing QWG templates. More discussion required.
  • Replace ncm-yaim by ncm-filecopy?
  • Not generic

Summary

  • No visible progress yet but some progress behind the scenes
  • Merge into SVN requires time and dedication: would like to test results
  • Grid monitoring to be done after completion of base work

LEMON - I. Fedorko

Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action.

LEMON usage: CERN (several instances: one at CC + online farms), CNAF, MS

Backends

  • Supported: Oracle and flat-files
  • No alarming possible with flat-files: rely on PL/SQL (available for most DBs)
  • Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent

LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics

  • Performance monitoring
  • Application monitoring
  • Infrastructure metrics: temperature, power…
  • 5 core sensors covering 60% of measurements
  • 1 LEMON server (+1 for historical data)

Quattor management of LEMON

  • LEMON agent configured with ncm-fmonagent
  • LEMON client authenticated to the server via the SINDES host certificates

Current activities

  • Federation of instances (worked with MS)
  • Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate
  • Also customized sensors for hypervisors
  • New LEMON web for better scalability and improved flexibility (more dynamic operations)

Future

  • Consolidation of monitoring activities at CERNIT, including experiment support
  • New DB schema for increased scalability
  • DB data export for data mining
  • Exception classification for alarming
  • Remote instances (monitoring of remote T0)
  • Probably integrating with Nagios for collection
  • Interfacing to Nagios: interested by ideas/experiences
  • Interfacing with Windows monitoring

LEMON development team at CERN: 0.5 to 1.5 FTE

  • Enough for maintenance for CERN usage
  • Not enough to do major reengineering: a problem as the use cases have changed
  • Consolidating monitoring activities (and manpower) is one direction
  • Interacting with other tools is the other direction

CMS Experience with Quattor - J. A. Coarasa Perez

CMS online computing infrastructure requirements

  • Complete autonomy from CERN IT and network
  • Myrinet core network
  • Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room
  • Fast configuration turnaround
  • Ability to read from electronics at 100 kHz (100 GB/s)
  • Computing power to do data selection (2 GB/s output): ~2500 cores
  • Enough disk storage for 2 days: ~300 TB
  • Spare computers for quick replacement in case of failure (~200)

Limited manpower: ~5 FTEs

Quattor is the core configuration management tool

  • Computers installed through Quattor but not using AII
  • Based on a (very) old Quattor version: 1.3
  • panc 6
  • RPM repositories hosted on NetApp filers with 6 SWrep servers
  • Performance: 10mn for compiling the whole 2500 machines
  • Time to deploy changes: ~10 mn (cdp notification used)
  • Reinstallation of 1K machines in 1h 1/4
  • Problem when SWrep servers were overloaded: spma timeout, no retry

Well defined template structure/layout (CMS specific)

Some CMS-specific developments

  • Dropbox for RPMs to allow people without Quattor expertise to update their SW in a traceable/reproducible way
  • Buggy components requiring workarounds (in particular for configuration removal)
  • Copy of configuration files to individual computers: copyd
  • Network configuration derived from DNS
  • RPMs used to create users
  • Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems)
  • Software Update Tools: allow to update existing RPM in a group of computers (dropbox)
  • Upgrade or downgrade
  • Allows to rollback changes
  • Allow to check the status of the update
  • Perform checks on the RPM: already exists, not named properly…
  • Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM
  • Only one person can run it at any time
  • Ability to run commands on a set of computers selected based on the use of a given template (like a type)

Conclusions

  • Quattor has been and is scalable to CMS needs
  • Not always as flexible as we’d like…

Security

SINDES Update - J. Dudziec

Certificate authority + secured file store.

Several issues requiring major reengineering

  • Only one SINDES user per system: no fine grain permissions for different users
  • Impossible to delete a file in the file store
  • Impossible a to move a machine from one cluster to another one
  • No support of subclusters
  • Unattended installation of machines not really possible: SINDES must be notified first

Progress since last workshop

  • Documentation on Quattor wiki
  • Code in SF with Apache2 license: still some issues (currently GPL)

SINDES v2 overview

  • Apache + SOAP
  • Internal DB for SINDES-specific information, eg. SINDES items (file + permissions)
  • Interaction with CDB through CDB2SQL to retrieve host information (eg. cluster/subcluster)
  • Krb used in replacement for SINDES CA for accessing SINDES server
  • Same client tool for users and hosts
  • Major improvement of code maintainability: 300 LOC instead of 10K in v1
  • Site-specific authorization modules to use whatever is appropriate at the site: CERN is using egroups, LANDB…
  • flexible item to host/cluster/subcluster mapping

Impact of SINDES CA removal: not necessarily more difficult to setup Krb than SINDES CA…

  • SINDES CA adds a lot of complexity to SINDES code, not necessarily secure…
  • May consider adding support for optional other authentication methods, like certificates

Unattended installation

  • CERN: initial keytab download allowed if request coming from low ports and reverse DNS matches (no time window)
  • MS: adding a time-window feature to Krb

Only client available in v1 is command line, other APIs under consideration for v2

  • May change the implication of a GPL license

Quattor and CERN Security - L. Munoz Meijias

Applying SW updates

  • CERN IT preparing snapshots of recommended versions every week, users picks one of them
  • In case of major security vulnerability/problem, CERN Security can make mandatory an update

Challenge of managing 250M accounts… in particular challenge of maintaining ACLs

  • Unix group are not an appropriate solution: difficult to maintain, no recursive groups
  • LDAP + egroups used to implement the required features
  • egroups: logical name for a set of accounts, recursive
  • Quattor used to configure LDAP authorization
  • Still an issue to configure non LDAP-aware services

Access to accounts: mainly through Krb (+PAM)

  • SSH keys are a potential problem in case of computers compromised outside CERN, no way to know if they have been removed from all CERN computers
  • Privileged accounts: no possibility to ssh out from a privilege account
  • Need to restrict more accounts (apache, oracle)
  • Would like to restrict the allowed origin of incoming connections to privilege accounts

Multi-factor authentication: no uniform way of doing it, work in progress to make progress on this

Logging of all commands and command arguments on sensitive systems using snoopy

  • Intercept calls to execv and sent to syslog what is executed

Monitoring with alarming of ~12 security-sensitive files per computer

  • File permissions

Development Process and Tools

End of current sprint

See wiki.

Maven-based Build Tools - C. Loomis

Tools available

  • Build plugins, configuration for full configuration module
  • Archetype to build config for a new configuration module
  • Creates tarballs as well as RPMs

Documentation available.

Usage status

  • pan
  • pan-templates
  • ncm-query an ncm-cdispd: prototype available

Making releases

  • “Magic” to do a release with Maven is easy: mvn clean release:prepare release:perform
  • Builds, tags, uploads packages, makes “site”, increments version
  • But requires some specific permissions: write access to nexus repository for packages, write access to SF website for “site” information
  • Allow anyone from command line to make release?
  • Use continuous integration server (eg. Hudson) for releases? Preferred solution… but it is an additional service (not available on SF, need to manage identities)

Continuous integration service: much more benefits than just the release making

  • Not necessary a big deal to set up the service
  • Already have the StratusLab experience
  • Is LAPP ready to do it (previously in charge of build infrastructure)?

More discussion tomorrow? Attempt to start a Hudson server?

Development Process Discussion

Generally good feeling for those who participated

  • But not a very good feeling about EVO
  • Let’s test the MS audio system for next standups

Meeting day and time: Thursday 2pm CET is not necessarily the best slot

  • Start a Doodle to find another possible slot, preferably during the morning
  • Wednesday may be a good choice, avoid Monday/Friday

Sprint duration: 1 month, smaller backlog

  • Monthly meeting to close the sprint and have the user interaction slot

Configuration module management: how to know/update what is the recommended (stable) version to use?

  • No great idea…
  • Involve manual steps/work
  • Must be sustainable…
  • To be reviewed when we have converted all configuration modules to use Maven tools and have a continuous integration server in place.

Migration from SVN to Git: is it desirable? if yes, which steps?

  • StratusLab feedback: required a bit of effort for SVN users but several benefits at the end
  • Smaller, more specific repositories: smaller number of users with write access per repository
  • External references are really external to the repository
  • At SF, all users with write access in the project have write access to Git repositories by default but can be restricted per repository
  • Short term steps: new projects (eg. SINDES) in Git, panc moved from SVN to Git
  • Need a candidate with interaction between several developers: CCM-related things (ccm, ncm-cdispd, ncm-ncd, QuattorFS), AII and its plugins

RHEL6

CERN Experience

  • Path to install perl modules has changed: fixed by Luis for every Perl modules in SVN
  • Now installed in /usr/lib/perl, that applications are not supposed to use (not used by anybody)
  • Using standard installation place makes RPM OS-version dependent
  • Use /opt/quattor/lib/perl: upgrade ncm-ncd to look in both places
  • rpmt-py broken by 2 changes in standard Python RPM bindings: fixed by Christos
  • curl bug breaking SINDES and potentially anything using wget: certificate not looked in the right place
  • No workaround/fix yet
  • No impact known yet on Quattor itself
  • ncm-grub doesn’t update the kernel after a kernel upgrade and %post script in RPM is failing
  • Christos will try to look at this
  • Also a need to add kernel architecture to grub command
  • SELinux: issue with rpmt-py, need to be documented on Quattor wiki
  • LDAP configuration is now in another place with another layout (previous information as a string now several tokens): ncm-authconfig has to be enhanced
  • Difficult to split the existing string in SL6 tokens: site specific
  • Rework the schema to have something more appropriate to handle different platforms?
  • Luis will try to summarize the problem and possible solutions on quattor-devel
  • Not clear if others are using LDAP for authentication…
  • Time synchronization in virtual machines run on HyperV: need to run new ntpdate service during startup to force synchronization
  • Initial time seems to be UTC…
  • ncm-modprobe: modprobe configuration files changed their location, new attribute in the schema to specify this location
  • Need to be handled by OS configuration templates
  • ncm-symlinks: use of last version required

AUTH

  • A few fixes to be committed

MS

  • Pb with AII file system partitioning (caches not flushed or something like that): thinking at using standard Anaconda for this, driving it from Quattor config
  • In the past, issues with Anaconda’s handling of file system partitioning
  • ncm-pam issue, about to commit the fix

Discussion

RHEL6 raises the question of the potential changes that may be required for multiple-platform support

  • MS wants to restart Solaris port
  • Some components may be valid only on one platform: how to prevent them to be used on the wrong one
  • Bind the path in configuration to something that prevents its use

Conclusions - Next Workshop

Next workshop in Strasbourg.

  • Best candidates look week 41 (10-14/10) and 42 (17-21/10)