6th Quattor Workshop - NIKHEF - 27-29/10/08
Introduction - S. Child
Quattor survey: 48 answer, 27 sites using Quattor, 14 countries
- Still a strong use of CDB: as used as SCDB
Evaluation of Quattor community:
- Strengths:
- Open and supportive community
- Improving integration with related tools (monitoring, VMs)
- Large suite of configuration components (254)
- Well defined language model
- Good support for gLite
- Weaknesses:
- Lack of effort for release management
- No automated build and test
- Lack of documentation, in specially quick-start style
- Difficult to contribute (access to repositories)
- Legacy architecture makes writing components difficult
- No official support for gLite configuration with Quattor
- Hard to bootstrap because of the many external dependencies
- Level of knowledge still low among prospective “customers”
Need to focus effort on completing migration from grid, CERN-centered usage model to a generic open source project.
- More information on documentation and packaging
- Quattor appliance for site bootstrap
- Dissemenation: several things done recently (EGEE, LISA, HEPiX). More to do: SC, Linux sysadmins…
- Seek funding for dedicated development efforts ? High overhead but may be worth
Goals for the workshop:
- Finalize structures for opening up community collaboration: accelerate move to SourceForge, ensure access rights are easy to manage
- Coordinate tool development: centralise and catalogue existing tools
- Decide on sustainable build and release procedures, as automated as possible, not dependent from one person
Site Reports
GRIF - M. Jouvin
UAM - L. Munoz
Using QWG + SCDB.
Manpower potential problem: Luis applied for a fellowship at CERN.
Documentation: would like a tool for automatic extraction of documentation from PAN code
- Doxygen plugin for PAN
Packages lists are much too long and sometimes cannot be handled at installation time.
- Currently 1200 packages on a 64-bit machine, was able to reduce it to 411 on a WN
SCDB at UAM configured to use
We’like to impement new ant targets without modifying build.xml.
Had like to implement a staging strategy for deploying changes: fits better on a distributed model.
- How specific is AQDB from Morgan and Stanley
- Is CDB over Git or Mercurial a project worth the effort ?
CERN - V. Lefébure
Main instances: 6300 profiles, 140 clusters
- Less than in March 08 because of retired HW
- panc: still using panc v6
- Cake now multithreaded
- Starting to use namespaces in SL5 templates
- SPMA and SWrep: starting to have VO-dedicated SWrep, still not enforcing RPM signing
- CDB2SQL: rewrite almost complete
- CCM: use SSL-based transfer, with fallback to http:
- More and more CDB users: ACL working well but management is an issue
- Update to Quattor 1.3 templates well under way
Xen-based virtualisation being taken over by Ewan Roche
CNAF - A. Chierici
Quattor 1.3 templates (namespace): still a lot to be adapted to new schema:
- Conversion going on smoothly, with help from ME. Storage people very happy.
- Imported templates for configuring ncm-yaim from NIKHEF
- CEs on a 64-bit OS using Michel’s tricks: improved the situation a lot.
- WN 64-bit tested and soon in production
Xen used on LHCb-T2 and a few nodes in T1
- Dom0 managed by hand
Building RPM lists: tried checkdeps.py but failed because of the YUM version required.
- Currently using a node in a VM to install metapackage and dependencies and produce the RPM list
CDB vs. SCDB: thinking about migrating, to take advantage of advanced logging features. Also concern on CDB support after ME leaves.
- No plan to use QWG at the beginning
- CNAF is very short on staff members and temporary contracts will be reduced
Philips - S.
2 instances based on SCDB + QWG:
- testgrid: 6 nodes, based on VMs
- biggrid: 206 nodes, including a SE with 3 disk servers
Future:
- panc v8
- Dummy template for WNs
- Nagios integration
- Xen to run biggrid services
Grid Ireland - S. Childs
[http://indico.cern.ch/materialDisplay.py?contribId=3&sessionId=4&materialId=slides&confId=40056].
1 SCDB for 22 sites
- 17 externals, 5 internals
- 7 site admins, 5.4 average commits per day
- 450 managed machines
- 20 Quattor servers
Monitoring expanded: Nagios deployed through Quattor
- Local sensors + read in EGEE SAMs
- LEMON erros fed to Nagios
panc v8 now in use: logging features already invaluable
Configured iSCSI+RAID storage (aggregating WN disks).
Issues and future work:
- Would like equivalent to WN speedup integrated into the compiler as dummy workaround is too fragile
- Dependency checker now incompatible with our SL4 server
- Get network, filesystems, IPMI under Quattor control
- Configuring cross-site redundancy for core services using drbd+Xen
- Get all TCD admins actively contributing to Quattor community
=== NIKHEF - R. Starink ===
Now using SCDB with some local modifications, with SCDB inside Subversion
- Still using panc 7.2.9, upgrade planned Dec./Jan.
AII: added new feature to give Anaconda literal definition of filesystems
- blockdevices and filesystems seen as too risky for SEs
Cloning WN definitions for speeding up but issue with dependencies when recompiling
Redundant repository servers with DNS load balancing
Projects:
- Virtualization: not yet based on QWG but considering to move. Want to work on management: where is my VM running, what should host X do ? See QuatView talk.
- Monitoring via Nagios: client side already done by Quattor, servers still manually configured but considering QWG.
- NCG: will try YAIM
- Planning OS setup via QWG
Issues:
- SW updates:
- Kernel: install only a limited number of updates
- gLite: update repository for WN separate from other node types
- SPMA pb with RPM file/pkg name mismatch
- PAN schema: would like a property describing node function (CE, WN…)
- ncm-xen: support for relocatable domUs
- ncm-yaim: remove siteinfo.def on failure
LAPP - E. Fede
SCDB + QWG templates used for 2 years for grid resource and started recently used it for internal systems.
- Quattor runs in a VM for easier backup and restore. But this is at the price of performances : 3mn for 100 templates.
- 4 people involved in Quattor management
- Management of GPFS nodes with Quattor: selection of appropriate RPMs based on kernel version, configuration of SSH keys…
Wish: web interface for AII
GRNET/AUTH - C. Triantafyllidis
Since Nov. 2007, using SCDB + QWG templates
- Started during EDG with LCFG and LCFGng
- First look at Quattor in 2006 and first attempt to use it mid-2007 with CDB + YAIM
- Remaining issue: installing a new node type is still not trivial but site is now consistent
- Troubleshooting errors often involved a node without Quattor and comparing
Synchronization with LCG QWG repository still a painful process
- Introduced a cfg/local to hold all local modifications common to both sites managed from SCDB and also to redefine machine-types
- Missed that standard machine-types could be customized without duplicating/redefining
- gLite updates is still not a straightforward process: need testing nodes
Components: several new components written (Pakiti, hydraclient, hydra).
- Ganglia component in progress
- A few components patches (krb5client)
Working on a web interface for management providing a web-based wizard for creating profiles from templates, creating new hardware templates and updating RPMs.
Current status in Greece: half of sites using Quattor for initial installation, only 2 sites for the full management
- Doing some efforts to get other site using it for the full site management too
Morgan & Stanley - N. Williams
Currently moving 6500 nodes under the new ‘‘Aquilon’’ system based on Quattor
- Alread a few hundreds
- 1 Quattor server, 6 boot servers (DHCP+TFTP), 14 admins among which 6 template amdins
- Monitoring with LEMON + perfdata: LEMON used for aggregated views
Aquilon vs. QWG templates: Aquilon is based on a relational model where a node is associated with a role
- Plenary templates: templates that provide date to the other templates. Typically generated on the fly from an external source.
- Template library: what configuration is required to provide specific behaviours, implementing configuration policies given the input plenary data.
- Only template library is versionned.
- Modeled after QWG ideas but with a different layout: every group of templates grouped by ‘‘archetype’’ (one archetype is ‘‘qwg’’, others are ‘‘aquilon’’, ‘‘service’’, ‘‘hardware’’, ‘‘pan’’, implemented as a pan namespace).
- Own standard schema to add M&S-specific information: location details, events/actions, archetype, personality, release and version allowed for components…
- Distributed services modeled with several instances for redundancy, scalability. Instances of each service are described into ‘‘service’’ archetype. One specific node uses one specific instance. Personality describes the services required but not the instance (selected using plenary template information).
Aquilon vs. CDB: decided not to use CDB but instead roll our own
- Replacing an aging asset database, integrating with legacy dbs.
- Based on Oracle
- Use AQD broker (written in Python) that mediates Aquilon commands to Oracle
- Entitlement done in AQD broker
Sandboxes used to implement staged deployment
- Hosts are associated with 1 sandbox and can be moved from one to another without any config change
- A sandbox implemented as a SCM branch: SCM used is Git
- Production templates cannot be edited but can only takes merges from sandboxes
- Still to come: unit testing of sandboxes, better cherry-picking tools and methodology (based on Git hooks)
Bootservers:
- DHCP, SPMA proxies, Kickstart servers
- CDB notification from AQD trigger aii-shellfe –notify: 2s per client to configure
- AQ translates pxeswitch commandes into ai-installfe requests to the appropriate servers
- Just switched to AII v2 to get advantage of more flexible file system and block device definitions. But problems with performances due to the plugin-based architecture. A few patches made, in particular to support selectable CCM database formats (will use CDB_file).
Statistics:
- Infrastructure servers are 8 core, 2.5Ghz, 16GB
- PAN: 13 profiles per second compile, <3 mn for 2K hosts, XML profiles are approx 230 KB
- Reconfiguring bootservers: 1 client takes 2s, reconfinguring 2K tagkes more than 1h…
- Installation: 700 RPMs, 1.8 GB, 15 minutes
- 10% transient failures on installing because of ccm-fetch unable to get its lock
Wishes and remarks:
- Boot servers don’t need whole profile
- ROCKS RPM download vi abittorient
- Components often have insufficient logging, not easy to get status of component configuration
- Really want to participate with development: SourceForge wil be a Really Good Thing
- Subclassing components allows for replicated functionality but better transactions
- Entitlements in PAN
- LEMON and SLS are ‘under-sold”
Core Components
CDB - M.E. Poleggi
Mainly busy with CNAF operations, not a lot of development
- Mainly cdb-tpl-view and pangraph ready for panc v8): not CDB specific
- No progress on fine grain locking and common authentication framework
cdb/cdb-soap
- Faster state management
- Multi-threaded dependency calculation
- Authentication supports locking
Wish list:
- ACLs restrictions on include
- Expose some CVS features to end-user
panc v8: still some problems, investigating…
Scalability of dependency calculation: requires a lot of operations and CPU, improved by moving to multi-threaded dependency calculation.
SCDB - M. Jouvin
QWG - M. Jouvin
PAN Compiler - C. Loomis
[http://indico.cern.ch/materialDisplay.py?contribId=14&sessionId=3&materialId=slides&confId=40056].
Current version: 8.2.2
- v8 is a major rewrite
- 8.2.3 should be released next week
- Significant language and implementation changes from v7
- Known intermittent threading problem since v7: not yet identified
- 7.2.9 is in maintenance: no new feature
Future plans:
- Facilitate pan language edtior and debugger
- Making grammar changes for better support: save comments…
- Internal refactoring to make such things easier: ability to put breakpoints…
- Performance: continued optimization for generated code, reduce memory footpring via object sharing, understand speed differences between machines, JVMs…
- SL4/SL5 difference: 2x slower
- XInclude support: replace stream, fetc, embed…
Language changes planned for v9 (not before next meeting..):
- Remove deprecated syntax from v8
- Authentitcation/authorization: still requires some discussion to find the appropriate solution (based on type extension) without performance degradation
- Support for annotation (doxygen)
AII - L. Munoz
[http://indico.cern.ch/materialDisplay.py?contribId=15&sessionId=3&materialId=slides&confId=40056].
Lots of feedback and bug fixes in the last months.
New hooks:
- Aconda hook
- NBP hook
AII runs in tainted mode: ill-formed profiles won’t force AII to do wrong things
- Still sensible to symlink attacks
Known issues:
- File system creation is slow: working on improvements for 1.4
- aii-dhcp still around and doesn’t fit v2’s philosphy based on plugins: slow down aii-shellfe because it is doing DNS requests
- Implement a ncm-dhcp component that can be used as a replacement ?
- Reduce the number of profile download: currently one per AII command
- Remove ‘base_url not defined’ message
Dropping root privilege for AII
- aii-shellfe running as user aii
- /osintall owned by aii
- sudo for running aii-dhcp or ncm-dhcp component
- Install CGI no longer running as root
Document SINDES hooks or similar solutions to install certificates during installation
- Link to BEGRID wiki
Other Modules - M.E. Poleggi
CAF:
- Ability to log to syslog
- Process.pm: wrapper over LC::Process with support for verbose run
- FileWriter.pm: wrapper for file writing operations
- Fixed Reporter.pm
LC: updated to last version, should try to pick up changes more frequently
- In fact it is currently maintained even though we don’t have much contact with Lionel
CCM: not unescape(), getUnescapedName(), fixed for taint mode
- Still on todo list: de-privileged CCM execution, flag for disabling CCM updates
cdispd:
- Redirected output to aovid SElinux related problems: but a pb in the implementation, needs to be fixed (Véronique)
- Wish list: disable further updates by Quattor (inhibition flag using a property in profile)
ncd:
- NCM::Check extended with ‘noaction’ for overriding the global flag
- NCD/* inheriting from CAF::RepoerterMany
- Sanitization to run in taint mode without warnings
- Wish list: protect against malicious modifications of profile, for example signing it
ncm-templates: no change
- Todo list: review it and remove everything that was AII v1 specific
SPMA/rpmt: ME reports situation where RPM encouters warning during scripts and SPMA returns a success even though, running the same command manuallu returns rc<>0
- To be investigated, not so clear…
pangraph + cdb-tpl-view ready for namespaces and navigation via panc logging information
New tools: checkdeps, QuatView…
New Tools
checkdeps.py - S. Childs
Goal: check dependency problems before deployement rather than after
- Only approach to make it efficient and robust is to use package metadata, normally not available on the machine you compile on
- When you start to write something, you realize you are rewritting YUM…
checkdeps concept:
- Retrieve package list from node profile parsing XML file
- Create YUM repository for each repository in profile
- Instruct YUM to use only these repositories and to use checkdeps configuration
- Create a transaction set for YUM will all packages in
Issues:
- YUM doesn’t have a real API: have to run in debug mode and parse YUM output
When it works it’s great but hard to get working!
- YUM parsing very version sensitive
- Current version requires up-to-date createrepo on server (SL5+)
- Even when up-to-date, can return misleading results…
Conclusion:
- This is definitely the way to go… almost there. Can get the result in a few seconds for a profile.
- Parsing YUM output really doesn’t work reliably: should consider adding “dtermineDesp” to YUM that returns the list of additional packages requires as a Python stucture
QuatView - T. Suerink (NIKHEF)
Web-based display of information from profiles to display OS, MAC address… of every node.
- SCDB compatible
- Flexible, one config file
- Profiles parsed into a SQL db: database updated with as script that can run as a cron
- QuatView backend may be merged with CDB2SQL
- For simple reports could produce an XML file rather than a database
Panc Logging Tools
Main logging categories:
- Includes
- Calls : includes + function calls
- Tasks: performance of each internal tasks
- Memory
- Threads
- All/None
panc delivered with a set of perl scripts getting the logging output and produces reasonable human-readable summaries.
New Tools : Requests and Status - S. Childs
Several categories identified:
- Debugging
- panc logging: part of PAN
- pangraph: able to graph output pan logging tools
- pantree: deprecated
- checkdeps: in QWG
- gencompswebdoc: generate html documentation for components, in CERN CVS
- cdb-pl-view: simple template viewer in PHP, not CDB-specific, must be renamed, CERN CVS
- cdb-getclusters: CERN specific
- lld2pkgs, rpmcheck.pl: obsolete, replaced by checkdeps, QWG
- xmldb2hw: very limited, superceded by QuatView, QWG
- xml2pkgs: generate RPM list from set of XML profiles, obsolete, QWG
- compare_xml: compare 2 XML profiles, QWG
- Setup
- rpmConflicts: very old stuff, obsolete, CERN CVS
- fill_swrep_server: import RPM to SWrep through SOAP interface, move to SWrep, CERN CVS
- getpkgarch: very old, probably obsolete, CERN CVS
- quattor-etics-add-component: generate and upload component config to ETICS, CERN CVS
- mac: tool for capturing MAC addresses, obsolete, CERN CVS
- quattor-client-install: installs Quattor SW on non-Quattor machine, QWG
- AII web interface: from TCD, not committed yet, more work needed
- Template generation
- buildOSTemplates and companion tools: generate OS RPM templates, QWG
- createPackagesTemplate: generate a templae list from a repository or URL, QWG
- html2pan: script used before ant update.rep.templates existed, obsolete, QWG
- generate-hw-templates: generate HW template from a CSV, quite LAL-specific currently, QWG
- rpmq2panpkg: generate a PAN pkg list from installed RPM on the current machine, CERN CVS
- groupandpasswd2tpl: same for user and groups, QWG
- rpmErrata, rpmUpdates: should move into createPackagesTemplates, QWG
- ncmtplconv: convert flat components to namespace, CERN CVS
- check-compile.sh: download and compile SCDB + QWG trees, should be renameed, QWG
- quattor build tools: CERN CVS
Tool requests from survey:
- GUI for CDB
- Pan parser for Doxygen
- The base “getting started tools”
- CDB (cdbop) grep functionality, dependency information: CDB-specific, SCDB has existing tools
- Display inclusion tree for templates: panc v8 logging
- Having the possibility to restore a machine configuration to a given date: just revert configuration DB
- XSL stylesheets fir visualization of node profiles so that they can be viewed in a browser
GUI: Eclipse for SCDB
- Benefit from work on Eclipse
- Effort of developing own GUI would be huge and better spent elsewhere
- Also the need for some web-based wizards
Need to merge in 1 location from the 3 repositories and merge duplicate functionalities
Need to document recipes on wik FAQ page
- A lot of documentation available, but need to rationalize
- SF area for documentation HTML
Monitoring in QWG - S. Childs
[http://indico.cern.ch/materialDisplay.py?contribId=31&sessionId=3&materialId=slides&confId=40056].
Objectives:
- Describe monitoring one for each machine class
- Configure the server too using nlist to describe each machine to monitor
- Support hierarchy: cluster aggregates similar nodes, super-cluster aggregates clusters
NCG: template added to QWG for Nagios NCG services
-
Currently adds services for SAM and NPM tests and provides some variables for local customizations
include {'standard/monitoring/nagios/ncg_services'};
TCD Usage: Lemon + Nagios host monitoring
- Nagios: active network checks
- LEMON: host checking via LEMON metrics
- LEMON results fed to Nagios, output sent to Nagios via nsca
It’d be nice to get the last version of LEMON sensors and metrics from CERN
- They are still in old format but this is a pretty easy task to convert them
OpenVZ - L. Munoz
OpenVZ is a virtualization solution with the following specific features:
- Run as a partition (container) in kernel host
- No specific kernel, HW… for the guest
Limitations:
- VMs can only run Linux
- VMs cannot use kernel thread (eg. run a NFS server)
Quattor implications:
- Use of a single file system for all VEs
- No Anoconda
- Host requires an OpenVZ kernel (namespacing patches) and OpenVZ tools (vzctl…)
- Host acts as a router for VMs
- Guest declaration: container ID, network information, limit on resources
- Configured with ncm-openvz
- Avoid to install all VMs at same time (SPMA…)
Guest installation: no Anaconda, only post-reboot script with aii-openvz
- file sysystem, HW description… must be empty
- Virtual Ethernet (veth): slower, less secure than OpenVZ default network but able to receive broadcasts
- Not handled by ncm-openvz, must be bridged and configured with ncm-network
To be committed soon…
Coding Practices and Conventions - L. Munoz
Identation: preferred is 4 whitespaces, no tab
- If identation style is changed, do it in a revision dedicated to formatting changes, not embedded into other changes
Coding style: be consistent with the existing coding style if any, else cleanup the code to adhere to one coding style
- If identation style is changed, do it in a revision dedicated to formatting changes, not embedded into other changes
Modularity: avoid very long function or instruction blocks
Comments:
- Comment PAN data structures in PAN code
- Comment function in function headers
- Don’t embed a lot of comment into the code
Other recommandations:
- Misc: use ‘our’ instead of ‘use vars’
- Don’t use system or qx// or LC::Process but CAF::Process as it logs the command being executed and stores the exit status in $?
- Writing files: use CAF::FileWriter
- Can also be used for temporary files
- For feeding a command, use a pipe with CAF::Process
- Use LC::File or <File::Copy> or <File::PAth> as it does many checks, in particular it doesn’t follow symlinks
- Make sure code is running in taint mode: don’t trust any input, sanitize command/system call input
Missing bit in CAF:
- CAF::FileReader ?
- Any need to rewrite/improve NCM::Check::lines ?
- A module to download and verify from the Internet ?
A wiki page on Twiki exists decribing these conventions, linked from SourceForge
SourceForge Migration
Resources Available - S. Childs
SVN Repositories: work in progress
File release; what is under download menu, no APT/YUM, size unlimited
- Easy to rsync what’s there
- Currently only panc and ncm-components
Wiki: up and running, need people to contribute
- Web space: up and running, redirected to the wiki, also used for docs, 100 MB
Mailing lists, issue trackers: not investigated yet
- Any need to migrate from CERN savannah ?
Forums: not investigated yet, are they useful for us?
Documentation: not investigated yet
- Try to redirect to wiki
Migration Status - S. Childs
First prototype based on an import from CVS of the agreed components from CVS with cvs2svn
- First version of SVN-ready quattor build tools
- trunk, tags, branches at the top level
Quattor build tools: 1 copy per “core component” (through