9th Quattor Workshop - 17-19/3/10 - Thessaloniki
Site Reports
LAL
See slides.
MS
Configuration changes:
- Still one configuration server
- 2 set of boot servers (Europe, US)
- 20K managed machines (+5K)
- Virtual machines with ESX: 1200 ESX server, 1700 managed machines
- AII rebuild time still 3h
- Compile time increased from 10 to 18 minutes
- Using MS-specific version of 8.2.9
- 8 people involved in template development maintenance (+2)
- LEMON used: 2x2 servers
Issues:
- No code submissions recently: legal issues with SF…
- aqd scalaibility: priority for next year
- Pending open-source:
- AII, CCM patches: in prod at MS
- FUSE interface to configuration browsing: ready for distribution
- Aquilon: RAL interested, still a significant MS-dependencies
AUTH
Migration from cfg/local to SCDB sites still in progress
Included monitoring information at profiles
Attempt to use Quattor as a central DB for all assets (KVM, StorageArrays…)
- Implemented as a separate cluster group
SCDB clusters by machine criticality
Xen configuration now based on QWG with a few developments
- Removed host dependency on guests
- Trying to remove the use of DHCP
Templates for new machine types
- Message broker, in production at AUTH
- egee-NAGIOS/egee-SNAGIOS: work in progress, not based on previous Nagios templates from QWG
QAD: Nagios interface almost completed
JRuby added to SCDB to add task written in Ruby
- Used for QAD integration
Issue: sync with QWG, how other sites are doing?
CERN
'’Note: report about Quattor in computing center only.’’
CDB 2.2.5 and panc 8.2.11: 10K machines
- panc 8.3.0 tested
New development: Django-based search tool for CDBSQL
- Using sqlalchemy: Oracle independent
Test of SCDB: work done by Daniel with the help of Luis
- Test done with one profile and the CERN template structure
- Observations:
- profile location is a big change for CERN
- Decoupling of commit and deploy may lead to potential problems
- SVN ACLs are only at the directory level: having them at the file level is an advantage in CDB
- +: build-in ACLs, local panc compilation giving access to log files
RFE: pkg_add/pkg_repl based on date to ensure people are not installing obsolete version. Is it feasible?
Can we agree on a policy for announcements of new components? Exact rules for using quattor-discuss rather than quattor-grid? Do we really need 2 lists?
NIKHEF
5 clusters, 500 managed machines
- 350 clones
- SCDB 2.3.1 + local changes
- panc 8.2.8
- Compilation time: ~30s
6 people involved in Quattor maintenance (4 active)
Virtualization: 13 hosts, 48 guests
Nagios: 1 master, 3 slaves
Refactoring of templates for easier usage: not using QWG
- VO configuration, VOMS
- Batch system access control, VOViews
- User banning
Virtualization: currently not used with grid clusters, investigating KVM
Grid Ireland
18 sites, some have expanded.
SCDB with bzr and Git clients and an old version of QWG (r4921)
- Not up-to-date in particular with errata and VO config
RAL
90% of T1 Quattor managed. T1 fully committed to Quattor.
- Mostly WNs (700 already, 400+ soon)
- Also other machine types: bdii, vobox, ….
- Disk servers: OS + all package management operations, CASTOR config still relying on Puppet
- Nagios servers: OS only
9 sysadmins involved + 6 more coming
SINDES on the point of deployment to ease server certificates handling.
Ready to start experimenting with Aquilon
Using SCDB+QWG (up-to-date with standard templates)
- Happy with performances: 4 minutes for a full compile, without profile cloning
- Would like to slim down RPM list used by various machine types. Is anyone else interested?
Very positive with Quattor so far
- Much more consistent
- Batch service much easier to manage
- Errata deployment much easier
- Much more agile: easier to deploy new versions
CERN LHCb Online Farm
Online computing
- Event filtering farm: currently 700 nodes, 500 to 1000 to be added in the future
- Diskless systems
- Credit card PCs
- Various control PCs and infrastructure nodes (Web servers, DNS, ….)
- Also some Windows nodes (120) but mainly running Linux SLC4/SLC5/RHEL5
Linux machines managed by Quattor
- Original setup done in 2006 with CDB/panc v6/aii v1
- SWrep replaced by scripts
- ViewVC web interface to templates and XM files
- Problems with Krb authentication in CDB and the fact author of commit is always Apache
- Decided to give a try to SCDB… see presentation
Maintainer of ncm-diskless_server and ncm-pvss
- Also developped ncm-checklines as an improved ncm-filecopy
Plans for the future:
- Finish move to SCDB
- Cleanup of tempates
- Update/rewrite the 2 components maintained as part of this use + ncm-network
- Several issues with ncm-network: see presentation tomorrow
Some extensions made on Quattor schema
- Additions to /system
- /hardware/ipmi
LAPP
5 SCDB clusters
- 1 machine acting both as SVN and SCDB server, still running SL4
10 Gb/s support: with Chelsio interfaces, found a way to do PXE
- No more need to use 1 Gb/s for Quattor (PXE) boot
Install and configuration of GPFS through Quattor
- Strongly depending of kernel and require to build RPM for each new version of the kernel
- Unfortunatly hard to make generic, thus not contributed to standard templates
CNAF
Very positive after the move to SCDB, despite a few problems
- Error caused by users who commit before compiling, planning a wrapper
3900 templates.
Working on improving QWG errata management.
Main versions deployed
- SL5 (no longer SLC)
- gLite 3.2
- Grid services configured with YAIM
Virutalization
- Using KVM, phasing out Xen
- No connection between hypervisor and VMs in templates: not clear we want to do it
- Hypervisors and VMs managed by Quattor
Issue: complexity is increasing a lot
- Bug tracking is very difficult
Developments
Pan Compiler - C. Loomis
v8.2 in production: only serious bug will be fixed
v8.3 (development version): 8.3.1 is a release candidate
- Dependency file format change, annotation syntax
- Will become 8.4.0 when production version is released
v9 planned before next workshop
- Will remove support for all deprecated syntax/features
- See previous list
- Will require .pan instead of .tpl
- 8.4 will support both
- Issue: CVS history cannot track renames…
- Would probably be easier to add an option to change the default extension to avoid the penalty of trying both for every template
panc release management moved from ant to maven
Current work:
- Allows sources from non-file locations (eg https)
- Will enable an XML-based syntax imported as a structure template
Code refactoring:
- clean up, internationalization
- Make AST more accessible (e.g. editors, code formatters…)
- Easier tight control by external tools (e.g. debugger)
Annotations: open the path for automatic documentation generation
- Need to agree on tags used
Another issue: auto-escaping of nlist keys
- Possible to look at it but potentially a huge work to take advantage of it
QWG Update - M. Jouvin
See slides.
SCDB Update - M. Jouvin
See slides.
Diskless Servers Support - L. Barda
For LHCb online farm
- Filtering tasks can be done in memory, disk not used
- Credit-card PC have no disks; still SLC4, i386
Initial diskless support in Quattor started in 2005 and taken over by LHCb end of 2007.
- Read-only NFS shared root filesystem configured on the server and copied to the diskless server at the end of kickstart
- RW files for each client in a specific file system (
/.snapshot
) remounted with -bind on the diskless server - 2 templates to describe a node configuration
- Proto node: shared root FS content, main part of the configuration
- Real node: node-specifc configuration
ncm-diskless_server
is used to generate the boot image, manage
DHCP/PXE for the managed systems, configure the ccm.conf
on the
node-specific file system.
Current status
- In production for several years
- Moved farm from SLC4 32-bit to SLC5 64-bit: was pretty easy
- A student should work on SLC5 i386 for CCPC
Issues with the current solution
- Very large mount table on nodes
- Issues when modifying a RW file on a RO file system
- List of RH RW files is too large
- Lack of debug info in system-config-netboot used to generate images and initrd generation is not flexible enough
- Shared root file system can only be a copy of the server file system
Plan is to rewrite ncm-diskless_server
and use UnionFS to replace the
NFS mounts and bind tricks.
Several issues with ncm-network
too, in particular inappropriate
restart of the network breaking the NFS access. Will start a rewrite
(ncm-network_NG
)
- Should separate each part of the configuration: interfaces, routes, bonding…. Code restructuring.
- Should use network utils to modifiy the running system
- Modify sysconfig files for persistent modifications
Try to interact with the community during the design phase to ensure it
could be used at some points in replacement of the current
ncm-network
.
Migration to SCDB at CERN LHCb Online - L. Barda
Motivations:
- Wanted to have real names in commits
- CVS too much hidden in CDB: no more conflict resolution
- CDB2SQL too heavy
SCDB installation using standard documentation, but no QWG. A few modified components:
- LDAP authentication to SVN
- Next Apache will allow subgroups
- svnperms.py to restrict who can do commits
- DHCPD config / aii-dhcp
- quattor.build.xml: no cluster used, profiles located in /profiles/*
- New target added
- Post-commit: added the ability to send an email before running post-commit.py
- build-tag.py run under Apache rather than root without sudo or ssh
- Also added the ability to run quatview as part of build-tag.py (new section in config file)
Also AII v1 to v2 migration.
Special use case: allow sub-detector people to change their CCPC Mac address
- Allow them to commit only in the subfolder corresponding to their CCPC and make a deploy
- Rely on svnperms.py precommit hook to restrict their right to commit
Currently a test instance, in production soon.
Monitoring
NCG-Monitoring and Quattor - R. Starink
Quattor implementation at NIKHEF based on standard/monitoring/quattor
but a lot have changed and merge is challenging.
- Changes both in QWG and NIKHEF templates
Enhancement at NIKHEF:
- Hierarchy of Nagios servers
- Slaves schedule probes and return result to master through
passive checks. Uses
ncm-nsca
on master+slaves - Master monitors slaves with active checks
- Master and slaves with the same host/service definition but with slightly different parameters. Quattor helps to keep things consistency.
- Master is the only one to run Apache: configured with
ncm-filecopy
- pnp4nagios: RRD-based graphs (like Ganglia) integrated with Nagios, usable with any probe
- Service checks: generic checks + grid-specific ones (DPM, WMS, Torque, MAUI…)
Grid monitoring: wanted to enable NCG-based grid monitoring solution for site
- SAM test results
- Local grid checks
- Ideally entirely Quattor-managed but potential conflict: NCG implements a dynamic detection of tests to run where Quattor describes tests in a static configuration
- NCG: risk (happened!) of Nagios killed by a dynamic update of the configuration
- Requirements: co-exist with existing Quattor+Nagios setup, no gLite on Nagios server
- Incompatibilities fed back to developers
Grid monitoring layout:
- Master werver can either submit grid tests directly or through a separated gLite UI
- NCG used to generate Nagios configuration but not to enable/configure Apache, sudo… (done by Quattor)
- Files generated by NCG included as
cfg_file
inncm-nagios
- Potential conflict between Quattor config and NCG config (duplicated hosts…) preventing Nagios to start
- Master has a check (delivered by NCG) to push checks generated on master on the gLite UI
Status
- Started early in pre-OAT times where many manual steps were required
- Deployed recently the Jan2010 release and now configuration almost fully automated
- Many perl-xxx dependencies: better to use
checkdeps
- Setup is not straightforward: NCG is still work in progress
- Not completely happy with NGC-magic but aternative is to describe services in Quattor configuration. Is it maintainable?
Monitoring based on OAT Work - C. Triantafillidys
See slides.
Prototype not ready for distribution, based on QAD and NCG
- Both still under developments
- QAD used as a management interface to add MetricSets and select
nodes which they are applied to based on
/monitoring/features
.
NagiosBox for Biomed - M. Jouvin
See slides.
Monitoring Discussion
TCD still using QWG templates but short of manpower.
- TCD should update QWG trunk so that we can understand/assess difference with NIKHEF
- QWG templates for describing monitoring configuration for local resources
- Must allow AUTH usage pattern where all the configuration of Nagios is done outside Quattor (through QAD)
- RAL will not necessarily use it but could provide feedback on what is done/planned in QWG templates
Grid service monitoring will try to rely on NCG as much as possible
- Need input from AUTH about possible usage patterns of NCG API, how to integrate it in components…
First step: 2 wiki pages describing NIKHEF configuration and NCG potential
- Location to be discussed later
Should plan a Nagios node in future test cluster.
StratusLab - C. Loomis
Goal: enhancing grid infrastructure with virtualization and cloud technologies by incorporating could innovation into existing grid infrastructures
- Simplify and optimize its use and operation
- Enhance existing computing infrastructure with “Iaas”
- Grid services will remain the glue among distributed resources
- Possibility to scale out to commercial clouds
- Deliver an open cloud API (à la Amazon EC2) to enable community services to take advantage of cloud resources
StratusLab aims to deliver a toolkit that can be easily installed by existing sites.
- All the software produced will be openSource (exact license to be decided)
Interaction with Quattor community
- Several partners member of Quattor community: TCD, LAL
- Major goal is to make the life of sysadmins easier: need to ensure that it is easy to install and configure with automated tools
- Want to leverage Quattor community’s expertise with grid deployment: don’t want to reinvent the wheel..
- Would like to use Quattor to build appliance configurations
Project should start beginning of June.
- Still in negociation
SF Migration and Build Tools - C. Loomis
Current situation
- Need to converge to using tools on SF and having a sustainable set of build tools
- Was planned to be done within QUEST and now we need to move forward ourselves
- Need to have concrete relaistic plans
- Build tools unmaintained(able) since ME left
Current tools:
- Wiki
- MediaWiki on SF
- Trac on SF enabled but not used
- Trac@lal: used extensively for QWG documentation
- SVN: SF for Quattor core, QWG for QWG+SCDB
- Bug/feature tracking
- SF: Pan compiler only (based on Bugzilla, working well)
- Trac@LAL: QWG todo list
- Savanah@CERN: main place for user bugs
Build tools:
- Current set of tools very complex and basically unmaintainable
- Attempt with Ant but not clear there is a real gain as Ant config file is rather complex
- Maven solution may be possible: many attractive features (dependency management, multi-level project configuration, RPM packaging), more language neutral than Ant
- Willing to do limited test but will need help from others
Wiki future
- Trac on SF is lacking to many features (plugins) to allow an easy migration from Trac@LAL
- A temporary option could be to move current MediaWiki content to Trac@LAL until we have a usable Trac@SF
- Then we could migrate with export/import of Trac DB
- We gain some time for SF to install the same Trac version as LAL
- Agreed but import of MediaWiki contents to Trac@LAL must be the occasion of restructuring current contents to be more generic and ready for import to SF
SVN future
- Move SCDB to SF asap (and other generic tools if any)
- QWG part kept at LAL temporarily
Bug/feature tracking
- Current SF bug tracking is functional and fairly usable, only missing bit is ability to link to SVN
- Moving all bug tracking of to SF would have the benefit to remove one of many places where users have to go
- Agreement to close Savanah
- Review exact requirements and issues in monthly meetings
- Configure SF with appropriate tickets
- Communicate with the community
- Implement the change. Target: 2-3 months
Mailing list
- quattor-grid: only QWG-related stuff
- quattor-discuss: everything related to core components, including NCM components
- quattor-nagios: discussion to Nagios bit and bytes (development list)
- Announcements: evaluate SF announcement features as they are more appropriate than a mailing list
Build tools:
- Cal ready to do a mini test with Maaven
- Elisabetta from CNAF ready to help
- Should try to remove the too many externals
Quattor Future
Agreement on a monthly meeting.
- Doodle to find the convenient time slot (if possible allow participation from US but not a strong requirement)
- Between 1 and 2h
- Possible facility: Dutch, BELNET, IN2P3
- Do a test asap: end of next week/beginning of the week after
- First meeting: first week of April
- ~10 people ready to participate
Work plan
- Panc debugger: not in the short term
- Panc editor: try to find a student who could do the work to have something basic but complete
- Not scheduled until we find one
- Component tooling + ability to run non-Perl components
- Need to understand costs and benefits
- Concentrate on a review of what this involves
- Ask Nick; first goal may be to produce a list of issues as a FAQ (reasons why it is not possible)
- Support for package managers other than RPM
- SPMA replacement on other platforms
- May also impact the build tools (ability to package Quattor as something else than RPM)
- Not first priority for many sites but support for YUM/APT may be valuable for many sites
- CERN had plans to use YUM of SPMA
- Concentrate first on a review of what is exactly involved, potential benefits, minimum YUM version to use, understand APT and YUM differences
- Aquilon integration: RAL will work on this anyway and expect progress before next workshop
- Will depend a little bit on MS availability to help: MS cannot commit to spend much time but agrees to help
- Ian taking this in charge but limited to getting it usable at RAL
- IPv6: not a short-term requirement but needs to be done
- NIKHEF could commit to try to install a test quattor server that will use IPv6 and check what is working/not working
- Some expertise available in EGEE3 and several NRENs
- Make a clear list of what are the changed required
- Hooks for interoperation with management frameworks
- Code review and best-practices
- Define best-practices and recommended API for components and rewrite a documentation for developping components on this (Luis)
- Set up of build and test infrastructure
- Cal volunteers for the build tools and for adding unit tests
- Do a review of the existing build system at LAPP and make proposal to have it more useful and more used (Eric)
- Review existing components, identifying obsolete/duplicate, and possibly contributing to a documentation page about existing components and their coverage
- No volunteer but remains an important action
- QWG templates
- Other MW support: wait for EMI to see if there is a real need
- generic service support: Christos ready to work on it in the coming months
- VM support: to be done in collaboration with StratusLab
- Integration with monitoring: see previous discussion
Dissemination
- Must remain/be a strong activity for the community: enlarging the community is a way to pave the way for the future
- Proposal to make presentations at the EGI conferences and establish contact with NGIs
- A booth/demo, a poster or a presentation?
- Could take the opportunity of doing something in common with StratusLab at the first EGI conference
- A presentation based on RAL experience?
- Try to be present in NGI meetings
- Increase our visibility: be present on sites like Oloho
- Make a list of conference useful to attend, in particular sysadmins conference
- Wiki page about events and conferences of interest
- Training harder without funding, will also require work to get the appropriate material
- Appliance: StratusLab has a need for it, may come as an outcome of the collaboration
- COSTS project: possibility to get funding for dissemination actions but deadline very close (26/3)
- 1st application only 4 pages (10K words), more detailed application later
- Up to 100KE/year during 4 years
- People (not institutions) from 5 countries
- 3/4 of proposal passing the 1st stage being funded
- Attempt to apply based on QUEST proposal and members + BELNET + CNAF (Cal/Michel). Note that initial member list can be refined later.
Documentation
- Clarify the different targets for the documentation
- General communication: a basis exists on MediaWiki, need to develop, in particular use case study
- User documentation: mainly concentrated on QWG templates, missing
bits (in particular components)
- Need to ask users/NGIs what they need
- Lack of step-by-step guides
- Documentation for developpers
- Many things missing
- Need to document the build tools when we agreed on which ones to use
- Existing documentation review (Andrea)
Commercial support and consultancy
- No big interest in the community right now
Long-term organization of the community
- Legal entity behind the name to support the branding with official logos, presentation layouts…
- Probably not a short-term action
- Licensing cleanup: ask PLUME people if they are still ready to contribute without funding
Quattor toolkit component status
- AII: nothing major, maintained by Luis
- Nick has some contributions pending
- SPMA: still maintained, minor modifications
- ncm-spma review and enhancements: schema adjustements (repository), function rewrite (Victor)
- rpmt-py: no maintainer but no maintenance needed so far… German is the author.
- panc: see discussion, maintained/developped by Cal
- CCM: last contributor/maintainer was Nick
- Need to enhance CCM to be able to handle gzip profiles without ungizipping them at Apache level
- ncm-cdispd/ncm-ncd: Nick is the last one who had a look at it
- Also the new daemon written by Nick to act a proxy for managing non Linux boxes
- Add ability to run dependendy only if there was a config change affecting them
- Tracking/querying status of last component run: mod may be already done by Nick
- ncm-query: Michel maintainer but no change foreseen
- FUSE file system to browse configuration on client
(
ncm-query
alternative) - CDB: maintenance done to use last panc compiler (CERN)
- cdb-sync may be of broader use: SCDB may have to use it to take benefit of ncm-cdispd alternative
- ncm-templates: a template processor used in components, not really maintained but works!
- perl-CAF/perl-LC: actively maintained by Luis
SINDES: an external potentially useful component, should check the status and the accessibility outside CERN
- No longer any maintenance at CERN, frozen
- Any possibility to add it to Quattor repository?
- RAL to drive the effort, Veronique to check with CERN
Worshop preparation is now the responsibility of monthly meeting.
Conclusion
Next meeting
- Several proposals: Philips (Eindhoven), RAL, Strasbourg
- After a vote, RAL choosen if they confirm they can organize it
- Sites are welcome to express their interest in hosting a workshop
- 2+1 days