Quattor Workshop - Strasbourg - 12-14/10/2011

Agenda

Core Tools

SINDES2 - V. Lefebure

Main new features (see presentations at previous workshops)

Based on Krb rather than a private CA/certificates.
Unique API for human clients and machines

Jan left CERN in July but still doing some follow-up to help with the move to production at CERN

1st production cluster expected at Christmas
Migration beginning of next year

Sources will be put in SF Quattor as soon as the code is considered ready for production.

Stijn: still using v1 at UGant.

Same for Bruxelles grid site where it is mainly used for certificate delivery…

MS: not interested by PKI, already have a production Krb infrastructure

Chose to add Krb encryption support in ncm-ccm and ncm-download
Profile encrypted before sending it but stored plain text on the server

RAL: gave up with SINDES

Will ncm-download with its new support of Krb as a replacement

Quattor Status at MS - N. Williams

MS is doing Quattor its own way… not using Quattor directly, Aquilon a layer above Quattor

Quattor used underneath
Aquilon providing many advanced features not found in other config DB implementation: personalities, archetype, template sandboxes for development/testing without interfering with production, multiple panc version support…
archetype implemented as a pan loadpath

Figures

20K real machines managed with Aquilon
20K virtual machines, including VMware ESX servers using Quattor Remote Deployer (QRD)
50 NetApp servers using QRD too
Some Windows management : pre-configuration (host name; IP config…) passed to Windows team
No plan to go really further: Windows team relies on its own tools
Everything managed from the same server
Reusing QWG machine type concept but extending it (personnalities)
Working on Hadoop configuration

Still wanting to open-source Aquilon and QRD but lack of time to do the code clean up.

Basically forked AII to fix the issues found.

Need to discuss how to streamline feeding MS contribution into the code mainstream.

Start an Aquilon section on Trac wiki…

Aquilon Appliance Walk-Through - N. Williams

Goal: enable an easier evaluation of Aquilon

Installing Aquilon to evaluate it can be painful
Can also be used to do development validation
Allow to import an existing Quattor configuration
Zero requirement on existinf infrastructure

Based on Ubuntu Turnkey distribution (intended to create appliances) + sqlite

Provides interconnection with a datawarehouse using cdb2sql
Web-based management interface to help creating/managing machines
Written in Pylons
ssh connection to appliance where Aquilon command line interface can be used

Several modes implemented using URLs

/reset: allow to reset Aquilon configuration and reinitialize it
11 steps before installing the 1st machine : add an archetype, personality, os, define network parameters…
Machine management
3 steps per machine: select machine HW, machine network config (including algorithm to allocate IP address), host associating a machine, an archetype, a personality, an OS version, build status…

Aquilon privilege management: 6 possible roles

Last version of the appliance not yet on SF… but soon!

QRC / QRD - N. Williams

Goal: manage non-Linux machines like appliances, ESX clusters…

QRC = Quattor Remote Configure (equivalent of ncm-ncd)
QRD = Quattor Remote Dispatcher (equivalent of listend + cdispd)

Idea is to have a machine that acts as a proxy and listens for changes in profiles to implement them on a remote device using a management API for the device.

Use components in a namespace specific to the device instead of NCM::Component
QRD is the complex part in charge of doing profile change analysis: support failover between 2 QRD servers
QRD synchronization rewritten using Apache Zookeeper

QRD listens to CDB notification rather than CDP: the notification is sent when there is a profile change, no matter what is the exact profile modified.

QRC in a Git repository on SF.

QRD currently being rewritten in Python, not yet in SF.

MS currently managing mainly ESX clusters with QRD/QRC. Planning at managing network switches

We may discuss in the future possibility for common schema for some appliances/devices and a library of connectors for widely used devices like CISCO switches…

Pan Compiler Update - C. Loomis

v9 is the actively developped version

Currently 8.4.7 with deprecated features removed
Main change is new options to fix annotation issues with namespaces
Documentation reorganized: just one book merging previous documentations
Simplified, streamlined code, no new feature (yet?)
Migration to clojure: improved support for multithreading
Is clojure licence (EPL 1.0, Eclipse) an issue?

Migration v8 to v9: use 8.4.7 with deprecrated feature warnings to identify problems

Use relevant option to turn them into fatal errors

v9 release candidate available but not yet in SF (soon)

Will be tagged as a release as soon as there is some feedback

QWG Templates

QWG Templates Update - M. Jouvin

See presentation.

StratusLab templates: not really part of the QWG templates (yet) but feedback from the community is welcome.

Monitoring Templates - R. Starink

Work in progress behind the scene…

Documentation available on the wiki.

UGant is using icinga and started with a previous version of Nagios templates: another fork…

Need to see if we can avoid the cost of another future merge…
icinga as a Nagios fork has a very similar configuration, check if there is the need for a new configuration component
Need to identify the use cases that cannot be implemented by current QWG templates

Discuss on quattor-devel

Development Process

See [/wiki/Meetings/Workshops/20110316#DevelopmentProcessDiscussion Michel’s presentation] at last workshop: basically the same questions after almost 1 year of experience with SCRUM.

Scrum / standup meeting

Agreemenent this is rather positive
Weekly meeting should happen every week, whether Michel is available or not
Weekly meeting should be on time: let’s move it to 2:15 pm to increase the chance to start immediatly
Meetings should remain short for everybody, not only for the late comers…!!!
Add EVO connection into the reminder
Have more formal planning/review of sprints and advertize sprint outcome on the mailing list
Let’s try 1 1/2 month sprint

People working on some developments need to let others know through a message to quattor-devel list or participation to standup meetings.

Monthly meeting

Ask the general mailing list about the interest of the community
If yes, propose to have a larger meeting for the sprint review meeting
Plan them in advance and add an event in Quattor Indico area (and reference it on the wiki)
Send a reminder 2 weeks in advance

Repository branching

Difficult to use with SVN + old build tools: need to put some effort to make progress on the migration to new build tools and Git
Not really a problem for QWG where trunk is really for work in progress
As discussed at the last workshop, good candidates for migration are CCM-related things (ccm, ncm-cdispd, ncm-ncd, QuattorFS), AII and its plugins
Encourage people who are contributing major changes to make a branch, as it has been done for ncm-network
Use stand-up/review meetings to decide merging into trunk

Round Tables

=== Kickstart / AII === #AII

Disk partitionning: current scheme has advantages (full control) but doesn’t allow to format large boot disks (> 2GB)

Support for GPT partitions: need to check if Anaconda supports it
Using standard Anaconda partitionning features would help taking advantage of new features but should accept to give up some control
Need to check that reinstallation works in all configuration

Kickstart configuration is done by an AII plugin: could if necessary replace the plugin or add a hook

NIKHEF already ignores pre-script and produces the kickstart instructions required to use standard Anaconda formatting features
NIKHEF has written a hook disklayout that could be used as an alternative to aii-ks call to lib-blockdevices
In fact aii-ks/lib-blockdevices already generates (almost) the required information for Anaconda, just with noformat

Need to add as a standard AII function the function allowing to add a new hook whatever are the hooks already defined

Such an helper function already exists in hook for SINDES

aii-ks: really fix pbs mentionned in ticket #235 (See #235#comment4)

aii-pxelinux: see if it possible to specify the MAC address to use rather than the interface name

From http://wiki.centos.org/TipsAndTricks/KickStart:
A third method works if you are doing PXE based installations. Then you add IPAPPEND 2 to the PXE configuration file and use ksdevice=bootif. In this case anaconda will use the interface that did the PXE boot (this does not necessarily needs to be the first one with a active link).

=== SL/RHEL6 Support === #RHEL6

MAC address binding to interfaces: now done by udev rather than the interface config file.

May be a source of problems as udev config is done by Kickstart before interface config files are configured by Quattor with some risk of inconsistency
1 possibility to explore: use non standard name in Quattor config and define udev rules appropriatly
Check impact on apps and operations

On compute nodes running a RHEL6 derivative, kernel option nohz should be turned off(nohz=off) in Grub config to get better performances (Stijn)

Scheduler option to make it more responsive on desktops
Define by defaults for RHEL6 derivative?

=== SPMA vs. YUM === #YUM

See Steve Traylen’s https://twiki.cern.ch/twiki/bin/view/Main/SteveTraylen/Spma2Yum.

All SPMA features can now be implemented with YUM
Main drawback is the cost of the migration…

Several open questions

`ncm-yum requires some extensions not handled by anybody
What is the right level of detail in package list: proposed use of repository snapshots will probably not work accross sites to implement things like OS errata
We need to implement some kind of metadependency for a packager: something like a componenent alias name implemented in ncm-ncd
This may replace the need for SPMA as an abstract layer to the packager, something that doesn’t work properly as the package description is quite tightly coupled with the actual packager, making a common schema difficult.

=== Change Scheduling === #ChangeScheduling

MS use cases

spma not allowed to run on a running host, can only run at reboot
ncm-spma runs but with the flag to run spma disabled to avoid preventing other components to run
Some actions cannot be scheduled without reinstalling: would be great to let the user knows that his changes will not be applied immediatly
May be handled at Aquilon (SCDB?) level: will require a partial diff of XML profiles: need to find a tool to do this

Partial configuration changes look difficult as Quattor is really designed to enforce consistency on a “all or nothing” basis.

The most important feature would be the implementation of changes at controlled time.

For SPMA, it’d be nice to establish a black list of packages that should not be upgraded except at boot time.

ncm-ncd: add a mechanism to blacklist some components if some conditions are not met.

List of possibile conditions/states (local time, boot context…) shoud be well defined and a library/common method should be available for all components to evaluate the state/condition the same way.

Exact actions done if a condition is met will remain the responsability of the components or ncm-ncd

Would be useful to let a component advertize that a change will require a reboot

Could be implemented through the status file produced when the component is run

=== Network Configuration === #NetworkConfig

MS main requirements

Bonding
policy routing
ethertools parameters

Some patches contributed for current ncm-network to handle this but would be good to rethink the schema

In particular the component configuration should be moved to /software/components and most of the things currently under /system/network should be moved there to allow an easier support of appliances and non Linux boxes
In particular most things currently under /system/network/interfaces
Some children of interfaces in the configuration should be probably at the same level (bonds, vlans…) in the same way we have blockdevices and filesystems
Gabor will initiate the discussion on the mailing list with a proposal

ncm-network rewrite: a basic version is available in SVN, ready for code review

Try to deploy on test machines after initial code review

=== QWG Global Variables === #QWGVars

Hot discussion about use of global variables in QWG

Contents unvalidated
Request to put them in the profile to get them validated
Michel strongly opposed as the state value at one point in time may be misleading: what is important is the value in the context of a given template, something that cannot be tracked in the XML profile
Side discussion: filecopy unvalided contents

Until we have a pan debugger (no concrete project anymore), some possibilities:

Ensure there is the appropriate debug statements in the template
Consider as bugs templates using a variable without any content validation (when it makes sense) and fix them!
See with Cal if a dump_global_variables() function would make sense in PAN to dump all global variable at one point in time
If only a subset of the variables are needed, can probably be done with a debug() statement
dump_global_variable could accept as an arguement a regexp to match the variable name to restrict to variables used in the template

Discuss with call the possibility to dislay with a Pan option the name of the template entered and wheter it is executed or ignored because it has the unique tag.

Document the global variable “namespaces” used in QWG variables.

=== Configuration Component Logging === #CompLogging

Document/agree on what is the expected level of default verbosity for a component (log/info) and what is the debug level to use for what.

=== Writing Components in Python === #CompPython

Several problems:

ncm-ncd should fork a process to run a component rather than running the component inside ncm-ncd: not a very big change…
Implication on locking?
CCM API in Python: almost done by MS as part of Quattor FS
No locking implemented yet to access the profile
Python version of CAF to ensure consistency of some low-level operation: more work…
Also the idea of using a standard library to access config files and ensure they use a consistent format

Add to future sprints a proof of concept development and attempt to address the required changes in ncm-ncd.

Sprint 3 Review

See [/wiki/Development/Scrum/Sprint-2011-03 sprint backlog].

New sprint is [/wiki/Development/Scrum/Sprint-2011-04 sprint 2011-04], due December 1st.

12th Quattor Workshop Summary (October 2011)