Quattor Workshop - Budapest - 20/3/2012

Agenda

Community News

No new sites that we are aware of in the last six months.

CERN looking at Puppet for their new infrastructure and possibly as a Quattor replacement on the future

Report at a future Quattor Workshop

DESY still using Quattor, may try to reestablish contact

SLAC looking at Quattor and requiring some help.

CNAF has contact

PAN Compiler

v8 frozen: 8.4.7 is the last version released

No more support
If still using, upgrade to v9!

v9 first versions

Last released version is 9.1
RC1, RC2, RC3 released
9.1 is RC3 + Windows-related fixes
v8 deprecated features removed
“panc nutshell book”
“root” element for build info
May be an interesting feature to implement “profile cloning”
various bug fixes

v9.2 recently released

plan for removal of escaping: deprecate XMLDB, change tests to use pan xml
Enabled by a switch, the only option in v10
Profile may contain a flag saying if escaping has been used or not
Impact on SCDB
Componenent need update to be prepared to handle unescaped keys (values)
Expanded maven support
Currently archetype, simple build
Future: Maven has a Pan interface, in addition to ant
Support for JSON output format
Will enable transparent gziping almost for free in term of performance during compilation or even a positive one when IO is the limiting factor (eg. laptops)
Need to document how to enable gzip in downstream components
Skeleton for updated panc command in clojure
More consistency in the options, simplified/cleaner code for processing options in the compiler

v9 series roeadmap

Streamlined, simplified code with limited changes to functionality
Gradual migration to clojure: CLI, template compilation, functions, output formats…
Can be done incrementally
Limited inclusion of other libraries (e.g. Clojure, JSON)
Until now panc has been self contained
All included components have an Apache 2 (compatible) license

New requests from discussion

Range starting with a negative number (Luis): may lead to some grammar ambiguity not trivial to fix
panc as a service (Gabor): idea is to have a continuously running service to avoid the starting time cost when compiling only one profile
Functions to convert IP addresses (v4 and v6) to long integer (and vice versa) to help with netmasking and other similar things (Gabor)
Readd debugging information during dependency calculation to help identify the reason a dependency is not considered up to date (Michel)
Disappeared during v9 code refactoring
Add an info() function in addition to the warning() one to avoid abuse of warning() (Gabor)
function or pragma to activate debugging in a given template as an alternative to command line switches (Gabor)
Easier to use for people who are not Pan experts
May be great to say debug this template and all included templates…
Not easy to implement as compilation occurs before execution and debug code is removed in the debug flag is not set for the template: possible performance impact if checked during execution
In include, support a list of template rather than just one (Ronald)
Rename nlist to hash
Keep both as a function and deprecate nlist at some point
Add a section to the manual explaining the Pan vocabulary and how it matches the vocabulary in other languages

Discussion but no agreement on how to allow structure templates with rights to modify other part of the configuration than the one they are assign to or to pass parameters to normal templates…

Several goals: make things cleaner than using global variables (variable scope), make easier to non expert to use pan, entitlement (restrict a user to use one specific template to achieve a specific configuration action)…

Annotation Support

2 possible approaches for building the output files produced by annotation processing

Hierarchy based on input templates hierarchy: related to layout of input files
Makes input options more complex as some specific options are required just for annotation
Namespace : conflict between similar namespace in different template sets
Requires a wrapper to handle possible conflicts.

Decision : move annotation processing outside of panc compiler to a separate tool

Will reuse the compilation module of panc
Makes having distinct options easier: keep the current behaviour
Will put less constraint on further extensions of the tool

CCM

Problems

Scalability because of the backend storage format: mainly MS and Gent
Difficult to support new backend format like JSON
Supports only http URL: makes testing difficult
Remove of unescape() function

MS contributions in the last years

CDB_File format x3 faster
Require an additional perl module: perl-CDB-File
Kerberos support for encryption

Recent work

JSON support: quite advanced, able to run components from the JSON file
Still need to test/Validate the results
Support of new protocols for downloading profiles: file://, ftp://, ssh://
But the underlying modules may not support all of them

Status

Rewritten modules have a large number of unit tests
Introduces new dependencies

JSON support issues

Some data types are lost: impossible to distinguish long and double
And Perl doesn’t distinguish them internally from string…: means that you cannot know if the information was a string or a number
Pb doesn’t exist with language distinguishing thems
No way to ask JSON about the type of a data
In fact getTree() already loose this information

unescape() removal

Will check the flag saying that the profile has been generated without escaping
IF the flag is set, will do nothing: just return the argument unmodified
No modification required to components, except if they use their private version of unescape()

Requests:

Have distinct exit code in ccm-fetch for failures related to network errors (Gabor)
May allow to switch between different profile sources

AII

Known limitations

SL6 issues
LVM and reinstallation
Large installation disk support
Scalability: lack of large installation

MS additions/improvemnents

Profile caching: Profiles are cached for future reuse for better performances, using CDB_FIle format for improved performances
Fine-grained locking: per-profile instead of globally
Allow to run aii-shellfe in // for several nodes
Allow monitoring of stalled installation by looking at lock timestamp
Partition alignment based on some provided information
Not really working with SL4
Parallel installation support: only partially merged as it requires PRocess::Parallel which is MS-specific
Some alternative may be found
New options: -firmware and -livecd to boot alternate images in a way similar to rescue
Target of these options defined in node profile

*Full RHEL6/SL6 support

Schema made more flexible to add/remove mandatory or deprecated options
GPT used for partition tables, using parted
Anaconda doesn’t die with large disks, much faster
Anaconda doesn’t allow to reinstall changing some partitions and protecting others…
Built with Maven tools

New issues

Code quality is becoming a problem: many contributors, few tests, spaghetti code
May require a significant rewrite with unit tests
Kickstart file more and more difficult to read….
Debugging of Kickstart %pre and %post scripts can be very difficult…
May redirect everything to a file in /tmp and ensure all commands are logged

Pending issues/requests

Implement deferred reboot during installation based on a profile property: needed by MS
DHCP plugin by MS: would be great to have it release to the community
Replacement for aii-dhcp able to do more things looking at the whole profile information (like adding info required by Windows machine)
MS saw problems with machines where boot disk is renamed after some events (change) on the machine or OS upgrade
Would like to think at a mechanism to give an abstract name to disks and do late binding but difficult to implement into Kickstart

MS Stanley still considering reviving the Solaris port, including installation port.

Network Configuration

ncm-network

Among the problems

Network restart during reconfiguration: potentially a problem on diskless systems
Code very difficult to maintain

Rewrite requirements

Separate each part of the configuration: interface, routing…
Reconfiguration without network restart
Support all current features
Add all the features that RH scripts support
Easy path from old to new component if possible
Use ip command rather than ifconfigand route

Routing configuration: rely on ip route rather than ADRESS/NETMASK/GATEWAY

Convert route definition to a nlist: key is the subnet usual notation xx.xx.xx.xx/nn (nn being the number of bits in the netmaks)
Policy routing contributed by Gabor
Multipath routing: done by ip route nexthop

Status

Main missing part is the routing… Loic expects to have some time to work on it in the coming months…

IPv6 support required: no real reason to manage it in a separate component, need to add it

Network schema

May require some refactoring: some information are improperly placed under /system/network and we need to support more information interface specific
An idea is to move this kind of information into the hardware description part and have the network configuration module looking there for this part of the information, based on the interface name

Other issues

loopback interface is not configured by network configuration module and is ignored. This doesn’t allow to define aliases on it (Gabor)

IPv6 readiness - R. Starink / P. Bernabé

Evaluation work done at NIKHEF recently: goal was to identify what was needed to be changed

Focus on mixed IPv4/v6 setup
Backward compatibility

Test cluster

Isolated from the world
OS: CentOS 5.5 64-bit
A dedicated Quattor server
Clients: basic OS to start, grid services later

Quattor server

PXE only supported for IPv4: mixed setup will be required to handle installation
As long as IPv4 is there, no change required on the server!

ncm-network

Schema: basic support for IPv6 but no support for mixed setup
Duplicate IP, gateway…
Some IPv6 enabling fields
How to specify IPv6 addresses: configure separately prefix and local part? What to bind prefixes to (interfaces? node?)?
Autoconfigure support?

QWG changes

Allow enabling IPv6 globally or per interface
A new table for IPv6 addresses with a more flexible layout allowing to handle multiple addresses per node
IPv6 parameters: per network?

ncm-iptables

A few mods already made to enhance patterns for IPv6 and add protocol options

Other components: difficult to predict needed changes

Service specific probably?

Conclusion (next steps)

Focus on schema for IPv6 description: ensure enough flexibility to cover different use cases without too much unnecessary complexity
Ask Paco if he can contributes
Ask IPv6 experts (in particular in HEP community) to assess the proposed schema

Aquilon and other MS Tools

Current status

AII, CCM mods released thanks to Luis
ncm-network changes contributed
ncm-sysctl not yet modified: waiting for escape/unescape removal
quattor-remote-configure available but quattor-remote-deployer not yet released as it is being rewritten

5 core developpers, many support engineers

Current focus is to allow more operations without editing templates as Aquilon usage extended to less standard machines (non “grid”)
Plan is to add more commands into Aquilon that will deal with more advanced operations

Quattor Remote Deployer (QRD) status: equivalent of ncm-ncd to manage non Linux devices

Python rewrite of the old Perl code
QRD relies on Quattor Remote Configure (QRC) to execute components: QRC already released
ython rewrite requires a Python CCM API

Aquilon changes and main developments

More changes made by Aquilon commands and less template editing, e.g. assign specific features to an existing personnality
Make easier for users to make trivial changes
Reduce the risk of a big mistake by exposure of templates and the right to modify them
HW features: generic description of HW features implemented using vendor tools and their specific options
1 template to enable/disable a feature including a vendor/model specific template definining the required configuration
Under namespace features/hardware
Added description of clusters (VMware clusters, high availability clusters): members (e.g. assignment of VMs to an ESX clusters), reboot constraints, file system descrption allowing attaching remote file systems…
Will add possibility to assign a VM to a single host (e.g. running KVM) rather than an ESX cluster
Planning to add support for defining service address (moving with VMs or shared by several) in Aquilon
Request to model Windows cluster in Aquilon
But real use may require a GUI…
Working on modelling network switches and configuring them QRD

Aquilon appliance

Still on SourceForge but not updated… lacking templates to be useful
Looking for “pop-up” installations via cloud
Appliance uses Debian

Mac OSX support: proof of concept of other OS support in Aquilon

ncm-directoryservices to configure OpenDirectory: ncm-mcx
ncm-ncd, ccm-fetch all works
CCM uses DB_File
A bit of fixing needed in quattor-build-tools
A single package created (alpha version) and put on SourceForge: “Managed Quattor Client for OS X”

Fostering Aquilon adoption

James’s experience

No major problem in using QWG templates, a few minor modifications needed
James agreeing to write a short report about these…
Main difficulty is the different workflow between SCDB and Aquilon
Aquilon upgrade: DB upgrade scripts only provided for Oracle
Aquilon supports whatever DB backend is supported by SQLAlchemy
Need to maintain backend specific upgrade scripts: PostgreSQL seems the most important to support

Appliance work required to allow early adopters to look at it

Update Aquilon version
Add QWG templates
Add ability to check out templates from appliance

RAL will have a summer student to work on Aquilon in July for 3 monthts

Try to get the new appliance ready by then.

Quattor Remote Deployer experience at MS

Goal: manage non Linux devices that offer some sort of API for remote management

E.g. ESX clusters, switches, file appliances…
Handle iniitial configuration and reconfiguration

2 parts in the system

QRD itself: equivalent of ncm-cdispd
Receives profiles and decide what to do
Uses a plugin for either installation (aii, configuration of the boot server) or for configuration (QRC)
Quattor Remote Configure (QRC) : equivalent of ncm-ncd
1 framework and specialized components

Choice of distributed (redundant) QRD/QRC configuration

Main goal is redundancy: multiple source of the configuration easy to implement with http
Every QRD receives the notification and they use a lock to ensure only one is really execute the configuration actions
Shared data (NFS) between all instances for locks and cached profiles… not ideal
Modified CAF::Lock to implement NFS locks
If possible, achieve better scalability but not really easy because of potential locking issues that may negatively impact the performances

In QRC, implemented conditional execution of dependencies based on the fact they have a change in their configuration…

Check whether it makes sense to feed back into ncm-ncd

QRD

Different modes and connectors implemented by plugins that have their own (simple) configuration
Analyze of the work to be done : multiple configuration changed notifications merged in one action
Does everything needed to move to the last config successfully deployed to the new one
Keep track of the plugin status to decide if a config was succesfully applied

QRD improvements needed

Too much locking that may lead to a configuration deployment being postponed indefinitely
Locking + grouping is overcomplicated and 2 levels of locking are probably unnecessary
IO performance problem with NFS
Lack of visibility of action progress: difficult to troubleshoot why a request is not executed
Impossible to execute an action without a CDB notification (profile change): no dry run, no possibility to force the redeployment of a configuration without a profile change…
Requires direct use of QRC
endless retry loop: useless…
After a failure, a component will be re-run regularly, without any config change (default: 5 mn): difference with ncm-cdispd

Current status

In production at MAS for 1 1/2 year
Known problems are well identified
Rewrite in progress but not yet ready

QWG Templates

Input based on RAL work.

YAIM Support

Goal: allow to reuse RPM lists for grid services as they exist in QWG but use ncm-yaim rather than standard QWG configuration to do the actual service configuration

BDII as a proof of concept

Proposal

Define a variable USE_QWG_CONFIG to select the configuration variant
Create for each service 2 namespace qwg and yaim
Rename current service.tpl into qwg\config.tpl
Create a new service.tpl that will include RPMs and acts as a switch between both variants

Open questions

Account configuration: with YAIM or with ncm-account?
Try to support both and compare

RPM list management

Would be desirable to manage separately the RPM list and the RPM version to use

RPM lists can be generated from an OS distribution
Would make easier the maintenance of templates in config/os: should not require a change with new versions as long as the RPMs are the same.

Need to improve the XSL stylesheet processing the distribution comps.xml to produce the information in this new format.

Template defining default version will be included at the beginning of every RPM template for a non disruptive migration

SL6 issues

ncm-network changes to manage udev made by MS but not yet contributed back.

grub/spma inconsistency with new kernel names including architecture: fixed by last spma (-12).

ldapauth configuration completely changed

New configution supported through a new configuration subtree in the component

SPMA and SELinux: SPMA requires a new SELinux effort to work

Added by last version in SF repository

Monitoring

Experience with Nagios and Icinga at UGent

Main goal: dynamic reconfiguration of monitoring

Currently done by a Python framework parsing the XML profiles to produce the required monitoring configuration
At UGent, everything is described in Quattor
The framework produces a template that is used by the Icinga server: currently no automatic reconfiguration of the monitoring server
Current QWG approach relies on a static definition of groups in a specific template and is difficult to keep in sync with the actual config: at least the risk of discrepancies
Plan to handle this with Aquilon in the future
Unclear if it will really impact the overall workflow

Host groups a host belong to are defined in the Quattor configuration

Processed by the Python framework to generate the appropriate monitoring configuration

ncm-icinga: written from scratch

Schema different from ncm-nagios
Configuration information required (variables) is basically the same

Status

Doing last tests
Production targeted end of April
Want to look at Aquilon this summer

QWG Todo: document the variables related to monitoring and common to both Nagios and Icinga

Already some information on the Trac wiki

Configuration Modules

Configuration module status report

State file managed/generated by ncm-ncd and ncm-cdispd in last versions
Requires a ‘state’ drective defining the state directory in the ncm-ncd and the ncm-cdispd configuration file
Presence of the file denotes a component that should run and indicate the error if there was one in a previous run
TODO: add an option to ncm-ncd to display the configuration modules that are waiting to run and whether they experienced an error in a previous run

Test mode: ‘noaction’ property should do the job if the configuration module uses CAF

Will display file to be opened, command to be executed rather than doing it

Would be great to keep not only the latest profile but the last one.

ccm-purge could keep a given number of profiles or all the profiles more recent that a certain time interval
ncm-query could be enhanced to list differences between profiles with several levels of details: components impacts, detailed config change for part of the configuration tree…

Delayed execution: see [/wiki/Meetings/Workshops/20111011#ChangeScheduling Strasbourg’s minutes]

No work since

Support for other languages: desirable but not urgent

YUM Support

3 good reasons to use YUM (at least!)

Scale down Quattor entry level: may be easier for small sites to add their packages
Remove a Quattor specific component to maintain
Enable support of other platforms

Concentrate first on making possible to use YUM as a packager rather than SPMA in a flexible way.

Proposal

Make the packager an implicit dependency of all components rather than an explicit one
ncm-cdispd will be no longer concerned by ensuring it is run first
ncm-ncd will be in charge of running the packager before any action: may require to pass a new option from cdispd or to add an option in the ncd configuration file

Corner case: ability to force a component like filecopy to run before the packager to fix a problem preventing its successful execution

See later if there is a real use case: may add a property to components like runFirst to force their run before the packager

packager specific configuration modules (ncm-yum, ncp-apt..) should take the package information from the same part of the configuration tree (/software/packages)

Configuration description should remain independent of the packager

Community Life and Development Process/Tools

Quattor Build and Test Process and Tools

StratusLab experience

Build with maven as the driver
Easy to wrap non-java builds and to use other build tools like make
Tools should have unit tests
Nexus as a package repository
Works well with maven: packages pushed automatically after build
Wth maven, has very clean dependency resolution process: dependencies automatically pulled in with transitive dependency resolution
A new snapshot entry created by maven every time it runs and produces something new
Package repository at the heart of the build process
Continous integration with Hudson/Jenkins
Continuous build of the SW, triggered by modifications to the package repository
Can also do build on demand
Installation of test infrastructures
Allows to manage dependencies between test stages but not working well as a workflow manager (ability to take different decisions based on result of a stage)
SlipStream for system testing
Workflow-like features to manage deployment of virtual machines
Growing use for system deployments and testings
Web server for release distributions
YUM based repositories for published releases

Proposal for Quattor

Move quattor.org DNS domain to an active institue to allow more flexible updates
Need ability to create entries in this domain
Discuss with CERN immediate transfer of primary server and possible transfer of ownership: LAL primary, RAL secondary
Start wth same toolset as for StratusLab
Decide where to put the tools: StratusLab cloud? persistent disk for data?
Move Nexus server from LAPP to the same place as other tools?
Demonstrate builds for core tools
Some already maven-ready
PRobably ~10 more to convert to new build tools: cdispd/listend/ncd, aquilon, rpmt-py/spma, ncm-templates, ncm-query
Not necessary to convert them to Maven
Deployment/test of Quattor “HelloWorld”
Simple test configuration based on Aquilon appliance?
Perform “release” of core tools
Devise same procedure for configuration modules
Think about unit test environment: Luis started some work around this…

Actions

DNS changes: Michel
Jenkins, Nexus deployment: Cal
Integration of core tools: Cal + tool maintainers
Target date for initial setup: ~1 month

Extend the number of people with Maven knowledge…

Not a requirement for the very short term… core tool maintainers are a very limited set of people
Trac page reasonnably up to date…

Quattor releases: let’s review how we do it when we have the build infrastructure in place

With YUM, no need of metarpm
The main issue is not bringing dependencies but identify what are the stable versions of all components that work together
“HelloWorld” test configuration may help to assess what is ready for releasing

Weekly standup meetings

Should run without Michel: Luis and Ronald agrees to backup

Main priorities should focus usability rather than new features

Cookbooks to start with Quattor and its core tools
Reuse the panc idea
Foster Aquilon adoption as a tool hiding the template complexity
Possibility for a Quattor dashboard
Possibility to reuse/adapt existing tools like Foreman?
Quattor as a tool to manage virtualized infrastructure
Start by documenting existing use cases and experiences in blog entries, e.g. LAL work within StratusLab to manage images rather than machines

Quattor Public Visibility and Community

Main page move to GitHub

Need to change DNS… see previous actions
Cal makes a proposal that can be reviewed
Populate with a few entries all the categories of information
Put Oloh! widget on the web page
Go live!

News broadcasting

Send more general/regular news should appear on the main web site
Broadcast it to the general mailing list
Tweet this news: any plugin to do it automatically?

Reserve Quattor name on Google+, LinkedIn, (Fabebook?)

Cal will do it

IRC: try to continue to listen it despite the low activity…

Dissemination materials

Build an internal repository for existing dissemination materials that could be reused by others to make presentations

Miscellaneous

Fix link to Quattor home page in WallStreet Tech article about MS usage of Linux/Quattor (Nick)
Or add a redirect in Mediawiki page

Conclusions

Next workshop: U. of Gent

Doodle asap to fix the dates after UGent fed back their constraints

13th Quattor Workshop Summary (March 2012)

Quattor Workshop - Budapest - 20/3/2012

Community News

PAN Compiler

Annotation Support

CCM

AII

Network Configuration

ncm-network

IPv6 readiness - R. Starink / P. Bernabé

Aquilon and other MS Tools

Fostering Aquilon adoption

Quattor Remote Deployer experience at MS

QWG Templates

YAIM Support

RPM list management

SL6 issues

Monitoring

Experience with Nagios and Icinga at UGent

Configuration Modules

YUM Support

Community Life and Development Process/Tools

Quattor Build and Test Process and Tools

Quattor Public Visibility and Community

Conclusions