13th Quattor Workshop Summary (March 2012)

Written
20 March 2012
Author
Michel Jouvin

Quattor Workshop - Budapest - 20/3/2012

Agenda

Community News

No new sites that we are aware of in the last six months.

CERN looking at Puppet for their new infrastructure and possibly as a Quattor replacement on the future

  • Report at a future Quattor Workshop

DESY still using Quattor, may try to reestablish contact

SLAC looking at Quattor and requiring some help.

  • CNAF has contact

PAN Compiler

v8 frozen: 8.4.7 is the last version released

  • No more support
  • If still using, upgrade to v9!

v9 first versions

  • Last released version is 9.1
  • RC1, RC2, RC3 released
  • 9.1 is RC3 + Windows-related fixes
  • v8 deprecated features removed
  • “panc nutshell book”
  • “root” element for build info
  • May be an interesting feature to implement “profile cloning”
  • various bug fixes

v9.2 recently released

  • plan for removal of escaping: deprecate XMLDB, change tests to use pan xml
  • Enabled by a switch, the only option in v10
  • Profile may contain a flag saying if escaping has been used or not
  • Impact on SCDB
  • Componenent need update to be prepared to handle unescaped keys (values)
  • Expanded maven support
  • Currently archetype, simple build
  • Future: Maven has a Pan interface, in addition to ant
  • Support for JSON output format
  • Will enable transparent gziping almost for free in term of performance during compilation or even a positive one when IO is the limiting factor (eg. laptops)
  • Need to document how to enable gzip in downstream components
  • Skeleton for updated panc command in clojure
  • More consistency in the options, simplified/cleaner code for processing options in the compiler

v9 series roeadmap

  • Streamlined, simplified code with limited changes to functionality
  • Gradual migration to clojure: CLI, template compilation, functions, output formats…
  • Can be done incrementally
  • Limited inclusion of other libraries (e.g. Clojure, JSON)
  • Until now panc has been self contained
  • All included components have an Apache 2 (compatible) license

New requests from discussion

  • Range starting with a negative number (Luis): may lead to some grammar ambiguity not trivial to fix
  • panc as a service (Gabor): idea is to have a continuously running service to avoid the starting time cost when compiling only one profile
  • Functions to convert IP addresses (v4 and v6) to long integer (and vice versa) to help with netmasking and other similar things (Gabor)
  • Readd debugging information during dependency calculation to help identify the reason a dependency is not considered up to date (Michel)
  • Disappeared during v9 code refactoring
  • Add an info() function in addition to the warning() one to avoid abuse of warning() (Gabor)
  • function or pragma to activate debugging in a given template as an alternative to command line switches (Gabor)
  • Easier to use for people who are not Pan experts
  • May be great to say debug this template and all included templates…
  • Not easy to implement as compilation occurs before execution and debug code is removed in the debug flag is not set for the template: possible performance impact if checked during execution
  • In include, support a list of template rather than just one (Ronald)
  • Rename nlist to hash
  • Keep both as a function and deprecate nlist at some point
  • Add a section to the manual explaining the Pan vocabulary and how it matches the vocabulary in other languages

Discussion but no agreement on how to allow structure templates with rights to modify other part of the configuration than the one they are assign to or to pass parameters to normal templates…

  • Several goals: make things cleaner than using global variables (variable scope), make easier to non expert to use pan, entitlement (restrict a user to use one specific template to achieve a specific configuration action)…

Annotation Support

2 possible approaches for building the output files produced by annotation processing

  • Hierarchy based on input templates hierarchy: related to layout of input files
  • Makes input options more complex as some specific options are required just for annotation
  • Namespace : conflict between similar namespace in different template sets
  • Requires a wrapper to handle possible conflicts.

Decision : move annotation processing outside of panc compiler to a separate tool

  • Will reuse the compilation module of panc
  • Makes having distinct options easier: keep the current behaviour
  • Will put less constraint on further extensions of the tool

CCM

Problems

  • Scalability because of the backend storage format: mainly MS and Gent
  • Difficult to support new backend format like JSON
  • Supports only http URL: makes testing difficult
  • Remove of unescape() function

MS contributions in the last years

  • CDB_File format x3 faster
  • Require an additional perl module: perl-CDB-File
  • Kerberos support for encryption

Recent work

  • JSON support: quite advanced, able to run components from the JSON file
  • Still need to test/Validate the results
  • Support of new protocols for downloading profiles: file://, ftp://, ssh://
  • But the underlying modules may not support all of them

Status

  • Rewritten modules have a large number of unit tests
  • Introduces new dependencies

JSON support issues

  • Some data types are lost: impossible to distinguish long and double
  • And Perl doesn’t distinguish them internally from string…: means that you cannot know if the information was a string or a number
  • Pb doesn’t exist with language distinguishing thems
  • No way to ask JSON about the type of a data
  • In fact getTree() already loose this information

unescape() removal

  • Will check the flag saying that the profile has been generated without escaping
  • IF the flag is set, will do nothing: just return the argument unmodified
  • No modification required to components, except if they use their private version of unescape()

Requests:

  • Have distinct exit code in ccm-fetch for failures related to network errors (Gabor)
  • May allow to switch between different profile sources

AII

Known limitations

  • SL6 issues
  • LVM and reinstallation
  • Large installation disk support
  • Scalability: lack of large installation

MS additions/improvemnents

  • Profile caching: Profiles are cached for future reuse for better performances, using CDB_FIle format for improved performances
  • Fine-grained locking: per-profile instead of globally
  • Allow to run aii-shellfe in // for several nodes
  • Allow monitoring of stalled installation by looking at lock timestamp
  • Partition alignment based on some provided information
  • Not really working with SL4
  • Parallel installation support: only partially merged as it requires PRocess::Parallel which is MS-specific
  • Some alternative may be found
  • New options: -firmware and -livecd to boot alternate images in a way similar to rescue
  • Target of these options defined in node profile

*Full RHEL6/SL6 support

  • Schema made more flexible to add/remove mandatory or deprecated options
  • GPT used for partition tables, using parted
  • Anaconda doesn’t die with large disks, much faster
  • Anaconda doesn’t allow to reinstall changing some partitions and protecting others…
  • Built with Maven tools

New issues

  • Code quality is becoming a problem: many contributors, few tests, spaghetti code
  • May require a significant rewrite with unit tests
  • Kickstart file more and more difficult to read….
  • Debugging of Kickstart %pre and %post scripts can be very difficult…
  • May redirect everything to a file in /tmp and ensure all commands are logged

Pending issues/requests

  • Implement deferred reboot during installation based on a profile property: needed by MS
  • DHCP plugin by MS: would be great to have it release to the community
  • Replacement for aii-dhcp able to do more things looking at the whole profile information (like adding info required by Windows machine)
  • MS saw problems with machines where boot disk is renamed after some events (change) on the machine or OS upgrade
  • Would like to think at a mechanism to give an abstract name to disks and do late binding but difficult to implement into Kickstart

MS Stanley still considering reviving the Solaris port, including installation port.

Network Configuration

ncm-network

Among the problems

  • Network restart during reconfiguration: potentially a problem on diskless systems
  • Code very difficult to maintain

Rewrite requirements

  • Separate each part of the configuration: interface, routing…
  • Reconfiguration without network restart
  • Support all current features
  • Add all the features that RH scripts support
  • Easy path from old to new component if possible
  • Use ip command rather than ifconfigand route

Routing configuration: rely on ip route rather than ADRESS/NETMASK/GATEWAY

  • Convert route definition to a nlist: key is the subnet usual notation xx.xx.xx.xx/nn (nn being the number of bits in the netmaks)
  • Policy routing contributed by Gabor
  • Multipath routing: done by ip route nexthop

Status

  • Main missing part is the routing… Loic expects to have some time to work on it in the coming months…

IPv6 support required: no real reason to manage it in a separate component, need to add it

Network schema

  • May require some refactoring: some information are improperly placed under /system/network and we need to support more information interface specific
  • An idea is to move this kind of information into the hardware description part and have the network configuration module looking there for this part of the information, based on the interface name

Other issues

  • loopback interface is not configured by network configuration module and is ignored. This doesn’t allow to define aliases on it (Gabor)

IPv6 readiness - R. Starink / P. Bernabé

Evaluation work done at NIKHEF recently: goal was to identify what was needed to be changed

  • Focus on mixed IPv4/v6 setup
  • Backward compatibility

Test cluster

  • Isolated from the world
  • OS: CentOS 5.5 64-bit
  • A dedicated Quattor server
  • Clients: basic OS to start, grid services later

Quattor server

  • PXE only supported for IPv4: mixed setup will be required to handle installation
  • As long as IPv4 is there, no change required on the server!

ncm-network

  • Schema: basic support for IPv6 but no support for mixed setup
  • Duplicate IP, gateway…
  • Some IPv6 enabling fields
  • How to specify IPv6 addresses: configure separately prefix and local part? What to bind prefixes to (interfaces? node?)?
  • Autoconfigure support?

QWG changes

  • Allow enabling IPv6 globally or per interface
  • A new table for IPv6 addresses with a more flexible layout allowing to handle multiple addresses per node
  • IPv6 parameters: per network?

ncm-iptables

  • A few mods already made to enhance patterns for IPv6 and add protocol options

Other components: difficult to predict needed changes

  • Service specific probably?

Conclusion (next steps)

  • Focus on schema for IPv6 description: ensure enough flexibility to cover different use cases without too much unnecessary complexity
  • Ask Paco if he can contributes
  • Ask IPv6 experts (in particular in HEP community) to assess the proposed schema

Aquilon and other MS Tools

Current status

  • AII, CCM mods released thanks to Luis
  • ncm-network changes contributed
  • ncm-sysctl not yet modified: waiting for escape/unescape removal
  • quattor-remote-configure available but quattor-remote-deployer not yet released as it is being rewritten

5 core developpers, many support engineers

  • Current focus is to allow more operations without editing templates as Aquilon usage extended to less standard machines (non “grid”)
  • Plan is to add more commands into Aquilon that will deal with more advanced operations

Quattor Remote Deployer (QRD) status: equivalent of ncm-ncd to manage non Linux devices

  • Python rewrite of the old Perl code
  • QRD relies on Quattor Remote Configure (QRC) to execute components: QRC already released
  • ython rewrite requires a Python CCM API

Aquilon changes and main developments

  • More changes made by Aquilon commands and less template editing, e.g. assign specific features to an existing personnality
  • Make easier for users to make trivial changes
  • Reduce the risk of a big mistake by exposure of templates and the right to modify them
  • HW features: generic description of HW features implemented using vendor tools and their specific options
  • 1 template to enable/disable a feature including a vendor/model specific template definining the required configuration
  • Under namespace features/hardware
  • Added description of clusters (VMware clusters, high availability clusters): members (e.g. assignment of VMs to an ESX clusters), reboot constraints, file system descrption allowing attaching remote file systems…
  • Will add possibility to assign a VM to a single host (e.g. running KVM) rather than an ESX cluster
  • Planning to add support for defining service address (moving with VMs or shared by several) in Aquilon
  • Request to model Windows cluster in Aquilon
  • But real use may require a GUI…
  • Working on modelling network switches and configuring them QRD

Aquilon appliance

  • Still on SourceForge but not updated… lacking templates to be useful
  • Looking for “pop-up” installations via cloud
  • Appliance uses Debian

Mac OSX support: proof of concept of other OS support in Aquilon

  • ncm-directoryservices to configure OpenDirectory: ncm-mcx
  • ncm-ncd, ccm-fetch all works
  • CCM uses DB_File
  • A bit of fixing needed in quattor-build-tools
  • A single package created (alpha version) and put on SourceForge: “Managed Quattor Client for OS X”

Fostering Aquilon adoption

James’s experience

  • No major problem in using QWG templates, a few minor modifications needed
  • James agreeing to write a short report about these…
  • Main difficulty is the different workflow between SCDB and Aquilon
  • Aquilon upgrade: DB upgrade scripts only provided for Oracle
  • Aquilon supports whatever DB backend is supported by SQLAlchemy
  • Need to maintain backend specific upgrade scripts: PostgreSQL seems the most important to support

Appliance work required to allow early adopters to look at it

  • Update Aquilon version
  • Add QWG templates
  • Add ability to check out templates from appliance

RAL will have a summer student to work on Aquilon in July for 3 monthts

  • Try to get the new appliance ready by then.

Quattor Remote Deployer experience at MS

Goal: manage non Linux devices that offer some sort of API for remote management

  • E.g. ESX clusters, switches, file appliances…
  • Handle iniitial configuration and reconfiguration

2 parts in the system

  • QRD itself: equivalent of ncm-cdispd
  • Receives profiles and decide what to do
  • Uses a plugin for either installation (aii, configuration of the boot server) or for configuration (QRC)
  • Quattor Remote Configure (QRC) : equivalent of ncm-ncd
  • 1 framework and specialized components

Choice of distributed (redundant) QRD/QRC configuration

  • Main goal is redundancy: multiple source of the configuration easy to implement with http
  • Every QRD receives the notification and they use a lock to ensure only one is really execute the configuration actions
  • Shared data (NFS) between all instances for locks and cached profiles… not ideal
  • Modified CAF::Lock to implement NFS locks
  • If possible, achieve better scalability but not really easy because of potential locking issues that may negatively impact the performances

In QRC, implemented conditional execution of dependencies based on the fact they have a change in their configuration…

  • Check whether it makes sense to feed back into ncm-ncd

QRD

  • Different modes and connectors implemented by plugins that have their own (simple) configuration
  • Analyze of the work to be done : multiple configuration changed notifications merged in one action
  • Does everything needed to move to the last config successfully deployed to the new one
  • Keep track of the plugin status to decide if a config was succesfully applied

QRD improvements needed

  • Too much locking that may lead to a configuration deployment being postponed indefinitely
  • Locking + grouping is overcomplicated and 2 levels of locking are probably unnecessary
  • IO performance problem with NFS
  • Lack of visibility of action progress: difficult to troubleshoot why a request is not executed
  • Impossible to execute an action without a CDB notification (profile change): no dry run, no possibility to force the redeployment of a configuration without a profile change…
  • Requires direct use of QRC
  • endless retry loop: useless…
  • After a failure, a component will be re-run regularly, without any config change (default: 5 mn): difference with ncm-cdispd

Current status

  • In production at MAS for 1 1/2 year
  • Known problems are well identified
  • Rewrite in progress but not yet ready

QWG Templates

Input based on RAL work.

YAIM Support

Goal: allow to reuse RPM lists for grid services as they exist in QWG but use ncm-yaim rather than standard QWG configuration to do the actual service configuration

  • BDII as a proof of concept

Proposal

  • Define a variable USE_QWG_CONFIG to select the configuration variant
  • Create for each service 2 namespace qwg and yaim
  • Rename current service.tpl into qwg\config.tpl
  • Create a new service.tpl that will include RPMs and acts as a switch between both variants

Open questions

  • Account configuration: with YAIM or with ncm-account?
  • Try to support both and compare

RPM list management

Would be desirable to manage separately the RPM list and the RPM version to use

  • RPM lists can be generated from an OS distribution
  • Would make easier the maintenance of templates in config/os: should not require a change with new versions as long as the RPMs are the same.

Need to improve the XSL stylesheet processing the distribution comps.xml to produce the information in this new format.

  • Template defining default version will be included at the beginning of every RPM template for a non disruptive migration

SL6 issues

ncm-network changes to manage udev made by MS but not yet contributed back.

grub/spma inconsistency with new kernel names including architecture: fixed by last spma (-12).

ldapauth configuration completely changed

  • New configution supported through a new configuration subtree in the component

SPMA and SELinux: SPMA requires a new SELinux effort to work

  • Added by last version in SF repository

Monitoring

Experience with Nagios and Icinga at UGent

Main goal: dynamic reconfiguration of monitoring

  • Currently done by a Python framework parsing the XML profiles to produce the required monitoring configuration
  • At UGent, everything is described in Quattor
  • The framework produces a template that is used by the Icinga server: currently no automatic reconfiguration of the monitoring server
  • Current QWG approach relies on a static definition of groups in a specific template and is difficult to keep in sync with the actual config: at least the risk of discrepancies
  • Plan to handle this with Aquilon in the future
  • Unclear if it will really impact the overall workflow

Host groups a host belong to are defined in the Quattor configuration

  • Processed by the Python framework to generate the appropriate monitoring configuration

ncm-icinga: written from scratch

  • Schema different from ncm-nagios
  • Configuration information required (variables) is basically the same

Status

  • Doing last tests
  • Production targeted end of April
  • Want to look at Aquilon this summer

QWG Todo: document the variables related to monitoring and common to both Nagios and Icinga

  • Already some information on the Trac wiki

Configuration Modules

Configuration module status report

  • State file managed/generated by ncm-ncd and ncm-cdispd in last versions
  • Requires a ‘state’ drective defining the state directory in the ncm-ncd and the ncm-cdispd configuration file
  • Presence of the file denotes a component that should run and indicate the error if there was one in a previous run
  • TODO: add an option to ncm-ncd to display the configuration modules that are waiting to run and whether they experienced an error in a previous run

Test mode: ‘noaction’ property should do the job if the configuration module uses CAF

  • Will display file to be opened, command to be executed rather than doing it

Would be great to keep not only the latest profile but the last one.

  • ccm-purge could keep a given number of profiles or all the profiles more recent that a certain time interval
  • ncm-query could be enhanced to list differences between profiles with several levels of details: components impacts, detailed config change for part of the configuration tree…

Delayed execution: see [/wiki/Meetings/Workshops/20111011#ChangeScheduling Strasbourg’s minutes]

  • No work since

Support for other languages: desirable but not urgent

YUM Support

3 good reasons to use YUM (at least!)

  • Scale down Quattor entry level: may be easier for small sites to add their packages
  • Remove a Quattor specific component to maintain
  • Enable support of other platforms

Concentrate first on making possible to use YUM as a packager rather than SPMA in a flexible way.

Proposal

  • Make the packager an implicit dependency of all components rather than an explicit one
  • ncm-cdispd will be no longer concerned by ensuring it is run first
  • ncm-ncd will be in charge of running the packager before any action: may require to pass a new option from cdispd or to add an option in the ncd configuration file

Corner case: ability to force a component like filecopy to run before the packager to fix a problem preventing its successful execution

  • See later if there is a real use case: may add a property to components like runFirst to force their run before the packager

packager specific configuration modules (ncm-yum, ncp-apt..) should take the package information from the same part of the configuration tree (/software/packages)

  • Configuration description should remain independent of the packager

Community Life and Development Process/Tools

Quattor Build and Test Process and Tools

StratusLab experience

  • Build with maven as the driver
  • Easy to wrap non-java builds and to use other build tools like make
  • Tools should have unit tests
  • Nexus as a package repository
  • Works well with maven: packages pushed automatically after build
  • Wth maven, has very clean dependency resolution process: dependencies automatically pulled in with transitive dependency resolution
  • A new snapshot entry created by maven every time it runs and produces something new
  • Package repository at the heart of the build process
  • Continous integration with Hudson/Jenkins
  • Continuous build of the SW, triggered by modifications to the package repository
  • Can also do build on demand
  • Installation of test infrastructures
  • Allows to manage dependencies between test stages but not working well as a workflow manager (ability to take different decisions based on result of a stage)
  • SlipStream for system testing
  • Workflow-like features to manage deployment of virtual machines
  • Growing use for system deployments and testings
  • Web server for release distributions
  • YUM based repositories for published releases

Proposal for Quattor

  • Move quattor.org DNS domain to an active institue to allow more flexible updates
  • Need ability to create entries in this domain
  • Discuss with CERN immediate transfer of primary server and possible transfer of ownership: LAL primary, RAL secondary
  • Start wth same toolset as for StratusLab
  • Decide where to put the tools: StratusLab cloud? persistent disk for data?
  • Move Nexus server from LAPP to the same place as other tools?
  • Demonstrate builds for core tools
  • Some already maven-ready
  • PRobably ~10 more to convert to new build tools: cdispd/listend/ncd, aquilon, rpmt-py/spma, ncm-templates, ncm-query
  • Not necessary to convert them to Maven
  • Deployment/test of Quattor “HelloWorld”
  • Simple test configuration based on Aquilon appliance?
  • Perform “release” of core tools
  • Devise same procedure for configuration modules
  • Think about unit test environment: Luis started some work around this…

Actions

  • DNS changes: Michel
  • Jenkins, Nexus deployment: Cal
  • Integration of core tools: Cal + tool maintainers
  • Target date for initial setup: ~1 month

Extend the number of people with Maven knowledge…

  • Not a requirement for the very short term… core tool maintainers are a very limited set of people
  • Trac page reasonnably up to date…

Quattor releases: let’s review how we do it when we have the build infrastructure in place

  • With YUM, no need of metarpm
  • The main issue is not bringing dependencies but identify what are the stable versions of all components that work together
  • “HelloWorld” test configuration may help to assess what is ready for releasing

Weekly standup meetings

  • Should run without Michel: Luis and Ronald agrees to backup

Main priorities should focus usability rather than new features

  • Cookbooks to start with Quattor and its core tools
  • Reuse the panc idea
  • Foster Aquilon adoption as a tool hiding the template complexity
  • Possibility for a Quattor dashboard
  • Possibility to reuse/adapt existing tools like Foreman?
  • Quattor as a tool to manage virtualized infrastructure
  • Start by documenting existing use cases and experiences in blog entries, e.g. LAL work within StratusLab to manage images rather than machines

Quattor Public Visibility and Community

Main page move to GitHub

  • Need to change DNS… see previous actions
  • Cal makes a proposal that can be reviewed
  • Populate with a few entries all the categories of information
  • Put Oloh! widget on the web page
  • Go live!

News broadcasting

  • Send more general/regular news should appear on the main web site
  • Broadcast it to the general mailing list
  • Tweet this news: any plugin to do it automatically?

Reserve Quattor name on Google+, LinkedIn, (Fabebook?)

  • Cal will do it

IRC: try to continue to listen it despite the low activity…

Dissemination materials

  • Build an internal repository for existing dissemination materials that could be reused by others to make presentations

Miscellaneous

  • Fix link to Quattor home page in WallStreet Tech article about MS usage of Linux/Quattor (Nick)
  • Or add a redirect in Mediawiki page

Conclusions

Next workshop: U. of Gent

  • Doodle asap to fix the dates after UGent fed back their constraints