Scaling OpenStack is hard. Sometimes the challenges come from gaps or issues in the software itself. Often, however, they stem from the mere fact that growth introduces complexities. As the number and type of various "nodes" expands, Operators are faced with several challenges around how they are deployed, tracked, interact with each other, are monitored and dealt with when they become problematic. There is no one-size fits all solution, or one "right" way to do it, but many of us are working to overcome the same obsticles. This talk will focus on Rackspace's approach to fleet management as we have continued to scale.
It's as much (if not more) about philosphy as it is technology. How we choose to view various "nodes", the level of expendability we assign to each and the amount of care (or not) we'd like to focus on them in the future. When thinking about a fleet at 10x the size it is today, we look at how we manage the inventory, provision the nodes and deal with any of the downstream issues that might arise. Most of our tooling has been built/assembled with those things in mind. In this talk we'll dive into:
- Our inventory management aproach - it's all about data aggreation - Host provisioning - look Ma, no hands! - Remediation services - Auto correction of monitoring alerts and anomoly detection/correction - Where we go from here - How we think about the future, What we still have to build, and yes, Virginia - we'd like to upstream the whole suite