We had presented our cloud design at OpenStack Paris Summit (
http://bit.ly/1DbJPUO) and started the operation after the conference. In this talk, we are going to share some important lessons and processes learned after the one year of OpenStack operation. This talk will help people who just want to start OpenStack operation or think of an operation by a small number of people.
1. Team Building It is essential to organize DevOps team to keep up with an active OpenStack development. We created DevOps team from scratch. We share the process of the team building and each member's skill doing DevOps.
2. Monitoring System Monitoring is important to keep the system stable. We share items we are currently monitoring (about 60,000 items) and show some important items to prevent service disruption. Alos, we share some custom scripts for OpenStack health check (e.g. RabbitMQ, MySQL and OpenStack services).
3. Log Analytics Logs (e.g. OpenStack debug log, Syslog, Auth log and Operation log) give you very important information and we can find potential problems/risks by analyzing those logs. We are getting more than 40GB logs a day and it is difficult to find important information among them. We demonstrate our Elasticsearch based log analytics/visualization tool to sort out useful information.
4. Continuous Integration Once you start a cloud service, it is difficult to stop the service though there are many necessary updates. We have updated the environment more than 100 times without downtime. We demonstrate Neutron Agent update that is one of the most difficult part of current OpenStack. We also share CI/CD tools and own tools used for system validation after updates.
5. Daily Operation We share our daily works.
- Tools help you to monitor the system efficiently
- Tools help you to check security alert
- Issue tracking and management
- Tools and procedures used for emergency operation (remote operation tools)
Thanks to the community, it becomes easier to deploy OpenStack by many tools(e.g. Juju, RDO and Fuel); however, there is still less information about keep running/updating OpenStack without downtime. We are going to share our experiences and own tools developed through the private cloud operation. Also, we share future challenges to make OpenStack operation more efficient. Today, there are still some manual operations but our goal is to help OpenStack operators sleep better by automating most of the operations.