Loading…
Back To Schedule
Thursday, October 29 • 1:50pm - 2:30pm
'fsck' Your Cloud - Detect Resource Leaks and Keep Openstack Consistent

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

In very large scale, heavily used Openstack deployment like the ones we have in Paypal, resource leaks happens from time to time and the content in the Openstack databases becomes inconsistent across different components due to the distributed nature of Openstack.

The resource leaks and inconsistent data cause capacities shortage and operation failures in our cloud insfrastructure.

Please note that it is important to find and fix the underlying issues in the code. In a production environment, there are third-party services could cause the openstack into inconsistent states, for example, hardware failures in hypervisor/switch, backend storge issues, loadbalancer issues, database cluster out-of-sync, and rabbitmq issues, etc. As a cloud provider for enterprise, we also need to resolve customer issues ASAP through the quick hack.

We would like to share our experiences and lession learned on how to detect resource leaks and keep Openstack consistent.

Just like fsck for filesystem, we deployed a set of cleanup tools to check/repair the Openstack cloud.

The tool set cleans up leaking resources and fix inconsistent data not only for Openstack alone, but also other services used by Openstack (DNS server, and NSX controller, etc)

Here are the list of items being cleaned up:

1. zombie VMs. instances marked as deleted in Nova DB but still running on hypervisors.
2. zombie disk files on the hypervisor. The huge disk files left on hypervisor for deleted VMs.
3. in consistent cinder volume states acrossing five different modules: Nova DB in API cell, Nova DB in compute cells, ciner DB, the libvirt.xml of the instance on the hypervisor, and the iscsi sessions on the hypervisor.
5. Unused the DNS entries for deleted VMs, and duplicated DNS entries for the same IP.
6. Orphan ports in Neutron DB which are no longer used by VMs or
7. Resources leaks in NSX controller, for example: virtual ports, virtual switch, virtual router and security groups.
8. nova quota out of sync and cinder quota out-of-sync
9. inconsistence caused by staled RPC message

Speakers
avatar for Zhenhua Feng

Zhenhua Feng

Staff Software Engineer
Zhenhua is a staff software engineer with Paypals' cloud engineering team. He works on OpenStack and SDN to bring availability, scaliblility and security to one of the largest online payment systems in the world. Before that, he was with Cisco's Enterprise Networking Group building... Read More →
avatar for Wei Tian

Wei Tian

cloud performance Lead at Paypal
15 years of enterprise software development experience, including architecture design and technical leader experience. Over 10 years of experience on virtualization and cloud infrastructure implementation and deployment. Working in Paypal's cloud engineering team since 2013. Leading... Read More →


Thursday October 29, 2015 1:50pm - 2:30pm JST
Aoba

Attendees (0)