Monday, December 12, 2011

The Fabric Controller

Operating systems have at their core a kernel. This kernel is responsible for being the traffic cop in the system. It manages the sharing of resources, schedules the use of precious assets (CPU time), allocates work streams as appropriate, and keeps an eye on security. The fabric has a kernel called the Fabric Controller (FC). Understanding these relationships will help you get the most out of the platform.

The FC handles all of the jobs a normal operating system’s kernel would handle. It manages the running servers, deploys code, and makes sure that everyone is happy and has a seat at the table.

The FC is an Azure application in and of itself, running multiple copies of itself for redundancy’s sake. It’s largely written in managed code. The FC contains the complete state of the fabric internally, which is replicated in real time to all the nodes that are part of the FC. If one of the primary nodes goes offline, the latest state information is available to the remaining nodes, which then elect a new primary node.

The FC manages a state machine for each service deployed, setting a goal state that’s based on what the service model for the service requires. Everything the FC does is in an effort to reach this state and then to maintain that state when it’s reached. We’ll go into the details of what the service model is in the next few pages, but for now, just think of it as a model that defines the needs and expectations that your service has.

The FC is obviously very busy. Let’s look at how it manages to seamlessly perform all these tasks.


How the FC works: the driver model
The FC follows a driver model, just like a conventional OS. Windows has no idea how to specifically work with your video card. What it does know is how to speak to a video driver, which in turn knows how to work with a specific video card. The FC works with a series of drivers for each type of asset in the fabric. These assets include the machines, as well as the routers, switches, and load balancers. Although the variability of the environment is low today, over time new types of each asset are likely to be introduced. The goal is to reduce unnecessary diversity, but you’ll have business needs that require breadth in the platform. Perhaps you’ll get a software load balancer for free, but you’ll have to pay a little bit more per month to use a hardware load balancer.

A customer might choose a certain option, such as a hardware load balancer, to meet a specific need. The FC would have a different driver for each piece of infrastructure it controls, allowing it to control and communicate with that infrastructure.

The FC uses these drivers to send commands to each device that help these devices reach the desired running state. The commands might create a new VLAN to a switch or allocate a pool of virtual IP addresses. These commands help the FC move the state of the service towards the goal state. While the FC is moving all your services toward the running state, it’s also allocating resources and managing the health of the nodes in the fabric and of your services.


Resource allocation
One of the key jobs of the FC is to allocate resources to services. It analyzes the service model of the service, including the fault and update domains, and the availability of resources in the fabric. Using a greedy resource allocation algorithm, it finds which nodes can support the needs of each instance in the model. When it has reserved the capacity, the FC updates its data structures in one transaction. After the update, the goal state of each node is changed, and the FC starts moving each node towards its goal state by deploying the proper images and bits, starting up services, and issuing other commands through the driver model to all the resources needed for the change.


Instance management
The FC is also responsible for managing the health of all of the nodes in the fabric, as well as the health of the services that are running. If it detects a fault in a service, it tries to remediate that fault, perhaps by restarting the node or taking it offline and replacing it with a different node in the fabric. When a new container is added to the data center, the FC performs a series of burn-in tests to ensure that the hardware delivered is working correctly. Part of this process results in the new resource being added into the inventory for the data center, making it available to be allocated by the FC. If hardware is determined to be faulty, either during installation or during a fault, the hardware is flagged in the inventory as being unusable and is left alone until later. When a container has enough failures, the remaining workloads are moved to different containers and then the whole container is taken offline for repair. After the problems have been fixed, the whole container is retested and returned into service.

Source of Information : Manning Azure in Action 2010
The Fabric ControllerSocialTwist Tell-a-Friend
Digg Google Bookmarks reddit Mixx StumbleUpon Technorati Yahoo! Buzz DesignFloat Delicious BlinkList Furl

0 comments: on "The Fabric Controller"

Post a Comment