Distributed data locality

NodeWeaver has been designed, from scratch, as a single optimized entity – using the idea (first taught me at my university by my electrical communication professor, during a discussion on Trellis coding) that when you look at optimizing only a single step you are not doing larger, potentially much important optimization of the whole system. It’s a common problem, and in some cases it comes down to the issue of economics: when you pool resources together you may be able to get more performance or higher utilization, since many times you need a resource to be available, but you don’t need them to be available all at once. In fact, you aim for the maximum possible performance, not the average. You want your car to go fast when you need it (for example when surpassing a slower vehicle), despite the fact that on average your car travels much slower – the average speed of cars in London is a mere 19km/h.

Take the example of the SAN, the storage area network. You know that disks are a slow resource, and that by pooling disks together (along with lots of cache, an intelligent controller and so on) you can get much better performance to the VMs that need it. You know also that the interarrival times and the IOPS requested by your virtual machines are usually uncorrelated, and so the probability of everyone asking lots of IOPS at the same time is very small; this means that by aggregating disks in a SAN, you get the advantage of speed at a comparable much smaller price. Up to a point.

There are two problems with this approach: the first is one of space. What happens when you end up the space allocated in the SAN? Most small and medium sized SANs are locked in terms of space: if you end your space, you buy another one. Which is a quite costly practice. If you paid up (a lot more) in advance, you may extend your space with a second SAN, and things don’t improve much (but at least you don’t have to move things off the first SAN). The second problem is that you created a second bottleneck, that is the link between your computing nodes and the storage, usually through fibre channel or similar low-latency, high throughput connections. This shows whenever there is a substantial contention for the channel or the available IOPS-for example, during a “boot storm”, when lots of VMs (usually virtual desktops) are launched at the same time. Many different techniques have been introduced to reduce this effect, both on the virtualization side (for example VMware’s View Accelerator, limited however to 2GB) or through dedicated acceleration appliances (that, of course, you need to pay separately).

Our approach is different, and takes advantage of some specific properties of how we handle the storage. As mentioned in a previous post, we treat all the storage that is available in a NodeWeaver appliance (two rotational disks and two caching SSD units) as a single, undifferentiated pool, that we informally call “sea of data”. But each node knows where not only where a request has been filled, but also what VM asked it and its “distance”, internally called topology. A topology of zero means that the request is being served by the same node where the VM that made the request-meaning that it does not need to go to the external network to be executed; a higher topology means that the “distance” is higher and that will be served though a network connection.

The interesting thing is that the system is dynamic – the distributed file system will start to arrange blocks to minimize the overall topology for the blocks that get written most, so that those will not block the network with unneeded traffic. This way it is possible to serve most of the request directly from the node, and thanks to the integrated SSD cache, at an exceptional speed, while a background process continuously propagates the changes to meet the goal replication guarantees. Another advantage comes through our redirect-on-write system for creating snapshots and derived machines (explained here): when the user asks for 20 virtual desktops our platform creates 20 new snapshots, where only the changed blocks are created, reducing IO effort substantially. Since most reads will happen in the same blocks anyway (due to the invariability of most of the guest operating system boot process) those blocks get saved in the SSD cache and in the smaller internal memory cache and will be shared by all the VMs – greatly increasing the cache ratio and the perceived speed. The combination of the two features – dynamic block placement and redirect-on-write caching amplification – allows our NodeWeaver appliances to be able to serve even the most demanding workloads while maintaining total consistency and reliability.