Welcome to End Point’s blog

Ongoing observations by End Point people

Take pleasure in small things

We're in the midst of our 2009 company meeting, and are having our first-ever "hackathon". The engineering team is divided into several working groups focusing on a variety of free software projects.

As a distributed organization, we don't always get a lot of opportunity to write code and do our "real work" side-by-side like this. And it's a pleasure to witness and to participate.

Just thought I'd share.

Why not OpenAFS?

OpenAFS is not always the right answer for a filesystem. While it is a good network filesystem, there are usage patterns that don't fit well with OpenAFS, and there are some issues with OpenAFS that should be considered before adopting or using it.

First, if you don't really need a network filesystem, the overhead of OpenAFS may not be worthwhile. If you mostly write data, but seldom read it across a network, then the cache of OpenAFS may hinder performance rather than help. OpenAFS might not a good place to put web server logs, for example, that get written to very frequently, but seldom read.

OpenAFS is neither a parallel filesystem nor a high-performance filesystem. In high-performance computing (HPC) situations, a single system (or small set of systems) may write a large amount of data, and then a large number of systems may read from that. In general, OpenAFS does not scale well for multiple parallel reads of read-write data, but it scales very well for parallel reads of replicated read-only data. Because read-only replication is not instantaneous, depending on the latencies that can be tolerated, OpenAFS may or may not be a good choice. If you need to write and immediately read gigabytes or terabytes of data, OpenAFS may not work well for you.

It should be noted, though, that Hartmut Reuter and others have developed extensions to OpenAFS that allow for parallel access to read-write data, and their testing has shown that accesses scale linearly with the degree of parallelism. Work to integrate their extensions into core OpenAFS is ongoing.

Additionally, if your environment needs to leverage special-purpose high-speed networks and does not leverage IP for connectivity, then OpenAFS will not be a good choice. It only communicates over IP and does not do Infiniband or Myrinet, for example.

OpenAFS is also more difficult than NFS or CIFS to set up and administer. For those two products, simple configurations can be set up in minutes, often just requiring editing a few files and/or clicking on a simple GUI to 'share' some files.

OpenAFS, on the other hand, requires configuration on the client, and setup of both fileservers and the other infrastructure servers (e.g., Kerberos, the user and group management server, and the location server). Thus, OpenAFS has a higher hurdle for getting started.

As mentioned, OpenAFS requires Kerberos. For an environment that already has Kerberos infrastructure, whether via Active Directory, MIT Kerberos, Heimdal, or another implementation, this might not be a large challenge. For an environment that does not leverage Kerberos, though, determining the right Kerberos infrastructure, the policies to manage it, and getting the implementation done can be a significant hurdle.

Also, as OpenAFS has its own user and group management components, the interaction of those with existing components (or lack thereof) also needs to be resolved. An organization that uses LDAP (or Active Directory), for example, might need to leverage some add-ons to more smoothly integrate with OpenAFS, or new code might need to be written to make that integration work better.

While both Kerberos and integration of user and group management are good system administration practices, for an organization that does not already have these practices, needing to adopt them in order to reasonably evaluate and use OpenAFS can be daunting.

The filesystem semantics of OpenAFS can also be a barrier to adoption. OpenAFS only uses the owner bits for Unix file permissions, for example, so the group and other bits are completely unused (OpenAFS preserves them, but just doesn't consult them for access control). This can cause issues with software that relies on group permissions to manage access. OpenAFS uses access control lists (ACLs) to do this, which are similar to those used on Windows but do not implement the traditional Unix semantics.

Another semantic difference is that OpenAFS does not implement byte-range locking but only implements file-level locking. Some software (e.g., Microsoft Access) requires byte-range locking in order to work properly; thus, OpenAFS is not a good place to store Microsoft Access databases.

Network filesystems often have semantic differences from local filesystems, and OpenAFS is no different. For OpenAFS, the big difference for developers is that it does not implement write on commit semantics but rather write on close. In other words, when a client issues a write request, that request does not necessarily cause other clients reading the data to be aware of the new contents. Instead, OpenAFS will write on the closing of the file (or an fsync() call). While this is not necessarily specific to OpenAFS, it is a subtlety of networked filesystems that many developers may not be aware of, so they need to be more careful about checking the return status of file close() calls, and they also need to be aware of the differences so that they can properly handle an cross-system coordination if based on contents of files.

While OpenAFS is a solid network filesystem, there are scenarios in which OpenAFS might be too heavyweight, might not perform as well as needed, or behave differently from what is required. Understanding these issues is helpful in making a reasoned choice about a network filesystem.

Slow Xen virtualization of RHEL 3 i386 guest on RHEL 5 x86_64

It seems somehow appropriate that this post so closely follows Ethan's recent note about patches vs. complaints in free software. Here's the situation and the complaint (no patch, I'm sorry to say):

We're migrating an old server into a virtual machine on a new server, because our client needs to get rid of the old server very soon. Then afterwards we will migrate the services piecemeal to run natively on RHEL 5 x86_64 with current versions of each piece of the software stack, so we have time to test compatibility and make adjustments without being in a big hurry.

The old server is running RHEL 3 i386 on 2 Xeon @ 2.8 GHz CPUs (hyperthreaded), 4 GB RAM, 2 SCSI hard disks in RAID 1 on MegaRAID, running Red Hat's old 2.4.21-4.0.1.ELsmp kernel.

The new server is running RHEL 5 x86_64 on 2 Xeon quad-core L5410 @ 2.33GHz CPUs, 16 GB RAM, 6 SAS hard disks in RAID 10 on LSI MegaRAID, running Red Hat's recent 2.6.18-92.1.22.el5xen kernel.

The virtual machine is using Xen full virtualization, with 4 virtual CPUs and 4 GB RAM allocated, with a nearly identical copy of the operating system and applications from the old server. And it is bog-slow. Agonizingly slow.

Under the load of even a single repeated web request to web server (Apache) + app server (Interchange) + database server (MySQL), it breathes heavy, and takes 1-2 seconds per request (wildly varying). The old physical machine takes 0.5-0.7 seconds per request under 2 concurrent users. Under heavier load (just a boring day of regular web traffic) the new VM groans and plods along.

The most noticeable metric is that the CPUs get pegged from 50%-90% to more system usage, with under 40% user usage. This is nearly the opposite of the physical machine where system usage was always in the low teens %, and user usage was around 50% per CPU. In both cases there's almost no I/O wait.

First, I'm really surprised it's this bad. We've done Xen full virtualization of RHEL 5 x86_64 and i386 guests on RHEL 5 x86_64, with no special handling, and it's always worked quite well with little performance degradation.

So, we know there are paravirtualized drivers you can use to speed up network and disk devices even of otherwise fully virtualized guests. However, apparently you can't use the paravirtualized drivers in 32-bit RHEL 3 guest on a 64-bit RHEL 5 host. That's really painful, since in my mind a very common use case for virtualization is loading a bunch of old 32-bit machines on a big 64-bit machine with a lot of RAM. But ... not if you want it to even match the speed of the old servers!

We increased the number of virtual CPUs to 8. That took the edge off the worst slowdowns a bit, but only barely.

We tried upgrading the RHEL 3 guest to the very latest versions of everything from Red Hat Network (Update 9 IIRC) and upgrading the RHEL 5 host to the very latest RHEL 5.3, and saw the wild variation in performance from request to request moderate a lot. Also, performance under heavier concurrency was stable: 2.4-2.6 seconds per request in that scenario.

But that's still really slow. I hope we're just missing something obvious here. I'd love to know what the really stupid mistake we're making is. So far, the search has been fruitless and this seemingly ideal use can for Xen virtualization is barely usable.

College District Launches 4 additional sites.

We built a system for one of our clients, College District, that allows them to launch e-commerce sites fairly easily using a shared framework, database and adminstration panel. The first of the sites, Tiger District, launched over a year ago and has been succesful in selling LSU branded merchandise. A few weeks ago the following sites were launched on the system: Sooner District, Longhorn District, Gator District and Roll Tide District.

The interesting parts of the system include a single Interchange installation, serving two catalogs, one for the administration area and one for all of the stores. Each site gets it's own htdocs area for it's images and css files (which are generated by the site generator using the selected colors). A cool part about this setup is that a new feature added appears on all sites instantly. The site code uses the request domain name to determine which user to connect to the database as. The heavy lifting of the multi-site capbilities is handled by a single Postgres database which utilizes roles, schemas and search paths to show or hide data based on the user that connected to the database. This works really well when it comes time to makes changes to an underlying table, instead of having to update the same table in 10 different databases, the change is applied to a single table and all sites are effected by the change.

Additional sites will be launched in the near future as well as some great community features soon.

Note to self:

In free software, patches are considerably more useful than complaints.

It's easy to forget.

The Orange Code

I've been reading the new book The Orange Code, the story of ING Direct by Arkadi Kuhlmann and Bruce Philp. Here are a few passages I liked from what I read today:

The commitment to constantly learn is the only fair way to bring everyone in the company under the same umbrella. It is a leveler. (p. 213)

... [W]e've got to earn it each day, and we need to feel that we have new challenges that can make us or break us every day. ... Each day's work will last only as long as it's relevant. ... [W]e did okay in each of the last seven years, but we are only ever as good as our last year, our last day, our last transaction. We still have a lot to do, since our competition is not resting. (pp. 208-209)

Trust and faith not only are built over time, but they actually need the passage of time to validate them. (p. 197)

Contributing is a privilege earned, not a right. And there are, indeed, bad ideas, most of which are answers to questions the contributors didn't really understand in the first place. There is a reason why some of the world's finest jazz musicians were classically trained: You have to understand the rules before you can intelligently improvise on them. (p. 195)

I haven't finished reading it yet and probably will have more to say about it when I have.

Greg Sabino Mullane @ US PostgreSQL Association

Belated congratulations to End Point's Greg Sabino Mullane for his election to the United States PostgreSQL Association's board for 2009-2011. It didn't really happen this way, but I think of Greg taking over Selena's board position there. (Actually Bruce Momjian filled that role till the elections.)

Anyway, nice work, all of you, on improving and promoting a great database and its equally important community.

What is OpenAFS?

A common question about OpenAFS adoption is "What is OpenAFS?" Usually, the person asking the question is somewhat familiar with filesystems, but doesn't follow the technical details of various filesystems. This article is designed to help that reader understand why OpenAFS could be a useful solution (and understand where it is not a useful solution).

First, the basics. OpenAFS is an open source implementation of AFS: from the OpenAFS website, OpenAFS is a heterogeneous system that "offers client-server architecture for federated file sharing and replicated read-only content distribution, providing location independence, scalability, security, and transparent migration capabilities".

Let's break that down:

First, OpenAFS is extremely cross-platform. OpenAFS clients exist for small devices (e.g., the Nokia tablet) up to mainframes. Do you want Windows with that? Not a problem. On the other hand, OpenAFS servers are primarily available on Unix-based platforms. Implementations of OpenAFS servers for Windows do exist, but they are not recommended or supported (If you'd like to change that, you are welcome to submit patches or to hire developers to make that change. That's a major advantage of an open source project.).

The second part of OpenAFS is rather straightforward: it is a client-server distributed file system. Much like SMB/CIFS in the Windows world, and NFS in the Unix world, OpenAFS lets file accesses take place over a network. One feature that sets OpenAFS apart from CIFS and NFS, though, is its strong file consistency semantics based on its use of client-side caching and callbacks. Client-side caching lets clients access data from their local cache without going across the network for every access.

Other distributed filesystems allow this as well, but OpenAFS is rather unusual in that it guarantees that the clients will be notified if the file changes. This caching plus the consistency guarantees make OpenAFS especially useful across wide-area networks, not just local area networks. With respect to consistency, most other distributed filesystems use timeouts and/or some kind of FIFO or LRU algorithm for determining how a client handles content in a cache. OpenAFS uses callbacks, which are a promise from the file server to the client that if the file changes, the server will contact the client to tell the client to invalidate the cached contents. That notion of callbacks gives OpenAFS a much stronger consistency guarantee than most other distributed filesystems.

Another unusual feature in OpenAFS is that it provides a mechanism for replicated access to read-only data, without requiring any special hardware or additional high-availability or replication technology. In a sense, OpenAFS can be considered an inexpensive way to get a read-only SAN. OpenAFS does this by classifying data as read-write or read-only, and providing a mechanism to create replicas of read-only data. Up to 11 replicas of data can be made, allowing read access to be very widely distributed.

The last four features mentioned in the website description are also very interesting: location independence, scalability, security, and transparent migration.

OpenAFS provides location independence by separating information about where a file resides from the actual filesystem itself. This allows separation of name service from file service, which lets OpenAFS scale better. It also provides some functionality not present in other networked filesystems in that changing the location of the data can be more easily done. Because of the layer of indirection, OpenAFS is able to make a copy of data behind the scenes, and after that data has been migrated, to then update the location information. This allows for transparent migration of data.

Because the location of data is separate from the data itself, if some of the data is found to be more heavily used, that data can be migrated to a separate server, so as to better balance out the accesses across multiple servers. This can be done without negatively impacting the users. This kind of feature is not usually found in networked filesystems but only in either higher-end proprietary Network Attached Storage (NAS) systems, or in Storage Area Networks (SANs).

Because of OpenAFS's use of client-side caching, read-only data, and separation of location information from the filesystem itself, OpenAFS can scale up quite well. The initial design of AFS was to be at least 10 times more scalable than the implementations of NFS at that time, with a client to server ratio of 200:1. While client to server ratios are highly dependent on hardware and filesystem access patterns, 200:1 is still easily achievable, and much higher ratios have been leveraged in production environments. 600:1 is achievable in an environment where the data is predominately read-only.

OpenAFS provides built-in security by leveraging Kerberos to provide authentication services. The servers themselves rely on Kerberos to ensure that a rogue host cannot successfully masquerade as an OpenAFS server, even if DNS is compromised. OpenAFS itself is agnostic with respect to what kind of Kerberos server is used, as long as it supports the Kerberos 5 protocol standards: a Windows Kerberos Domain Controller can provide the Kerberos services for an OpenAFS installation, as can an MIT KDC or a Heimdal one.

Additionally, traffic between the clients and servers can be encrypted by OpenAFS itself (i.e., not just with SSH or VPN encryption). This can provide an extra layer of security.

Overall, OpenAFS provides some of the features of traditional network filesystems like CIFS and NFS, but with better scalability, consistency and security. Additionally, because of its ability to replicate and transparently migrate data, OpenAFS can be leveraged much like a SAN, but without the proprietary tie-ins to hardware.