Greyhole and Scientific Data Handling
I was delighted recently to discover Greyhole. Essentially, it’s a system that allows you to configure a Samba share at one end, and a bunch of disks at the other. The disks get the data shared between them, with a configurable level of duplication. It’s aimed mainly at the home user, who wants a higher degree of data security than the single drive approach provides, but is not going to go the expensive and poorly scalable RAID approach.
The implementation is fairly straight-forward and elegant. The Samba share is provided by a customised Samba virtual file system. This augments the standard process by logging to a spool region (one file per file operation). A daemon consumes these files, stuffing them into a database, then consumes the entries in the database. Essentially, if anything has changed, greyhole rsyncs the change to one or more of the backend disks.
It’s a really nice system. I must admit that PhP wouldn’t have been my first choice, but that is horses for courses. Likewise, the dependency on Samba is unfortuante — I always found it a pig to configure, besides which I’d like to use this internally on a linux box. I had a discussion with the author Guillaume Boudreau, who confirmed my initial feeling that the Samba VFS could be easily replaced with another, such as FUSE. I’d like to have a go at doing this work, and it’s very possible — basically, it requires a big merge between Guillaumes VFS and the FUSE based loggedfs. If I had written any C, I could probably do it in a day or so, but as it stands, it is likely to take longer.
As well as home usage, though, this could also be good for the researcher. While a small lab could pay for managed storage, this tends to come in at £1000 per TB, per annum. Most labs don’t need 24/7 recovery though, and the data is often write once, read occasionally. Greyhole would work out for 1TB at 200 quid (for a low-wattage PC server), 100 quid two 1TB discs which would cost, say, 40 quid to power for a year (say, 15W for the computer, 10W for the hard drives, and a bit more for networking, adaptors, USB hubs and so). For lab usage, the drives would probably last 2-3 years at least, while an all solid state computer might last twice this long. More storage space could be added as needed, dropping the cost per TB substantially, although how scalable greyhole is I don’t know.
The general approach could be used more widely, though. As well as JBOD spanning, what about:
Blac khole | The lab runs a local disc for their own data access needs, which is backed up to a institutional data store somewhere off-site. The daemon could be configured to use late night bandwidth, which would only compromise data security slightly. |
Whit ehole | More in line with my style of science, the local disc would be backed up to a public accessible repository. Obviously this would require suitable metadata to describe the status of the data, but everything would be sharable and accessible as it was produced. |
Wor mhole | Many labs collaborate with one or two others. A wormhole file system would be configured so that data placed on my file share would magically appear, read-only, in one or more places on the internet, using a rsync/ssh pipe. My collaborators data would, likewise, appear on my disc. |
Plu ghole | This would replicate the normal scientific “supplementary data” process for releasing data publically. Essentially, everything on the file system would, after a significant period, be converted into an excel spreadsheet with no column titles or any additional metadata. This would then be placed in a web accessible location for between 2-6 months, before being randomly deleted. |
I’m buying a low power consumption PC to try out greyhole in it’s current form, to see how it goes.