The evil a space can do
Recently, I was contacted by a Kcite (n.d.a) user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.
So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines (n.d.b) but they are hardly alone in this.
<span class="slug-doi">10.1128/AAC.01664-10
</span>
However, looking a bit further into this at the binary of this source and we see this:
00006260: 2020 2020 2020 2020 203c 7370 616e 2063 <span c
00006270: 6c61 7373 3d22 736c 7567 2d64 6f69 223e lass="slug-doi">
00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30 10.1128/...AAC.0
00006290: 3136 3634 2d31 300a 2020 2020 2020 2020 1664-10.
The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.
Actually, I suspect that this is not the case, as the DOI is in the page metadata:
<meta content="10.1128/AAC.01664-10" name="citation_doi" />
It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness (n.d.c) A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:
The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.
In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite (n.d.d) uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.
Update
Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.
———. n.d.c. https://erambler.co.uk/blog/doi2oa-status-update.
———. n.d.d. https://process.knowledgeblog.org/309.