on September 7, 2012 by in greycite, How To, Comments (0)

What is Greycite?

Greycite is a tool which extracts bibliographic metadata from a web page. It stores this metadata, and then makes it available for either computational use by tools such as Kcite (http://knowledgeblog.org/kcite-plugin), or for viewing directly from the web; for example, consider the Greycite page for the previous link.


While the DOI system has a number of significant issues (http://www.russet.org.uk/blog/1849), it does have one advantage; new DOIs are minted through a central authority, or rather one of several (http://www.russet.org.uk/blog/2044), that is the DOI registration agency. At the time of minting some registration agencies such as CrossRef, or DataCite require the registration of metadata about the DOI which can, in turn, be retrieved via content negotiation (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html) (http://www.russet.org.uk/blog/2006).

While the DOI capability is useful, it is also problematic. It puts the registration agency in a privileged position, having primary access to the metadata. It also means that the process of producing a new referencable artefact — an article or a piece of data — requires interacting with the registration agency. We have done this previously with Knowledgeblog, generating DOIs for a number of our articles; the process was manual, however, and unwieldly. It is also not suitable as a general method for publishing grey literature which needs to be light-weight.

This then provides the motivation for Greycite (http://www.russet.org.uk/blog/2071). We wanted a system which, like the DOI registration agency was capable of returning metadata about a given identifier. However, we wanted it to work using only URIs, so that it was entirely compatible with the web; we also wished to avoid a requirement for publishers to interact directly with Greycite to provide metadata.

How Greycite Works

Many webpages have embedded metadata; generally, this is placed into the underlying HTML to allow enable computational use of the page. Where a webpage is generated by a content management system this will often happen automatically without the explicit involvement of the author of the content. Greycite mines this metadata from the webpage; it is capable of recognising metadata in a number of different formats. It then stores this metadata, and provides access to it a number of ways.

It is straightforward to use Greycite. The main interface as shown is very simple, and allows you to enter a single URL.

greycite-empty.pngFigure 1. The Greycite interface

Metadata about this URL will then be displayed. In this case, you can see the URL (http://www.russet.org.uk/blog/2071), when it was scanned and how many times it has been requested. It is possible to access the metadata in structured format either as BibTeX or JSON through the buttons on the right. In addition, a complete history is shown, which allows you to determine if the title or authorship has changed over time. For each scan, basic provenance is available showing from where Greycite has captured the metadata. For this URL, we have recovered metadata from COinS, Google Scholar, and Open Graph metadata. Where available, Greycite will also link through to services such as http://www.webarchive.org.uk/ukwa/ or http://archive.org which provide archival versions of web pages.

greycite-with-url.pngFigure 2. Metadata for a URL

Providing Metadata for Greycite

Greycite currently accesses metadata in a variety of different formats. Unfortunately, one disadvantage of this decentralised system is that there are many different formats and they are not followed rigourously. Greycite takes a somewhat heuristic approach, applying a “best effort” rule to try and interpret the metadata it sees.

The most straightforward way to add metadata if you are using WordPress is to use our own Kblog-metadata (http://knowledgeblog.org/kblog-metadata) plugin. This has been tested with Greycite and, indeed, directly uses Greycite for part of its functionality. For other content management systems, we support Open Graph Protocol, Google Scholar, and CoINS. Many CMSs already have appropriate plugin support for these systems. If you are using a hosted service, such as http://wordpress.com or http://blogger.com, there is a reasonable chance that Greycite will just work, although this is dependent on the theme you use. We hope to provide a more complete solution for hosted services in future.


Phillip Lord
School of Computing Science
Newcastle University
Lindsay Marshall
School of Computing Science
Newcastle University


No Comments

Leave a comment