Google Scholar and DSpace

Many institutional repositories aim to offer a discovery and access interface for scientific research. Both mainstream and specialized search engines are still the first tools of choice, linking out to metadata and files stored in institutional repositories. Google has been operating its dedicated search service for scholarly articles, Google Scholar, alongside its main search service for almost ten years. The popularity of the service has grown despite receiving criticism for being unconventional. Case law and clinical searches are two subject specific examples that illustrate this. Google Scholar replaced Thomson ISI Web of Science as the data source for the popular Publish or Perish tool.

It is no secret that Google Scholar likes institutional repositories. But even though the use of EPrints, Digital Commons and DSpace is recommended in Google's Inclusion Guidelines for Webmasters, solely using any of these platforms is not a recipe for guaranteed inclusion. Configuration issues like the metadata fields you decide to use and how these are exposed on the repository web pages will highly affect your compliance with the inclusion guidelines. Unlike OpenAIRE, where getting your repository registered for harvesting is an interactive process, Google Scholar crawls automatically without interacting with your staff. This enables Google Scholar to scale, crawling thousands of academic resource sites without the need for a massive amount of support staff. The downside is that repository managers are left in the dark, not knowing why their repository or particular items are not being crawled and indexed.

Thanks to a close dialog between the DSpace community and the Google Scholar team, DSpace support for Google Scholar crawling has improved in recent years. We wanted to find out how well recent DSpace versions are actually being indexed by Google Scholar today. To do this we tried to get a general sense of overall coverage in one experiment and tried to verify if specific individual items were included in a second one.

Google Scholar Index Ratio

Because we are restricted to the public Google Scholar interface (API, anyone?), there is a limit to what we can learn about the Scholar's coverage of specific repositories. Inspired by work from Patrick O'Brien and Kenning Arlitsch we wanted to get a sense for coverage, e.g. the Google Scholar Index Ratio for modern DSpace 3 and 1.8.x repositories. Like O'Brien and Arlitsch, we define the ratio as the total number of items in the repository divided by the number of items indexed by Google Scholar that shows up using the "site:repo-url" operator.

In its troubleshooting guidelines, Google warns that this operator is not a good indicator of coverage as its purpose is essentially to help users refine their queries and not coverage checking. This is why we also conducted a second experiment to investigate if particular items in these repositories are being indexed and if we could detect a pattern if they aren't. If you seriously want to investigate coverage of your repository, doing an in-depth verification of specific items is the way to go. Such an approach provides more meaningful results than solely relying on a GS Index ratio based on the "site:repo-url" search operator. Still, we are using this operator here to have a point of comparison with the aforementioned work by O'Brien and Arlitsch from October 2011.

Repository NameItem TotalGS item totalGS Index Ratio
DSpace@MIT 68784 35200 51.17%
Baylor BEARDOCS 1552 1390 89.56%
Gothenburg University Publications Electronic Archive 29000 22000 75.86%
SUNScholar Research Repository 55398 19700 35.56%
Digital Access to Scholarship at Harvard 14247 3740 26.25%
UBC cIRcle 44188 43800 99.12%
Myrrh - The Moore Institutional Repository 2577 2130 82.65 %
The American University in Cairo DAR 2897 1490 51.43 %
IDS OpenDocs 3124 1710 54.73 %
Wits Institutional Repository on DSpace 8104 6630 81.81%

O'Brien and Arlitsch reported an average of 30% for a sample of 20 repositories operating on a number of different platforms in October 2011. The average indexing ration for our sample of 10 recent DSpace repositories is 64.8%. This could indicate that this set of repositories today is enjoying better coverage than the other set of repositories back in 2011. Improvements in DSpace itself will almost definitely not build the whole story as Google Scholar may have improved its indexing processes as well. While the methodology and tools seem too blunt to say anything sensible about a difference between 51 and 54% they do raise curiosity as to why particular repositories are residing at each end of the spectrum.

Hunting for individual items

In another experiment we wanted to get a sense of which items are included and which are missing. As a methodology we selected items from five repositories and explicitly searched for their titles in Scholar to see if we would receive a hit from the repository. We selected a very old item and a very recent item, to identify if Scholar is challenged with keeping up with the most recent content. We also looked at metadata-only items if we could find any to see if they are suffering from worse coverage compared to items that have full text attached to them. Every result links back to the item that we used for the experiment.

Repository NameRecent itemOld ItemRestricted item
UBC cIRcle Item not found Item found *
Myrrh - The Moore Institutional Repository Item found Item not found *
The American University in Cairo DAR Item found Item found Item not found
IDS OpenDocs Item not found Item found *
Wits Institutional Repository on DSpace Item found Item found *

* To identify an item with no or restricted bitstreams, we accessed a maximum of 20 random items in the repository. For these repositories, we were not able to find an item that either had no files or non-public files attached to it.

Google's Troubleshooting page notes that new papers are normally added several times a week while updates to existing papers can take up to 6-9 months. The above test with recent items has the problem that even though they are among the most recent for the repository, they were not all added on the same date. Not very surprising, an item that is just 2 days old was probably too recent to be already included in Google Scholar. When looking at the older items, only one item couldn't be found in Google Scholar. Briefly looking at the item page, one may point at the very limited amount of metadata as a reason for non-inclusion.

For the 5 repositories in this sample, items with restricted bitstreams or metadata only items were hard to find. Overall, it is impossible to assert that metadata-only items are totally disregarded by Google Scholar, as there are many examples of "citation" listings (1, 2, 3, ...) that link out to metadata only items in DSpace, especially if these are highly cited items.

DSpace 4 improvements

Google Scholar indexing and the associated ratios are likely to further improve for DSpace 4 repositories. This recent release of DSpace included several enhancements explicitly requested by the Google Scholar team. The Google Scholar crawler should find it easier to retrieve recent submissions in DSpace 4. This has been achieved by an extensive list of recent submissions that can be accessed from a link on the DSpace homepage. The logic behind which bitstream gets exposed in the citation_pdf_url meta tag has been enhanced as well.

Conclusion

Because Google Scholar indexing is still much of a black box today, improving your coverage can be particularly challenging. The examples in this article show that indexing ratios well over 50% are attainable for DSpace repositories. Therefore such a ratio may be attainable for your DSpace as well. If your DSpace items are currently not being indexed in Google Scholar, chances are that a DSpace upgrade or specific customizations can increase your visibility in what has become one of the most popular academic search engines.

@mire assists institutions around the globe with a range of DSpace services and add-on modules. If you are experiencing difficulty with Google Scholar indexing or other aspects of your DSpace, we can help. Contact us now to benefit from our expertise.