Preliminary Inventory of Digital Collections

Incomplete thoughts on digital libraries.

Automated Testing of HTML5 Microdata in Rails

While HTML5 Microdata has the advantage of using visible markup, it can still be invisible enough that when your app changes your Microdata goes out of sync. Testing your Microdata is important for ensuring your Microdata parses correctly and you’re communicating to the search engines what you think you are. I’ll show you a simple example of test-first addition of Microdata in a Rails project.

Spoiler: It is super easy using the microdata gem.

The Prevalence of Schema.org Book Properties in the Wild

,

The Web Data Commons is extracting the structured data discovered in the Common Crawl corpus, and they’re making the extracted data and some high-level analyzed data available for free to all. I took a look at which properties of http://schema.org/Book were actually used in the wild in the August 2012 corpus. My hope is to inform, in a small way, the discussion around extending Schema.org to better accommodate bibliographic data happening through the W3C Schema Bib Extend Community Group. By seeing what is actually being used, we might make better decisions about how Schema.org could be extended.

Common Crawl URL Index

,

The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.

Detailed Video Engagement Analytics for HTML5 Video

If you have published video on the Web with HTML5 Video, then you will want to know how engaged your viewers are with your video. This post will lead you through collecting the data you need to assess your video publishing efforts. Services like Google Analytics only get you so far. To get deeper insights into how your video is being used, you may want to track detailed video engagement analytics.

HTML5 Video: Everything I Needed to Know

,

While there is a lot of good information on the Web about HTML5 Video, I wanted to put down everything that I gathered together about it. As I work on video projects, I will continue to add more and new information to this post.

This post is in the form of a tutorial where a single page is built up little by little as new problems arise or new features are implemented. I hope for this to be a gentle introduction to HTML5 Video.

Library Catalog Pages Ranking in Search Engines

Recently there was a thread on the code4lib list about local catalog records showing up in results in the search engines like Google, Bing, and Yahoo!. The anecdotal evidence is that Google is actively crawling and indexing library catalogs like Johns Hopkins’.

Some of the discussion has revolved around how useful this local catalog data is to folks coming from search engines. How many of these users are satisfied with coming to a local library catalog? I think many people will be unsatisfied because they have found something interesting that they cannot access. Much of what academic libraries have is only available to their own students, faculty, and staff or to other institutions through inter-library loan. This situation may be improving for users.

Digital Collections, Crawling, and Aggregating Content

Code4Lib 2012 Lightning Talk That Wasn’t

Lightning talks filled up fast this year at Code4Lib before I had a chance to sign up, which is probably for the best since I had already had the opportunity to give a full length talk. Here is the lightning talk that I had prepared with each slide being followed by my draft speaker notes.

Hi.

Digital Libraries have aspired to create the one big pot of digital library stuff to hold everything. For the most part we’ve used niche protocols, dumbed-down metadata, cumbersome workflows, and lots of time massaging metadata in an effort to achieve these big aggregations.

I want to talk about what I think is better way to do aggregations. There’s a lot more you can build with what I’m talking about, but I want to set aggregations in my sights.