This page has permanently moved to

The stale content is below for a limited time for your convenience while DNS caches get expired.

Page 2 of 5

DLF Forum 2013 presentation: Embedded Semantic Markup,, the Common Crawl, and Web Data Commons

Nov 4, 2013

I spoke at the 2013 DLF Forum about Embedded Semantic Markup,, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives. My slides, code, and data are all open.

Here’s the abstract:

read more

Automated Testing of HTML5 Microdata in Rails

Mar 22, 2013

While HTML5 Microdata has the advantage of using visible markup, it can still be invisible enough that when your app changes your Microdata goes out of sync. Testing your Microdata is important for ensuring your Microdata parses correctly and you’re communicating to the search engines what you think you are. I’ll show you a simple example of test-first addition of Microdata in a Rails project.

Spoiler: It is super easy using the microdata gem.

read more

Using the webvtt Ruby gem to display subtitles on the page

Feb 19, 2013

Using the webvtt gem, you can display on the page the WebVTT subtitles, captions, or chapters you’ve created for HTML5 video or audio. If you’re already creating WebVTT files for your media, you ought to get the most use out of it as you can. I’ll show you one way you could use them.

read more

The Prevalence of Book Properties in the Wild

Jan 24, 2013

The Web Data Commons is extracting the structured data discovered in the Common Crawl corpus, and they’re making the extracted data and some high-level analyzed data available for free to all. I took a look at which properties of were actually used in the wild in the August 2012 corpus. My hope is to inform, in a small way, the discussion around extending to better accommodate bibliographic data happening through the W3C Schema Bib Extend Community Group. By seeing what is actually being used, we might make better decisions about how could be extended.

read more

Common Crawl URL Index

Jan 15, 2013

The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.

read more

Page 2 of 5

Previous page Next page