This page has permanently moved to http://ronallo.com/page/2/
The stale content is below for a limited time for your convenience while DNS caches get expired.
Page 2 of 5
Nov 4, 2013
I spoke at the 2013 DLF Forum about Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives. My slides, code, and data are all open.
Here’s the abstract:
read moreMar 22, 2013
While HTML5 Microdata has the advantage of using visible markup, it can still be invisible enough that when your app changes your Microdata goes out of sync. Testing your Microdata is important for ensuring your Microdata parses correctly and you’re communicating to the search engines what you think you are. I’ll show you a simple example of test-first addition of Microdata in a Rails project.
Spoiler: It is super easy using the microdata gem.
read moreFeb 19, 2013
Using the webvtt gem, you can display on the page the WebVTT subtitles, captions, or chapters you’ve created for HTML5 video or audio. If you’re already creating WebVTT files for your media, you ought to get the most use out of it as you can. I’ll show you one way you could use them.
read moreJan 24, 2013
The Web Data Commons is extracting the structured data discovered in the Common Crawl corpus, and they’re making the extracted data and some high-level analyzed data available for free to all. I took a look at which properties of http://schema.org/Book were actually used in the wild in the August 2012 corpus. My hope is to inform, in a small way, the discussion around extending Schema.org to better accommodate bibliographic data happening through the W3C Schema Bib Extend Community Group. By seeing what is actually being used, we might make better decisions about how Schema.org could be extended.
read moreJan 15, 2013
The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.
read morePage 2 of 5