Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: Big Web Data

Jason Ronallo
Associate Head, Digital Library Initiatives
North Carolina State University Libraries

@ronallo
jason_ronallo@ncsu.edu

Hi! I'm Jason Ronallo at NC State.

Outline

I'm going to cover a lot of ground in a short period of time just to give you an idea about each of these. Just enough to begin to see why this is important and give you ideas on what's becoming possible. Much of this is the necessary preface for the research I've begun.

How Search Engines Work

  1. Robots crawl the Web
  2. Process and index crawl data
  3. Try to answer search queries with the most relevant results

friendly robot 

This is basically how search engines work.

Problem is that for the most part they're having to do natural language processing to pull out semantics. It is really hard to do. Even the smart minds at big companies like Google can only do so much.

Embedded Semantic Markup

is the Associate Head of Digital Library Initiatives at .

This is part of the reason why embedded semantic markup is important. Here's a snippet here, but you can't see it.

Embedded Semantic Markup

<span itemscope 
   itemtype="http://schema.org/Person">
   <a itemprop="url" 
      href="http://twitter.com/ronallo">
      <span itemprop="name">Jason Ronallo</span>
   </a> is the <span itemprop="jobTitle">
     Associate Head of Digital Library 
     Initiatives</span> at
   <span itemprop="affiliation" itemscope
     itemtype="http://schema.org/Library">
     <span itemprop="name"> <a itemprop="url"
       href="http://lib.ncsu.edu">NCSU Libraries</a></span>
  </span>.
</span>

But here's what the markup looks like. Some attributes have been added to some simple HTML to add some more structure to the data.

That's all that embedded semantic markup is. Embedded semantic markup provides the syntax (some extra markup) to structure data in your HTML pages.

Think of this like hidden annotations.

Why use embedded semantic markup?

Since youre eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization.

We often have very rich metadata for the resources we describe in our databases. In the past schemes to expose this metadata through HTML and the Web led to a lot of dumbing down. Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.

Schema.org

* Numbers from last time I checked early in 2013.

Using embedded semantic markup to structure your data is great, but if you're not using a vocabulary that someone else understands it is kind of pointless. This is where schema.org comes in. (Read slide.)

Examples

I'm going to show you a couple of very quick examples of what we've done so far with embedded semantic markup and schema.org at NCSU. I'm sure the Duke folks will be showing you more examples in a bit.

screenshot of home page of rare and unique materials site 

OK, here a digital collections site where we've implemented embedded semantic markup and schema.org.

Schema.org types on the Rare & Unique Materials site

These are the types of things that we're describing on that site. And looking at Google Webmaster Tools we know these things are getting indexed.

Rich Snippets: Rare and Unique Materials Video

YKK 

Third result in Google video search for "bug sprays and pets."

So the main benefit we get out of all this right now is what Google calls Rich Snippets.

This search result has a video thumbnail, the duration and a bit from the transcript of the video as the description. Rich snippets is really the only thing that Google has said it will use this data for.

You can see how having this extra information can make a particular search result stand out and be more likely to be clicked on. So it improves discoverability.

But how else could this benefit libraries and archives if all of this gets pushed further?

Library Home Page

NCSU Libraries Home Page 

We've included some Microdata and Schema.org on the Libraries' site.

Answers Instead of Search Results

There is a more general trend for search engines to give answers instead of a list of search results. You may have seen results like this in Google already. While the data is currently taken from Wikipedia, Freebase, and some other standard sources, I'd expect that more answers would start being sourced from the embedded semantic markup that gets published. I think this could go even further though.

Library Services and Google Now

22 minutes to Hunt Library, Centennial Campus

Wolfline #8 from Scott Hall

Hunt Library hours: 7:00 a.m. - 11:00 p.m.

View nearby events

Fabulous Faculty @ DH Hill Library - Brickyard Farmer's Market - Read Smart Book Discussion (Salt Sugar Fat)

Other good things we could add would be hours of operation, our extact geographic location, and events happening in the Libraries.

Who is familiar with Google Now?

It is like an automatic personal assistant that learns about your habits and gives you helpful information. If you enable Google Now you'll see information about how long it would take for you to get from home to work on the next bus.

This is totally speculative about where this could go, but wouldn't it be cool if it showed students the Libraries' hours for the day, when and where their study group is meeting, and what events are happening in the library? This is the kind of thing which is possible when lots of this data is published on the Web and combined with the student's data.

Research Question:
Are Academic Libraries Publishing Embedded Structued Data?

How about digital collections?

Get some rough idea of the landscape of use of embedded semantic markup and schema.org among academic institutions and academic libraries.

So one of the things I'd like to do with this data is learn is how many academic institutions are using these technologies, especially how academic libraries are. What are their patterns of use? How could we improve things for libraries in this new Big Web data environment? I've got lots of other questions.

Problem is that I'd need lots of data to answer these questions. I certainly don't have the means to go out and crawl the Web, store all the data, and parse it.

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

http://commoncrawl.org/

This is where the Common Crawl comes in to help. They don't crawl as much as Google, but it is still a lot. (Read slide.)

If you want to use this data, it is free. But to parse it all will cost you money. Not as much as crawling the Web on your own would, but still something.

Web Data Commons

Extracting Structured Data from the Common Web Crawl

http://webdatacommons.org/

Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples/Statements 7,350,953,995

Cost to extract the data from the Common Crawl: $398

And this is where the Web Data Commons comes in. It parses the whole Common Crawl corpus to extract all of the embedded semantic markup into RDF-triples. It makes all the data available for free.

Again we're talking some Big Data here with over 7 billion triples.

Even so the size of this data set is more approachable to just download and play with.

What's an N-Quad?

_:node6eecc231551a72e90e7efb3dc3fc26 http://schema.org/Photograph/name "Mary Travers singing live on stage" http://d.lib.ncsu.edu/collections/catalog/0228376 .

Subject Predicate Object Context

An N-Quad is an RDF statement that also includes a context piece at the end. Context is the URL of the HTML page from which the data was extracted.

Line-based format makes it easier to do some rough parsing.

Before we get into some of the results let's cover a little bit of vocabulary.

Web Data Commons publishes its data as N-Quads. So what's an NQuad?

Methodology

  1. Grab all of the Web Data Commons extracted N-Quads (7,350,953,995 of them) from the August 2012 Common Crawl Corpus
  2. Use commandline tools (cat & grep) to boil things down to just N-Quads that contain ".edu" somewhere, anywhere
  3. Further reduce by university (duke.edu, nccu.edu, ncsu.edu, unc.edu)
  4. Even further reduce to just libraries (library.duke.edu, lib.ncsu.edu, lib.ncsu.edu)
  5. Run some scripts over these smaller batches to get some results

All very much a crude pass at this!

I wanted to contain the big data element a bit so I used some crude methods to just begin to get some usable data out.

OK, let's see what we're left with.

Total Statements (N-Quads)

All triples 7,350,953,995
All .edu 8,178,985
duke.edu 58,867
nccu.edu 79
ncsu.edu 9,339
unc.edu 52,751

These are all the statements that contain the text (.edu, duke.edu, nccu.edu, ncsu.edu, unc.edu) anywhere in the N-Quad.

Unique Contexts/Pages

duke.edu 55,344 library.duke.edu 1,123
nccu.edu 2 n/a*
ncsu.edu 664 lib.ncsu.edu 155
unc.edu 2,837 lib.unc.edu 503

These are the number of unique contexts (HTML pages) that are included in the Common Crawl and that have included some embedded semantic markup that Web Data Commons has extracted.

* Uncertain how to target just NCCU Libraries.

So 55 thousand pages on Duke's site included some structured data, and of those 1 thousand are from the library's site.

Digital Collections at NCSU: Rare & Unique Materials

  1. http://d.lib.ncsu.edu/collections/
  2. http://d.lib.ncsu.edu/collections/catalog/0228376
  3. http://d.lib.ncsu.edu/collections/catalog/bh2127pnc001
  4. http://d.lib.ncsu.edu/collections/catalog/unccmc00145-002-ff0003-002-004_0002

Mary Travers singing Webb-Barron-Wells House American Dry Cleaning building drawing

In looking closer at just the NCSU digital collections I see just these 4 URLs included as contexts. So these are the only NCSU digital collections pages with embedded semantic markup that were crawled. Many more pages have it--and many more interesting resources, in my opinion, in the collection as well--but these are the only ones that had enough PageRank to be included in the crawl in August of 2012.

Use of Schema.org

145,351 N-Quads from all 8,178,985 .edu N-Quads use schema.org.

duke.edu 1901 library.duke.edu 1660
nccu.edu 3
ncsu.edu 326 lib.ncsu.edu 102
unc.edu 301 lib.unc.edu 25

* These numbers look at the whole quad and not just the context. So these universities and libraries might not actually be using schema.org (or may have been using schema.org but the documents that have schema.org have not been crawled by the Common Crawl).

So one thing I really need to do is do this calculation based on contexts instead.

OK, What Does This Mean? Preliminary Thoughts.

Still a lot more research to do--and do more exactly--but here are some of my initial thoughts on this.

What questions about Big Web Data would you be interested in?

Bonus Slide: What Could Libraries and Archives Do With These Web Technologies?

Libraries and archives can be both producers and consumers of this data.

So I think we can have a part to play both as producers of this data and as consumers of it.

So that's what others might do with this data, but what could Libraries do?

It could enable new services.

This might help us identify data sets we'd like to preserve.

Credits

Jason Ronallo

@ronallo

http://jronallo.github.io/

jason_ronallo@ncsu.edu

You can read more of what I've written on these topics and find other presentations I've given.