Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives

Jason Ronallo
NCSU Libraries
@ronallo

Slides: https://ronallo.com/presentations/2013-dlf

Hi. I'm Jason Ronallo the Associate Head of Digital Library Initiatives at NCSU Libraries. Much of the work that I've done has been with digital special collections and especially with improving the discoverability of these collections on the open Web.

I'm going to give an introduction to each of these pieces of the title, what they are, and how we can use them, and then I'm going to show a little of the original research that I'm doing. So let's jump into it.

How Search Engines Work

  1. Robots crawl the Web
  2. Process and index crawl data
  3. Try to answer search queries with the most relevant results

friendly robot 

But first I'd like to talk for a moment about how search engines work. Robots crawl the web. They process and index crawl data. Finally they try to get folks to relevant pages that match search queries. It is that step #2 that we're going to focus on today. There's a lot in that. The point I'd like to make about it now is that search engines have begun to reach the limits of what they can do with natural language processing alone. The problem is that there's a limit to what meaning you can pull out just from HTML tags and the text content.

Semantics

Semantics in HTML helps to solve this.

Semantics in HTML

<ol>
  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
</ol>
There has always been some semantics in HTML. Here's a simple ordered list. We can't tell much from it, but we do know that these things are ordered.

HTML5 Semantics

<nav></nav>
<header></header>
<article>
  <section></section>
  <section></section>
</article>
<footer></footer>
HTML5 has added a bunch of new semantic elements like article. This can let us pick the article out of a page for a distraction free reading experience.

Trapped Knowledge

But that still doesn't tell us much about what the content is about. There's still a lot of knowledge trapped in HTML pages that's difficult to get out.

Embedded Semantic Markup

And that's where embedded semantic markup helps.

Embedded Semantic Markup Example

is the Associate Head of Digital Library Initiatives at .

Here's a simple statement. It is easy for us as humans to know what this means, but you can imagine how much more complex it would be to try to instruct a computer to pull out these same pieces of data, especially if this was within a longer text.

There's actually some embedded semantic markup on this page. You can't see it?

Embedded Semantic Markup Is Hidden Annotations Meant for Machines

friendly robot 

Well that's because embedded semantic markup is a bunch of hidden annotations on the page meant for machines.

Embedded Semantic Markup Exposed

Person has the properties name, url, jobTitle, and affiliation. The affiliation is with a Library that has a name and url.

Here's the embedded semantic markup exposed. You can see that this whole thing describes a Person who has some properties like a name, url, jobTitle, and affiliation. Breaking things down in this way it can make easy sense to machines.

Embedded Semantic Markup Structure

 

Embedded Semantic Markup HTML

<span itemscope itemtype="Person">
  <a itemprop="url" 
    href="http://twitter.com/ronallo">
    <span itemprop="name">Jason Ronallo</span>
  </a> is the <span itemprop="jobTitle">
    Associate Head of Digital Library 
    Initiatives</span> at
  <span itemprop="affiliation" itemscope
    itemtype="Library">
    <span itemprop="name"> 
      <a itemprop="url" href="http://lib.ncsu.edu">
       NCSU Libraries</a>
     </span>
  </span>.
</span>
Here's our example HTML. I'm using the Microdata syntax for the embedded semantic markup. I won't get into the particulars, but you can see that there are some extra attributes like itemscope, itemtype, and itemprop added to the HTML.

JSON Serialization

{"items": [
    { "type": [ "http://schema.org/Person" ],
      "properties": {
        "url": [ "http://twitter.com/ronallo" ],
        "name": [ "Jason Ronallo" ],
        "jobTitle": [ "Associate Head of Digital Library Initiatives" ],
        "affiliation": [
          { "type": [ "http://schema.org/Library" ],
            "properties": {
              "name": [ "NCSU Libraries" ],
              "url": [ "http://lib.ncsu.edu/" ]
            }
          }
        ]
      }
    }
  ]
}
You can exctract the embedded semantic markup and serialize it as JSON.

RDF (Turtle)

@prefix md: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .

<> md:item ( [ a schema:Person;
                schema:affiliation [ a schema:Library;
                        schema:name "NCSU Libraries";
                        schema:url <http://lib.ncsu.edu> ];
                schema:jobTitle "Associate Head of Digital Library Initiatives";
                schema:name "Jason Ronallo";
                schema:url <http://twitter.com/ronallo> ] );
    rdfa:usesVocabulary schema: .
Or serialize to some RDF representation.

Types of
Embedded Semantic Markup

These are the different syntaxes that are most commonly used for embedded semantic markup. The example that I've shown is in the Microdata syntax. I mention them now since, we'll see these again when we get to looking at the research I've done.

Why use
embedded semantic markup?

Embedded semantic markup is a syntax for structuring data in HTML when you need to communicate unambiguously with machines. Since your eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization. These syntaxes also allow us the chance to go from rich metadata to rich embedded data.

That last point is something I'd like to stress. Too often in libraries when we're looking to exchange data or expose it on the Web we dumb it down. (OAI-PMH led to a lot of this.) You might also remember folks trying to put Dublin Core into the header of HTML.

We often have very rich metadata for the resources we describe in our databases.

Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.

The End of Dumbed Down Metadata

I hope we're nearing the end of dumbed down metadata.

Vocabularies for Understanding

Embedded semantic markup alone only gives the syntax and isn't useful on its own. We also need vocabularies so that we understand each other and get past dumbed down metadata.

Schema.org

This is where a vocabulary like schema.org comes in to help. There are all kinds of vocabularies that can be used for describing the content of Web pages, but I'll focus on schema.org for its ease of use, particular use cases, and growing implementation base.

Schema.org

[Read slide.] Yes, you can even describe a Volcano, which peculiarly has a property for phone number.

 

Here's the schema.org for the tree of all the types of things you can describe with it.

 

This kind of simple documentation makes it easier to implement.

Why use Schema.org?

There are lots of reasons to use schema.org: - Growing implementation base - Software implementations (CMS) - When the data is out there, others will come along, discover it, and use it.

Improve Discoverability on the Open Web

But the main reason right now is to improve the discoverability of your services and collections on the open web. There are many facets to how to improve discoverability. In part it means improving things in Google.

Rich Snippets

 

The way Google has promised to use embedded semantic markup with schema.org is for what it calls rich snippets.

You've probably already seen search results snippets in Google like this. There's a lot of information about this recipe page. You see an image of a cupcake, the number of reviews and stars, how long it takes to cook, and the number of calories. It even includes a list of some of the ingredients.

And you can imagine how the click through rates on a search snippet like this could be higher than a normal one. And this is the main reason folks are currently using this.

Library Examples

So I'll show you some examples of how we've implemented embedded semantic markup and schema.org at NCSU Libraries and then show a couple ideas on how we might see it being used in the future.

To give you an idea how this can apply to libraries and digital collections.

 

Here's a page for the NCSU Libraries Rare and Unique Digital Collections site. And you can see a drawing of Grove Arcade here.

A Grove Arcade Drawing is a http://schema.org/CreativeWork

 

There's some hidden embedded semantic markup which says: The thing on this page is a CreativeWork with a name and here's an image of it.

Item Information

 

If we go further down the page we can see a bunch more metadata about this resoure.

Embedded Item Information

 

And if we expose the annotations we can see that we're communicating more information about the resource.

Building Information

 

And again if we go further down the page still, we can see some information about the building that's in the drawing.

Embedded Building Information

 

We can see that this page is packed with information about the building.

Rich Snippets: Video


Third result in Google video search for "bug sprays and pets."


Second result in Google for "jim hunt future farmers".

I've had the most success with getting rich snippets to display for video resources.

These search results have a video thumbnail, the duration and a bit from the transcript of the video if there is one.

And you can see again how having this extra information can make a particular search result stand out and be more likely to be clicked on. So it improves discoverability.

Future Possibilities

I want to suggest now some future possibilities of where this might be going and how it can benefit libraries.

Answers instead of
Search Results

There is a more general trend for search engines to give answers instead of just a list of search results. You may have seen results like this in Google already. You can see images of Alan Alda, but also some more structured data about him which is what you might have been looking for.

While the data is currently taken from Wikipedia, Freebase, and some other standard sources, I'd expect that more answers would start being sourced from the embedded semantic markup that gets published.

 

We already get similar showing up for DH Hill Library. It shows an image of the library as well as places it on a map. You can see the address and hours.

 

Who is familiar with Google Now on Android?

It is like an automatic personal assistant that learns about your habits and gives you helpful information based on your current context. If you enable Google Now you'll see information about how long it would take for you to get from home to work on the next bus. You can think of this as personalized, predictive, contextual search.

This is totally speculative about where this could go, but wouldn't it be cool if it showed students the Libraries' hours for the day, when and where their study group is meeting, and what events are happening in the library? This is the kind of thing which is possible when lots of this data is published on the Web and combined with the data from the device.

I think there's a whole range of new services which we can enable by making our data available in this open way. And they're likely to be ideas that we won't have.

Save the Time of the Reader

We can be doing more to save the time of the reader.

Schema.org Activities in Gmail

   

Here's one other way we could be using this that would work now.

Using schema.org activities in email, you can trigger action buttons right in someone's gmail inbox. If you email yourself a book citation you can be offered a button to request the book. Or if your book is coming due soon you can have a button to quickly renew it right from your inbox.

Embedded Semantic Markup +
a Web-scale Vocabulary =
The Semantic Web?

Yes

No

So is this the semantic web? I tend to think this is one good path towards it.

Research

Instead of that philosophical question, I've got some questions that may be a bit more practical.

Research Questions

Are academic institutions publishing embedded semantic markup?

Are academic libraries?

What kind of data are they publishing?

What syntaxes are they using?

What vocabularies?

How can you even answer these questions then?

In the past I've tried a little bit to ask libraries if they're publishing this kind of data in HTML, but haven't really heard of that many. My reach is only so far and some libraries may not even know they're publishing data in this way. In some cases a content management system includes this or some developers may have just added it without telling anyone.

There might be times when a library doesn't even know it is publishing this kind of data because the CMS or other system just does it.

One way I could solve this problem is to crawl the Web--maybe just from a list of certain domains that I'm interested in--but even that could be expensive and time consuming.

I could maybe go begging one of the big search engines for some information. The search engine Blekko would allow you to submit jobs to extract some information from its corpus, but there was only so much you could do.

And this is the position that many have been left in when wanting to answer these kinds of questions where you need lots of Web pages to get a good answer.

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

http://commoncrawl.org/

This is where the Common Crawl comes in to help. (Read slide.) They crawl a big portion of the highest ranked pages on the internet and make it freely available.

So we really are talking big web data here.

This is really important and exciting. Before the Common Crawl came along you would have had to work for one of the big search engines to have this kind of access to this data. There's so much knowledge out there and now there's a way for many more to access that data.

They don't crawl as much as Google, but it is still a lot.

Uses of the Common Crawl

The best part is that if you want to use this data, it is free to access. To create an Amazon EC2 job to parse it all will cost you money. But certainly not as much as crawling the Web on your own would. They'll certainly have saved you a lot of money.

Lots of startups are using the Common Crawl to try out business ideas. Instead of having to conduct your own crawl before trying out an idea, there's a ready crawl corpus to play with and test an idea.

One cool use you can read about is how the company Swiftkey is using the Common Crawl. Swiftkey creates alternative keyboards for Android phones and tablets that make typing faster, more accurate. One of the features they have is word prediction. They're using the Common Crawl as a large text corpus to analyze to improve their algorithms for the 60+ languages they support.

You can also see the results of the Norvig Web Data Science Awards where folks tried out ideas like associating concepts on the Web through sentence word co-occurence on this large corpus.

The founder of the Common Crawl also point out the effect this could have on teaching Big Data skills in colleges.

How to get to the Embedded Semantic Markup?

Even with as easy as the Common Crawl makes it to get to the data I'd be interested in, it would still be some work and come at some expense. So how can I get to the embedded semantic markup without a lot of work?

Web Data Commons

Extracting Structured Data from the Common Crawl

Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples/Statements 7,350,953,995

Percentage of the Common Crawl corpus with embedded structured data? 12.3%

Cost to extract the data from the Common Crawl: $398

http://webdatacommons.org/

Luckily the Web Data Commons has done the work to extract structured data from the Common Crawl for me. It parses the whole Common Crawl corpus to extract all of the embedded semantic markup into RDF-triples. It makes all the data available for free.

Again we're talking some Big Data here with over 7 billion triples.

Even so the size of this data set is more approachable to just download and play with. So now we've got a better way to get our data.

What's an N-Quad?

_:node6eecc231551a72e90e7efb3dc3fc26
http://schema.org/Photograph/name
"Mary Travers singing live on stage"
http://d.lib.ncsu.edu/collections/catalog/0228376 .

Subject Predicate Object Context

An N-Quad is an RDF statement that also includes a context piece at the end. Context is the URL of the HTML page from which the data was extracted.

Line-based format makes it easier to do some rough parsing.

You might know that an RDF triple has, um, 3 parts. Web Data Commons publishes its data as N-Quads. So what's an NQuad? (Read slide.)

Summary Methodology

  1. Grab all of the Web Data Commons extracted N-Quads (7,350,953,995 of them) from the August 2012 Common Crawl corpus.
  2. Use commandline tools (cat & grep) to boil things down to just N-Quads that contain ".edu" somewhere, anywhere.
  3. Analyze all of these RDF statements and output CSV.
  4. Index in Solr and view with Blacklight.

Extraction code and documentation links: https://ronallo.com/presentations/2013-dlf

The Web Data Commons has a lot of other analysis on their site for the whole set, but I wanted to really just focus on an area closer to home for me.

Caveats

I have not done any calculations to determine what proportion of each site was crawled. So the comparisons here are just raw numbers. A very large university site might have lots of crawled pages, but it could be a smaller percentage of their whole website than a smaller number here from a university with a smaller website.

For the libraries they may have lots of materials that are not under the library subdomain which would skew those numbers significantly. I was also only looking for subdomains that included "lib.", "library.", and "libraries.". It could be that a lot of academic libraries have their site under a subdirectory or some other subdomain.

Suffice it to say that these are raw numbers and some of the processing was crude. This initial research is a rough cut.

Total Resulting Statements
(N-Quads)

All statements 7,350,953,995
All .edu 12,182,975
.edu context 8,178,985

So again what I started with was definitely big data for me.

I was able to pare that down to just over 12 million that include ".edu" somewhere in the statement. Crude but it worked well enough.

And then the final number--8 million--are just those quads where the context--the page the triple was extracted from--included .edu in the domain name.

A much more manageable data set to deal with.

Research Questions

Are academic institutions publishing embedded semantic markup?

Are academic libraries?

What kind of data are they publishing?

What syntaxes are they using?

What vocabularies?

So now we're ready to ask our research questions again.

So with this kind of data now freely and easily available, I think we can begin to be a bit more scientific in how we're looking at our metadata on the Web. I could see something similar to what Roy Tennant does at OCLC where he reports on MARC tag usage. When we're making decisions about how to proceed, this data can help us make informed decisions.

NC State Univ. Peer Institutions

Instead of looking at everything even in that smaller 8 million statement set, I wanted to start off just with a comparison with the official peer institutions of NC State University where I work. Many of these institutions are DLF members.

Common Crawl URLs

 

Looked up using the Common Crawl Search URL index. We can see that all the universities had pages crawled by the Common Crawl. Some did not have any of their library pages crawled at all.

You can see that Virginia Tech, a DLF member, had the most library pages included in the crawl.

So one thing we could do is compare the library websites for those that were and weren't crawled and see if there are any differences. Is there a robots.txt that's preventing the Common Crawl robot from crawling? This could be one area of future research based on this data.

% of Common Crawl URLs w/ Embedded Semantic Markup

psu.edu 18.35% gatech.edu 1.20%
illinois.edu 8.87% ncsu.edu 1.19%
ucdavis.edu 5.58% iastate.edu 0.71%
osu.edu 3.87% wisc.edu 0.69%
rutgers.edu 3.45% umd.edu 0.35%
arizona.edu 1.97% colostate.edu 0.26%
msu.edu 1.67% purdue.edu 0.20%
tamu.edu 1.58% vt.edu 0.16%
ufl.edu 1.51%

One simple thing to figure out was what percentage of the URLs in the Common Crawl included embedded semantic markup.

One of the things to note about this slide is that some of the URLs included in the Common Crawl are from non-HTML documents like PDFs and XML documents. Neither of those would have any embedded semantic markup.

We can see that Penn State has a high proportion of their crawled URLs that include at least some embedded semantic markup.

What kind of content is psu.edu describing?

So what kind of stuff is Penn State describing? hCard marks up people, companies, organizations. hCalendar is for publishing events. open graph protocol is what Facebook promotes. geo marks up lat/lng.

Web Data Commons
Library Contexts

When I looked for academic libraries this is the pattern I was matching on. If your library lives at a different kind of URL I would have missed it--and I'd be interested in learning what it is.

Web Data Commons Library Triples and Unique Contexts

 

Again some of these institutions had no contexts for their libraries. So they have been removed. Which of the peer institutions libraries are publishing the most data? While Ohio State University publishes the most triples. NCSU has slightly more pages that include this markup.

Number of triples with this library as the context compared to the number of unique contexts. So while OSU library had a lot of triples published they all came from 138 pages.

General Academic Institution Stats

So let's take a broader look at some more basic stats on how all academic institutions are using embedded semantic markup.

Syntaxes Used by Academic Institutions

mf-hcard 5,854,493
rdfa 1,337,528
mf-hcalendar 770,228
mf-xfn 456,184
microdata 285,296
mf-geo 51,565
mf-hresume 5,363
mf-hreview 2,908
mf-hlisting 48
Most of what's being published is in the hCard Microformat syntax. Information about people, places, and organizations.

Schema.org Types Used by Academic Institutions

LocalBusiness 20,565 CollectionPage 1,275
PostalAddress 17,267 Blog 991
CollegeOrUniversity 11,554 University 550
Organization 10,172 CollegeorUniversity 420
WebPage 8,846 Review 372
Article 8,351 NewsArticle 316
BlogPosting 3,511 Place 313
Person 2,071 ScholarlyArticle 298
Event 1,539 SportsEvent 289
EducationalOrganization 1,508 Thing 168

This is just counting the number of triples that include that schema.org type. One interesting thing is that this is almost exclusively in the Microdata syntax. It wasn't until later that the schema.org partners said they'd support RDFa Lite.

Note that ScholarlyArticle is represented a little, but Book or Photograph or CreativeWork aren't some of the top types. So either where we have these kinds of things isn't being crawled or we're not yet marking up these kinds of things.

What about digital collections?

Have a digital collection with a sitemap? Please, go to this URL right now:

http://go.ncsu.edu/sitemap

But where are digital collections? The URLs for these sites tend to vary more, so I'm asking for folks to submit their sitemaps. Please help out my research and submit your digital collections sitemaps at this URL.

Digital Collections at NCSU:
Rare & Unique Materials

  1. http://d.lib.ncsu.edu/collections/
  2. http://d.lib.ncsu.edu/collections/catalog/0228376
  3. http://d.lib.ncsu.edu/collections/catalog/bh2127pnc001
  4. http://d.lib.ncsu.edu/collections/catalog/unccmc00145-002-ff0003-002-004_0002

Mary Travers singing Webb-Barron-Wells House American Dry Cleaning building drawing

In looking closer at just the NCSU digital collections I see just these 4 URLs included as contexts. So these are the only NCSU digital collections pages with embedded semantic markup that were crawled. Many more pages have it--and many more interesting resources, in my opinion, in the collection as well--but these are the only ones that had enough PageRank to be included in the crawl in August of 2012.

2013 Crawl

It appears that the Common Crawl is getting ready to release some new data. And I hope to reproduce some of this work then and see if the situation has improved in the past year.

Libraries as Producers and Consumers of Big Web Data

Libraries as Producers

Libraries as Consumers

So I think we can have a part to play both as producers of this data and as consumers of it.

So that's what others might do with this data, but what could Libraries do?

It could enable new services.

This might help us identify data sets we'd like to preserve.

Open Research

Code, documenation, data sets, and slides with speaker notes: http://ronallo.com/presentations/2013-dlf

Please submit your digital collection sitemaps: http://go.ncsu.edu/sitemap

One of my favorite things about this kind of research is that all of the data is totally in the open and any of the results can be independently confirmed. So here's links to my code and to the resulting data.

Credits

Jason Ronallo

@ronallo

http://ronallo.com

Please submit your digital collection sitemaps: http://go.ncsu.edu/sitemap

Code, documenation, data sets, and slides with speaker notes: http://ronallo.com/presentations/2013-dlf

Feel free to follow up with me about any of this through Twitter or email. Tweet me your questions about this data and I will put together an answer. Check out my blog at ronallo.com. Please submit your sitemap. You can find all the code, documentation, data sets and slides.