Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives

Jason Ronallo
NCSU Libraries

Slides: https://ronallo.com/presentations/2013-dlf

Hi. I'm Jason Ronallo the Associate Head of Digital Library Initiatives at NCSU Libraries. Much of the work that I've done has been with digital special collections and especially with improving the discoverability of these collections on the open Web.

I'm going to give an introduction to each of these pieces of the title, what they are, and how we can use them, and then I'm going to show a little of the original research that I'm doing. So let's jump into it.

How Search Engines Work

  1. Robots crawl the Web
  2. Process and index crawl data
  3. Try to answer search queries with the most relevant results

friendly robot 

But first I'd like to talk for a moment about how search engines work. Robots crawl the web. They process and index crawl data. Finally they try to get folks to relevant pages that match search queries. It is that step #2 that we're going to focus on today. There's a lot in that. The point I'd like to make about it now is that search engines have begun to reach the limits of what they can do with natural language processing alone. The problem is that there's a limit to what meaning you can pull out just from HTML tags and the text content.


Semantics in HTML helps to solve this.

Semantics in HTML

  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
There has always been some semantics in HTML. Here's a simple ordered list. We can't tell much from it, but we do know that these things are ordered.

HTML5 Semantics

HTML5 has added a bunch of new semantic elements like article. This can let us pick the article out of a page for a distraction free reading experience.

Trapped Knowledge

But that still doesn't tell us much about what the content is about. There's still a lot of knowledge trapped in HTML pages that's difficult to get out.

Embedded Semantic Markup

And that's where embedded semantic markup helps.

Embedded Semantic Markup Example

is the Associate Head of Digital Library Initiatives at .

Here's a simple statement. It is easy for us as humans to know what this means, but you can imagine how much more complex it would be to try to instruct a computer to pull out these same pieces of data, especially if this was within a longer text.

There's actually some embedded semantic markup on this page. You can't see it?

Embedded Semantic Markup Is Hidden Annotations Meant for Machines

friendly robot 

Well that's because embedded semantic markup is a bunch of hidden annotations on the page meant for machines.

Embedded Semantic Markup Exposed

Person has the properties name, url, jobTitle, and affiliation. The affiliation is with a Library that has a name and url.

Here's the embedded semantic markup exposed. You can see that this whole thing describes a Person who has some properties like a name, url, jobTitle, and affiliation. Breaking things down in this way it can make easy sense to machines.

Embedded Semantic Markup Structure


Embedded Semantic Markup HTML

<span itemscope itemtype="Person">
  <a itemprop="url"
    <span itemprop="name">Jason Ronallo</span>
  </a> is the <span itemprop="jobTitle">
    Associate Head of Digital Library
    Initiatives</span> at
  <span itemprop="affiliation" itemscope
    <span itemprop="name">
      <a itemprop="url" href="http://lib.ncsu.edu">
       NCSU Libraries</a>
Here's our example HTML. I'm using the Microdata syntax for the embedded semantic markup. I won't get into the particulars, but you can see that there are some extra attributes like itemscope, itemtype, and itemprop added to the HTML.

JSON Serialization

{"items": [
    { "type": [ "http://schema.org/Person" ],
      "properties": {
        "url": [ "http://twitter.com/ronallo" ],
        "name": [ "Jason Ronallo" ],
        "jobTitle": [ "Associate Head of Digital Library Initiatives" ],
        "affiliation": [
          { "type": [ "http://schema.org/Library" ],
            "properties": {
              "name": [ "NCSU Libraries" ],
              "url": [ "http://lib.ncsu.edu/" ]
You can exctract the embedded semantic markup and serialize it as JSON.

RDF (Turtle)

@prefix md: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .

<> md:item ( [ a schema:Person;
                schema:affiliation [ a schema:Library;
                        schema:name "NCSU Libraries";
                        schema:url <http://lib.ncsu.edu> ];
                schema:jobTitle "Associate Head of Digital Library Initiatives";
                schema:name "Jason Ronallo";
                schema:url <http://twitter.com/ronallo> ] );
    rdfa:usesVocabulary schema: .
Or serialize to some RDF representation.

Types of
Embedded Semantic Markup

These are the different syntaxes that are most commonly used for embedded semantic markup. The example that I've shown is in the Microdata syntax. I mention them now since, we'll see these again when we get to looking at the research I've done.

Why use
embedded semantic markup?

Embedded semantic markup is a syntax for structuring data in HTML when you need to communicate unambiguously with machines. Since your eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization. These syntaxes also allow us the chance to go from rich metadata to rich embedded data.

That last point is something I'd like to stress. Too often in libraries when we're looking to exchange data or expose it on the Web we dumb it down. (OAI-PMH led to a lot of this.) You might also remember folks trying to put Dublin Core into the header of HTML.

We often have very rich metadata for the resources we describe in our databases.

Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.

The End of Dumbed Down Metadata

I hope we're nearing the end of dumbed down metadata.

Vocabularies for Understanding

Embedded semantic markup alone only gives the syntax and isn't useful on its own. We also need vocabularies so that we understand each other and get past dumbed down metadata.


This is where a vocabulary like schema.org comes in to help. There are all kinds of vocabularies that can be used for describing the content of Web pages, but I'll focus on schema.org for its ease of use, particular use cases, and growing implementation base.


[Read slide.] Yes, you can even describe a Volcano, which peculiarly has a property for phone number.


Here's the schema.org for the tree of all the types of things you can describe with it.


This kind of simple documentation makes it easier to implement.

Why use Schema.org?

There are lots of reasons to use schema.org: - Growing implementation base - Software implementations (CMS) - When the data is out there, others will come along, discover it, and use it.

Improve Discoverability on the Open Web

But the main reason right now is to improve the discoverability of your services and collections on the open web. There are many facets to how to improve discoverability. In part it means improving things in Google.

Rich Snippets


The way Google has promised to use embedded semantic markup with schema.org is for what it calls rich snippets.

You've probably already seen search results snippets in Google like this. There's a lot of information about this recipe page. You see an image of a cupcake, the number of reviews and stars, how long it takes to cook, and the number of calories. It even includes a list of some of the ingredients.

And you can imagine how the click through rates on a search snippet like this could be higher than a normal one. And this is the main reason folks are currently using this.

Library Examples

So I'll show you some examples of how we've implemented embedded semantic markup and schema.org at NCSU Libraries and then show a couple ideas on how we might see it being used in the future.

To give you an idea how this can apply to libraries and digital collections.