Jason Ronallo
NCSU Libraries
@ronallo
Slides: https://ronallo.com/presentations/2013-dlf
Hi. I'm Jason Ronallo the Associate Head of Digital Library Initiatives at NCSU Libraries. Much of the work that I've done has been with digital special collections and especially with improving the discoverability of these collections on the open Web.
I'm going to give an introduction to each of these pieces of the title, what they are, and how we can use them, and then I'm going to show a little of the original research that I'm doing. So let's jump into it.
<ol>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ol>
<nav></nav>
<header></header>
<article>
<section></section>
<section></section>
</article>
<footer></footer>
Jason Ronallo is the Associate Head of Digital Library Initiatives at NCSU Libraries.
Here's a simple statement. It is easy for us as humans to know what this means, but you can imagine how much more complex it would be to try to instruct a computer to pull out these same pieces of data, especially if this was within a longer text.
There's actually some embedded semantic markup on this page. You can't see it?
Person has the properties name, url, jobTitle, and affiliation. The affiliation is with a Library that has a name and url.
<span itemscope itemtype="Person">
<a itemprop="url"
href="http://twitter.com/ronallo">
<span itemprop="name">Jason Ronallo</span>
</a> is the <span itemprop="jobTitle">
Associate Head of Digital Library
Initiatives</span> at
<span itemprop="affiliation" itemscope
itemtype="Library">
<span itemprop="name">
<a itemprop="url" href="http://lib.ncsu.edu">
NCSU Libraries</a>
</span>
</span>.
</span>
{"items": [
{ "type": [ "http://schema.org/Person" ],
"properties": {
"url": [ "http://twitter.com/ronallo" ],
"name": [ "Jason Ronallo" ],
"jobTitle": [ "Associate Head of Digital Library Initiatives" ],
"affiliation": [
{ "type": [ "http://schema.org/Library" ],
"properties": {
"name": [ "NCSU Libraries" ],
"url": [ "http://lib.ncsu.edu/" ]
}
}
]
}
}
]
}
@prefix md: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .
<> md:item ( [ a schema:Person;
schema:affiliation [ a schema:Library;
schema:name "NCSU Libraries";
schema:url <http://lib.ncsu.edu> ];
schema:jobTitle "Associate Head of Digital Library Initiatives";
schema:name "Jason Ronallo";
schema:url <http://twitter.com/ronallo> ] );
rdfa:usesVocabulary schema: .
That last point is something I'd like to stress. Too often in libraries when we're looking to exchange data or expose it on the Web we dumb it down. (OAI-PMH led to a lot of this.) You might also remember folks trying to put Dublin Core into the header of HTML.
We often have very rich metadata for the resources we describe in our databases.
Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.
The way Google has promised to use embedded semantic markup with schema.org is for what it calls rich snippets.
You've probably already seen search results snippets in Google like this. There's a lot of information about this recipe page. You see an image of a cupcake, the number of reviews and stars, how long it takes to cook, and the number of calories. It even includes a list of some of the ingredients.
And you can imagine how the click through rates on a search snippet like this could be higher than a normal one. And this is the main reason folks are currently using this.So I'll show you some examples of how we've implemented embedded semantic markup and schema.org at NCSU Libraries and then show a couple ideas on how we might see it being used in the future.
To give you an idea how this can apply to libraries and digital collections.
Third result in Google video search for "bug sprays and pets."
Second result in Google for "jim hunt future farmers".
I've had the most success with getting rich snippets to display for video resources.
These search results have a video thumbnail, the duration and a bit from the transcript of the video if there is one.
And you can see again how having this extra information can make a particular search result stand out and be more likely to be clicked on. So it improves discoverability.There is a more general trend for search engines to give answers instead of just a list of search results. You may have seen results like this in Google already. You can see images of Alan Alda, but also some more structured data about him which is what you might have been looking for.
While the data is currently taken from Wikipedia, Freebase, and some other standard sources, I'd expect that more answers would start being sourced from the embedded semantic markup that gets published.
Who is familiar with Google Now on Android?
It is like an automatic personal assistant that learns about your habits and gives you helpful information based on your current context. If you enable Google Now you'll see information about how long it would take for you to get from home to work on the next bus. You can think of this as personalized, predictive, contextual search.
This is totally speculative about where this could go, but wouldn't it be cool if it showed students the Libraries' hours for the day, when and where their study group is meeting, and what events are happening in the library? This is the kind of thing which is possible when lots of this data is published on the Web and combined with the data from the device.
I think there's a whole range of new services which we can enable by making our data available in this open way. And they're likely to be ideas that we won't have.
Here's one other way we could be using this that would work now.
Using schema.org activities in email, you can trigger action buttons right in someone's gmail inbox. If you email yourself a book citation you can be offered a button to request the book. Or if your book is coming due soon you can have a button to quickly renew it right from your inbox.Are academic institutions publishing embedded semantic markup?
Are academic libraries?
What kind of data are they publishing?
What syntaxes are they using?
What vocabularies?
In the past I've tried a little bit to ask libraries if they're publishing this kind of data in HTML, but haven't really heard of that many. My reach is only so far and some libraries may not even know they're publishing data in this way. In some cases a content management system includes this or some developers may have just added it without telling anyone.
There might be times when a library doesn't even know it is publishing this kind of data because the CMS or other system just does it.
One way I could solve this problem is to crawl the Web--maybe just from a list of certain domains that I'm interested in--but even that could be expensive and time consuming.
I could maybe go begging one of the big search engines for some information. The search engine Blekko would allow you to submit jobs to extract some information from its corpus, but there was only so much you could do.
And this is the position that many have been left in when wanting to answer these kinds of questions where you need lots of Web pages to get a good answer.
This is where the Common Crawl comes in to help. (Read slide.) They crawl a big portion of the highest ranked pages on the internet and make it freely available.
So we really are talking big web data here.
This is really important and exciting. Before the Common Crawl came along you would have had to work for one of the big search engines to have this kind of access to this data. There's so much knowledge out there and now there's a way for many more to access that data.
They don't crawl as much as Google, but it is still a lot.The best part is that if you want to use this data, it is free to access. To create an Amazon EC2 job to parse it all will cost you money. But certainly not as much as crawling the Web on your own would. They'll certainly have saved you a lot of money.
Lots of startups are using the Common Crawl to try out business ideas. Instead of having to conduct your own crawl before trying out an idea, there's a ready crawl corpus to play with and test an idea.
One cool use you can read about is how the company Swiftkey is using the Common Crawl. Swiftkey creates alternative keyboards for Android phones and tablets that make typing faster, more accurate. One of the features they have is word prediction. They're using the Common Crawl as a large text corpus to analyze to improve their algorithms for the 60+ languages they support.
You can also see the results of the Norvig Web Data Science Awards where folks tried out ideas like associating concepts on the Web through sentence word co-occurence on this large corpus.
The founder of the Common Crawl also point out the effect this could have on teaching Big Data skills in colleges.http://urlsearch.commoncrawl.org/?q=lib.ncsu.edu
Let's say you didn't want to go through all the work to process the whole data set. Common Crawl has also made available a URL search tool. You can search for a particular subdomain and it will return all pages it has crawled in that subdomain. This can be useful for seeing whether your site has been crawled at all.
It will also return data as JSON that includes information about which segment of the corpus to go to to find that page. This could significantly cut down on the amount of work you need to do to get to just the pages of the domains you're interested in.
I've used this URL Search tool some for the research you'll see later.Extracting Structured Data from the Common Crawl
Domains with Triples | 2,286,277 |
Typed Entities | 1,811,471,956 |
Triples/Statements | 7,350,953,995 |
Percentage of the Common Crawl corpus with embedded structured data? 12.3%
Cost to extract the data from the Common Crawl: $398
Luckily the Web Data Commons has done the work to extract structured data from the Common Crawl for me. It parses the whole Common Crawl corpus to extract all of the embedded semantic markup into RDF-triples. It makes all the data available for free.
Again we're talking some Big Data here with over 7 billion triples.
Even so the size of this data set is more approachable to just download and play with. So now we've got a better way to get our data._:node6eecc231551a72e90e7efb3dc3fc26
http://schema.org/Photograph/name
"Mary Travers singing live on stage"
http://d.lib.ncsu.edu/collections/catalog/0228376 .
An N-Quad is an RDF statement that also includes a context piece at the end. Context is the URL of the HTML page from which the data was extracted.
Line-based format makes it easier to do some rough parsing.
Extraction code and documentation links: https://ronallo.com/presentations/2013-dlf
I have not done any calculations to determine what proportion of each site was crawled. So the comparisons here are just raw numbers. A very large university site might have lots of crawled pages, but it could be a smaller percentage of their whole website than a smaller number here from a university with a smaller website.
For the libraries they may have lots of materials that are not under the library subdomain which would skew those numbers significantly. I was also only looking for subdomains that included "lib.", "library.", and "libraries.". It could be that a lot of academic libraries have their site under a subdirectory or some other subdomain.
All statements | 7,350,953,995 |
All .edu | 12,182,975 |
.edu context | 8,178,985 |
So again what I started with was definitely big data for me.
I was able to pare that down to just over 12 million that include ".edu" somewhere in the statement. Crude but it worked well enough.
And then the final number--8 million--are just those quads where the context--the page the triple was extracted from--included .edu in the domain name.
A much more manageable data set to deal with.Are academic institutions publishing embedded semantic markup?
Are academic libraries?
What kind of data are they publishing?
What syntaxes are they using?
What vocabularies?
So now we're ready to ask our research questions again.
So with this kind of data now freely and easily available, I think we can begin to be a bit more scientific in how we're looking at our metadata on the Web. I could see something similar to what Roy Tennant does at OCLC where he reports on MARC tag usage. When we're making decisions about how to proceed, this data can help us make informed decisions.
Looked up using the Common Crawl Search URL index. We can see that all the universities had pages crawled by the Common Crawl. Some did not have any of their library pages crawled at all.
You can see that Virginia Tech, a DLF member, had the most library pages included in the crawl.
So one thing we could do is compare the library websites for those that were and weren't crawled and see if there are any differences. Is there a robots.txt that's preventing the Common Crawl robot from crawling? This could be one area of future research based on this data.psu.edu | 18.35% | gatech.edu | 1.20% |
illinois.edu | 8.87% | ncsu.edu | 1.19% |
ucdavis.edu | 5.58% | iastate.edu | 0.71% |
osu.edu | 3.87% | wisc.edu | 0.69% |
rutgers.edu | 3.45% | umd.edu | 0.35% |
arizona.edu | 1.97% | colostate.edu | 0.26% |
msu.edu | 1.67% | purdue.edu | 0.20% |
tamu.edu | 1.58% | vt.edu | 0.16% |
ufl.edu | 1.51% |
One simple thing to figure out was what percentage of the URLs in the Common Crawl included embedded semantic markup.
One of the things to note about this slide is that some of the URLs included in the Common Crawl are from non-HTML documents like PDFs and XML documents. Neither of those would have any embedded semantic markup.
We can see that Penn State has a high proportion of their crawled URLs that include at least some embedded semantic markup.
Again some of these institutions had no contexts for their libraries. So they have been removed. Which of the peer institutions libraries are publishing the most data? While Ohio State University publishes the most triples. NCSU has slightly more pages that include this markup.
Number of triples with this library as the context compared to the number of unique contexts. So while OSU library had a lot of triples published they all came from 138 pages.mf-hcard | 5,854,493 |
rdfa | 1,337,528 |
mf-hcalendar | 770,228 |
mf-xfn | 456,184 |
microdata | 285,296 |
mf-geo | 51,565 |
mf-hresume | 5,363 |
mf-hreview | 2,908 |
mf-hlisting | 48 |
LocalBusiness | 20,565 | CollectionPage | 1,275 |
PostalAddress | 17,267 | Blog | 991 |
CollegeOrUniversity | 11,554 | University | 550 |
Organization | 10,172 | CollegeorUniversity | 420 |
WebPage | 8,846 | Review | 372 |
Article | 8,351 | NewsArticle | 316 |
BlogPosting | 3,511 | Place | 313 |
Person | 2,071 | ScholarlyArticle | 298 |
Event | 1,539 | SportsEvent | 289 |
EducationalOrganization | 1,508 | Thing | 168 |
This is just counting the number of triples that include that schema.org type. One interesting thing is that this is almost exclusively in the Microdata syntax. It wasn't until later that the schema.org partners said they'd support RDFa Lite.
Note that ScholarlyArticle is represented a little, but Book or Photograph or CreativeWork aren't some of the top types. So either where we have these kinds of things isn't being crawled or we're not yet marking up these kinds of things.CollectionPage | 1275 | duke.edu ncsu.edu |
ScholarlyArticle | 298 | santafe.edu |
ContactPoint | 139 | lynn.edu duke.edu |
Photograph | 20 | ncsu.edu |
Library | 18 | berkeley.edu |
CreativeWork | 12 | ncsu.edu |
LandmarksOrHistoricalBuildings | 11 | ncsu.edu |
Book | 6 | ohiolink.edu |
GeoCoordinates | 4 | marshall.edu |
Have a digital collection with a sitemap? Please, go to this URL right now:
So I think we can have a part to play both as producers of this data and as consumers of it.
So that's what others might do with this data, but what could Libraries do?
It could enable new services.
This might help us identify data sets we'd like to preserve.Code, documenation, data sets, and slides with speaker notes: http://ronallo.com/presentations/2013-dlf
Please submit your digital collection sitemaps: http://go.ncsu.edu/sitemap
NCSU sites that use embedded semantic markup (Microdata) and Schema.org:
Please submit your digital collection sitemaps: http://go.ncsu.edu/sitemap
Code, documenation, data sets, and slides with speaker notes: http://ronallo.com/presentations/2013-dlf