context:forge

improving the signal to noise ratio. information in context. web as knowledge.

context:forge RSS Feed
 
 
 
 

Bookmarks for March 24th through April 25th

Links for March 24th through April 25th:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Bookmarks for December 27th through March 22nd

Links for December 27th through March 22nd:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

OpenPublish; Deploy a high performance (semantic) web site in hours – not months.

A week ago or so our partner – Phase2Technology – announced the release of OpenPublish. The dust has settled from DrupalCon a bit and I wanted to take a few minutes to talk about what OpenPublish is and why it is very very important.

The quick background. Drupal is a hot Open Source content management and web site deployment platform. It has probably tens of thousands of users and thousands of internal and external deployments. Suffice it to say it’s the hot thing in Open Source CMS platforms right now.

Drupal let’s you build a site fairly quickly. It won’t be pretty and it won’t have much functionality – but it can be up and running in a matter of minutes.  Then you can spend the next few days, weeks or months giving it a nice look and feel, finding the extensions for the functionality you need and perhaps building some glue to hook it all together. Weeks or months later you’ll have the basics in place and can start to think about the advanced features you’d like to implement – in the next few weeks or months.

(Elapsed time – maybe 1-3 months)

Or, we can do it the OpenPublish way. Download the installation setup (from here), run the setup, Get a key from Open Calais (here), enter it into OpenPublish.

Done. Start writing or grabbing feeds. You’re finished.

(Elapsed time – maybe 1 hour)

But – here’s where things start to get very interesting. OpenPublish isn’t just a quick way to install Drupal. OpenPublish uses Calais semantic technology (look at that – seven paragraphs in and the first time we’ve used the word semantic) to provide features even the big guys don’t have. Here are a few examples:

  • Articles are automatically tagged with the people, places, companies, geographies and other elements inside them. You can do this automatically by setting relevance thresholds or do it interactively where Calais suggests and you approve.
  • You can automatically tag your archives. Thousand of articles – no problem. Millions – give us a call and we’ll work something out to get it done in a day or two.
  • You can automatically create topic hubs on any tag (e.g. Drupal vocabulary), set of tags, logical arguments about tags. Want a topic hub on “Natural Disasters” in California? About five clicks and it’s done – and it will maintain itself forever.
  • “More like this” functionality is built right in. Your readers can see other related content on your site or – at your option – on other blogs or mainline news sources.
  • Map integration, RDF generation and exposure, lots of other cool stuff.

What we like is that the semantics aren’t the goal here – they’re simply the enabler for a high performance publishing platform.

If you’re a publisher and you want help customizing the installation you should contact our friends at Phase2 and they’d be happy to help. If you’re a smaller non-profit, an advocacy organization or generally someone who doesn’t have a lot of money or time – OpenPublish can literally get you up and running in hours.

The Calais Initiative is proud to sponsor the development of the Drupal modules underlying OpenPublish and proud to work with the Phase2 team – they’re a great group of people.

P.S. It’s all free.

P.P.S Nancy Kho wrote a great overview here.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Metadata as a Service

Kas Thomas (of CMS Watch) wrote two great back to back posts on his blog.

In the first post, Kas discusses the power of “Metadata as a Service” – in short what can you make happen if metadata generation is widely available to your content creation, management and consumption tools.

What’s great is that he doesn’t stop there. In his second post he goes on to construct an OpenOffice plugin that automatically meta-tags your content as you’re creating it. This has obvious benefits for content management and search across or outside the enterprise.

Now – take what Kas has done and extend it to the Linked Data cloud as we’ve done with Calais 4.0. Beyond metadata we now have super-metadata. By using the Linked Data capabilities built in to Calais you could not only tag an article as being about say “IBM” – but insert the fact that IBM is headquartered in New York, That New York is part of North America and that IBM has an SIC code of 8742 and others.

Here’s the Calais URI for IBM: Start exploring the DBPedia links at the bottom and I’m sure you’ll think of some interesting use cases.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Bookmarks for October 29th through December 27th

Links for October 29th through December 27th:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Life in the Linked Data Cloud: Calais Release 4

(Re purposed from the blog post on http://www.opencalais.com/node/9501)

The Gist: Release 4 of Calais will be a big deal. In that release we’ll go beyond the ability to extract semantic data from your content. We will link that extracted semantic data to datasets from dozens of other information sources, from Wikipedia to Freebase to the CIA World Fact Book. In short – instead of being limited to the contents of the document you’re processing, you’ll be able to develop solutions that leverage a large and rapidly growing information asset: the Linked Data Cloud.

The goal of this post is just to give our community a heads-up to start thinking and planning.

During the course of 2008 we’ve had three significant releases of Calais, with additional point releases nearly each month along the way. We’ve added new knowledge domains, improved performance, delivered integration with a range of tools and developed new user-facing applications. It’s been a year of amazing growth in our developer community and the capabilities of the Calais service.

While every previous release has accomplished something significant, Release 4 is going to introduce something that we think is game changing – and that’s life in the Linked Data cloud. It’s important enough that we want to give all the members of our community time to think about it, prepare for it and get your brains in gear on how you might use it.

Every release of Calais up to this point has focused on meeting the need to extract semantic information from text. Release 4 builds on this by creating the ability to harvest the Linked Data cloud using that semantic data.

For this all to make sense we need to introduce a few things. If you already know about de-referenceable URIs and the Linked Data cloud – skim ahead. If not – please take a moment to ingest the background you need.

When you send text to Calais it returns several things: entities, facts, events and categories. For purposes of today’s discussion we’re going to focus in on entities. Entities are just what they sound like – they are things. Some specific examples are people, companies, organizations, geographies, sports teams and music albums.

When Calais extracts an entity from your text it returns (at least) a few things. It tells you the name of the entity and it tells you what type of entity it is. Unlike other extraction services we don’t just return a list of things – Calais tells you it found a thing of type=Company and a value=IBM or type=Person and value=Jane Doe. But – there’s something else Calais returns that hasn’t meant very much up until now: it returns a Uniform Resource Identifier (URI) for that entity. There’s nothing magic about URIs – they are simply a unique identifier for every entity that Calais discovers. Here’s an example (it’s not pretty) of what the URI for the Company IBM looks like:

d.opencalais.com/comphash-1/7c375e93-de13-3f56-a42d-add43142d9d1

Well, that doesn’t look very useful does it? If you were to pull up that URI (when Release 4 is out) all you’d see is RDF with links to places called DBpedia and Freebase and Reuters. But keep those links in mind: they’re the key to a whole new world.

Linked Data is the name of a movement underway (not too surprisingly, initiated by Sir Tim Berners-Lee) that sets a standard and expected behavior for publishing and connecting data on the web. This isn’t about publishing web pages – this is about turning those web pages into data that’s accessible to programs to work with. We’ll give you a quick example to make it real: Wikipedia is one of the single largest sets of information across a broad range of topics in the world. It’s really great if I’m a person who’s casually looking for information on a particular topic – but it’s not so great if I’m a computer program that wants to use that data. Why? Because it’s formatted and organized for people – not computers – to read.

But Wikipedia has a twin – in fact a Linked Data twin – called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format called RDF and accessible via the Linked Data standards. And, Wikipedia is not alone. A growing cloud of information sets from DBpedia to the CIA World Fact Book to U.S. Census data to Musicbrainz – and many others – is becoming available. What’s important is that this cloud is 1) growing, and 2) interoperable. There are “pointers” from entries in DBpedia to entries in Musicbrainz and back to entries in Geonames – it’s another big Web – but this time it’s a Web of Data.

So – lots of words and arcane concepts. Let’s try to bring it all together into something that makes sense. We’ll put one sentence out there – and then we’ll give a few examples.

Beginning with Calais Release 4 you and the programs you develop will be able to go from many of the entities Calais extracts directly to the Linked Data Cloud.

A simple example:

I want to process today’s business news. For each article I want to extract all of the companies mentioned – but only if the article also mentions a merger or acquisition. I am only interested in companies whose headquarters (or those of their subsidiaries) are located in New York State. Do all of that and give me a widget for my news site titled “Merger Activity for NY Consulting Companies”. And oh, by the way, this isn’t a research project – I want you to do it real time for the 10,000 pieces of news I process every day.

How would you do that? Option 1 is to hire a bunch of researchers, give them a fast internet connection and teach them to type very very fast.  Option 2 is to write some code that looks like this:

For each Article

Submit to Calais, get response

If MergerAcquisition exists then

For each Company

Retrieve Calais Company URI, extract DBpedia link

Send Linked Data inquiry to DBpedia, get response

If CompanyIndustry contains “Consulting”

If CompanyHeadquarters = “New York”

Put them on the list

For each subsidiary

Send Linked Data query to Dbpedia, get result

If CompanyHeadquarters = “New York”

Put them on the list

(lots of endif’s)

Print the list

That really is a pretty straightforward example. How about companies in the news with at least one subsidiary doing business in an area that the CIA Factbook considers dangerous? Or books released by authors who attended Harvard who live in Ohio? Or … . We think you get the idea.

So. The summary. The combination of semantic data extraction (generic extraction, tags, keywords won’t do the trick) + de-referenceable URIs (entity identifiers you and your programs can retrieve) + the Linked Data Cloud = amazing stuff.

We’d like you to start thinking about it.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Spinqing

We’ve all been there – you’re on a panel, giving a presentation or just having a discussion with colleagues. Then.. someone asks a question. Well, it’s supposed to be a question but it’s really just an opportunity to look smart. At conferences at least it usually has a lot of meta-words and phrases like platform or paradigm or contextualize or whatever. It’s not a question – it’s a spinq – A Self Promotional Inquiry.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Developers! Developers! Developers!

One of the really fun parts of working on the Calais Initiative is our community of developers. They toil in quiet and then – surprise! – they release something really cool and interesting. So – I wanted to take just a moment to highlight two new Calais R3.1 applications that popped up this weekend.

iPlayerist by Geography

iPlayerlist is an interesting application that takes shows available via the BBC iPlayer and allows you to find them by topics, times and other attributes. Andy @ mibly.com has just rolled out an enhancement that uses the new Calais geo-location capabilities to find shows based on the locations mentioned in their descriptions. Available here I think it’s a great example of a simple, clean way to improve the user experience using semantic metadata extraction. Unfortunately viewing many of the resulting videos won’t work unless you’re in the U.K. This isn’t iPlayer’s fault – it’s a limitation the BBC has put in place.

/

/

/

/

/

Calais Geo Location Tutorial and Demo App

Guilhem Vellut has put together a nice demonstration app that shows the Calais geo-location features in action. While I really like the application (you can see it here) it’s the blog post he wrote giving the details of exactly how he built the applications – including code samples – that’s really great. By investing the time to document what he did and how he got everything working together he’s provided a great jumpstart for anyone else wanting to experiment with Calais geo-location. Thanks!

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

What is Web 3.0?

After participating in yet another “What is Web 3.0″ panel I decided to strip my answer down to Twitterable size. Here it is:

Web 2.0 created a problem – overwhelming content overload. Web 3.0’s job is solve that problem. That’s it.

Maybe later on I’ll write a few thousand more words around the details. But that’s what they are: details. Figure out how to decrease content overload in publishing, in user generated content, in social networks and in search. Stop worrying about the killer app. Just make things better.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Greg Boutin @ Semantics Incorporated

Greg Boutin

Greg Boutin

Greg Boutin wrote a fairly in-depth piece on SemanticProxy. In this article Greg reviews SemanticProxy’s performance and asks a number of questions about whether it’s truly “Semantic”. So – second in a series of cheating by republishing responses I’ve written… here we go.

Greg’s original article is located here

Greg:

I thought I had responded to this post – but it appears it was one of those many responses I’ve composed in my head while driving or whatever and never actually gotten down in writing.

First, a couple of things that may need clarification.

SemanticProxy is Calais. What SemanticProxy does is to take the burden of fetching a web page, cleaning HTML, calling Calais and all that off the developer. It does all of that for you and returns the results as RDF – or as HTML for demonstration purposes. So – any functionality in Calais is automatically reflected in SemanticProxy. The main technical challenge with SemanticProxy other than engineering for scalability is simply HTML cleaning. One thing we’re thinking of is the creation of a simple tag publishers can embed to indicate the start/stop of the “core” content on a page.

The second area is around the engine underlying Calais. In your post you mention that you assume it’s a statistical engine – it isn’t. The Calais engine is built on core Natural Language processing (NLP) technology augmented by lexicons and statistical methods. It works by parsing out the parts of speech into core elements and then applying a three-tiered set of pattern recognition and rule-based approaches wrapping up with a voting and scoring system that selects from the candidate entities, facts and events. The rules and pattern recognition techniques are tuned to identify specific types of entities (people, places, organizations, etc), facts (Person:JobPosition, Person:PoliticalAffiliation, etc) and events (NaturalDisaster, SportingGame, EarningsAnnouncement, etc). The specific elements that Calais understands are documented on our site and expand by 5-15 each month.

Calais also supports “Semi-Exhaustive Extraction” (SEE) for those that want to dive into the deep end of the semantics pool. In SEE we extract all relationships between Thing1 and Thing2 if we can type at least one of the things.

Entity recognition will always be a “IS A” type predicate. “John Doe” “IS A” “Entity Type Person” – so all of our entity recognition will automatically fall into this category.

Facts and events are a little more complicated. For example let’s take something simple like Calais extracting that a person has a particular job title at a particular company. I’m not going to even attempt to write out the RDF – but the basics of that type of relationship would look like:

“John Doe” “IS A” “Person”
“John Doe” “Has the Title” “Chief Wrangler” “AFFILIATED WITH” “ACME”
“ACME” “IS A” “Company”

That’s not even close to RDF – but you get the idea.

So – are we using “Smart” predicates – I think so. Everything we identify (other than simple entity recognition – which is the easy part) is represented in RDF as a series of relationships and attributes. Every fact we identify is, in essence, it’s own smart predicate. Every event is built of of facts and entities.

What we don’t do is deliver any level of analysis beyond what’s presented to us. We don’t dip into the global linked data brain or Dbpedia or other assets to find and deliver more information about what we’ve extracted. If we tell you someone is a “Person” – we don’t tell you that people are mammals. As far as I’m concerned – that’s where linked data and large scale “describe the world” ontologies come in.

So – in summary. Entity recognition (the relatively easy part of what we do) is always about “IS A” type relationships. The harder (and cooler in the long run) stuff is much more sophisticated.

Also – one (well two) exceptions to the “we don’t augment with external data” statement above. In our current technology preview release we’ve rolled out disambiguation around companies and geographies. What this means is that if an article says IBM, IBM Research, IBM Limited or IBL Labs – we’ll tell you it’s really “IBM” and give you the appropriate identifying information (Ticker, web site, etc). We do this using a BIG table – but we also go beyond that and look for contextual clues like industries and geographies that will help us narrow things down.

Geographies are similar – “Longhorns” are more likely the be associated with Paris, TX than with Paris, France.

Long response – but I felt a few of these things were worth clarifying. We’re really enjoying the widespread adoption of Calais (almost 1.5M transactions per day and climbing) – but at this point most of the use cases are barely scratching the surface of what Calais provides. Once people have gotten over the current focus on entity recognition (tag clouds anyone?) we hope they’ll step back and explore some of the more powerful semantic capabilities Calais has to offer.

Regards,

Tom

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati