Conversation on Semantic Linking

| No Comments | No TrackBacks

A while back, and spread over a year, two friends and I had a discussion of the semantics of links, spurred by some irritating automated dictionary linking. I tried to summarize the discussion in a couple of ways, but I couldn't do better than the original. Here it is.

GH: Semantic Links 1 - Question

My friend GH posed this question in an email:

2. Hyperlink Semantics.

Or maybe that's not the right term. Here's an example:

I was reading Information Week the other night (one of the only tech news sites that have balanced mac articles). I was there reading about GPLv3, that freak Stallman, and Torvalds' preference of v2 over v3. I read this paragraph:

"Unless there's a radical reworking of GPL version 3 (GPLv3, in the programmer lexicon), a significant portion of the open source community will reject it, chief among them Linus Torvalds, the creator of Linux. "I will not sign on to GPLv3 if it limits how the code is used," Torvalds says in a lengthy E-mail [hyperlink] exchange with InformationWeek."

I think, "Sweet!", interview with Torvalds, keeping it real! And I click the link named "E-mail" to read this phat interview. And instead I get:

"Results found for: e-mail

(electronic-mail)

The transmission of text messages and optional file attachments over a

network. Within an enterprise, users can send mail to a single

recipient or broadcast it to multiple users. Mail is sent to a

simulated mailbox in the network mail server or host computer until it

is examined and deleted. The mail program (e-mail client) in your

computer queries the mail server every so many minutes and alerts you

if new mail has arrived."

WHAT THE FUCK!!

Don't get me wrong. Yes, these dictionary links are insidious and horrible. But that's not my problem. My problem is one of context. What informative content in the context of the main topic of this paragraph does linking to the definition of "email" provide? None. This is a huge problem on the web, and I see it every damn day. People link to things for the hell of it, because they can, just because there is a word that can be linked to something that is out on the web. As if the top priority in getting across an idea or in presenting a story is to be able to link to sites.

Why? Is this just an outgrowth of the reductionist ethos that one tends to find in technology oriented people? It can't be only that, because this is a problem all over the web, not only in tech sites. Is this the result of the on-going breakdown of knowledge into information? I really don't know why. I just know that it is annoying and ill-conceived.

(It's almost as if, on the web, there is no such thing as context, and instead we live in a world where words have one and only one informative meaning in any situation. Instead, of course, words have different **uses**, and those uses define their various meanings. Indeed, those uses **are** their meanings. By the way, for a really great discussion of this misunderstanding of language, see the "Philosophical Investigations." It's worth the read.)

Scott, do you have anything on this over on Textuality? I did a brief buzz through, but wasn't sure how to look for this.

(BTW, Scott, what was the term you were using when we were all over JL's? "Semantic Domains"? And what book was that in?)

JL: Semantic Links 2 - The Philosophy

My friend JL chimed in with this:

Why? Is this just an outgrowth of the reductionist ethos that one

tends to find in technology oriented people? It can't be only that,

because this is a problem all over the web, not only in tech sites.

Is this the result of the on-going breakdown of knowledge into

information? I really don't know why. I just know that it is

annoying and ill-conceived.

I'm sure Scott has a bazillion thoughts on this, but having once upon a time done a whole shitload of research on semi-structured data, information organization and retrieval, etc., I have some intuitions and semi-informed ideas on why this is.

  1. 1) It's a general flaw in the way the Web was built. Links contain no semantic information at least partly for the same reasons HTML contains a blink tag and an "address" tag but no ability to explicitly place objects in relative positions to one another; the Web just wasn't originally conceived as including the variety of *kinds* of content that it does, so links weren't originally conceived as requiring more semantic information than, say, the text contained within them. Note that Tim Berners-Lee either quickly realized the problem there, or always thought of our current web as the first pass, because he's a major organizer of the Semantic Web project.

  2. Google has popularized the theory that you can extract meaningful, complex semantic information ("what is the relevance of this web page to some information need (i.e. web query) that I have?") from the structure of links into and out of a page (see: a million gallons of ink spilled about PageRank). This is at best a half-truth: that structure of links does have some predictive power, but we know that Google leverages an enormous amount of other contextual information about web pages to generate its rankings. (Besides, if semantic information was available, Google would presumably do much much better.)

  3. In general, the creation of semantics from syntax is a Great Mystery (possibly even "The Great Mystery") of human intelligence. The great frustration of building systems that perform tasks that we think of as requiring human intelligence is that, you know, in the end, everything is syntax. And yet we are able to extract meaning-heavy, context-heavy communication from each other and our signs every day, constantly. A huge amount of the history of building artificial semiotic systems (like the Web) is rather alchemical: it's about smushing together syntactic elements and hoping that meaning just somehow appear. After all, natural language is just a bunch of syntactic elements, but somehow meaning appears from it. Since we have not the first clue how that happens, we do a lot of coming up with sign-systems and then seeing what they can do. The basic flaw of Web links is that they are meant to be an unobtrusive "markup" of complex, ambiguous natural language that creates clear, unambiguous connections between documents. Just thinking about this, it seems obvious that we have no idea how the hell you'd do it, and if you could do it perfectly, you would have essentially solved the AI problem. Techies, however, are extremely fond of syntax-only solutions that place the burden of meaning-making on the end-users, who as humans have a better sense of how to create meaning than any computer system (that is, they have *any* sense of it). Thus, our current syntax-bound links on the Web are exactly as good or bad at communicating meaning as the humans responsible are at using them for communicating meaning. (In the case you cite, they did a shitty job.)

  4. So the problem with adding context to web links as a matter of their technical implementation is no one knows what the hell semantically-strong links would even look like at a systematic level. In a sense, the only naturally-occurring example we have of how to link one document to another document is with: natural language. Since the whole point of hypertext is to allow the "hypertextuality" -- the linked-ness of the text -- to be unobtrusive, natural language won't do; the point is to create something *more* powerful than an article with footnotes that contain cross-references, not *less*. So there seems to be a way in which context-free links are the best we've got for now, since we don't know how to do better.

The general solutions I've seen to this so far have been:

  • Selecting a finite set of possible contextual meanings for a link, and using CSS to create a different appearance for each context. You can see this today in articles where, say, regular links that go to other similar information are underlined once in blue, but there are also links underlined twice in green that go the wikipedia article or dictionary definition of the word being liked. The NYT's bizarre system also applies here, in which there are visible links that are underlined as normal, but every single word in an article has an "invisible" link on it when double-clicked that takes you to a dictionary definition of the world. (I hate this feature with the fire of a thousand suns, by the way.) This solution clearly works okay, but it has immediate limitations: for one thing, if you had more than about two or at the most three different contexts you wanted to represent, your document would become a huge mess and hard to read and understand. For another, the set of possible contextual meanings is so huge that users can't depend on consistently applied markup across sites (although obviously standards could arise). So this doesn't really do anything for us, especially since all the implementations of this idea I've seen have been of the form of one kind of link that goes somewhere specific (dictionary) and one kind that goes "everywhere else."

  • The Semantic Web solution, aka Jeff Bezos's Stupidest Idea For Amazon, in which a massive ontology of knowledge is created and built into a computer system to be used for "artificial reasoning" about links and contexts. We've talked about why this is a terrible idea for Amazon, and it's an insane idea for the Web at large. I could go on and on.

...That's about all we've got.

So I feel like the reason for your frustration, GH, is essentially that links suck, but they're basically all we've got for now.

To take a more positive view, links are exactly as good or bad as the people implementing them, and the web is such an incredibly young medium that part of what we're seeing is just the thrashing around of trying to figure out how we'll use these new media tools we've come up with.

I like your mention of Wittgenstein, mostly because I've often felt like the history of people using software to organize languages has been a desperate 60 year effort to prove Wittgenstein wrong by successfully engineering a system that couldn't work if Wittgenstein was right. Since I pretty much think Wittgenstein is right, you can see why eventually I had to leave grad school. :)

Scott: Semantic Links 3 - Some Examples

GH asked about links and context, and the pernicious laziness of automated linking to dictionary words. JL's elegant and concise reply is above.  Now for my rambling and verbose discussion of examples.

First, JL had a great answer regarding the base problem of web links. To be more prosaically technical: In standard HTML, even today, there's no standard way to include context, largely because as Josh describes we just can't get computers to do context in any useful way. HTML doesn't include context because ... how the hell would it? What would a browser do with that tag attribute? Would *everyone* agree with that, or be willing to read in that manner?

I'd like to emphasize the "standard" part there, though. Numerous hypertext systems have addressed rich context in various ways. As Josh described, various sites do their own thing by styling links differently depending on whether they're dictionary links vs. links to other articles (many news sites) or by whether the link goes to another page within the site vs without the site (wikipedia).

The pre-web hypertext system Storyspace let an author set conditions on links-- if you'd been to certain lexia, a link would go one place, but if you hadn't, it would go another. That is a bit of a roundabout way of trying to ensure context without forcing the user to articulate it, by the way-- a reader doesn't have to *say* that they're following a link while thinking about a certain theme, but if they've just been to three other certain pages about that theme and then follow the link, there's a good chance of it. Obviously, even that system can't guarantee context. You might have read the last three pages tracing an argument because you are trying to get what the author means or you might have read them because you think the argument is horseshit and are jotting down points to debunk; you'll be hoping for very different things from any link under those two circumstances.

Tinderbox, which is a software package intended in part to adapt Storyspace to the broader world of non-narrative hypertext authorship, simplified that system and tried to make it richer at the same time. Tinderbox documents don't assume a linear ('narrative') reading; they might be rendered as a website or even used as a text-heavy database. So Tinderbox gives an author a few metadata fields for links and an arbitrary number for lexia themselves. You could have Tinderbox render out only links of a certain type (context) or list links with their contexts as a pop-up window on mouseover of a link.

Note as well that there *is* a standard in HTML now that can greatly help with context for links, but it is almost completely unused: the tooltip. As people become more sophisticated hypertext readers, they generally hit a point where they preview a link. They'll mouseover it and look at the URL that pops up in the browser window. That'll often tell you whether it goes to the NYT, to someone's blogspot account, etc. That's really useful for the Penny Arcade comic, for instance, where they write really well and link very well except that you often can't tell which link goes to the comic of the day. I am rarely interested in following the political fights that inspire a particular Penny Arcade post, so I'll mouseover the links to see which one is a link to their own site rather than some player forum. Since standard HTML includes a tooltip attribute for an anchor tag, so we could take that behavior to the next level and put a link's context into the tooltip. It wouldn't be *responsive* to context, but it would at least communicate the context the author had in mind when forging the link.

A bit of fantasy that I've indulged in on occasion: it would be pretty awesome if we could include reader-defined context into hypertext, something that I call Tiered Engagement. I think that the ultimate tutorial system, be it in a game, the next version of Word, or The Perfect Digital Projection of a Textbook, would be one that would respond to the learner's level of experience. Imagine opening Photoshop and telling it that you're a novice user, you're really only retouching some photos that your aunt took at your cousin's bar mitzvah. It hides all the burning and dodging and masking tools and rearranges the Help feature to be really basic for you. Then later your wife opens Photoshop so that she can do her freelance colorist work, and she gets the full interface and a help system that basically shrugs its shoulders and sends her directly on to the Adobe forums. I'm getting a bit off by looking at Photoshop as a hypertext, but you can imagine a similar system within a more traditional text: the fifth grader gets different links from the chapter on Volcanos than the 12th grader, and the geophysicist gets another set entirely.

The trick is, as JL notes, getting the reader to define the context in a way that the system will understand. He spent much of grad school watching his professors beat their heads against the fact that it's hard to get readers to define context for themselves for something that they don't yet know, and trying to second-guess them by essentially modeling their knowledge is, really The Problem of Artificial Intelligence. In the very limited domain of "using Photoshop" or "reading this specific article in a particular reference", you can assume a lot of the reader's perspective, and just have to determine the level of engagement that the reader wants. On more open-ended work, though, there are too many perspectives or levels of understanding for that.

JL noted that techies like to leave the burden of defining context up to the end-user. I'm probably a techie, but both my techie and my literary sides say this is okay if you include the author as an end-user. In a traditional printed book or article (still far and away the primary model for most online writing), you rely on the author to arrange their words in such a way that their point is comprehensible to the reader. If they do it well, they're a "clear" or "good" reader. If they don't, they're either "bad" or "difficult". I don't see why writing hypertexts should be any different. Jerry Parkinson is a good writer, and so includes links smoothly within his sentences in a way that, generally, means that if some link text interests you, you click on it and get something that tells you more about what interested you in clicking on the link. The author of the article GH read, or their publishing website, is a poor writer, linking text in a pedantic, patronizing way rather than in a way that expanded their argument.

That's a lot of words for what JL meant when he said that hypertext is a new medium, and we're still learning its conventions. I don't think that it's inherently related to technology or to a reductive techie mindset, though, beyond the hubristic idea that the formation or assessment of meaning can be automated.

I don't know jack about Wittgenstein, but now I'm interested, at least in the stuff that touches on this. Got any links? ;)

No TrackBacks

TrackBack URL: http://www.textuality.org/cgi-sys/cgiwrap/sprice/managed-mt/mt-tb.cgi/47

Leave a comment

About this Entry

This page contains a single entry by Scott Price published on March 1, 2009 2:22 PM.

Hello, World was the previous entry in this blog.

WiiFit to the boxes we put ourselves in is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

March 2009

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31