multidimensional, get it?

I was watching this Bloomberg video the other day featuring Shawn Carolan, the venture capitalist who backed the Siri electronic personal assistant startup then sold it to Apple. His was the closest I’d heard to a technical explanation of how Siri works and it surprised me because it sounded a lot like technology I remembered from years ago at Excite, the long-defunct search engine.  Please look at the video and then meet me in the next paragraph.  The part that excited me (no pun intended) is about four minutes in.

Okay, he said they used linguistic techniques to map blocks of words against 10 possible domains of expertise to figure out what the heck you are asking Siri to do, with the real breakthrough being treating the entire user question or order as a single linguistic unit.

Now let’s jump back to 1994.  Eighteen years ago the search engine technology standard was set by Alta Vista, which spun out of Digital Equipment Corporation. Alta Vista pioneered web indexing with spiders dragging back web pages and it pioneered keyword searches. But if a keyword wasn’t present, Alta Vista would never return the web page you really needed because Alta Vista wasn’t smart enough.

Today the search engine standard is of course Google which uses PageRank to measure relevance by putting higher in the search results those pages that are linked to by more other pages.  Adding to this (yes, I know I’m over-simplifying — feel free to correct me) Google knows a lot about synonyms and how word meaning changes in different contexts — basic linguistic tools that were probably out of Alta Vista’s reach simply because of the processing power required.

And then there was Excite, which was completely different. When I first visited the company in 1994 it was called ArchiText and was six Stanford students operating from their Los Altos garage. I helped them find their first customer and their first venture capitalist, Steve Coit of Charles River Ventures. Vinod came along later.

Most of the ArchiText boys were semantic systems majors and they took a very different technical approach to search than did Alta Vista or that other up-and-coming search engine, Yahoo, which in those days did the task the old fashioned way — by hand.

ArchiText used spiders, too, and built its own web index, but from the start the company was dedicated to finding useful search results even if they didn’t include any search terms from the original user query — seemingly an enormous job.  Google does some of that through its elaborate algorithm, mentioned above, but Google’s technique is for the most part hard coded and brute force while ArchiText’s was very different and, well, elegant.

Here’s how the ArchiText (later Excite) search engine worked. Every query was stripped to its significant words — subjects, objects, verbs and adjectives — then each query became a vector in a multidimensional space with each unique word being a dimension.  “How do space rockets stay in orbit when they are flying through space?” would become a vector string one unit long for each of those words but two units long for the word “space.”  This bit of semantic DNA was then mapped against an index of millions of web pages that had all been similarly converted to multidimensional vectors.

Finding the most relevant results then became a simple matter of grabbing the N vectors (web pages) nearest to the query vector in that multidimensional space.  It was quick, scalable, concentrated the processing load on the indexing where it didn’t bog down retrieval, and could reliably return pages like “Why satellites fall from the sky” that might answer the question even though none of the same words were used.

Compare that to the description of Siri from the Bloomberg video.  Siri takes the entire query as a single block and maps it against a corpus composed of 10 domains of expertise looking for a fit, or perhaps for the best fit.

Technically it sounds darned similar to me, but then I’m forever condemned to remember old crap like this.

In the long run PageRank was more useful to the real world, Excite got sucked into @Home and the whole mess blew up with the dot-com meltdown, but not before all this technology was patented — patents owned today by Excite@Home’s creditors, which surprised me given that the original inventor, Graham Spencer, now works at Google.

Those old Excite patents, while nearing the end of their lives, could turn out to be very valuable to, say, a Google trying to compete with Siri on Android or even to an Apple trying to defend Siri from competitors.

I expect we’ll see those patents change hands sometime soon.