Sneak Preview of Siri – The Virtual Assistant that will Make Everyone Love the iPhone, Part 2: The Technical Stuff
In Part-One of this article on TechCrunch,
I covered the emerging paradigm of Virtual Assistants and explored a
first look at a new product in this category called Siri. In this
article, Part-Two, I interview Tom Gruber, CTO of Siri, about the history, key ideas, and technical foundations of the product:
Nova Spivack: Can you give me a more precise definition of a Virtual Assistant?
Tom Gruber: A virtual personal assistant is a software system that
- Helps the user find or do something (focus on tasks, rather than information)
- Understands the user's intent (interpreting language) and context (location, schedule, history)
- Works on the user's behalf, orchestrating multiple services and information sources to help complete the task
In other words, an assistant helps me do things by understanding me and working for me. This may seem quite general, but it is a fundamental shift from the way the Internet works today. Portals,
search engines, and web sites are helpful but they don't do things for
me - I have to use them as tools to do something, and I have to adapt
to their ways of taking input.
Nova Spivack: Siri is hoping to kick-start the revival of the
Virtual Assistant category, for the Web. This is an idea which has a
rich history. What are some of the past examples that have influenced
your thinking?
Tom Gruber: The idea of interacting with a
computer via a conversational interface with an assistant has excited
the imagination for some time. Apple's famous Knowledge Navigator
video offered a compelling vision, in which a talking head agent helped
a professional deal with schedules and access information on the net.
The late Michael Dertouzos, head of MIT's Computer Science Lab, wrote
convincingly about the assistant metaphor as the natural way to
interact with computers in his book "The Unfinished Revolution:
Human-Centered Computers and What They Can Do For Us". These accounts
of the future say that you should be able to talk to your computer in
your own words, saying what you want to do, with the computer talking
back to ask clarifying questions and explain results. These are
hallmarks of the Siri assistant. Some of the elements of these visions
are beyond what Siri does, such as general reasoning about science in
the Knowledge Navigator. Or self-awareness a la Singularity. But Siri
is the real thing, using real AI technology, just made very practical
on a small set of domains. The breakthrough is to bring this vision to
a mainstream market, taking maximum advantage of the mobile context and
internet service ecosystems.
Nova Spivack: Tell me about the CALO project,
that Siri spun out from. (Disclosure: my company, Radar Networks,
consulted to SRI in the early days on the CALO project, to provide
assistance with Semantic Web development)
Tom Gruber: Siri has its roots in the DARPA CALO project (“Cognitive Agent that Learns and Organizes”) which was led by SRI. The
goal of CALO was to develop AI technologies (dialog and natural
language understanding,s understanding, machine learning, evidential
and probabilistic reasoning, ontology and knowledge representation,
planning, reasoning, service delegation) all integrated into a virtual
assistant that helps people do things. It pushed the limits on machine
learning and speech, and also showed the technical feasibility of a
task-focused virtual assistant that uses knowledge of user context and
multiple sources to help solve problems.
Siri is integrating, commercializing, scaling, and applying
these technologies to a consumer-focused virtual assistant. Siri was
under development for several years during and after the CALO project
at SRI. It was designed as an independent architecture, tightly
integrating the best ideas from CALO but free of the constraints of a
national distributed research project. The Siri.com team has been evolving and hardening the technology since January 2008.
Nova Spivack: What are primary aspects of Siri that you would say are “novel”?
Tom Gruber: The demands of the consumer
internet focus -- instant usability and robust interaction with the
evolving web -- has driven us to come up with some new innovations:
- A conversational interface that combines the best of speech and semantic language understanding with an interactive dialog that helps guide
people toward saying what they want to do and getting it done. The
conversational interface allows for much more interactivity that
one-shot search style interfaces, which aids usability and improves
intent understanding. For example, if Siri didn't quite hear what you
said, or isn't sure what you meant, it can ask for clarifying
information. For example, it can prompt on ambiguity: did you mean
pizza restaurants in Chicago or Chicago-style pizza places near you?
It can also make reasonable guesses based on context. Walking
around with the phone at lunchtime, if the speech interpretation comes
back with something garbled about food you probably meant "places to
eat near my current location". If this assumption isn't right, it is easy to correct in a conversation.
- Semantic auto-complete - a
combination of the familiar "autocomplete" interface of search boxes
with a semantic and linguistic model of what might be worth saying.
The so-called "semantic completion" makes it possible to rapidly state
complex requests (Italian restaurants in the SOMA neighborhood of San
Francisco that have tables available tonight) with just a few clicks.
It's sort of like the power of faceted search a la Kayak, but packaged
in a clever command line style interface that works in small form
factor and low bandwidth environments.
- Service delegation - Siri
is particularly deep in technology for operationalizing a user's intent
into computational form, dispatching to
multiple, heterogeneous services, gathering and integrating results,
and presenting them back to the user as a set of solutions to their
request. In a restaurant selection task, for instance, Siri combines
information from many different sources (local business directories,
geospatial databases, restaurant guides, restaurant review sources,
online reservation services, and the user's own favorites) to show a
set of candidates that meet the intent expressed in the user's natural
language request.
Nova Spivack: Why do you think Siri will succeed when other AI-inspired projects have failed to meet expectations?
Tom Gruber: In general my answer is that Siri is more focused. We can break this down into three areas of focus:
- Task focus. Siri is very focused on a
bounded set of specific human tasks, like finding something to do,
going out with friends, and getting around town. This task focus
allows it to have a very rich model of its domain of competence, which
makes everything more tractable from language understanding to reasoning to service invocation and results presentation
- Structured data focus. The
kinds of tasks that Siri is particularly good at involve semistructured
data, usually on tasks involving multiple criteria and drawing from
multiple sources. For example, to help find a place to eat, user
preferences for cuisine, price range, location, or even specific food
items come into play. Combining results from multiple sources requires
reasoning about domain entity identity and the relative capabilities of
different information providers. These are hard problems of semantic
information processing and integration that are difficult but feasible
today using the latest AI technologies.
- Architecture focus. Siri
is built from deep experience in integrating multiple advanced
technologies into a platform designed expressly for virtual assistants.
Siri co-founder Adam Cheyer was chief architect of the CALO project,
and has applied a career of experience to design the platform of the
Siri product. Leading the CALO project taught him a lot
about what works and doesn't when applying AI to build a virtual
assistant. Adam and I also have rather unique experience in combining
AI with intelligent interfaces and web-scale knowledge integration.
The result is a "pure play" dedicated architecture for virtual
assistants, integrating all the components of intent understanding,
service delegation, and dialog flow management. We have
avoided the need to solve general AI problems by concentrating on only
what is needed for a virtual assistant, and have chosen to begin with a
finite set of vertical domains serving mobile use cases.
Nova Spivack: Why did you design Siri primarily for mobile devices, rather than Web browsers in general?
Tom Gruber: Rather than trying to be like
a search engine to all the world's information, Siri is going after
mobile use cases where deep models of context (place, time, personal
history) and limited form factors magnify the power of an intelligent
interface. The smaller the form factor, the more mobile the context,
the more limited the bandwidth : the more it is important that the
interface make intelligent use of the user's attention and the
resources at hand. In other words, "smaller needs to be smarter." And
the benefits of being offered just the right level of detail or being
prompted with just the right questions can make the difference between
task completion or failure. When you are on the go, you just don't
have time to wade through pages of links and disjoint interfaces, many
of which are not suitable to mobile at all.
Nova Spivack: What language and platform is Siri written in?
Tom Gruber: Java, Javascript, and Objective C (for the iPhone)
Nova Spivack: What about the Semantic Web? Is Siri built with Semantic Web open-standards such as RDF and OWL, Sparql?
Tom Gruber: No, we connect to partners on
the web using structured APIs, some of which do use the Semantic Web
standards. A site that exposes RDF usually has an API that is easy to
deal with, which makes our life easier. For instance, we use geonames.org
as one of our geospatial information sources. It is a full-on Semantic
Web endpoint, and that makes it easy to deal with. The more the API
declares its data model, the more automated we can make our coupling to
it.
Nova Spivack: Siri seems smart, at least
about the kinds of tasks it was designed for. How is the knowledge
represented in Siri – is it an ontology or something else?
Tom Gruber: Siri's knowledge is
represented in a unified modeling system that combines ontologies,
inference networks, pattern matching agents, dictionaries, and dialog
models. As much as possible we represent things declaratively (i.e.,
as data in models, not lines of code). This is a tried and true best
practice for complex AI systems. This makes the whole system more
robust and scalable, and the development process more agile. It also
helps with reasoning and learning, since Siri can look at what it knows
and think about similarities and generalizations at a semantic level.
Nova Spivack: Will Siri be part of the Semantic
Web, or at least the open linked data Web (by making open API’s,
sharing of linked data, RDF, available, etc.)?
Tom Gruber: Siri isn't a source of data,
so it doesn't expose data using Semantic Web standards. In the
Semantic Web ecosystem, it is doing something like the vision of a
semantic desktop - an intelligent interface that knows about user needs
and sources of information to meet those needs, and intermediates. The
original Semantic Web article in Scientific American included use cases
that an assistant would do (check calendars, look for things based on
multiple structured criteria, route planning, etc.). The Semantic Web
vision focused on exposing the structured data, but it assumes APIs
that can do transactions on the data. For example, if a virtual
assistant wants to schedule a dinner it needs more than the information
about the free/busy schedules of participants, it needs API access to
their calendars with appropriate credentials, ways of communicating
with the participants via APIs to their email/sms/phone, and so forth.
Siri is building on the ecosystem of APIs, which are better if they
declare the meaning of the data in and out via ontologies. That is the
original purpose of ontologies-as-specification that I promoted in the
1990s - to help specify how to interact with these agents via
knowledge-level APIs.
Siri does, however, benefit greatly from standards for talking
about space and time, identity (of people, places, and things), and
authentication. As I called for in my Semantic Web talk in 2007, there
is no reason we should be string matching on city names, business
names, user names, etc.
All players near the user in the ecommerce value chain get
better when the information that the users need can
be unambiguously identified, compared, and combined. Legitimate
service providers on the supply end of the value chain also benefit,
because structured data is harder to scam than text. So if some
service provider offers a multi-criteria decision making service, say,
to help make a product purchase in some domain, it is much easier to do
fraud detection when the product instances, features, prices, and
transaction availability information are all structured data.
Nova Spivack: Siri appears to be able to handle
requests in natural language. How good is the natural language
processing (NLP) behind it? How have you made it better than other NLP?
Tom Gruber: Siri's top line measure of
success is task completion (not relevance). A subtask is intent
recognition, and subtask of that is NLP. Speech is another element,
which couples to NLP and adds its own issues. In this context, Siri's
NLP is "pretty darn good" -- if the user is talking about something in
Siri's domains of competence, its intent understanding is right the
vast majority of the time, even in the face of noise from speech,
single finger typing, and bad habits from too much keywordese. All NLP
is tuned for some class of natural language, and Siri's is tuned for
things that people might want to say when talking to a virtual
assistant on their phone. We evaluate against a corpus, but I don't
know how it would compare to standard message and news corpuses using
by the NLP research community.
Nova Spivack: Did you develop your own speech interface, or are you using third-party system for that? How good is it? Is it battle-tested?
Tom Gruber: We use third party speech
systems, and are architected so we can swap them out and experiment.
The one we are currently using has millions of users and continuously
updates its models based on usage.
Nova Spivack: Will Siri be able to talk back to users at any point?
Tom Gruber: It could use speech synthesis
for output, for the appropriate contexts. I have a long standing
interest in this, as my early graduate work was in communication
prosthesis. In the current mobile internet world, however,
iPhone-sized screens and 3G networks make it possible to do so more
much than read menu items over the phone. For the blind, embedded
appliances, and other applications it would make sense to give Siri
voice output.
Nova Spivack: Can you give me more examples of how the NLP in Siri works?
Tom Gruber: Sure, here’s an example, published in the Technology Review, that illustrates what’s going on in a typical dialogue with Siri. (Click link to view the table)
Nova Spivack: How personalized does Siri get – will it
recommend different things to me depending on where I am when I ask,
and/or what I’ve done in the past? Does it learn?
Tom Gruber: Siri does learn in simple ways
today, and it will get more sophisticated with time. As you said, Siri
is already personalized based on immediate context, conversational
history, and personal information such as where you live. Siri doesn't
forget things from request to request, as do stateless systems like
search engines. It always considers the user model along with the
domain and task models when coming up with results. The evolution in
learning comes as users have a history with Siri, which gives it a
chance to make some generalizations about preferences. There is a
natural progression with virtual assistants from doing exactly what
they are asked, to making recommendations based on assumptions about
intent and preference. That is the curve we will explore with
experience.
Nova Spivack: How does Siri know what is in various
external services – are you mining and doing extraction on their data,
or is it all just real-time API calls?
Tom Gruber: For its current domains Siri
uses dozens of APIs, and connects to them in both realtime access and
batch data synchronization modes. Siri knows about the data because we
(humans) explicitly model what is in those sources. With declarative
representations of data and API capabilities, Siri can reason about the
various capabilities of its sources at run time to figure out which
combination would best serve the current user request. For sources
that do not have nice APIs or expose data using standards like the
Semantic Web, we can draw on a value chain of players that do extract
structure by data mining and exposing APIs via scraping.
Nova Spivack: Thank you for the information, Siri might actually make me like the iPhone enough to start using one again.
Tom Gruber: Thank you, Nova, it's a pleasure to discuss this with someone who really gets the technology and larger issues. I hope Siri does get you to use that iPhone again. But remember, Siri is just starting out and will sometimes say silly things. It's easy to project intelligence onto an assistant, but Siri isn't going to pass the Turing Test. It's just a simpler, smarter way to do what you already want to do. It
will be interesting to see how this space evolves, how people will come
to understand what to expect from the little personal assistant in
their pocket.