Web By Phone
Francesco Caruso
caruso@bellcore.com
Abstract.
WebByPhone is a system that handles a phone call, recognizes touch-tone
(DTMF) digits and then reads a web page aloud using a TTS text-to-speech
synthesizer. WebByPhone deploys the power of Web browsing, by removing
the need of expensive equipment and requires simply a common PSTN telephone.
This not only extends the web access to the casual user, but it also provides
an effective aid to the visually impaired individuals. |
Table of contents:
Introduction.
Web technologies have been commonly based on graphical interfaces even
since the days of the first Web Browser Mosaic. The use of graphical user
interfaces has simplified the user interactions and contributed successfully
to the diffusion of the web, but at the same time, it elevated the need
to purchase expensive systems. Television and telephones are substantially
more widely used than Personal Computers. New attempts to penetrate the
market offering web services leveraging on common diffused appliances are
rapidly emerging. WebTV and IP-Phones are just an example of this.
WebByPhone deploys the power of Web browsing, by removing the need of
expensive equipment and requires simply a common PSTN telephone. This not
only extends the web access to the casual user, but it also provides an
effective aid to the visually impaired individuals. Although
WebByPhone is not the only system providing web access through PSTN, it
offers an open architecture and a full platform independence that guarantees
portability amongst different systems.
WebByPhone is totally implemented in Java. At the time of design my
choices leveraged on emerging standards such as the Java speech API (JSAPI),
and on well known and establish design patterns, such as the Observer/Observable
and the State pattern.
In this paper I will introduce the related work and provide a background
on the technologies and products interacting in the WebByPhone system.
I will then illustrate the architecture both describing the main architecture
diagram and the components in detail. I will describe the design patterns
and programming techniques applied in each module trying to emphasize the
benefit gained in operating those design choices.
This document also contains a functional description with various examples
of how the user can interact with the system. Since WebByPhone is entirely
coded in Java, the installation of the main program does not require any
additional compilation and is ready to run. I have added a minimal installation
procedure and requirements that include the essential steps needed to install
the peripheral software interacting with WebByPhone. At conclusion of this
document I report a list of possible directions in which the system may
evolve in the future.
Related Work.
There are several related works involving speech synthesis.
Most of these applications are oriented to blind users and provide access
to common programs and operating systems. BLYNX
[1], and BLINUX [2]
are two examples of how it is possible to couple a TTS and an application.
This is just a partial list of applications using voice speech
synthesis: [3]
-
Emacspeak Driver for the DoubleTalk and LiteTalk Synthesizers
-
Emacspeak Driver for Braille 'n Speak, Braille Lite, and Type 'n Speak
-
Linux Device Driver for the DoubleTalk PC
-
Screader: a Text-to-Speech Application for Linux
-
Slackware96 Rootdisk for the Blind
There is also a company NetPhonic
Communications, Inc. [4] that provides two products with similar features:
Web-On-Call
[5] and Email-On-Call
[6].
I have tested these two products through the demo they offer by phone.
They appear to be, at least from the demos, well done and production quality.
Although WebByPhone is not a complete product, I place it very close
to the competitor Web-On-Call.
The implemented WebByPhone core functionality in addition to the diffused
utilization of design patterns provides an open architecture in which it
is extremely easy to enrich the system with new features.
Background
The following is a brief introduction to the various software and devices
I used building the WebByPhone system.
Speech and other Java APIs
JavaSoft has recently released a Java
API for speech [7] More info. JavaTM
Speech API [8]
The Java Speech API is one of the Java Media and Communication APIs,
a suite of software interfaces that provide cross-platform access to audio,
video and other multimedia playback.
The Java Speech API, in combination with the other Java Media and Communication
APIs, allows developers to enrich Java applications and applets with rich
media and communications capabilities.
The Java Speech API leverages the capabilities of other Java APIs. The
Internationalization features of the Java programming language plus the
use of the Unicode character set simplify the development of multi-lingual
applications. Many of the classes and interfaces of JSAPI follow the patterns
of JavaBeansTM. JSAPI events integrate with the event mechanisms of AWT,
JavaBeans and the Java Foundation Classes (JFC).
To use the Java Speech API, a user must have certain minimum software
and hardware available. The following is a broad sample of requirements.
The individual requirements of speech synthesizers and speech recognizers
can vary greatly and users should check product requirements closely.
Speech software: A JSAPI-compliant speech recognizer or synthesizer
is required.
System requirements: most speech recognizers and some speech synthesizers
require powerful desktop computers to run effectively.
It is important to check the minimum and recommended requirements for
CPU, memory and disk space when purchasing a speech product.
Audio Hardware: Speech synthesizers require audio output. Speech
recognizers require audio input. Most desktop and laptop computers now
sold have satisfactory audio support. Most dictation systems perform better
with good quality sound cards.
IBM Speech for Java.
IBM is one of the first companies implementing the JSAPI interfaces.
Speech for Java
v0.61 [9] is a free evaluation available for testing on the web at
http://www.alphaworks.ibm.com/formula/speech /.
Speech for Java is a beta software from IBM and provides Java Speech
API on top of ViaVoice.
Speech for Java is a Java programming interface for speech that
gives Java application developers access to the IBM ViaVoice speech
technology. Speech for Java supports voice command recognition, dictation,
and text-to-speech synthesis, based on the IBM
ViaVoice [10] technology.
Speech for Java is an alpha implementation of a core subset of the
beta Java Speech API. (http://java.sun.com/products/java-media/speech/)
The Java Speech API is a cross-platform Speech API that was developed by
Sun Microsystems Inc. in collaboration with IBM and other industry speech
technology companies. More information on the Java Speech API can be found
at the Java Speech
API home page http://java.sun.com/products/java-media/speech.
Requirements
In much the same way that Java implementations on Windows are built on
top of the native Windows GUI capabilities, Speech for Java is built on
top of the native speech recognition and synthesis capabilities in IBM
ViaVoice. Thus Speech for Java requires installation of IBM ViaVoice Gold
on the computer. ViaVoice is not provided as part of this package.
More information about ViaVoice can be found at the VoiceType / ViaVoice
Home Page.
Minimum requirements for running IBM ViaVoice:
-
166MHz Pentium or 150MHz Pentium with MMX, running Windows 95 with 32MB
of memory or Windows NT with 48MB.
-
The Speech for Java has only been tested on the JavaSoft JDK 1.1.5 version
of Java.
-
ViaVoice Gold is an IBM software available off the shelf.
Microsoft Speech API.
The Microsoft® Speech Application Programming Interface (API) allows
application developers to incorporate both speech recognition and text-to-speech
into their applications.
More
information can be found at [11] http://www.microsoft.com/directx/pavilion/dsound/speechapi.htm
I am not directly using the SAPI in my project.
Serial Port Java Drivers.
SerialPort from Solutions
Consulting [12] http://www.sc-systems.com/serPort.html,
is a Java class to provide access to serial ports from Java application.
SerialPort is a high-performance class that also provides low-level serial
port control.
Web browser Lynx
Lynx is a full-featured World Wide Web (WWW) client for users running cursor-addressable,
character-cell display devices (e.g., vt100 terminals, vt100 emulators
running on PCs or Macs, or any other character-cell display). It will display
Hypertext Markup Language (HTML) documents containing links to files on
the local system, as well as files on remote systems running http, gopher,
ftp, wais, nntp, finger, or cso/ph/qi servers, and services accessible
via logins to telnet, tn3270 or rlogin accounts (see URL Schemes Supported
by Lynx). Current versions of Lynx run on Unix, VMS, Windows95/NT, 386DOS
and OS/2 EMX.
More information on Lynx
[13] can be found in http://www.slcc.edu/lynx/release2-8/lynx2-8/lynx_help/lynx_help_main.html
Telephone Access unit T-311.
With the Teltone T-311 Telephone Access Unit computers can make and answer
telephone calls, and information about those calls can be returned to the
computer. The T-311 allows communication between called and calling parties.
This communication is made possible by the conversion of DTMF-to-ASCII
and ASCII-to-DTMF.
With the T-311 computers and other terminal devices can control telephone
system functions such as answering and placing calls, observing call status,
sending or receiving DTMF signals, "flashing" the line and coupling audio
sources, like speech synthesizers, onto the line.
For more information
see [14] http://www.teltone.com/cti/t-311.html
Architecture
WebByPhone is entirely implemented in Java and has been tested with the
JDK 1.1.5 JavaSoft VM. The whole architecture consists of more than 30
classes and includes about 2500 lines of code. The whole design and
implementation of the system took an accumulated time of less than 100
hours with only a programmer.
Architecture modules:
The following is a diagram of the modules included in the system.
The following section contains a detailed description of the WebByPhone
modules as previously depicted in the architectural diagram above.
Phone gateway.
This component handles a physical call coming from PSTN and establishing
an audio link connected to the sound card. The hardware part of this module
is the Teltone T-311. The T-311 has a serial RS-232 interface and
is connected to a PC serial port. The class ph_driver is the phone
resource manager which acts as a Java wrapper for the T-311. Since Java
by itself does not handle the serial port device, the ph_driver leverages
on a commercial library to access the Serial Port.
The third party SW I used is called SerialPort and is produced
by Solutions Consulting Inc. This package provides access to the serial
port abstracting it as if it were a standard Java IO stream.
Since this company provides support for the most common platform this
solution still guarantees portability amongst platforms.
By design the ph_driver module is able to generate events according
to the observer/observable design pattern[15] . The Events are modeled
by the class SerialEv and are composed of two units of information:
a type and a value. Each module that wishes to receive SerialEv
events must implement the interface SerialEvObserver and subscribe
to the source.
The SerialEv are specialized in different kind of events: SerialDTMFLineEv
and SerialLineEv.
These two events are produced respectively when a DTMF sequence or
a line is read from the phone device.
DTMF converter.
This module is responsible for the conversion of DTMF tones into ASCII
strings. This device allows the user to remote pilot the web browser detecting
the pressed keys on a touch-tone phone. Although there are programs off
the shelf able to perform this kind of conversion, I am satisfied with
the quality of the Teltone T-311 DTMF and choose to use it in the WebByPhone.
Mediator.
This module acts as a coordinator among the other entities. I designed
this part of the system according to the mediator design pattern.
In reality several Objects, the webph and the collection of CallState
Objects, compose the Mediator.
The webph is the main program. It contains the Main method
and is responsible for the initialization of the system. During the initialization
phase all the various Objects needed are created and initialized.
The mediator handles the phone sessions as a finite state
machine. On designing the mediator I took into account the extendibility
of the phone session model. I designed the system in such a way as to simplify
changes on the state graph in terms of adding new states and new transition
between states. This flexibility has been achieved using the State design
pattern. Distinct objects inheriting from a base CallState Object
coded as abstract class represents the session states. The mediator receives
events and forwards them to the current state, which is responsible for
reacting according to the embedded business logic. Each state implements
the SerialEvObserver interface and receives events from the phone
device. The behavior of the system is dictated by the current state reaction
to the events. Behavior modification of the system is as easy as introducing
a new CallState object and linking it to some predecessor state. In this
way I have actually introduced user authentication simply by adding a new
state Authentication between the WaitingCall and CallConect
states. It is also possible to take advantage of the state hierarchy by
grouping together some of the common behavior in the base class CallState,
avoiding redundant implementation in all states. This is done for example
to handle the DTMF request for online help, for the handling of the call
termination event and for the explicit termination of the session by the
DTMF sequence '**#'.
This state mechanism can be extended recursively to delegate part of
behavior to other modules. In particular the menu has a common part in
each state and a specialized part that is valid while in the process of
reading a web page. During such phases the system must react not only to
the standard instructions but also to the link selection fetching the new
web page. This behavior is accomplished by introducing a menu object able
to process events and react when the link selection is detected. The menu
is created from the web page metadata object (webDoc) and contains
all the links available from a starting page.
GUI.
Even if it is not strictly required by the project, it is useful to have
a GUI interface especially in the intermediate phase of the project implementation.
This module allows testing the system without all the hardware. In particular
it simulates the events produced by touch-tone keys.
TTS.
This module provides the vendor independence for the Text-to-Speech (TTS)
functionality. The Object managing the speech generation is called TTS
and is currently based on the IBM Speech for Java v0.61.
The choice of basing the TTS on the IBM product was essentially dictated
by the fact that the IBM Speech for Java was at the time the only implementation
of the Java Speech API available.
I have in particular evaluated the Lucent TTS application platform
2.0 beta for the PC platform (PC is the only platform supported…) and
although I was extremely satisfied with the audio quality, I decided to
discard the product because of the absence of a Java interface. Theoretically,
since the lucent TTS implements SAPI (The Microsoft Speech API)
it should be possible to interface the TTS to Java using C++ native methods.
This requires a deep understanding of the SAPI and a concrete programming
effort, which is currently beyond the scope of this particular project.
The Lucent researchers I contacted confirmed the intentions of evolving
the product in this direction.
It should be noted that the Speech for Java v0.61, which is
an alpha version, does not implement the full JSAPI interface, and is not
stable and robust yet. I would like to acknowledge the IBM researchers
for providing me prompt feedback on the problems I encountered with this
implementation and for providing me with possible alternative ways to get
around the unimplemented features I needed. I still have on occasion some
error exceptions coming from the TTS module and am in the process of isolating
them.
Web Browser.
This web browser module is able to fetch a web page and extract the text
information along with all the meta-information needed to "surf"
the page. In order to de-couple the system from a particular browser I
have introduced an interface webDriver defining a basic method:
public webDoc getUrl(String url).
I decided to not implement yet another Java browser and html parser
but to leverage on a well known and maintained public domain text web browser:
Lynx.
The main advantage of using well maintained third party software is
the implicit guarantee that the system will work consistently when the
HTML changes. I used a common technique called screen scraping to
extract information from the Lynx browser. Lynx is currently executed with
the options -number_links, -pseudo_inlines, -dump, from inside a
class called webDriverLynx that implements the webDriver
interface. The webDriverLynx reads the output piped throughout the
standard output into an Input Stream and then parses it using a finite
state machine. The result of this process is a webDoc object. I
defined the webDoc containing the textual representation of the
web page and a data structure (links) listing the tuples URL,
description, related to the anchors contained in the web page.
Program Documentation.
The Program WebByPhone has an online help for
the command flags if started with the option '-' (minus).
One useful parameter that can be specified in the command line is the
home-page URL. For example to start WebByPhone with the Columbia
web page you can type:
WebByPhone -h http://www.cs.columbia.edu/
Once the program starts it waits for incoming calls.
WebByPhone will detect the phone ring and answer at the first
ring playing the welcome message. It then follows an authentication phase.
In this phase the user is asked to enter a code number and a pin number.
The systems perform a lockup in the registered users and if the user
is present with the correct pin it will grant access permission and continue
the phone session. In case either the user is not registered or the pin
is incorrect WebByPhone will terminate the active session with a
"Good bye message".
The authentication phase is very useful for two reasons: it not only
provides a basic way of performing access control, but it also allows the
system to personalize the session for the particular user. Each user in
the system has a user profile and it is possible to store user related
information such as home page, bookmarks and voice preferences.
It is possible to achieve a higher level of authentication by gathering
the caller ID information from the Teltone T-311 unit. The unit I am using
for this prototype does not support the caller id at this time.
Once the user is authenticated WebByPhone refers to the
caller by name, using the corresponding user profile information.
Upon completion of the initial authentication phase, WebByPhone
begins fetching the requested page, playing the user instructions concurrently.
When the page is received and processed by the web browser wrapper, WebByPhone
will start reading the text.
Each Link in the page will be numbered and the keyword 'link' will
be pronounced before reading the content. At the end of the page the system
will remind the user of all options and will read the links, starting from
the last read.
By convention each input through the touch tone keypad must terminate
with the special character '#'.
To repeat a menu or to get an online help use the '*' followed by '#'.
To terminate the call press '**' followed by '#'.
To select a link, just enter the link number followed by '#'.
Other classes of functionality can be implemented, by assigning them
to codes greater than 90. For instance it is possible to map 91 to allow
the user to perform a 'go to URL' function. The URL can be inserted in
the system either by selection directly from personalized bookmark or by
typing it through the touch-tone keypad.
Although the textual data input through a touch-tone keypad is slow
and tedious, the following algorithm can be used:
The user can enter each alphabetical character from a to z with a combination
of two keys; the key containing the character and the number specifying
the position of the character in the key.
For example to enter 'home' the user will type 42 63 61 32.
It is possible to map special characters like '@' and '.' into particular
codes.
Installation notes:
The whole system code can be packet and deployed as a single zip archive
file.
Following this strategy, WebByPhone installation does not requires
any particular operation rather than coping the file to a destination directory,
since the Java VM is able to access the classes as needed directly from
the archive.
The following is a brief description of the installation sequence of
all the helper applications needed by WebByPhone.
Via Voice Gold
Speech for Java 0.6
Install the Java lib and the related DLL.
Follow the IBM instructions.
SerialPort
Unzip the SerialPort installation and place the DLL in the windows/System32
directory.
Lynx
Install Lynx on the local disk and configure the system as follows:
lynx_cfg= c:\web\lynx32\lynx.cfg
PATH=PATH; c:\web\lynx32
Replace the path according to your actual installation path.
Java JDK 1.1.5.
Download the official JDK from javaSoft and Install the self extracting
executable file with the wizard.
Future Work.
Although WebByPhone implements all the core functionality allowing the
user to navigate a web page by phone, in order to effectively use the systems
several functionality should be added.
It will be useful to emulate the behavior of a common browser enabling
the possibility to bookmark pages, to directly access a page from a bookmark
and to keep a history of the pages visited to enable back and forth navigation.
This feature can be added introducing a new module in the system without
changing the architecture. Another future enhancement can be done handling
the form data entry. This will allow the user to input data and take advantage
of the search engines.
An additional refinement can be made on the way the HTML is rendered.
Leveraging on the fact that the JSAPI includes the possibility to select
a speech syntetizer with different attributes it should be possible to
render the different HTML tags with different voice tones and even changing
from male to female to emphasize tags like anchors.
This entire possible enhancement to the system can be introduced without
changing the main architecture, but simply adding or modifying few modules.
DISCLAIMER: This document states my personal opinions
and I am fully responsible for it.
References
[1] BLYNX - http://leb.net/blinux/blynx/index.html
[2] BLINUX - http://leb.net/blinux/index.html
[3] BLINUX Project - http://leb.net/blinux/betas.html
[4] NetPhonic Communications - http://www.netphonic.com/company/company.htm
[5] Web-On-Call - http://www.netphonic.com/product/woc/wocprod.htm
[6] Email-On-Call - http://www.netphonic.com/product/eoc/eocprod.htm.
[7] Java API for speech - http://java.sun.com/javaone/javaone98/sessions/T604/
[8] java-media - http://java.sun.com/products/java-media/speech
[9] Speech for Java v0.61 - http://www.alphaworks.ibm.com/
[10] IBM ViaVoice - http://www.software.ibm.com/is/voicetype/
[11] Microsoft Speech API - http://www.microsoft.com/directx/pavilion/dsound/speechapi.htm
[12] SerialPort from Solutions Consulting - http://www.sc-systems.com/serPort.html
[13] Lynx - http://www.slcc.edu/lynx/release2-8/lynx2-8/lynx_help/lynx_help_main.html
[14] Telephone Access unit T-311 - http://www.teltone.com/cti/t-311.html
[15] Design Patterns - Elements of reusable OO Software, Erich Gamma
, Addison Wesley.
Last updated: Sunday, May 3, 1998 by Francesco Caruso