Integrating Email, Text to Speech Synthesis and Advanced Telephony Services

Jeremy Blumenfeld Columbia University, New York, NY USA jjb18@columbia.edu

Miriam Tauil Columbia University, New York, NY USA mtauil@cs.columbia.edu

Abstract
Software Overview
Related Work and Products
Overview of the system architecture

User Interface
Teltone
Pop Mail Interface
TTS

Group Task List
Software Documentation
Source Code
References

Abstract

The Email By Phone system allows the end user to dial in via a regular touch-tone phone, enter in their login-id and password, retrieve their email messages and have these messages read to them over the phone. After dialing up the system, the user interacts with it only through the touch-tone keys on a regular telephone. At the other end, these tones are received by a Teltone T-311 Telephone Access Unit. The Teltone unit converts the signals into the ASCII codes which are used to control the system.

The system interacts with a POP mail server. It authenticates the user's login name and password on that POP server, provides the user with information on the number of messages they have received, allows the user to listen to message headers, the text of messages and delete the message. The user may also navigate forwards, backwards or to a specific message. One important feature is that the user may interrupt the reading of any message at any point while it is being read back by the system and cancel, delete or back-up the reading of the message.

The messages are read back by using the Bell Labs TTS system. This speech output is directed to the Line-Out jack of the system which is hooked directly back into the Teltone T-311.

The system is interesting in its integration of a variety of different pre-existing products- the Teltone Telephone Access unit, and the Bell Labs Text to Speech C programming library.

Software Overview

On start-up the system initializes the Teltone to answer calls and starts up the TTS system as a daemon process to accept text data and output audio data.
When a call is received, the caller is prompted for their user ID and password and enters it using the digits on the phone utilizing a pre-defined translation system from the ascii characters to the handset's digits.
The Teltone converts this information to ASCII.
After receiving this information, the system authenticates the user via the user's default pop server, and retrieves messages.
The user is prompted with the number of messages and prompted for an action - listen to all messages or listen to a specific message number.
For each message played back the user is first prompted with the sender, subject and date header information for the message and then the system begins to play-back the message text.
The user may interrupt the play-back of message or header information at any time and choose a new action.
The server will playback the retrieved text messages using the AT&T Text to Speech software.
After hearing a message the user has the option to delete the message.

Related Work and Products

The Email By Phone system is not based on any single new technology. Rather, it brings together a collection of existing disparate technologies to create a new functionality for users. Its utility is that it brings mobility to email. Email no longer requires the use of a computer, just a touch-tone telephone.

A number of private companies have begun to develop and market similar systems. These companies are all trying to provide a simple, single integrated solution for handling a variety of different communication services- voicemail, email and in some cases fax.

Pure Speech's SpeechMail software which allows users to dial-in and retrieve email for their PC using voice commands as well as touch-tone keys was recently licensed by Compaq and is being packaged for sale along with their new PC's. A number of services have cropped up which offer email by telephone and other features to their subscribers. E-Now is a California-based start-up which is offering access to user email accounts via a 1-800 number and voice/phone inputs. E-Now users don't need to install the software or the server themselves, rather they access E-Now's system for a flat monthly fee plus additional per transaction charges. VirtualOffice clients receive their voice mail and faxes on their telephone number, and can retrieve all their mail, including e-mail, via touch-tone phone or the Internet. In Germany, EteX software, has developed a similar product for email by phone which has also been licensed for use.

In general, the main technical challenge presented by these systems is the quality of the text-to-speech system and, if offered, the voice recognition. For our system, we utilized the TTS synthesizer developed by Bell Labs. This is discussed in more detail here.

Architecture

The following scheme provides a high level view of the software modules developed in the project: System Architecture

The User Interface

User Interface Overview:

One of the main design challenges of this project was to create a user interface for a technology which normally uses a full computer keyboard and monitor in a telephone handset. We needed to design an interface which was consistent and easy to use yet powerful and also secure.

Our main difficulty here was presented by the user's log-in ID and password entry. In our design, the user is required to enter their full log-in information in order to be authenticated. Although this is not the most user-friendly option, it provides a high level of security.

This required that our system supports the mapping of all the characters that are acceptable as a login ID and password characters into a phone touch tone key or keys. The following implementation attempts to use an easy to remember way of mapping phone keys to keyboard characters (the user will mainly have to remember his password) and also map smaller number of touch tone keys to more frequently used character (based on our opinion only).

Telephone keys are mapped as follows:

A number in the password will be mapped to the same number.
A lower case letter will be mapped to the following:

The touch tone key where the letter appears. For example: a->2, b->2, d->3.

Since q and z do not appear in the touch tone key pad, the "1" key will be used.

"1", "2" or "3" will be pressed depending to differentiate among the 3 letters that appear in each key.
For example: "21" will be used for a, "22" for b and "23" for c.

An upper case letter will be represented using:

The first 2 keys that are used for the lower case of this letter, followed by a "1" key, which will represent upper case.

Since punctuation can be used for passwords, "." will be represented by "*", ";" by "**" .
Finally, all characters, numbers, etc. are terminated by the # key.

Here we summarize the mappings from a keyboard character to a touch tone telephone key:

21#

52#

82#

211#

521#

821#

22#

53#

83#

221#

531#

831#

23#

61#

91#

231#

611#

911#

31#

62#

92#

311#

621#

921#

32#

63#

93#

321#

631#

931#

33#

71#

12#

331#

711#

121#

41#

11#

411#

111#

42#

72#

421#

721#

;

**#

43#

73#

431#

731#

51#

81#

511#

811#

User Interface Implementation:

At each point in the user's interaction with the system they may be given only certain options to choose from. For example, before retrieving any messages the user must first pass through the authentication procedure. The system, therefore needs some state information maintained and so we have modeled this as a state machine.

Each state is processed as follows: the state message is played out , if there is any required user input the input is accepted from the user, based on the input, if there is any extra processing that needs to occur, it is performed by the system.

The following table summarizes the fields in the state table. Each state may specify a message to be played out, the user input, any extra processing which needs to occur and its result based on this extra processing. To make changes to the messages, etc. the corresponding information can be changed in the source code.

State Name Message to play Input Digit/s Extra Process Extra Process Result Next State

Security Considerations:

The security design goal is to allow each user to access his own messages and to avoid anybody else accessing them. Since mail message security is provided by the user's login name and password, these must be entered at the beginning of each email by phone session for the user to retrieve the messages. Each user is allowed three attempts to log-in, after which point the system will automatically disconnect them. This is to prevent attempts at password guessing. Also, to make this harder the system only reports limited information back when a user enters invalid data, it will not specifically say if the problem is with the login name or with the password.

Security Limitations

The problem of having somebody listen to the user's phone line can compromise the security of the mail messages. No encryption can be done for the communication in this media, since no decryption tool is available at the phone end.

This puts both the retrieved messages and the user's password in risk, but provides the ease of use by a very common device: the phone.

This problem cannot be addressed by this project since its intent is to get mail messages by a regular phone and the problem is part of the phone system.

Scalability - Support of users from multiple pop servers:

The default mail server name is currently "cs.columbia.edu". The system can be easily extended to be used for another single pop server by storing the default mail server in a configuration file. It also could be extended to support users from multiple pop servers using a database that will map each "Email-by-Phone login ID" to "pop server login ID" and "pop server name". The purpose of the "Email-by-Phone loginID" is to resolve the possible conflict when having the same login ID in different pop servers.

This database will have to be populated with the users information (Email-by-Phone login ID, pop server login ID and pop server name). Still users that belong to the default mail server will be able to use the system without prior configuration( or database population).

Teltone Interface

The Teltone Access Unit provides the phone line computer communication. The Teltone unit is responsible for answering the phone calls and passing the server the user input; and in the other direction, for passing the callers the spoken messages. Teltone accepts input through an R232-C port AT command.

A small Teltone library was implemented in teltone.c to support the project. It includes low level routines that write and listen to the R232 computer port, sending standard AT telephony commands as well as some commands specific to the Teltone functionality. It checks the return codes of these commands; and high level routines that hide the complexity of the above and can be used without any prior AT telephony command knowledge. The teltoneInit() function should be called to initiate the connection, and teltoneEnd() to end it. Some high level routines examples are:

setToAnsweringMode(int ringNum)
acceptIncommingCall( int timeout)
disconnect()
acceptUserInputFromTeltone(char *data, int timeoutSec, int charCount)

The low level routines were designed to be generous in what they accept. For example, when reading the return code of the commands that return OK or ERR, the function will read a maximum of 10 lines until OK or ERR are read. Of course, OK or ERR should normally appear on the next line or the second next line. If any minor changes are added in future versions of the Teltone unit these functions will still work.

Also, Teltone is set to return it's return codes as words and not number codes, to make it easier to debug.

POP Mail Interface

The modules dealing with the POP mail server involves two parts- interactions with the POP mail server via TCP and parsing of messages for headers and MIME attached files. The POP RFC provides a simple framework for interactions with the POP server. The pop.c module provides an interface layer through which the email by phone system can send messages to the POP server and receives back useful information for the system. The functions which directly interact with the POP server are intended to be as generous as possible with what they will accept from the POP server. These functions were tested against the servers running on the SUN Solaris systems in the CS Lab and a Linux system. These functions should also interact with an IMAP server.

In retrieving messages from the POP server, the system places a limit on the size of each individual message. This limits the delay between the request for a message and playing it out. It also provides more stable memory management.

The parsing of the headers returns back information pulled out from the relevant header fields in a form which is intended to be more easily read out by the TTS. The RFC's describing mail messages and message headers allow a lot of latitude to the sender mail client. The parsing functions attempt to deal with this. These functions were tested against mail messages composed from a variety of different clients.

The POP mail interface will find MIME boundaries and MIME content-type and encoding fields. It also provides a separate function to decode from Base64 encoded attachments.

Bell Labs TTS

In using the TTS system, we were mainly constrained by the requirement that the user should be able to cancel play-out of the mail data at any point, without having to wait for all data to be played out. This requires that the system implement some type of concurrency and the play-out of TTS data should be interruptable by the system.

The TTS library has two options- allow the TTS server to handle the speech output by sending it directly to the audio port or send the data back to the client. In order to implement this type of concurrency we had to be able to interrupt the speech output so we needed to maintain control over audio input within the email by phone system. Unfortunately, the current TTS library did not always act as expected when sending back the speech data to the client. In particular, the library is supposed to allow the user to specify a function to handle the user output. This capability did not seem to be available. Instead in order to control the output we specified a file into which TTS writes the speech data, the only other option TTS provides for returning data to a client. After writing out the data, the TTS returns and the file can be processed by our system. This allows the system to interrupt speech play-out when the user presses a key on the phone.

The main drawback to this is that there is an increased latency between requesting a message and hearing it. In order to be sure that the file is ready to be safely played-out to the audio port, the entire message is first fully processed by the TTS server and saved to disk before being output to the audio port.

Task List

Design user interface at the phone-end: (Miriam Tauil)

Design and implement security, so each user can access only his own messages.
Allow the user to navigate and manage their list of messages.

Communicate with the Teltone T-311 through the serial port. (Miriam Tauil)

Accept the ASCII commands received from the "Teltone T-311" hardware.
Send this information to pop server.

Communication with the PopServer including:

User authentication, message text and header retrieval, parsing of messages and headers. (Jeremy Blumenfeld)

Playback of user text messages into the "Teltone T-311" port using the AT&T Text to Speech software. (Jeremy Blumenfeld)

Control of system to allow user-input to cancel.

References

Steven R., Advanced Programming in the Unix Environment, 1992- Chapter 11
Gallas, J., Introduction - Serial Programming Guide for POSIX Compliant Operating Systems
Teltone T-311 documentation
AT&T Text to Speech software
POP3 and MIME RFC's:

Post Office Protocol version 3.0 (POP3): RFC 1939
Standard for the Format of ARPA Internet Text Messages: RFC 822
Multimedia E-mail User Agent Checklist: RFC 1844
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies: RFC 2045

Last updated: May 6, 1998.