Client-side versus Server-side Development of VoiceXML Applications

This post discusses some of the design decisions used for developing VoiceModel, an open source project for creating a framework that simplifies developing VoiceXML applications using Microsoft ASP.NET MVC and Visual Studio. Developing voice/speech applications using VoiceXML has made development of IVR applications more like developing web applications.  The IVR acts like a web browser.  In fact the component of the IVR that consumes the VoiceXML documents is often referred to as the VoiceXML browser.  So that would make the IVR the client-side.  Any application logic we put into the VoiceXML document will be processed client-side.  Even though some constructs were added in later versions of VoiceXML to make it possible to access data sources directly from VoiceXML you usually needed server-side programming to access back-end systems to retrieve or update data and to dynamically generate VoiceXML documents that contain data to be voiced back to the user of the system.  So how much of the application should be processed in the client and how much on the server?

Modern web applications have evolved to put a lot more software in the client-side to provide a richer user interface that performs better.  Technologies such as AJAX have made this possible.  But even with this trend  well designed web applications clearly separate concerns of business logic (the Model in MVC),  presentation to the user (the View in MVC), and application control or flow (the Controller in MVC).  So how much logic should be in your VoiceXML documents.  I would argue very little.  Although VoiceXML is a full programming language that allows you to put a lot of business logic and call flow in it I would argue that only the presentation to the user should be in there.  The Voice Browser Working Group, which is in charge of the VoiceXML standard, seems to agree.  The next version of  VoiceXML (version 3.0) addresses these concerns of coupling presentation with domain logic and control as discussed in this preview of VoiceXML 3.0 put together by the Voice Browser Working Group.  One of the things they have done to address this is to provide a new language called State Chart XML (SCXML) to handle call control and remove this concern from VoiceXML.  What is not clear to me is whether they intend SCXML to be processed in the IVR (client-side) or the web/application server (server-side).

VoiceModel takes the approach of handling application control or call flow on the server just like any web application developed using ASP.NET MVC.  The VoiceModel architecture is very flexible in how you implement the controller.  You could implement it using the standard controllers provided by ASP.NET MVC where you develop methods that represent actions. Or you can use a simple state machine provided by VoiceModel, or any other method of choice such as Windows Workflow Foundation.  There is a good example of how you can use Windows Workflow Foundation in an ASP.NET MVC application here.  I hope to explore using Windows Workflow Foundation with VoiceModel in a later post as it seems like a good fit since they added support for state machines.  Another approach would be to have the state machine generated from SCXML so that the state machine could be defined in this language. I have demonstrated how to use the built-in state machine in the post "Where is the Controller for an MVC VoiceXML Application".

The presentation layer for VoiceModel takes a minimalist approach, only presenting to the IVR what is required to provide a voice user interface (VUI).  VoiceModel currently has only three major objects in the model to support the VUI and they are:

  • Output - Voices audio files and/or text-to-speech to the user.
  • Input - Prompts the user to enter some information on the telephone keypad or speak it and collects the user input.
  • Exit - Optionally voices information to the caller and then hangs-up
This is 95% of what you need to create a VUI.  Some people would ask, what about a menu object for handling navigation by the user?  Not needed in the VoiceModel approach.  Just use the Input object and pass the users input back to the controller for navigation and error handling.  I believe there are only two more objects required in VoiceModel; an object to handle the user recording information and an object to handle transfer of the user to another phone or device.  These will be added in future versions of VoiceModel.  There may be a few more utility objects required as I explore using VoiceModel with CCXML for more advance call control, such as outbound voice applications.

VoiceModel handles the domain model just like any other well designed ASP.NET MVC application, by defining business objects that represent the application domain and decouple them from the method of data persistence.  This is too large of a topic for discussion here but suffice it to say you would use the same approach as developing web applications.  This has the added advantage of being able to reuse any software developed to define the domain model, that may be already developed for a web application, in your voice application.

As you can see from this description of the VoiceModel design there is very little client-side processing,  just what is required to define the presentation layer/voice user interface.  This moves all of the processing of call flow and domain logic to server-side.  This has a number of advantages.

  • The VoiceXML documents presented to the IVR are brief and quick to process and transport.  
  • Developing, debugging and testing of domain logic and control are much easier in the Visual Studio environment.  Visual Studio provides a rich set of debuggers and tools that simplify application development.  I have not found any VoiceXML development environments that comes close to the productivity of Visual Studio.  Debugging in VoiceXML development environments takes me back to the mainframe days of debugging through logs and trace files.  
  • This approach is scalable. 
  • It promotes reuse of .NET components across different modalities of user interfaces.
  • Developing in VoiceXML is very error prone and cumbersome compared to C# or VB.NET.  VoiceXML and ECMA script are not type safe and you tend to find more issues during testing rather than during programming.  With C# and VB.NET you get InteliSense which I find to be a huge productivity gain.
Using VoiceModel has additional benefits not related to the minimalist client-side side approach.
  • Leverage your ASP.NET MVC skills to develop voice/speech applications.
  • Reuse existing .NET artifacts.
  • Leverage the extensive .NET Framework.
  • It is open source so you are in control of your voice/speech projects without being locked into a tool vendor that may not survive.

I recommend you explore the VoiceModel Project on Codeplex. I am always interested in feedback and contributions to this project.  With contributions from the community I feel VoiceModel has potential to make it much easier to develop and deploy better quality voice/speech applications.

Popular posts from this blog

Using Claims in ASP.NET Identity

Adding Email Confirmation to ASP.NET Identity in MVC 5

Customizing ASP.NET Identity in MVC 5