Speech Recognition Grammar Specification

Last updated

Speech Recognition Grammar Specification (SRGS) is a W3C standard for how speech recognition grammars are specified. A speech recognition grammar is a set of word patterns, and tells a speech recognition system what to expect a human to say. For instance, if you call an auto-attendant application, it will prompt you for the name of a person (with the expectation that your call will be transferred to that person's phone). It will then start up a speech recognizer, giving it a speech recognition grammar. This grammar contains the names of the people in the auto attendant's directory and a collection of sentence patterns that are the typical responses from callers to the prompt.

Contents

SRGS specifies two alternate but equivalent syntaxes, one based on XML, and one using augmented BNF format. In practice, the XML syntax is used more frequently.

Both the ABNF and XML form have the expressive power of a context-free grammar. A grammar processor that does not support recursive grammars has the expressive power of a finite state machine or regular expression language.

If the speech recognizer returned just a string containing the actual words spoken by the user, the voice application would have to do the tedious job of extracting the semantic meaning from those words. For this reason, SRGS grammars can be decorated with tag elements, which when executed, build up the semantic result. SRGS does not specify the contents of the tag elements: this is done in a companion W3C standard, Semantic Interpretation for Speech Recognition (SISR). SISR is based on ECMAScript, and ECMAScript statements inside the SRGS tags build up an ECMAScript semantic result object that is easy for the voice application to process.

Both SRGS and SISR are W3C Recommendations, the final stage of the W3C standards track. The W3C VoiceXML standard, which defines how voice dialogs are specified, depends heavily on SRGS and SISR.

Examples

Here is an example of the augmented BNF of SRGS, as it could be used in an auto attendant application:

#ABNF1.0ISO-8859-1;// Default grammar language is US Englishlanguageen-US;// Single language attachment to tokens// Note that "fr-CA" (Canadian French) is applied to only//  the word "oui" because of precedence rules$yes=yes|oui!fr-CA;// Single language attachment to an expansion$people1=(MichelTremblay|AndréRoy)!fr-CA;// Handling language-specific pronunciations of the same word// A capable speech recognizer will listen for Mexican Spanish and//   US English pronunciations.$people2=Jose!en-US|Jose!es-MX;/**  * Multi-lingual input possible*@examplemayIspeaktoAndréRoy *@examplemayIspeaktoJose */public$request=mayIspeakto($people1|$people2);

Here is the same SRGS example, using the XML form:

<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"                  "http://www.w3.org/TR/speech-grammar/grammar.dtd"><!-- the default grammar language is US English --><grammarxmlns="http://www.w3.org/2001/06/grammar"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.w3.org/2001/06/grammar                              http://www.w3.org/TR/speech-grammar/grammar.xsd"xml:lang="en-US"version="1.0"><!--     single language attachment to tokens     "yes" inherits US English language     "oui" is Canadian French language  --><ruleid="yes"><one-of><item>yes</item><itemxml:lang="fr-CA">oui</item></one-of></rule><!-- Single language attachment to an expansion --><ruleid="people1"><one-ofxml:lang="fr-CA"><item>MichelTremblay</item><item>AndréRoy</item></one-of></rule><!--     Handling language-specific pronunciations of the same word     A capable speech recognizer will listen for Mexican Spanish      and US English pronunciations.  --><ruleid="people2"><one-of><itemxml:lang="en-US">Jose</item><itemxml:lang="es-MX">Jose</item></one-of></rule><!-- Multi-lingual input is possible --><ruleid="request"scope="public"><example>mayIspeaktoAndréRoy</example><example>mayIspeaktoJose</example>mayIspeakto <one-of><item><rulerefuri="#people1"/></item><item><rulerefuri="#people2"/></item></one-of></rule></grammar>

See also

Related Research Articles

The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in the development of standards for the World Wide Web. As of 5 March 2023, W3C had 462 members. W3C also engages in education and outreach, develops software and serves as an open forum for discussion about the Web.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and standardised by ISO/IEC since 2015.

The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data. RDF provides a variety of syntax notations and data serialization formats, with Turtle currently being the most widely used notation.

VoiceXML (VXML) is a digital document standard for specifying interactive media and voice dialogs between humans and computers. It is used for developing audio and voice response applications, such as banking systems and automated customer service portals. VoiceXML applications are developed and deployed in a manner analogous to how a web browser interprets and visually renders the Hypertext Markup Language (HTML) it receives from a web server. VoiceXML documents are interpreted by a voice browser and in common deployment architectures, users interact with voice browsers via the public switched telephone network (PSTN).

Call Control eXtensible Markup Language (CCXML) is an XML standard designed to provide asynchronous event-based telephony support to VoiceXML. Its current status is a W3C recommendation, adopted May 10, 2011. Whereas VoiceXML is designed to provide a Voice User Interface to a voice browser, CCXML is designed to inform the voice browser how to handle the telephony control of the voice channel. The two XML applications are wholly separate and are not required by each other to be implemented - however, they have been designed with interoperability in mind

A voice browser is a software application that presents an interactive voice user interface to the user in a manner analogous to the functioning of a web browser interpreting Hypertext Markup Language (HTML). Dialog documents interpreted by voice browser are often encoded in standards-based markup languages, such as Voice Dialog Extensible Markup Language (VoiceXML), a standard by the World Wide Web Consortium.

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

SCXML stands for State Chart XML: State Machine Notation for Control Abstraction. It is an XML-based markup language that provides a generic state-machine-based execution environment based on Harel statecharts.

In computer science and web development, XML Events is a W3C standard for handling events that occur in an XML document. These events are typically caused by users interacting with the web page using a device, such as a web browser on a personal computer or mobile phone.

In computing, Terse RDF Triple Language (Turtle) is a syntax and file format for expressing data in the Resource Description Framework (RDF) data model. Turtle syntax is similar to that of SPARQL, an RDF query language. It is a common data format for storing RDF data, along with N-Triples, JSON-LD and RDF/XML.

Semantic Interpretation for Speech Recognition (SISR) defines the syntax and semantics of annotations to grammar rules in the Speech Recognition Grammar Specification (SRGS). Since 5 April 2007, it is a World Wide Web Consortium recommendation.

The Pronunciation Lexicon Specification (PLS) is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use.

Natural Language Semantics Markup Language is a markup language for providing systems with semantic interpretations for a variety of inputs, including speech and natural language text input. Natural Language Semantics Markup Language is currently a World Wide Web Consortium Working Draft.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

Animation of Scalable Vector Graphics, an open XML-based standard vector graphics format is possible through various means:

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

Multimodal Architecture and Interfaces is an open standard developed by the World Wide Web Consortium since 2005. It was published as a Recommendation of the W3C on October 25, 2012. The document is a technical report specifying a multimodal system architecture and its generic interfaces to facilitate integration and multimodal interaction management in a computer system. It has been developed by the W3C's Multimodal Interaction Working Group.