|
US$409.00 · In stock Delivery: <= 4 days. True-PDF full-copy in English will be manually translated and delivered via email. GB/T 36464.1-2020: Information technology - Intelligent speech interaction system - Part 1: General specifications Status: Valid
| Standard ID | Contents [version] | USD | STEP2 | [PDF] delivered in | Standard Title (Description) | Status | PDF |
| GB/T 36464.1-2020 | English | 409 |
Add to Cart
|
4 days [Need to translate]
|
Information technology - Intelligent speech interaction system - Part 1: General specifications
| Valid |
GB/T 36464.1-2020
|
PDF similar to GB/T 36464.1-2020
Basic data | Standard ID | GB/T 36464.1-2020 (GB/T36464.1-2020) | | Description (Translated English) | Information technology - Intelligent speech interaction system - Part 1: General specifications | | Sector / Industry | National Standard (Recommended) | | Classification of Chinese Standard | L77 | | Classification of International Standard | 35.240.01 | | Word Count Estimation | 22,225 | | Date of Issue | 2020-04-28 | | Date of Implementation | 2020-11-01 | | Issuing agency(ies) | State Administration for Market Regulation, China National Standardization Administration |
GB/T 36464.1-2020: Information technology - Intelligent speech interaction system - Part 1: General specifications ---This is a DRAFT version for illustration, not a final translation. Full copy of true-PDF in English version (including equations, symbols, images, flow-chart, tables, and figures etc.) will be manually/carefully translated upon your order.
Information technology - Intelligent speech interaction system - Part 1.General specifications
ICS 35.240.01
L77
National Standards of People's Republic of China
Information technology intelligent voice interaction system
Part 1.General Specification
2020-04-28 released
2020-11-01 implementation
State Administration for Market Regulation
Issued by the National Standardization Management Committee
Table of contents
Preface Ⅲ
1 Scope 1
2 Normative references 1
3 Terms and definitions 1
4 System general function framework 4
5 Voice interface requirements 5
5.1 Voice Collection 5
5.2 Voice broadcast 5
5.3 Input and output 5
5.4 Environmental noise adaptability 5
6 Data resource requirements 5
6.1 Audio data 5
6.2 Text data 5
7 Front-end processing requirements 6
7.1 Wake up by voice 6
7.2 Sound source localization 6
7.3 Voiceprint recognition 6
7.4 Speech enhancement 7
7.5 Format conversion 7
7.6 Resampling 7
8 Voice processing requirements 7
8.1 Speech recognition 7
8.2 Semantic Understanding 8
8.3 Speech Synthesis 8
8.4 Endpoint detection 8
8.5 Voice codec 9
8.6 Full duplex interaction 9
8.7 Emotional Computing 9
9 Service interface requirements 9
10 Application business processing requirements 9
Appendix A (informative appendix) Some parameters and their calculation methods 10
A.1 Overview 10
A.2 Pickup distance 10
A.3 Voice interaction success rate 10
A.4 Voice wakeup 10
A.5 Speech recognition 11
A.6 Semantic Understanding 11
A.7 Speech Synthesis 12
A.8 Voice quality 12
A.9 Voiceprint recognition rate 13
A.10 Voice coding and decompression rate 13
A.11 Speech enhancement 13
A.12 Sound source localization 13
A.13 Speech interruption success rate 13
Reference 15
Information technology intelligent voice interaction system
Part 1.General Specification
1 Scope
This part of GB/T 36464 gives the general functional framework of the intelligent voice interaction system, and specifies the voice interaction interface, data resources,
Front-end processing, voice processing, service interface, application business processing and other functional unit requirements.
This section applies to the general design, development, application and maintenance of intelligent voice interaction systems.
2 Normative references
The following documents are indispensable for the application of this document. For dated reference documents, only the dated version applies to this article
Pieces. For undated references, the latest version (including all amendments) applies to this document.
GB/T 11460 Information technology Chinese character font requirements and testing methods
GB 18030 Information Technology Chinese Coded Character Set
GB/T 21024-2007 General Technical Specification for Chinese Speech Synthesis System
GB/T 34083-2017 Chinese speech recognition Internet service interface specification
GB/T 34145-2017 Chinese Speech Synthesis Internet Service Interface Specification
SJ/T 11380-2008 Technical Specification for Automatic Voiceprint Recognition (Speaker Recognition)
3 Terms and definitions
The following terms and definitions apply to this document.
3.1
Voice interaction
Information transmission and communication activities between humans and functional units through voice.
[GB/T 36464.2-2018, definition 3.1]
3.2
Voice interaction system
A system that is composed of functional units (or combinations thereof), data resources, etc. that can realize voice interaction with humans.
[GB/T 36464.2-2018, definition 3.2]
3.3
Intelligent voice interaction system
Based on all or part of artificial intelligence technologies such as speech recognition, semantic understanding, and speech synthesis, it is composed of intelligent software and hardware, and has intelligent
A voice interaction system with human-computer interaction capabilities.
3.4
Human-computer interaction
Between humans and functional units, in order to complete certain tasks, information transmission and communication activities are carried out in a certain interactive manner.
3.5
Functional unit
A hardware entity, or a software entity, or a hardware entity and a software entity that can complete a specific task.
[GB/T 5271.1-2000, definition 01.01.40]
3.6
Speech synthesis
The process of synthesizing human language through mechanical and electronic methods.
[GB/T 21024-2007, definition 3.1]
3.30
Affective computing
The collection, recognition, decision-making and expression of specific emotions in the process of human-computer interaction.
4 System general function framework
Intelligent voice interaction system (hereinafter referred to as the system) includes voice interaction interface, front-end processing, voice processing, service interface, application business office
Functional units such as management and data resources, including.
a) The voice interaction interface provides a man-machine interface for direct voice interaction between the system and people, including voice signal input, output, and
Voice capabilities supported by end processing and voice processing;
b) Data resources include audio data and text data processed by the system;
c) Front-end processing provides functions such as voice wake-up, sound source localization, voiceprint recognition, voice enhancement, format conversion, resampling, etc.;
d) Speech processing provides speech recognition, semantic understanding, speech synthesis, endpoint detection, speech codec, full-duplex interaction, emotional computing, etc.
Features;
e) Service interface provides an interface for external equipment/facilities to call system voice services;
f) Application service processing converts the results of voice processing into corresponding application instructions and feeds back service response results.
The general functional framework of the system is shown in Figure 1; some parameter definitions and calculation methods are shown in Appendix A.
a) It should be independent of the specific operating system and platform and can be extended;
b) It should be structured data to facilitate system processing;
c) Chinese coded characters should meet the requirements of GB 18030 and be tested according to GB/T 11460;
d) The data exchange format of Chinese speech synthesis shall meet the requirements of Chapter 5 of GB/T 21024-2007.
7 Front-end processing requirements
7.1 Voice wake up
7.1.1 Command word wake up
The system should support the use of predefined command words to wake up the system by voice.
7.1.2 Command word voiceprint wake up
In the process of voice wake-up, the system should support the use of text-related voiceprint recognition and command word matching, and it can wake up after the voiceprint is confirmed successfully
system.
7.1.3 Custom wake-up command word
The system should support customization of the command word used for voice wake-up.
7.1.4 Multiple wake-up command words
The system should support the use of different command words for voice wake-up; it can enter the corresponding state or mode according to the specified wake-up command word.
7.1.5 Multi-audio stream monitoring
When the system is waking up by voice, it should support simultaneous monitoring of multiple audio streams.
7.2 Sound source localization
The system should support the location of the sound source by calculating the plane angle, azimuth angle and distance of the sound source.
7.3 Voiceprint recognition
7.3.1 General requirements
The system should support the following voiceprint recognition functions.
a) Text-related voiceprint recognition;
b) Text-independent voiceprint recognition;
c) Voiceprint recognition of specified text;
d) Voiceprint model training;
e) Voiceprint model adaptation;
f) Voiceprint confirmation;
g) Voiceprint recognition;
h) Voiceprint detection;
i) Voiceprint tracking;
j) Language-related voiceprint recognition;
k) Language-independent voiceprint recognition.
The above functional descriptions and requirements shall meet the requirements of Chapter 3 of SJ/T 11380-2008.
7.3.2 Voiceprint text acquisition
The system should support the acquisition of specified text or custom text for voiceprint model training, voiceprint model adaptation, voiceprint confirmation and voiceprint
identify.
7.4 Speech enhancement
7.4.1 Noise suppression
The system should support the suppression of background noise in the input speech and improve the signal-to-noise ratio of the speech.
7.4.2 Reverberation cancellation
The system should support the suppression of late reverberation in the input speech to improve the clarity and intelligibility of the speech signal.
7.5 Format conversion
The system should support the conversion of audio format to another format to meet the requirements of voice processing.
7.6 Resampling
The system should support changing the sampling rate of digital voice signals to meet the requirements of voice processing.
8 Voice processing requirements
8.1 Speech recognition
8.1.1 General requirements
The system should support all or most of the following speech recognition functions.
a) Chinese speech recognition service;
b) Multilingual recognition;
c) Multi-dialect recognition;
d) Multilingual mixed reading recognition;
e) Custom syntax;
f) Personalized identification;
g) Multiple candidates for recognition results;
h) Custom hot words;
i) Advanced recognition results;
j) Language information recognition;
k) Speaker information recognition.
The above functional descriptions and requirements should meet the requirements of 4.2 and 4.3 in GB/T 34083-2017.
8.1.2 Voice recognition method
The system should support near-field audio processing and/or far-field audio processing; it should at least support keyword recognition, command word recognition, and continuous
A type of speech recognition.
8.2 Semantic understanding
8.2.1 Custom Semantic Dictionary
The system can support application-defined semantic dictionaries and user-defined semantic dictionaries.
8.2.2 Custom semantic library
The system can support application-defined semantic libraries and user-defined semantic libraries.
8.2.3 Fuzzy recognition
The system should correctly deal with typos, synonyms, and more and fewer words.
8.2.4 Semantic extraction
In the interaction process of the system, the semantic elements and the key intent of the user should be extracted.
8.2.5 Semantic Sorting
The system can provide multiple sorted comprehension results in the semantic comprehension results for users to choose or confirm again.
8.3 Speech synthesis
The system should support all or most of the following speech synthesis functions.
a) Chinese speech synthesis;
b) Streaming speech synthesis;
c) Multiple synthetic text encodings;
d) Personalized synthesis;
e) Multilingual synthesis;
f) Multi-dialect synthesis;
g) Multilingual mixed reading synthesis;
h) Synthetic audio multi-timbre;
i) User-defined word segmentation;
j) User-defined pronunciation;
k) Synthetic text location information;
l) Text segmentation and pinyin information;
m) Audio time information.
The above functional descriptions and requirements should meet the requirements of 4.2 and 4.3 in GB/T 34145-2017.
8.4 Endpoint detection
8.4.1 Single Endpoint Detection
The system should support detecting the start point and end point of the first speech segment from a continuous audio stream.
8.4.2 Multi-endpoint detection
The system should support detecting the start and end points of multiple speech segments from a continuous audio stream.
8.4.3 Endpoint detection sensitivity setting
The system should support setting the voice waiting timeout period and tail mute length, and adjust the sensitivity of voice endpoint detection.
8.5 Voice codec
8.5.1 Variable rate encoding
The system should support changing the code stream rate of the coded speech output by the speech coding algorithm by setting the coding level or by other means.
8.5.2 Compression level setting
The system should support setting the compression level of the speech coding algorithm according to the current network conditions, system performance and other requirements.
8.6 Full duplex interaction
The system should support full-duplex voice interaction; in this state, it should support one-time voice wake-up and voice interruption at any time, enabling contextual language
Context and open dialogue management can control the rhythm of dialogue and predict user intentions.
8.7 Emotional Computing
The system should support emotional calculation using voice signals as the carrier.
9 Service interface requirements
The system should have a service interface that can be called externally. Among them, the Internet interface for Chinese speech recognition should comply with GB/T 34083
It is stipulated that the Internet interface of Chinese speech synthesis should meet the requirements of GB/T 34145.
10 Application business processing requirements
The system should support the conversion of user intentions into application and business control commands or system instructions to achieve application and business response.
Appendix A
(Informative appendix)
Some parameters and their calculation methods
A.1 Overview
This appendix gives some parameter definitions and calculation methods used to describe the intelligent voice interaction system.
A.2 Pickup distance
When the distance between the sound source and the pickup device is less than or equal to 1m, it is the near field; when the distance between the sound source and the pickup device is > 1m, it is the far field.
A.3 Voice interaction success rate
In a certain period of time, the total number of successful voice interactive sessions accounted for the percentage of the total number of effective voice interactive sessions. "Successful Voice
"Interactive session" refers to a voice interactive session in which a complete voice service result is obtained without errors during the period; "effective voice interactive session" refers to
All voice interactive sessions remove failed sessions due to user terminal failure or user behavior or parameter errors.
Refer to formula (A.1) for the calculation method of interaction success rate.
PS=
SF×
100% (A.1)
Where.
PS---Interaction success rate, %;
S ---the number of successful interactions;
F ---The number of failed interactions.
A.4 Voice wakeup
A.4.1 Wake-up rate
The ratio of the number of successful wake-ups to the total number of voice wake-ups in a voice wake-up operation within a certain period of time. Used to describe the voice wake-up operation
For the correct response, see formula (A.2) for the calculation method.
ρsw=
Nsw
Nw ×
100% (A.2)
Where.
ρsw --- arousal rate, %;
Nsw---successful wake-up times;
Nw ---The number of voice wake-up operations.
A.4.2 Frequency of false wakeups
The frequency of false wakeup describes the frequency of false wakeup operations in unit time, and the calculation method is shown in formula (A.3).
fFW=
NFW
(A.3)
Where.
fFW --- False wake-up frequency, the unit is times per hour (times/h);
NFW---the number of false awakenings during the investigation period;
T --- the duration of the evaluation, in hours (h).
A.5 Speech recognition
A.5.1 Word accuracy
For the calculation method of word accuracy, please refer to 5.2.1 of GB/T 21023-2007.
A.5.2 Sentence recognition rate
The calculation method of sentence recognition rate is shown in formula (A.4).
Psr=
Nsr
Nsi×
100% (A.4)
Where.
Psr---sentence recognition rate, %;
Nsr---the number of sentences correctly recognized by the intelligent voice interaction system;
Nsi---mark the total number of sentences.
A.5.3 Initial response time
The time elapsed from when the user's valid voice input is detected to when the first part of the recognition result is obtained, in milliseconds
(ms), used to describe the real-time response of speech recognition.
A.5.4 End response time
The time elapsed from when the user's valid voice input is detected to when the last part of the recognition result is obtained, in milliseconds
Seconds (ms), used to describe the real-time response of speech recognition.
A.6 Semantic understanding
A.6.1 Correct rate of semantic understanding
See formula (A.5) for the calculation method of semantic understanding correct rate.
RSS=
NSS
N ×100%
(A.5)
Where.
RSS---The correct rate of semantic understanding, %;
NSS---the number of times the operation intention and semantic elements are correctly judged;
N ---The total number of times the user input is correctly recognized text information.
A.6.2 Correct response rate
Refer to formula (A.6) for the calculation method of the correct rate of Chinese comprehension response.
A.9 Voiceprint recognition rate
The voiceprint recognition rate includes parameters such as false rejection rate, false acceptance rate, missed recognition rate, and false alarm rate. Refer to SJ/T 11380- for the calculation method.
The provisions of 3.3.2 in.2008.
A.10 Voice coding and decompression rate
The speech coding/decompression ratio is the ratio of the code stream rate of the compressed audio output by the speech compression algorithm to the input audio to be compressed.
A.11 Speech enhancement
A.11.1 Signal-to-noise ratio improvement
The signal-to-noise ratio improvement is the ratio of the output speech signal-to-noise ratio of the speech enhancement function unit to the input speech signal-to-noise ratio.
A.11.2 Noise suppression
Refer to formula (A.7) for the calculation method of noise suppression.
DNR=10log
N-1
n=0
|νin(n)|2
N-1
n=0
|νout(n)|2
(A.7)
Where.
DNR --- the amount of noise suppression, in decibels (dB);
νin(n)---the amplitude of the nth noise signal in the input signal;
νout(n)---the amplitude of the nth noise signal in the output signal;
N --- the total number of frequency components of the input signal spectrum.
A.12 Sound source location
A.12.1 Plane angle positioning error
The plane angle positioning error is the difference between the plane angle of the sound source position calculated by the sound source positioning function unit and the true value.
A.12.2 Pitch angle positioning error
The pitch angle positioning error is the difference between the pitch angle of the sound source position calculated by the sound source positioning function unit and the true value.
A.12.3 Distance positioning error
The distance positioning error is the difference between the sound source position distance calculated by the sound source positioning function unit and the true value.
A.13 Voice interruption success rate
In dialogue management, the success rate of voice interruption refers to the ratio of the number of correct responses to the voice interruption operation in a certain period of time. its
Tips & Frequently Asked Questions:Question 1: How long will the true-PDF of GB/T 36464.1-2020_English be delivered?Answer: Upon your order, we will start to translate GB/T 36464.1-2020_English as soon as possible, and keep you informed of the progress. The lead time is typically 2 ~ 4 working days. The lengthier the document the longer the lead time. Question 2: Can I share the purchased PDF of GB/T 36464.1-2020_English with my colleagues?Answer: Yes. The purchased PDF of GB/T 36464.1-2020_English will be deemed to be sold to your employer/organization who actually pays for it, including your colleagues and your employer's intranet. Question 3: Does the price include tax/VAT?Answer: Yes. Our tax invoice, downloaded/delivered in 9 seconds, includes all tax/VAT and complies with 100+ countries' tax regulations (tax exempted in 100+ countries) -- See Avoidance of Double Taxation Agreements (DTAs): List of DTAs signed between Singapore and 100+ countriesQuestion 4: Do you accept my currency other than USD?Answer: Yes. If you need your currency to be printed on the invoice, please write an email to [email protected]. In 2 working-hours, we will create a special link for you to pay in any currencies. Otherwise, follow the normal steps: Add to Cart -- Checkout -- Select your currency to pay.
|