HOME Cart(0) Quotation About-Us Policy PDFs Standard-List
www.ChineseStandard.net Database: 189759 (19 Oct 2025)

GB/T 36464.1-2020 English PDF

US$409.00 · In stock
Delivery: <= 4 days. True-PDF full-copy in English will be manually translated and delivered via email.
GB/T 36464.1-2020: Information technology - Intelligent speech interaction system - Part 1: General specifications
Status: Valid

Standard ID	Contents [version]	USD	STEP2	[PDF] delivered in	Standard Title (Description)	Status	PDF
GB/T 36464.1-2020	English	409	Add to Cart	4 days [Need to translate]	Information technology - Intelligent speech interaction system - Part 1: General specifications	Valid	GB/T 36464.1-2020

PDF similar to GB/T 36464.1-2020

Standard similar to GB/T 36464.1-2020

GB/T 42450 GB/T 40685 GB/T 39788 GB/T 36464.4 GB/T 36464.5 GB/T 36464.3

Basic data

Standard ID	GB/T 36464.1-2020 (GB/T36464.1-2020)
Description (Translated English)	Information technology - Intelligent speech interaction system - Part 1: General specifications
Sector / Industry	National Standard (Recommended)
Classification of Chinese Standard	L77
Classification of International Standard	35.240.01
Word Count Estimation	22,225
Date of Issue	2020-04-28
Date of Implementation	2020-11-01
Issuing agency(ies)	State Administration for Market Regulation, China National Standardization Administration

GB/T 36464.1-2020: Information technology - Intelligent speech interaction system - Part 1: General specifications

---This is a DRAFT version for illustration, not a final translation. Full copy of true-PDF in English version (including equations, symbols, images, flow-chart, tables, and figures etc.) will be manually/carefully translated upon your order.
Information technology - Intelligent speech interaction system - Part 1.General specifications ICS 35.240.01 L77 National Standards of People's Republic of China Information technology intelligent voice interaction system Part 1.General Specification 2020-04-28 released 2020-11-01 implementation State Administration for Market Regulation Issued by the National Standardization Management Committee

Preface Ⅲ 1 Scope 1 2 Normative references 1 3 Terms and definitions 1 4 System general function framework 4 5 Voice interface requirements 5 5.1 Voice Collection 5 5.2 Voice broadcast 5 5.3 Input and output 5 5.4 Environmental noise adaptability 5 6 Data resource requirements 5 6.1 Audio data 5 6.2 Text data 5 7 Front-end processing requirements 6 7.1 Wake up by voice 6 7.2 Sound source localization 6 7.3 Voiceprint recognition 6 7.4 Speech enhancement 7 7.5 Format conversion 7 7.6 Resampling 7 8 Voice processing requirements 7 8.1 Speech recognition 7 8.2 Semantic Understanding 8 8.3 Speech Synthesis 8 8.4 Endpoint detection 8 8.5 Voice codec 9 8.6 Full duplex interaction 9 8.7 Emotional Computing 9 9 Service interface requirements 9 10 Application business processing requirements 9 Appendix A (informative appendix) Some parameters and their calculation methods 10 A.1 Overview 10 A.2 Pickup distance 10 A.3 Voice interaction success rate 10 A.4 Voice wakeup 10 A.5 Speech recognition 11 A.6 Semantic Understanding 11 A.7 Speech Synthesis 12 A.8 Voice quality 12 A.9 Voiceprint recognition rate 13 A.10 Voice coding and decompression rate 13 A.11 Speech enhancement 13 A.12 Sound source localization 13 A.13 Speech interruption success rate 13 Reference 15 Information technology intelligent voice interaction system Part 1.General Specification

1 Scope

This part of GB/T 36464 gives the general functional framework of the intelligent voice interaction system, and specifies the voice interaction interface, data resources, Front-end processing, voice processing, service interface, application business processing and other functional unit requirements. This section applies to the general design, development, application and maintenance of intelligent voice interaction systems.

2 Normative references

The following documents are indispensable for the application of this document. For dated reference documents, only the dated version applies to this article Pieces. For undated references, the latest version (including all amendments) applies to this document. GB/T 11460 Information technology Chinese character font requirements and testing methods GB 18030 Information Technology Chinese Coded Character Set GB/T 21024-2007 General Technical Specification for Chinese Speech Synthesis System GB/T 34083-2017 Chinese speech recognition Internet service interface specification GB/T 34145-2017 Chinese Speech Synthesis Internet Service Interface Specification SJ/T 11380-2008 Technical Specification for Automatic Voiceprint Recognition (Speaker Recognition)

3 Terms and definitions

The following terms and definitions apply to this document. 3.1 Voice interaction Information transmission and communication activities between humans and functional units through voice. [GB/T 36464.2-2018, definition 3.1] 3.2 Voice interaction system A system that is composed of functional units (or combinations thereof), data resources, etc. that can realize voice interaction with humans. [GB/T 36464.2-2018, definition 3.2] 3.3 Intelligent voice interaction system Based on all or part of artificial intelligence technologies such as speech recognition, semantic understanding, and speech synthesis, it is composed of intelligent software and hardware, and has intelligent A voice interaction system with human-computer interaction capabilities. 3.4 Human-computer interaction Between humans and functional units, in order to complete certain tasks, information transmission and communication activities are carried out in a certain interactive manner. 3.5 Functional unit A hardware entity, or a software entity, or a hardware entity and a software entity that can complete a specific task. [GB/T 5271.1-2000, definition 01.01.40] 3.6 Speech synthesis The process of synthesizing human language through mechanical and electronic methods. [GB/T 21024-2007, definition 3.1] 3.30 Affective computing The collection, recognition, decision-making and expression of specific emotions in the process of human-computer interaction.

4 System general function framework

Intelligent voice interaction system (hereinafter referred to as the system) includes voice interaction interface, front-end processing, voice processing, service interface, application business office Functional units such as management and data resources, including. a) The voice interaction interface provides a man-machine interface for direct voice interaction between the system and people, including voice signal input, output, and Voice capabilities supported by end processing and voice processing; b) Data resources include audio data and text data processed by the system; c) Front-end processing provides functions such as voice wake-up, sound source localization, voiceprint recognition, voice enhancement, format conversion, resampling, etc.; d) Speech processing provides speech recognition, semantic understanding, speech synthesis, endpoint detection, speech codec, full-duplex interaction, emotional computing, etc. Features; e) Service interface provides an interface for external equipment/facilities to call system voice services; f) Application service processing converts the results of voice processing into corresponding application instructions and feeds back service response results. The general functional framework of the system is shown in Figure 1; some parameter definitions and calculation methods are shown in Appendix A. a) It should be independent of the specific operating system and platform and can be extended; b) It should be structured data to facilitate system processing; c) Chinese coded characters should meet the requirements of GB 18030 and be tested according to GB/T 11460; d) The data exchange format of Chinese speech synthesis shall meet the requirements of Chapter 5 of GB/T 21024-2007. 7 Front-end processing requirements 7.1 Voice wake up 7.1.1 Command word wake up The system should support the use of predefined command words to wake up the system by voice. 7.1.2 Command word voiceprint wake up In the process of voice wake-up, the system should support the use of text-related voiceprint recognition and command word matching, and it can wake up after the voiceprint is confirmed successfully system. 7.1.3 Custom wake-up command word The system should support customization of the command word used for voice wake-up. 7.1.4 Multiple wake-up command words The system should support the use of different command words for voice wake-up; it can enter the corresponding state or mode according to the specified wake-up command word. 7.1.5 Multi-audio stream monitoring When the system is waking up by voice, it should support simultaneous monitoring of multiple audio streams. 7.2 Sound source localization The system should support the location of the sound source by calculating the plane angle, azimuth angle and distance of the sound source. 7.3 Voiceprint recognition 7.3.1 General requirements The system should support the following voiceprint recognition functions. a) Text-related voiceprint recognition; b) Text-independent voiceprint recognition; c) Voiceprint recognition of specified text; d) Voiceprint model training; e) Voiceprint model adaptation; f) Voiceprint confirmation; g) Voiceprint recognition; h) Voiceprint detection; i) Voiceprint tracking; j) Language-related voiceprint recognition; k) Language-independent voiceprint recognition. The above functional descriptions and requirements shall meet the requirements of Chapter 3 of SJ/T 11380-2008. 7.3.2 Voiceprint text acquisition The system should support the acquisition of specified text or custom text for voiceprint model training, voiceprint model adaptation, voiceprint confirmation and voiceprint identify. 7.4 Speech enhancement 7.4.1 Noise suppression The system should support the suppression of background noise in the input speech and improve the signal-to-noise ratio of the speech. 7.4.2 Reverberation cancellation The system should support the suppression of late reverberation in the input speech to improve the clarity and intelligibility of the speech signal. 7.5 Format conversion The system should support the conversion of audio format to another format to meet the requirements of voice processing. 7.6 Resampling The system should support changing the sampling rate of digital voice signals to meet the requirements of voice processing.

8 Voice processing requirements

8.1 Speech recognition 8.1.1 General requirements The system should support all or most of the following speech recognition functions. a) Chinese speech recognition service; b) Multilingual recognition; c) Multi-dialect recognition; d) Multilingual mixed reading recognition; e) Custom syntax; f) Personalized identification; g) Multiple candidates for recognition results; h) Custom hot words; i) Advanced recognition results; j) Language information recognition; k) Speaker information recognition. The above functional descriptions and requirements should meet the requirements of 4.2 and 4.3 in GB/T 34083-2017. 8.1.2 Voice recognition method The system should support near-field audio processing and/or far-field audio processing; it should at least support keyword recognition, command word recognition, and continuous A type of speech recognition. 8.2 Semantic understanding 8.2.1 Custom Semantic Dictionary The system can support application-defined semantic dictionaries and user-defined semantic dictionaries. 8.2.2 Custom semantic library The system can support application-defined semantic libraries and user-defined semantic libraries. 8.2.3 Fuzzy recognition The system should correctly deal with typos, synonyms, and more and fewer words. 8.2.4 Semantic extraction In the interaction process of the system, the semantic elements and the key intent of the user should be extracted. 8.2.5 Semantic Sorting The system can provide multiple sorted comprehension results in the semantic comprehension results for users to choose or confirm again. 8.3 Speech synthesis The system should support all or most of the following speech synthesis functions. a) Chinese speech synthesis; b) Streaming speech synthesis; c) Multiple synthetic text encodings; d) Personalized synthesis; e) Multilingual synthesis; f) Multi-dialect synthesis; g) Multilingual mixed reading synthesis; h) Synthetic audio multi-timbre; i) User-defined word segmentation; j) User-defined pronunciation; k) Synthetic text location information; l) Text segmentation and pinyin information; m) Audio time information. The above functional descriptions and requirements should meet the requirements of 4.2 and 4.3 in GB/T 34145-2017. 8.4 Endpoint detection 8.4.1 Single Endpoint Detection The system should support detecting the start point and end point of the first speech segment from a continuous audio stream. 8.4.2 Multi-endpoint detection The system should support detecting the start and end points of multiple speech segments from a continuous audio stream. 8.4.3 Endpoint detection sensitivity setting The system should support setting the voice waiting timeout period and tail mute length, and adjust the sensitivity of voice endpoint detection. 8.5 Voice codec 8.5.1 Variable rate encoding The system should support changing the code stream rate of the coded speech output by the speech coding algorithm by setting the coding level or by other means. 8.5.2 Compression level setting The system should support setting the compression level of the speech coding algorithm according to the current network conditions, system performance and other requirements. 8.6 Full duplex interaction The system should support full-duplex voice interaction; in this state, it should support one-time voice wake-up and voice interruption at any time, enabling contextual language Context and open dialogue management can control the rhythm of dialogue and predict user intentions. 8.7 Emotional Computing The system should support emotional calculation using voice signals as the carrier.

9 Service interface requirements

The system should have a service interface that can be called externally. Among them, the Internet interface for Chinese speech recognition should comply with GB/T 34083 It is stipulated that the Internet interface of Chinese speech synthesis should meet the requirements of GB/T 34145. 10 Application business processing requirements The system should support the conversion of user intentions into application and business control commands or system instructions to achieve application and business response.

Appendix A

(Informative appendix) Some parameters and their calculation methods A.1 Overview This appendix gives some parameter definitions and calculation methods used to describe the intelligent voice interaction system. A.2 Pickup distance When the distance between the sound source and the pickup device is less than or equal to 1m, it is the near field; when the distance between the sound source and the pickup device is > 1m, it is the far field. A.3 Voice interaction success rate In a certain period of time, the total number of successful voice interactive sessions accounted for the percentage of the total number of effective voice interactive sessions. "Successful Voice "Interactive session" refers to a voice interactive session in which a complete voice service result is obtained without errors during the period; "effective voice interactive session" refers to All voice interactive sessions remove failed sessions due to user terminal failure or user behavior or parameter errors. Refer to formula (A.1) for the calculation method of interaction success rate. PS= SF× 100% (A.1) Where. PS---Interaction success rate, %; S ---the number of successful interactions; F ---The number of failed interactions. A.4 Voice wakeup A.4.1 Wake-up rate The ratio of the number of successful wake-ups to the total number of voice wake-ups in a voice wake-up operation within a certain period of time. Used to describe the voice wake-up operation For the correct response, see formula (A.2) for the calculation method. ρsw= Nsw Nw × 100% (A.2) Where. ρsw --- arousal rate, %; Nsw---successful wake-up times; Nw ---The number of voice wake-up operations. A.4.2 Frequency of false wakeups The frequency of false wakeup describes the frequency of false wakeup operations in unit time, and the calculation method is shown in formula (A.3). fFW= NFW (A.3) Where. fFW --- False wake-up frequency, the unit is times per hour (times/h); NFW---the number of false awakenings during the investigation period; T --- the duration of the evaluation, in hours (h). A.5 Speech recognition A.5.1 Word accuracy For the calculation method of word accuracy, please refer to 5.2.1 of GB/T 21023-2007. A.5.2 Sentence recognition rate The calculation method of sentence recognition rate is shown in formula (A.4). Psr= Nsr Nsi× 100% (A.4) Where. Psr---sentence recognition rate, %; Nsr---the number of sentences correctly recognized by the intelligent voice interaction system; Nsi---mark the total number of sentences. A.5.3 Initial response time The time elapsed from when the user's valid voice input is detected to when the first part of the recognition result is obtained, in milliseconds (ms), used to describe the real-time response of speech recognition. A.5.4 End response time The time elapsed from when the user's valid voice input is detected to when the last part of the recognition result is obtained, in milliseconds Seconds (ms), used to describe the real-time response of speech recognition. A.6 Semantic understanding A.6.1 Correct rate of semantic understanding See formula (A.5) for the calculation method of semantic understanding correct rate. RSS= NSS N ×100% (A.5) Where. RSS---The correct rate of semantic understanding, %; NSS---the number of times the operation intention and semantic elements are correctly judged; N ---The total number of times the user input is correctly recognized text information. A.6.2 Correct response rate Refer to formula (A.6) for the calculation method of the correct rate of Chinese comprehension response. A.9 Voiceprint recognition rate The voiceprint recognition rate includes parameters such as false rejection rate, false acceptance rate, missed recognition rate, and false alarm rate. Refer to SJ/T 11380- for the calculation method. The provisions of 3.3.2 in.2008. A.10 Voice coding and decompression rate The speech coding/decompression ratio is the ratio of the code stream rate of the compressed audio output by the speech compression algorithm to the input audio to be compressed. A.11 Speech enhancement A.11.1 Signal-to-noise ratio improvement The signal-to-noise ratio improvement is the ratio of the output speech signal-to-noise ratio of the speech enhancement function unit to the input speech signal-to-noise ratio. A.11.2 Noise suppression Refer to formula (A.7) for the calculation method of noise suppression. DNR=10log N-1 n=0 |νin(n)|2 N-1 n=0 |νout(n)|2 (A.7) Where. DNR --- the amount of noise suppression, in decibels (dB); νin(n)---the amplitude of the nth noise signal in the input signal; νout(n)---the amplitude of the nth noise signal in the output signal; N --- the total number of frequency components of the input signal spectrum. A.12 Sound source location A.12.1 Plane angle positioning error The plane angle positioning error is the difference between the plane angle of the sound source position calculated by the sound source positioning function unit and the true value. A.12.2 Pitch angle positioning error The pitch angle positioning error is the difference between the pitch angle of the sound source position calculated by the sound source positioning function unit and the true value. A.12.3 Distance positioning error The distance positioning error is the difference between the sound source position distance calculated by the sound source positioning function unit and the true value. A.13 Voice interruption success rate In dialogue management, the success rate of voice interruption refers to the ratio of the number of correct responses to the voice interruption operation in a certain period of time. its

Tips & Frequently Asked Questions:

Question 1: How long will the true-PDF of GB/T 36464.1-2020_English be delivered?

Answer: Upon your order, we will start to translate GB/T 36464.1-2020_English as soon as possible, and keep you informed of the progress. The lead time is typically 2 ~ 4 working days. The lengthier the document the longer the lead time.

Question 2: Can I share the purchased PDF of GB/T 36464.1-2020_English with my colleagues?

Answer: Yes. The purchased PDF of GB/T 36464.1-2020_English will be deemed to be sold to your employer/organization who actually pays for it, including your colleagues and your employer's intranet.

Question 3: Does the price include tax/VAT?

Answer: Yes. Our tax invoice, downloaded/delivered in 9 seconds, includes all tax/VAT and complies with 100+ countries' tax regulations (tax exempted in 100+ countries) -- See Avoidance of Double Taxation Agreements (DTAs): List of DTAs signed between Singapore and 100+ countries

Question 4: Do you accept my currency other than USD?

Answer: Yes. If you need your currency to be printed on the invoice, please write an email to [email protected]. In 2 working-hours, we will create a special link for you to pay in any currencies. Otherwise, follow the normal steps: Add to Cart -- Checkout -- Select your currency to pay.