Unipos is an overseas product procurement and consultation service for research and development companies that is supported by many research and educational institutions.
On this page, we have selected products that are attracting attention from customers involved in research and development, focusing on speech corpora.Please take a look.
table of contents
What is a speech corpus?
A speech corpus refers to a collection of speech data, and its importance plays a major role in natural language processing research and practice.Natural language processing is a technology that analyzes information related to language and communication, and among these, linguistic speech corpora are"An important role in structuring linguistic data"Because of this, it also occupies an important position in the field of AI.
Speech analysis is an example of a specific natural language processing method.This is a technology that extracts information about language from speech data, and is widely used in fields such as speech recognition and speech synthesis.Speech analysis is laying the foundation for machines to understand speech and respond appropriately.
Furthermore, research on natural language processing is progressing rapidly with the use of speech corpora and advances in deep learning.Advances in technology are expanding the range of applications for natural language processing and opening up new possibilities.It can be said that a collection of data called a speech corpus is contributing to the development of language processing technology.
The role of linguistic speech corpora
- Learning data provided: Provide diverse audio data to enhance training of machine learning models
- Foundation of speech recognition: Speech analysis lays the foundation for speech recognition technology, allowing machines to convert speech into text.
- Development of conversational AI: Supporting the evolution of conversational AI and enabling machines to communicate with users in natural language
- Improving speech synthesis technology: Utilizing information obtained from speech analysis in speech synthesis technology to enable natural speech generation
- Application to real life: Can be widely used in real life, such as voice control of smart homes, complaint handling, learning assistance, etc.
Popular language audio corpus products at Unipos
Speech corpora are used in a variety of fields, including language research, speech recognition technology, speech synthesis technology, and language processing technology, and demand is expected to increase in the future.
Unipos also handles many speech corpus related products.We would like to introduce some of the most popular ones, so please take a look.
Speechocean Corpus | Various Languages Commercial Research Corpus
Commercial and research corpora in various languages
Various corpora handled by China Speechocean.
In addition to ASR-Corpus (automatic speech recognition corpus), TTS-Corpus (speech synthesis corpus), computer vision corpus, vocabulary corpus, natural language processing corpus, etc.
We handle a large number of corpora, including approximately 1,000 types for commercial use and approximately 150 types for research.
The corpus is divided into more than 110 languages and dialects (accents), age, gender, recording time, recording platform, etc. When making an inquiry, please let us know the name and SN (King-) of your desired corpus. please.
Beijing Haitian Ruisheng Science Technology Ltd / DataOcean AI (manufacturer site)
Main uses |
– Automatic speech recognition corpus v – Speech synthesis corpus – text corpus - Multilingual – Commercial and research offerings |
LDC Corpus | Language Corpus Database
Corpus of various languages (language database)
A corpus handled by the LDC (Linguistic Data Consortium), headquartered at the University of Pennsylvania in the United States.
We have a rich collection of words and data in a variety of formats, including text databases, audio databases, and lexicons.
When contacting us, please let us know the name of your desired product.
LDC corpus catalog page (manufacturer site)
Main uses |
– Natural language processing (NLP) research data – Linguistic data annotation – Used for syntactic analysis and morphological analysis – Data from the Speech Resource Consortium – Providing large-scale annotated linguistic data |
ELRA GLOBALPHONE | Multilingual audio database
Multilingual speech database
A multilingual speech database (corpus) provided by ELRA (European Language Resources Association).
The GlobalPhone series is newspaper reading audio data (440bit, 6kHz monaural) recorded with a proximity microphone (Sennheiser 16-16). As of 2023, data is provided in 22 languages.
Main uses |
– Development of multilingual speech recognition system – Audio data for natural language processing (NLP) research – Used for comparative research on pronunciation between languages – Creation of language technology evaluation package – Research uses of audio data in various languages |
AISHELL Corpus | Artificial Intelligence Chinese Corpus
Chinese corpus for artificial intelligence
A speech corpus for voice-based intelligent products such as smart homes, automobiles (smart cars), and robots, handled by Beijing Shell Shell Technology in China.Data is categorized by usage scenario.
Open source corpora for academic research purposes are also provided.
Main uses |
– Voice recognition system training – Chinese natural language processing research – Speech synthesis database – Dataset for multimodal learning – Audio annotation and analysis |
Example
Speech corpora are used as important speech resources in the field of natural language processing (NLP).Speech corpora are leveraged for many natural language processing tasks, including speech recognition, text conversion, automatic summarization, machine translation, and sentiment analysis.
1.Voice recognition |
Linguistic speech corpora are used for training and evaluating speech recognition.Through the variety of pronunciations, accents, and linguistic expressions in the corpus, speech recognition systems can adapt to different languages and dialects and provide accurate text translation. |
2.Text conversion |
Using a linguistic speech corpus, the work is done to convert speech data into text data.This improves the accuracy of speech-to-text conversion and can be used as input data for NLP tasks. |
3. Automatic summary |
Use text extracted from audio data for automatic summarization.This makes it possible to generate summaries from large amounts of audio data and organize information efficiently. |
4. Machine translation |
Linguistic speech corpora are used to train machine translation.A translation system will be developed that supports communication between multiple languages by converting voice data to text data. |
5.Sentiment analysis |
Linguistic speech corpora are used to analyze speakers' emotions and emotional expressions contained in speech data.This helps analyze product reputation and improve the quality of customer service. |
Speech corpora are a necessary data source for training and evaluating NLP algorithms and provide richer information compared to text data.Speech corpora maximize the usefulness of audio data and enable efficient information extraction and processing.
Language audio corpus provided free of charge
There are not only speech corpora that are provided for a fee, but also free corpora that are independently collected, compiled, and provided by academic institutions.Certain conditions and procedures are required for use, so please check the information on each site before using.
University of Tsukuba Multilingual Speech Corpus (UT-ML)
The University of Tsukuba Multilingual Speech Corpus (UT-ML) is a speech database that supports languages from 11 countries.Contains audio from a total of 98 speakers in different languages and genders. You can apply by choosing between the CD/DVD version and online distribution.
Chinese MULTEXT Corpus (MULTEXT-C)
The Chinese MULTEXT Corpus (MULTEXT-C) is the Chinese version of the Multilingual Text Tools and Corpora (MULTEXT) created in Europe. Forty manuscripts, each lasting 1 to 5 minutes, were recorded with instructions to speak as naturally as possible. You can apply by choosing between the CD/DVD version and online distribution.
Vowel speech database for men, women, and children with physical information (JVPD)
This is a vowel database created for the purpose of publishing as a standard scientific material on Japanese sounds.There were 385 speakers, and we also have height and weight data for 284 of them. You can apply by choosing between the CD/DVD version and online distribution.
Fundamental Research (A) “Regional Differences in Japanese Dialects” Dialect Speech Corpus (GSR-JD)
This is an audio corpus of Japanese dialects that includes read-aloud utterances and natural discourse. Contains audio from a total of 9 people from 133 regions. You can apply by choosing between the CD/DVD version and online distribution.
Summary
Although many people may not be familiar with the existence of speech corpora, the results of research and development using speech corpora are all around us.
Typical examples include the voice recognition function of smartphones and smart speakers, and it is not uncommon to use AI Chat to convert speech to text and automatically summarize it.In the future, needs will continue to increase and applications will become more diverse.
At Unipos, we procure speech corpus products and related hardware and software from around the world to support the success of our customers' businesses and research.We will be happy to investigate products that are not listed on the Unipos website, so please feel free to contact us.
■ Click here for Unipos service introductions and inquiries Overseas product procurement and consulting service for R & D "Unipos" |