TEGAKARI
  • Home
  • Overseas Products What's New (Unipos)
  • R & D PC configuration example (Tegsys)
  • Service information for R & D
    • Rental service tegakari
  • Technical information articles
  • Version upgrade information
  • News from TEGARA
  • Contact Us
Pickup new articles
  • [April 2025, 7] Reasons for choosing the TKS Division | A system that can be used immediately after delivery Overseas Products What's New (Unipos)
  • [April 2025, 7] Workstation for Electromagnetic Field Analysis Research workstation
  • [April 2025, 7] Machine for large-scale language model calculation Research workstation
  • [April 2025, 7] MAGMA dedicated machine for large-scale numerical calculations Research workstation
  • [April 2025, 7] Part 3: Young researchers x product introduction - "Advancement of research" seen through introduction examples Overseas Products What's New (Unipos)

Home > Features > [Special article] Introducing popular products related to speech corpora

[Special article] Introducing popular products related to speech corpora

2023/11/27 TEGARA Co., Ltd. Humanities / Social Sciences, Informatics, Artificial intelligence, Application development and programming, Features

Unipos is an overseas product procurement and consultation service for research and development companies that is supported by many research and educational institutions.

On this page, we have selected products that are attracting attention from customers involved in research and development, focusing on speech corpora.Please take a look.

table of contents

  • What is a speech corpus?
    • The role of linguistic speech corpora
  • Popular language audio corpus products at Unipos
    • Speechocean Corpus | Various Languages ​​Commercial Research Corpus
    • LDC Corpus | Language Corpus Database
    • ELRA GLOBALPHONE | Multilingual audio database
    • AISHELL Corpus | Artificial Intelligence Chinese Corpus
  • Example
  • Language audio corpus provided free of charge
    • University of Tsukuba Multilingual Speech Corpus (UT-ML)
    • Chinese MULTEXT Corpus (MULTEXT-C)
    • Vowel speech database for men, women, and children with physical information (JVPD)
    • Fundamental Research (A) “Regional Differences in Japanese Dialects” Dialect Speech Corpus (GSR-JD)
  • Summary

What is a speech corpus?

A speech corpus refers to a collection of speech data, and its importance plays a major role in natural language processing research and practice.Natural language processing is a technology that analyzes information related to language and communication, and among these, linguistic speech corpora are"An important role in structuring linguistic data"Because of this, it also occupies an important position in the field of AI.

Speech analysis is an example of a specific natural language processing method.This is a technology that extracts information about language from speech data, and is widely used in fields such as speech recognition and speech synthesis.Speech analysis is laying the foundation for machines to understand speech and respond appropriately.
Furthermore, research on natural language processing is progressing rapidly with the use of speech corpora and advances in deep learning.Advances in technology are expanding the range of applications for natural language processing and opening up new possibilities.It can be said that a collection of data called a speech corpus is contributing to the development of language processing technology.

The role of linguistic speech corpora

  • Learning data provided: Provide diverse audio data to enhance training of machine learning models
  • Foundation of speech recognition: Speech analysis lays the foundation for speech recognition technology, allowing machines to convert speech into text.
  • Development of conversational AI: Supporting the evolution of conversational AI and enabling machines to communicate with users in natural language
  • Improving speech synthesis technology: Utilizing information obtained from speech analysis in speech synthesis technology to enable natural speech generation
  • Application to real life: Can be widely used in real life, such as voice control of smart homes, complaint handling, learning assistance, etc.

 

Popular language audio corpus products at Unipos

Speech corpora are used in a variety of fields, including language research, speech recognition technology, speech synthesis technology, and language processing technology, and demand is expected to increase in the future.

Unipos also handles many speech corpus related products.We would like to introduce some of the most popular ones, so please take a look.

Speechocean Corpus | Various Languages ​​Commercial Research Corpus

Commercial and research corpora in various languages

Various corpora handled by China Speechocean.
In addition to ASR-Corpus (automatic speech recognition corpus), TTS-Corpus (speech synthesis corpus), computer vision corpus, vocabulary corpus, natural language processing corpus, etc.
We handle a large number of corpora, including approximately 1,000 types for commercial use and approximately 150 types for research.

The corpus is divided into more than 110 languages ​​and dialects (accents), age, gender, recording time, recording platform, etc. When making an inquiry, please let us know the name and SN (King-) of your desired corpus. please.

Beijing Haitian Ruisheng Science Technology Ltd / DataOcean AI (manufacturer site)

Main uses
– Automatic speech recognition corpus v
– Speech synthesis corpus
– text corpus
- Multilingual
– Commercial and research offerings

LDC Corpus | Language Corpus Database

Corpus of various languages ​​(language database)

A corpus handled by the LDC (Linguistic Data Consortium), headquartered at the University of Pennsylvania in the United States.
We have a rich collection of words and data in a variety of formats, including text databases, audio databases, and lexicons.

When contacting us, please let us know the name of your desired product.

LDC corpus catalog page (manufacturer site)

Main uses
– Natural language processing (NLP) research data
– Linguistic data annotation
– Used for syntactic analysis and morphological analysis
– Data from the Speech Resource Consortium
– Providing large-scale annotated linguistic data

ELRA GLOBALPHONE | Multilingual audio database

Multilingual speech database

A multilingual speech database (corpus) provided by ELRA (European Language Resources Association).
The GlobalPhone series is newspaper reading audio data (440bit, 6kHz monaural) recorded with a proximity microphone (Sennheiser 16-16). As of 2023, data is provided in 22 languages.

Main uses
– Development of multilingual speech recognition system
– Audio data for natural language processing (NLP) research
– Used for comparative research on pronunciation between languages
– Creation of language technology evaluation package
– Research uses of audio data in various languages

AISHELL Corpus | Artificial Intelligence Chinese Corpus

Chinese corpus for artificial intelligence

A speech corpus for voice-based intelligent products such as smart homes, automobiles (smart cars), and robots, handled by Beijing Shell Shell Technology in China.Data is categorized by usage scenario.

Open source corpora for academic research purposes are also provided.

Main uses
– Voice recognition system training
– Chinese natural language processing research
– Speech synthesis database
– Dataset for multimodal learning
– Audio annotation and analysis

Example

Speech corpora are used as important speech resources in the field of natural language processing (NLP).Speech corpora are leveraged for many natural language processing tasks, including speech recognition, text conversion, automatic summarization, machine translation, and sentiment analysis.

1.Voice recognition
Linguistic speech corpora are used for training and evaluating speech recognition.Through the variety of pronunciations, accents, and linguistic expressions in the corpus, speech recognition systems can adapt to different languages ​​and dialects and provide accurate text translation.
2.Text conversion
Using a linguistic speech corpus, the work is done to convert speech data into text data.This improves the accuracy of speech-to-text conversion and can be used as input data for NLP tasks.
3. Automatic summary
Use text extracted from audio data for automatic summarization.This makes it possible to generate summaries from large amounts of audio data and organize information efficiently.
4. Machine translation
Linguistic speech corpora are used to train machine translation.A translation system will be developed that supports communication between multiple languages ​​by converting voice data to text data.
5.Sentiment analysis
Linguistic speech corpora are used to analyze speakers' emotions and emotional expressions contained in speech data.This helps analyze product reputation and improve the quality of customer service.

Speech corpora are a necessary data source for training and evaluating NLP algorithms and provide richer information compared to text data.Speech corpora maximize the usefulness of audio data and enable efficient information extraction and processing.

Language audio corpus provided free of charge

There are not only speech corpora that are provided for a fee, but also free corpora that are independently collected, compiled, and provided by academic institutions.Certain conditions and procedures are required for use, so please check the information on each site before using.

University of Tsukuba Multilingual Speech Corpus (UT-ML)

The University of Tsukuba Multilingual Speech Corpus (UT-ML) is a speech database that supports languages ​​from 11 countries.Contains audio from a total of 98 speakers in different languages ​​and genders. You can apply by choosing between the CD/DVD version and online distribution.

Chinese MULTEXT Corpus (MULTEXT-C)

The Chinese MULTEXT Corpus (MULTEXT-C) is the Chinese version of the Multilingual Text Tools and Corpora (MULTEXT) created in Europe. Forty manuscripts, each lasting 1 to 5 minutes, were recorded with instructions to speak as naturally as possible. You can apply by choosing between the CD/DVD version and online distribution.

Vowel speech database for men, women, and children with physical information (JVPD)

This is a vowel database created for the purpose of publishing as a standard scientific material on Japanese sounds.There were 385 speakers, and we also have height and weight data for 284 of them. You can apply by choosing between the CD/DVD version and online distribution.

Fundamental Research (A) “Regional Differences in Japanese Dialects” Dialect Speech Corpus (GSR-JD)

This is an audio corpus of Japanese dialects that includes read-aloud utterances and natural discourse. Contains audio from a total of 9 people from 133 regions. You can apply by choosing between the CD/DVD version and online distribution.

Summary

Although many people may not be familiar with the existence of speech corpora, the results of research and development using speech corpora are all around us.
Typical examples include the voice recognition function of smartphones and smart speakers, and it is not uncommon to use AI Chat to convert speech to text and automatically summarize it.In the future, needs will continue to increase and applications will become more diverse.

At Unipos, we procure speech corpus products and related hardware and software from around the world to support the success of our customers' businesses and research.We will be happy to investigate products that are not listed on the Unipos website, so please feel free to contact us.

■ Click here for Unipos service introductions and inquiries

Overseas product procurement and consulting service for R & D "Unipos"

 

  • Natural language processing
  • Corpus
  • Voice processing

People who read this article also read this article

Business support and efficiency tools

Introducing overseas corpora that are useful for improving the efficiency of research and development – ​​Part 2 [Unipos]

2024/10/24 TEGARA Co., Ltd. Mathematical Science, Chemical, Medicine / Nursing / Pharmacy, Biology / Agriculture, Informatics, Artificial intelligence, Business support and efficiency tools, Overseas Products What's New (Unipos)

[Please check] This is a sequel to the following article: Introducing overseas corpora that are useful for improving the efficiency of research and development – ​​Part 1 [Unipos] Review of the previous article […see next]

Multimedia (video / image / audio) processing

Voice analysis software "Voce Vista Video" for teaching and vocal training

2020/5/26 TEGARA Co., Ltd. Multimedia (video / image / audio) processing, Overseas Products What's New (Unipos)

■This article was posted on May 2020, 5, so the information may be outdated. Voice analysis software for instruction and vocal training on Unipos website […see next]

Development kit / electronic work

Amazon Alexa Software Development Kit for Speech Recognition Device Development "Intel Speech Enabling Developer Kit"

2017/11/14 TEGARA Co., Ltd. Development kit / electronic work, Overseas Products What's New (Unipos)

■ This is an article posted on November 2017, 11, so the content of the information may be out of date.Amazon Alexa voice recognition device development on Unipos website […see next]

Site search:

Tegara YouTube Video

[Effect of IR Pass Filter] Shoot whiteboard with RealSense D435 and D435f

The latest posted video is displayed.
Other videosTegara Corporation Youtube channelplease look at

Popular Articles (Access ranking for the last 7 days)

  • Multi-functional terminal software "MobaXterm" 2022/5/18
  • [Product introduction] MarineTraffic: real-time information provision service on ships (subscription plan) 2023/4/6
  • furix BetterWMF and CompareDWG tools for AutoCAD [Product introduction] Beyond Compare: File and folder comparison, integration and synchronization utility 2022/11/18
  • We compared 8 types of 3D cameras in various environments [No. XNUMX indoor edition] 2020/9/7
  • [Product introduction] Leap Motion Controller 2 – Hand tracking camera that recognizes hand and finger movements 2023/6/9

Latest posts

  • Turnkey systems accelerate the initial research phase! Tegara's TKS Division
    Reasons for choosing the TKS Division | A system that can be used immediately after delivery
    2025/7/14
  • Workstation for Electromagnetic Field Analysis
    2025/7/11
  • Machine for large-scale language model calculation
    2025/7/9
  • MAGMA dedicated machine for large-scale numerical calculations
    2025/7/8
  • Tegsys x Unipos x TKS Young Researchers Support Campaign
    Part 3: Young researchers x product introduction - "Advancement of research" seen through introduction examples
    2025/7/7

Featured tags

Analysis tool (56) 3D camera (55) Machine learning (machine learning) (53) AI (47) Robotics (45) VR (44) Bioinformatics (42) Robot arm (42) RealSense (41) Statistical analysis (39) Deepearning (37) Video / Video (37) SBC (36) Depth camera (36) instrumentation (35) Small SBC (35) IoT (35) simulation (33) Spectrum (33) Data analysis (31) Python (29) First principle (29) Cyber ​​security (28) JavaScript (27) AR (27) Next-generation sequencer (27) Chemical (27) . NET (26) In-vehicle (25) Image processing (25) Image analysis / image inspection (25) Metashape (25) TO DEAL (25) UI (24) MATLAB (24) Photogrammetry (23) prototype (22) Educational robot (22) 3D model (22) Molecular biology (22) Support (22) Measuring instrument (21) Web development / production (21) Test tool (20) material (20) GIS (20) ROS (19) Drone (19) Robot hand (19) Visualization (19) Mech robot (19) Electromagnetic field analysis (19) Psychology (19) security (19) Mobile robot (19) Animation (19) ToF (18) Autonomous vehicle (18) gene (18) protocol (18) programming (18) EEG (18) 3D printer (17) Deep learning (17) DNA (17) CAE (17) Raspberry Pi (17) tracking (17) Clinical (17) Bioassay (17) Motion capture (17) Education (16) XNUM XD modeling (16) chart (16) Industrial (16) Structural analysis (16) modeling (16) biostatistics (15) Movie editing (15) 3D scan (15) drug development (15) AR / VR (15) Library (15) RNA (15) Arduino (15) Fluid analysis (15) Molecular dynamics (15) Device control (14) Articles delivered in August 2022 (14) Articles delivered in August 2022 (14) 写真 (14) others (14) Agriculture / Agriculture (14) SLAM (14) Information dissemination September issue (14) CUDA (14) Malware (14) Stimulus presentation (14) CFD (14) Monitoring (13) Voice processing (13) Genome analysis (13) Surveying (13) IDE (Integrated Development Environment) (13) Nanostructured material (13) Depth sensor (13) wireless (13) Development and evaluation kit (13) 24 hours operation (13) STEM / STEAM education (13) 3D CAD (13) control (13) Numerical analysis (13) Thermal fluid analysis (13) Information dissemination February 22 issue (12) Deep Lab Cut (12) GPGPU (12) natural Science (12) Capture glove (12) Information dissemination February 22 issue (12) Looking Glass (12) CAD (12) FDTD method (12) Quantum chemistry calculation (12) Remote operation (remote control) (12)
Find Information by Field-Category
  •  Humanities / Social Sciences
  •  Mathematical Science
  •  Chemical
  •  engineering
  •  Medicine / Nursing / Pharmacy
  •  Biology / Agriculture
  •  Informatics
 
  •  Artificial intelligence
  •  Robotics
  •  Sensor technology
  •  Development kit / electronic work
  •  Digital gadget
  •  Automotive / vehicle related
  •  Industrial communication technology
  •  Application development and programming
  •  Network security
  •  Multimedia (video / image / audio) processing
  •  Business support and efficiency tools
Translate
Contact Form – Contact
Click here to contact TEGAKARI
Site link
Privacy Policy
Management website (service)
TEGARA Co., Ltd.
TEGARA CORPORATION corporate site

UNIPOS
Overseas product procurement and consultation services for R & D

Tegusis
Research and industrial PC production and sales services
SNS account
  • Twitter
  • YouTube
  • Facebook

TEGARA Co., Ltd.

Tegara is a platform that provides R & D with useful products, services, and information in an integrated manner. "Helping accelerate R & D"

Copyright © 2020 | Tegara Corporation