TEGAKARI
  • Home
  • Latest information on overseas products (Unipos WEB)
  • R & D PC configuration example (Tegsys)
  • Service information for R & D
    • Rental service tegakari
    • Research and development/experimental equipment set construction service
  • Technical information articles
  • Version upgrade information
  • News from TEGARA
  • Contact
Pickup new articles
  • [April 2025, 12] Space-saving workstation compatible with Nanopore MinION Mk1D Medicine / Nursing / Pharmacy
  • [April 2025, 12] Workstation for COMSOL Multiphysics Mathematical Science
  • [April 2025, 12] RTX 6000Ada x 4-chip workstation for robotic machine learning Mathematical Science
  • [April 2025, 12] Bacterial NGS analysis workstation Mathematical Science
  • [April 2025, 12] Ansys Fluent Thermo-Fluid Analysis Workstation Mathematical Science

Home > Features > [Special article] Introducing popular products related to speech corpora

[Special article] Introducing popular products related to speech corpora

December 27 2025 TEGARA Co., Ltd. Humanities / Social Sciences, Informatics, Artificial intelligence, Application development and programming, Features

Unipos is an overseas product procurement and consultation service for research and development companies that is supported by many research and educational institutions.

On this page, we have selected products that are attracting attention from customers involved in research and development, focusing on speech corpora.Please take a look.

table of contents

  • What is a speech corpus?
    • The role of linguistic speech corpora
  • Popular language audio corpus products at Unipos
    • Speechocean Corpus | Various Languages ​​Commercial Research Corpus
    • LDC Corpus | Language Corpus Database
    • ELRA GLOBALPHONE | Multilingual audio database
    • AISHELL Corpus | Artificial Intelligence Chinese Corpus
  • Example
  • Language audio corpus provided free of charge
    • University of Tsukuba Multilingual Speech Corpus (UT-ML)
    • Chinese MULTEXT Corpus (MULTEXT-C)
    • Vowel speech database for men, women, and children with physical information (JVPD)
    • Fundamental Research (A) “Regional Differences in Japanese Dialects” Dialect Speech Corpus (GSR-JD)
  • My Feelings, Then and Now

What is a speech corpus?

A speech corpus refers to a collection of speech data, and its importance plays a major role in natural language processing research and practice.Natural language processing is a technology that analyzes information related to language and communication, and among these, linguistic speech corpora are"An important role in structuring linguistic data"Because of this, it also occupies an important position in the field of AI.

Speech analysis is an example of a specific natural language processing method.This is a technology that extracts information about language from speech data, and is widely used in fields such as speech recognition and speech synthesis.Speech analysis is laying the foundation for machines to understand speech and respond appropriately.
Furthermore, research on natural language processing is progressing rapidly with the use of speech corpora and advances in deep learning.Advances in technology are expanding the range of applications for natural language processing and opening up new possibilities.It can be said that a collection of data called a speech corpus is contributing to the development of language processing technology.

The role of linguistic speech corpora

  • Learning data provided: Provide diverse audio data to enhance training of machine learning models
  • Foundation of speech recognition: Speech analysis lays the foundation for speech recognition technology, allowing machines to convert speech into text.
  • Development of conversational AI: Supporting the evolution of conversational AI and enabling machines to communicate with users in natural language
  • Improving speech synthesis technology: Utilizing information obtained from speech analysis in speech synthesis technology to enable natural speech generation
  • Application to real life: Can be widely used in real life, such as voice control of smart homes, complaint handling, learning assistance, etc.

 

Popular language audio corpus products at Unipos

Speech corpora are used in a variety of fields, including language research, speech recognition technology, speech synthesis technology, and language processing technology, and demand is expected to increase in the future.

Unipos also handles many speech corpus related products.We would like to introduce some of the most popular ones, so please take a look.

Speechocean Corpus | Various Languages ​​Commercial Research Corpus

Commercial and research corpora in various languages

Various corpora handled by China Speechocean.
In addition to ASR-Corpus (automatic speech recognition corpus), TTS-Corpus (speech synthesis corpus), computer vision corpus, vocabulary corpus, natural language processing corpus, etc.
We handle a large number of corpora, including approximately 1,000 types for commercial use and approximately 150 types for research.

The corpus is divided into more than 110 languages ​​and dialects (accents), age, gender, recording time, recording platform, etc. When making an inquiry, please let us know the name and SN (King-) of your desired corpus. please.

Beijing Haitian Ruisheng Science Technology Ltd / DataOcean AI (manufacturer site)

Main uses
– Automatic speech recognition corpus v
– Speech synthesis corpus
– text corpus
- Multilingual
– Commercial and research offerings

LDC Corpus | Language Corpus Database

Corpus of various languages ​​(language database)

A corpus handled by the LDC (Linguistic Data Consortium), headquartered at the University of Pennsylvania in the United States.
We have a rich collection of words and data in a variety of formats, including text databases, audio databases, and lexicons.

When contacting us, please let us know the name of your desired product.

LDC corpus catalog page (manufacturer site)

Main uses
– Natural language processing (NLP) research data
– Linguistic data annotation
– Used for syntactic analysis and morphological analysis
– Data from the Speech Resource Consortium
– Providing large-scale annotated linguistic data

ELRA GLOBALPHONE | Multilingual audio database

Multilingual speech database

A multilingual speech database (corpus) provided by ELRA (European Language Resources Association).
The GlobalPhone series is newspaper reading audio data (440bit, 6kHz monaural) recorded with a proximity microphone (Sennheiser 16-16). As of 2023, data is provided in 22 languages.

Main uses
– Development of multilingual speech recognition system
– Audio data for natural language processing (NLP) research
– Used for comparative research on pronunciation between languages
– Creation of language technology evaluation package
– Research uses of audio data in various languages

AISHELL Corpus | Artificial Intelligence Chinese Corpus

Chinese corpus for artificial intelligence

A speech corpus for voice-based intelligent products such as smart homes, automobiles (smart cars), and robots, handled by Beijing Shell Shell Technology in China.Data is categorized by usage scenario.

Open source corpora for academic research purposes are also provided.

Main uses
– Voice recognition system training
– Chinese natural language processing research
– Speech synthesis database
– Dataset for multimodal learning
– Audio annotation and analysis

Example

Speech corpora are used as important speech resources in the field of natural language processing (NLP).Speech corpora are leveraged for many natural language processing tasks, including speech recognition, text conversion, automatic summarization, machine translation, and sentiment analysis.

1.Voice recognition
Linguistic speech corpora are used for training and evaluating speech recognition.Through the variety of pronunciations, accents, and linguistic expressions in the corpus, speech recognition systems can adapt to different languages ​​and dialects and provide accurate text translation.
2.Text conversion
Using a linguistic speech corpus, the work is done to convert speech data into text data.This improves the accuracy of speech-to-text conversion and can be used as input data for NLP tasks.
3. Automatic summary
Use text extracted from audio data for automatic summarization.This makes it possible to generate summaries from large amounts of audio data and organize information efficiently.
4. Machine translation
Linguistic speech corpora are used to train machine translation.A translation system will be developed that supports communication between multiple languages ​​by converting voice data to text data.
5.Sentiment analysis
Linguistic speech corpora are used to analyze speakers' emotions and emotional expressions contained in speech data.This helps analyze product reputation and improve the quality of customer service.

Speech corpora are a necessary data source for training and evaluating NLP algorithms and provide richer information compared to text data.Speech corpora maximize the usefulness of audio data and enable efficient information extraction and processing.

Language audio corpus provided free of charge

There are not only speech corpora that are provided for a fee, but also free corpora that are independently collected, compiled, and provided by academic institutions.Certain conditions and procedures are required for use, so please check the information on each site before using.

University of Tsukuba Multilingual Speech Corpus (UT-ML)

The University of Tsukuba Multilingual Speech Corpus (UT-ML) is a speech database that supports languages ​​from 11 countries.Contains audio from a total of 98 speakers in different languages ​​and genders. You can apply by choosing between the CD/DVD version and online distribution.

Chinese MULTEXT Corpus (MULTEXT-C)

The Chinese MULTEXT Corpus (MULTEXT-C) is the Chinese version of the Multilingual Text Tools and Corpora (MULTEXT) created in Europe. Forty manuscripts, each lasting 1 to 5 minutes, were recorded with instructions to speak as naturally as possible. You can apply by choosing between the CD/DVD version and online distribution.

Vowel speech database for men, women, and children with physical information (JVPD)

This is a vowel database created for the purpose of publishing as a standard scientific material on Japanese sounds.There were 385 speakers, and we also have height and weight data for 284 of them. You can apply by choosing between the CD/DVD version and online distribution.

Fundamental Research (A) “Regional Differences in Japanese Dialects” Dialect Speech Corpus (GSR-JD)

This is an audio corpus of Japanese dialects that includes read-aloud utterances and natural discourse. Contains audio from a total of 9 people from 133 regions. You can apply by choosing between the CD/DVD version and online distribution.

My Feelings, Then and Now

Although many people may not be familiar with the existence of speech corpora, the results of research and development using speech corpora are all around us.
Typical examples include the voice recognition function of smartphones and smart speakers, and it is not uncommon to use AI Chat to convert speech to text and automatically summarize it.In the future, needs will continue to increase and applications will become more diverse.

At Unipos, we procure speech corpus products and related hardware and software from around the world to support the success of our customers' businesses and research.We will be happy to investigate products that are not listed on the Unipos website, so please feel free to contact us.

■ Click here for Unipos service introductions and inquiries

Overseas product procurement and consulting service for R & D "Unipos"

 


  • Natural language processing
  • Corpus
  • Voice processing

People who read this article also read this article

R & D PC configuration example (Tegsys)

Llama-3 compatible natural language processing workstation

December 22 2025 TEGARA Co., Ltd. Research workstation, Informatics, Artificial intelligence, Business support and efficiency tools, R & D PC configuration example (Tegsys)

Please refer to Case No. PC-10873 and consider a PC for performing text summarization using natural language processing using elyza's Llama-3-ELYZA-JP-8B. […see next]

Humanities / Social Sciences

English reading software with OCR function "Natural Reader"

December 12 2025 TEGARA Co., Ltd. Humanities / Social Sciences, Overseas Products What's New (Unipos)

■ This article was posted on February 2016, 2, so the information may be out of date. English reading software Nat with OCR function on Unipos website […see next]

Multimedia (video / image / audio) processing

Voice analysis software "Voce Vista Video" for teaching and vocal training

December 26 2025 TEGARA Co., Ltd. Multimedia (video / image / audio) processing, Overseas Products What's New (Unipos)

■This article was posted on May 2020, 5, so the information may be outdated. Voice analysis software for instruction and vocal training on Unipos website […see next]

Site search:

Tegara's research and development campaign information

  • ALOHA Purchase Early Bird Campaign | This is your last chance to purchase during fiscal year 7!
    ALOHA Purchase Early Bird Campaign | This is your last chance to purchase during fiscal year 7!
    December 17 2025
  • Special Offer on AI Robotics Products | For Tegara Repeat Users
    Special Offer on AI Robotics Products | For Tegara Repeat Users
    December 31 2025
  • Unipos Referral Campaign | Benefits for both the introducer and the referred person
    Unipos Referral Campaign | Benefits for both the introducer and the referred person
    December 31 2025
  • Special campaign for conference attendees | UNIPOS
    Special campaign for conference attendees | UNIPOS
    December 1 2025
  • Special Campaign for Life Science Research and Development [Tegsys]
    Special Campaign for Life Sciences Research and Development [Tegsys]
    December 23 2025
  • Announcement of the Young Researchers Support Campaign
    Announcement of the Young Researchers Support Campaign
    December 29 2025

Tegara YouTube Video

[Effect of IR Pass Filter] Shoot whiteboard with RealSense D435 and D435f

The latest posted video is displayed.
Other videosTegara Corporation Youtube channelto check more details.

Popular Articles (Access ranking for the last 7 days)

  • The latest version 5 of the projection mapping software "MadMapper" has been officially released. December 23 2025
  • What is the need for a service that does not require the HDD to be returned? December 2 2025
  • Illustration tool "BioRender" for the life science field December 30 2025
  • furix BetterWMF and CompareDWG tools for AutoCAD [Product introduction] Beyond Compare: File and folder comparison, integration and synchronization utility December 18 2025
  • Open source software defined radio platform "Hack RF One" December 4 2025

Latest posts

  • Space-saving workstation compatible with Nanopore MinION Mk1D
    December 15 2025
  • Workstation for COMSOL Multiphysics
    December 11 2025
  • RTX 6000Ada x 4-chip workstation for robotic machine learning
    December 9 2025
  • Bacterial NGS analysis workstation
    December 5 2025
  • Ansys Fluent Thermo-Fluid Analysis Workstation
    December 5 2025

Featured tags

Analysis tool (56) 3D camera (55) Machine learning (machine learning) (55) Robotics (51) AI (48) Bioinformatics (46) Deepearning (46) VR (44) Statistical analysis (43) Robot arm (42) RealSense (41) Video / Video (37) Depth camera (36) SBC (36) IoT (35) Small SBC (35) simulation (35) instrumentation (35) Spectrum (33) Data analysis (31) Next-generation sequencer (31) Python (31) First principle (30) Cyber ​​security (28) Image analysis / image inspection (28) JavaScript (27) Chemical (27) AR (27) Image processing (26) . NET (26) MATLAB (26) Metashape (26) TO DEAL (25) In-vehicle (25) UI (24) Photogrammetry (23) Educational robot (22) 3D model (22) prototype (22) Support (22) material (22) Molecular biology (22) gene (21) Web development / production (21) Measuring instrument (21) Molecular dynamics (21) Electromagnetic field analysis (21) GIS (20) ROS (20) Test tool (20) Animation (19) Robot hand (19) Mech robot (19) Drone (19) Mobile robot (19) security (19) Psychology (19) Visualization (19)
Find Information by Field-Category
  •  Humanities / Social Sciences
  •  Mathematical Science
  •  Chemical
  •  engineering
  •  Medicine / Nursing / Pharmacy
  •  Biology / Agriculture
  •  Informatics
 
  •  Artificial intelligence
  •  Robotics
  •  Sensor technology
  •  Development kit / electronic work
  •  Digital gadget
  •  Automotive / vehicle related
  •  Industrial communication technology
  •  Application development and programming
  •  Network security
  •  Multimedia (video / image / audio) processing
  •  Business support and efficiency tools
Translate
Site link
Privacy Policy
Management website (service)
TEGARA Co., Ltd.
TEGARA CORPORATION corporate site

UNIPOS
Overseas product procurement and consultation services for R & D

Tegusis
Research and industrial PC production and sales services

TKS Division
Research and development/experimental equipment set construction service
Contact Form – Contact
Click here to contact TEGAKARI
SNS account
  • Twitter
  • YouTube
  • Facebook

TEGARA Co., Ltd.

Tegara is a platform that provides R & D with useful products, services, and information in an integrated manner. "Helping accelerate R & D"

Copyright © 2020 | Tegara Corporation