Learning machine for large-scale language models for biology | TEGAKARI, an information dissemination media for research and development

A customer involved in research and development of medical products asked us about a learning machine for large-scale language models for biology.
It is assumed that large-scale language models used in biology such as ProteinBERT, ChemBERTa, and HyenaDNA will be executed from pre-training.

Customers requested that we prioritize GPU performance, as we have received information that ProteinBERT used Nvidia Quadro RTX 5000, ChemBERTa used NVIDIA Tesla T4, and HyenaDNA used NVIDIA A100 for training.

In addition, we would like to have a budget of 300 million yen or less, a configuration that will allow for the highest speed, and a case that is about the size of a mid-tower and can be used in a 100V power environment.

Based on the conditions you contacted us, we proposed the following configuration.

CPU	Intel Xeon W5-2455X (3.20GHz 12 cores)
memory	128GB REG ECC
Storage 1	2TB M.2 SSD
Storage 2	4TB SSD S-ATA
video	NVIDIA RTX A6000 48GB x2
network	on board (1GbE x1 /10GbE x1)
Housing + power supply	Middle tower type housing + 1500W
OS	Microsoft Windows 11 Professional 64bit

This is a machine configuration proposal that emphasizes GPU performance based on your budget and usage environment.

The GPU is equipped with NVIDIA RTX A6000 x2.
According to the official website of the ProteinBERT developer, it says that it took about a month to build the trained model using NVIDIA RTX5000.
The A6000 is a newer generation than the RTX5000 and is a higher-end model in the lineup, so you can expect higher processing performance than the RTX5000.

The NVIDIA Tesla T4 that you cited as an example is a product that is often used for inference.Therefore, this configuration uses A4, which has higher unit performance than NVIDIA TeslaT6000.

Also, NVIDIA A100, unlike A6000, is a GPGPU-only card.
Although this product has high fp64 performance and is suitable for scientific calculations, fp64 performance is rarely used for deep learning purposes like this one.
In addition, the price is much higher than the A6000, and it can only be used in a dedicated casing, so we judged that it would not be a good match for our usage conditions and purpose.

Regarding storage, the developer of ProteinBERT recommends that users have at least 1TB of storage capacity when training models on their own, so it is equipped with a 2TB system disk and a 4TB data disk.
In addition, assuming that frequent data access will occur during learning, all storage is SSD.

The OS selected is Windows 11.
The language model you plan to use is basically provided as a Python package, so you can change it if you wish on any OS that supports Python.

The configuration of this case study is based on the conditions given by the customer.
We will flexibly propose machines according to your conditions, so please feel free to contact us even if you are considering different conditions than what is listed.

■ Keywords

・What is Deep Learning?
DeepLearning is a type of machine learning that uses multilayer neural networks to perform advanced pattern recognition and prediction.Since it generally requires a large amount of data, it is considered an effective method when data is abundant.DeepLeanig is also widely used in fields such as image recognition, speech recognition, and natural language processing.Because it can learn complex features and relationships, it can achieve higher accuracy than traditional machine learning methods.

Reference: [Special article] What is machine learning? * Jump to our owned media "TEGAKARI"

・What is Python?
Python is an object-oriented programming language copyrighted by the Python Software Foundation (PSF).Its programming syntax is simple, making it highly readable, and it also features a wide variety of components, such as libraries and frameworks, that are suitable for different purposes.A popular language for programming beginners to advanced users.

Reference: Python *Jumps to an external site

・What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing (NLP) model developed by Google.It can understand words based on a given context and is applied to a wide range of tasks in language processing.
Also, BERT consists of two phases: pre-training and fine-tuning.Pre-training creates a generic language model trained from a large corpus.Fine-tuning adjusts a model learned from a small dataset to apply it to a specific task.
It is characterized by showing higher accuracy than conventional NLP models and being able to handle complex tasks, and is applied to text generation, question answering, document classification, language translation, etc. Widely used as one.

・What is ProteinBERT?
ProteinBERT is a protein language model based on BERT. Pretrained on up to 90 million proteins from the UniRef1 database, it can handle protein sequences of almost any length, including very long protein sequences.

Reference: GitHub – nadavbra/protein_bert *Jumps to an external site

・What is ChemBERTa?
ChemBERTa is a large-scale language model of SMILES notation, which is a notation method for chemical structures, using RoBERTa (a variant of BERT).It is used in drug design, chemical modeling, property prediction, etc.

Reference: GitHub – seyonechithrananda/bert-loves-chemistry: bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modeling, etc. *Jumps to an external site

・What is HyenaDNA?
HyenaDNA is a large-scale language model that is pre-trained on the human genome as a base sequence of 100 million tokens.Single nucleotide unit (ATGC) tokenization allows analysis at the nucleotide level.

Reference: GitHub – HazyResearch/hyena-dna: Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena *Jumps to an external site