NGS Analysis Workstation (November 2024 version)

A customer involved in plant genome research consulted us about the configuration of a PC for analysis.

The purpose is to assemble a genome of a diploid organism, estimated at 280 Mb, and we understand that the research will be expanded to include RNA-seq analysis. The budget is about 2 million to 100 million yen, and the OS is a machine with Linux (Ubuntu) pre-installed.

Regarding the software to be used, Trinity is planned to be used for RNA-seq analysis, SPAdes, Platanus, Racon, and medaka are expected for haploid samples in genome analysis, and FALCON or Canu for diploid samples.
We plan to initially perform assembly and synteny analysis using approximately 10 Gb of data, and the specific software to be used has yet to be decided.

In light of these requirements, we proposed the following configuration:

CPU Intel Xeon W5-3435X 3.10GHz (3.0GHz at TB4.70) 16C/32T
memory Total 256GB DDR5 4800 REG ECC 32GB x 8
Storage 1 1TB SSD M.2 NVMe Gen4
Storage 2 16TB HDD S-ATA
Video NVIDIA T400 4GB (MiniDisplayPort x3)
network on board (1GbE x1 /10GbE x1)
Housing + power supply Tower type housing + 1000W
OS Ubuntu 22.04

When selecting a PC for analysis, it is common sense to emphasize securing sufficient memory capacity for calculations. Therefore, the order of specification selection and cost weighting is to first secure the required amount of memory, and then use the remaining budget to consider the CPU for analysis and data storage.

In this case, we are starting with about 10Gb of data, so we are proposing a configuration that will allow you to perform a certain level of analysis within a budget of 120 million yen. The key point of this configuration is memory scalability, with the ability to add up to +768GB later. This configuration allows for room to add more memory if a memory shortage occurs as the amount of data handled increases step by step.

Please note that the configuration in this example prioritizes memory capacity that can be analyzed, and is not a configuration that prioritizes processing speed itself. If you have any requirements regarding processing speed, please feel free to contact us.

Reference: Trinity memory notation (Running Trinity · trinityrnaseq/trinityrnaseq Wiki · GitHub)

■ Keywords

・What is Trinity?

Trinity is software for de novo assembly of transcriptomes. It is useful for organisms for which reference genomes are not available, or for discovering novel transcripts. It can recover the original mRNA sequence using RNA-Seq reads (short base sequences).

Reference: trinityrnaseq · GitHub *Jumps to an external site

What is SPAdes?

SPAdes is software for de novo assembly of genome sequences. It is an assembler for reconstructing genome sequences using next-generation sequencing (NGS) data, and is particularly suited to the assembly of bacterial genomes. It also supports single-cell sequencing (SCS) data.

Reference: GitHub – ablab/spades: SPAdes Genome Assembler *Jumps to an external site

What is Platanus?

Platanus is software for de novo assembly of genome sequences, and is a tool particularly suited to the assembly of highly heterogeneous genomes. It can reconstruct genome sequences with high accuracy using short read data from next-generation sequencers.

Reference: GitHub – rkajitani/Platanus_B: De novo genome assembler for bacterial genomes *Jumps to an external site

What is Racon?

Racon is a tool for generating fast and accurate consensus sequences in de novo genome assembly using long-read sequence data. It aims to rapidly generate high-quality consensus sequences from long reads with high error rates, and is particularly suitable for data from PacBio and Oxford Nanopore Technologies.

Reference: GitHub – isovic/racon: Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here: *Jumps to an external site

What is medaka?

medaka is a tool used for the analysis of next-generation sequencing data, especially for DNA mutation detection. It is a polishing and mutation detection tool that targets long-read sequencing data, mainly from Oxford Nanopore Technologies (ONT), and is particularly suitable for post-processing of de novo assemblies and mutation calling against known reference genomes.

Reference: GitHub – nanoporetech/medaka: Sequence correction provided by ONT Research *Jumps to an external site

What is FALCON?

FALCON is software for performing de novo genome assembly using PacBio long-read sequence data. It is a de novo genome assembler developed by Pacific Biosciences (PacBio) and is suitable for assembling large-scale, complex genomes.

Reference: GitHub – PacificBiosciences/FALCON: FALCON: experimental PacBio diploid assembler — Out-of-date — Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries *Jumps to an external site

・What is canu?

Canu is software for de novo genome assembly using long-read sequence data. It is a de novo assembler specialized for long-read data from PacBio and Oxford Nanopore Technologies, and is suitable for assembling large-scale, complex genomes.

Reference: GitHub – marbl/canu: A single molecule sequence assembler for genomes large and small. *Jumps to an external site

Feel free to request a quote based on your usage and budget - Tegsys' simple inquiry form