Contact Us

Tel:0371-63387308
      0371-65330928
E-mail:guoshuxuebao@caas.cn

Home-Journal Online-2020 No.1

A study on the use of next-generation sequencing data for genome walking in the loquat

Online:2020/3/19 10:00:45 Browsing times:
Author: JIANG Shuang, AN Haishan, XU Fangjie, ZHANG Xueying
Keywords: Loquat; Genome working; Promoter; Next-generation sequencing; Reads;
DOI: 10.13925/j.cnki.gsxb.20190296
Received date:
Accepted date:
Online date:
PDF Abstract

Abstract:【Objective】The genome data of loquat(Eriobotrya japonica) has not been published, which limits the research of molecular biology in loquat. Many studies focus on the regulatory relationship between transcription factors and target gene promoters. Therefore, the promoters of the target genes are important to the researchers. The genome walking based on polymerase chain reaction(PCR) is usually used to obtain promoter sequences, which is time-consuming and hard to get results. In recent years, the cost of next-generation sequencing represented by Illumina Hiseq has been decreasing The Illumina Hiseq had provided a pair of Reads, and the length of a Read was 150 bp. The paired-end sequencing could cover a DNA region of 200 to 500 bp(longer fragments are feasible in library construction),which enabled the genome walking by the matching Reads.【Methods】In April 2018, the young leaves of loquat‘Huoju'were collected and the genomic DNA was extracted using CTAB method. Paired-end sequencing was performed using Illumina's Hiseq Xten. The Clean Reads of‘Huoju'were applied to the genome walking. The sequence of target genes was isolated from the database of transcript, and the front of 100 bp in each gene was set to search the upstream sequence. Two methods were described to quickly get the promoter sequence. One method used an existing program Magicblast, and the anther one used a newly developed program Promoter_Scan script in this study. The Magicblast was designed to handle Reads, the fastq files did not require additional treatment and could be manipulated directly.Firstly, the command of Makeblastdb was used to build the 100 bp sequence in the front of the target gene, and then the main program Magicblast was used to search the CleanReads.fq with the parameter of the score(60). At this stage, Magicblast could not automatically terminate when it was matched with some Reads. We needed to abort the program manually in 30 min. After one run, the matching Reads were collected and assembled using Seqman software to make an extension. The first 100 bp of the extended sequence was rebuilt and the above procedure was repeated until that the extended promoter sequence reached 2 000 bp. Promoter_Scan was a newly developed program that could build four 10 bp index sequences in each pair of Reads. Each Read had two index sequences, including 10 bp at the end of Read and revise complementary sequence of 10 bp in the front of Read. Firstly, four index sequences were used to match the 100 bp sequence in the front of the target gene. If matching, the first 30 bp bait sequence of the target gene was used to match the target Reads again, and finally, the target gene sequence and the mapping Reads were assembled. The above procedure was repeated with the new 100 bp sequence in the front of the assembled sequence. The program was finished until the length of the promoter reaching 2 000 bp.【Results】The next-generation genome sequencing data of loquat was performed. After the poorly sequenced Reads(0.07%) were filtered out, 206 million paired-end 150 bp Reads data had formed a total of 61.77 GB of base data. The sequencing data had been uploaded to the Beijing Institute of Genomics Database(BIG Data Center: CRR056810)(http://bigd.big.ac.cn/gsa). The previous report estimated that the genome size of loquat was about 700 MB, so the sequencing depth was about 85 times more than that in this study. CL15890.Contig2(cell wall expansion protein EjEXP3) was used as the test gene, the sequence of this gene was found from the transcript library, and the Magicblast library was constructed using the front 100 bp sequence. The complete retrieval of the entire data(61.77 GB) took more than 5 hours, so we manually terminated after half an hour of operation to obtain partial search results. The sequence of CL15890.Contig2 and the mapped Reads were assembled. When the length of the promoter of EjEXP3 was more than 2 000 bp, a total of 147 matching Reads were obtained throughout the search process, assembling 19 times and taking 9.5 hours. To estimate the time by Promoter_Scan, eight genes were randomly selected. Transcript sequences were obtained from the transcriptome and the genome walking was performed by the front 100 bp sequence of each gene. The number of assembled times and required time were counted until the length of the promoter sequence reached 2 000 bp. The results showed that the number of assembly times was ranged from 9 to 14, with an average of 11.8. The time was 15.2 to 28.4 minutes with an average of 21.3 minutes. The Promoter_Scan could significantly improve the efficiency of the experiment. To further validate the methods developed in this study, primers were designed for the assembled promoter sequences and verified by clonal sequencing. We aligned the EjEXP3 gene promoter sequences from Magicblast,Promoter_Scan, and clonal sequencing. The result showed that the sequences were highly consistent except for the unknown sequence N in Promoter_Scan. Also, the clonal sequencing results were completely consistent with the predicted sequence by Promoter_Scan in the promotor of the other 7 genes, indicating that the results obtained by Promoter_Scan were reliable.【Conclusion】In this study, the Magicblast search method could fulfill the experimental requirements without bioinformatics skills, but it was time consuming. Promoter_Scan enabled the automatic extension and significantly reduced time.The two methods in this study could be used not only for the study of molecular biology but also for species without published genomic data.