sanger-tol/genomeassembly is a bioinformatics pipeline for de-novo genome assembly from long read data (PacBio HiFi or ONT), long-range Hi-C data, and optionally Illumina WGS and Illumina 10X linked reads. It is capable of producing primary/alternative assembles, Hi-C phased assemblies using Hi-C data, and trio-binned assemblies using data from parental sequencing.
- If FastK databases and coverage information information are not provided, the pipeline first builds these and estimates the genome coverage using genomescope2.
- Assembles the provided long reads using hifiasm, optionally producing hic-phased or trio-binned assemblies.
- (optional) Purges retained haplotigs from the assembly using purge_dups.
- (optional) Polishes the combined assembly using Illumina 10X reads with Longranger and Freebayes
- Maps Hi-C reads to each assembly using bwamem2 or minimap2.
- Scaffolds each assembly using long-range Hi-C interactions using YaHS.
- Produces numerical statistics for each assembly at each stage of the pipeline using GFASTATS (assembly statiscics), BUSCO (single-copy ortholog statistics), and MERQURY.FK (QV and kmer-completeness).
- Assembles organelles using de-novo assembly oatk and reference-based assembly MitoHiFi.
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
Currently, it is advised to run the pipeline with Docker or Singularity, as purge_dups and mitohifi do not support running with Conda.
Now, you can run the pipeline using:
nextflow run sanger-tol/genomeassembly \
-profile <docker/singularity/.../institute> \
--genomic_data genomic_data.yaml \
--assembly_specs assembly_specs.yaml \
--outdir <OUTDIR>Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
sanger-tol/genomeassembly was originally written by Ksenia Krashennikova and Jim Downie.
We thank the following people for their extensive assistance in the development of this pipeline:
@priyanka-surana for the code review, very helpful coding suggestions, and assistance with pushing this pipeline forward through development.
@mcshane and @c-zhou for the design and implementation of the original pipelines for purging (@mcshane), polishing (@mcshane) and scaffolding (@c-zhou).
TreeVal team Damon-Lee Pointon (@DLBPointon), Yumi Sims (@yumisims) and William Eagles (@weaglesBio) for implementation of the hic-mapping pipeline.
@muffato for help with nf-core integration, dealing with infrastructure and troubleshooting, for the code reviews and valuable suggestions at the different stages of the pipeline development.
@gq1 for the code review, valuable suggestions to the code improvement and contributions to the full test setup.
@mahesh-panchal for nextflow implementation of the purging pipeline, code review and valuable suggestions to the nf-core modules implementation.
If you would like to contribute to this pipeline, please see the contributing guidelines.
If you use sanger-tol/genomeassembly for your analysis, please cite it using the following doi: 10.5281/zenodo.10391851.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.