How Scaffolding Works
Scaffolding turns ordered contigs into larger records. In a chromosome-scale assembly workflow, the scaffold is often the file a researcher wants to inspect, compare, or submit: one record per chromosome, linkage group, or reference sequence.
The key point is that scaffolding joins existing sequence. It does not discover the missing bases between contigs. When the sequence between adjacent contigs is unknown, the scaffold uses Ns as an explicit placeholder.
The Core Idea
Scaffolding uses placement evidence to decide three things for adjacent contigs:
- order,
- orientation,
- gap or overlap representation.
The scaffold is a useful coordinate system. It should also be honest about what is known and unknown.
Why We Think Scaffolding Works
Scaffolding works when independent evidence agrees on the relative placement of contigs. A close reference can provide approximate order and orientation. Paired reads, Hi-C contacts, long reads, optical maps, linkage maps, and assembly graph links can add support when they are available.
No single evidence type is perfect. References can be wrong or biologically different. Hi-C can join chromosome arms but is lower resolution near repeats. Long reads and graph paths can be missing in repetitive regions. Good scaffolding treats evidence as support, not magic.
What Scaffolding Changes
Scaffolding can:
- order contigs,
- orient contigs,
- join contigs into larger FASTA records,
- insert N gaps for unknown sequence,
- report inferred gaps and overlaps,
- create provenance records such as AGP or scaffold-gap tables.
Scaffolding should not:
- pretend Ns are known bases,
- silently remove real sequence,
- resolve an ambiguous graph branch,
- fix a chimeric contig that should have been reviewed earlier.
Gaps And Overlaps
Adjacent contigs can have a positive gap, no inferred gap, or a negative gap that indicates overlap in the placement coordinate system.
This is why the scaffold FASTA alone is not enough. The FASTA contains sequence and Ns. It does not explain why a gap length was chosen, whether an overlap was trimmed, or whether a gap was manually reviewed.
Provenance Matters
A scaffold is easiest to trust when it has a map from scaffold coordinates back to the original components.
AGP files, scaffold-gap reports, and run summaries are not administrative afterthoughts. They are what let another person reproduce, inspect, and revise the scaffold later.
A Bad Scaffold Can Look Convincing
Wrong order or orientation can create a chromosome-scale record that looks tidy in FASTA but creates problems in alignments, variant calling, and downstream annotation.
This is why high-confidence scaffolding is usually late in the workflow. Fix bad contigs first, sort and orient contigs carefully, then scaffold the reviewed set.
Example Walkthrough
Three contigs align to the same chromosome in order: A, B, C. The right end of A maps before the left end of B, leaving an inferred 12 kb reference gap. B and C overlap by 400 bp in reference coordinates, but their sequences do not match well at the overlap.
The concept-level scaffold is:
- Join A and B with an N gap because the missing sequence is unknown.
- Record the inferred 12 kb gap in a gap report.
- Treat the B/C negative gap as an overlap decision, not as permission to delete sequence automatically.
- Preserve a component map so the scaffold can be audited.
- Re-align the scaffold if scaffold-level validation is needed.
Common Traps
Do not confuse scaffolding with gap filling. Scaffolding can place Ns. Gap filling replaces some Ns with sequence.
Do not trim overlaps just because reference coordinates overlap. Sequence confirmation and the biological context matter.
Do not scaffold before resolving obvious chimeric contigs. A scaffold cannot make a bad component safe.
Do not forget that reference-guided scaffolding inherits reference assumptions. If the sample has real structural differences, the reference may be a guide rather than a truth source.
Brief History And Further Reading
Scaffolding has been part of genome assembly since whole-genome shotgun projects needed to connect contigs using mate pairs and clone-end information. As sequencing changed, scaffold evidence expanded: long reads can span repeats and gaps, Hi-C can provide chromosome-scale contact patterns, optical maps and linked reads can add long-range constraints, and reference-guided methods can use conserved synteny when an appropriate reference exists.
Modern chromosome-scale assemblies often combine these ideas. The conceptual lesson has stayed stable: a scaffold is a supported ordering of known sequence plus explicit uncertainty.
Further reading:
- Batzoglou et al. 2002. ARACHNE: a whole-genome shotgun assembler.
- Burton et al. 2013. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions.
- Dudchenko et al. 2017. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds.
- Alonge et al. 2019. RaGOO: fast and accurate reference-guided scaffolding of draft genomes.
- NCBI. AGP specification.