How Scaffolding Works

Scaffolding turns ordered contigs into larger records. In a chromosome-scale assembly workflow, the scaffold is often the file a researcher wants to inspect, compare, or submit: one record per chromosome, linkage group, or reference sequence.

The key point is that scaffolding joins existing sequence. It does not discover the missing bases between contigs. When the sequence between adjacent contigs is unknown, the scaffold uses Ns as an explicit placeholder.

The Core Idea

Scaffolding uses placement evidence to decide three things for adjacent contigs:

  1. order,
  2. orientation,
  3. gap or overlap representation.
Ordered contigs joined into a scaffold with N gaps Three contigs are placed along a reference, oriented, and joined into one scaffold with N gaps where sequence is unknown. Scaffolding joins placed contigs with visible uncertainty The output is a larger coordinate frame, not proof that every base between contigs is known. Placement evidence contig A contig B contig C unknown gap unknown gap Scaffold output contig A Ns contig B Ns contig C
Figure 1. Scaffolding is ordering plus uncertainty. Contigs become one larger scaffold record. Ns mark sequence that is not known from the contigs themselves.

The scaffold is a useful coordinate system. It should also be honest about what is known and unknown.

Why We Think Scaffolding Works

Scaffolding works when independent evidence agrees on the relative placement of contigs. A close reference can provide approximate order and orientation. Paired reads, Hi-C contacts, long reads, optical maps, linkage maps, and assembly graph links can add support when they are available.

Evidence types used to support scaffolding Reference placement, Hi-C contacts, long reads, and graph links all support the same contig adjacency. Good scaffolds are supported by converging evidence Different data types can support the same adjacency for different reasons. contig A contig B candidate junction Reference same chromosome order Hi-C contact enrichment Reads bridges or links Graph oriented link/path
Figure 2. Evidence can converge on an adjacency. A scaffold is more credible when multiple independent signals point to the same order and orientation.

No single evidence type is perfect. References can be wrong or biologically different. Hi-C can join chromosome arms but is lower resolution near repeats. Long reads and graph paths can be missing in repetitive regions. Good scaffolding treats evidence as support, not magic.

What Scaffolding Changes

Scaffolding can:

Scaffolding should not:

Gaps And Overlaps

Adjacent contigs can have a positive gap, no inferred gap, or a negative gap that indicates overlap in the placement coordinate system.

Positive gaps, zero gaps, and overlaps in scaffolding Three rows show a positive gap, touching contigs, and an overlap that should be reviewed before trimming. Junctions have different meanings A scaffold-gap report is how the uncertainty stays visible. Positive gap insert Ns Touching spans zero inferred gap Negative gap overlap: review before trimming
Figure 3. Gaps and overlaps are junction annotations. Positive gaps usually become Ns. Overlaps require a policy and should be reported clearly.

This is why the scaffold FASTA alone is not enough. The FASTA contains sequence and Ns. It does not explain why a gap length was chosen, whether an overlap was trimmed, or whether a gap was manually reviewed.

Provenance Matters

A scaffold is easiest to trust when it has a map from scaffold coordinates back to the original components.

Scaffold provenance with component and gap records A scaffold FASTA is paired with AGP-like component and gap rows that describe contigs and N intervals. A scaffold should keep a component map Scaffold FASTA Ns Ns Map records 1-170: contig A, forward 171-230: gap, 60 Ns, inferred 231-380: contig B, reverse 381-460: gap, 80 Ns, reviewed 461-625: contig C, forward Every scaffold base has a source or a gap reason.
Figure 4. Provenance makes scaffolds reviewable. Component maps and gap reports explain where each scaffold interval came from.

AGP files, scaffold-gap reports, and run summaries are not administrative afterthoughts. They are what let another person reproduce, inspect, and revise the scaffold later.

A Bad Scaffold Can Look Convincing

Wrong order or orientation can create a chromosome-scale record that looks tidy in FASTA but creates problems in alignments, variant calling, and downstream annotation.

Correct and incorrect scaffold order One scaffold order follows the reference A B C while another places C between A and B and creates a false long-range rearrangement. Scaffolding errors become coordinate errors Supported order A B C consistent downstream coordinates Unsupported order A C B false rearrangement signal
Figure 5. A scaffold is a hypothesis about coordinate order. Wrong order can create artificial structural signals downstream.

This is why high-confidence scaffolding is usually late in the workflow. Fix bad contigs first, sort and orient contigs carefully, then scaffold the reviewed set.

Example Walkthrough

Three contigs align to the same chromosome in order: A, B, C. The right end of A maps before the left end of B, leaving an inferred 12 kb reference gap. B and C overlap by 400 bp in reference coordinates, but their sequences do not match well at the overlap.

The concept-level scaffold is:

  1. Join A and B with an N gap because the missing sequence is unknown.
  2. Record the inferred 12 kb gap in a gap report.
  3. Treat the B/C negative gap as an overlap decision, not as permission to delete sequence automatically.
  4. Preserve a component map so the scaffold can be audited.
  5. Re-align the scaffold if scaffold-level validation is needed.

Common Traps

Do not confuse scaffolding with gap filling. Scaffolding can place Ns. Gap filling replaces some Ns with sequence.

Do not trim overlaps just because reference coordinates overlap. Sequence confirmation and the biological context matter.

Do not scaffold before resolving obvious chimeric contigs. A scaffold cannot make a bad component safe.

Do not forget that reference-guided scaffolding inherits reference assumptions. If the sample has real structural differences, the reference may be a guide rather than a truth source.

Brief History And Further Reading

Scaffolding has been part of genome assembly since whole-genome shotgun projects needed to connect contigs using mate pairs and clone-end information. As sequencing changed, scaffold evidence expanded: long reads can span repeats and gaps, Hi-C can provide chromosome-scale contact patterns, optical maps and linked reads can add long-range constraints, and reference-guided methods can use conserved synteny when an appropriate reference exists.

Modern chromosome-scale assemblies often combine these ideas. The conceptual lesson has stayed stable: a scaffold is a supported ordering of known sequence plus explicit uncertainty.

Further reading: