How Gap Filling Works

Gap filling tries to replace unknown scaffold sequence with actual bases. That makes it more powerful than scaffolding, and more risky. A gap should be filled only when the candidate sequence is connected to the correct flanks, supported by evidence, and not confused with an equally plausible alternative.

The key question is:

Do we know the sequence that belongs between these two scaffold flanks?

The Core Idea

Scaffolding says “these contigs are adjacent, but the sequence between them is unknown.” Gap filling says “this candidate path supplies the missing sequence.”

Graph-supported gap filling replaces scaffold Ns A scaffold has an N gap between left and right contigs. A graph has two possible paths, but read support selects one path that replaces the Ns. Gap filling replaces uncertainty with reviewed sequence A fill is safe only when one candidate path connects the correct flanks and alternatives are ruled out. Scaffold gap left contig Ns right contig Candidate graph paths L A B R supported path X Y alternative path left unresolved replace Ns after review
Figure 1. Gap filling is path selection plus sequence validation. The graph may contain more than one path. Filling is appropriate only when one path is selected and its sequence fits the scaffold flanks.

The difference from scaffolding is simple but important: scaffolding can be honest with Ns. Gap filling must choose bases.

Why Gap Filling Is Harder Than Scaffolding

Scaffolding can be useful even when the gap sequence is unknown. Gap filling must answer more questions:

If those questions do not converge, the honest result is usually to leave Ns.

Path Statuses

Most gap-filling decisions reduce to a small set of patterns.

Common gap filling path statuses Three rows show no path, one unique path, and ambiguous branching paths between scaffold flanks. Different graph situations need different actions No path L R Leave Ns or inspect inputs. Unique path L A R Fill after sequence validation and review. Ambiguous paths L A B R Do not guess. Add evidence or leave unresolved.
Figure 2. Path status controls the decision. A unique path is not the same as an ambiguous branch. No path and ambiguous paths should usually remain gaps.

Ambiguity is not failure. In many genomes, especially repeat-rich plant genomes, refusing to fill an ambiguous gap is the correct scientific choice.

Flank Validation

A graph path is not enough. The path must also fit the scaffold sequence at both ends.

Gap fill flank validation A good graph path matches the left and right scaffold flanks, while a bad path has mismatched ends and should not be applied. The path must fit the scaffold flanks Pass left flank graph fill sequence right flank path ends match FASTA flanks Fail graph and scaffold likely come from different stages
Figure 3. Flank validation protects against stale evidence. A path from the wrong graph or assembly stage can look plausible until its ends are compared with the actual scaffold sequence.

If flank validation fails, stop. It usually means names, coordinates, sequence versions, or graph stages drifted.

Evidence Can Support Or Conflict

When several paths exist, evidence should select the same path. Conflicting evidence is a warning, not a tie-breaker to ignore.

Evidence support and conflict in gap filling GAF reads and reference placement support path A, while Hi-C weakly supports path B, creating a review decision. Evidence should converge on one path L A B R GAF read traversals support path A Reference placement supports path A Hi-C contacts weakly favor path B
Figure 4. Support can converge or conflict. A fill is easier to trust when reads, graph topology, and placement evidence point to the same path.

The conservative choice is to fill only when support is unique enough for the review goal. A benchmark run may apply all fillable paths to test behavior. A production curation run should usually require explicit accepted rows.

Possible Outcomes

Gap filling is not simply pass or fail.

Gap filling outcomes Three outcomes show accepted fill, reviewed leave gap, and unresolved due to ambiguity. A good result can still leave Ns Accepted fill Unique path, validated flanks, accepted review row. Reviewed gap Fill exists, but reviewer keeps Ns for caution. Unresolved No path, missing sequence, or ambiguous support.
Figure 5. Leaving Ns can be the correct output. The goal is not to fill every gap. The goal is to fill only gaps whose sequence is defensible.

This mindset is especially important in repeat-rich genomes. A confident unresolved gap is better than a confident wrong sequence.

Example Walkthrough

A scaffold has a 20 kb N gap between contig A and contig B. In the assembly graph, A and B connect through two possible paths. Long read alignments to the graph traverse only path 1, and the path sequence matches the suffix of A and the prefix of B. Path 2 has no read support and includes an unsequenced node.

The concept-level interpretation is:

  1. There is a candidate sequence between the correct flanks.
  2. One path is better supported than the alternative.
  3. The sequence validates against the scaffold boundaries.
  4. The fill can be accepted if the review goal is to replace Ns with graph sequence.

If either flank failed validation, or if read support tied between paths, the better result would be to leave the Ns and report why.

Common Traps

Do not fill gaps from a graph that no longer matches the scaffold FASTA stage.

Do not treat the shortest path as the correct path. Repeats and bubbles can make the shortest graph path biologically wrong.

Do not override conflicting evidence just because a path can be reconstructed. Reconstruction asks “can sequence be built?” Review asks “should this sequence be placed here?”

Do not forget that unsequenced graph nodes cannot contribute bases. Topology without sequence can support review, but it cannot fill the FASTA interval.

Brief History And Further Reading

Gap filling began as an attempt to use additional reads or read pairs to close N gaps left by short-read assemblies. As long reads became common, tools such as PBJelly used long-read alignments to bridge gaps and improve draft genomes. With modern graph-based assemblies, the same conceptual question appears in a new form: which path through the graph, if any, belongs between these scaffold flanks?

The history is a shift from “find any sequence that spans the gap” toward “choose a supported sequence path and preserve uncertainty when the evidence is ambiguous.”

Further reading: