How Gap Filling Works

Gap filling tries to replace unknown scaffold sequence with actual bases. That makes it more powerful than scaffolding, and more risky. A gap should be filled only when the candidate sequence is connected to the correct flanks, supported by evidence, and not confused with an equally plausible alternative.

The key question is:

Do we know the sequence that belongs between these two scaffold flanks?

The Core Idea

Scaffolding says “these contigs are adjacent, but the sequence between them is unknown.” Gap filling says “this candidate path supplies the missing sequence.”

Figure 1. Gap filling is path selection plus sequence validation. The graph may contain more than one path. Filling is appropriate only when one path is selected and its sequence fits the scaffold flanks.

The difference from scaffolding is simple but important: scaffolding can be honest with Ns. Gap filling must choose bases.

Why Gap Filling Is Harder Than Scaffolding

Scaffolding can be useful even when the gap sequence is unknown. Gap filling must answer more questions:

Do the scaffold flanks map to graph nodes or path ends?
Is there a path between those flanks?
Is there exactly one best path, or several plausible paths?
Does the graph path have sequence?
Do the path ends match the scaffold FASTA flanks?
Do reads, graph topology, Hi-C contacts, or reference placement support the same path?

If those questions do not converge, the honest result is usually to leave Ns.

Path Statuses

Most gap-filling decisions reduce to a small set of patterns.

Figure 2. Path status controls the decision. A unique path is not the same as an ambiguous branch. No path and ambiguous paths should usually remain gaps.

Ambiguity is not failure. In many genomes, especially repeat-rich plant genomes, refusing to fill an ambiguous gap is the correct scientific choice.

Flank Validation

A graph path is not enough. The path must also fit the scaffold sequence at both ends.

Figure 3. Flank validation protects against stale evidence. A path from the wrong graph or assembly stage can look plausible until its ends are compared with the actual scaffold sequence.

If flank validation fails, stop. It usually means names, coordinates, sequence versions, or graph stages drifted.

Evidence Can Support Or Conflict

When several paths exist, evidence should select the same path. Conflicting evidence is a warning, not a tie-breaker to ignore.

Figure 4. Support can converge or conflict. A fill is easier to trust when reads, graph topology, and placement evidence point to the same path.

The conservative choice is to fill only when support is unique enough for the review goal. A benchmark run may apply all fillable paths to test behavior. A production curation run should usually require explicit accepted rows.

Possible Outcomes

Gap filling is not simply pass or fail.

Figure 5. Leaving Ns can be the correct output. The goal is not to fill every gap. The goal is to fill only gaps whose sequence is defensible.

This mindset is especially important in repeat-rich genomes. A confident unresolved gap is better than a confident wrong sequence.

Example Walkthrough

A scaffold has a 20 kb N gap between contig A and contig B. In the assembly graph, A and B connect through two possible paths. Long read alignments to the graph traverse only path 1, and the path sequence matches the suffix of A and the prefix of B. Path 2 has no read support and includes an unsequenced node.

The concept-level interpretation is:

There is a candidate sequence between the correct flanks.
One path is better supported than the alternative.
The sequence validates against the scaffold boundaries.
The fill can be accepted if the review goal is to replace Ns with graph sequence.

If either flank failed validation, or if read support tied between paths, the better result would be to leave the Ns and report why.

Common Traps

Do not fill gaps from a graph that no longer matches the scaffold FASTA stage.

Do not treat the shortest path as the correct path. Repeats and bubbles can make the shortest graph path biologically wrong.

Do not override conflicting evidence just because a path can be reconstructed. Reconstruction asks “can sequence be built?” Review asks “should this sequence be placed here?”

Do not forget that unsequenced graph nodes cannot contribute bases. Topology without sequence can support review, but it cannot fill the FASTA interval.

Brief History And Further Reading

Gap filling began as an attempt to use additional reads or read pairs to close N gaps left by short-read assemblies. As long reads became common, tools such as PBJelly used long-read alignments to bridge gaps and improve draft genomes. With modern graph-based assemblies, the same conceptual question appears in a new form: which path through the graph, if any, belongs between these scaffold flanks?

The history is a shift from “find any sequence that spans the gap” toward “choose a supported sequence path and preserve uncertainty when the evidence is ambiguous.”