How Gap Filling Works
Gap filling tries to replace unknown scaffold sequence with actual bases. That makes it more powerful than scaffolding, and more risky. A gap should be filled only when the candidate sequence is connected to the correct flanks, supported by evidence, and not confused with an equally plausible alternative.
The key question is:
Do we know the sequence that belongs between these two scaffold flanks?
The Core Idea
Scaffolding says “these contigs are adjacent, but the sequence between them is unknown.” Gap filling says “this candidate path supplies the missing sequence.”
The difference from scaffolding is simple but important: scaffolding can be honest with Ns. Gap filling must choose bases.
Why Gap Filling Is Harder Than Scaffolding
Scaffolding can be useful even when the gap sequence is unknown. Gap filling must answer more questions:
- Do the scaffold flanks map to graph nodes or path ends?
- Is there a path between those flanks?
- Is there exactly one best path, or several plausible paths?
- Does the graph path have sequence?
- Do the path ends match the scaffold FASTA flanks?
- Do reads, graph topology, Hi-C contacts, or reference placement support the same path?
If those questions do not converge, the honest result is usually to leave Ns.
Path Statuses
Most gap-filling decisions reduce to a small set of patterns.
Ambiguity is not failure. In many genomes, especially repeat-rich plant genomes, refusing to fill an ambiguous gap is the correct scientific choice.
Flank Validation
A graph path is not enough. The path must also fit the scaffold sequence at both ends.
If flank validation fails, stop. It usually means names, coordinates, sequence versions, or graph stages drifted.
Evidence Can Support Or Conflict
When several paths exist, evidence should select the same path. Conflicting evidence is a warning, not a tie-breaker to ignore.
The conservative choice is to fill only when support is unique enough for the review goal. A benchmark run may apply all fillable paths to test behavior. A production curation run should usually require explicit accepted rows.
Possible Outcomes
Gap filling is not simply pass or fail.
This mindset is especially important in repeat-rich genomes. A confident unresolved gap is better than a confident wrong sequence.
Example Walkthrough
A scaffold has a 20 kb N gap between contig A and contig B. In the assembly graph, A and B connect through two possible paths. Long read alignments to the graph traverse only path 1, and the path sequence matches the suffix of A and the prefix of B. Path 2 has no read support and includes an unsequenced node.
The concept-level interpretation is:
- There is a candidate sequence between the correct flanks.
- One path is better supported than the alternative.
- The sequence validates against the scaffold boundaries.
- The fill can be accepted if the review goal is to replace Ns with graph sequence.
If either flank failed validation, or if read support tied between paths, the better result would be to leave the Ns and report why.
Common Traps
Do not fill gaps from a graph that no longer matches the scaffold FASTA stage.
Do not treat the shortest path as the correct path. Repeats and bubbles can make the shortest graph path biologically wrong.
Do not override conflicting evidence just because a path can be reconstructed. Reconstruction asks “can sequence be built?” Review asks “should this sequence be placed here?”
Do not forget that unsequenced graph nodes cannot contribute bases. Topology without sequence can support review, but it cannot fill the FASTA interval.
Brief History And Further Reading
Gap filling began as an attempt to use additional reads or read pairs to close N gaps left by short-read assemblies. As long reads became common, tools such as PBJelly used long-read alignments to bridge gaps and improve draft genomes. With modern graph-based assemblies, the same conceptual question appears in a new form: which path through the graph, if any, belongs between these scaffold flanks?
The history is a shift from “find any sequence that spans the gap” toward “choose a supported sequence path and preserve uncertainty when the evidence is ambiguous.”
Further reading:
- Boetzer and Pirovano 2012. Toward almost closed genomes with GapFiller.
- English et al. 2012. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology.
- Walker et al. 2014. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement.
- Kolmogorov et al. 2019. Assembly of long, error-prone reads using repeat graphs.
- GFA working group. Graphical Fragment Assembly format specification.