How To Decide When To Fix A Contig
Fixing a contig means changing sequence. That makes it one of the most useful and most dangerous parts of assembly curation. A suspicious plot is not enough. The better question is:
Is there strong evidence that one assembled sequence joins pieces that should not be joined in this assembly?
This lesson explains the reasoning before the ChromoSort command details. The goal is to teach when a split model is more believable than the current contig, why the evidence can support that conclusion, and why many odd-looking patterns should still be left alone.
The Core Idea
A contig is a candidate for fixing when its own coordinate order says “one continuous assembled molecule,” but the evidence says “two or more incompatible genomic neighborhoods.”
The split model becomes convincing when the alignment blocks are long, coherent, and separated by a local boundary. It becomes weak when the evidence is made of small repeat-like hits, stale alignments, low-confidence mappings, or patterns that are better explained as true structural variation.
Observation, Interpretation, Action
Good review has three steps:
- Observation: what pattern is visible?
- Interpretation: what biological or technical situations could create it?
- Action: should sequence be changed, reviewed manually, or left alone?
The same observation can lead to different actions. A contig with blocks on two references might be a misjoin, a real translocation relative to the reference, an unresolved repeat, contamination, or a stale alignment. Fixing is justified only after the competing explanations have been narrowed.
The Evidence Ladder
Do not jump from “odd plot” to “split contig.” Climb the evidence ladder.
The first rung matters most. If an alignment was generated from raw.fa, it
describes raw.fa. It does not validate fixed.fa, ordered.fa, a manual
FASTA export, or a scaffold FASTA.
Pattern Gallery
Strong Fix Candidate
A strong candidate has a small number of large blocks that disagree with the single-contig model.
Review stance: evaluate a split plan, inspect the boundary, and ask whether each emitted piece would still have enough support.
Usually Not A Fix
Some patterns are real and important, but still should not be cut.
Review stance: use orientation, manual review, additional references, graph context, and read evidence before deciding whether any sequence edit is needed.
Repeat Noise Or Stale Evidence
Repeat-rich genomes can produce small off-target hits that look dramatic when compressed into a whole-genome plot. Stale alignments can make already-edited FASTA records look broken because names or coordinates no longer match.
Review stance: raise filters, inspect per-reference plots, confirm the exact FASTA pair, and avoid edits until the pattern remains under better evidence.
A Practical Decision Table
| If you see… | First interpretation | Conservative action |
|---|---|---|
| Strong blocks on different references | Possible misjoin, translocation, repeat, or contamination | Evaluate a reviewed split; inspect graph/read support. |
| Same reference, distant jump | Possible local misassembly or structural difference | Review boundary and compare with independent evidence. |
| Whole-contig reverse alignment | Orientation difference | Orient during sorting; do not split. |
| Blue-red-blue internal pattern | Possible inversion or reference difference | Review as inversion; do not automatically cut. |
| Many short off-target hits | Repeats, paralogs, or secondary alignments | Filter and inspect; do not cut from speckles. |
| Suspicious pattern after editing FASTA | Possible stale evidence | Re-align the exact edited FASTA before interpreting. |
Example Walkthrough
Imagine a soybean contig with 11 Mb of high-identity alignment to chromosome 3, then a sharp transition, then 8 Mb of high-identity alignment to chromosome 11. Both blocks use most of their local contig spans. The transition is not made of many tiny repeat hits.
The concept-level review is:
- The current contig asserts a single joined molecule.
- The evidence places the left and right intervals in incompatible reference neighborhoods.
- A single split explains the pattern better than the joined model.
- Each output piece would still have substantial support.
- The fixed FASTA must be re-aligned before sorting or final validation.
That is a reasonable fix candidate. It is not yet an automatic edit. The reviewer still needs an explicit accepted plan and provenance for the evidence used.
Common Traps
Do not split a contig just to make a plot prettier. A prettier reference-normal plot can be a worse assembly if the sample truly differs from the reference.
Do not treat all same-reference orientation changes as errors. Inversions and complex haplotype differences require more careful interpretation.
Do not apply a fix table after the FASTA changed. Breakpoint coordinates belong to a particular source sequence.
Do not let a single evidence stream overrule obvious contradiction from graph, read, or manual review evidence.
Brief History And Further Reading
Early genome assembly quality work made an important point that still matters: contiguity is not the same as correctness. Large contigs can contain structural errors, and breaking or editing them should be justified by evidence rather than by N50-style metrics alone.
Reference-based tools such as QUAST helped standardize language around misassemblies, relocations, inversions, and translocations. Read-backed tools such as REAPR emphasized that mapped reads can reveal local assembly problems that reference comparison alone may miss. Modern assembly review usually combines both ideas: compare to references, but keep read, graph, and k-mer evidence nearby.
Further reading:
- Gurevich et al. 2013. QUAST: quality assessment tool for genome assemblies.
- Hunt et al. 2013. REAPR: a universal tool for genome assembly evaluation.
- Li 2018. Minimap2: pairwise alignment for nucleotide sequences.
- Rhie et al. 2020. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies.