How To Decide When To Fix A Contig

Fixing a contig means changing sequence. That makes it one of the most useful and most dangerous parts of assembly curation. A suspicious plot is not enough. The better question is:

Is there strong evidence that one assembled sequence joins pieces that should not be joined in this assembly?

This lesson explains the reasoning before the ChromoSort command details. The goal is to teach when a split model is more believable than the current contig, why the evidence can support that conclusion, and why many odd-looking patterns should still be left alone.

The Core Idea

A contig is a candidate for fixing when its own coordinate order says “one continuous assembled molecule,” but the evidence says “two or more incompatible genomic neighborhoods.”

Joined contig model compared with split contig model One contig contains a left interval aligned to chromosome 3 and a right interval aligned to chromosome 11. A reviewed split emits two pieces. Fixing is a model choice The joined contig is one model. The split pieces are another. Reference evidence Chr03 neighborhood Chr11 neighborhood Current assembly model candidate breakpoint Reviewed fix model piece on Chr03 piece on Chr11
Figure 1. Fixing is a model choice. The split model is appropriate only when it explains the evidence better than the original joined contig.

The split model becomes convincing when the alignment blocks are long, coherent, and separated by a local boundary. It becomes weak when the evidence is made of small repeat-like hits, stale alignments, low-confidence mappings, or patterns that are better explained as true structural variation.

Observation, Interpretation, Action

Good review has three steps:

  1. Observation: what pattern is visible?
  2. Interpretation: what biological or technical situations could create it?
  3. Action: should sequence be changed, reviewed manually, or left alone?

The same observation can lead to different actions. A contig with blocks on two references might be a misjoin, a real translocation relative to the reference, an unresolved repeat, contamination, or a stale alignment. Fixing is justified only after the competing explanations have been narrowed.

The Evidence Ladder

Do not jump from “odd plot” to “split contig.” Climb the evidence ladder.

Evidence ladder for fixing contigs A ladder shows increasing confidence from exact FASTA alignment, coherent blocks, local boundary, independent support, reviewed action, and re-alignment after editing. Evidence should accumulate before sequence changes Each step removes a common false-positive explanation. 6. Re-align the edited FASTA before validation or downstream sorting 5. Accept an explicit reviewed action: split, manual review, or leave unchanged 4. Check independent support: graph, long reads, read pairs, or another reference 3. Locate a plausible local boundary rather than a smear of tiny hits 2. Confirm long coherent blocks in incompatible neighborhoods 1. Confirm the alignment matches this exact FASTA
Figure 2. The evidence ladder. The higher the consequence of the action, the more provenance and support the decision needs.

The first rung matters most. If an alignment was generated from raw.fa, it describes raw.fa. It does not validate fixed.fa, ordered.fa, a manual FASTA export, or a scaffold FASTA.

Strong Fix Candidate

A strong candidate has a small number of large blocks that disagree with the single-contig model.

Strong fix candidate patterns Two cartoon dot plots show a multi-reference contig and a same-reference jump with a sharp boundary. Patterns that deserve split review A. Multi-reference blocks reference position same contig Chr03 Chr11 B. Same-reference jump sharp transition
Figure 3. Strong candidates have coherent blocks. A multi-reference split or a sharp same-reference jump deserves review when the blocks are long and the boundary is local.

Review stance: evaluate a split plan, inspect the boundary, and ask whether each emitted piece would still have enough support.

Usually Not A Fix

Some patterns are real and important, but still should not be cut.

Patterns that usually should not be fixed by splitting Two cartoon dot plots show a whole-contig reverse alignment and an internal inversion pattern. Odd is not the same as broken A. Whole-contig reverse alignment Usually orient, do not split. B. Internal inversion pattern Review biology before changing sequence.
Figure 4. Some discordance is not a split request. Whole-contig reverse orientation is usually an orientation choice. Internal inversions may be real biology or reference difference.

Review stance: use orientation, manual review, additional references, graph context, and read evidence before deciding whether any sequence edit is needed.

Repeat Noise Or Stale Evidence

Repeat-rich genomes can produce small off-target hits that look dramatic when compressed into a whole-genome plot. Stale alignments can make already-edited FASTA records look broken because names or coordinates no longer match.

Repeat noise and stale evidence can mimic fix candidates A strong main alignment is surrounded by small speckled hits and a warning that stale evidence compares the wrong FASTA stage. Common false positives Repeat speckles should not drive a cut. raw.fa + raw.paf valid evidence for raw contigs fixed.fa + raw.paf stale pairing: do not interpret
Figure 5. False positives are common. Repeats, secondary hits, and stale FASTA/alignment pairings can make a harmless contig look suspicious.

Review stance: raise filters, inspect per-reference plots, confirm the exact FASTA pair, and avoid edits until the pattern remains under better evidence.

A Practical Decision Table

If you see… First interpretation Conservative action
Strong blocks on different references Possible misjoin, translocation, repeat, or contamination Evaluate a reviewed split; inspect graph/read support.
Same reference, distant jump Possible local misassembly or structural difference Review boundary and compare with independent evidence.
Whole-contig reverse alignment Orientation difference Orient during sorting; do not split.
Blue-red-blue internal pattern Possible inversion or reference difference Review as inversion; do not automatically cut.
Many short off-target hits Repeats, paralogs, or secondary alignments Filter and inspect; do not cut from speckles.
Suspicious pattern after editing FASTA Possible stale evidence Re-align the exact edited FASTA before interpreting.

Example Walkthrough

Imagine a soybean contig with 11 Mb of high-identity alignment to chromosome 3, then a sharp transition, then 8 Mb of high-identity alignment to chromosome 11. Both blocks use most of their local contig spans. The transition is not made of many tiny repeat hits.

The concept-level review is:

  1. The current contig asserts a single joined molecule.
  2. The evidence places the left and right intervals in incompatible reference neighborhoods.
  3. A single split explains the pattern better than the joined model.
  4. Each output piece would still have substantial support.
  5. The fixed FASTA must be re-aligned before sorting or final validation.

That is a reasonable fix candidate. It is not yet an automatic edit. The reviewer still needs an explicit accepted plan and provenance for the evidence used.

Common Traps

Do not split a contig just to make a plot prettier. A prettier reference-normal plot can be a worse assembly if the sample truly differs from the reference.

Do not treat all same-reference orientation changes as errors. Inversions and complex haplotype differences require more careful interpretation.

Do not apply a fix table after the FASTA changed. Breakpoint coordinates belong to a particular source sequence.

Do not let a single evidence stream overrule obvious contradiction from graph, read, or manual review evidence.

Brief History And Further Reading

Early genome assembly quality work made an important point that still matters: contiguity is not the same as correctness. Large contigs can contain structural errors, and breaking or editing them should be justified by evidence rather than by N50-style metrics alone.

Reference-based tools such as QUAST helped standardize language around misassemblies, relocations, inversions, and translocations. Read-backed tools such as REAPR emphasized that mapped reads can reveal local assembly problems that reference comparison alone may miss. Modern assembly review usually combines both ideas: compare to references, but keep read, graph, and k-mer evidence nearby.

Further reading: