How To Decide When To Fix A Contig

Fixing a contig means changing sequence. That makes it one of the most useful and most dangerous parts of assembly curation. A suspicious plot is not enough. The better question is:

Is there strong evidence that one assembled sequence joins pieces that should not be joined in this assembly?

This lesson explains the reasoning before the ChromoSort command details. The goal is to teach when a split model is more believable than the current contig, why the evidence can support that conclusion, and why many odd-looking patterns should still be left alone.

The Core Idea

A contig is a candidate for fixing when its own coordinate order says “one continuous assembled molecule,” but the evidence says “two or more incompatible genomic neighborhoods.”

Figure 1. Fixing is a model choice. The split model is appropriate only when it explains the evidence better than the original joined contig.

The split model becomes convincing when the alignment blocks are long, coherent, and separated by a local boundary. It becomes weak when the evidence is made of small repeat-like hits, stale alignments, low-confidence mappings, or patterns that are better explained as true structural variation.

Observation, Interpretation, Action

Good review has three steps:

Observation: what pattern is visible?
Interpretation: what biological or technical situations could create it?
Action: should sequence be changed, reviewed manually, or left alone?

The same observation can lead to different actions. A contig with blocks on two references might be a misjoin, a real translocation relative to the reference, an unresolved repeat, contamination, or a stale alignment. Fixing is justified only after the competing explanations have been narrowed.

The Evidence Ladder

Do not jump from “odd plot” to “split contig.” Climb the evidence ladder.

Figure 2. The evidence ladder. The higher the consequence of the action, the more provenance and support the decision needs.

The first rung matters most. If an alignment was generated from raw.fa, it describes raw.fa. It does not validate fixed.fa, ordered.fa, a manual FASTA export, or a scaffold FASTA.

Pattern Gallery

Strong Fix Candidate

A strong candidate has a small number of large blocks that disagree with the single-contig model.

Figure 3. Strong candidates have coherent blocks. A multi-reference split or a sharp same-reference jump deserves review when the blocks are long and the boundary is local.

Review stance: evaluate a split plan, inspect the boundary, and ask whether each emitted piece would still have enough support.

Usually Not A Fix

Some patterns are real and important, but still should not be cut.

Figure 4. Some discordance is not a split request. Whole-contig reverse orientation is usually an orientation choice. Internal inversions may be real biology or reference difference.

Review stance: use orientation, manual review, additional references, graph context, and read evidence before deciding whether any sequence edit is needed.

Repeat Noise Or Stale Evidence

Repeat-rich genomes can produce small off-target hits that look dramatic when compressed into a whole-genome plot. Stale alignments can make already-edited FASTA records look broken because names or coordinates no longer match.

Figure 5. False positives are common. Repeats, secondary hits, and stale FASTA/alignment pairings can make a harmless contig look suspicious.

Review stance: raise filters, inspect per-reference plots, confirm the exact FASTA pair, and avoid edits until the pattern remains under better evidence.

A Practical Decision Table

If you see…	First interpretation	Conservative action
Strong blocks on different references	Possible misjoin, translocation, repeat, or contamination	Evaluate a reviewed split; inspect graph/read support.
Same reference, distant jump	Possible local misassembly or structural difference	Review boundary and compare with independent evidence.
Whole-contig reverse alignment	Orientation difference	Orient during sorting; do not split.
Blue-red-blue internal pattern	Possible inversion or reference difference	Review as inversion; do not automatically cut.
Many short off-target hits	Repeats, paralogs, or secondary alignments	Filter and inspect; do not cut from speckles.
Suspicious pattern after editing FASTA	Possible stale evidence	Re-align the exact edited FASTA before interpreting.

Example Walkthrough

Imagine a soybean contig with 11 Mb of high-identity alignment to chromosome 3, then a sharp transition, then 8 Mb of high-identity alignment to chromosome 11. Both blocks use most of their local contig spans. The transition is not made of many tiny repeat hits.

The concept-level review is:

The current contig asserts a single joined molecule.
The evidence places the left and right intervals in incompatible reference neighborhoods.
A single split explains the pattern better than the joined model.
Each output piece would still have substantial support.
The fixed FASTA must be re-aligned before sorting or final validation.

That is a reasonable fix candidate. It is not yet an automatic edit. The reviewer still needs an explicit accepted plan and provenance for the evidence used.

Common Traps

Do not split a contig just to make a plot prettier. A prettier reference-normal plot can be a worse assembly if the sample truly differs from the reference.

Do not treat all same-reference orientation changes as errors. Inversions and complex haplotype differences require more careful interpretation.

Do not apply a fix table after the FASTA changed. Breakpoint coordinates belong to a particular source sequence.

Do not let a single evidence stream overrule obvious contradiction from graph, read, or manual review evidence.

Brief History And Further Reading

Early genome assembly quality work made an important point that still matters: contiguity is not the same as correctness. Large contigs can contain structural errors, and breaking or editing them should be justified by evidence rather than by N50-style metrics alone.

Reference-based tools such as QUAST helped standardize language around misassemblies, relocations, inversions, and translocations. Read-backed tools such as REAPR emphasized that mapped reads can reveal local assembly problems that reference comparison alone may miss. Modern assembly review usually combines both ideas: compare to references, but keep read, graph, and k-mer evidence nearby.