chromo clean
Use chromo clean when an assembly is already mostly correct but needs
reference-guided cleanup: removing unaligned or redundant fragments, splitting a
small number of strong misjoins, optionally orienting records to the reference,
and writing one cleaned FASTA in reference order.
chromo clean makes all decisions from the raw assembly and the raw alignment
file you provide. It does not run MUMmer or minimap2. After it writes
<prefix>.clean.fa, re-run MUMmer or minimap2 on that cleaned FASTA for final
validation plots.
What chromo clean Does
Given a reference FASTA, raw assembly FASTA, and MUMmer coords or minimap2 PAF,
chromo clean:
- Runs the same assignment and duplicate-overlap filtering logic as
chromo sorton the raw assembly. - Marks raw contigs as retained or discarded.
- Selects retained raw contigs for fix planning. The default is all retained contigs.
- Runs the same conservative breakpoint planner as
chromo fixon selected retained raw contigs. - Replaces accepted split contigs with reference-labeled pieces.
- Keeps retained unsplit contigs unchanged.
- Optionally reverse-complements emitted records with
--orient-to-reference. - Orders emitted records by reference FASTA order and reference placement.
- Writes a cleaned FASTA plus initial sort, fix, clean, and run-summary reports.
The command is not implemented as “write ordered.fa, then run chromo fix on
that file.” Raw coords or PAF are never fed into a sub-step expecting renamed or
oriented ordered.fa records.
Run chromo clean
chromo clean \
--ref-fasta reference.fa \
--assembly-fasta assembly.fa \
--coords mummer/raw.coords \
--output-prefix results/sample \
--orient-to-reference \
--discarded-fasta results/sample.discarded.fa
The same workflow can use minimap2 PAF:
chromo clean \
--ref-fasta reference.fa \
--assembly-fasta assembly.fa \
--paf paf/raw.paf \
--output-prefix results/sample \
--orient-to-reference
If you only want to run fix planning on split candidates surfaced by the sort step:
chromo clean \
--ref-fasta reference.fa \
--assembly-fasta assembly.fa \
--coords mummer/raw.coords \
--output-prefix results/sample \
--fix-scope split-candidates
chromo clean Outputs
| Output | Description |
|---|---|
<prefix>.clean.fa |
Retained raw contigs and accepted split pieces, oriented if requested and ordered by reference placement. |
<prefix>.initial_sort.contig_assignments.tsv |
Full first-pass sort assignment report for raw contigs. |
<prefix>.initial_sort.contig_ref_matches.tsv |
Per-contig/reference match report from the raw alignment. |
<prefix>.initial_sort.chromosome_summary.tsv |
Chromosome summary from the initial sort decision. |
<prefix>.fix_targets.txt |
Original raw contig IDs selected for fix planning. |
<prefix>.fix_report.tsv |
chromo fix-style report for selected retained contigs. |
<prefix>.clean_contigs.tsv |
Main unified audit table for discarded contigs, retained unsplit contigs, and retained split pieces. |
<prefix>.clean_chromosome_summary.tsv |
Final cleaned-record summary grouped by reference sequence. |
<prefix>.run_summary.txt |
Inputs, outputs, sort/fix/clean status counts, and validation guidance. |
--discarded-fasta path |
Optional FASTA of raw contigs discarded by the initial sort step. |
Example clean_contigs.tsv Rows
| source_contig | clean_status | clean_name | kept_by_sort | sort_status | fix_selected | fix_status | dominant_ref | slice_start | slice_end |
|---|---|---|---|---|---|---|---|---|---|
contig_01 |
discarded_no_alignment |
. |
no |
no_alignment |
no |
. |
. |
. |
. |
contig_04 |
kept_split_piece |
chrom02_contig_04_a |
yes |
kept_split_candidate |
yes |
split |
chrom02 |
1 |
20 |
contig_04 |
kept_split_piece |
chrom07_contig_04_b |
yes |
kept_split_candidate |
yes |
split |
chrom07 |
21 |
40 |
contig_inv_mid |
not_split_single_target |
chrom06_contig_inv_mid |
yes |
kept |
yes |
not_split_single_target |
chrom06 |
1 |
55 |
chromo clean Parameters
| Parameter | Default | Meaning |
|---|---|---|
--coords |
required unless --paf |
MUMmer show-coords file generated from the raw assembly FASTA. |
--paf |
required unless --coords |
minimap2 PAF generated from the raw assembly FASTA. |
--output-prefix |
required | Prefix for clean FASTA and audit reports. |
--discarded-fasta |
none | Optional FASTA of raw contigs discarded by sort filtering. |
--orient-to-reference |
off | Reverse-complement emitted records whose dominant alignment is reverse-strand. |
--fix-scope |
kept |
Which raw contigs to inspect with fix planning: kept, split-candidates, kept-and-split-candidates, or file. |
--fix-targets-file |
none | Original raw contig IDs to inspect when --fix-scope file is used. |
--fix-mode |
conservative |
Breakpoint planner mode passed to the internal fix step. |
--min-segment-bp |
10000 |
Minimum query-aligned bp for an alignment segment to inform splitting. |
--breakpoint-penalty-bp |
50000 |
Smoothed breakpoint penalty for conservative/chromosome/comprehensive modes. |
--min-piece-aligned-bp |
50000 |
Minimum dominant aligned bp required in each smoothed split piece. |
--min-piece-query-frac |
0.05 |
Minimum query-span fraction required for each smoothed split piece. |
--max-breakpoints-per-contig |
4 |
Maximum accepted breakpoints per source contig. |
All standard chromo sort assignment and duplicate-overlap thresholds are also
available, including --min-aligned-bp, --min-query-cov,
--min-best-ref-share, --min-novel-ref-bp, --min-novel-ref-frac,
--overlap-mode, --novel-ref-criteria, terminal-overlap rescue thresholds,
and split-candidate thresholds.
Reasoning Behind chromo clean
Discard Before Fix
The cleanup target is a mostly good assembly. In that situation, obvious
unaligned fragments, redundant overlaps, and alternate fragments should not
receive automatic breakpoint surgery. chromo clean first asks which raw
contigs are worth retaining, then runs fix planning only on the selected
retained raw contigs by default.
Fix Before Final Order
Accepted splits create new pieces whose reference placement may differ from the
source contig’s best reference. chromo clean therefore plans splits before
the final cleaned FASTA is ordered. Split pieces and unsplit contigs are then
sorted together by final dominant reference placement.
Still Re-Align The Cleaned FASTA
The cleaned FASTA is derived from raw alignment evidence. That is enough for an
auditable cleanup decision, but it is not a substitute for final validation.
After chromo clean, align <prefix>.clean.fa to the reference and inspect
fresh chromo plot or mummerplot output from that clean-FASTA alignment. The
dot-plot guide gives examples of the
patterns to look for during that validation step.