Agent And Review Playbook
This playbook turns common ChromoSort review work into a portable procedure. It is written for new project chats, coding agents, and users moving from one crop, species, or assembly batch to another. The examples use generic file names; replace reference, assembly, sample, and chromosome names with the names from your dataset.
The goal is not to make every decision automatic. The goal is to keep the inputs, evidence classes, and intervention choices explicit enough that another person or agent can reproduce the same reasoning later.
Start With A Dataset Manifest
Keep one small manifest per analysis batch. A plain TSV is enough:
sample reference_fasta assembly_fasta alignment_format alignment_path gfa read_paf gaf notes
sampleA reference.fa assembly.fa paf paf/sampleA.paf graphs/sampleA.gfa reads/sampleA.reads_to_assembly.paf graph_reads/sampleA.gaf raw assembly
At minimum, record:
- the exact reference FASTA and assembly FASTA used by the coords or PAF file,
- the aligner command and preset,
- whether PAF was generated with
-c --secondary=no, - the ChromoSort command and output prefix,
- any FASTA-changing step that requires a fresh downstream alignment.
The most important invariant is the FASTA/alignment compatibility rule: do not use an old raw-assembly alignment as evidence for a changed FASTA.
Choose One Primary Alignment Evidence
MUMmer coords and minimap2 PAF are alternative ways to provide the same class
of evidence: a whole-genome reference-to-assembly alignment. You usually do
not need both for a production run. Pick one primary alignment source, run
sort, plot, fix, and eval fix from that source, and use independent
evidence streams such as long-read PAF, GFA, and GAF when a biological decision
needs more support.
PAF is the recommended primary source for most new runs because it is much faster in large plant-genome tests and supports MAPQ filtering. MUMmer coords remains a good alternative, especially for projects with existing nucmer pipelines or for a second aligner perspective on a surprising candidate.
Running both coords and PAF is still useful when benchmarking ChromoSort, checking parser parity, or tuning minimap2/MUMmer settings for a new genome group. Treat that comparison as a diagnostic, not as independent biological validation.
As a loose expectation from the soybean coords-vs-PAF chromo fix benchmark,
split counts were close, differing by about 5-10%, while the exact set of
marginal split contigs differed by about 20-30%. Those differences appeared to
come from aligner behavior and output structure, such as row fragmentation,
secondary/primary handling, MAPQ, and identity fields, not from ChromoSort
applying different post-normalization logic to coords and PAF.
If you choose MUMmer, use a filtered reference-vs-assembly coords file:
nucmer -t 16 -c 500 -p mummer/${sample} reference.fa assembly.fa
delta-filter -i 95 -l 10000 -1 mummer/${sample}.delta > mummer/${sample}.filter
show-coords -r -c -l mummer/${sample}.filter > mummer/${sample}.coords
If you choose minimap2 PAF, keep base-level PAF output and suppress secondary rows:
minimap2 -x asm5 -c -t 16 --secondary=no reference.fa assembly.fa \
> paf/${sample}.paf
Choose the strictest minimap2 preset that still recovers the expected
chromosome-scale alignments. asm5 is the safest same-species default. Move to
asm10 or asm20 only when expected syntenic blocks are missing or badly
fragmented. When using permissive presets, inspect plots and consider
--min-mapq; avoid PAF identity filters until you have checked the PAF
identity distribution.
Run Sort And Plot From The Primary Alignment
For a PAF-backed run:
chromo sort \
--ref-fasta reference.fa \
--assembly-fasta assembly.fa \
--paf paf/${sample}.paf \
--output-prefix results/${sample} \
--orient-to-reference
chromo plot \
--ref-fasta reference.fa \
--assembly-fasta assembly.fa \
--paf paf/${sample}.paf \
--assignments results/${sample}.contig_assignments.tsv \
--output-prefix plots/${sample} \
--per-ref
For a coords-backed run, replace --paf paf/${sample}.paf with
--coords mummer/${sample}.coords.
For a focused replot, add --sel-ref with the reference IDs involved in the
decision.
Review Primary Alignment Decisions
Start from three reports per sample:
<prefix>.run_summary.txt: settings, input paths, status counts, and PAF diagnostics when PAF is the primary alignment.<prefix>.contig_assignments.tsv: one row per contig with assignment, status, second reference, and split-candidate flags.<prefix>.contig_ref_matches.tsv: per-contig, per-reference evidence.
Use these decision classes:
| Pattern | Interpretation | Next action |
|---|---|---|
Contig is kept_split_candidate with two substantial reference matches |
Review target | Inspect the plot and match rows; fix only reviewed contigs with chromo fix --mode conservative; re-align the fixed FASTA. |
| One strong same-reference placement plus tiny off-target matches | Usually repeat or paralog noise | Leave unsplit unless other evidence supports a real event. |
| Whole contig is reverse relative to the reference | Orientation issue | Use chromo sort --orient-to-reference; do not split. |
| Same-reference internal inversion | Biological inversion, reference difference, or assembly error | Evaluate with read and graph evidence; do not automatically reference-normalize. |
| Terminal overlap between separate contigs | Scaffold/overlap problem, not a within-contig fix | Review sort/scaffold overlap reports and use explicit scaffold overlap policies only when justified. |
If you also ran the other whole-genome alignment format, use it as a diagnostic cross-check:
| Optional coords-vs-PAF pattern | Interpretation |
|---|---|
| Both inputs flag the same contig and references | The alignment-format paths are coherent for that event; still use plot/read/graph evidence for the biological call. |
| Candidate appears only in one input | Aligner-specific evidence or threshold effect; inspect plots and eval evidence before fixing. |
| Dominant assigned reference differs, but the same references are involved | Usually a near-tie assignment difference; decide on the event, not the winner label alone. |
For production fixes, prefer targeted contig lists over --all:
chromo fix \
--assembly-fasta assembly.fa \
--paf paf/${sample}.paf \
--contigs contig_a contig_b \
--mode conservative \
--min-mapq 20 \
--orient-to-reference \
--output-fasta results/fix/${sample}.fixed.fa \
--report results/fix/${sample}.fix_report.tsv
If you use coords instead of PAF, replace --paf with --coords and drop the
PAF-specific --min-mapq filter.
Review Same-Reference Inversions
An internal inversion is different from a multi-reference chimeric contig. A dot plot may show one contig assigned cleanly to one reference, but with a large internal block in the opposite orientation. That pattern can be a real sample allele, a difference between the sample and reference, or an assembly error.
Ask two separate questions:
- Is the inversion real in this assembly or haplotype?
- If it is real, should the delivered FASTA preserve it or be deliberately reference-normalized?
For pangenome graph inputs, preserve real haplotype structure. Whole-contig orientation to the reference is often useful, but flipping a validated internal inversion removes the allele from that assembly path. Reference-normalized pseudoassemblies can be useful for a specific downstream comparison, but they should be labeled separately from biological assembly inputs.
Use chromo eval fix --mode comprehensive to make a review table without
immediately applying a fix:
chromo eval fix \
--assembly-fasta assembly.fa \
--paf paf/${sample}.paf \
--contigs contig_with_inversion \
--mode comprehensive \
--min-mapq 20 \
--gfa graphs/${sample}.gfa \
--gaf graph_reads/${sample}.gaf \
--read-paf reads/${sample}.reads_to_assembly.paf \
--read-window-bp 10000 \
--read-min-anchor-bp 2000 \
--orient-to-reference \
--output-prefix review/${sample}.inversion
Evidence that makes a same-reference inversion more believable includes:
- the primary alignment showing a sharp orientation switch with strong blocks,
- an optional second whole-genome alignment reproducing the same breakpoints,
- long alignments and high unique support on both sides of the event,
- long reads spanning the candidate breakpoints in the read-to-assembly PAF,
- graph paths or GFA topology consistent with the assembly sequence,
- confirmation in another assembly, map, or related accession when available.
Evidence that argues for caution includes low MAPQ rows, repeat-rich or homeologous placements, many small off-target hits, breakpoint pileups of edge-only reads without spanning reads, or disagreement between graph and read-to-assembly evidence.
Add Long-Read And Graph Evidence
For read-to-assembly evidence:
minimap2 -x map-hifi -c -t 16 --secondary=no assembly.fa reads.fastq.gz \
> reads/${sample}.reads_to_assembly.paf
For read-to-graph evidence, use GraphAligner or another graph mapper that writes GAF:
GraphAligner \
-g graphs/${sample}.gfa \
-f reads.fastq.gz \
-a graph_reads/${sample}.gaf
chromo eval fix summarizes read support near proposed breakpoints as
longread_spanning_reads, longread_split_reads, edge-read counts, and nearby
read counts. GFA and GAF fields are advisory context for fix review. They do
not change sequence by themselves.
For scaffold and gapfill review, GFA and GAF become more central because the
question is often whether graph paths connect adjacent contigs or explain a
gap. Use chromo eval scaffold and chromo eval gapfill when the decision is
between contig-end joins or candidate graph paths rather than within-contig
breakpoints.
Decide What To Deliver
Use explicit delivery labels:
| Delivery class | Use when | Typical command path |
|---|---|---|
| Native assembly | Structure appears real or unresolved | Sort, orient whole contigs if desired, scaffold conservatively. |
| Fixed assembly | Evidence supports an assembly misjoin or wrong breakpoint | Targeted chromo fix, chromo cut, or reviewed-plan application, then re-align. |
| Review-only table | Evidence is interesting but not ready for sequence editing | chromo eval fix/scaffold/gapfill plus focused plots or manual dashboard. |
| Reference-normalized experimental FASTA | Downstream analysis intentionally requires reference orientation at local events | Run explicit comprehensive/sensitive fix or manual edits into a clearly named output folder. |
| Pangenome graph input | Building a graph intended to represent real alleles | Preserve validated inversions and other real structure; avoid unlabelled reference normalization. |
Document the choice in the manifest or a short decision table. Future agents should be able to answer: what was changed, what was left native, which evidence supported that choice, and which FASTA each downstream alignment describes.
Handoff Checklist For A New Chat Or Agent
Before handing off a review, provide:
- the manifest or a list of exact input paths,
- commands used to generate the primary MUMmer coords or minimap2 PAF,
- ChromoSort sort and plot commands,
- the relevant
run_summary,contig_assignments, andcontig_ref_matchesfiles, - focused plots for any contested contig or reference,
- any GFA, read-to-assembly PAF, and read-to-graph GAF evidence,
- a decision table with rows such as
fix now,review before fix,leave native, andreference-normalized experiment, - notes about any dataset-specific preset choices such as
asm20for a divergent reference.
Avoid crop-specific assumptions in the handoff. The same evidence logic works with soybean chromosomes, grass chromosomes, fungal scaffolds, animal chromosomes, or any other reference/assembly naming convention.