Architecture
ChromoSort is a reference-guided genome assembly curation toolkit. It consumes existing whole-genome alignments, assembly graph records, and optional cross-evidence tables; it does not assemble reads, run aligners, polish bases, or infer variants. Its core responsibility is to convert heterogeneous evidence into deterministic, auditable contig-level decisions: order, split, cut, scaffold, review, plot, and optionally fill graph-supported gaps.
The implementation is intentionally organized around command boundaries. Each command writes explicit FASTA, TSV, HTML, or plot artifacts rather than sharing hidden state. This makes the pipeline inspectable by users and reviewers, and it makes sequence-changing steps depend on documented inputs and reports.
Design Principles
-
Evidence first, sequence edits second. Alignment metrics, graph context, gap decisions, and manual-review choices are written as reports. Sequence changes occur only in commands whose purpose is to write a new FASTA:
sort,clean,fix,cut,manual apply,scaffold, andgapfill --apply. -
Conservative defaults. The default behavior favors retaining ambiguous evidence for review over making irreversible edits. Examples include split-candidate protection in
sort, breakpoint smoothing infix, report-only graph evidence insort,fix, andscaffold, and ambiguous-path refusal ingapfill. -
Simple, explicit data models. FASTA records, alignment segments, merged intervals, assignment rows, graph nodes/edges, split pieces, scaffold gaps, and fill plans are represented with small dataclasses or TSV schemas. The project avoids an internal database and uses files as the public contract between stages.
-
Format normalization before decision logic. MUMmer
show-coordsand minimap2 PAF records are normalized into a sharedSegmentrepresentation before coverage, identity, orientation, and ordering decisions are made. GFA graph support is similarly normalized into oriented node and link records. -
Graph evidence is guarded. GFA, GAF, Hi-C-like contact tables, and reference-placement PAF can refine review and gap-filling decisions, but sequence insertion requires a validated graph path. Complex or unknown GFA overlaps are preserved as uncertainty rather than converted into trim lengths.
-
Reviewer-readable provenance. Important decisions are surfaced as status codes, reason strings, support counts, interval lengths, risk flags, and summary tables. The goal is to make each accepted or rejected decision reconstructable from input evidence.
System Map
The commands can be used as a pipeline, but the architecture does not require a single prescribed run order. Users can stop at any report, inspect it, edit manual inputs, re-align edited FASTA records, and resume with later commands.
Command/API Responsibilities
The public command surface is the chromo dispatcher in
src/chromosort/cli.py. pyproject.toml also exposes compatibility scripts:
chromosort, chromosort-clean, chromosort-fix-contigs,
chromosort-scaffold, and chromosort-gapfill.
| Command/API | Primary module | Main inputs | User-facing responsibility | Sequence-changing behavior |
|---|---|---|---|---|
chromo sort |
reference_order.py |
Reference FASTA, assembly FASTA, coords or PAF, optional GFA | Assign contigs to reference sequences, classify low-confidence and duplicate-overlap records, order kept contigs by reference position, and write match/assignment summaries. | Writes an ordered FASTA for kept contigs. Optionally reverse-complements records with --orient-to-reference. |
chromo clean |
clean.py |
Reference FASTA, raw assembly FASTA, coords or PAF | Apply sort assignment/filtering to raw contigs, select retained raw contigs for conservative fix planning, reconcile retained unsplit contigs and split pieces, and write unified audit reports. | Writes a cleaned FASTA. Optionally reverse-complements emitted records with --orient-to-reference. |
chromo eval fix |
evaluate.py and fix_contigs.py |
Assembly FASTA, coords or PAF, selected contigs or --all, optional GFA, long-read PAF, and GAF |
Prepare an editable review-event TSV for candidate split pieces using the fix planner plus optional graph and read evidence. | Does not change FASTA. Accepted rows can later be applied by chromo fix --reviewed-plan. |
chromo eval scaffold |
evaluate.py and scaffold.py |
Ordered FASTA, assignment TSV, optional GFA, long-read PAF, and GAF | Prepare an editable review-event TSV for adjacent scaffold junctions, inferred gaps/overlaps, graph context, GAF path support, and read-bridge support. | Does not change FASTA. Accepted rows can later pin gap lengths through chromo scaffold --reviewed-plan. |
chromo eval gapfill |
evaluate.py and gapfill.py |
Ordered FASTA, assignment TSV, GFA, optional GAF, Hi-C table, reference-placement PAF, and long-read PAF | Prepare an editable review-event TSV from the gapfill planner, including candidate path status, fillability, branch-support evidence, and read-bridge context. | Does not change FASTA. Accepted rows can later restrict chromo gapfill --reviewed-plan --apply. |
chromo fix |
fix_contigs.py |
Assembly FASTA, coords or PAF, selected contigs or --all, optional GFA |
Detect and split selected chimeric contigs into reference-labeled pieces using mode-specific breakpoint planning. | Writes a fixed FASTA and split report. Optionally writes only pieces and/or reverse-complements pieces. |
chromo cut |
cut.py |
Assembly FASTA plus explicit cut positions | Apply reviewed 1-based cut positions and report emitted pieces. | Writes a cut FASTA. No cut is inferred from alignments. |
chromo manual |
manual.py |
Reference FASTA, assembly FASTA, coords or PAF, optional GFA, long-read PAF, GAF, and review table | Build a self-contained review dashboard with alignment, contig, graph, initial piece information, and optional evidence panels. | Does not change FASTA. |
chromo manual fix/scaffold/gapfill |
manual.py and review.py |
Same dashboard inputs plus task-specific chromo eval review table |
Add a focused task-event queue and selected-event evidence panels for fix, scaffold, or gapfill review. | Does not change FASTA. Manual edits still require browser export plus manual apply; reviewed TSVs feed the matching executor. |
chromo manual apply |
manual.py |
Assembly FASTA and exported recipe JSON | Apply a dashboard recipe containing slices, orientation choices, removals, and optional scaffold grouping. | Writes a manually edited FASTA and optional report. |
chromo scaffold |
scaffold.py |
Ordered FASTA and sort assignment TSV, optional GFA, optional reviewed plan |
Join ordered contigs per reference with inferred, fixed, or reviewed gaps; report reference overlaps and graph adjacency. | Writes scaffold FASTA. Overlap trimming occurs only under explicit overlap policies. |
chromo gapfill |
gapfill.py |
Ordered FASTA, assignment TSV, GFA, optional GAF, Hi-C table, reference-placement PAF, reviewed plan | Plan graph-supported gap fills, annotate risk, resolve unique supported paths, and optionally apply fillable paths. | Writes gapfilled FASTA only with --apply; reviewed plans can further restrict which fillable gaps are applied. |
chromo plot |
plot.py |
Reference FASTA, assembly FASTA, coords or PAF, optional assignment TSV, optional selected reference IDs | Draw whole-genome, per-reference, and selected-reference dot plots from existing alignments. See the dot-plot guide for interpretation patterns. | Does not change FASTA. |
Algorithm And Data Model Activation Map
This section makes explicit where each documented algorithm or data model is used. “Activation” means the command path, mode, or parameter that calls the logic; it does not imply that the evidence source is allowed to change sequence by itself.
| Algorithm or data model | Activated by | Key parameter or mode controls | Main output fields or artifacts | Validation anchors |
|---|---|---|---|---|
Alignment normalization into Segment records |
sort, clean, fix, eval fix, manual, and plot |
--coords or --paf, --min-segment-bp, --min-segment-idy, --min-mapq, --include-secondary-paf |
Match metrics, embedded alignment rows, dot-plot segments, fix blocks | tests/test_reference_order.py, tests/test_fix_contigs.py, tests/test_plot.py, tests/test_manual.py |
| Match aggregation and best-reference ranking | sort, clean, manual, and plot --assignments indirectly through sort tables |
--min-aligned-bp, --min-query-cov, --min-best-ref-share |
match_report.tsv, contig_assignments.tsv, manual query records |
tests/test_reference_order.py, tests/test_clean.py, tests/test_manual.py |
| Duplicate-overlap filtering and terminal-extension rescue | sort and the sort/filter stage inside clean |
--no-overlap-filter, --overlap-mode, --min-novel-ref-bp, --min-novel-ref-frac, --novel-ref-criteria, terminal-extension thresholds |
Assignment statuses such as duplicate_overlap, terminal_overlap, and kept/rescued records |
tests/test_reference_order.py, tests/test_clean.py |
| Split-candidate protection before fixing | sort and clean |
--split-candidate-min-aligned-bp, --split-candidate-min-query-frac, --split-candidate-max-best-share |
kept_split_candidate assignment status and split-candidate annotations |
tests/test_reference_order.py, tests/test_clean.py |
| Fix breakpoint block planning | fix, eval fix, and the fix stage inside clean |
--mode sensitive/chromosome/comprehensive/conservative, --breakpoint-penalty-bp, --max-merge-gap, piece support thresholds |
fix_report.tsv, fix_review.tsv, cleaned FASTA pieces |
tests/test_fix_contigs.py, tests/test_eval.py, tests/test_clean.py |
| Reviewed fix-row application | fix --reviewed-plan |
Accepted chromosort-review-event-v1 rows with task fix and action split_piece |
Fixed FASTA and report rows generated from reviewed slices | tests/test_eval.py, tests/test_fix_contigs.py |
| Explicit cut planning | cut |
--cut, --cuts-file, --contig/--pos, --min-piece-bp |
Cut FASTA and cut report | tests/test_cut.py |
| GFA node and link evidence | sort --gfa, fix --gfa, eval fix --gfa, manual --gfa, scaffold --gfa, eval scaffold --gfa, gapfill, and eval gapfill |
--graph-guard, --graph-report, --graph-overlap-policy, graph path depth controls |
Graph assignment/report TSVs, graph panels, graph gap fields, fill-path nodes | tests/test_graph.py, tests/test_reference_order.py, tests/test_fix_contigs.py, tests/test_scaffold.py, tests/test_gapfill.py, tests/test_manual.py |
| Review-event TSV model | eval fix/scaffold/gapfill, manual fix/scaffold/gapfill, fix --reviewed-plan, scaffold --reviewed-plan, gapfill --reviewed-plan |
--review-table in manual modes; --reviewed-plan in executors |
*.fix_review.tsv, *.scaffold_review.tsv, *.gapfill_review.tsv using schema chromosort-review-event-v1 |
tests/test_review.py, tests/test_eval.py, executor tests |
| Long-read PAF evidence summaries | eval fix --read-paf, eval scaffold --read-paf, eval gapfill --read-paf, manual --read-paf panels |
--min-read-mapq, --read-window-bp, --read-terminal-window-bp, --read-min-anchor-bp |
longread_* review fields for breakpoint, bridge, orientation, read order, and median read-gap support |
tests/test_longreads.py, tests/test_eval.py, tests/test_manual.py |
| GAF parsing and graph traversal summaries | eval fix --gaf, eval scaffold --gaf, eval gapfill --gaf, manual --gaf panels, and gapfill --gaf |
--min-gaf-mapq, --min-gaf-path-support, candidate-path limits |
gaf_* review fields, gapfill GAF support/status/read fields, selected-event GAF panel rows |
tests/test_gaf.py, tests/test_eval.py, tests/test_gapfill.py, tests/test_manual.py |
| Scaffold gap and overlap construction | scaffold and eval scaffold |
--fixed-gap-bp, --overlap-policy, --trim-sequence-min-identity, --reviewed-plan |
Scaffold FASTA, scaffold_gaps.tsv, scaffold_review.tsv |
tests/test_scaffold.py, tests/test_eval.py |
| Scaffold graph-junction reporting | scaffold --gfa and eval scaffold --gfa |
--graph-overlap-policy, --graph-max-path-edges; eval scaffold --gaf adds path support context |
graph_gaps.tsv, graph_status, graph_path_nodes, gaf_* review columns |
tests/test_scaffold.py, tests/test_eval.py |
| Gapfill path enumeration and branch resolution | gapfill and eval gapfill |
Required --gfa; --gaf, --hic-pairs, --ref-paf, --max-path-edges, --max-candidate-paths, support thresholds |
gapfill_plan.tsv, gapfill_review.tsv, optional review HTML, optional gapfilled FASTA |
tests/test_gapfill.py, tests/test_eval.py |
| Manual dashboard recipe model | manual, manual fix/scaffold/gapfill, and manual apply |
Dashboard browser actions; manual apply --scaffold/--no-scaffold, --gap-bp |
chromosort-manual-v1 JSON recipe, manual FASTA, manual report |
tests/test_manual.py |
| Dot-plot rendering | plot and the canvas view in manual |
--formats, --per-ref, --sel-ref, --assignments, segment filters |
SVG/PDF/PNG plots and manual canvas lines | tests/test_plot.py, tests/test_manual.py |
Evidence Activation And Authority
| Evidence stream | Where it enters | Where it is report-only | Where it can affect sequence output |
|---|---|---|---|
| Whole-genome coords or PAF | sort, clean, fix, eval fix, manual, plot |
Manual and plot views; eval tables before execution | sort, clean, and planner-driven fix; reviewed fix rows can replace planner inputs |
| GFA graph topology | sort --gfa, fix --gfa, manual --gfa, scaffold --gfa, eval modes, gapfill |
Sort/fix/manual graph context; scaffold graph reports by default; eval advisory fields | scaffold --graph-overlap-policy confirm can confirm terminal overlap trimming; gapfill --apply can insert validated graph-path sequence |
| Long-read PAF to assembly | eval fix/scaffold/gapfill --read-paf, manual --read-paf |
Review-event fields and dashboard panels | Does not directly edit sequence; users may accept or edit reviewed rows informed by the fields |
| Long-read GAF to graph | eval fix/scaffold/gapfill --gaf, manual --gaf, gapfill --gaf |
Advisory in eval fix; support context in eval scaffold and manual panels |
In gapfill, unique non-conflicting support can choose among candidate GFA paths before sequence validation |
| Hi-C-like graph contacts | gapfill --hic-pairs and eval gapfill --hic-pairs |
Candidate path support and risk annotations | In gapfill, unique non-conflicting support can choose a candidate path before sequence validation |
| Reference-placement PAF for graph nodes | gapfill --ref-paf and eval gapfill --ref-paf |
Candidate path support fields and review HTML | In gapfill, unique non-conflicting support can choose a candidate path before sequence validation |
Formal Data Model
Coordinate Conventions
ChromoSort accepts two alignment coordinate systems:
- MUMmer
show-coordsrows, which are read as 1-based closed endpoints. - PAF rows, whose 0-based half-open endpoints are converted into equivalent 1-based closed endpoints by adding one to the start coordinate and preserving the end coordinate.
Most interval arithmetic then converts endpoints to half-open intervals $I = [a,b)$ with length
\[|I| = b - a.\]For reverse-strand rows, interval endpoints are normalized with min and
max for arithmetic, while the original strand is retained as an orientation
field. This separation keeps overlap and coverage computations independent of
strand direction.
Alignment Segment and Match Metrics
The shared Segment model records:
- reference name $r$ and query/contig name $q$,
- reference and query endpoints,
- aligned reference and query lengths,
- percent identity $p_s$,
- sequence lengths $L_r$ and $L_q$ when provided by the alignment format,
- orientation, MAPQ, and secondary-alignment status when available.
For a query $q$ and reference $r$, let $S_{q,r}$ be the set of passing alignment segments after identity, MAPQ, secondary-PAF, and length filters. For segment $s$, let $Q_s$ be its query interval, $R_s$ its reference interval, and $l_s = |Q_s|$ its query-aligned length. ChromoSort aggregates those segments as:
\[M^Q_{q,r} = \left|\bigcup_{s \in S_{q,r}} Q_s\right|\] \[M^R_{q,r} = \left|\bigcup_{s \in S_{q,r}} R_s\right|\] \[C^Q_{q,r} = \frac{M^Q_{q,r}}{L_q}, \qquad C^R_{q,r} = \frac{M^R_{q,r}}{L_r}\] \[\bar{p}_{q,r} = \frac{\sum_{s \in S_{q,r}} p_s l_s} {\sum_{s \in S_{q,r}} l_s}.\]The resulting MatchMetrics object stores raw aligned bp, merged query bp,
merged reference bp, query coverage, reference coverage, weighted mean identity,
reference start/end, reference midpoint, dominant orientation, segment count,
and merged reference intervals.
The best-reference share used by sort is:
where $r^*$ is the best-ranked reference for query $q$.
Graph Model
graph.py reads the subset of GFA needed by ChromoSort:
Srecords becomeGraphNodeobjects with name, length, optional sequence, tags, and source line number.Lrecords becomeGraphEdgeobjects with source node, target node, orientations, raw overlap CIGAR, parsed overlap bp when exact, tags, and line number.AssemblyGraphindexes outgoing and incoming edges by oriented key $(node, orientation)$ where $orientation \in {+,-}$.
The parser is strict for missing linked segments and inconsistent LN:i
lengths when sequence is present. It only converts simple overlap CIGARs made
of M, =, and X operations into exact trim lengths. * or complex CIGARs
containing insertion, deletion, clipping, skipped, or padding operations return
unknown overlap (None) so downstream sequence-changing code cannot treat
them as validated trim lengths.
Command-Level Records
The command modules layer additional records over these primitives:
Assignmentrecords store the selected reference, status, best and second match, overlap-filter annotations, split-candidate annotations, output name, and orientation outcome forsort.QueryBlock,BlockGroup,SplitPiece, andContigPlanrepresent alignment blocks, smoothed breakpoint groups, emitted pieces, and planner status forfix.ScaffoldMember,GapRecord,GraphGapRecord, andScaffoldRecordrepresent ordered contigs, inferred gaps or overlaps, graph evidence at adjacent contig junctions, and emitted scaffold records forscaffold.FillPlanrecords graph path status, support counts, risk flags, fill sequence, right-flank trim length, reviewed acceptance, and applied status forgapfill.ReviewEventrecords use schemachromosort-review-event-v1and carry task type, candidate action, target, default acceptance, status, confidence, reason, notes, and task-specific evidence fields. They are written bychromo eval fix/scaffold/gapfill, embedded bychromo manual fix/scaffold/gapfill, and consumed byfix,scaffold, orgapfillthrough--reviewed-plan.- Manual recipes use schema
chromosort-manual-v1and store 1-based inclusive source slices, output names, strand choices, scaffold labels, and removal flags.
Long-Read Evidence Records
longreads.py parses long-read-to-assembly PAF records into read-level
alignments keyed by read name and target contig. These records are not a
general-purpose aligner replacement. They are used only by chromo eval modes
to summarize breakpoint and contig-end support, and by chromo manual to label
the optional long-read PAF evidence panel when review-table fields are present.
gaf.py parses long-read-to-graph GAF rows into oriented node paths. The shared
summary functions are used in four places:
chromo eval fix --gafsummarizes reads touching a candidate graph node.chromo eval scaffold --gafcompares support for selected and alternate GFA paths between adjacent contigs.chromo eval gapfill --gafandchromo gapfill --gafsummarize candidate fill-path support from the same traversal logic.chromo manual --gafdisplaysgaf_*review-table fields for the selected event; it records the evidence input but does not independently recompute GAF support in the browser generator.
Core Algorithms
Alignment Normalization and Metric Construction
The parser layer streams coords or PAF rows, skips malformed rows, applies
filter thresholds, and yields normalized Segment objects. The aggregation
layer groups segments by (query, reference), merges query and reference
intervals independently, and computes the formal metrics above. This design is
appropriate for fragmented whole-genome alignments because it avoids double
counting overlapping alignment rows while preserving raw aligned bp and segment
count for review.
Reference Assignment and Duplicate Filtering
chromo sort evaluates every assembly FASTA record against its reference
matches:
- For each query, matches are ranked by merged query bp, raw query bp, and mean identity.
- The best match is rejected as
below_min_aligned_bp,below_min_query_cov, orambiguous_ref_matchif it fails the configured support thresholds. - A strong multi-reference record can be protected as
kept_split_candidatewhen at least two references meet the split-candidate bp and query-fraction thresholds and the best-reference share is not too high. This keeps likely chimeric contigs available forchromo fix. - A large alignment can be rescued as
kept_large_alignmentwhen it narrowly fails query coverage but satisfies the large-alignment thresholds. - Kept records are passed to the duplicate-overlap filter unless
--no-overlap-filteris used.
For each reference, the duplicate-overlap pass ranks candidate assignments by merged reference bp, merged query bp, query coverage, mean identity, and query length. Higher-ranked contigs claim reference intervals first. Candidate intervals are either:
- a broad first-to-last reference span (
--overlap-mode span), or - exact merged alignment intervals (
--overlap-mode alignment).
Let $I$ be the candidate interval set and $C$ be the already covered interval set on the same reference. Novel reference support is:
\[N(I,C) = |I \setminus C|, \qquad F(I,C) = \frac{N(I,C)}{|I|}.\]A lower-ranked contig is kept when the configured absolute and fractional novel
reference thresholds pass. If not, split-candidate protection and terminal
overlap rescue are considered before the contig is labeled
terminal_overlap or duplicate_overlap.
Terminal-overlap rescue is intentionally narrower than general overlap acceptance. It applies only when the novel interval is a one-sided extension of a single candidate interval:
\[E = N(I,C), \qquad E \ge E_{\min}, \qquad \frac{E}{|I|} \ge f_{\min}.\]This distinguishes plausible dovetail extensions from contained duplicate fragments.
Breakpoint Planning for chromo fix
chromo fix starts from filtered alignment segments and converts them into
query-ordered QueryBlock records. Adjacent blocks with the same target
reference and orientation can be merged when they are separated by at most the
configured query gap. Candidate split signals are then planned differently by
mode:
| Mode | Candidate signal | Planner signature | Smoothing path | Consequence |
|---|---|---|---|---|
chromosome |
Reference/chromosome changes only. | reference |
Dynamic-programming breakpoint smoothing. | Same-reference orientation changes are absorbed into the dominant reference piece. |
conservative |
Reference/chromosome changes, plus same-reference orientation events only when the orientation-aware pre-plan meets the complex-inversion criteria. | Usually reference; switches to (reference, orientation) only for qualifying same-reference complex orientation events. |
Dynamic-programming breakpoint smoothing. | Default breakpoint-averse behavior: strong inter-reference chimeras are eligible, simple same-reference inversions are usually left intact. |
comprehensive |
All reference/chromosome changes and all same-reference orientation changes. | (reference, orientation) |
Dynamic-programming breakpoint smoothing. | Broader orientation-aware review mode. It can expose inversion candidates, but it is not a guaranteed superset of conservative because the target signature changes the smoothing model and piece-support checks. |
sensitive |
Every passing reference/orientation transition after adjacent same-target collapse. | (reference, orientation) |
No breakpoint-penalty smoothing; only direct run collapse, minimum emitted-piece length, and the breakpoint guard apply. | Debugging or intentionally aggressive scans where every passing transition should be visible. |
The important implementation distinction is that comprehensive is not
implemented as “conservative plus extra calls.” It runs the smoothed planner
with orientation included in every target signature. That can split the evidence
into different candidate pieces or reject a plan for weak-piece support even
when conservative would split the same contig using reference-only signatures.
The smoothed planner uses identity-weighted bp:
\[w_i = l_i \frac{p_i}{100}\]for block $i$ with aligned length $l_i$ and identity $p_i$. For an interval of blocks $G$, support is accumulated by signature $\sigma$ (reference alone or reference plus orientation, depending on mode). The discordance cost of keeping that interval as one piece is:
\[D(G) = \sum_{i \in G} w_i - \max_{\sigma} \sum_{i \in G, \sigma_i=\sigma} w_i.\]Dynamic programming chooses a partition of the block list that minimizes:
\[\operatorname{cost}(\mathcal{P}) = \sum_{G \in \mathcal{P}} D(G) + \lambda \left(|\mathcal{P}| - 1\right),\]where $\lambda$ is --breakpoint-penalty-bp. The planner reports a score
equivalent to the reduction in discordance after paying breakpoint penalties:
After partitioning, terminal groups that fail support thresholds can be merged into neighbors, all remaining groups must pass minimum aligned-bp and query-fraction thresholds, and a per-contig breakpoint guard can reject overly fragmented plans. Accepted breakpoints are placed midway between adjacent alignment blocks, then clamped to the contig sequence length. This produces piece coordinates that are deterministic and easy to audit, while smoothing small discordant blocks and local structural-variant-like noise by default.
Explicit Cutting
chromo cut is deliberately not evidence-driven. It accepts repeatable
CONTIG:POS[,POS...] cut specifications, a cut file, or a single-contig
shorthand. Cut positions are 1-based and mean “cut after this base.” The
planner rejects missing contigs, duplicate cut positions, terminal positions,
and any plan that would create pieces shorter than --min-piece-bp. It also
checks that emitted FASTA IDs will remain unique.
Manual Review and Application
chromo manual computes the same best-match metrics used by sort, embeds
filtered alignment segments, optionally embeds graph node context, and writes a
browser-based dashboard. Initial pieces cover each source contig completely,
with the scaffold label initialized to the best reference or unplaced.
Task-specific manual fix, manual scaffold, and manual gapfill modes load
the shared review-event table with --review-table, open a focused event
queue, and render selected-event evidence panels for alignment, GFA,
long-read PAF, and GAF inputs when those data are available.
chromo manual apply is schema-driven. It validates the recipe schema, verifies
that requested source contigs and slices exist, reverse-complements pieces whose
strand is -, omits removed pieces, and optionally groups active pieces into
scaffolds joined by fixed N gaps. The recipe is therefore the sole authority
for manual edits.
Eval Tables and Reviewed Execution
chromo eval is the table-first counterpart to task-specific manual review.
It does not introduce a separate decision algorithm; it reuses the algorithm
that the matching executor already exposes, then serializes candidate decisions
as ReviewEvent rows with enough evidence fields for spreadsheet or dashboard
review.
eval fixactivates thefixbreakpoint planner and adds optionalgraph_*,gaf_*, andlongread_*fields. Accepted rows are executable only throughfix --reviewed-plan, which bypasses alignment re-planning and applies the reviewed source slices.eval scaffoldactivates scaffold gap/overlap calculation and optional graph junction evidence. Accepted rows are executable only throughscaffold --reviewed-plan, where they override matching gap lengths but do not reorder, orient, or trim contigs by themselves.eval gapfillactivates the gapfill planner and writes the same path, support, validation, and fillability concepts as a review-event table. Accepted rows are executable only throughgapfill --reviewed-plan --apply, where the current plan must still match the reviewed path.
Scaffold Construction
chromo scaffold groups kept assignment rows by assigned reference and uses
the ordered FASTA from the same sort run. For adjacent contigs $a$ and $b$ on
the same scaffold, the reference-inferred gap is:
If $g \ge 0$, the scaffold inserts either $g$ Ns or a fixed configured gap. If $g < 0$, the adjacent reference spans overlap by:
\[o(a,b) = -g(a,b).\]Overlap handling is policy-controlled:
zero-gapwrites no Ns and keeps both sequences unchanged.warndoes the same but emits warnings.trim-referencetrims the right-contig prefix by the reference-inferred terminal overlap.trim-sequencetrims only when the right-prefix/left-suffix sequence identity over the overlap meets the configured threshold.graph-overlap-policy confirmcan trim a terminal overlap when an oriented direct GFA edge confirms the junction.
Nonterminal overlaps are reported but not trimmed by the terminal-only trimming policies.
Graph-Supported Gap Filling
chromo gapfill plans fills between adjacent sorted contigs using oriented GFA
paths. Path enumeration is breadth-first, orientation-aware, bounded by
--max-path-edges, bounded by --max-candidate-paths, and cycle-aware. Missing
graph nodes produce missing_node, absent paths produce no_graph_path, and
multiple candidate paths remain ambiguous_paths unless exactly one path is
selected by support evidence.
Support evidence is computed as follows:
- GAF support counts read paths that contain the candidate oriented path or its reverse-oriented counterpart as a contiguous subpath.
- Hi-C support sums contact counts across adjacent node pairs on the candidate path.
- Reference-placement PAF support sums the best identity-weighted query bp for intermediate graph nodes whose placements match the scaffold reference, orientation, and expected reference gap window.
If different evidence sources uniquely select different paths, ChromoSort reports the conflict and refuses to fill automatically. When a path is selected, sequence reconstruction still must pass validation:
- the GFA path must include sequence for both flank nodes and all intermediate nodes,
- the oriented flank node sequences must exactly match the adjacent FASTA records,
- every applied GFA link overlap must have a known simple bp length,
- overlap lengths must be valid relative to sequence lengths,
- the resulting inserted sequence must not exceed
--max-fill-bp.
Only fillable plans can be applied. With a reviewed plan, the accepted row
must still exist in the current plan, the current path must match the reviewed
path exactly, and the current status must still be fillable.
Plotting
chromo plot draws direct visualizations from existing alignment rows. It
constructs cumulative offsets for reference and query FASTA records:
and maps normalized alignment endpoints into a whole-genome or per-reference
canvas. Forward and reverse alignments are rendered as line segments with
different colors. Optional assignment input reorders the query axis by kept
ChromoSort assignments without re-running placement. Optional selected-reference
input filters the reference axis, alignment rows, and --per-ref panel set
without changing the underlying FASTA or alignment file. The
dot-plot guide describes how these
segments map to common review patterns such as clean collinearity, reversed
contigs, internal inversions, chimeric candidates, duplicate signal, and
alignment gaps.
Evidence, Scoring, Ranking, or Decision Logic
ChromoSort does not currently implement a probabilistic confidence model. Instead, it uses deterministic scores and thresholded evidence quantities that are visible in reports.
| Decision point | Quantity | Interpretation |
|---|---|---|
Best reference in sort |
merged_query_bp, raw_query_bp, avg_identity |
Ranks each contig’s candidate reference matches. |
| Ambiguous reference match | $B_q$ | Fraction of all merged query-aligned bp assigned to the best reference. |
| Duplicate-overlap filtering | $N(I,C)$ and $F(I,C)$ | Absolute and fractional novel reference contribution relative to stronger kept contigs. |
| Split-candidate protection | Number of supported references and best-reference share | Prevents strong multi-reference contigs from being discarded before fix review. |
| Breakpoint smoothing | $\operatorname{cost}(\mathcal{P})$ and planner score | Balances discordant alignment support against a fixed breakpoint penalty. |
| Long-read breakpoint support | Spanning, split, edge, and nearby read counts in eval fix |
Reports support around candidate breakpoints without automatically changing the fix planner. |
| Scaffold overlap trimming | $g(a,b)$, $o(a,b)$, overlap class, optional sequence identity | Determines gap length and whether an explicitly allowed trim is safe to perform. |
| Long-read bridge support | Bridge count, orientation summary, read-order summary, and median read-space gap in eval scaffold/gapfill |
Reports whether reads support adjacent contig ends in assembly-coordinate space. |
| GAF node or path support | Node traversal counts, selected-path support, best alternate support, and support status | Reports graph-read support in eval/manual; can resolve gapfill branches only under gapfill support thresholds. |
| Reviewed row acceptance | accept field in chromosort-review-event-v1 rows |
Allows human-reviewed rows to drive fix, scaffold, or gapfill after task-specific validation. |
| Gapfill branch selection | Unique GAF, Hi-C, or reference-placement support | Resolves graph branches only when one path clears threshold and is not contradicted. |
| Gapfill risk annotation | Branching, high-degree nodes, self-loops, unsequenced nodes, cycles, weak/tied support | Flags cases that need review even when a path is reported. |
The gapfill branch-complexity score is additive and intentionally simple:
\[R = \max(0, P-1) + |H| + 2|S| + 2|U| + C + |X|,\]where $P$ is the number of candidate paths, $H$ is the set of high-degree nodes, $S$ is the set of self-loop nodes, $U$ is the set of unsequenced nodes, $C$ is the number of avoided cycles during path search, and $X$ is the set of extra warning flags such as weak, tied, conflicting, or sequence-validation failures. The score is a review aid, not a calibrated probability.
Region, Window, Graph, or Table Construction Logic
Interval Regions
All alignment-derived region logic is built from merged half-open intervals. This applies to coverage, duplicate-overlap filtering, terminal-overlap classification, and reference span reporting. A candidate’s overlap with prior claims is computed by two-pointer interval intersection after sorting; novel intervals are computed by subtracting the covered set from the candidate set.
Reference Gap Windows
Scaffold gaps and reference-placement gapfill support use adjacent assignment coordinates. For same-reference gapfill candidates, the expected placement window is bounded by the adjacent assignment edges:
\[W(a,b) = [\min(end_a, start_b), \max(end_a, start_b)].\]Intermediate graph-node placements from --ref-paf support a candidate path
only when their reference midpoint falls inside this window and the orientation
matches the candidate path.
Graph Paths
Graph path construction uses oriented keys. Starting keys are determined from the left assignment orientation when known, otherwise both orientations are allowed. Target keys are built the same way from the right assignment. The BFS state contains the current oriented node, the path of edges taken so far, and the seen oriented nodes. Revisiting a seen oriented node is counted as an avoided cycle and is not expanded.
Output Tables
TSV reports are not secondary decorations; they are command contracts. For example:
sortassignment tables feedscaffold,gapfill, and ordered plots.evalreview-event tables feed task-specificmanualdashboards and the matchingfix,scaffold, orgapfillreviewed execution path.fix,cut, andmanual applyreports describe every emitted piece.scaffold_gaps.tsvrecords raw inferred gaps, final gap sizes, overlap class, policy, action, trim length, and optional sequence identity.graph_gaps.tsvrecords direct GFA adjacency, shortest reported path, and missing-node status for scaffold junctions.gapfill_plan.tsvrecords selected paths, alternative support, risk flags, fill status, fill length, right trim, acceptance, and applied status.
Cross-source or Multi-stage Reconciliation
ChromoSort reconciles evidence sources through explicit names and stage-local tables:
- Alignment evidence is keyed by reference and query FASTA names. PAF and
coords inputs are interchangeable only after normalization into
Segmentrecords. - Graph evidence resolves contig names conservatively. Scaffold and gapfill
junctions try original contig names, ChromoSort
new_namevalues, and FASTA record names when mapping records back to GFA segments. - Review-event rows are task-scoped.
manual fix/scaffold/gapfillrejects rows whose task does not match the selected mode, and executors consume only accepted rows for their own task. - Manual-review recipes are validated against the original assembly FASTA; they do not inherit implicit state from the dashboard.
- Reviewed scaffold gaps are checked against the current scaffold, left contig, and right contig names before application.
- Reviewed gapfill plans are checked against the current graph path before application, preventing stale accepted paths from being silently applied.
- Sequence-editing stages can change contig IDs and coordinates. The intended workflow is to re-align edited FASTA records before using downstream alignment-dependent commands.
This reconciliation strategy favors reproducibility over convenience. It avoids implicit cross-stage mutation and forces each downstream stage to consume the current FASTA and the corresponding current reports.
Implementation Architecture
The source tree is a small Python package under src/chromosort/.
| Module | Role |
|---|---|
cli.py |
Top-level chromo dispatcher and subcommand routing. |
reference_order.py |
FASTA length scanning, alignment parsing, interval utilities, match aggregation, assignment decisions, duplicate-overlap filtering, ordered FASTA writing, and sort reports. |
fix_contigs.py |
Chimeric-contig block collection, breakpoint planning, smoothing, split-piece writing, graph guard/report integration, and split reports. |
evaluate.py |
Table-only review-event generation for eval fix, eval scaffold, and eval gapfill, reusing executor planners and adding optional graph, long-read PAF, and GAF evidence fields. |
review.py |
Shared review-event schema, TSV column ordering, accept-value parsing, duplicate-id checks, and read/write helpers used by eval, manual, and reviewed executors. |
longreads.py |
Long-read-to-assembly PAF parsing and summaries for breakpoint support and contig-end bridge evidence. |
gaf.py |
Long-read-to-graph GAF parsing, oriented path matching, candidate traversal support summaries, and node-level advisory summaries. |
cut.py |
Explicit cut parsing, validation, FASTA splitting, and cut reports. |
manual.py |
Dashboard data construction, graph context embedding, task review-event queues, modular evidence panels, recipe validation, manual piece/scaffold application, and manual reports. |
plot.py |
Alignment filtering, axis construction, SVG/PDF/PNG rendering, and per-reference plots. |
scaffold.py |
Assignment-table reading, scaffold grouping, reviewed gap overrides, reference gap/overlap logic, graph junction reporting, and scaffold FASTA writing. |
gapfill.py |
GAF/Hi-C/reference-placement support integration, graph path enumeration, branch resolution, sequence reconstruction, review dashboard generation, reviewed-plan validation, and optional gapfilled FASTA writing. |
graph.py |
GFA S/L parser, oriented graph indexes, direct-edge queries, node evidence, and link evidence. |
paths.py |
Output directory creation and small shared filesystem helpers. |
Most dependencies are from the Python standard library. Pillow is used for PNG plot output. Jekyll-related files support the documentation site and are separate from runtime package behavior.
Complexity and Scaling
The following bounds describe the implemented algorithms, not biological limits. Let $S$ be the number of passing alignment segments, $V$ graph nodes, $E$ graph edges, $Q$ query contigs, $B_q$ fix blocks for contig $q$, and $G$ adjacent scaffold gaps.
| Component | Time complexity | Memory behavior |
|---|---|---|
| Alignment parsing | $O(S)$ | Stores grouped intervals and segment summaries needed by the command. |
| Match aggregation | $O(S + \sum_{q,r} k_{q,r}\log k_{q,r})$ | Interval lists per query-reference pair before merging. |
| Assignment ranking | $O(\sum_q R_q \log R_q)$ where $R_q$ is matched references per query | Per-query match lists. |
| Duplicate-overlap filtering | $O(\sum_r A_r \log A_r + I_r)$ plus interval subtraction/intersection work | Covered interval sets per reference; worst cases grow with fragmentation. |
| Fix block merging | $O(B_q \log B_q)$ per contig | Query blocks for selected contigs. |
| Fix smoothing DP | $O(B_q^2)$ per selected contig | DP table and interval-cost cache over block ranges. |
| Cut planning | $O(C \log C)$ per cut contig plus FASTA streaming | Stores cut positions and FASTA length records. |
| GFA parsing | $O(V+E)$ | Stores all parsed nodes, edges, and orientation indexes. |
| Review-event TSV read/write | $O(R)$ for $R$ rows | Stores reviewed rows while checking task and event-id uniqueness. |
| Long-read PAF support summaries | Proportional to read alignments for the queried contig or contig pair | Stores filtered read alignments grouped by read and target. |
| GAF traversal summaries | Proportional to candidate paths times filtered GAF read-path length | Stores filtered GAF records and per-path read-name summaries. |
| Scaffold construction | $O(Q+G)$ without graph path reporting | Stores ordered FASTA records and scaffold gap reports. |
| Scaffold graph path report | $O(G \cdot d^D)$ in the bounded worst case | Bounded by --graph-max-path-edges; reports one shortest path per gap. |
| Gapfill path enumeration | $O(G \cdot \min(P,d^D) \cdot D)$ | Bounded by --max-candidate-paths $P$ and --max-path-edges $D$. |
| Gapfill GAF support | Proportional to candidate paths times read-path scan length | Stores filtered GAF path records. |
| Plot rendering | $O(S)$ draw elements | PNG additionally requires $O(width \cdot height)$ image memory. |
The most performance-sensitive code paths are interval merging, the fix planner’s quadratic DP over alignment blocks, graph BFS at scaffold/gapfill junctions, and large plot rendering. The exposed depth, candidate, segment, and format filters exist to bound those costs for larger assemblies.
Validation Strategy
The repository contains a pytest/unittest suite configured in pyproject.toml.
The tests are synthetic and deterministic; they are best interpreted as unit
and integration evidence for the implemented decision rules, not as biological
benchmarking.
Validation coverage currently includes:
- CLI dispatch and mutually exclusive argument checks.
- coords and PAF parity for sorting and fixing.
- best-reference assignment, duplicate-overlap rejection, split-candidate protection, terminal-overlap classification, terminal-extension rescue, and large-alignment rescue.
- graph assignment reports and graph guard warnings for sort/fix cases.
- breakpoint smoothing, sensitive/comprehensive/conservative fix modes, inversion handling, noisy small-event rejection, contig-subset parity, and breakpoint guardrails.
- review-event schema reading/writing, accepted/rejected value parsing, duplicate-id rejection, and task mismatch rejection.
eval fix,eval scaffold, andeval gapfilltable generation, including reviewed executor handoff cases.- long-read PAF breakpoint and contig-bridge summaries.
- GAF parsing, MAPQ filtering, node summaries, selected-path support, alternate support, and mixed GFA/PAF/GAF scaffold evidence.
- explicit cut validation for duplicate, terminal, and file-based cut inputs.
- manual dashboard ordering, optional sequence embedding, graph context, task-event queues, modular evidence panels, recipe slicing, reverse-complementing, removal, and scaffold grouping.
- plot generation for PDF, SVG, PNG, per-reference plots, and assignment-based query ordering.
- scaffold inferred/fixed gaps, negative reference-gap overlap reporting, reference trimming, sequence-confirmed trimming, and graph-confirmed trimming.
- GFA parsing of orientations, simple versus complex overlap CIGARs, missing links, and length mismatches.
- gapfill unique path application, reviewed-plan acceptance/rejection, stale path rejection, ambiguous branch refusal, GAF/Hi-C/reference-placement branch resolution, conflict refusal, risk flags, and review HTML embedding.
The tests/data/graph_gotchas/ fixture documents intentionally difficult graph
cases: an unambiguous path, an ambiguous branch, a cycle, orientation-specific
links, disconnected mapped nodes, unsequenced segments, and repeat/duplicate
warning contexts.
Runtime validation is also part of the architecture. Many user-facing errors
are raised as ValueError and converted to command exit status 2 with a clear
message. Examples include mutually exclusive alignment inputs, malformed GFA
records, missing assignment-table columns, stale reviewed gapfill paths, invalid
cut positions, missing source FASTA records, and unsupported graph safety flags
without --gfa.
The current GitHub Actions workflow deploys the documentation site from
docs/. At the time this document was written, test execution is configured
locally but is not represented as a repository CI workflow.
Non-goals and Limitations
- ChromoSort does not run MUMmer, minimap2, GraphAligner, Hi-C processing, or assembly tools internally. It consumes their outputs.
- ChromoSort does not polish base calls, call variants, estimate haplotypes, or validate biological correctness beyond the supplied evidence.
- The ranking and risk scores are deterministic heuristics, not calibrated probabilities or statistical confidence intervals.
- Reference-guided decisions inherit reference bias. Contigs absent from, highly diverged from, or structurally incompatible with the reference can be under-supported or classified for review.
- Duplicate-overlap filtering is per reference and based on reference interval novelty. It is not a global assembly graph optimization.
- GFA support is limited to segment (
S) and link (L) records. GFA path/walk records are not used by the core graph parser. - Sequence-changing graph fills require segment sequences and simple exact link overlap lengths. Missing sequences, complex overlap CIGARs, flank mismatches, invalid overlaps, overly long fills, or ambiguous/conflicting branches prevent automatic filling.
- Scaffold overlap trimming is policy-limited and terminal-focused. Nonterminal overlaps are reported for review rather than broadly edited.
- After FASTA-changing steps, downstream alignment-dependent commands should be run with fresh alignments against the edited FASTA.
- The package metadata currently classifies the project as alpha, so reviewers should treat command behavior as implemented and tested, but still evolving.
Reviewer’s Reading Guide
For code review, start with the command boundary and then follow the evidence models into the algorithms:
src/chromosort/cli.pyfor public command routing.src/chromosort/reference_order.pyfor alignment normalization, interval math, match aggregation, sort assignment, and duplicate-overlap logic.src/chromosort/fix_contigs.pyfor breakpoint planning, smoothing, mode behavior, and emitted split pieces.src/chromosort/evaluate.pyandsrc/chromosort/review.pyfor review-event table generation, schema constraints, and reviewed handoff fields.src/chromosort/graph.py,src/chromosort/gaf.py, andsrc/chromosort/longreads.pyfor graph and read evidence constraints.src/chromosort/scaffold.pyandsrc/chromosort/gapfill.pyfor junction-level gap, overlap, graph path, and fill behavior.src/chromosort/manual.py,src/chromosort/cut.py, andsrc/chromosort/plot.pyfor review, explicit edit, and visualization surfaces.tests/test_reference_order.py,tests/test_fix_contigs.py,tests/test_eval.py,tests/test_review.py,tests/test_longreads.py,tests/test_gaf.py,tests/test_scaffold.py,tests/test_gapfill.py,tests/test_graph.py,tests/test_manual.py,tests/test_cut.py,tests/test_plot.py, andtests/test_cli.pyfor executable examples of the expected behavior.
For a methods review, the most important invariants to check are:
- interval normalization and merged-bp accounting,
- conservative handling of ambiguous or duplicate alignments,
- breakpoint smoothing and support thresholds,
- explicitness of every sequence-changing operation,
- review-event task scoping and accepted-row validation,
- graph path validation before sequence insertion,
- distinction between report-only evidence and evidence allowed to select or apply a sequence-changing path,
- stale-plan prevention for reviewed gap fills,
- report schemas used as contracts between stages.