Genomi

Active Genome Index

The local, queryable genome artifact Genomi builds from VCF, gVCF, BAM, FASTQ, and consumer raw genotype exports.

An Active Genome Index is Genomi's local, queryable representation of a genome source. It exists so agents do not reason over giant raw files directly.

Supported sources

Consumer array sources are primarily presence/genotype evidence at assayed sites. Sequencing-derived sources can support richer depth, genotype quality, and callability checks when those fields are present.

What parsing creates

genomi.parse_source detects the source type and writes a durable record under GENOMI_HOME. The core artifacts are:

ArtifactPurpose
work/active-genome-index.sqliteQuery tables for variants, reference spans, metadata, stats, and source header lines
canonical BGZF VCFGenomi-owned normalized source used for random access
evidence/evidence.sqlitePer-index evidence storage
shared-evidence.sqliteShared reviewed evidence storage under GENOMI_HOME
registry.jsonUsers, assigned AGIs, active AGI IDs, default user, and response profile
session context.jsonChat-scoped active user, active AGI, and access grants

The SQLite schema currently requires metadata, stats, records, spans, source header lines, and region/variant/rsID query indexes. Rebuilding is handled by the lifecycle rules below.

Variants first, reference tail in the background

A whole-genome gVCF is ~96% reference blocks, so genomi.parse_source splits the work for gVCFs into two phases:

  • Phase A — variants pass (synchronous). Every variant row is parsed and written; final stats are computed. The index reaches variants_ready and the full interpretation surface — rsID, gene, region, exact-allele lookup, ClinVar, PRS — is queryable in minutes.
  • Phase B — reference pass (background). Detached background job active_genome_index.build_reference_pass (auto-launched, idempotent, internal) coalesces and appends the reference-block tail and flips the index to completed. Its job_id is surfaced in the parse result; agents poll genomi.check_background_job, they don't reparse.

Plain VCFs, small files, capped (max_records) parses, consumer arrays, and BAM/FASTQ stay single-phase — there is no reference tail to defer. Until Phase B reports completed, reference-dependent reads (callability, genotype-support, callset-QC, PRS scoring, ancestry overlap) carry a reference_pending marker so a host treats a transient empty/negative as provisional rather than final.

Access and gating

Every capability that touches per-sample genome rows goes through a single gated reader rather than opening the SQLite directly. One door composes the two concerns that used to be hand-stamped across handlers:

  • Session authorization. genomi.approve_agi_access records explicit user approval for the current chat. The gate raises a structured active_genome_index_approval_required envelope when an unapproved capability tries to read.
  • Readiness. The reader knows the parse state (complete / variants_ready / reference_pending) and gates reference-dependent operations lazily — cheap public prerequisite checks run first, and reference_pending is stamped once at the operation boundary, not in every handler. Capabilities that only need variants (prs.calculate_score, ancestry.check_sample_overlap, variant.resolve) degrade gracefully at variants_ready instead of hard-failing.

The net effect for agents: you don't manage per-capability gates. Approve once with genomi.approve_agi_access, then any AGI-reading tool returns a typed readiness envelope when the index isn't fully ready yet.

Users and AGIs

User/profile nicknames belong to people or profiles. Active Genome Index IDs belong to genome artifacts. A user can have multiple genome records and one selected active index.

Useful base operations:

OperationUse
genomi.parse_sourceDetect and digitize a source into an AGI
genomi.list_usersList user/profile metadata
genomi.assign_user_genomeLink a source or AGI to a profile
genomi.select_userSelect a profile for the session
genomi.set_default_userPersist one default profile for GENOMI_HOME
genomi.approve_agi_accessRecord explicit approval to read an existing AGI
genomi.describe_contextInspect active context and response-profile guidance

Evidence operations

The AGI itself is technical sample evidence. Interpretation comes from focused capabilities.

OperationUse
active_genome_index.summarizeCompact readiness and artifact summary
active_genome_index.classify_callset_qcCallset shape, QC field availability, and absence-claim boundaries
active_genome_index.classify_genotype_supportWhether one allele has enough sample support
active_genome_index.classify_region_callabilityWhether a region can support reference or absence claims

For rsIDs, genes, or public evidence around a locus, start with variant.resolve through genomi.invoke after reading the variant skill.

Lifecycle states

genomi.describe_context and read operations report AGI readiness through an active_genome_index_readiness block with status and a structured reason code:

StatusAgent action
completeContinue with focused evidence tools
variants_readyContinue with variant queries; reference-dependent results carry reference_pending until Phase B finishes. Poll genomi.check_background_job with the surfaced reference job — don't reparse
needs_reparseReparse from the recorded source path if it still exists
schema_too_newUpgrade Genomi; do not downgrade the index by reparsing
missing sourceAsk for the current source path before reparsing

If genomi.parse_source returns status="in_progress", poll genomi.check_background_job. Do not replace a complete parse with a capped sample parse for user-facing interpretation.

Boundaries

  • The original genome source remains local.
  • Parsing does not automatically run every interpretation tool.
  • Missing library evidence is not negative evidence.
  • Consumer arrays cannot prove broad absence or coverage claims the way sequencing sources sometimes can.
  • Clinical decisions require clinical confirmation.