Skip to content
This repository was archived by the owner on Jan 25, 2023. It is now read-only.

Start should be 1-based, not 0-based #251

Closed
teemukataja opened this issue Dec 11, 2018 · 12 comments
Closed

Start should be 1-based, not 0-based #251

teemukataja opened this issue Dec 11, 2018 · 12 comments

Comments

@teemukataja
Copy link
Contributor

In the Beacon specification, the start-key is described to be 0-based, while the VCF specification describes the position as 1-based; POS - position: The reference position, with the 1st base having position 1.

I verified this information using IGV genome browser. Upon further research, other genomic filetypes also report to be using the 1-based system.

@mbaudis
Copy link
Member

mbaudis commented Dec 11, 2018

The decision was early on to follow GA4GH standards, which are 0-based half open.

(The lack of a clear documentation of "GA4GH standards" strikes again ...).

So, 0 based it should be.

@mbaudis mbaudis closed this as completed Dec 11, 2018
@teemukataja
Copy link
Contributor Author

I would like to understand this use case and couldn't find anything on the past issues. Can you point me to where I can find information on this decision, if no such document of standards exist?

@mbaudis
Copy link
Member

mbaudis commented Dec 11, 2018

@teemukataja

  1. in VMC, which constitutes the main active GKS project https://docs.google.com/document/d/12E8WbQlvfZWk5NrxwLytmympPby6vsv60RxCeD5wc1E/edit (see page 16)
  2. in the frozen GA4GH schema https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/variants.proto#L168

Hope this helps ...

@teemukataja
Copy link
Contributor Author

Thank you.

@teemukataja
Copy link
Contributor Author

@mbaudis

The VMC data model on page 16 suggests, that nucleotides follow a 1-based counting convention.

Upon reading more about bases and interbases I strongly feel that Beacon should follow the standards of genomic files, such as the VCF, which uses the 1-based system. Because the 1-based system tell the position of the base of interest, I think it fits more for the role of Beacon. Interbases might be better for applying data science on datasets, but the role of Beacon is to find those datasets first.

@mbaudis
Copy link
Member

mbaudis commented Dec 11, 2018

Interbase coordinates

I'm not married to any concept, but there have been endless discussions already. Also, this is a clear case where Beacon just has to pick up whatever format is selected as a "GA4GH standard". File formats and browsers all use different coordinate systems.

Quote:

Moving from UCSC browser/tools to Ensembl browser/tools or back
* Ensembl uses 1-based coordinate system
* UCSC uses 0-based coordinate system
* Some file formats are 1-based (GFF, SAM, VCF) and others are 0-based (BED, BAM)

Pinging @andrewyatz @reece ...

@andrewyatz
Copy link

You pinged?

With reference to GA4GH there is additional context available from @jmarshall comment on a PR of mine for refget. The use of 0-based, inclusive coordinates is now a convention of GA4GH specifications. It certainly isn't a standard. If it were this has been left behind in the pre infinity war like snap of GA4GH but the vague notion that we prefer 0-based, inclusive pervades.

@mbaudis
Copy link
Member

mbaudis commented Dec 11, 2018 via email

@jmarshall
Copy link

jmarshall commented Dec 11, 2018

I would like to understand this use case

It is clear that 0-based half-inclusive intervals are the appropriate representation to use for arithmetic (and hence machine communications). If this doesn't seem clear to you, reread the epic threads linked to in the comment (samtools/hts-specs#327 (comment)) that @andrewyatz pointed to.

[So it's obviously the right representation for APIs; whether it's GA4GH's policy to use this representation is a separate question.]

As Beacon is a web service API, its purpose is machine communications therefore 0-based half-inclusive is the natural representation. This statement about purpose is a bit more quibble-able, so GA4GH codified this choice as a policy, or a “standard” if you will. (Those of us who were there at the time remember this use of the word “standard” — with relief, as it was the end of endless discussions!) This is reflected in secondary sources such as the htsget spec:

We use the following pan-GA4GH standards:

  • 0 start, half open coordinates

Tragically the primary sources (some GA4GH press release or minutes of some meeting), if any, have been obfuscated by subsequent web site reorganisations…

@reece
Copy link

reece commented Dec 11, 2018

Humans use 1-based inclusive. That shouldn't and won't change.

Interbase coordinates conceptually cleaner than inclusive coordinates (regardless of base), especially when distinguishing insertions and deletions, and for edits at the terminii. APIs should use interbase.

I can't think of any technical benefit for 0-based inclusive coordinates.

@jmarshall
Copy link

(For the avoidance of doubt,) “interbase” and “zero-based half-inclusive” are two names for the same representation (and the latter name is more formally “zero-based half-open” I guess). I think in @andrewyatz's comment he was meaning the latter but inadvertently elided the “half-”.

@reece
Copy link

reece commented Dec 12, 2018

@jmarshall: Funny, I removed a point clarifying this because I thought it was a distraction. I guess I should have left it in.

Although interbase and 0-based, right-open are numerically equivalent, they're semantically distinct. Interbase provides important conceptual clarity.

0-based, right-open refers to residues, which makes it awkward to refer to insertion points at the terminii because you have to refer to imaginary residues. Also, with residue-based coordinates, insertions use exclusive coordinates but deletions and substitutions use inclusive coordinates. That is, 5_6 refers to the space between 5 and 6 for an insertion, but refers to 5 and 6 inclusively for a deletion or MNV.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants