-
Notifications
You must be signed in to change notification settings - Fork 25
Start should be 1-based, not 0-based #251
Comments
The decision was early on to follow GA4GH standards, which are 0-based half open. (The lack of a clear documentation of "GA4GH standards" strikes again ...). So, 0 based it should be. |
I would like to understand this use case and couldn't find anything on the past issues. Can you point me to where I can find information on this decision, if no such document of standards exist? |
Hope this helps ... |
Thank you. |
The VMC data model on page 16 suggests, that nucleotides follow a 1-based counting convention. Upon reading more about bases and interbases I strongly feel that Beacon should follow the standards of genomic files, such as the VCF, which uses the 1-based system. Because the 1-based system tell the position of the base of interest, I think it fits more for the role of Beacon. Interbases might be better for applying data science on datasets, but the role of Beacon is to find those datasets first. |
I'm not married to any concept, but there have been endless discussions already. Also, this is a clear case where Beacon just has to pick up whatever format is selected as a "GA4GH standard". File formats and browsers all use different coordinate systems. Quote:
Pinging @andrewyatz @reece ... |
You pinged? With reference to GA4GH there is additional context available from @jmarshall comment on a PR of mine for refget. The use of |
That could have been me, repeatedly, again & again ;-)
"GA4GH really needs to have this kind of decision record in an easy to reference place"
So: Maybe this would be a tangible GKS product *before Christmas* - jut confirm coordinate system & document the choice?
I offer schemablocks.org :-)
… On 11 Dec 2018, at 17:34, Andrew Yates ***@***.***> wrote:
You pinged?
With reference to GA4GH there is additional context available from @jmarshall <https://github.com/jmarshall> comment on a PR of mine for refget <samtools/hts-specs#327 (comment)>. The use of 0-based, inclusive coordinates is now a convention of GA4GH specifications. It certainly isn't a standard. If it were this has been left behind in the pre infinity war like snap of GA4GH but the vague notion that we prefer 0-based, inclusive pervades.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApM1nhr9J1xl_a_MsEyKHhspU_X0jltks5u396egaJpZM4ZNaXQ>.
|
It is clear that 0-based half-inclusive intervals are the appropriate representation to use for arithmetic (and hence machine communications). If this doesn't seem clear to you, reread the epic threads linked to in the comment (samtools/hts-specs#327 (comment)) that @andrewyatz pointed to. [So it's obviously the right representation for APIs; whether it's GA4GH's policy to use this representation is a separate question.] As Beacon is a web service API, its purpose is machine communications therefore 0-based half-inclusive is the natural representation. This statement about purpose is a bit more quibble-able, so GA4GH codified this choice as a policy, or a “standard” if you will. (Those of us who were there at the time remember this use of the word “standard” — with relief, as it was the end of endless discussions!) This is reflected in secondary sources such as the htsget spec:
Tragically the primary sources (some GA4GH press release or minutes of some meeting), if any, have been obfuscated by subsequent web site reorganisations… |
Humans use 1-based inclusive. That shouldn't and won't change. Interbase coordinates conceptually cleaner than inclusive coordinates (regardless of base), especially when distinguishing insertions and deletions, and for edits at the terminii. APIs should use interbase. I can't think of any technical benefit for 0-based inclusive coordinates. |
(For the avoidance of doubt,) “interbase” and “zero-based half-inclusive” are two names for the same representation (and the latter name is more formally “zero-based half-open” I guess). I think in @andrewyatz's comment he was meaning the latter but inadvertently elided the “half-”. |
@jmarshall: Funny, I removed a point clarifying this because I thought it was a distraction. I guess I should have left it in. Although interbase and 0-based, right-open are numerically equivalent, they're semantically distinct. Interbase provides important conceptual clarity. 0-based, right-open refers to residues, which makes it awkward to refer to insertion points at the terminii because you have to refer to imaginary residues. Also, with residue-based coordinates, insertions use exclusive coordinates but deletions and substitutions use inclusive coordinates. That is, 5_6 refers to the space between 5 and 6 for an insertion, but refers to 5 and 6 inclusively for a deletion or MNV. |
In the Beacon specification, the
start
-key is described to be 0-based, while the VCF specification describes the position as 1-based; POS - position: The reference position, with the 1st base having position 1.I verified this information using IGV genome browser. Upon further research, other genomic filetypes also report to be using the 1-based system.
The text was updated successfully, but these errors were encountered: