Title: | Sequence Globally Unique Identifier ('SEGUID') Checksums for Linear, Circular, Single-Stranded and Double-Stranded Biological Sequences |
---|---|
Description: | An R implementation of the original Sequence Globally Unique Identifier ('SEGUID') algorithm [Babnigg and Giometti (2006) <doi:10.1002/pmic.200600032>] and 'SEGUID' v2 (<https://www.seguid.org>), which extends 'SEGUID' v1 with support for linear, circular, single- and double-stranded biological sequences, e.g. DNA, RNA, and proteins. |
Authors: | Henrik Bengtsson [aut, cre, cph] |
Maintainer: | Henrik Bengtsson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0-9076 |
Built: | 2024-11-12 04:55:30 UTC |
Source: | https://github.com/seguid/seguid-r |
SEGUID checksums for linear, circular, single- and double-stranded sequences
seguid(seq, alphabet = "{DNA}", form = c("long", "short", "both")) lsseguid(seq, alphabet = "{DNA}", form = c("long", "short", "both")) csseguid(seq, alphabet = "{DNA}", form = c("long", "short", "both")) ldseguid(watson, crick, alphabet = "{DNA}", form = c("long", "short", "both")) cdseguid(watson, crick, alphabet = "{DNA}", form = c("long", "short", "both"))
seguid(seq, alphabet = "{DNA}", form = c("long", "short", "both")) lsseguid(seq, alphabet = "{DNA}", form = c("long", "short", "both")) csseguid(seq, alphabet = "{DNA}", form = c("long", "short", "both")) ldseguid(watson, crick, alphabet = "{DNA}", form = c("long", "short", "both")) cdseguid(watson, crick, alphabet = "{DNA}", form = c("long", "short", "both"))
seq |
(character string) The sequence for which the checksum
should be calculated. The sequence may only comprise of symbols
in the alphabet specified by the |
alphabet |
(character string) The type of sequence used.
If |
form |
(character string) How the checksum is presented.
If |
watson , crick
|
(character strings) Two reverse-complementary DNA sequences. Both sequences should be specified in the 5'-to-3' direction. |
The SEGUID functions return a single character string, if form
is
either "long"
or "short"
. If form
is "both"
, then a character
vector of length two is return, where the first component holds the
"short" checksum and the second the "long" checksum.
The long checksum, without the prefix, is string with 27 characters.
The short checksum, without the prefix, is the first six characters
of the long checksum.
All checksums are prefixed with a label indicating which SEGUID
method was used.
Except for seguid()
, which uses base64 encoding, all functions
produce checksums using the base64url encoding ("Base 64 Encoding
with URL and Filename Safe Alphabet").
seguid()
calculates the SEGUID v1 checksum for a linear,
single-stranded sequence.
lsseguid()
calculates the SEGUID v2 checksum for a linear,
single-stranded sequence.
csseguid()
calculates the SEGUID v2 checksum for a circular,
single-stranded sequence.
ldseguid()
calculates the SEGUID v2 checksum for a linear,
double-stranded sequence.
cdseguid()
calculates the SEGUID v2 checksum for a circular,
double-stranded sequence.
The base64url encoding is the base64 encoding with non-URL-safe characters
substituted with URL-safe ones. Specifically, the plus symbol (+
) is
replaced by the minus symbol (-
), and the forward slash (/
) is
replaced by the underscore symbol (_
).
The Base64 checksum, which is used for the original SEGUID checksum,
is not guaranteed to comprise symbols that can
safely be used as-is in Uniform Resource Locator (URL). Specifically,
it may consist of forward slashes (/
) and plus symbols (+
), which
are characters that carry special meaning in a URL.
For the same reason, a Base64 checksum cannot safely be used
as a file or directory name, because it may have a forward slash.
The checksum returned is always 27-character long. This is because the
representation always end with a padding character (=
) so that the
length is a multiple of four character. We relax this requirement, by
dropping the padding character.
Babnigg, G., Giometti, CS. A database of unique protein sequence identifiers for proteome studies. Proteomics. 2006 Aug;6(16):4514-22. doi:10.1002/pmic.200600032.
Josefsson, S., The Base16, Base32, and Base64 Data Encodings, RFC 4648, doi:10.17487/RFC4648, October 2006, https://www.rfc-editor.org/info/rfc4648.
Wikpedia article 'Nucleic acid notation', February 2024. https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Wikpedia article 'Nucleic acid notation', February 2024, https://en.wikipedia.org/wiki/Amino_acid.
Wikipedia article 'SHA-1' (Secure Hash Algorithm 1), December 2023. https://en.wikipedia.org/wiki/SHA-1.
## SEGUID v1 on linear single-stranded DNA seguid("GATTACA") #> seguid=tp2jzeCM2e3W4yxtrrx09CMKa/8 ## SEGUID v2 on linear single-stranded DNA lsseguid("GATTACA") #> lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8 ## SEGUID v2 on cicular single-stranded DNA ## GATTACA = ATTACAG = ... = AGATTAC csseguid("GATTACA") #> csseguid=mtrvbtuwr6_MoBxvtm4BEpv-jKQ ## SEGUID v2 on blunt, linear double-stranded DNA ## GATTACA ## CTAATGT ldseguid("GATTACA", "TGTAATC") #> ldseguid=AcRsEcNFrui5wCxI7xxo6wnDYPY ## SEGUID v2 on staggered, linear double-stranded DNA ## -ATTACA ## CTAAT-- ldseguid("-ATTACA", "--TAATC") #> ldseguid=98Klwxd3ZQPGHqnH3BheIuZVHQQ ## SEGUID v2 on circular double-stranded DNA ## GATTACA = ATTACAG = ... = AGATTAC ## CTAATGT = TAATGTC = ... = TCTAATG cdseguid("GATTACA", "TGTAATC") #> cdseguid=zCuq031K3_-40pArbl-Y4N9RLnA ## SEGUID v2 on linear single-stranded expanded ## epigenetic sequence (Viner et al., 2024) viner_DNA <- "{DNA},m1,h2,f3,c4" lsseguid("AmT2C", alphabet = viner_DNA) #> lsseguid=MW4Rh3lGY2mhwteaSKh1-Kn2fGA ## SEGUID v2 on linear double-stranded expanded ## epigenetic sequence (Viner et al., 2024) ldseguid("AmT2C", "GhA1T", alphabet = viner_DNA) #> ldseguid=rsPDjP4SWr3-ploCeXTdTA80u0Y
## SEGUID v1 on linear single-stranded DNA seguid("GATTACA") #> seguid=tp2jzeCM2e3W4yxtrrx09CMKa/8 ## SEGUID v2 on linear single-stranded DNA lsseguid("GATTACA") #> lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8 ## SEGUID v2 on cicular single-stranded DNA ## GATTACA = ATTACAG = ... = AGATTAC csseguid("GATTACA") #> csseguid=mtrvbtuwr6_MoBxvtm4BEpv-jKQ ## SEGUID v2 on blunt, linear double-stranded DNA ## GATTACA ## CTAATGT ldseguid("GATTACA", "TGTAATC") #> ldseguid=AcRsEcNFrui5wCxI7xxo6wnDYPY ## SEGUID v2 on staggered, linear double-stranded DNA ## -ATTACA ## CTAAT-- ldseguid("-ATTACA", "--TAATC") #> ldseguid=98Klwxd3ZQPGHqnH3BheIuZVHQQ ## SEGUID v2 on circular double-stranded DNA ## GATTACA = ATTACAG = ... = AGATTAC ## CTAATGT = TAATGTC = ... = TCTAATG cdseguid("GATTACA", "TGTAATC") #> cdseguid=zCuq031K3_-40pArbl-Y4N9RLnA ## SEGUID v2 on linear single-stranded expanded ## epigenetic sequence (Viner et al., 2024) viner_DNA <- "{DNA},m1,h2,f3,c4" lsseguid("AmT2C", alphabet = viner_DNA) #> lsseguid=MW4Rh3lGY2mhwteaSKh1-Kn2fGA ## SEGUID v2 on linear double-stranded expanded ## epigenetic sequence (Viner et al., 2024) ldseguid("AmT2C", "GhA1T", alphabet = viner_DNA) #> ldseguid=rsPDjP4SWr3-ploCeXTdTA80u0Y