class Bio::Sequence
objects represent annotated sequences in bioruby. A Bio::Sequence
object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA
or a Bio::Sequence::AA
object. For most users, this encapsulation will be completely transparent. Bio::Sequence
responds to all methods defined for Bio::Sequence::NA
/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence
# Create a nucleic or amino acid sequence dna ='atgcatgcATGCATGCAAAA') rna ='augcaugcaugcaugcaaaa') aa ='ACDEFGHIKLMNPQRSTVWYU') # Print it out puts dna.to_s puts aa.to_s # Get a subsequence, bioinformatics style (first nucleotide is '1') puts dna.subseq(2,6) # Get a subsequence, informatics style (first nucleotide is '0') puts dna[2,6] # Print in FASTA format puts dna.output(:fasta) # Print all codons dna.window_search(3,3) do |codon| puts codon end # Splice or otherwise mangle your sequence puts dna.splicing("complement(join(1..5,16..20))") puts rna.splicing("complement(join(1..5,16..20))") # Convert a sequence containing ambiguity codes into a # regular expression you can use for subsequent searching puts aa.to_re # These should speak for themselves puts dna.complement puts dna.composition puts dna.molecular_weight puts dna.translate puts dna.gc_percent
Organism classification, taxonomic classification of the source organism. (Array of String)
Comments (String or an Array of String)
Data Class defined by EMBL
(String) See
Created date of the sequence entry (Date, DateTime, Time, or String)
Last modified date of the sequence entry (Date, DateTime, Time, or String)
Links to other database entries. (An Array of Bio::Sequence::DBLink
A String with a description of the sequence (String)
Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See
The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL
origin, this is the primary accession number.
Version of the entry (String or Integer). Unlike sequence_version
, entry_version
is a database maintainer’s internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL
, GenBank
, and DDBJ
may have different entry_version.
Error probabilities of the bases/residues in the sequence. (Array containing Float, or nil)
(An Array of Bio::Feature
Namespace of the sequence IDs described in entry_id
, primary_accession
, and secondary_accessions
methods (String). For example, ‘EMBL’, ‘GenBank’, ‘DDBJ’, ‘RefSeq’.
Keywords (An Array of String)
molecular type (String). “DNA” or “RNA” for nucleotide sequence.
(not well supported) Organelle information (String).
identifiers which are not described in entry_id
, primary_accession
,and secondary_accessions
methods (Array of Bio::Sequence::DBLink
objects). For example, NCBI
GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks
should be used.
Primary accession number (String)
The meaning (calculation method) of the quality scores stored in the quality_scores
attribute. Maybe one of :phred, :solexa, or nil.
Note that if it is nil, and error_probabilities
is empty, some methods implicitly assumes that it is :phred (PHRED score).
Quality scores of the bases/residues in the sequence. (Array containing Integer, or nil)
(An Array of Bio::Reference
Release information when created (String)
Release information when last-modified (String)
Secondary accession numbers (Array of String)
The sequence object, usually Bio::Sequence::NA
/AA, but could be a simple String
Version number of the sequence (String or Integer). Unlike entry_version
, sequence_version
will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL
, GenBank
, and DDBJ
) may have the same sequence_version.
Organism species (String). For example, “Escherichia coli”.
Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.
Organism classification, taxonomic classification of the source organism. (Array of String)
Topology (String). “circular”, “linear”, or nil.
Public Class Methods
# File lib/bio/sequence.rb 463 def self.adapter(source_data, adapter_module) 464 biosequence = 465 biosequence.instance_eval { 466 remove_instance_variable(:@seq) 467 @source_data = source_data 468 } 469 biosequence.extend(adapter_module) 470 biosequence 471 end
Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence
Creates a new Bio::Sequence
object from database data with an adapter module.
# File lib/bio/sequence.rb 283 def 284 seq = 285 286 return seq 287 end
Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence
object wrapping a sequence of the guessed type (either Bio::Sequence::AA
or Bio::Sequence::NA
s ='atgc') puts s.seq.class #=> Bio::Sequence::NA
(required) str: String or
/AA object
- Returns
# File lib/bio/sequence.rb 381 def self.guess(str, *args) 382*args) 383 end
Guess the class of a given sequence. Returns the class (Bio::Sequence::AA
or Bio::Sequence::NA
) guessed. In general, used by developers only, but if you know what you are doing, feel free.
puts .guess('atgc') #=> Bio::Sequence::NA
There are three optional parameters: ‘threshold`, `length`, and `index`.
The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA
“guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq') #=> Bio::Sequence::AA puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA
The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
# limit the guess to the first 1000 positions puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)
The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
puts Bio::Sequence.guess('-----atgcc') #=> Bio::Sequence::AA puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA
(required) str: String or
/AA object -
(optional) threshold: Float in range 0,1 (default 0.9)
(optional) length: Fixnum (default 10000)
(optional) index: Fixnum (default 1)
- Returns
# File lib/bio/sequence.rb 436 def self.input(str, format = nil) 437 if format then 438 klass = format 439 else 440 klass = Bio::FlatFile::AutoDetect.default.autodetect(str) 441 end 442 obj = 443 obj.to_biosequence 444 end
Create a new Bio::Sequence
object from a formatted string (GenBank
, fasta format, etc.)
s = Bio::Sequence.input(str)
(required) str: string
(optional) format: format specification (class or nil)
- Returns
# File lib/bio/sequence.rb 99 def initialize(str) 100 @seq = str 101 end
Create a new Bio::Sequence
s ='atgc') puts s #=> 'atgc'
Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string
puts s.seq.class #=> String
See Bio::Sequence#na
, Bio::Sequence#aa
, and Bio::Sequence#auto
for methods to transform the basic String of a just created Bio::Sequence
object to a proper bioruby object
(required) str: String or
/AA object
- Returns
# File lib/bio/sequence.rb 447 def, format = nil) 448 input(str, format) 449 end
alias of Bio::Sequence.input
Public Instance Methods
# File lib/bio/sequence.rb 422 def aa 423 @seq = 424 @moltype = AA 425 end
Transform the sequence wrapped in the current Bio::Sequence
object into a Bio::Sequence::NA
object. This method will change the current object. This method does not validate your choice, so be careful!
s ='atgc') puts s.seq.class #=> String s.aa puts s.seq.class #=> Bio::Sequence::AA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s ='RRLE') s.aa
- Returns
# File lib/bio/sequence.rb 454 def accessions 455 [ primary_accession, secondary_accessions ].flatten.compact 456 end
accession numbers of the sequence
- Returns
Array of String
# File lib/bio/sequence.rb 264 def auto 265 @moltype = guess 266 if @moltype == NA 267 @seq = 268 else 269 @seq = 270 end 271 end
Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA
or Bio::Sequence::NA
) on the basis of this guess. This method will change the current Bio::Sequence
s ='atgc') puts s.seq.class #=> String puts s.seq.class #=> Bio::Sequence::NA
- Returns
/AA object
# File lib/bio/sequence.rb 328 def guess(threshold = 0.9, length = 10000, index = 0) 329 str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common 330 cmp = str.composition 331 332 bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] + 333 cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u'] 334 335 total = str.length - cmp['N'] - cmp['n'] 336 337 if bases.to_f / total > threshold 338 return NA 339 else 340 return AA 341 end 342 end
Guess the class of the current sequence. Returns the class (Bio::Sequence::AA
or Bio::Sequence::NA
) guessed. In general, used by developers only, but if you know what you are doing, feel free.
s ='atgc') puts s.guess #=> Bio::Sequence::NA
There are three parameters: ‘threshold`, `length`, and `index`.
The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA
“guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA
s ='atgcatgcqq') puts s.guess #=> Bio::Sequence::AA puts s.guess(0.8) #=> Bio::Sequence::AA puts s.guess(0.7) #=> Bio::Sequence::NA
The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.
s = VERY LONG SEQUENCE) puts s.guess(0.9, 1000) # limit the guess to the first 1000 positions
The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…
s ='-----atgcc') puts s.guess #=> Bio::Sequence::AA puts s.guess(0.9,10000,5) #=> Bio::Sequence::NA
(optional) threshold: Float in range 0,1 (default 0.9)
(optional) length: Fixnum (default 10000)
(optional) index: Fixnum (default 1)
- Returns
# File lib/bio/sequence.rb 401 def na 402 @seq = 403 @moltype = NA 404 end
Transform the sequence wrapped in the current Bio::Sequence
object into a Bio::Sequence::NA
object. This method will change the current object. This method does not validate your choice, so be careful!
s ='RRLE') puts s.seq.class #=> String puts s.seq.class #=> Bio::Sequence::NA !!!
However, if you know your sequence type, this method may be constructively used after initialization,
s ='atgc')
- Returns
# File lib/bio/sequence/compat.rb 27 def to_s 28 29 end
Return sequence as String. The original sequence is unchanged.
seq ='atgc') puts s.to_s #=> 'atgc' puts s.to_s.class #=> String puts s #=> 'atgc' puts s.class #=> Bio::Sequence
- Returns
String object