module Bio::Sequence::Common
DESCRIPTION¶ ↑
Bio::Sequence::Common
is a Mixin implementing methods common to Bio::Sequence::AA
and Bio::Sequence::NA
. All of these methods are available to either Amino Acid or Nucleic Acid sequences, and by encapsulation are also available to Bio::Sequence
objects.
USAGE¶ ↑
# Create a sequence dna = Bio::Sequence.auto('atgcatgcatgc') # Splice out a subsequence using a Genbank-style location string puts dna.splice('complement(1..4)') # What is the base composition? puts dna.composition # Create a random sequence with the composition of a current sequence puts dna.randomize
Public Instance Methods
Source
# File lib/bio/sequence/common.rb 122 def +(*arg) 123 self.class.new(super(*arg)) 124 end
Create a new sequence by adding to an existing sequence. The existing sequence is not modified.
s = Bio::Sequence::NA.new('atgc') s2 = s + 'atgc' puts s2 #=> "atgcatgc" puts s #=> "atgc"
The new sequence is of the same class as the existing sequence if the new data was added to an existing sequence,
puts s2.class == s.class #=> true
but if an existing sequence is added to a String, the result is a String
s3 = 'atgc' + s puts s3.class #=> String
- Returns
-
new
Bio::Sequence::NA
/AA or String object
Source
# File lib/bio/sequence/common.rb 216 def composition 217 count = Hash.new(0) 218 self.scan(/./) do |x| 219 count[x] += 1 220 end 221 return count 222 end
Returns a hash of the occurrence counts for each residue or base.
s = Bio::Sequence::NA.new('atgc') puts s.composition #=> {"a"=>1, "c"=>1, "g"=>1, "t"=>1}
- Returns
-
Hash object
Source
# File lib/bio/sequence/common.rb 95 def concat(*arg) 96 super(self.class.new(*arg)) 97 end
Add new data to the end of the current sequence. The original sequence is modified.
s = Bio::Sequence::NA.new('atgc') s << 'atgc' puts s #=> "atgcatgc" s << s puts s #=> "atgcatgcatgcatgc"
- Returns
-
current
Bio::Sequence::NA
/AA object (modified)
Source
# File lib/bio/sequence/common.rb 79 def normalize! 80 initialize(self) 81 self 82 end
Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA
or transforming all positions to lowercase if the sequence is NA
. The original sequence is modified.
s = Bio::Sequence::NA.new('atgc') s.normalize!
- Returns
-
current
Bio::Sequence::NA
/AA object (modified)
Source
# File lib/bio/sequence/common.rb 244 def randomize(hash = nil) 245 if hash 246 tmp = '' 247 hash.each {|k, v| 248 tmp += k * v.to_i 249 } 250 else 251 tmp = self 252 end 253 seq = self.class.new(tmp) 254 # Reference: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle 255 seq.length.downto(2) do |n| 256 k = rand(n) 257 c = seq[n - 1] 258 seq[n - 1] = seq[k] 259 seq[k] = c 260 end 261 if block_given? then 262 (0...seq.length).each do |i| 263 yield seq[i, 1] 264 end 265 return self.class.new('') 266 else 267 return seq 268 end 269 end
Returns a randomized sequence. The default is to retain the same base/residue composition as the original. If a hash of base/residue counts is given, the new sequence will be based on that hash composition. If a block is given, each new randomly selected position will be passed into the block. In all cases, the original sequence is not modified.
s = Bio::Sequence::NA.new('atgc') puts s.randomize #=> "tcag" (for example) new_composition = {'a' => 2, 't' => 2} puts s.randomize(new_composition) #=> "ttaa" (for example) count = 0 s.randomize { |x| count += 1 } puts count #=> 4
Arguments:
-
(optional) hash: Hash object
- Returns
-
new
Bio::Sequence::NA
/AA object
Source
# File lib/bio/sequence/common.rb 66 def seq 67 self.class.new(self) 68 end
Create a new sequence based on the current sequence. The original sequence is unchanged.
s = Bio::Sequence::NA.new('atgc') s2 = s.seq puts s2 #=> 'atgc'
- Returns
-
new
Bio::Sequence::NA
/AA object
Source
# File lib/bio/sequence/common.rb 286 def splice(position) 287 unless position.is_a?(Locations) then 288 position = Locations.new(position) 289 end 290 s = String.new 291 position.each do |location| 292 if location.sequence 293 s << location.sequence 294 else 295 exon = self.subseq(location.from, location.to) 296 begin 297 exon.complement! if location.strand < 0 298 rescue NameError 299 end 300 s << exon 301 end 302 end 303 return self.class.new(s) 304 end
Return a new sequence extracted from the original using a GenBank
style position string. See also documentation for the Bio::Location
class.
s = Bio::Sequence::NA.new('atgcatgcatgcatgc') puts s.splice('1..3') #=> "atg" puts s.splice('join(1..3,8..10)') #=> "atgcat" puts s.splice('complement(1..3)') #=> "cat" puts s.splice('complement(join(1..3,8..10))') #=> "atgcat"
Note that ‘complement’ed Genbank position strings will have no effect on Bio::Sequence::AA
objects.
Arguments:
-
(required) position: String or
Bio::Location
object
- Returns
-
Bio::Sequence::NA
/AA object
Source
# File lib/bio/sequence/common.rb 312 def split(*arg) 313 if block_given? 314 super 315 else 316 ret = super(*arg) 317 ret.collect! { |x| self.class.new('').replace(x) } 318 ret 319 end 320 end
Acts almost the same as String#split.
Source
# File lib/bio/sequence/common.rb 144 def subseq(s = 1, e = self.length) 145 raise "Error: start/end position must be a positive integer" unless s > 0 and e > 0 146 s -= 1 147 e -= 1 148 self[s..e] 149 end
Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters. Important: Biological sequence numbering conventions (one-based) rather than ruby’s (zero-based) numbering conventions are used.
s = Bio::Sequence::NA.new('atggaatga') puts s.subseq(1,3) #=> "atg"
Start defaults to 1 and end defaults to the entire existing string, so subseq called without any parameters simply returns a new sequence identical to the existing sequence.
puts s.subseq #=> "atggaatga"
Arguments:
-
(optional) s(start): Integer (default 1)
-
(optional) e(end): Integer (default current sequence length)
- Returns
-
new
Bio::Sequence::NA
/AA object
Source
# File lib/bio/sequence/compat.rb 49 def to_fasta(header = '', width = nil) 50 warn "Bio::Sequence#to_fasta is obsolete. Use Bio::Sequence#output(:fasta) instead" if $DEBUG 51 ">#{header}\n" + 52 if width 53 self.to_s.gsub(Regexp.new(".{1,#{width}}"), "\\0\n") 54 else 55 self.to_s + "\n" 56 end 57 end
Bio::Sequence#to_fasta is DEPRECATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output
instead. Note that Bio::Sequence::NA#to_fasta
, Bio::Sequence::AA#to_fasata, and Bio::Sequence::Generic#to_fasta can still be used, because there are no alternative methods.
Output the FASTA format string of the sequence. The 1st argument is used as the comment string. If the 2nd option is given, the output sequence will be folded.
Arguments:
-
(optional) header: String object
-
(optional) width: Fixnum object (default nil)
- Returns
-
String
Source
# File lib/bio/sequence/common.rb 53 def to_s 54 String.new(self) 55 end
Return sequence as String. The original sequence is unchanged.
seq = Bio::Sequence::NA.new('atgc') puts s.to_s #=> 'atgc' puts s.to_s.class #=> String puts s #=> 'atgc' puts s.class #=> Bio::Sequence::NA
- Returns
-
String object
Source
# File lib/bio/sequence/common.rb 199 def total(hash) 200 hash.default = 0.0 unless hash.default 201 sum = 0.0 202 self.each_byte do |x| 203 begin 204 sum += hash[x.chr] 205 end 206 end 207 return sum 208 end
Returns a float total value for the sequence given a hash of base or residue values,
values = {'a' => 0.1, 't' => 0.2, 'g' => 0.3, 'c' => 0.4} s = Bio::Sequence::NA.new('atgc') puts s.total(values) #=> 1.0
Arguments:
-
(required) hash: Hash object
- Returns
-
Float object
Source
# File lib/bio/sequence/common.rb 180 def window_search(window_size, step_size = 1) 181 last_step = 0 182 0.step(self.length - window_size, step_size) do |i| 183 yield self[i, window_size] 184 last_step = i 185 end 186 return self[last_step + window_size .. -1] 187 end
This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’. Typically used with a block. Any remaining sequence at the terminal end will be returned.
Prints average GC% on each 100bp
s.window_search(100) do |subseq| puts subseq.gc end
Prints every translated peptide (length 5aa) in the same frame
s.window_search(15, 3) do |subseq| puts subseq.translate end
Split genome sequence by 10000bp with 1000bp overlap in fasta format
i = 1 remainder = s.window_search(10000, 9000) do |subseq| puts subseq.to_fasta("segment #{i}", 60) i += 1 end puts remainder.to_fasta("segment #{i}", 60)
Arguments:
-
(required) window_size: Fixnum
-
(optional) step_size: Fixnum (default 1)
- Returns
-
new
Bio::Sequence::NA
/AA object