module Bio::Sequence::Common

DESCRIPTION¶ ↑

Bio::Sequence::Common is a Mixin implementing methods common to Bio::Sequence::AA and Bio::Sequence::NA. All of these methods are available to either Amino Acid or Nucleic Acid sequences, and by encapsulation are also available to Bio::Sequence objects.

USAGE¶ ↑

# Create a sequence
dna = Bio::Sequence.auto('atgcatgcatgc')

# Splice out a subsequence using a Genbank-style location string
puts dna.splice('complement(1..4)')

# What is the base composition?
puts dna.composition

# Create a random sequence with the composition of a current sequence
puts dna.randomize

Public Instance Methods

+ (*arg)

Source

    # File lib/bio/sequence/common.rb
122 def +(*arg)
123   self.class.new(super(*arg))
124 end

Create a new sequence by adding to an existing sequence. The existing sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
s2 = s + 'atgc'
puts s2                                 #=> "atgcatgc"
puts s                                  #=> "atgc"

The new sequence is of the same class as the existing sequence if the new data was added to an existing sequence,

puts s2.class == s.class                #=> true

but if an existing sequence is added to a String, the result is a String

s3 = 'atgc' + s
puts s3.class                           #=> String

Returns: new Bio::Sequence::NA/AA or String object

Calls superclass method

<< (*arg)

Source

    # File lib/bio/sequence/common.rb
 99 def <<(*arg)
100   concat(*arg)
101 end

composition ()

Source

    # File lib/bio/sequence/common.rb
216 def composition
217   count = Hash.new(0)
218   self.scan(/./) do |x|
219     count[x] += 1
220   end
221   return count
222 end

Returns a hash of the occurrence counts for each residue or base.

s = Bio::Sequence::NA.new('atgc')
puts s.composition              #=> {"a"=>1, "c"=>1, "g"=>1, "t"=>1}

Returns: Hash object

concat (*arg)

Source

   # File lib/bio/sequence/common.rb
95 def concat(*arg)
96   super(self.class.new(*arg))
97 end

Add new data to the end of the current sequence. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s << 'atgc'
puts s                                  #=> "atgcatgc"
s << s
puts s                                  #=> "atgcatgcatgcatgc"

Returns: current Bio::Sequence::NA/AA object (modified)

Calls superclass method

normalize! ()

Source

   # File lib/bio/sequence/common.rb
79 def normalize!
80   initialize(self)
81   self
82 end

Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
s.normalize!

Returns: current Bio::Sequence::NA/AA object (modified)

Also aliased as: seq!

randomize (hash = nil) { |seq| ... }

Source

    # File lib/bio/sequence/common.rb
244 def randomize(hash = nil)
245   if hash
246     tmp = ''
247     hash.each {|k, v|
248       tmp += k * v.to_i
249     }
250   else
251     tmp = self
252   end
253   seq = self.class.new(tmp)
254   # Reference: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
255   seq.length.downto(2) do |n|
256     k = rand(n)
257     c = seq[n - 1]
258     seq[n - 1] = seq[k]
259     seq[k] = c
260   end
261   if block_given? then
262     (0...seq.length).each do |i|
263       yield seq[i, 1]
264     end
265     return self.class.new('')
266   else
267     return seq
268   end
269 end

Returns a randomized sequence. The default is to retain the same base/residue composition as the original. If a hash of base/residue counts is given, the new sequence will be based on that hash composition. If a block is given, each new randomly selected position will be passed into the block. In all cases, the original sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.randomize                        #=> "tcag"  (for example)

new_composition = {'a' => 2, 't' => 2}
puts s.randomize(new_composition)       #=> "ttaa"  (for example)

count = 0
s.randomize { |x| count += 1 }
puts count                              #=> 4

Arguments:

(optional) hash: Hash object

Returns: new Bio::Sequence::NA/AA object

seq ()

Source

   # File lib/bio/sequence/common.rb
66 def seq
67   self.class.new(self)
68 end

Create a new sequence based on the current sequence. The original sequence is unchanged.

s = Bio::Sequence::NA.new('atgc')
s2 = s.seq
puts s2                                 #=> 'atgc'

Returns: new Bio::Sequence::NA/AA object

seq! ()

Alias for: normalize!

splice (position)

Source

    # File lib/bio/sequence/common.rb
286 def splice(position)
287   unless position.is_a?(Locations) then
288     position = Locations.new(position)
289   end
290   s = String.new
291   position.each do |location|
292     if location.sequence
293       s << location.sequence
294     else
295       exon = self.subseq(location.from, location.to)
296       begin
297         exon.complement! if location.strand < 0
298       rescue NameError
299       end
300       s << exon
301     end
302   end
303   return self.class.new(s)
304 end

Return a new sequence extracted from the original using a GenBank style position string. See also documentation for the Bio::Location class.

s = Bio::Sequence::NA.new('atgcatgcatgcatgc')
puts s.splice('1..3')                           #=> "atg"
puts s.splice('join(1..3,8..10)')               #=> "atgcat"
puts s.splice('complement(1..3)')               #=> "cat"
puts s.splice('complement(join(1..3,8..10))')   #=> "atgcat"

Note that ‘complement’ed Genbank position strings will have no effect on Bio::Sequence::AA objects.

Arguments:

(required) position: String or Bio::Location object

Returns: Bio::Sequence::NA/AA object

Also aliased as: splicing

splicing (position)

Alias for: splice

split (*arg)

Source

    # File lib/bio/sequence/common.rb
312 def split(*arg)
313   if block_given?
314     super
315   else
316     ret = super(*arg)
317     ret.collect! { |x| self.class.new('').replace(x) }
318     ret
319   end
320 end

Acts almost the same as String#split.

Calls superclass method

subseq (s = 1, e = self.length)

Source

    # File lib/bio/sequence/common.rb
144 def subseq(s = 1, e = self.length)
145   raise "Error: start/end position must be a positive integer" unless s > 0 and e > 0
146   s -= 1
147   e -= 1
148   self[s..e]
149 end

Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters. Important: Biological sequence numbering conventions (one-based) rather than ruby’s (zero-based) numbering conventions are used.

s = Bio::Sequence::NA.new('atggaatga')
puts s.subseq(1,3)                      #=> "atg"

Start defaults to 1 and end defaults to the entire existing string, so subseq called without any parameters simply returns a new sequence identical to the existing sequence.

puts s.subseq                           #=> "atggaatga"

Arguments:

(optional) s(start): Integer (default 1)
(optional) e(end): Integer (default current sequence length)

Returns: new Bio::Sequence::NA/AA object

to_fasta (header = '', width = nil)

Source

   # File lib/bio/sequence/compat.rb
49 def to_fasta(header = '', width = nil)
50   warn "Bio::Sequence#to_fasta is obsolete. Use Bio::Sequence#output(:fasta) instead" if $DEBUG
51   ">#{header}\n" +
52   if width
53     self.to_s.gsub(Regexp.new(".{1,#{width}}"), "\\0\n")
54   else
55     self.to_s + "\n"
56   end
57 end

Bio::Sequence#to_fasta is DEPRECATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output instead. Note that Bio::Sequence::NA#to_fasta, Bio::Sequence::AA#to_fasata, and Bio::Sequence::Generic#to_fasta can still be used, because there are no alternative methods.

Output the FASTA format string of the sequence. The 1st argument is used as the comment string. If the 2nd option is given, the output sequence will be folded.

Arguments:

(optional) header: String object
(optional) width: Fixnum object (default nil)

Returns: String

to_s ()

Source

   # File lib/bio/sequence/common.rb
53 def to_s
54   String.new(self)
55 end

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence::NA.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence::NA

Returns: String object

Also aliased as: to_str

to_str ()

Alias for: to_s

total (hash)

Source

    # File lib/bio/sequence/common.rb
199 def total(hash)
200   hash.default = 0.0 unless hash.default
201   sum = 0.0
202   self.each_byte do |x|
203     begin
204       sum += hash[x.chr]
205     end
206   end
207   return sum
208 end

Returns a float total value for the sequence given a hash of base or residue values,

values = {'a' => 0.1, 't' => 0.2, 'g' => 0.3, 'c' => 0.4}
s = Bio::Sequence::NA.new('atgc')
puts s.total(values)                    #=> 1.0

Arguments:

(required) hash: Hash object

Returns: Float object

window_search (window_size, step_size = 1) { |self| ... }

Source

    # File lib/bio/sequence/common.rb
180 def window_search(window_size, step_size = 1)
181   last_step = 0
182   0.step(self.length - window_size, step_size) do |i| 
183     yield self[i, window_size]                        
184     last_step = i
185   end                          
186   return self[last_step + window_size .. -1] 
187 end

This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’. Typically used with a block. Any remaining sequence at the terminal end will be returned.

Prints average GC% on each 100bp

s.window_search(100) do |subseq|
  puts subseq.gc
end

Prints every translated peptide (length 5aa) in the same frame

s.window_search(15, 3) do |subseq|
  puts subseq.translate
end

Split genome sequence by 10000bp with 1000bp overlap in fasta format

i = 1
remainder = s.window_search(10000, 9000) do |subseq|
  puts subseq.to_fasta("segment #{i}", 60)
  i += 1
end
puts remainder.to_fasta("segment #{i}", 60)

Arguments:

(required) window_size: Fixnum
(optional) step_size: Fixnum (default 1)

Returns: new Bio::Sequence::NA/AA object