class Bio::Sequence::NA

DESCRIPTION¶ ↑

Bio::Sequence::NA represents a bare Nucleic Acid sequence in bioruby.

USAGE¶ ↑

# Create a Nucleic Acid sequence.
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')

# What are the names of all the bases?
puts dna.names
puts rna.names

# What is the GC percentage?
puts dna.gc_percent
puts rna.gc_percent

# What is the molecular weight?
puts dna.molecular_weight
puts rna.molecular_weight

# What is the reverse complement?
puts dna.reverse_complement
puts dna.complement

# Is this sequence DNA or RNA?
puts dna.rna?

# Translate my sequence (see method docs for many options)
puts dna.translate
puts rna.translate

Public Class Methods

new (str)

Source

   # File lib/bio/sequence/na.rb
75 def initialize(str)
76   super
77   self.downcase!
78   self.tr!(" \t\n\r",'')
79 end

Generate an nucleic acid sequence object from a string.

s = Bio::Sequence::NA.new("aagcttggaccgttgaagt")

or maybe (if you have an nucleic acid sequence in a file)

s = Bio::Sequence:NA.new(File.open('dna.txt').read)

Nucleic Acid sequences are always all lowercase in bioruby

s = Bio::Sequence::NA.new("AAGcTtGG")
puts s                                  #=> "aagcttgg"

Whitespace is stripped from the sequence

seq = Bio::Sequence::NA.new("atg\nggg\ttt\r  gc")
puts s                                  #=> "atggggttgc"

Arguments:

(required) str: String

Returns: Bio::Sequence::NA object

Calls superclass method

randomize (*arg, &block)

Source

   # File lib/bio/sequence/compat.rb
82 def self.randomize(*arg, &block)
83   self.new('').randomize(*arg, &block)
84 end

Generate a new random sequence with the given frequency of bases. The sequence length is determined by their cumulative sum. (See also Bio::Sequence::Common#randomize which creates a new randomized sequence object using the base composition of an existing sequence instance).

counts = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
puts Bio::Sequence::NA.randomize(counts)  #=> "ggcttgttac" (for example)

You may also feed the output of randomize into a block

actual_counts = {'a'=>0, 'c'=>0, 'g'=>0, 't'=>0}
Bio::Sequence::NA.randomize(counts) {|x| actual_counts[x] += 1}
actual_counts                     #=> {"a"=>1, "c"=>2, "g"=>3, "t"=>4}

Arguments:

(optional) hash: Hash object

Returns: Bio::Sequence::NA object

Public Instance Methods

at_content ()

Source

    # File lib/bio/sequence/na.rb
346 def at_content
347   count = self.composition
348   at = count['a'] + count['t'] + count['u']
349   gc = count['g'] + count['c']
350   total = at + gc
351   return 0.0 if total == 0
352   return at.quo(total)
353 end

Calculate the ratio of AT / ATGC bases. U is regarded as T.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.at_content                       #=> 4/9
puts s.at_content.to_f                  #=> 0.444444444444444

In older Ruby versions, Float is always returned.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.at_content                       #=> 0.444444444444444

Note that “u” is regarded as “t”. If there are no ATGC bases in the sequence, 0.0 is returned.

Returns: Rational or Float

at_skew ()

Source

    # File lib/bio/sequence/na.rb
395 def at_skew
396   count = self.composition
397   a = count['a']
398   t = count['t'] + count['u']
399   at = a + t
400   return 0.0 if at == 0
401   return (a - t).quo(at)
402 end

Calculate the ratio of (A - T) / (A + T) bases. U is regarded as T.

s = Bio::Sequence::NA.new('atgttgttgttc')
puts s.at_skew                          #=> (-3/4)
puts s.at_skew.to_f                     #=> -0.75

In older Ruby versions, Float is always returned.

s = Bio::Sequence::NA.new('atgttgttgttc')
puts s.at_skew                          #=> -0.75

Note that “u” is regarded as “t”. If there are no AT bases in the sequence, 0.0 is returned.

Returns: Rational or Float

codon_usage ()

Source

    # File lib/bio/sequence/na.rb
273 def codon_usage
274   hash = Hash.new(0)
275   self.window_search(3, 3) do |codon|
276     hash[codon] += 1
277   end
278   return hash
279 end

Returns counts of each codon in the sequence in a hash.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.codon_usage                #=> {"gcg"=>1, "tga"=>1, "atg"=>1}

This method does not validate codons! Any three letter group is a ‘codon’. So,

s = Bio::Sequence::NA.new('atggNNtga')
puts s.codon_usage                #=> {"tga"=>1, "gnn"=>1, "atg"=>1}

seq = Bio::Sequence::NA.new('atgg--tga')
puts s.codon_usage                #=> {"tga"=>1, "g--"=>1, "atg"=>1}

Also, there is no option to work in any frame other than the first.

Returns: Hash object

complement ()

Alias for Bio::Sequence::NA#reverse_complement

Alias for: reverse_complement

complement! ()

Alias for Bio::Sequence::NA#reverse_complement!

Alias for: reverse_complement!

cut_with_enzyme (*args)

Source

    # File lib/bio/sequence/na.rb
530 def cut_with_enzyme(*args)
531   Bio::RestrictionEnzyme::Analysis.cut(self, *args)
532 end

Example:

seq = Bio::Sequence::NA.new('gaattc')
cuts = seq.cut_with_enzyme('EcoRI')

seq = Bio::Sequence::NA.new('gaattc')
cuts = seq.cut_with_enzyme('g^aattc')

See Bio::RestrictionEnzyme::Analysis.cut

Also aliased as: cut_with_enzymes

cut_with_enzymes (*args)

Alias for: cut_with_enzyme

dna ()

Source

    # File lib/bio/sequence/na.rb
474 def dna
475   self.tr('u', 't')
476 end

Returns a new sequence object with any ‘u’ bases changed to ‘t’. The original sequence is not modified.

s = Bio::Sequence::NA.new('augc')
puts s.dna                              #=> 'atgc'
puts s                                  #=> 'augc'

Returns: new Bio::Sequence::NA object

dna! ()

Source

    # File lib/bio/sequence/na.rb
486 def dna!
487   self.tr!('u', 't')
488 end

Changes any ‘u’ bases in the original sequence to ‘t’. The original sequence is modified.

s = Bio::Sequence::NA.new('augc')
puts s.dna!                             #=> 'atgc'
puts s                                  #=> 'atgc'

Returns: current Bio::Sequence::NA object (modified)

forward_complement ()

Source

    # File lib/bio/sequence/na.rb
100 def forward_complement
101   s = self.class.new(self)
102   s.forward_complement!
103   s
104 end

Returns a new complementary sequence object (without reversing). The original sequence object is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.forward_complement               #=> 'tacg'
puts s                                  #=> 'atgc'

Returns: new Bio::Sequence::NA object

forward_complement! ()

Source

    # File lib/bio/sequence/na.rb
114 def forward_complement!
115   if self.rna?
116     self.tr!('augcrymkdhvbswn', 'uacgyrkmhdbvswn')
117   else
118     self.tr!('atgcrymkdhvbswn', 'tacgyrkmhdbvswn')
119   end
120   self
121 end

Converts the current sequence into its complement (without reversing). The original sequence object is modified.

seq = Bio::Sequence::NA.new('atgc')
puts s.forward_complement!              #=> 'tacg'
puts s                                  #=> 'tacg'

Returns: current Bio::Sequence::NA object (modified)

gc_content ()

Source

    # File lib/bio/sequence/na.rb
321 def gc_content
322   count = self.composition
323   at = count['a'] + count['t'] + count['u']
324   gc = count['g'] + count['c']
325   total = at + gc
326   return 0.0 if total == 0
327   return gc.quo(total)
328 end

Calculate the ratio of GC / ATGC bases. U is regarded as T.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.gc_content                       #=> (5/9)
puts s.gc_content.to_f                  #=> 0.5555555555555556

In older Ruby versions, Float is always returned.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.gc_content                       #=> 0.555555555555556

Note that “u” is regarded as “t”. If there are no ATGC bases in the sequence, 0.0 is returned.

Returns: Rational or Float

gc_percent ()

Source

    # File lib/bio/sequence/na.rb
296 def gc_percent
297   count = self.composition
298   at = count['a'] + count['t'] + count['u']
299   gc = count['g'] + count['c']
300   return 0 if at + gc == 0
301   gc = 100 * gc / (at + gc)
302   return gc
303 end

Calculate the ratio of GC / ATGC bases as a percentage rounded to the nearest whole number. U is regarded as T.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.gc_percent                       #=> 55

Note that this method only returns an integer value. When more digits after decimal points are needed, use gc_content and sprintf like below:

s = Bio::Sequence::NA.new('atggcgtga')
puts sprintf("%3.2f", s.gc_content * 100)  #=> "55.56"

Returns: Fixnum

gc_skew ()

Source

    # File lib/bio/sequence/na.rb
370 def gc_skew
371   count = self.composition
372   g = count['g']
373   c = count['c']
374   gc = g + c
375   return 0.0 if gc == 0
376   return (g - c).quo(gc)
377 end

Calculate the ratio of (G - C) / (G + C) bases.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.gc_skew                          #=> 3/5
puts s.gc_skew.to_f                     #=> 0.6

In older Ruby versions, Float is always returned.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.gc_skew                          #=> 0.6

If there are no GC bases in the sequence, 0.0 is returned.

Returns: Rational or Float

illegal_bases ()

Source

    # File lib/bio/sequence/na.rb
411 def illegal_bases
412   self.scan(/[^atgcu]/).sort.uniq
413 end

Returns an alphabetically sorted array of any non-standard bases (other than ‘atgcu’).

s = Bio::Sequence::NA.new('atgStgQccR')
puts s.illegal_bases                    #=> ["q", "r", "s"]

Returns: Array object

molecular_weight ()

Source

    # File lib/bio/sequence/na.rb
427 def molecular_weight
428   if self.rna?
429     Bio::NucleicAcid.weight(self, true)
430   else
431     Bio::NucleicAcid.weight(self)
432   end
433 end

Estimate molecular weight (using the values from BioPerl’s SeqStats.pm module).

s = Bio::Sequence::NA.new('atggcgtga')
puts s.molecular_weight                 #=> 2841.00708

RNA and DNA do not have the same molecular weights,

s = Bio::Sequence::NA.new('auggcguga')
puts s.molecular_weight                 #=> 2956.94708

Returns: Float object

names ()

Source

    # File lib/bio/sequence/na.rb
458 def names
459   array = []
460   self.each_byte do |x|
461     array.push(Bio::NucleicAcid.names[x.chr.upcase])
462   end
463   return array
464 end

Generate the list of the names of each nucleotide along with the sequence (full name). Names used in bioruby are found in the Bio::AminoAcid::NAMES hash.

s = Bio::Sequence::NA.new('atg')
puts s.names                    #=> ["Adenine", "Thymine", "Guanine"]

Returns: Array object

reverse_complement ()

Source

    # File lib/bio/sequence/na.rb
131 def reverse_complement
132   s = self.class.new(self)
133   s.reverse_complement!
134   s
135 end

Returns a new sequence object with the reverse complement sequence to the original. The original sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.reverse_complement               #=> 'gcat'
puts s                                  #=> 'atgc'

Returns: new Bio::Sequence::NA object

Also aliased as: complement

reverse_complement! ()

Source

    # File lib/bio/sequence/na.rb
145 def reverse_complement!
146   self.reverse!
147   self.forward_complement!
148 end

Converts the original sequence into its reverse complement.

The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
puts s.reverse_complement               #=> 'gcat'
puts s                                  #=> 'gcat'

Returns: current Bio::Sequence::NA object (modified)

Also aliased as: complement!

rna ()

Source

    # File lib/bio/sequence/na.rb
498 def rna
499   self.tr('t', 'u')
500 end

Returns a new sequence object with any ‘t’ bases changed to ‘u’. The original sequence is not modified.

s = Bio::Sequence::NA.new('atgc')
puts s.dna                              #=> 'augc'
puts s                                  #=> 'atgc'

Returns: new Bio::Sequence::NA object

rna! ()

Source

    # File lib/bio/sequence/na.rb
510 def rna!
511   self.tr!('t', 'u')
512 end

Changes any ‘t’ bases in the original sequence to ‘u’. The original sequence is modified.

s = Bio::Sequence::NA.new('atgc')
puts s.dna!                             #=> 'augc'
puts s                                  #=> 'augc'

Returns: current Bio::Sequence::NA object (modified)

to_re ()

Source

    # File lib/bio/sequence/na.rb
442 def to_re
443   if self.rna?
444     Bio::NucleicAcid.to_re(self.dna, true)
445   else
446     Bio::NucleicAcid.to_re(self)
447   end
448 end

Create a ruby regular expression instance (Regexp)

s = Bio::Sequence::NA.new('atggcgtga')
puts s.to_re                            #=> /atggcgtga/

Returns: Regexp object

translate (frame = 1, table = 1, unknown = 'X')

Source

    # File lib/bio/sequence/na.rb
232 def translate(frame = 1, table = 1, unknown = 'X')
233   if table.is_a?(Bio::CodonTable)
234     ct = table
235   else
236     ct = Bio::CodonTable[table]
237   end
238   naseq = self.dna
239   case frame
240   when 1, 2, 3
241     from = frame - 1
242   when 4, 5, 6
243     from = frame - 4
244     naseq.complement!
245   when -1, -2, -3
246     from = -1 - frame
247     naseq.complement!
248   else
249     from = 0
250   end
251   nalen = naseq.length - from
252   nalen -= nalen % 3
253   aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown}
254   return Bio::Sequence::AA.new(aaseq)
255 end

Translate into an amino acid sequence.

s = Bio::Sequence::NA.new('atggcgtga')
puts s.translate                        #=> "MA*"

By default, translate starts in reading frame position 1, but you can start in either 2 or 3 as well,

puts s.translate(2)                     #=> "WR"
puts s.translate(3)                     #=> "GV"

You may also translate the reverse complement in one step by using frame values of -1, -2, and -3 (or 4, 5, and 6)

puts s.translate(-1)                    #=> "SRH"
puts s.translate(4)                     #=> "SRH"
puts s.reverse_complement.translate(1)  #=> "SRH"

The default codon table in the translate function is the Standard Eukaryotic codon table. The translate function takes either a number or a Bio::CodonTable object for its table argument. The available tables are (NCBI):

1. "Standard (Eukaryote)"
2. "Vertebrate Mitochondrial"
3. "Yeast Mitochondorial"
4. "Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma"
5. "Invertebrate Mitochondrial"
6. "Ciliate Macronuclear and Dasycladacean"
9. "Echinoderm Mitochondrial"
10. "Euplotid Nuclear"
11. "Bacteria"
12. "Alternative Yeast Nuclear"
13. "Ascidian Mitochondrial"
14. "Flatworm Mitochondrial"
15. "Blepharisma Macronuclear"
16. "Chlorophycean Mitochondrial"
21. "Trematode Mitochondrial"
22. "Scenedesmus obliquus mitochondrial"
23. "Thraustochytrium Mitochondrial"

If you are using anything other than the default table, you must specify frame in the translate method call,

puts s.translate                #=> "MA*"  (using defaults)
puts s.translate(1,1)           #=> "MA*"  (same as above, but explicit)
puts s.translate(1,2)           #=> "MAW"  (different codon table)

and using a Bio::CodonTable instance in the translate method call,

mt_table = Bio::CodonTable[2]
puts s.translate(1, mt_table)           #=> "MAW"

By default, any invalid or unknown codons (as could happen if the sequence contains ambiguities) will be represented by ‘X’ in the translated sequence. You may change this to any character of your choice.

s = Bio::Sequence::NA.new('atgcNNtga')
puts s.translate                        #=> "MX*"
puts s.translate(1,1,'9')               #=> "M9*"

The translate method considers gaps to be unknown characters and treats them as such (i.e. does not collapse sequences prior to translation), so

s = Bio::Sequence::NA.new('atgc--tga')
puts s.translate                        #=> "MX*"

Arguments:

(optional) frame: one of 1,2,3,4,5,6,-1,-2,-3 (default 1)
(optional) table: Fixnum in range 1,23 or Bio::CodonTable object (default 1)
(optional) unknown: Character (default ‘X’)

Returns: Bio::Sequence::AA object

Protected Instance Methods

rna? ()

Source

    # File lib/bio/sequence/na.rb
514 def rna?
515   self.index('u')
516 end