class Bio::GCG::Seq
Bio::GCG::Seq
¶ ↑
This is GCG
sequence file format (.seq or .pep) parser class.
References
¶ ↑
-
Information about
GCG
Wisconsin Package®
www.accelrys.com/products/gcg_wisconsin_package .
-
EMBOSS
sequence formats
www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/SequenceFormats.html
-
BioPerl document
Constants
- DELIMITER
delimiter used by
Bio::FlatFile
Attributes
“Check:” field, which indicates checksum of current sequence.
Date field of this entry.
Description field.
ID field.
heading (‘!!NA_SEQUENCE 1.0’ or whatever like this)
“Length:” field. Note that sometimes this might differ from real sequence length.
“Type:” field, which indicates sequence type. “N” means nucleic acid sequence, “P” means protein sequence.
Public Class Methods
Calculates checksum from given string.
# File lib/bio/appl/gcg/seq.rb 141 def self.calc_checksum(str) 142 # Reference: Bio::SeqIO::gcg of BioPerl-1.2.3 143 idx = 0 144 sum = 0 145 str.upcase.tr('^A-Z.~', '').each_byte do |c| 146 idx += 1 147 sum += idx * c 148 idx = 0 if idx >= 57 149 end 150 (sum % 10000) 151 end
Creates new instance of this class. str must be a GCG
seq formatted string.
# File lib/bio/appl/gcg/seq.rb 38 def initialize(str) 39 @heading = str[/.*/] # '!!NA_SEQUENCE 1.0' or like this 40 str = str.sub(/.*/, '') 41 str.sub!(/.*\.\.$/m, '') 42 @definition = $&.to_s.sub(/^.*\.\.$/, '').to_s 43 desc = $&.to_s 44 if m = /(.+)\s+Length\:\s+(\d+)\s+(.+)\s+Type\:\s+(\w)\s+Check\:\s+(\d+)/.match(desc) then 45 @entry_id = m[1].to_s.strip 46 @length = (m[2] ? m[2].to_i : nil) 47 @date = m[3].to_s.strip 48 @seq_type = m[4] 49 @checksum = (m[5] ? m[5].to_i : nil) 50 end 51 @data = str 52 @seq = nil 53 @definition.strip! 54 end
Creates a new GCG
sequence format text. Parameters can be omitted.
Examples:
Bio::GCG::Seq.to_gcg(:definition=>'H.sapiens DNA', :seq_type=>'N', :entry_id=>'gi-1234567', :seq=>seq, :date=>date)
# File lib/bio/appl/gcg/seq.rb 161 def self.to_gcg(hash) 162 seq = hash[:seq] 163 if seq.is_a?(Bio::Sequence::NA) then 164 seq_type = 'N' 165 elsif seq.is_a?(Bio::Sequence::AA) then 166 seq_type = 'P' 167 else 168 seq_type = (hash[:seq_type] or 'P') 169 end 170 if seq_type == 'N' then 171 head = '!!NA_SEQUENCE 1.0' 172 else 173 head = '!!AA_SEQUENCE 1.0' 174 end 175 date = (hash[:date] or Time.now.strftime('%B %d, %Y %H:%M')) 176 entry_id = hash[:entry_id].to_s.strip 177 len = seq.length 178 checksum = self.calc_checksum(seq) 179 definition = hash[:definition].to_s.strip 180 seq = seq.upcase.gsub(/.{1,50}/, "\\0\n") 181 seq.gsub!(/.{10}/, "\\0 ") 182 w = len.to_s.size + 1 183 i = 1 184 seq.gsub!(/^/) { |x| s = sprintf("\n%*d ", w, i); i += 50; s } 185 186 [ head, "\n", definition, "\n\n", 187 "#{entry_id} Length: #{len} #{date} " \ 188 "Type: #{seq_type} Check: #{checksum} ..\n", 189 seq, "\n" ].join('') 190 end
Public Instance Methods
If you know the sequence is AA, use this method. Returns a Bio::Sequence::AA
object.
If you call naseq for protein sequence, or aaseq for nucleic sequence, RuntimeError will be raised.
# File lib/bio/appl/gcg/seq.rb 108 def aaseq 109 if seq.is_a?(Bio::Sequence::AA) then 110 @seq 111 else 112 raise 'seq_type != \'P\'' 113 end 114 end
If you know the sequence is NA, use this method. Returens a Bio::Sequence::NA
object.
If you call naseq for protein sequence, or aaseq for nucleic sequence, RuntimeError will be raised.
# File lib/bio/appl/gcg/seq.rb 121 def naseq 122 if seq.is_a?(Bio::Sequence::NA) then 123 @seq 124 else 125 raise 'seq_type != \'N\'' 126 end 127 end
Sequence
data. The class of the sequence is Bio::Sequence::NA
, Bio::Sequence::AA
or Bio::Sequence::Generic, according to the sequence type.
# File lib/bio/appl/gcg/seq.rb 88 def seq 89 unless @seq then 90 case @seq_type 91 when 'N', 'n' 92 k = Bio::Sequence::NA 93 when 'P', 'p' 94 k = Bio::Sequence::AA 95 else 96 k = Bio::Sequence 97 end 98 @seq = k.new(@data.tr('^-a-zA-Z.~', '')) 99 end 100 @seq 101 end
Validates checksum. If validation succeeds, returns true. Otherwise, returns false.
# File lib/bio/appl/gcg/seq.rb 132 def validate_checksum 133 checksum == self.class.calc_checksum(seq) 134 end