class Bio::Lasergene

bio/db/lasergene.rb - Interface for DNAStar Lasergene sequence file format

Author

Trevor Wennblom <trevor@corevx.com>

Copyright

Copyright © 2007 Center for Biomedical Research Informatics, University of Minnesota (cbri.umn.edu)

License

The Ruby License

Description

Bio::Lasergene reads DNAStar Lasergene formatted sequence files, or .seq files. It only expects to find one sequence per file.

Usage

require 'bio'
filename = 'MyFile.seq'
lseq = Bio::Lasergene.new( IO.readlines(filename) )
lseq.entry_id  # => "Contig 1"
lseq.seq  # => ATGACGTATCCAAAGAGGCGTTACC

Comments

I’m only aware of the following three kinds of Lasergene file formats. Feel free to send me other examples that may not currently be accounted for.

File format 1:

## begin ##
"Contig 1" (1,934)
  Contig Length:                  934 bases
  Average Length/Sequence:        467 bases
  Total Sequence Length:         1869 bases
  Top Strand:                       2 sequences
  Bottom Strand:                    2 sequences
  Total:                            4 sequences
^^
ATGACGTATCCAAAGAGGCGTTACCGGAGAAGAAGACACCGCCCCCGCAGTCCTCTTGGCCAGATCCTCCGCCGCCGCCCCTGGCTCGTCCACCCCCGCCACAGTTACCGCTGGAGAAGGAAAAATGGCATCTTCAWCACCCGCCTATCCCGCAYCTTCGGAWRTACTATCAAGCGAACCACAGTCAGAACGCCCTCCTGGGCGGTGGACATGATGAGATTCAATATTAATGACTTTCTTCCCCCAGGAGGGGGCTCAAACCCCCGCTCTGTGCCCTTTGAATACTACAGAATAAGAAAGGTTAAGGTTGAATTCTGGCCCTGCTCCCCGATCACCCAGGGTGACAGGGGAATGGGCTCCAGTGCTGWTATTCTAGMTGATRRCTTKGTAACAAAGRCCACAGCCCTCACCTATGACCCCTATGTAAACTTCTCCTCCCGCCATACCATAACCCAGCCCTTCTCCTACCRCTCCCGYTACTTTACCCCCAAACCTGTCCTWGATKCCACTATKGATKACTKCCAACCAAACAACAAAAGAAACCAGCTGTGGSTGAGACTACAWACTGCTGGAAATGTAGACCWCGTAGGCCTSGGCACTGCGTKCGAAAACAGTATATACGACCAGGAATACAATATCCGTGTMACCATGTATGTACAATTCAGAGAATTTAATCTTAAAGACCCCCCRCTTMACCCKTAATGAATAATAAMAACCATTACGAAGTGATAAAAWAGWCTCAGTAATTTATTYCATATGGAAATTCWSGGCATGGGGGGGAAAGGGTGACGAACKKGCCCCCTTCCTCCSTSGMYTKTTCYGTAGCATTCYTCCAMAAYACCWAGGCAGYAMTCCTCCSATCAAGAGcYTSYACAGCTGGGACAGCAGTTGAGGAGGACCATTCAAAGGGGGTCGGATTGCTGGTAATCAGA
## end ##

File format 2:

## begin ##
^^:                                  350,935
Contig 1 (1,935)
  Contig Length:                  935 bases
  Average Length/Sequence:        580 bases
  Total Sequence Length:         2323 bases
  Top Strand:                       2 sequences
  Bottom Strand:                    2 sequences
  Total:                            4 sequences
^^
ATGTCGGGGAAATGCTTGACCGCGGGCTACTGCTCATCATTGCTTTCTTTGTGGTATATCGTGCCGTTCTGTTTTGCTGTGCTCGTCAACGCCAGCGGCGACAGCAGCTCTCATTTTCAGTCGATTTATAACTTGACGTTATGTGAGCTGAATGGCACGAACTGGCTGGCAGACAACTTTAACTGGGCTGTGGAGACTTTTGTCATCTTCCCCGTGTTGACTCACATTGTTTCCTATGGTGCACTCACTACCAGTCATTTTCTTGACACAGTTGGTCTAGTTACTGTGTCTACCGCCGGGTTTTATCACGGGCGGTACGTCTTGAGTAGCATCTACGCGGTCTGTGCTCTGGCTGCGTTGATTTGCTTCGCCATCAGGTTTGCGAAGAACTGCATGTCCTGGCGCTACTCTTGCACTAGATACACCAACTTCCTCCTGGACACCAAGGGCAGACTCTATCGTTGGCGGTCGCCTGTCATCATAGAGAAAGGGGGTAAGGTTGAGGTCGAAGGTCATCTGATCGATCTCAAAAGAGTTGTGCTTGATGGCTCTGTGGCGACACCTTTAACCAGAGTTTCAGCGGAACAATGGGGTCGTCCCTAGACGACTTTTGCCATGATAGTACAGCCCCACAGAAGGTGCTCTTGGCGTTTTCCATCACCTACACGCCAGTGATGATATATGCCCTAAAGGTAAGCCGCGGCCGACTTTTGGGGCTTCTGCACCTTTTGATTTTTTTGAACTGTGCCTTTACTTTCGGGTACATGACATTCGTGCACTTTCGGAGCACGAACAAGGTCGCGCTCACTATGGGAGCAGTAGTCGCACTCCTTTGGGGGGTGTACTCAGCCATAGAAACCTGGAAATTCATCACCTCCAGATGCCGTTGTGCTTGCTAGGCCGCAAGTACATTCTGGCCCCTGCCCACCACGTTG
## end ##

File format 3 (non-standard Lasergene header):

## begin ##
LOCUS       PRU87392               15411 bp    RNA     linear   VRL 17-NOV-2000
DEFINITION  Porcine reproductive and respiratory syndrome virus strain VR-2332,
            complete genome.
ACCESSION   U87392 AF030244 U00153
VERSION     U87392.3  GI:11192298
[...cut...]
     3'UTR           15261..15411
     polyA_site      15409
ORIGIN      
^^
atgacgtataggtgttggctctatgccttggcatttgtattgtcaggagctgtgaccattggcacagcccaaaacttgctgcacagaaacacccttctgtgatagcctccttcaggggagcttagggtttgtccctagcaccttgcttccggagttgcactgctttacggtctctccacccctttaaccatgtctgggatacttgatcggtgcacgtgtacccccaatgccagggtgtttatggcggagggccaagtctactgcacacgatgcctcagtgcacggtctctccttcccctgaacctccaagtttctgagctcggggtgctaggcctattctacaggcccgaagagccactccggtggacgttgccacgtgcattccccactgttgagtgctcccccgccggggcctgctggctttctgcaatctttccaatcgcacgaatgaccagtggaaacctgaacttccaacaaagaatggtacgggtcgcagctgagctttacagagccggccagctcacccctgcagtcttgaaggctctacaagtttatgaacggggttgccgctggtaccccattgttggacctgtccctggagtggccgttttcgccaattccctacatgtgagtgataaacctttcccgggagcaactcacgtgttgaccaacctgccgctcccgcagagacccaagcctgaagacttttgcccctttgagtgtgctatggctactgtctatgacattggtcatgacgccgtcatgtatgtggccgaaaggaaagtctcctgggcccctcgtggcggggatgaagtgaaatttgaagctgtccccggggagttgaagttgattgcgaaccggctccgcacctccttcccgccccaccacacagtggacatgtctaagttcgccttcacagcccctgggtgtggtgtttctatgcgggtcgaacgccaacacggctgccttcccgctgacactgtccctgaaggcaactgctggtggagcttgtttgacttgcttccactggaagttcagaacaaagaaattcgccatgctaaccaatttggctaccagaccaagcatggtgtctctggcaagtacctacagcggaggctgca[...cut...]
## end ##

Constants

DELIMITER_1
DELIMITER_2

Attributes

average_length[R]

Average length per sequence

bottom_strand_sequences[R]

Number of bottom strand sequences

comments[R]

Entire header before the sequence

contig_length[R]

Contig length, length of present sequence

name[R]

Name of sequence

top_strand_sequences[R]

Number of top strand sequences

total_length[R]

Length of parent sequence

total_sequences[R]

Number of sequences

Public Class Methods

new(lines) click to toggle source
    # File lib/bio/db/lasergene.rb
124 def initialize(lines)
125   process(lines)
126 end

Public Instance Methods

entry_id() click to toggle source

Name of sequence

    # File lib/bio/db/lasergene.rb
147 def entry_id
148   @name
149 end
seq() click to toggle source

Sequence

Bio::Sequence::NA or Bio::Sequence::AA object

    # File lib/bio/db/lasergene.rb
141 def seq
142   @sequence
143 end
standard_comment?() click to toggle source

Is the comment header recognized as standard Lasergene format?


Arguments

  • none

Returns

true or false

    # File lib/bio/db/lasergene.rb
134 def standard_comment?
135   @standard_comment
136 end

Protected Instance Methods

process(lines) click to toggle source
    # File lib/bio/db/lasergene.rb
155 def process(lines)
156   delimiter_1_indices = []
157   delimiter_2_indices = []
158   
159   # If the data from the file is passed as one big String instead of
160   # broken into an Array, convert lines to an Array
161   if lines.kind_of? String
162     lines = lines.tr("\r", '').split("\n")
163   end
164 
165   lines.each_with_index do |line, index|
166     if line.match DELIMITER_1
167       delimiter_1_indices << index
168     elsif line.match DELIMITER_2
169       delimiter_2_indices << index
170     end
171   end
172 
173   raise InputError, "More than one delimiter of type '#{DELIMITER_1}'" if delimiter_1_indices.size > 1
174   raise InputError, "More than one delimiter of type '#{DELIMITER_2}'" if delimiter_2_indices.size > 1
175   raise InputError, "No comment to data separator of type '#{DELIMITER_2}'" if delimiter_2_indices.size < 1
176 
177   if !delimiter_1_indices.empty?
178     # toss out DELIMETER_1 and anything preceding it
179     @comments = lines[ (delimiter_1_indices[0] + 1) .. (delimiter_2_indices[0] - 1) ]
180   else
181     @comments = lines[ 0 .. (delimiter_2_indices[0] - 1) ]
182   end
183 
184   @standard_comment = false
185   if @comments[0] =~ %r{(.+)\s+\(\d+,\d+\)} # if we have a standard Lasergene comment
186     @standard_comment = true
187     @name = $1
188     comments.each do |comment|
189       if comment.match('Contig Length:\s+(\d+)')
190         @contig_length = $1.to_i
191       elsif comment.match('Average Length/Sequence:\s+(\d+)')
192         @average_length = $1.to_i
193       elsif comment.match('Total Sequence Length:\s+(\d+)')
194         @total_length = $1.to_i
195       elsif comment.match('Top Strand:\s+(\d+)')
196         @top_strand_sequences = $1.to_i
197       elsif comment.match('Bottom Strand:\s+(\d+)')
198         @bottom_strand_sequences = $1.to_i
199       elsif comment.match('Total:\s+(\d+)')
200         @total_sequences = $1.to_i
201       end
202     end
203   end
204 
205   @comments = @comments.join('')
206   @sequence = Bio::Sequence.auto( lines[ (delimiter_2_indices[0] + 1) .. -1 ].join('') )
207 end