class Bio::Lasergene
bio/db/lasergene.rb - Interface for DNAStar Lasergene
sequence file format
- Author
-
Trevor Wennblom <trevor@corevx.com>
- Copyright
-
Copyright © 2007 Center for Biomedical Research Informatics, University of Minnesota (cbri.umn.edu)
- License
-
The Ruby License
Description¶ ↑
Bio::Lasergene
reads DNAStar Lasergene
formatted sequence files, or .seq
files. It only expects to find one sequence per file.
Usage¶ ↑
require 'bio' filename = 'MyFile.seq' lseq = Bio::Lasergene.new( IO.readlines(filename) ) lseq.entry_id # => "Contig 1" lseq.seq # => ATGACGTATCCAAAGAGGCGTTACC
Comments¶ ↑
I’m only aware of the following three kinds of Lasergene
file formats. Feel free to send me other examples that may not currently be accounted for.
File format 1:
## begin ## "Contig 1" (1,934) Contig Length: 934 bases Average Length/Sequence: 467 bases Total Sequence Length: 1869 bases Top Strand: 2 sequences Bottom Strand: 2 sequences Total: 4 sequences ^^ ATGACGTATCCAAAGAGGCGTTACCGGAGAAGAAGACACCGCCCCCGCAGTCCTCTTGGCCAGATCCTCCGCCGCCGCCCCTGGCTCGTCCACCCCCGCCACAGTTACCGCTGGAGAAGGAAAAATGGCATCTTCAWCACCCGCCTATCCCGCAYCTTCGGAWRTACTATCAAGCGAACCACAGTCAGAACGCCCTCCTGGGCGGTGGACATGATGAGATTCAATATTAATGACTTTCTTCCCCCAGGAGGGGGCTCAAACCCCCGCTCTGTGCCCTTTGAATACTACAGAATAAGAAAGGTTAAGGTTGAATTCTGGCCCTGCTCCCCGATCACCCAGGGTGACAGGGGAATGGGCTCCAGTGCTGWTATTCTAGMTGATRRCTTKGTAACAAAGRCCACAGCCCTCACCTATGACCCCTATGTAAACTTCTCCTCCCGCCATACCATAACCCAGCCCTTCTCCTACCRCTCCCGYTACTTTACCCCCAAACCTGTCCTWGATKCCACTATKGATKACTKCCAACCAAACAACAAAAGAAACCAGCTGTGGSTGAGACTACAWACTGCTGGAAATGTAGACCWCGTAGGCCTSGGCACTGCGTKCGAAAACAGTATATACGACCAGGAATACAATATCCGTGTMACCATGTATGTACAATTCAGAGAATTTAATCTTAAAGACCCCCCRCTTMACCCKTAATGAATAATAAMAACCATTACGAAGTGATAAAAWAGWCTCAGTAATTTATTYCATATGGAAATTCWSGGCATGGGGGGGAAAGGGTGACGAACKKGCCCCCTTCCTCCSTSGMYTKTTCYGTAGCATTCYTCCAMAAYACCWAGGCAGYAMTCCTCCSATCAAGAGcYTSYACAGCTGGGACAGCAGTTGAGGAGGACCATTCAAAGGGGGTCGGATTGCTGGTAATCAGA ## end ##
File format 2:
## begin ## ^^: 350,935 Contig 1 (1,935) Contig Length: 935 bases Average Length/Sequence: 580 bases Total Sequence Length: 2323 bases Top Strand: 2 sequences Bottom Strand: 2 sequences Total: 4 sequences ^^ ATGTCGGGGAAATGCTTGACCGCGGGCTACTGCTCATCATTGCTTTCTTTGTGGTATATCGTGCCGTTCTGTTTTGCTGTGCTCGTCAACGCCAGCGGCGACAGCAGCTCTCATTTTCAGTCGATTTATAACTTGACGTTATGTGAGCTGAATGGCACGAACTGGCTGGCAGACAACTTTAACTGGGCTGTGGAGACTTTTGTCATCTTCCCCGTGTTGACTCACATTGTTTCCTATGGTGCACTCACTACCAGTCATTTTCTTGACACAGTTGGTCTAGTTACTGTGTCTACCGCCGGGTTTTATCACGGGCGGTACGTCTTGAGTAGCATCTACGCGGTCTGTGCTCTGGCTGCGTTGATTTGCTTCGCCATCAGGTTTGCGAAGAACTGCATGTCCTGGCGCTACTCTTGCACTAGATACACCAACTTCCTCCTGGACACCAAGGGCAGACTCTATCGTTGGCGGTCGCCTGTCATCATAGAGAAAGGGGGTAAGGTTGAGGTCGAAGGTCATCTGATCGATCTCAAAAGAGTTGTGCTTGATGGCTCTGTGGCGACACCTTTAACCAGAGTTTCAGCGGAACAATGGGGTCGTCCCTAGACGACTTTTGCCATGATAGTACAGCCCCACAGAAGGTGCTCTTGGCGTTTTCCATCACCTACACGCCAGTGATGATATATGCCCTAAAGGTAAGCCGCGGCCGACTTTTGGGGCTTCTGCACCTTTTGATTTTTTTGAACTGTGCCTTTACTTTCGGGTACATGACATTCGTGCACTTTCGGAGCACGAACAAGGTCGCGCTCACTATGGGAGCAGTAGTCGCACTCCTTTGGGGGGTGTACTCAGCCATAGAAACCTGGAAATTCATCACCTCCAGATGCCGTTGTGCTTGCTAGGCCGCAAGTACATTCTGGCCCCTGCCCACCACGTTG ## end ##
File format 3 (non-standard Lasergene
header):
## begin ## LOCUS PRU87392 15411 bp RNA linear VRL 17-NOV-2000 DEFINITION Porcine reproductive and respiratory syndrome virus strain VR-2332, complete genome. ACCESSION U87392 AF030244 U00153 VERSION U87392.3 GI:11192298 [...cut...] 3'UTR 15261..15411 polyA_site 15409 ORIGIN ^^ atgacgtataggtgttggctctatgccttggcatttgtattgtcaggagctgtgaccattggcacagcccaaaacttgctgcacagaaacacccttctgtgatagcctccttcaggggagcttagggtttgtccctagcaccttgcttccggagttgcactgctttacggtctctccacccctttaaccatgtctgggatacttgatcggtgcacgtgtacccccaatgccagggtgtttatggcggagggccaagtctactgcacacgatgcctcagtgcacggtctctccttcccctgaacctccaagtttctgagctcggggtgctaggcctattctacaggcccgaagagccactccggtggacgttgccacgtgcattccccactgttgagtgctcccccgccggggcctgctggctttctgcaatctttccaatcgcacgaatgaccagtggaaacctgaacttccaacaaagaatggtacgggtcgcagctgagctttacagagccggccagctcacccctgcagtcttgaaggctctacaagtttatgaacggggttgccgctggtaccccattgttggacctgtccctggagtggccgttttcgccaattccctacatgtgagtgataaacctttcccgggagcaactcacgtgttgaccaacctgccgctcccgcagagacccaagcctgaagacttttgcccctttgagtgtgctatggctactgtctatgacattggtcatgacgccgtcatgtatgtggccgaaaggaaagtctcctgggcccctcgtggcggggatgaagtgaaatttgaagctgtccccggggagttgaagttgattgcgaaccggctccgcacctccttcccgccccaccacacagtggacatgtctaagttcgccttcacagcccctgggtgtggtgtttctatgcgggtcgaacgccaacacggctgccttcccgctgacactgtccctgaaggcaactgctggtggagcttgtttgacttgcttccactggaagttcagaacaaagaaattcgccatgctaaccaatttggctaccagaccaagcatggtgtctctggcaagtacctacagcggaggctgca[...cut...] ## end ##
Constants
- DELIMITER_1
- DELIMITER_2
Attributes
Average length per sequence
-
Parsed from standard
Lasergene
header
Number of bottom strand sequences
-
Parsed from standard
Lasergene
header
Entire header before the sequence
Contig length, length of present sequence
-
Parsed from standard
Lasergene
header
Name of sequence
-
Parsed from standard
Lasergene
header
Bio::Sequence::NA
or Bio::Sequence::AA
object
Number of top strand sequences
-
Parsed from standard
Lasergene
header
Length of parent sequence
-
Parsed from standard
Lasergene
header
Number of sequences
-
Parsed from standard
Lasergene
header
Public Class Methods
# File lib/bio/db/lasergene.rb 124 def initialize(lines) 125 process(lines) 126 end
Public Instance Methods
Name of sequence
-
Parsed from standard
Lasergene
header
# File lib/bio/db/lasergene.rb 147 def entry_id 148 @name 149 end
Bio::Sequence::NA
or Bio::Sequence::AA
object
# File lib/bio/db/lasergene.rb 141 def seq 142 @sequence 143 end
Is the comment header recognized as standard Lasergene
format?
Arguments
-
none
- Returns
-
true
orfalse
# File lib/bio/db/lasergene.rb 134 def standard_comment? 135 @standard_comment 136 end
Protected Instance Methods
# File lib/bio/db/lasergene.rb 155 def process(lines) 156 delimiter_1_indices = [] 157 delimiter_2_indices = [] 158 159 # If the data from the file is passed as one big String instead of 160 # broken into an Array, convert lines to an Array 161 if lines.kind_of? String 162 lines = lines.tr("\r", '').split("\n") 163 end 164 165 lines.each_with_index do |line, index| 166 if line.match DELIMITER_1 167 delimiter_1_indices << index 168 elsif line.match DELIMITER_2 169 delimiter_2_indices << index 170 end 171 end 172 173 raise InputError, "More than one delimiter of type '#{DELIMITER_1}'" if delimiter_1_indices.size > 1 174 raise InputError, "More than one delimiter of type '#{DELIMITER_2}'" if delimiter_2_indices.size > 1 175 raise InputError, "No comment to data separator of type '#{DELIMITER_2}'" if delimiter_2_indices.size < 1 176 177 if !delimiter_1_indices.empty? 178 # toss out DELIMETER_1 and anything preceding it 179 @comments = lines[ (delimiter_1_indices[0] + 1) .. (delimiter_2_indices[0] - 1) ] 180 else 181 @comments = lines[ 0 .. (delimiter_2_indices[0] - 1) ] 182 end 183 184 @standard_comment = false 185 if @comments[0] =~ %r{(.+)\s+\(\d+,\d+\)} # if we have a standard Lasergene comment 186 @standard_comment = true 187 @name = $1 188 comments.each do |comment| 189 if comment.match('Contig Length:\s+(\d+)') 190 @contig_length = $1.to_i 191 elsif comment.match('Average Length/Sequence:\s+(\d+)') 192 @average_length = $1.to_i 193 elsif comment.match('Total Sequence Length:\s+(\d+)') 194 @total_length = $1.to_i 195 elsif comment.match('Top Strand:\s+(\d+)') 196 @top_strand_sequences = $1.to_i 197 elsif comment.match('Bottom Strand:\s+(\d+)') 198 @bottom_strand_sequences = $1.to_i 199 elsif comment.match('Total:\s+(\d+)') 200 @total_sequences = $1.to_i 201 end 202 end 203 end 204 205 @comments = @comments.join('') 206 @sequence = Bio::Sequence.auto( lines[ (delimiter_2_indices[0] + 1) .. -1 ].join('') ) 207 end