class Bio::Fastq
Bio::Fastq
is a parser for FASTQ format.
Constants
- DefaultFormatName
-
Default format name
- FLATFILE_SPLITTER
-
Splitter for
Bio::FlatFile
- FormatNames
-
Available format names.
- Formats
-
Available format name symbols.
Attributes
definition; ID line (begins with @)
misc lines before the entry (String or nil)
quality as a string
raw sequence data as a String object
Public Class Methods
Source
# File lib/bio/db/fastq.rb 384 def initialize(str = nil) 385 return unless str 386 sc = StringScanner.new(str) 387 while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/) 388 unless add_header_line(line) then 389 sc.unscan 390 break 391 end 392 end 393 while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/) 394 unless add_line(line) then 395 sc.unscan 396 break 397 end 398 end 399 @entry_overrun = sc.rest 400 end
Creates a new Fastq
object from formatted text string.
The format of quality scores should be specified later by using format=
method.
Arguments:
-
str: Formatted string (String)
Public Instance Methods
Source
# File lib/bio/db/fastq.rb 325 def add_header_line(line) 326 @header ||= "" 327 if line[0,1] == "@" then 328 false 329 else 330 @header.concat line 331 self 332 end 333 end
Adds a header line if the header data is not yet given and the given line is suitable for header. Returns self if adding header line is succeeded. Otherwise, returns false (the line is not added).
Source
# File lib/bio/db/fastq.rb 340 def add_line(line) 341 line = line.chomp 342 if !defined? @definition then 343 if line[0, 1] == "@" then 344 @definition = line[1..-1] 345 else 346 @definition = line 347 @parse_errors ||= [] 348 @parse_errors.push Error::No_atmark.new 349 end 350 return self 351 end 352 if defined? @definition2 then 353 @quality_string ||= String.new 354 if line[0, 1] == "@" and 355 @quality_string.size >= @sequence_string.size then 356 return false 357 else 358 @quality_string.concat line 359 return self 360 end 361 else 362 @sequence_string ||= String.new 363 if line[0, 1] == '+' then 364 @definition2 = line[1..-1] 365 else 366 @sequence_string.concat line 367 end 368 return self 369 end 370 raise "Bug: should not reach here!" 371 end
Adds a line to the entry if the given line is regarded as a part of the current entry.
Source
# File lib/bio/db/fastq.rb 447 def entry_id 448 unless defined? @entry_id then 449 eid = @definition.strip.split(/\s+/)[0] || @definition 450 @entry_id = eid 451 end 452 @entry_id 453 end
Identifier of the entry. Normally, the first word of the ID line.
Source
# File lib/bio/db/fastq.rb 530 def error_probabilities 531 unless defined? @error_probabilities then 532 self.format ||= self.class::DefaultFormatName 533 a = @format.q2p(self.quality_scores) 534 @error_probabilities = a 535 end 536 @error_probabilities 537 end
Estimated probability of error for each base.
- Returns
-
(Array containing Float) error probability values
Source
# File lib/bio/db/fastq.rb 498 def format 499 ((defined? @format) && @format) ? @format.name : nil 500 end
Format name. One of “fastq-sanger”, “fastq-solexa”, “fastq-illumina”, or nil (when not specified).
- Returns
-
(String or nil) format name
Source
# File lib/bio/db/fastq.rb 477 def format=(name) 478 if name then 479 f = FormatNames[name] || Formats[name] 480 if f then 481 reset_state 482 @format = f.instance 483 self.format 484 else 485 raise "unknown format" 486 end 487 else 488 reset_state 489 nil 490 end 491 end
Specify the format. If the format is not found, raises RuntimeError.
Available formats are:
"fastq-sanger" or :fastq_sanger "fastq-solexa" or :fastq_solexa "fastq-illumina" or :fastq_illumina
Arguments:
-
(required) name: format name (String or Symbol).
- Returns
-
(String) format name
Source
# File lib/bio/db/fastq.rb 669 def mask(threshold, mask_char = 'n') 670 to_biosequence.mask_with_quality_score(threshold, mask_char) 671 end
Masks low quality sequence regions. For each sequence position, if the quality score is smaller than the threshold, the sequence in the position is replaced with mask_char.
Note: This method does not care quality_score_type.
Arguments:
-
(required) threshold : (Numeric) threshold
-
(optional) mask_char : (String) character used for masking
- Returns
-
Bio::Sequence
object
Source
# File lib/bio/db/fastq.rb 426 def naseq 427 unless defined? @naseq then 428 @naseq = Bio::Sequence::NA.new(@sequence_string) 429 end 430 @naseq 431 end
returns Bio::Sequence::NA
Source
# File lib/bio/db/fastq.rb 505 def quality_score_type 506 self.format ||= self.class::DefaultFormatName 507 @format.quality_score_type 508 end
The meaning of the quality scores. It may be one of :phred, :solexa, or nil.
Source
# File lib/bio/db/fastq.rb 516 def quality_scores 517 unless defined? @quality_scores then 518 self.format ||= self.class::DefaultFormatName 519 s = @format.str2scores(@quality_string) 520 @quality_scores = s 521 end 522 @quality_scores 523 end
Quality score for each base. For “fastq-sanger” or “fastq-illumina”, it is PHRED score. For “fastq-solexa”, it is Solexa score.
- Returns
-
(Array containing Integer) quality score values
Source
# File lib/bio/db/fastq.rb 439 def seq 440 unless defined? @seq then 441 @seq = Bio::Sequence::Generic.new(@sequence_string) 442 end 443 @seq 444 end
returns Bio::Sequence::Generic
Source
# File lib/bio/db/fastq.rb 654 def to_biosequence 655 Bio::Sequence.adapter(self, Bio::Sequence::Adapter::Fastq) 656 end
Returns sequence as a Bio::Sequence
object.
Note: If you modify the returned Bio::Sequence
object, the sequence or definition in this Fastq
object might also be changed (but not always be changed) because of efficiency.
Source
# File lib/bio/db/fastq.rb 421 def to_s 422 "@#{@definition}\n#{@sequence_string}\n+\n#{@quality_string}\n" 423 end
Returns Fastq
formatted string constructed from instance variables. The string will always be consisted of four lines without wrapping of the sequence and quality string, and the third-line is always only contains “+”. This may be different from initial entry.
Note that use of the method may be inefficient and may lose performance because new string object is created every time it is called. For showing an entry as-is, consider using Bio::FlatFile#entry_raw
. For output with various options, use Bio::Sequence#output
(:fastq).
Source
# File lib/bio/db/fastq.rb 563 def validate_format(errors = nil) 564 err = [] 565 566 # if header exists, the format might be broken. 567 if defined? @header and @header and !@header.strip.empty? then 568 err.push Error::Skipped_unformatted_lines.new 569 end 570 571 # if parse errors exist, adding them 572 if defined? @parse_errors and @parse_errors then 573 err.concat @parse_errors 574 end 575 576 # check if identifier exists, and identifier matches 577 if !defined?(@definition) or !@definition then 578 err.push Error::No_ids.new 579 elsif defined?(@definition2) and 580 !@definition2.to_s.empty? and 581 @definition != @definition2 then 582 err.push Error::Diff_ids.new 583 end 584 585 # check if sequence exists 586 has_seq = true 587 if !defined?(@sequence_string) or !@sequence_string then 588 err.push Error::No_seq.new 589 has_seq = false 590 end 591 592 # check if quality exists 593 has_qual = true 594 if !defined?(@quality_string) or !@quality_string then 595 err.push Error::No_qual.new 596 has_qual = false 597 end 598 599 # sequence and quality length check 600 if has_seq and has_qual then 601 slen = @sequence_string.length 602 qlen = @quality_string.length 603 if slen > qlen then 604 err.push Error::Short_qual.new 605 elsif qlen > slen then 606 err.push Error::Long_qual.new 607 end 608 end 609 610 # sequence character check 611 if has_seq then 612 sc = StringScanner.new(@sequence_string) 613 while sc.scan_until(/[ \x00-\x1f\x7f-\xff]/n) 614 err.push Error::Seq_char.new(sc.pos - sc.matched_size) 615 end 616 end 617 618 # sequence character check 619 if has_qual then 620 fmt = if defined?(@format) and @format then 621 @format.name 622 else 623 nil 624 end 625 re = case fmt 626 when 'fastq-sanger' 627 /[^\x21-\x7e]/n 628 when 'fastq-solexa' 629 /[^\x3b-\x7e]/n 630 when 'fastq-illumina' 631 /[^\x40-\x7e]/n 632 else 633 /[ \x00-\x1f\x7f-\xff]/n 634 end 635 sc = StringScanner.new(@quality_string) 636 while sc.scan_until(re) 637 err.push Error::Qual_char.new(sc.pos - sc.matched_size) 638 end 639 end 640 641 # if "errors" is given, set errors 642 errors.concat err if errors 643 # returns true if no error; otherwise, returns false 644 err.empty? ? true : false 645 end
Format validation.
If an array is given as the argument, when errors are found, error objects are pushed to the array. Currently, following errors may be added to the array. (All errors are under the Bio::Fastq
namespace, for example, Bio::Fastq::Error::Diff_ids
).
Error::Diff_ids
– the identifier in the two lines are different Error::Long_qual
– length of quality is longer than the sequence Error::Short_qual
– length of quality is shorter than the sequence Error::No_qual
– no quality characters found Error::No_seq
– no sequence found Error::Qual_char
– invalid character in the quality Error::Seq_char
– invalid character in the sequence Error::Qual_range
– quality score value out of range Error::No_ids
– sequence identifier not found Error::No_atmark
– the first identifier does not begin with “@” Error::Skipped_unformatted_lines
– the parser skipped unformatted lines that could not be recognized as FASTQ format
Arguments:
-
(optional) errors: (Array or nil) an array for pushing error messages. The array should be empty.
- Returns
-
true:no error, false: containing error.