class Bio::Fastq

Bio::Fastq is a parser for FASTQ format.

Constants

DefaultFormatName

Default format name

FLATFILE_SPLITTER

Splitter for Bio::FlatFile

FormatNames

Available format names.

Formats

Available format name symbols.

Attributes

definition[R]

definition; ID line (begins with @)

entry_overrun[R]
header[R]

misc lines before the entry (String or nil)

quality_string[R]

quality as a string

sequence_string[R]

raw sequence data as a String object

Public Class Methods

new(str = nil) click to toggle source

Creates a new Fastq object from formatted text string.

The format of quality scores should be specified later by using format= method.


Arguments:

  • str: Formatted string (String)

# File lib/bio/db/fastq.rb, line 383
def initialize(str = nil)
  return unless str
  sc = StringScanner.new(str)
  while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/)
    unless add_header_line(line) then
      sc.unscan
      break
    end
  end
  while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/)
    unless add_line(line) then
      sc.unscan
      break
    end
  end
  @entry_overrun = sc.rest
end

Public Instance Methods

add_header_line(line) click to toggle source

Adds a header line if the header data is not yet given and the given line is suitable for header. Returns self if adding header line is succeeded. Otherwise, returns false (the line is not added).

# File lib/bio/db/fastq.rb, line 324
def add_header_line(line)
  @header ||= ""
  if line[0,1] == "@" then
    false
  else
    @header.concat line
    self
  end
end
add_line(line) click to toggle source

Adds a line to the entry if the given line is regarded as a part of the current entry.

# File lib/bio/db/fastq.rb, line 339
def add_line(line)
  line = line.chomp
  if !defined? @definition then
    if line[0, 1] == "@" then
      @definition = line[1..-1]
    else
      @definition = line
      @parse_errors ||= []
      @parse_errors.push Error::No_atmark.new
    end
    return self
  end
  if defined? @definition2 then
    @quality_string ||= ''
    if line[0, 1] == "@" and
        @quality_string.size >= @sequence_string.size then
      return false
    else
      @quality_string.concat line
      return self
    end
  else
    @sequence_string ||= ''
    if line[0, 1] == '+' then
      @definition2 = line[1..-1]
    else
      @sequence_string.concat line
    end
    return self
  end
  raise "Bug: should not reach here!"
end
entry_id() click to toggle source

Identifier of the entry. Normally, the first word of the ID line.

# File lib/bio/db/fastq.rb, line 446
def entry_id
  unless defined? @entry_id then
    eid = @definition.strip.split(/\s+/)[0] || @definition
    @entry_id = eid
  end
  @entry_id
end
error_probabilities() click to toggle source

Estimated probability of error for each base.


Returns

(Array containing Float) error probability values

# File lib/bio/db/fastq.rb, line 529
def error_probabilities
  unless defined? @error_probabilities then
    self.format ||= self.class::DefaultFormatName
    a = @format.q2p(self.quality_scores)
    @error_probabilities = a
  end
  @error_probabilities
end
format() click to toggle source

Format name. One of “fastq-sanger”, “fastq-solexa”, “fastq-illumina”, or nil (when not specified).


Returns

(String or nil) format name

# File lib/bio/db/fastq.rb, line 497
def format
  ((defined? @format) && @format) ? @format.name : nil
end
format=(name) click to toggle source

Specify the format. If the format is not found, raises RuntimeError.

Available formats are:

"fastq-sanger" or :fastq_sanger
"fastq-solexa" or :fastq_solexa
"fastq-illumina" or :fastq_illumina

Arguments:

  • (required) name: format name (String or Symbol).

Returns

(String) format name

# File lib/bio/db/fastq.rb, line 476
def format=(name)
  if name then
    f = FormatNames[name] || Formats[name]
    if f then
      reset_state
      @format = f.instance
      self.format
    else
      raise "unknown format"
    end
  else
    reset_state
    nil
  end
end
mask(threshold, mask_char = 'n') click to toggle source

Masks low quality sequence regions. For each sequence position, if the quality score is smaller than the threshold, the sequence in the position is replaced with mask_char.

Note: This method does not care quality_score_type.


Arguments:

  • (required) threshold : (Numeric) threshold

  • (optional) mask_char : (String) character used for masking

Returns

Bio::Sequence object

# File lib/bio/db/fastq.rb, line 668
def mask(threshold, mask_char = 'n')
  to_biosequence.mask_with_quality_score(threshold, mask_char)
end
nalen() click to toggle source

length of naseq

# File lib/bio/db/fastq.rb, line 433
def nalen
  naseq.length
end
naseq() click to toggle source

returns Bio::Sequence::NA

# File lib/bio/db/fastq.rb, line 425
def naseq
  unless defined? @naseq then
    @naseq = Bio::Sequence::NA.new(@sequence_string)
  end
  @naseq
end
qualities()
Alias for: quality_scores
quality_score_type() click to toggle source

The meaning of the quality scores. It may be one of :phred, :solexa, or nil.

# File lib/bio/db/fastq.rb, line 504
def quality_score_type
  self.format ||= self.class::DefaultFormatName
  @format.quality_score_type
end
quality_scores() click to toggle source

Quality score for each base. For “fastq-sanger” or “fastq-illumina”, it is PHRED score. For “fastq-solexa”, it is Solexa score.


Returns

(Array containing Integer) quality score values

# File lib/bio/db/fastq.rb, line 515
def quality_scores
  unless defined? @quality_scores then
    self.format ||= self.class::DefaultFormatName
    s = @format.str2scores(@quality_string)
    @quality_scores = s
  end
  @quality_scores
end
Also aliased as: qualities
seq() click to toggle source

returns Bio::Sequence::Generic

# File lib/bio/db/fastq.rb, line 438
def seq
  unless defined? @seq then
    @seq = Bio::Sequence::Generic.new(@sequence_string)
  end
  @seq
end
to_biosequence() click to toggle source

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this Fastq object might also be changed (but not always be changed) because of efficiency.

# File lib/bio/db/fastq.rb, line 653
def to_biosequence
  Bio::Sequence.adapter(self, Bio::Sequence::Adapter::Fastq)
end
to_s() click to toggle source

Returns Fastq formatted string constructed from instance variables. The string will always be consisted of four lines without wrapping of the sequence and quality string, and the third-line is always only contains “+”. This may be different from initial entry.

Note that use of the method may be inefficient and may lose performance because new string object is created every time it is called. For showing an entry as-is, consider using Bio::FlatFile#entry_raw. For output with various options, use Bio::Sequence::Format#output(:fastq).

# File lib/bio/db/fastq.rb, line 420
def to_s
  "@#{@definition}\n#{@sequence_string}\n+\n#{@quality_string}\n"
end
validate_format(errors = nil) click to toggle source

Format validation.

If an array is given as the argument, when errors are found, error objects are pushed to the array. Currently, following errors may be added to the array. (All errors are under the Bio::Fastq namespace, for example, Bio::Fastq::Error::Diff_ids).

Error::Diff_ids – the identifier in the two lines are different Error::Long_qual – length of quality is longer than the sequence Error::Short_qual – length of quality is shorter than the sequence Error::No_qual – no quality characters found Error::No_seq – no sequence found Error::Qual_char – invalid character in the quality Error::Seq_char – invalid character in the sequence Error::Qual_range – quality score value out of range Error::No_ids – sequence identifier not found Error::No_atmark – the first identifier does not begin with “@” Error::Skipped_unformatted_lines – the parser skipped unformatted lines that could not be recognized as FASTQ format


Arguments:

  • (optional) errors: (Array or nil) an array for pushing error messages. The array should be empty.

Returns

true:no error, false: containing error.

# File lib/bio/db/fastq.rb, line 562
def validate_format(errors = nil)
  err = []

  # if header exists, the format might be broken.
  if defined? @header and @header and !@header.strip.empty? then
    err.push Error::Skipped_unformatted_lines.new
  end

  # if parse errors exist, adding them
  if defined? @parse_errors and @parse_errors then
    err.concat @parse_errors
  end

  # check if identifier exists, and identifier matches
  if !defined?(@definition) or !@definition then
    err.push Error::No_ids.new
  elsif defined?(@definition2) and
      !@definition2.to_s.empty? and
      @definition != @definition2 then
    err.push Error::Diff_ids.new
  end

  # check if sequence exists
  has_seq  = true
  if !defined?(@sequence_string) or !@sequence_string then
    err.push Error::No_seq.new
    has_seq = false
  end

  # check if quality exists
  has_qual = true
  if !defined?(@quality_string) or !@quality_string then
    err.push Error::No_qual.new
    has_qual = false
  end

  # sequence and quality length check
  if has_seq and has_qual then
    slen = @sequence_string.length
    qlen = @quality_string.length
    if slen > qlen then
      err.push Error::Short_qual.new
    elsif qlen > slen then
      err.push Error::Long_qual.new
    end
  end

  # sequence character check
  if has_seq then
    sc = StringScanner.new(@sequence_string)
    while sc.scan_until(/[ \x00-\x1f\x7f-\xff]/n)
      err.push Error::Seq_char.new(sc.pos - sc.matched_size)
    end
  end

  # sequence character check
  if has_qual then
    fmt = if defined?(@format) and @format then
            @format.name
          else
            nil
          end
    re = case fmt
         when 'fastq-sanger'
           /[^\x21-\x7e]/n
         when 'fastq-solexa'
           /[^\x3b-\x7e]/n
         when 'fastq-illumina'
           /[^\x40-\x7e]/n
         else
           /[ \x00-\x1f\x7f-\xff]/n
         end
    sc = StringScanner.new(@quality_string)
    while sc.scan_until(re)
      err.push Error::Qual_char.new(sc.pos - sc.matched_size)
    end
  end

  # if "errors" is given, set errors
  errors.concat err if errors
  # returns true if no error; otherwise, returns false
  err.empty? ? true : false
end