Pseudo K-tuple nucleotide composition

The Pseudo K-tuple nucleotide composition or PseKNC, is a method for converting a nucleotide sequence (DNA or RNA) into a numerical vector so as to be used in pattern recognition techniques. Generally, the K-tuple can refer to a dinucleotide (when K=2) or a trinucleotide (when K=3). Depending on the instance, the technique can also be called PseDNC or PseTNC.

The method was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition) that is applied to protein sequences.

Background

PseAAC

PseKNC was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition). Previously, investigations either relied on sequential models for making predictions of certain protein properties (which, in its simplest case, just refers to the amino acid composition of the protein), or a discrete model which represents a vector of twenty elements, each of which represent the frequency of each amino acid in the protein sample. The discrete model, however, fails to account for sequence-order information. The PseACC model extends the 20-length vector in the discrete model with ÃÂ» components, each of which in some way captures sequence-order information, and this vector becomes the basis for making predictions.

Analogous problem in genomics

Analogously, a discrete model of a nucleotide sequence based on its dinucleotide composition would lay involve a vector of 16 elements, the value of which one representing the frequency of each dinucleotide in the sequence:

Where D is the DNA sequence, T is the transpose operator, and f(AA) is the normalized occurrence frequency of AA in the DNA sequence. A trinucleotide representation can be denoted as:

As can be seen, these discrete models fail to consider any global or long-range sequence-order information. To address this for both DNA and RNA sequences, the pseudo K-tuple nucleotide composition or PseKNC was proposed.

PseKNC

PseKNC extends the discrete model by adding ÃÂ» components to represent sequence-order and physico-chemical properties of the nucleotide sequence. The original KNC model will involve 4K components. In a dinucleotide situation where K = 2, 42 = 16 components will be included. The extension by PseKNC results in (4K + ÃÂ») components.

Applications

A wide diversity of applications have been developed with respect to the PseKNC method. For example, it has become an integral component of many algorithms designed to predict the locations of recombination hotspots and coldspots from sequence information.

Web servers

For the convenience scientific community, a freely available web server called PseKNC and an open source package called PseKNC-General were developed in 2013 and 2014, respectively, that could convert large-scale sequence datasets to pseudo nucleotide compositions with numerous choices of physicochemical property combinations. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, MoreauÃ¢ÂÂBroto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC.

Another web server, Pse-in-One, allows users to hand-select all pre-existing PseAAC and PseKNC methods for protein, RNA, and DNA sequences, along with any selection of the existing availability of physicochemical property combinations for these options.