Plots and Diagrams · Programming · R · Uncategorized

Plotting with ggplot2 – L 01

In this post we will learn some plotting in R using ggplot libary. The tutorial includes use of several ggplot methods like: geom_point(), geom_errorbar(), geom_line.

The overall goal is to overlay several layers of geom canvas to create a feature rich, comprehensive plot. One important method used in this tutorial is the use of two separate geom_line() which some people might find helpful in certain context.


Hypothetical experiment:

I am creating a simulated data based on a hypothetical experiment. The nature of the experiment may be modified based on your field of study and you may use the example and code as something that can be applied to you own problem.

Experiment and data:

We have several plants from one particular species (or population). The trait we are looking for is, “How does the diameter of the plants change as it grows?”.

There is a plant hormone called auxin which when altered can affect plant growth. So, we design an experiment where several randomly selected individuals from this growing stock of plants are subjected to hormonal alteration using an auxin inhibitor (1-N-Naphthylphthalamic acid (NPA) ).

We apply hormone inhibitor in two ways and create two treatment groups :

  1. Aux_Spray – by spraying the NPA solution all over the selected plants.
  2. Aux_Drop – by localized application of the NPA droplets in the center of the plant (which only affect the primary shoot).

Additionally, we create a control group of plants by treating them with: 

  1. Water – by treating the plants by spraying them with water.
  2. DMSO – by treating the plants by spraying them with DMSO solution.

So, altogether we have 4 treatment groups:

 

Step 01 – 02 : Create data (hypothetical one though).

## Step 01: Import the required libary if required
library(ggplot2)
library(tidyverse)

#' ## Step 02: Let's create some data
# make 4 treatment groups with 6 replicates
treatmentGrp = sample(rep(c("Aux_Spray", "Aux_Drop", "DMSO", "Water"), each = 6))

# we now need 4x6 = 24 samples
sample_id = sample(rep(c(1:24)))

# let's create a random diameter size for the month of January, February and March
diameterJan = sample(c(seq(from = 3, to = 7, by = 0.4)), size = 24, replace = TRUE)
diameterFeb = sample(c(seq(from = 4, to = 10, by = 0.3)), size = 24, replace = TRUE)
diameterMar = sample(c(seq(from = 3.5, to = 7, by = 0.7)), size = 24, replace = TRUE)

## Let's put all the above data into a dataframe
my_df <- data.frame(sampleID = sample_id, Treatment = treatmentGrp,
                    diameterInJan = diameterJan, diameterInFeb = diameterFeb,
                    diameterInMar = diameterMar)

# Now, we compute "diameter changes" between consecutive months
# and add it as new columns
my_df$diamChangeJanToFeb = my_df$diameterInFeb - my_df$diameterInJan
my_df$diamChangeFebToMar = my_df$diameterInMar - my_df$diameterInFeb

 

Overall research question:

A) Do plants treated with different treatment show different diameter growth patterns across several months?

B) Does treatment affect variation in diameter changes across several months? This question is little different than previous question. It is asking if diameter changes are different across treatment groups (between two consecutive months) after accounting for starting diameter between two consecutive months.

Here we will try to answer this question only using plots (no statistics as of now). And, I am using `R` and the required library (ggplot, dplyr):

Step 03 – 04 : Subset and reshape the data. Calculate submean and standard error of mean.

#' ## Step 03: Subset and Reshape the data 

# Before we reshape we need to subset the data that is required by the plot
# Here, we select data from "Treatment" and "Diameter" column from each month)
my_dfSubset = subset(
  my_df, select = c("diameterInJan", "diameterInFeb", "diameterInMar",
                    "diamChangeJanToFeb", "diamChangeFebToMar", "Treatment"))
head(my_dfSubset)

# reshape the subseted data from wide to long and thin format
my_dfReshaped =
  my_dfSubset %>%  gather(Month, Diameter, -Treatment)
head(my_dfReshaped)

#' ## Step 04: Calculate the required values from the data
# Calculate "mean" and standard error of the mean (i.e SE)
  # by "Treatment" for each time period i.e "Month")
  # since the mean is not for overall month, but grouped by treatement,
  # we call it "SubMean".
my_dfSubMean = aggregate(my_dfReshaped$Diameter,
                         by=list(Treatment = my_dfReshaped$Treatment,
                                 Month = my_dfReshaped$Month), FUN = mean , na.rm=TRUE)
head(my_dfSubMean)

# Change the column name "x" to "SubMean"
colnames(my_dfSubMean)[colnames(my_dfSubMean)=="x"] <- "SubMean"
my_dfSubMean

# Calculate the "standard error" of the mean (for each "Treatment" for each "Month")
my_dfSubSE = aggregate(my_dfReshaped$Diameter,
                       by=list(Treatment = my_dfReshaped$Treatment,
                               Month = my_dfReshaped$Month), FUN = sd , na.rm=TRUE)
head(my_dfSubSE)

# Change column name "x" to "SubSE"
colnames(my_dfSubSE)[colnames(my_dfSubSE)=="x"] <- "SubSE"
head(my_dfSubSE) 

# Now, bind the "my_dfSubMean" with "my_dfSE"
my_dfSubMean = Reduce(function(x, y)
    merge(x, y, by=c("Treatment", "Month"),
          all = TRUE), list(my_dfSubMean, my_dfSubSE)) 

 

Step 05  : Split the data

We split this big data into two parts, one that contains only “Diameter” values, and another containing “diameter changes” values

#' ## Step 05: Split the data into smaller groups, so plotting can be made easier
## Split the "SubMean" and "SubSE" data into two sets
# one set containing the "diameter" values for each month
# another set containing the "diameter changes" values betweeen consecutive months
# ** Note : This becomes helpful when we want to apply "geom_line" and other methods to ..
# .. the "Diameter" and "Diameter changes" separately # select values (from "Month" column)
that contains only "diameter" for that month. 

# we use grep to capture the values that have "Change" string in it, then invert the selection
my_dfDiameterSubMean = my_dfSubMean[grep('Change', my_dfSubMean$Month, invert = TRUE), ] 

# select values (from "Month" column) that represent "diameter changes" between two
# .. consecutive months
my_dfChangeSubMean = my_dfSubMean[grep('Change', my_dfSubMean$Month), ] 

 

Step 06 : Plot the data.

#' ## Step 06 : Now, start plotting hello
# We use input the main reshaped data for all the observations
# pipe in the reshaped data
my_dfReshaped  %&gt;%
  # make a blank canvas to start with
  ggplot() + 

  # reduce the canvas to a minimal background
  # this should be put early in plot building, if not it will strip the
  # .. custom theme added to the overall plot
  theme_minimal() + 

  # add the "main title" and then x,y axis title
  ggtitle("Effect of auxin inhibitor (NPA) on diameter and diameter changes.") +
  xlab("Month") +
  ylab("Diameter Changes (mm) \t|\t Diameter (mm)") + 

  # add a layer of point/dot plot from all the observation (i.e "Diameter" and "Diameter Changes")
  # the data from different treatment are colored distinctly
  geom_point(aes(x = factor(Month), y = Diameter, colour = Treatment),
             na.rm = TRUE, size = 2,
             position = position_dodge(width = 0.2)) +

  # add custom color to the "Treatment" levels
  scale_colour_manual(
    values = c("Aux_Drop" = "Purple", "Aux_Spray" = "Red",
               "DMSO" = "Orange", "Water" = "Green")) + 

  # add finer ticks to the y-axis values
  scale_y_continuous(breaks = pretty(my_dfReshaped$Diameter, n = 20)) +

  # Improve the quality of the plot title and x,y title by adjusting size and format
  theme(plot.title = element_text(face="bold", hjust=0, size = 18),
        axis.title.x = element_text(face="bold", size = 12),
        axis.title.y = element_text(color="red", face="bold", size = 12)) +

  # rearrange the x-axis (keep months in order),
  # and rename the factors to make it neat
  # lower the position of the factor that indicate diameter changes by adding line break (i.e '\n').
  # this helps to show the "diameter changes" values separately against "diameter"
  scale_x_discrete(limits=c(
    "diameterInJan", "diamChangeJanToFeb", "diameterInFeb",
    "diamChangeFebToMar", "diameterInMar"),
    labels = c("Jan", "\nJanToFeb", "Feb", "\nFebToMar", "Mar")) + 

  # update the font quality of the renamed x-axis ticks and y-axis ticks value
  theme(axis.text.x = element_text(face = c(
    "plain", "bold", "plain", "bold", "plain"), size = 10),
    axis.text.y = element_text(size = 10)) +
  # we can add other flags like "vjust", "hjust", size to control the text quality in x-axis.

  # Now, add another layer of dot plots from the calculated "mean" (now we use another dataframe)
  # also make the size of the "mean" value larger so it can be seen as a mean value
  geom_point(data = my_dfSubMean, size = 5, aes(
    x = factor(Month), y = SubMean, colour = Treatment),
    na.rm = TRUE, position = position_dodge(width = 0.2)) +

  # add an error bar for "Mean" value for each "Treatment" for each time point
  geom_errorbar(data = my_dfSubMean, aes(
    x = Month, ymin=SubMean-SubSE, ymax=SubMean+SubSE, colour = Treatment),
    position = position_dodge(width = 0.2), width=.1, size = 0.5) + 

  # add horizontal abline to show positive vs. negative trend in diameter changes ..
  # .. between consecutive months
  geom_hline(yintercept=0, linetype="dashed", color = "black") +

  ## Add another layer of line plot using "geom_line()"
  # Connect the "subMeans" of the "Diameter" values across time points
  # for this we need to connect the means of "Diameter" and means of "Diameter Changes" separately

  # connect the mean of "Diameter" for month of Jan, Feb, Mar
  geom_line(data = my_dfDiameterSubMean, aes(
    x = Month, y = SubMean, group = Treatment, colour = Treatment),
    position = position_dodge(width = 0.2)) +

  # separately connect the mean of the "Diameter Changes" between Jan-Feb and Feb-Mar
  geom_line(data = <span class="pl-smi">my_dfChangeSubMean</span>, aes( 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#65279;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;ean, aes(
    x = Month, y = SubMean, group = Treatment, colour = Treatment),
    position = position_dodge(width = 0.2))

 

Here is the final plot :

Rplot02

 

This tutorial is also available in github with R-code and Rmarkdown HTML output – https://github.com/everestial/PlotAwesome/tree/master/MakeGGplotLevel01

Advertisements
Algorithms ! · Programming · Python · Uncategorized

“Primality test” and algorithm optimization using python – a synced approach.

Algorithm are means to finding ends to the problem.

While there are several good tutorials in programming I have been little underwhelemed by the available documents that show how a designed algorithm integrates with the programming/coding and how further algorithm optimization are approached to save two most important resources of computer – i.e “computation time” and “memory”.

For a newbie the concept of algorithm and it’s translation to working code (in any language) can be both exciting and frustrating – I speak this from my experience because I spent most of my life staying as an emperical biologist, then transitioned to bioinformatics as front end user, then jumped to backend bioinformatics application development. Most importantly and frustratingly my ride was solo, navigating only by means of stackoverflow and google.

Understanding an algorithm and it’s translation to code should go hand in hand:

  • because a working code that solves a problem at hand can never be written without a working algorithm.
  • because it also helps us find “the devil in the details”. With the trend of increase in high level programming we adhere by the assumption that the provided “black box” (packages, libraries or even a written code) will work. But, often times it won’t; so a perspective of algorithmic details and it’s translation into code is always a most for programmer’s career.

Codes for this tutorial can be found at FindMeAprime on my GitHub repo.



This tutorial is about “Primality Test”

Today, I will talk about a simple mathematical problem – “Test of Primality” i.e if a given natural positive number is a prime or not.

Finding a prime has been one of the challenging mathematical rigor to humans since the time the numbers were invented. And the test of primality has always fascinated us.

In this tutorial :

  • I will first take the existing definition of a “prime number” and use it as a base algorithm to write a code in python.
  • Subsequently, I will show how an existing algorithm could be optimized and translated into optimized code to acheive the same result.
  • Additionally, this tutorial has minor goal of showing how “for loop” vs. “while loop” can be used for the same set of problem and why a particular loop becomes more appropriate for a given problem. Compare the codes between “level 02 and level 03”.

So, overall we will evolve both in the terms of thinking about an algorithm and and it’s translation into workable coding in python.

Purpose of the tutorial:
    - Convert algorithms for primality test to python codes.
    - Use "for" vs. "while" loop for the same set of problem to understand their differences.
    - Algorithm and code optimization.

Definition of a prime:
    - num(x) is prime if "x" is divisible only by 1 and itself.
Expressing mathematically,
natural\ positive\ number\ "x"\ is\ a\ prime \ if, \\ f\left( x\% n\right) =0 \,, {only \ when} \begin{cases}n=1\\ n=x\end{cases}
    - i.e the number cannot be split into two or more number of positive integers.
    - So, the given input number has to be divided by 2 to n-1.

Level 01 : Primality testing using the simplest algorithm

We will covert the above definition of prime into an algorithm and python code below.

import sys
import time

my_num = input("input a number: ")
my_num = int(my_num)
print()
print('The input number is %s' %my_num)

# writing a small function to test the primality of the input number
def test_prime(num):
    set_timer = time.time()
    print()
    print('Level 01 primality test.')
    is_prime = ''  # to store the boolean logic

    # here (in python) the upper limit of the range is n-1
    for nth in range(2, num):
        if num % nth == 0:
            # if a number gets divided into two positive natural numbers
            print('the input number %s is not a prime' % num)
            print('Completed primality test.')

            # now, update the boolean logic
            is_prime = False

            # and if the input number is found to be a prime at some point
            # .. there is no point in testing it further. So, we can just break the for loop.
            break

    # now, if the boolean logic was not updated in all the divison tests in the above for-loop
    # we know the number is a prime.
    if is_prime == '':
        print('the input number %s is a prime' % num)
        print('Completed primality test in %s' %(time.time() - set_timer))

## Call the function to test the primality.
test_prime(my_num)

Begin test for level 01:

$ python3 FindMeAprime.py
input a number: 31

The input number is 31

Level 01 primality test.
the input number 31 is a prime
Completed primality test in 5.888938903808594e-05

Result: We can see that primality test was completed in almost no time.



Level 02: Now, let’s take algorithm to a next level

Earlier when we tested if a number is a prime or not, we divided the number (x) by all the numbers from n=2 to n=(x-1).

To save computation time for the primality test, we can simply divide the number (x) by numbers from n=2 to n=(x+1)/2.

  • because if the number isn’t divided by 2 it cannot be divided by any integer bigger than rounded (x+1)/2.
  • by reducing the “maximum required divisor” from “x-1” to “(x+1)/2”, we just simply reduced the size of the divisor and thus updated our algorithm.

Now, let’s update the above code

def test_primeL2(num):
    set_timer = time.time()
    print()
    print("Level 02 primality test.")

    # largest factor to divide the number of interest
    max_factor = round((num+1)/2)
    is_prime = ''  # to store the boolean logic

    # since python index starts at zero(0) the upper limit of the range is n-1
    for nth in range(2, max_factor):
        if num % nth == 0:
            # if a number gets divided into two positive natural numbers
            print('the input number %s is not a prime' % num)
            print('Completed primality test.')

            # now, update the boolean logic
            is_prime = False

            # and if the input number is found not to be a prime at some point
            # .. there is no point in testing it further. So, we can just break the for loop.
            break

    # now, if the boolean logic was not updated in all the divison tests in the above for-loop
    # we know the number is a prime.
    if is_prime == '':
        print('the input number %s is a prime' % num)
        print('Completed primality test in %s.' %(time.time() - set_timer))

## Call the function to test the primality.
test_primeL2(my_num)

Begin test for both level 01 and level 02 with a large prime number:

$ python3 FindMeAprime.py
input a number: 145897

The input number is 145897

Level 01 primality test.
the input number 145897 is a prime
Completed primality test in 0.01402425765991211

Level 02 primality test.
the input number 145897 is a prime
Completed primality test in 0.006134748458862305.

$ python3 FindMeAprime.py
input a number: 15487457

The input number is 15487457

Level 01 primality test.
the input number 15487457 is a prime
Completed primality test in 1.3484265804290771

Level 02 primality test.
the input number 15487457 is a prime
Completed primality test in 0.6822659969329834.

Result : You can see that the optimized algorithm (i.e test_primeL2) actually completes the test in much shorter time.


Level 03: Testing primality using while loop

Now, it’s time to see if we can implement the algorithm for the test of primality using “while loop”.

Also take care of the following points.

  • Coming from a long biological background, I personally found while-loop to be a little difficult to comprehend in the beginning.
  • Another problem with while loop is that if condition isn’t handled properly while loops can turn into infinite loop issues.
  • For the test of primality ‘while loop’ works better, you will see why.
def test_prime_W1(num):
    set_timer = time.time()
    print()
    print("Primality test using while loop.")
    max_factor = round((num+1)/2)
    is_prime = ''  # to store the boolean logic

    # instead of using for-loop on a range of numbers ..
    # in while loop we run the loop until a condition is met ..
    # i.e here we start dividing by factor 2 and then increase the factor ..
    # .. by 1 until it becomes equal to "max_factor"
    factor = 2
    while factor <= max_factor:
        if num % factor == 0:
            # if a number gets divided into two positive natural numbers
            print('the input number %s is not a prime' % num)
            print('Completed primality test in %s.' %(time.time() - set_timer))

            # update the boolean logic
            is_prime = False

            # break the while loop if condition met
            break

        # update the factor size if above condition isn't met
        factor += 1

    if is_prime == '':
        print('the input number %s is a prime' % num)
        print('Completed primality test in %s.' % (time.time() - set_timer))

## Call the function to test the primality
test_prime_W1(my_num)

 

Begin test for primality using both For and While loops:

$ python3 FindMeAprime.py
input a number: 15487457

The input number is 15487457

Level 01 primality test.
the input number 15487457 is a prime
Completed primality test in 1.303039312362671

Level 02 primality test.
the input number 15487457 is a prime
Completed primality test in 0.6651513576507568.

Primality test using while loop.
the input number 15487457 is a prime
Completed primality test in 0.9351861476898193.

Same test with another large number:

$ python3 FindMeAprime.py
input a number: 49979591

The input number is 49979591

Level 01 primality test.
the input number 49979591 is a prime
Completed primality test in 4.481180906295776

Level 02 primality test.
the input number 49979591 is a prime
Completed primality test in 2.3850677013397217.

Primality test using while loop.
the input number 49979591 is a prime
Completed primality test in 3.1510345935821533.

Result :

  • We can see that optimized “for-loop i.e level 02” was the fastest of all. The same algorithm (Level 02) when implemented in `while loop` was slower compared to when implemented on a `for-loop` method.
    I am not very sure if `while loop` are generally slower, so there is more testing and reading that can be done on this matter.
  • But, ‘while loop’ is a more fitting loop to this problem because down the road we will deal with dynamically reducing the size of the “maximum divisor”.

Level 04: dynamic “max factor” implementation.

Previously, setting a “max factor” did decrease our computation time,  but it may not be the fast way to run the primality test if the number being tested is very large.
To assist the process we can update our algorithm that would dynamically update the “max factor” as the test progresses.

Concept behind the updated (dynamic max factor reduction) algorithm :

  • Previously when the input natural positive number wasn’t perfectly divisible by 2, we reduced the max factor to half of the input number since it couldn’t be divided by any number larger than half of it’s size.
  • Similarly if the input natural number isn’t divisible by 2 and also not divisible by 3 the input number isn’t really divisible by any factors less than rounded(1/2) it’s size and also not divisible by any factor larger than rounded(1/3) its size.
  • So, this provides us with the opportunity to reduce the size of “max factor” as the size of the factor increases.
  • This dynamic increase in the factor size and reduction of max factor size can be embedded into while-loop rather than for-loop.
def test_prime_dmf(num):
    # dmf = dynamic max factor
    set_timer = time.time()
    print()
    print("Primality test using dynamic max factor on while loop.")

    # we start with this max factor
    max_factor = round((num+1)/2)

    is_prime = ''  # to store the boolean logic

    factor = 2  # set the initial factor
    while factor <= max_factor:  # we update the condition here to avoid infinite looping
        if num % factor == 0:
            # if a number gets divided into two positive natural numbers
            print('the input number %s is not a prime' % num)
            print('Completed primality test in %s.' %(time.time() - set_timer))

            # update the boolean logic
            is_prime = False

            # break the while loop if condition met
            break

        # update the factor size if above condition isn't met
        factor += 1

        # also reduce the size of the "max factor" if above condition isn't met
        max_factor = round((num+1)/factor)

        #print('factor, max factor:', factor, max_factor)

    if is_prime == '':
        print('the input number %s is a prime' % num)
        print('Completed primality test in %s.' % (time.time() - set_timer))

## Call the function to test the primality
test_prime_dmf(my_num)

Now, let’s test the primality using “max factor reduction” implementation:

$ python3 FindMeAprime.py
input a number: 49979591

The input number is 49979591

Level 01 primality test.
the input number 49979591 is a prime
Completed primality test in 4.220755338668823

Level 02 primality test.
the input number 49979591 is a prime
Completed primality test in 2.097686290740967.

Primality test using while loop.
the input number 49979591 is a prime
Completed primality test in 3.07643723487854.

Primality test using dynamic max factor on while loop.
the input number 49979591 is a prime
Completed primality test in 0.0027065277099609375.

Another test using larger prime :

python3 FindMeAprime.py
input a number: 573259321

The input number is 573259321

Level 01 primality test.
the input number 573259321 is a prime
Completed primality test in 50.32965445518494

Level 02 primality test.
the input number 573259321 is a prime
Completed primality test in 24.995513439178467.

Primality test using while loop.
the input number 573259321 is a prime
Completed primality test in 35.36520004272461.

Primality test using dynamic max factor on while loop.
the input number 573259321 is a prime
Completed primality test in 0.008965015411376953.

Result : Wow, you can see that dynamic “max factor reduction” implementation actually reduces the run time significantly.


Level 05: Primality test using O(sqrt(N)) implementation

If you take the code from the above level 04 “dynamic max factor” implementation and activate the line
print('factor, max factor:', factor, max_factor) and test a prime; let’s say “49979591”,

$ python3 FindMeAprime.py
input a number: 49979591

The input number is 49979591

Level 01 primality test.
the input number 49979591 is a prime
Completed primality test in 4.343443870544434

Level 02 primality test.
the input number 49979591 is a prime
Completed primality test in 2.1754212379455566.

Primality test using while loop.
the input number 49979591 is a prime
Completed primality test in 3.0253350734710693.

Primality test using dynamic max factor on while loop.
....................
....................
....................
factor, max factor: 7066 7073
factor, max factor: 7067 7072
factor, max factor: 7068 7071
factor, max factor: 7069 7070
factor, max factor: 7070 7069

the input number 49979591 is a prime
Completed primality test in 0.0029039382934570312.

… you can see how the ‘factor’ and ‘max factor’ value changes as the test progresses. One thing to note is that as the ‘factor’ is increasing the ‘max factor’ is decreasing and converging towards the `square root(N)` of the input number.

This shows that for a “primality test” we can set the `max factor` directly at the rounded square root of the input number. – The aaahaaa moment, isn’t it !!! :).
Hence the `O(sqrt(N))` test.

Source code for “O(sqrt(N))” implementation:

def test_primeSqrt(num):
    set_timer = time.time()
    print()
    print("Primality test using O(sqrt(N)) implementation.")
    max_factor = round(math.sqrt(num))
    is_prime = ''  # to store the boolean logic

    #print('max factor', max_factor)

    # instead of using for-loop on a range of numbers ..
    # in while loop we run the loop until a condition is met ..
    # i.e here we start dividing by factor 2 and then increase the factor ..
    # .. by 1 until it becomes equal to "max_factor"
    factor = 2
    while factor <= max_factor:
        if num % factor == 0:
            # if a number gets divided into two positive natural numbers
            print('the input number %s is not a prime' % num)
            print('Completed primality test in %s.' %(time.time() - set_timer))

            # update the boolean logic
            is_prime = False

            # break the while loop if condition met
            break

        # update the factor size if above condition isn't met
        factor += 1

    if is_prime == '':
        print('the input number %s is a prime' % num)
        print('Completed primality test in %s.' % (time.time() - set_timer))

## Call the function to test the primality
test_primeSqrt(my_num)

Lets begin some test with Sqrt(N) implementation.

$ python3 FindMeAprime.py
input a number: 49979591

The input number is 49979591

Level 01 primality test.
the input number 49979591 is a prime
Completed primality test in 4.27828311920166

Level 02 primality test.
the input number 49979591 is a prime
Completed primality test in 2.130234956741333.

Primality test using while loop.
the input number 49979591 is a prime
Completed primality test in 3.227029800415039.

Primality test using dynamic max factor on while loop.
the input number 49979591 is a prime
Completed primality test in 0.003675699234008789.

Primality test using O(sqrt(N)) implementation.
max factor 7070
the input number 49979591 is a prime
Completed primality test in 0.0009083747863769531.

Another test :

$ python3 FindMeAprime.py
input a number: 573259321

The input number is 573259321

Level 01 primality test.
the input number 573259321 is a prime
Completed primality test in 50.293437004089355

Level 02 primality test.
the input number 573259321 is a prime
Completed primality test in 24.826218366622925.

Primality test using while loop.
the input number 573259321 is a prime
Completed primality test in 36.55556273460388.

Primality test using dynamic max factor on while loop.
the input number 573259321 is a prime
Completed primality test in 0.00992727279663086.

Primality test using O(sqrt(N)) implementation.
max factor 23943
the input number 573259321 is a prime
Completed primality test in 0.0028390884399414062.

Result : Wow, the sqrt(N) implementation worked like a flash.


Level 06 : Primality test using “Sieve of Eratosthenes”

This method is mainly used to find the prime numbers with in a certain range. You can read about the details of the algorithm here Sieve of Eratosthenes and apply it to python. I will add python implementation of this algorithm soon in the future.

At the mean time you can take the above codes and update it if you can find a prime number within a given range of inputs (say between 101-10001).

You can read more about primes and find some large primes to test:

 


Conclusion

So, in this tutorial we were able to :

  • run a “primality test” using the base algorithm derived from the definition of “prime number.
  • update the algorithm and codes to reduce run time significantly.
  • apply the concept of `for loop` vs. `while loop` on the same problem and see why a particular loop is more suited.
  • learn algorithm optimization by starting with a simple concept of prime and primality test, then reduce the time for primality test by adding conditional measures.

Finally, algorithms are simply a measure to find a means to the problem at hand. It is more of an art of finding a pattern to problem solving and adding more refined patterns for runtime and memory optimization.

Haplotype phasing · Uncategorized

Extension of the ReadBackPhased haplotypes using phASE-Extender.

 

There is a saying, “Devil is in the details“.

In this blog, I will try to provide an exposure on the algorithm used by phase-Extender under the hood for the preparation of long range haplotypes (and possibly genome wide haplotype). phase-Extender uses short blocks of ReadBackPhased haplotypes in a population of samples and applies LD test between two consecutive blocks in one sample at a time . I am not delving into additional details of haplotype phasing and RBphasing – I am sure there are other excellent sources to gather information on this matter


See this links:     
  - Purpose and operation of Read-backed-phasing GATK
  - Read backed phasing documentation GATK

Part A :  ReadBackPhased haplotype in a population of samples.

Here I start the procedure right from the use of short haplotype blocks which are prepared by using ReadBackPhasing method employed by phaser software (Castel et al. 2016). The phased genotype in the VCF file produce by phaser has a PI and PG tag in the FORMAT field which represent the unique block index and the phased state of the genotypes included in that block. The PI and PG values are extracted from the VCF file using the python script vcf_to_table-v2.py and a haplotype file is created.

Fig 01: A typical VCF file produced by phaser.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ms01e ms02g ms03g ms04h ms05h ms06h
2 15881764 . T C 1253 PASS . GT:PI:PG 0/1:4:0|1 0/1:6:0|1 0/0:.:0/0 0/0:.:0/0 . .
2 15881767 . C T 9683 PASS . GT:PI:PG 0/0:4:0|0 0/1:6:0|1 0/0:.:0/0 0/0:.:0/0 . .
2 .......
2 .......
2 15882451 . T C 9683 PASS . GT:PI:PG 0/1:4:1|0 0/1:4:0|1 0/0:.:0/0 0/0:.:0/0 . .
2 15882454 . T C 9683 PASS . GT:PI:PG 0/1:4:1|0 0/1:4:0|1 0/0:.:0/0 0/0:.:0/0 . .
2 ....... 
2 .......

**Note: This VCF has been edited to fit the screen.
        For more details on, What is a VCF file?

Fig 02: A typical haplotype file produced from the above VCF (not exact though).

CHROM   POS     REF   all-alleles   ms01e:PI   ms01e:PG_al   ms02g:PI   ms02g:PG_al   ms03g:PI   ms03g:PG_al   ms04h:PI   ms04h:PG_al   ms05h:PI   ms05h:PG_al   ms06h:PI   ms06h:PG_al
2      15881764   .      .           4           C|T           6           C|T           7           T|T           7           T|T           7           C|T           7           C|T
2      15881767   .      .           4           C|C           6           T|C           7           C|C           7           C|C           7           T|C           7           C|C
2      15881989   .      .           4           C|C           6           A|C           7           C|C           7           C|C           7           A|T           7           A|C
2      15882091   .      .           4           G|T           6           G|T           7           T|A           7           A|A           7           A|T           7           A|C
2      15882451   .      .           4           C|T           4           T|C           7           T|T           7           T|T           7           C|T           7           C|A
2      15882454   .      .           4           C|T           4           T|C           7           T|T           7           T|T           7           C|T           7           C|T
2      15882493   .      .           4           C|T           4           T|C           7           T|T           7           T|T           7           C|T           7           C|T
2      15882505   .      .           4           A|T           4           T|A           7           T|T           7           T|T           7           A|C           7           A|T

The main idea is to prepare a pool of several samples which have short phased haplotype blocks represented as PI (phased index) and  PG_al  (phased genotype). For any sample, the site where two consecutive haplotype blocks are not joined represents the break point. In the above haplotype data not all samples have the breakpoint at the same position which is most common occurrences atleast within a gene. Generally, other samples have haplotype block that bridges these two consecutive haplotype blocks and contains the required information to extend the phase state.
Therefore, the main function of phASE-Extender is to mend this break-point in a sample by taking two consecutive blocks at one time and computing the likelihood estimates of LD between the blocks using data from other samples.

Fig 03: Representing a break-point in the “sample – ms02g”

contig    pos    ms02g:PI    ms02g:PG_al
2     15881764     6         C|T
2     15881767     6         T|C
2     15881989     6         A|C
2     15882091     6         G|T
                   ×——————————×—————> Break Point
2     15882451     4         T|C
2     15882454     4         T|C
2     15882493     4         T|C
2     15882505     4         T|A

In the above haplotype file there is a breakpoint in sample ms02g at the position 15882091-15882451, but all other samples have haplotype intact that bridges this breakpoint position. We can therefore use LD information from other samples to join the two consecutive haplotype in sample ms02g.

For sample  ms02g at PI=6, the RBphased haplotypes are  C-T-A-G and  T-C-C-T. Similarly, at PI=4, the phased haplotypes are T-T-T-T and C-C-C-A. But, since the haplotype is broken in two levels, we don’t know which phase from PI=6 goes with which phase of PI=4.

Part B : Markov chain and emission, transition matrix between readback phased haplotypes.


To the human eye/mind you can clearly see and say that left part of ms02g (PI-6, i.e C-T-A-G is more likely to go with right block of PI-4  i.e C-C-C-A). Therefore, the most likely extended haplotype block is  C-T-A-G-C-C-C-A and T-C-C-T-T-T-T-T.
But, how do you represent this logic mathematically and feed this to the computer?

Since, all other samples are completely phased bridging that breakpoint, I can estimate LD between blocks of sample-ms02g, by running a markov-chain and computing emission and transition probabilities between the sites of former and later block.

Below, I prepare the first order markov chain and transition matrix  to compute the likelyhood estimates, calculate the log2Odds and then assign and extend the haplotype in proper configuration.

Step 01:   Prepare the data structure of haplotype configuration between the blocks.

  • The top PI-6 is Block-1 and the bottom PI-4 is Block-2.
  • The phased haplotype in the left within each block is Hap-A and on the right is Hap-B.


Fig 04: Representation of two consecutive haplotype blocks before phase extension.

ms02g:PI    ms02g:PG_al
  6         C|T \
  6         T|C | Block 1
  6         A|C |
  6         G|T /
  ×——————————×—————> Break Point
  4         T|C \
  4         T|C |
  4         T|C | Block 2
  4         T|A /
           ↓   ↓
       Hap-A   Hap-B

So, the two consecutive blocks can be extended in one of the two possible haplotype configurations:

Fig 05: Possible configuration in which haplotype blocks can be extended.

Parallel Configuration:
 Block01-HapA with Block02-HapA, so B01-HapB with B02-HapB 
Alternate Configuration:
 Block01-HapA with Block02-HapB, so B01-HapB with B02-HapA

Step 02: Prepare emission and transition probability matrix and estimate likelihood.

  • Estimate the emission counts of each nucleotide (A, T, G, C) at the genomic position that is closest to the breakpoint (here the position is 15882091 and 15882451), see fig 07-A.
    • Convert the emission counts to emission probabilities, see fig 07-B.
  • Estimate transition counts from each nucleotide (A, T, G, C) of PI-6 to each nucleotide (A, T, G, C) of PI-4 for both haplotype configuration across all the samples, see fig 08-A.
    • The observed transition is counted as “1” if “PI value” match between the  former and later nucleotide, else as “0.5”.
  • Transition counts are then converted to transition probabilities for each emission in the former block, see fig 08-B.
  • Along the chain the likelihood estimates are maximized either using max-sum or max-product algorithm for each configuration.
  • Finally, log2Odds (ratio the maximum likelihood estimates)  is computed to identify the most possible haplotype configuration.


Fig 06: Allele transition matrix (from all sites of former block-01 to all sites of later block-02)

Transitions     ms02g:PG_al                    ms02g:PI
  4th ->       ┌┬┬┬  C|T \                       6
 3rd ->       ┌┬┬┬  T|C | Block 1               6
 2nd ->       ┌┬┬┬  A|C |                       6
 1st ->       ┌┬┬┬  G|T /                       6
 └────────────> ││││   ×—————> Break Point
                │││└> T|C \                       4
                ││└─> T|C | Block 2               4
                │└──> T|C |                       4
                └───> T|A /                       4
                     ↓   ↓ 
                 Hap-A   Hap-B

**Note: 
 - Transitions are computed starting with closest heterozygote 
sites between the two blocks. This method maximizes the LD 
(linkage disequilibrium) estimates between two blocks.
 - Therefore, "1st" transition starts with the heterozygous 
sites of two blocks that are most close to each other. 
 - The "2nd" transition starts from the 2nd closest 
heterozygous site of former block with the first heterozygous 
site of later block.
 - All other transitions are computed in similar manner.


Fig 07 : Nucleotide counts (emission counts) and probabilities at positions 15882091 and 15882451 across all the samples.

Fig 07 (A): Emission counts of each possible nucleotide
 pos\allele  A    T    G    C    total
15882091       5    4    2    1    12
15882451       1    7    0    4    12  

**Note: 
 - This emission counts are computed for all the nucleotides 
(A,T,G,C) at each position of two consecutive blocks. 
 - Emission counts are then converted to emission probabilities.

Fig 07 (B): Emission probabilities of each possible nucleotide
 pos\alleleA        T        G        C        total
15882091       5/12     4/12     2/12     1/12     1.0
15882451       1/12     7/12     0        4/12     1.0

 

Fig 08 : Transition counts and probabilities for nucleotides from position 15881764 to 15882451 across all the samples

Fig 08 (A): Representation of transition matrix (counts)
 to (pos 15882451)
 from     A    T    G    C       total
  A        0    3    0    2        5
  T        0    3.5  0    0.5      4
  G        0    0.5  0    1.5      2
  C        1    0    0    0        1

**Note: 
 - Above transition is only for alleles at last position 
of former block to first position of later block.
 - So, transition is calculated from nucleotides (A,T,G,C) 
at block01 to nucleotides (A,T,G,C) at block02 for all the 
positions. 
  - If "PI" values match between two blocks in a sample the 
observed transition is counted as 1, else 0.5, because 
non-matching "PI" indicate that all possible configurations 
are likely.
 - Transition counts are then converted to transition 
probabilities.

Fig 08 (B): Representation of transition matrix (probabilities)
to (pos 15882451)
  from A        T       G       C        total
  A         0        3/5     0       2/5       1.0
  T         0        3.5/4   0       0.5/4     1.0
  G         0        0.5/2   0       1.5/2     1.0
  C         1/1      0       0       0         1.0

Therefore emission and transition probabilities  for a 4-state markov process can be represented with figures below :  


Fig 09 (A) : “Emission probabilities” of nucleotides at position 15882091.

Emission probabilities

Fig 09 (B) : “Transition probabilities” of nucleotides to position 15882451 following emission at 15882091.

MarkovChainTransition

Latex source code for this figure is at: https://www.overleaf.com/16519256jwcxmrbkysqp#/63442557/

 

Step 03: Compute the maximum likelihood for each configuration


Maximum likelihood for a markov chain can be estimated as “maximized sum” or “maximized product” of all the “p(emission) x p(transitions)” in the chain. Below, I show the maximum likelihood estimate for a example snippet (first using just two positions) using max-sum approach, and then another example with two small haplotype blocks.


Fig 10 : Example (01) snippet using “max-sum” algorithm with just two sites.

If we consider just two positions (15882091 & 15882451) we 
can estimate the likelihood of each possible configuration as:
Parallel configuration:             Alternate configuration:                                  G|T 
G|T                                 G|T
T|C                                 C|T
  p(G)*p(GtT) = 2/12 * 0.5/2      p(G)*p(GtC) = 2/12 * 1.5/2
+ p(T)*p(TtC) = 4/12 * 0.5/4    + p(T)*p(TtT) = 4/12 * 3.5/4
——————— ————— = 0.08333           ——————— ————— = 0.41667
Avg. likelihood = 0.041667        Avg. likelihood = 0.2083

**note: 
- "AtC" -> represent "A" to "C" transition 
- "+" represents the summation of the likelihoods
- "+" may be replaced with "*" for max-product.


Estimate Likelihood ratio:
= likelihood of parallel config / likelihood of alternate config
= 0.041667/0.2083
= 0.200033

Therefore, alternate configuration is 4.99 (i.e 0.200033605⁻¹) 
times likely compared to parallel configuration.

log₂Odds(likelihood ratio)=
log₂(0.200033) = -2.321689601

Example (02) – when two RBphased blocks are present.


RBphased haplotype actually have multiple SNPs stringed together. We can therefore prepare markov chain and apply markov transition for all possible nucleotide combination in two possible haplotype configuration. This will then provide a better estimate of likelihood of haplotype configuration between the two consecutive blocks.

(put a figure here showing all possible transition?? )

Fig 11 – A : Likelihood estimates for parallel configuration using “max-sum” algorithm.

Parallel configuration:
Block-1-Hap-A (C-T-A-G) with Block-2-Hap-A (T-T-T-T)
  p(G) * [p(GtT) + p(GtT) + p(GtT) + p(GtT)] = (2/12) * [(0.5/2)+(0.5/2)+(0.5/2)+(0.5/2)] = 0.1667
+ p(A) * [p(AtT) + p(AtT) + p(AtT) + p(AtT)] = (3/12) * [(1.5/3)+(0.5/3)+(0.5/3)+(0.5/3)] = 0.25  
+ p(T) * [p(TtT) + p(TtT) + p(TtT) + p(TtT)] = (2/12) * [(1.5/2)+(0.5/2)+(0.5/2)+(0.5/2)] = 0.25
+ p(C) * [p(CtT) + p(CtT) + p(CtT) + p(CtT)] = (4/12) * [(1.5/4)+(0.5/4)+(0.5/4)+(0.5/4)] = 0.25
——————— ————————— ——————— ————————— Max Sum (score) = 0.9167
                                    Average (score) = 0.229167
+ (further summed with)
Block-1-Hap-B (T-C-C-T) with Block-2-Hap-B (C-C-C-A)
  p(T) * p(TtC) + p(TtC) + p(TtC) + p(TtA) = (4/12) * [(0.5/4)+(0.5/4)+(0.5/4)+(0.5/4)] = 0.1667
+ p(C) * p(CtC) + p(CtC) + p(CtC) + p(CtA) = (8/12) * [(1.5/8)+(1.5/8)+(1.5/8)+(1.5/8)] = 0.5
+ p(C) * p(CtC) + p(CtC) + p(CtC) + p(CtA) = (10/12) * [(2.5/10)+(2.5/10)+(2.5/10)+(2.5/10)] = 0.8333
+ p(T) * p(TtC) + p(TtC) + p(TtC) + p(TtA) = (8/12) * [(0.5/8)+(0.5/8)+(0.5/8)+(0.5/8)] = 0.1667
——————— ————————— ——————— ————————— Max Sum (score) = 1.6667 
                                    Average (score) = 0.416675


Final Max-Sum score (parallel configuration) = 
0.229167 + 0.416675 = 0.645842

Fig 11 – B: Likelihood estimate for alternate configuration using using “max-sum” algorithm.

Alternate configuration:
Block-1-Hap-A (C-T-A-G) with Block-2-Hap-B (C-C-C-A)
  p(G) * [p(GtC) + p(GtC) + p(GtC) + p(GtA)] = (2/12) * [(1.5/2)+(1.5/2)+(1.5/2)+(1.5/2)] = 0.5  
+ p(A) * [p(AtC) + p(AtC) + p(AtC) + p(AtA)] = (3/12) * [(1.5/3)+(2.5/3)+(2.5/3)+(2.5/3)] = 0.75 
+ p(T) * [p(TtC) + p(TtC) + p(TtC) + p(TtA)] = (2/12) * [(0.5/2)+(1.5/2)+(1.5/2)+(1.5/2)] = 0.4167
+ p(C) * [p(CtC) + p(CtC) + p(CtC) + p(CtA)] = (4/12) * [(2.5/4)+(3.5/4)+(3.5/4)+(3.5/4)] = 1.0833  
——————— ————————— ——————— ————————— Max Sum (score) = 2.75 
                                    Average (score) = 0.6875

Block-1-Hap-B (T-C-C-T) with Block-2-Hap-A (T-T-T-T) 
  p(T) * p(TtT + TtT + TtT + TtT) = (4/12) * [(3.5/4)+(3.5/4)+(3.5/4)+(2.5/4)] = 1.0833
+ p(C) * p(CtT + CtT + CtT + CtT) = (8/12) * [(5.5/8)+(6.5/8)+(6.5/8)+(6.5/8)] = 2.0833 
+ p(C) * p(CtT + CtT + CtT + CtT) = (10/12) * [(6.5/10)+(7.5/10)+(7.5/10)+(6.5/10)] = 2.3333 
+ p(T) * p(TtT + TtT + TtT + TtT) = (8/12) * [(6.5/8)+(7.5/8)+(7.5/8)+(6.5/8)] = 2.3333
——————— ————————— ——————— ————————— Max Sum (score) = 7.8332
                                    Average (score) = 1.9583


Final Max-Sum score (alternate configuration) = 
0.6875 + 1.9583 = 2.6458

Fig 12 : Likelihood estimate of Parallel vs. Alternate configuration.

Likelihood of Parallel vs. Alternate configuration

= likelihood of Parallel config
  likelihood of Alternate config

= 0.645842 / 2.6458
= 0.2441  (i.e 1/4.0967)
Therefore, haplotype is 4.097 times more likely to be phased 
in "alternate configuration".

log2Odds = log2(0.0043043) = -2.0344
So, with our default cutoff threshold of "|1.5|", we will be extending 
the haplotype in "alternate configuration"

# Output data:
contig   pos   ref   all-alleles   ms02g_PI   ms02g_PG_al
2   15881764   .    .                  6      C|T
2   15881767   .    .                  6      T|C
2   15881989   .    .                  6      A|C
2   15882091   .    .                  6      G|T
2   15882451   .    .                  6      C|T
2   15882454   .    .                  6      C|T
2   15882493   .    .                  6      C|T
2   15882505   .    .                  6      A|T

Max-Product algorithm:
  – If we want to rather assume full independence between each “emission-transition” event we can apply “max-product” algorithm.
– In “max product” algorithm we maximize the score by multiplying the likelihood of each “emission-transition” estimate.
– So, we can replace all “+” with “*” and include p(Emission of a nucleotide)number of transitions event

to continue ……

Mathematical representation of the algorithm:

Consider a chromosome from a diploid organism (with "y" heterozygote sites) split into "z" number of Readbackphased haplotype blocks of random size such that :

length\ of\ any\ RBphased\ haplotype\ (z_{x})\ =

1\ <\ len\left( z_{x}\right)\ <\ y-2

In a diploid organism each RBphased block is further represented by two haplotype strings H_{z} = \left\{ h_{z},\overline {h_{z}}\right\}

So, for "z" RBphased blocks there are 2^z possible haplotype configuration with one particular configuration being the true phase state chromosome wide.

The chromosome wide haplotype can be represented as a combined phase state of "z"  haplotype blocks :

"z"\ = \ \left\{ h,\overline {h}\right\}

CW_{hap}=\left\{ H_{1}\right\} \left\{ H_{2}\right\} \ldots \left\{ H_{z}\right\}

So, a chormosome wide haplotype is:

CW_{hap}=\left\{\begin{aligned}\overline {h_{1}}\\ h_{1}\end{aligned}\right\} \left\{\begin{aligned}\overline {h_{2}}\\ h_{2}\end{aligned}\right\} \ldots \left\{\begin{aligned}\overline {h_{z}}\\ h_{z}\end{aligned}\right\}

But, unlike classical haplotype phasing issue, here we can solve the phase state at chromosome level by solving the phase state between two consecutive haplotype block at once.

The most likely haplotype state H = \left\{ h,\overline {h}\right\} given two consecutive haplotype,

Block 1:  H_{1} = \left\{ h_{1},\overline {h_{1}}\right\}

Block 2:  H_{2} = \left\{ h_{2},\overline {h_{2}}\right\}

can be,

H=\begin{cases}\left\{ h_{1}h_{2},\overline {h_{1}}\overline {h_{2}}\right\} \\ \left\{ h_{1}\overline {h_{2}},\overline {h_{1}}h_{2}\right\} \end{cases}

H=\begin{cases}\left\{ h_{1}h_{2},\overline {h_{1}}\overline {h_{2}}\right\} ^{\nearrow parallel-configuration }_{OR}\\ \left\{ h_{1}\overline {h_{2}},\overline {h_{1}}h_{2}\right\} ^{\searrow alternate-configuration}\end{cases}

Using markov chain transition (from each site in Block 1 to each site in Block 2) we compute likelihood of the each possible configuration:

h,\overline {h}=\begin{cases}\left( h_{1}h_{2}\right) ,\left( \overline {h_{1}},\overline {h_{2}}\right) ^{\longrightarrow with likelihood (L{_1}) }\\ \left( h_{1}\overline {h_{2}}\right) ,\left( \overline {h_{1}}h_{2}\right) ^{\longrightarrow with likelihood (L{_2}) }\end{cases}

If Block 1 had m sites (indexed as “i”) and Block 2 has n sites (indexed as “j”). If “ATGC” represent the four possible nucleotide bases represented by “D” as:

4\ bases:D:A=1,\ T=2, \ G=3,\ C=4

the estimate likelihood are:

\begin{cases}L_{1}=\prod ^{2} _{k=1=l}\ \prod^{m} _{i=1}\ \prod^{n} _{j=1} \left[ \dfrac {D_{ghij}}{D_{g.ij}}\right] \\ \\ L_{2}=\prod ^{2} _{k=1\neq l}\ \prod^{m} _{i=1}\ \prod^{n} _{j=1} \left[ \dfrac {D_{ghij}}{D_{g.ij}}\right]\end{cases}

where\ bases/ 'g'\ and\ 'h'\ are\ what\ is\ observed\ at\ site\ pair\ "i,j"\ in\ this\ configuration,\ in\ this\ allele\ pair.

where,

\large \bf \begin{aligned}D_{ij} = 4 \times4\ matrix\ of \ nucleotide\ combinations\ w/base\ g=(1 \ to\ 4)\ at\ site\ 'i'\ \times base\ h = (1\ to\ 4)\ at\ site\ 'j'.\end{aligned} \\ \large \bf \begin{aligned}D_{ghij} = \sum^{S} _{f=1}\ P_{phase}\ , where\ 'P_{phase}\ ' is probability that the phase occurs. If phase unknown, p=0.5. Pf phase is know p=1 if it occurs, or p=0 if it doesnot occur.\end{aligned}

contd…..

#########

Objective: Find the most likely single haplotype configuration in a sample (Si) given the two consecutive haplotype blocks for that sample (Si) given haplotype configuration at those sites in other samples h^{n}_{S=1} . Formally this objective can be written as:

Consider a whole genome is split into “y” short haplotypes which is similar to Readbackphased haplotype.  Two consecutive Readback phased hapltoype can therefore be represented as H1={h1,h1c} and H2={h2,h2c}. The most likely haplotype H = {h, hc} conditional upon H1={h1,h1c} and H2={h2,h2c} can be represented as:

H  latex argmax_{H}\Pr \left(\ {H}\left| h_{1}h_{2}\right| h^{n}_{S=1}\right) &s=2$

and can be calculated by computing observed markov transition probabilities from each site of haplotype H1 to each site of hapltoype H2.

At any particular site a haplotype data consists of two possible

At any particular site a haplotype can be represented by two of the four possible nucleotide bases:

set = {A,T,G,C} numerically denoted as

D : A=1, T=2, G=3, C=4

One character denotes the maternal allele while the other denotes paternal allele.

Assumptions: (Citation …Chistine Lo ?? )

A new haplotype in the population is more likely to match an existing haplotype that has occurred more frequently in the population.

The probability of seeing a novel haplotype increases as the sample size, n increases.

• The probability of seeing a novel haplotype increases as mutation rate, increases.

• If the next haplotype is not exactly the same as an existing haplotype, it will tend to differ by a small number of mutations from an existing haplotype rather than be completely different from all existing haplotypes.

• Due to recombination, the next haplotype will tend to look somewhat similar to existing haplotypes over contiguous genomic regions. The length of the region is longer in areas where is smaller.

In `phase-Extender` we only have to calculate likelihood for two possible iteration (with two likelihood estimates in each iteration) against likelihood of 2^k iteration in non-Readback phase method.

Mathematical notation and Statistical equation representing the above process:
Ask on stack overflow?

For a given pool of “S” samples haplotype from a diploid organism

number of sample haplotype = S[f]

For a sample “S[f1]” we have two consecutive haplotype blocks = {1, 2} with “m” and “n” heterozygote sites. Each blocks has two phased haploid genome “haplotypes 1, 2[k]” for “block 1” and “haplotypes 1, 2[l]” for “block 2”. Thus for two consecutive blocks there are two possible phase configuration with,

k = l, and k != l

In a genomic region with “k” sites, there are 2^(k-1) possible haplotype. The objective of haplotype phasing is to recover two exact haplotype out of 2^(k-1) haplotypes of an individual.

The transition probability from “ith” element of Block01 to “jth” element of Block02 can be represented by the function:

f(Y|X) =

The cumulative probability (or likelihood) of
Continue reading “Extension of the ReadBackPhased haplotypes using phASE-Extender.”