Detail of message

Online Judge

Problem Set

Authors

Online Contests

User

Web Board
Home Page
F.A.Qs
Statistical Charts

Current Contest
Past Contests
Scheduled Contests
Award Contest

Re:High paid Parttime job :AI/algorithims expert needed for genomic assembly research project

Posted by deanguo at 2006-04-28 07:54:34
In Reply To:High paid Parttime job :AI/algorithims expert needed for genomic assembly research project Posted by:deanguo at 2006-04-28 07:53:32

> 
> HI,
> I am looking for an AI/algorithims programmer for doing graduate school research project.
> It is very challenging and technically/financially awarding research project.
> The project may take 6-9 months based on  20hours/week.
> If you are interested, please send email with resume to familyg45@hmails.com
> We can discuss more details through email or phone.
> thanks
> Dean Guo
> 
> 
> 
> 
> >> Details of description of projects
> The problem I am working on is called genomic assembly.
> Basic description of the assembly problem: Given a large string of
> letters G,
> many smaller strings of letters S_i are independently and randomly
> sampled from G.
> Each S_i comes from an unknown place in G, we can only compare two S_i
> to see if they overlap.
> If they overlap, they probably were sampled from adjacent positions in
> G.
> We compare S_i to each other repeatedly until we form a good guess at
> the original genome.
> This is like coding an algorithm to solve a jigsaw puzzle.
> Difficulty: S_i can be sampled with errors from G, G may contain
> regions
> that are repetitive,
> making false S_i - S_j matches likely.
> 
> For example, if you have ACGT and GTCA maybe the pieces fit like this:
> 
> ACGT
> __GTCA
> 
> We repeatedly join together pieces until all of them are in a line like
> this:
> AAAAAAAAACCCCCCCC
>               AAACCCCCCCCCCCC
>                                    CCCCCCCCCGGG
> The basic goal is determine which pieces fit together and which do not.
> Need to avoid false matches, while also detecting true matches.
> We do not have time to compare every pair directly, it would be too
> slow.
> A lot of dynamic programming is used to optimize string
> comparisons/matching.
> 
> This problem of assembling genomes was first solved by the Human Genome
> Project in 1998.
> We are working on a similar problem, but with a twist: input data will
> have 20% error rates (very high).
> ie if a letter in the genome is an A,
> there is a 20% probability that the letter will appear as a C,G, or T
> in the sample.
> => Our algorithms must be robust to handle high error data,
> yet fine tuned enough to detect matches between two mutated samples.
> 
> Question:  what's is the length of the genome strings: The size of the single string and the total size.and give me the more samples?
>  ***** Each of the genome samples (strings) is of size 100,000. The human genome is of size 3,000,000,000.
> We have a total of 300,000 strings. This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000.
> The idea is that for each position in the genome, if we take 300,000 random indepedent samples,
> each position will be covered by 20 strings, 10 to its left and 10 to its right.
> 
> Question: what is the typical values of:
> 1. how many samples do we have, for a single gnome ?
> 2. how long is each sample?
> ****The project is for a "hypothetical" new technology that is able to sequence long, high-error samples.
> We create simulated samples of length approx. 100k letters.
> The human genome is of size 3,000,000,000. We have a total of 300,000 strings.
> This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000.
> The idea is that if we take 300,000 random indepedent samples,
>  each position will on average be covered by 20 strings, 10 to its left and 10 to its right.
> 
> The strings have 20-30% point mutation error rates.
> ie each letter in each sample has a prob 20-30% of being mutated to another letter.
> Also have small mutation % for insertions/deletions in the sample.
> The challenge is to detect when two samples come from the same region in the genome (match/overlap).
>  Need to avoid false positive matches, while also not filtering out all true positives.
> 
> >>

Followed by:

Sorry, my email address is familyg45@gmail.com (46) deanguo 2006-04-28 07:57:02

Post your reply here:

Home Page Go Back To top