| ||||||||||
| Online Judge | Problem Set | Authors | Online Contests | User | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Web Board Home Page F.A.Qs Statistical Charts | Current Contest Past Contests Scheduled Contests Award Contest | |||||||||
Re:High paid Parttime job :AI/algorithims expert needed for genomic assembly research projectIn Reply To:High paid Parttime job :AI/algorithims expert needed for genomic assembly research project Posted by:deanguo at 2006-04-28 07:53:32 > > HI, > I am looking for an AI/algorithims programmer for doing graduate school research project. > It is very challenging and technically/financially awarding research project. > The project may take 6-9 months based on 20hours/week. > If you are interested, please send email with resume to familyg45@hmails.com > We can discuss more details through email or phone. > thanks > Dean Guo > > > > > >> Details of description of projects > The problem I am working on is called genomic assembly. > Basic description of the assembly problem: Given a large string of > letters G, > many smaller strings of letters S_i are independently and randomly > sampled from G. > Each S_i comes from an unknown place in G, we can only compare two S_i > to see if they overlap. > If they overlap, they probably were sampled from adjacent positions in > G. > We compare S_i to each other repeatedly until we form a good guess at > the original genome. > This is like coding an algorithm to solve a jigsaw puzzle. > Difficulty: S_i can be sampled with errors from G, G may contain > regions > that are repetitive, > making false S_i - S_j matches likely. > > For example, if you have ACGT and GTCA maybe the pieces fit like this: > > ACGT > __GTCA > > We repeatedly join together pieces until all of them are in a line like > this: > AAAAAAAAACCCCCCCC > AAACCCCCCCCCCCC > CCCCCCCCCGGG > The basic goal is determine which pieces fit together and which do not. > Need to avoid false matches, while also detecting true matches. > We do not have time to compare every pair directly, it would be too > slow. > A lot of dynamic programming is used to optimize string > comparisons/matching. > > This problem of assembling genomes was first solved by the Human Genome > Project in 1998. > We are working on a similar problem, but with a twist: input data will > have 20% error rates (very high). > ie if a letter in the genome is an A, > there is a 20% probability that the letter will appear as a C,G, or T > in the sample. > => Our algorithms must be robust to handle high error data, > yet fine tuned enough to detect matches between two mutated samples. > > Question: what's is the length of the genome strings: The size of the single string and the total size.and give me the more samples? > ***** Each of the genome samples (strings) is of size 100,000. The human genome is of size 3,000,000,000. > We have a total of 300,000 strings. This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000. > The idea is that for each position in the genome, if we take 300,000 random indepedent samples, > each position will be covered by 20 strings, 10 to its left and 10 to its right. > > Question: what is the typical values of: > 1. how many samples do we have, for a single gnome ? > 2. how long is each sample? > ****The project is for a "hypothetical" new technology that is able to sequence long, high-error samples. > We create simulated samples of length approx. 100k letters. > The human genome is of size 3,000,000,000. We have a total of 300,000 strings. > This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000. > The idea is that if we take 300,000 random indepedent samples, > each position will on average be covered by 20 strings, 10 to its left and 10 to its right. > > The strings have 20-30% point mutation error rates. > ie each letter in each sample has a prob 20-30% of being mutated to another letter. > Also have small mutation % for insertions/deletions in the sample. > The challenge is to detect when two samples come from the same region in the genome (match/overlap). > Need to avoid false positive matches, while also not filtering out all true positives. > > >> Followed by:
Post your reply here: |
All Rights Reserved 2003-2013 Ying Fuchen,Xu Pengcheng,Xie Di
Any problem, Please Contact Administrator