Detail of message

Online Judge

Problem Set

Authors

Online Contests

User

Web Board
Home Page
F.A.Qs
Statistical Charts

Current Contest
Past Contests
Scheduled Contests
Award Contest

High paid Parttime job :AI/algorithims expert needed for genomic assembly research project

Posted by deanguo at 2006-04-28 07:53:32

HI,
I am looking for an AI/algorithims programmer for doing graduate school research project.
It is very challenging and technically/financially awarding research project.
The project may take 6-9 months based on 20hours/week.
If you are interested, please send email with resume to familyg45@hmails.com
We can discuss more details through email or phone.
thanks
Dean Guo

>> Details of description of projects
The problem I am working on is called genomic assembly.
Basic description of the assembly problem: Given a large string of
letters G,
many smaller strings of letters S_i are independently and randomly
sampled from G.
Each S_i comes from an unknown place in G, we can only compare two S_i
to see if they overlap.
If they overlap, they probably were sampled from adjacent positions in
G.
We compare S_i to each other repeatedly until we form a good guess at
the original genome.
This is like coding an algorithm to solve a jigsaw puzzle.
Difficulty: S_i can be sampled with errors from G, G may contain
regions
that are repetitive,
making false S_i - S_j matches likely.

For example, if you have ACGT and GTCA maybe the pieces fit like this:

ACGT
__GTCA

We repeatedly join together pieces until all of them are in a line like
this:
AAAAAAAAACCCCCCCC
AAACCCCCCCCCCCC
CCCCCCCCCGGG
The basic goal is determine which pieces fit together and which do not.
Need to avoid false matches, while also detecting true matches.
We do not have time to compare every pair directly, it would be too
slow.
A lot of dynamic programming is used to optimize string
comparisons/matching.

This problem of assembling genomes was first solved by the Human Genome
Project in 1998.
We are working on a similar problem, but with a twist: input data will
have 20% error rates (very high).
ie if a letter in the genome is an A,
there is a 20% probability that the letter will appear as a C,G, or T
in the sample.
=> Our algorithms must be robust to handle high error data,
yet fine tuned enough to detect matches between two mutated samples.

Question: what's is the length of the genome strings: The size of the single string and the total size.and give me the more samples?
***** Each of the genome samples (strings) is of size 100,000. The human genome is of size 3,000,000,000.
We have a total of 300,000 strings. This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000.
The idea is that for each position in the genome, if we take 300,000 random indepedent samples,
each position will be covered by 20 strings, 10 to its left and 10 to its right.

Question: what is the typical values of:
1. how many samples do we have, for a single gnome ?
2. how long is each sample?
****The project is for a "hypothetical" new technology that is able to sequence long, high-error samples.
We create simulated samples of length approx. 100k letters.
The human genome is of size 3,000,000,000. We have a total of 300,000 strings.
This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000.
The idea is that if we take 300,000 random indepedent samples,
each position will on average be covered by 20 strings, 10 to its left and 10 to its right.

The strings have 20-30% point mutation error rates.
ie each letter in each sample has a prob 20-30% of being mutated to another letter.
Also have small mutation % for insertions/deletions in the sample.
The challenge is to detect when two samples come from the same region in the genome (match/overlap).
Need to avoid false positive matches, while also not filtering out all true positives.

Followed by:

Re:High paid Parttime job :AI/algorithims expert needed for genomic assembly research project (3830) deanguo 2006-04-28 07:54:34
- Sorry, my email address is familyg45@gmail.com (46) deanguo 2006-04-28 07:57:02
Correct: My email address is familyg45@gmail.com (0) deanguo 2006-04-28 07:58:23

Post your reply here:

User ID:
Password:
Title:

Content:

> 
> HI,
> I am looking for an AI/algorithims programmer for doing graduate school research project.
> It is very challenging and technically/financially awarding research project.
> The project may take 6-9 months based on  20hours/week.
> If you are interested, please send email with resume to familyg45@hmails.com
> We can discuss more details through email or phone.
> thanks
> Dean Guo
> 
> 
> 
> 
> >> Details of description of projects
> The problem I am working on is called genomic assembly.
> Basic description of the assembly problem: Given a large string of
> letters G,
> many smaller strings of letters S_i are independently and randomly
> sampled from G.
> Each S_i comes from an unknown place in G, we can only compare two S_i
> to see if they overlap.
> If they overlap, they probably were sampled from adjacent positions in
> G.
> We compare S_i to each other repeatedly until we form a good guess at
> the original genome.
> This is like coding an algorithm to solve a jigsaw puzzle.
> Difficulty: S_i can be sampled with errors from G, G may contain
> regions
> that are repetitive,
> making false S_i - S_j matches likely.
> 
> For example, if you have ACGT and GTCA maybe the pieces fit like this:
> 
> ACGT
> __GTCA
> 
> We repeatedly join together pieces until all of them are in a line like
> this:
> AAAAAAAAACCCCCCCC
>               AAACCCCCCCCCCCC
>                                    CCCCCCCCCGGG
> The basic goal is determine which pieces fit together and which do not.
> Need to avoid false matches, while also detecting true matches.
> We do not have time to compare every pair directly, it would be too
> slow.
> A lot of dynamic programming is used to optimize string
> comparisons/matching.
> 
> This problem of assembling genomes was first solved by the Human Genome
> Project in 1998.
> We are working on a similar problem, but with a twist: input data will
> have 20% error rates (very high).
> ie if a letter in the genome is an A,
> there is a 20% probability that the letter will appear as a C,G, or T
> in the sample.
> => Our algorithms must be robust to handle high error data,
> yet fine tuned enough to detect matches between two mutated samples.
> 
> Question:  what's is the length of the genome strings: The size of the single string and the total size.and give me the more samples?
>  ***** Each of the genome samples (strings) is of size 100,000. The human genome is of size 3,000,000,000.
> We have a total of 300,000 strings. This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000.
> The idea is that for each position in the genome, if we take 300,000 random indepedent samples,
> each position will be covered by 20 strings, 10 to its left and 10 to its right.
> 
> Question: what is the typical values of:
> 1. how many samples do we have, for a single gnome ?
> 2. how long is each sample?
> ****The project is for a "hypothetical" new technology that is able to sequence long, high-error samples.
> We create simulated samples of length approx. 100k letters.
> The human genome is of size 3,000,000,000. We have a total of 300,000 strings.
> This is 10x coverage, ie 3,000,000,000 * 10 / 100,000 = 300,000.
> The idea is that if we take 300,000 random indepedent samples,
>  each position will on average be covered by 20 strings, 10 to its left and 10 to its right.
> 
> The strings have 20-30% point mutation error rates.
> ie each letter in each sample has a prob 20-30% of being mutated to another letter.
> Also have small mutation % for insertions/deletions in the sample.
> The challenge is to detect when two samples come from the same region in the genome (match/overlap).
>  Need to avoid false positive matches, while also not filtering out all true positives.
> 
> >>

Home Page Go Back To top