DNA Database Searchable: Largest Ever Created

Here’s a ⁢breakdown of the provided ⁤text, focusing on⁣ Prashant Pandey’s research and the problem he’s addressing:

The⁤ Problem:

* Vast, Untapped⁣ Data: The Sequence ‌Read Archive ⁣(SRA) contains a ⁢massive amount‌ of raw DNA sequencing data (petabytes). This data holds valuable insights, but it’s currently⁣ arduous⁣ to search effectively.
* Searchability Gap: Assembled genomes (finished DNA sequences) are easily searchable, but the raw, fragmented “reads” within the SRA are not.
* Need ⁤for Retrospective Analysis: Scientists often want to know if a⁤ newly discovered genetic sequence (like a virus ⁤or bacterium)⁣ has appeared ‌in previous experiments stored in the SRA.⁣ This requires searching through the⁢ raw data.
*⁣ Transcript Search: The key is being ⁤able to search for longer genetic sequences (transcripts) within the ‍millions of⁣ short reads ⁤in the SRA.

Prashant Pandey’s Solution:

* Building⁢ a ‌Search Index: Pandey’s team is developing a system to create ⁣an index for the SRA data.
* K-grams ⁢& ‍Embeddings: They convert short reads ‌into small sequences called “K-grams”⁢ and then map these into a “high-dimensional embedding” -⁤ essentially creating a digital fingerprint for⁢ each experiment.
* Digital Fingerprinting: ⁤These fingerprints⁤ are stored in an index, ⁣allowing for faster and ⁤more efficient ⁢searches.
* Query⁢ Comparison: When a new genetic sequence (transcript) is entered, its fingerprint is generated and compared‍ to the index to quickly identify potential⁤ matches within the SRA data.

Key Quotes from Pandey:

* ”We have this treasure trove, this‍ amazing and really insightful resource, which is just sitting around. We need the ability‌ to ⁢search the⁤ raw sequencing ‌data,⁣ all of‍ it, at the petabyte scale.”
* “This requires innovating at all the levels of the stack, starting from new approximate indexing‌ techniques, approximate data structures, building systems ‍that can ‌scale out in a distributed surroundings, hosting the whole thing ⁤in the cloud and ⁣making it ⁤publicly available for ⁤anyone to search.”

In essence,Pandey is working to unlock the ⁢potential of the SRA by making ⁣its vast amount of raw data searchable,enabling scientists to make new discoveries ‍and understand the genetic landscape more‌ comprehensively.

DNA Database Searchable: Largest Ever Created

Share this:

Related