Skip to content

[C4GT Community]: Support full GEO based downloads #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks
saketkc opened this issue Apr 7, 2025 · 10 comments · May be fixed by #231 or #232
Open
5 tasks

[C4GT Community]: Support full GEO based downloads #229

saketkc opened this issue Apr 7, 2025 · 10 comments · May be fixed by #231 or #232

Comments

@saketkc
Copy link
Owner

saketkc commented Apr 7, 2025

Description

Currently, pysradb primarily focuses on fetching metadata and data from SRA. However, many GEO datasets are linked to SRA, and users often require access to GEO-specific files, especially GEO Matrix files, which contain processed expression data.

This feature request aims to extend pysradb to:

  1. Fetch GEO supplementary files, specifically identifying and downloading GEO Matrix files (typically .txt or .gz format).
  2. Provide an option to convert GEO Matrix files to a clean .tsv format, making it easier for users to load and analyze them.

Why this is useful:
Many GEO users want quick access to processed expression data. Integrating GEO matrix file support would make pysradb a one-stop tool for both raw and processed data, improving its utility for transcriptomics, genomics, and bioinformatics users.

Additional Context:


Goals

Goals

  • Add functionality to identify and download GEO Matrix files given a GEO accession (e.g., GSEXXXXXX).

  • Implement a parser that reads the downloaded matrix file and outputs it as a .tsv.

  • Handle common GEO Matrix file quirks (such as metadata headers or comments starting with !).

  • Update the documentation with examples.

  • Provide CLI flags/subcommands, e.g.,

pysradb geo-matrix --accession GSE12345 --to-tsv

Bonus (Optional):

  • Allow users to selectively download only the matrix file (and not other supplementary files).
  • Support both compressed (.gz) and uncompressed formats.
  • Add basic tests for downloading and parsing functionality.

Expected Outcome

The final module will allow a user to download the full GEO record with the matrix file parsed as a dataframe.

Acceptance Criteria

No response

Implementation Details

There is already some GEO support available, you will extend this class to add support for GEO based downloads

Mockups/Wireframes

No response

Product Name

pysradb

Organisation Name

C4GT

Domain

No response

Tech Skills Needed

Python

Organizational Mentor

Saket Choudhary

Angel Mentor

No response

Complexity

Medium

Category

Research

@jainrishi601
Copy link

Hi Saket Sir, I came from C4GT and would like to contribute to this feature request for pysradb. I'm interested in extending the GEO support to download and parse GEO Matrix files, and I'm ready to work on adding CLI flags, handling file quirks, and implementing conversion to a clean .tsv format. Please assign this issue to me.

@Piyush0000
Copy link

Piyush0000 commented Apr 12, 2025

hi @saketkc please assign me this issue, I can work on that as i have very good skills in python and machine learning

@hea7hen
Copy link

hea7hen commented Apr 12, 2025

Hi @saketkc, I’d love to contribute to this feature under C4GT. I have experience with Python, bioinformatics file parsing, and CLI tooling. I can implement GEO matrix file download, parsing to .tsv, CLI support, and basic tests. Could you please assign the issue to me?

@Urvashi2409
Copy link

Hi Saket,

I’ve thoroughly reviewed the GitHub repository and examined the codebase related to this project. I believe that adding support for GEO Matrix files, as outlined in the feature request, would be a valuable and impactful enhancement.

I'm confident in my ability to contribute to this feature and would love the opportunity to work on it. I’ve also drafted an initial approach for the implementation and would be happy to discuss it further.

Could you please assign this issue to me?

Best regards,
Urvashi Anand

@Parmarthcse
Copy link

Hi @saketkc, I'm from C4GT and would like to contribute to the GEO Matrix file support feature in pysradb. I'm ready to handle CLI flags, manage file quirks, and implement conversion to a clean .tsv format. Kindly assign this issue to me.

[email protected]

@NavinkumarD
Copy link

NavinkumarD commented Apr 19, 2025

Hello @saketkc,

I’ve successfully implemented the functionality to identify and download GEO Matrix files based on GEO accession numbers (e.g., GSE10072) and convert them into .tsv format. The implementation includes:

-Dynamic GEO Matrix File Downloading: Constructs URLs and downloads .gz files.

  • File Handling: Decompresses .gz files into .txt format, ensuring clean workspace management.
  • Parsing to .tsv Format: Handles quirks like skipping metadata rows prefixed by !, and outputs clean tab-separated files for easy analysis.
  • CLI Integration: Adds CLI subcommands (geo-matrix) for downloading and parsing GEO Matrix files, with options like --accession and --to-tsv.

I have thoroughly tested the script using the GEO accession GSE10072, and it performs as intended, producing accurate .tsv output files.

I’m now ready to submit my contribution and would appreciate any feedback or suggestions. Let me know if further refinements are required.

**Python Script of my project: **
import os
import requests
import gzip
import csv
import argparse

def download_geo_matrix(accession):
"""
Download and decompress the GEO Matrix file from the NCBI GEO repository.

Parameters:
accession (str): GEO accession number (e.g., 'GSE10072').

Returns:
str: Path to the decompressed GEO Matrix file.
"""
base_url = f"https://ftp.ncbi.nlm.nih.gov/geo/series/{accession[:5]}nnn/{accession}/matrix/"
file_name = f"{accession}_series_matrix.txt.gz"
file_url = base_url + file_name

print(f"Attempting to download from: {file_url}")

try:
    response = requests.get(file_url, stream=True)
    if response.status_code == 200:
        with open(file_name, 'wb') as gz_file:
            gz_file.write(response.content)
        print(f"Downloaded: {file_name}")

        decompressed_file = file_name.replace('.gz', '')
        with gzip.open(file_name, 'rb') as compressed_file, open(decompressed_file, 'wb') as output_file:
            output_file.write(compressed_file.read())
        print(f"Decompressed: {decompressed_file}")

        os.remove(file_name)
        print(f"Cleaned up: {file_name}")
        return decompressed_file
    else:
        print(f"Failed to download. Status code: {response.status_code}")
        return None
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
    return None

def parse_geo_matrix_to_tsv(input_file, output_file):
"""
Parse the GEO Matrix file and convert it into a .tsv file.

Parameters:
input_file (str): Path to the decompressed GEO Matrix file.
output_file (str): Path to the output .tsv file.
"""
try:
    with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
        writer = csv.writer(outfile, delimiter='\t')
        for line in infile:
            if not line.startswith('!'):  # Skip metadata lines starting with '!'
                row = line.strip().split('\t')
                writer.writerow(row)
    print(f"Parsed and saved as: {output_file}")
except Exception as e:
    print(f"Error parsing file: {e}")

def main():
"""
Main function to handle CLI commands for downloading and parsing GEO Matrix files.
"""
parser = argparse.ArgumentParser(description="Download and parse GEO Matrix files.")
parser.add_argument('--accession', type=str, required=True, help="GEO accession number (e.g., GSE10072).")
parser.add_argument('--to-tsv', type=str, required=True, help="Output .tsv file path.")
args = parser.parse_args()

accession = args.accession
tsv_file = args.to_tsv

decompressed_file = download_geo_matrix(accession)
if decompressed_file:
    parse_geo_matrix_to_tsv(decompressed_file, tsv_file)

if name == "main":
main()

Here’s what has been accomplished:

  1. Dynamic GEO Matrix File Downloading:

    • The script dynamically constructs URLs based on the GEO accession number provided (--accession).
    • Successfully downloads .gz compressed GEO Matrix files from the GEO repository.
  2. File Decompression:

    • Decompresses the downloaded .gz file into a readable .txt format.
    • Cleans up temporary .gz files to maintain a tidy workspace.
  3. Parsing to .tsv Format:

    • Skips metadata lines prefixed with ! to focus solely on the data table.
    • Converts the .txt file into a clean .tsv format for easy downstream analysis.
  4. CLI Integration:

    • Added a CLI interface with subcommands for downloading and parsing:
      pysradb geo-matrix --accession GSE10072 --to-tsv GSE10072_output.tsv
      
    • Users can specify the GEO accession (--accession) and desired output file path (--to-tsv).
  5. Error Handling:

    • Gracefully manages network issues, incorrect accession numbers, and file parsing errors.

Testing
The script has been tested using the GEO accession GSE10072, and it performs as expected:

Attempting to download from: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE10nnn/GSE10072/matrix/GSE10072_series_matrix.txt.gz
Downloaded: GSE10072_series_matrix.txt.gz
Decompressed: GSE10072_series_matrix.txt
Cleaned up: GSE10072_series_matrix.txt.gz
Parsed and saved as: GSE10072_output.tsv```

**Here's the output I've got in Command prompt:**
![Image](https://github.com/user-attachments/assets/73815f12-2909-4ac6-a761-43affd1ade4a)

I’ve committed the code to the geo-matrix-support branch in my forked repository and am ready to submit a pull request. This contribution aligns with the goals outlined in the feature request, and I believe it will add significant value to the pysradb tool.

Please let me know if further refinements or adjustments are needed, and I’d be happy to collaborate!

Best regards, Navin

@aditi75432
Copy link

Hello @saketkc sir,

I'd love to take up this issue as part of my contribution to the C4GT Community. Here's how I plan to approach it:

🔧 Implementation Plan:

  1. Download GEO Matrix Files:

    • Use the NCBI FTP structure to locate and download matrix files based on the GSE accession.
    • Example: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSEXXXX/matrix/.
  2. Support for Compressed & Uncompressed Files:

    • Detect .txt and .txt.gz files and handle accordingly during parsing and conversion.
  3. Parse Matrix to TSV:

    • Implement a parser to:
      • Skip metadata/comment lines starting with !.
      • Parse the actual data matrix into a clean pandas.DataFrame.
      • Output the final file in .tsv format.
  4. CLI Integration:

    • Add a subcommand like:
      pysradb geo-matrix --accession GSE12345 --to-tsv
      
    • Optionally add flags like --only-matrix to avoid downloading other supplementary files.
  5. Documentation & Testing:

    • Include usage examples and edge case handling in the docs.
    • Add basic tests for both download and parsing functionalities.

🧪 Test Plan:

Would love your input on this approach! Let me know if there are any constraints or preferences I should keep in mind.

Thanks!

@tanishra
Copy link

Hi @saketkc,

I'm excited to express my interest in contributing to the pysradb project, particularly in extending its capabilities to support GEO Matrix file downloads and parsing.
With a strong foundation in Python, data parsing, and experience working with bioinformatics datasets, I’m eager to help make pysradb an even more powerful and user-friendly tool for the research community.

I’m particularly enthusiastic about the opportunity to simplify access to processed expression data for GEO users — transforming matrix files into clean, analysis-ready TSV formats. I’m confident that my skills align well with the goals of this enhancement, from developing efficient parsers to implementing intuitive CLI extensions, and ensuring robust documentation and testing.

I'm fully committed to delivering a seamless experience that will help researchers spend less time on data wrangling and more time on scientific discovery. I'm looking forward to learning from the team and contributing meaningfully to pysradb's growth!

Best regards,
Tanish Rajput

@n14rishitha
Copy link

Hi @saket Choudhary,

I’d like to work on extending pysradb to support GEO Matrix file downloads and parsing. Here’s how I’ll approach it:

GEO Matrix Download

  • Fetch matrix files from GEO FTP given an accession
  • Optionally support selective downloads (only matrix files, not all supplementary files).

TSV Conversion

  • Parse matrix files (handling ! metadata headers) and output clean .tsv.
  • Support both .gz and uncompressed formats.

CLI & Documentation

  • Add a geo-matrix subcommand (e.g., pysradb geo-matrix --accession GSE12345 --to-tsv).
  • Document with usage examples.

Testing

  • Add initial tests for downloading and parsing.
  • Make it backward compatible.

Can you pls assign this task to me? I will keep the implementation light and well-documented.

@Moses-Mk
Copy link

Moses-Mk commented May 3, 2025

Hello @saketkc sir,

I have worked on this issue and solved it. Please check out my PR #234 for any modifications so that I can rework on it. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment