Skip to content

Commit 91db7f0

Browse files
committed
Adds pegi3s/id-mapping
1 parent 7fd585e commit 91db7f0

File tree

11 files changed

+304
-0
lines changed

11 files changed

+304
-0
lines changed

docs/index.html

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,9 @@ <h5>Programs:</h5>
243243
<li><a href="https://hub.docker.com/r/pegi3s/hyphy/" target="_blank"><b>hyphy</b></a>
244244
<a href="http://hyphy.org/tutorials/CL-prompt-tutorial/" target="_blank">[doc]</a> - Phylogenetics inferences
245245
</li>
246+
<li><a href="https://hub.docker.com/r/pegi3s/id-mapping" target="_blank"><b>id-mapping</b></a>
247+
<a href="https://hub.docker.com/r/pegi3s/id-mapping" target="_blank">[doc]</a> - ID mapping
248+
</li>
246249
<li><a href="https://hub.docker.com/r/pegi3s/igv/" target="_blank"><b>igv</b></a>
247250
<a href="https://software.broadinstitute.org/software/igv/UserGuide" target="_blank">[doc]</a> - Genomics viewer
248251
</li>

id_mapping/.vscode/tasks.json

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
// See https://go.microsoft.com/fwlink/?LinkId=733558
3+
// for the documentation about the tasks.json format
4+
"version": "2.0.0",
5+
"tasks": [
6+
{
7+
"label": "build docker",
8+
"type": "shell",
9+
"command": "CURRENT_VERSION=$(cat current.version) && docker build ./ -t pegi3s/id-mapping:${CURRENT_VERSION} --build-arg version=${CURRENT_VERSION} && docker tag pegi3s/id-mapping:${CURRENT_VERSION} pegi3s/id-mapping:latest",
10+
"problemMatcher": []
11+
},
12+
{
13+
"label": "id-mapping 1 [without cache]",
14+
"type": "shell",
15+
"command": "rm -f test.tsv && docker run --rm -v $(pwd):/data -w /data pegi3s/id-mapping map-ids --from-db UniProtKB_AC-ID --to-db Gene_Name --input test_data/ids.txt --batch-size 2 --output test.tsv && bat test.tsv",
16+
"problemMatcher": []
17+
},
18+
{
19+
"label": "id-mapping 2 [with cache]",
20+
"type": "shell",
21+
"command": "rm -f test.tsv && docker run --rm -v $(pwd):/data -w /data pegi3s/id-mapping map-ids --from-db UniProtKB_AC-ID --to-db Gene_Name --input test_data/ids.txt --batch-size 2 --output test.tsv --cache-dir tmp_cache && bat test.tsv",
22+
"problemMatcher": []
23+
},
24+
{
25+
"label": "list-from-dbs",
26+
"type": "shell",
27+
"command": "docker run --rm pegi3s/id-mapping list-from-dbs",
28+
"problemMatcher": []
29+
}
30+
]
31+
}

id_mapping/BUILD.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Building instructions
2+
3+
Run:
4+
5+
```bash
6+
CURRENT_VERSION=$(cat current.version) && docker build ./ -t pegi3s/id-mapping:${CURRENT_VERSION} --build-arg version=${CURRENT_VERSION} && docker tag pegi3s/id-mapping:${CURRENT_VERSION} pegi3s/id-mapping:latest
7+
```
8+
9+
# Build log
10+
11+
- 1.0.0 - 28/07/2023 - Hugo López Fernández

id_mapping/Dockerfile

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#
2+
# Copyright 2018-2023 Hugo López-Fernández, Pedro M. Ferreira, Miguel
3+
# Reboiro-Jato, Cristina P. Vieira, and Jorge Vieira
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
FROM ubuntu:22.04
19+
20+
RUN apt-get update && \
21+
apt-get install -y python3-pip && \
22+
pip install -Iv unipressed==1.2.0
23+
24+
ARG version
25+
26+
ENV VERSION=${version}
27+
28+
ADD scripts /opt/scripts
29+
30+
RUN chmod u+x /opt/scripts/*
31+
32+
ENV PATH=/opt/scripts/:${PATH}

id_mapping/README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# This image belongs to a larger project called Bioinformatics Docker Images Project (http://pegi3s.github.io/dockerfiles)
2+
3+
# ID mapping
4+
5+
The `pegi3s/id-mapping` Docker image allows mapping identifiers using the [UniProt server](https://www.uniprot.org/id-mapping/) through the [Unipressed](https://github.com/multimeric/Unipressed) API client.
6+
7+
The main script is `map-ids`, so you should adapt and run the following command:
8+
```sh
9+
docker run --rm -v /your/data/dir:/data -w /data pegi3s/id-mapping --from-db <FROM_DB> --to-db <TO_DB> --input input.txt --output output.tsv
10+
```
11+
12+
In this command, you should replace:
13+
- `/your/data/dir` to point to the directory that contains the file you want to process.
14+
- `input.txt` to the actual name of your input TXT file with the identifiers to map (one per line).
15+
- `output.tsv` to the actual name of your output TSV file.
16+
- `<FROM_DB>` to the actual name of the source database of the input identifiers.
17+
- `<TO_DB>` to the actual name of the destination database.
18+
19+
The valid names for `<FROM_DB>` and `<TO_DB>` can be obtained with `docker run --rm pegi3s/id-mapping list-from-dbs` and `docker run --rm pegi3s/id-mapping list-to-dbs`, respectively.
20+
21+
The script help can be obtained with `docker run --rm pegi3s/id-mapping map-ids -h`.
22+
23+
Advanced script options are described in the next subsections.
24+
25+
## Cache
26+
27+
To avoid repeating time and again the same mapping queries it is possible to enable a cache mechanism by using the `--cache-dir <cache_dir_name>` parameter. This way, the script will maintain a cache of previous queries for each different combination of `<FROM_DB>` and `<TO_DB>`.
28+
29+
## Batch size and delay
30+
31+
By default, the script has a batch size of 10, which means that it will send queries with at most ten identifiers to the server. The default delay between queries is 1 second, which means that the script will wait for this time before sending a new batch query.
32+
33+
These values can be changed by specifying `--batch-size <BATCH_SIZE> --delay <DELAY_SECONDS>`.
34+
35+
# Test data
36+
37+
To test the `map-ids` script, start creating a new file named `ids.txt` with the following identifiers:
38+
39+
```
40+
A1L190
41+
A0JP26
42+
A0PK11
43+
```
44+
45+
And then run the following command (change `/your/data/dir` to the actual path to the `ids.txt` file)
46+
47+
```sh
48+
docker run --rm -v /your/data/dir:/data -w /data \
49+
pegi3s/id-mapping map-ids \
50+
--from-db UniProtKB_AC-ID \
51+
--to-db Gene_Name \
52+
--input ids.txt \
53+
--output mapping.tsv \
54+
--cache-dir id_mapping_cache
55+
```
56+
57+
The result will be available in the new `mapping.tsv` file created at `/your/data/dir`.
58+
59+
# Changelog
60+
61+
The `latest` tag contains always the most recent version.
62+
63+
## [1.0.0] - 28/07/2022
64+
65+
- Initial `id-mapping` image containing the `map-ids`, `list-from-dbs` and `list-to-dbs` scripts.

id_mapping/current.version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
1.0.0

id_mapping/scripts/id-mapping.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
from unipressed.id_mapping.types import From, To
2+
3+
print(From)
4+
print(To)
5+
6+
from unipressed import IdMappingClient
7+
request = IdMappingClient.submit(
8+
source="UniProtKB_AC-ID", dest="Gene_Name", ids={"A1L190", "A0JP26", "A0PK11"}
9+
)
10+
11+
print(request.get_status())
12+
import time
13+
while request.get_status() != "FINISHED":
14+
time.sleep(1)
15+
print(list(request.each_result()))

id_mapping/scripts/list-from-dbs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/usr/bin/python3
2+
3+
from unipressed.id_mapping.types import From
4+
from typing import get_args
5+
6+
for db in get_args(From):
7+
print(db)

id_mapping/scripts/list-to-dbs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/usr/bin/python3
2+
3+
from unipressed.id_mapping.types import To
4+
from typing import get_args
5+
6+
for db in get_args(To):
7+
print(db)

id_mapping/scripts/map-ids

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
#!/usr/bin/python3
2+
3+
import os
4+
import argparse
5+
import time
6+
from unipressed.id_mapping.types import From, To
7+
from unipressed import IdMappingClient
8+
from typing import get_args
9+
10+
11+
def validate_db(db_str, valid_dbs, db_type):
12+
if not db_str in get_args(valid_dbs):
13+
print(
14+
f'Error: The specified {db_type} database is not valid. It must be one of: {get_args(valid_dbs)}')
15+
exit(1)
16+
17+
18+
def map_ids_unipressed(from_db, to_db, ids):
19+
request = IdMappingClient.submit(source=from_db, dest=to_db, ids=ids)
20+
21+
while request.get_status() != "FINISHED":
22+
time.sleep(1)
23+
24+
return list(request.each_result())
25+
26+
27+
def load_input(input_file):
28+
if input_file and os.path.isfile(input_file) and os.access(input_file, os.R_OK):
29+
with open(input_file, "r") as f:
30+
return [line.strip() for line in f.readlines()]
31+
else:
32+
print("Error: The input file is missing or not readable.")
33+
exit(1)
34+
35+
36+
def load_cache_and_subset_ids(cache_dir, from_db, to_db, source_ids):
37+
cached_data = {}
38+
source_ids_not_cached = source_ids
39+
40+
if cache_dir and os.path.isdir(cache_dir) and os.access(cache_dir, os.R_OK):
41+
cache_file = os.path.join(cache_dir, f"cache_{from_db}_{to_db}.tsv")
42+
if os.path.isfile(cache_file) and os.access(cache_file, os.R_OK):
43+
with open(cache_file, "r") as f:
44+
for line in f:
45+
key, value = line.strip().split("\t")
46+
cached_data[key] = value
47+
print(f"Loaded data from cache. Size: {len(cached_data)}")
48+
49+
source_ids_not_cached = [item for item in source_ids if item not in cached_data]
50+
51+
return source_ids_not_cached, cached_data
52+
53+
54+
def map_ids(ids, from_db, to_db, batch_size, delay):
55+
total_items = len(ids)
56+
num_batches = (total_items + batch_size - 1) // batch_size
57+
58+
mapped_ids = []
59+
for i in range(num_batches):
60+
start_idx = i * batch_size
61+
end_idx = min(start_idx + batch_size, total_items)
62+
batch_data = ids[start_idx:end_idx]
63+
64+
print(f"Mapping batch {i+1}")
65+
mapped_ids.extend(map_ids_unipressed(from_db, to_db, batch_data))
66+
67+
time.sleep(delay)
68+
69+
mapped_ids_dict = {}
70+
for mapping in mapped_ids:
71+
mapped_ids_dict[mapping['from']] = mapping['to']
72+
73+
return mapped_ids_dict
74+
75+
76+
def write_mapped_ids(output_file, source_ids, mapped_ids_dict, cached_data):
77+
with open(output_file, "w") as output:
78+
for source_id in source_ids:
79+
if source_id in cached_data:
80+
output.write(f"{source_id}\t{cached_data[source_id]}\n")
81+
elif source_id in mapped_ids_dict:
82+
output.write(f"{source_id}\t{mapped_ids_dict[source_id]}\n")
83+
else:
84+
output.write(f"{source_id}\t-\n")
85+
86+
87+
def save_cache(cache_dir, from_db, to_db, mapped_ids_dict):
88+
os.makedirs(cache_dir, exist_ok=True)
89+
if cache_dir and os.path.isdir(cache_dir) and os.access(cache_dir, os.R_OK):
90+
cache_file = os.path.join(cache_dir, f"cache_{from_db}_{to_db}.tsv")
91+
cached_data = {}
92+
with open(cache_file, "a") as cache_file:
93+
for key in mapped_ids_dict:
94+
cache_file.write(f"{key}\t{mapped_ids_dict[key]}\n")
95+
96+
97+
def main(from_db, to_db, input_file, output_file, batch_size=10, delay=1, cache_dir=""):
98+
validate_db(from_db, From, 'from')
99+
validate_db(to_db, To, 'to')
100+
101+
print(f"Mapping IDs from '{from_db}' to '{to_db}' in batches of {batch_size} with a delay of {delay} second(s).")
102+
print(f"Cache directory: '{cache_dir}'")
103+
print(f"Input file: '{input_file}'")
104+
print(f"Output file: '{output_file}'\n")
105+
106+
source_ids = load_input(input_file)
107+
source_ids_not_cached, cached_data = load_cache_and_subset_ids(cache_dir, from_db, to_db, source_ids)
108+
mapped_ids_dict = map_ids(source_ids_not_cached, from_db, to_db, batch_size, delay)
109+
write_mapped_ids(output_file, source_ids, mapped_ids_dict, cached_data)
110+
111+
if cache_dir:
112+
save_cache(cache_dir, from_db, to_db, mapped_ids_dict)
113+
114+
115+
if __name__ == "__main__":
116+
print('Script version:', os.getenv('VERSION', 'NA'))
117+
parser = argparse.ArgumentParser(description="Converts identifiers using the UniProt ID mapping server.")
118+
119+
parser.add_argument("--from-db", type=str, help="Source database.", required=True)
120+
parser.add_argument("--to-db", type=str, help="Destination database.", required=True)
121+
parser.add_argument("--input", type=str, help="Path to the input data file with the source IDs to be converted (one per line).", required=True)
122+
parser.add_argument("--output", type=str, help="Path to the output file.", required=True)
123+
124+
parser.add_argument("--batch-size", type=int, default=10, help="Batch size for querying IDs to the UniProt server.")
125+
parser.add_argument("--delay", type=int, default=1, help="Delay in seconds between batches.")
126+
parser.add_argument("--cache-dir", type=str, default="", help="Cache directory.")
127+
128+
args = parser.parse_args()
129+
main(args.from_db, args.to_db, args.input, args.output, args.batch_size, args.delay, args.cache_dir)

id_mapping/test_data/ids.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
A1L190
2+
A0JP26
3+
A0PK11

0 commit comments

Comments
 (0)