前言
偶然发现InterPro数据库挺不错的。
之前使用selenium爬取了AlphaFlod数据,于是也想试试把InterPro的结构域数据爬取一下。
结果发现官方已经给好了代码,真是太善解人意了。
当然,想要批量下载还需要魔改一下官方代码。
步骤一:获取想要下载的蛋白质列表
我们首先在Browse - InterPro (ebi.ac.uk) 该界面搜索我们需要的蛋白质
1、选择reviewed的蛋白质(unreviewed的数据一般质量不高,想用也行)
2、选择对应的物种
3、输入蛋白质的关键词
4、点击导出Export按钮
5、点击Generate按钮
完成生成后,会变为download按钮,下载即可
下载好的文件:
步骤二:下载列表中各个蛋白质的结构域数据
官方给出的代码只能下载单个蛋白质的结构域tsv信息
Results - InterPro (ebi.ac.uk)官方代码:Results - InterPro (ebi.ac.uk)
我们魔改一下,读取第一步下好的列表,然后再依次下载结构域信息,保存到各个文件中:
(运行要求:把protein-sequences.tsv重命名为export.tsv,然后放置在以下代码的同一目录下,在该目录下建立名为domain的文件夹,用于存放输出文件)
'''
修改自InterPro官网上的代码
用于读取InterPro上的查找结果 export.tsv 并根据结果下载所有蛋白质的结构域信息
'''
# standard library modules
import sys, errno, re, json, ssl, os
from urllib import request
from urllib.error import HTTPError
from time import sleep# BASE_URL = "https://www.ebi.ac.uk:443/interpro/api/entry/InterPro/protein/reviewed/A0A024R1R8/?page_size=200"def parse_items(items):if type(items)==list:return ",".join(items)return ""
def parse_member_databases(dbs):if type(dbs)==dict:return ";".join([f"{db}:{','.join(dbs[db])}" for db in dbs.keys()])return ""
def parse_go_terms(gos):if type(gos)==list:return ",".join([go["identifier"] for go in gos])return ""
def parse_locations(locations):if type(locations)==list:return ",".join([",".join([f"{fragment['start']}..{fragment['end']}" for fragment in location["fragments"]])for location in locations])return ""
def parse_group_column(values, selector):return ",".join([parse_column(value, selector) for value in values])def parse_column(value, selector):if value is None:return ""elif "member_databases" in selector:return parse_member_databases(value)elif "go_terms" in selector: return parse_go_terms(value)elif "children" in selector: return parse_items(value)elif "locations" in selector:return parse_locations(value)return str(value)def download_to_file(url, file_path):#disable SSL verification to avoid config issuescontext = ssl._create_unverified_context()next = urllast_page = Falseattempts = 0while next:try:req = request.Request(next, headers={"Accept": "application/json"})res = request.urlopen(req, context=context)# If the API times out due a long running queryif res.status == 408:# wait just over a minutesleep(61)# then continue this loop with the same URLcontinueelif res.status == 204:#no data so leave loopbreakpayload = json.loads(res.read().decode())next = payload["next"]attempts = 0if not next:last_page = Trueexcept HTTPError as e:if e.code == 408:sleep(61)continueelse:# If there is a different HTTP error, it wil re-try 3 times before failingif attempts < 3:attempts += 1sleep(61)continueelse:sys.stderr.write("LAST URL: " + next)raise ewith open(file_path,"w+") as f:for i, item in enumerate(payload["results"]):f.write(parse_column(item["metadata"]["accession"], 'metadata.accession') + "\t")f.write(parse_column(item["metadata"]["name"], 'metadata.name') + "\t")f.write(parse_column(item["metadata"]["source_database"], 'metadata.source_database') + "\t")f.write(parse_column(item["metadata"]["type"], 'metadata.type') + "\t")f.write(parse_column(item["metadata"]["integrated"], 'metadata.integrated') + "\t")f.write(parse_column(item["metadata"]["member_databases"], 'metadata.member_databases') + "\t")f.write(parse_column(item["metadata"]["go_terms"], 'metadata.go_terms') + "\t")f.write(parse_column(item["proteins"][0]["accession"], 'proteins[0].accession') + "\t")f.write(parse_column(item["proteins"][0]["protein_length"], 'proteins[0].protein_length') + "\t")f.write(parse_column(item["proteins"][0]["entry_protein_locations"], 'proteins[0].entry_protein_locations') + "\t")f.write("\n")# Don't overload the server, give it time before asking for moresleep(1)with open("export.tsv") as f:# 丢弃第一行头文件line = f.readline()line = f.readline()cnt = 0while line:cnt += 1print(cnt)protein_id = line.split("\t")[0]url = f"https://www.ebi.ac.uk:443/interpro/api/entry/InterPro/protein/reviewed/{protein_id}/?page_size=200"download_to_file(url,os.path.join('domain', protein_id+'.tsv'))line = f.readline()
运行就可以开始下载了
代码已放置在gitee仓库上,欢迎使用interpro-domain-downloader: 下载InterPro数据库上的蛋白质结构域domain数据 (gitee.com)