BioLiP inquiry
Posted: Fri Feb 10, 2023 6:36 pm
Hello Zhang,
I once again turn to you for some clarification.
I have been playing around with the dataset and I ran into a situation. The main file "BioLiP.txt.gz" has about 800.000+ entries (redundant set), all kinds of ligands included.
When downloading the weekly updates (redundant set), I am getting 586.605 ligands. Is this the same on your side or is it an error with my download?
The perl script seems to complete successfully.
Best,
Christos
On Thu, 9 Feb 2023 at 01:06, Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Hello Zhang,
Great, thank you a lot for all your help!
About the sequence number implementation, I understand it would take some work. I will keep using the separate files for now.
I can say that for the ligands that are single-entity (which is the majority, about 400.000 in BioLiP right now I think), so this means excluding DNA, RNA and peptides, each of them has one integer identifier that specifies them explicitly. But even the rest might be worth the trouble of including as it might save plenty of gigabytes in traffic overall.
If you decide to go for it and need any help or feedback on this, please feel free to let me know, I would be happy to help if I can.
Best,
Christos
On Thu, 9 Feb 2023 at 01:06, Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Hello Zhang,
Great, thank you a lot for all your help!
About the sequence number implementation, I understand it would take some work. I will keep using the separate files for now.
I can say that for the ligands that are single-entity (which is the majority, about 400.000 in BioLiP right now I think), so this means excluding DNA, RNA and peptides, each of them has one integer identifier that specifies them explicitly. But even the rest might be worth the trouble of including as it might save plenty of gigabytes in traffic overall.
If you decide to go for it and need any help or feedback on this, please feel free to let me know, I would be happy to help if I can.
Best,
Christos
On Thu, 9 Feb 2023 at 00:48, Chengxin Zhang <zcx@umich.edu> wrote:
Hello Christos,
We no longer separate the old set from weekly sets. All tar datasets are weekly updated. Sorry for the typo. I have removed the wrong link in download_all_sets.pl
I will consider implementing the residue sequence number for ligands, but it will take a while, as it will mean changing a lot of underlying code.
Chengxin Zhang, PhD
Howard Hughes Medical Institute,
Department of Molecular, Cellular and Developmental Biology, Yale University
On Wed, Feb 8, 2023 at 6:40 PM Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Hello Zhang,
Thank you for the clarification, and once again for pointing me in the right direction.
So to have all the up-to-date ligands that are present in the "large TSV file" (BioLiP.txt.gz), should the weekly updates (https://zhanggroup.org/BioLiP/weekly.html) cover it?
I am asking as I saw a link with reference to the "old set" in the included Perl script (download_all_sets.pl), and I remember using this feature in the past, but I cannot find the ligand links for the "old set" at "https://zhanggroup.org/BioLiP2/download.html".
Is there any plan to include the sequence number for ligands into the main file at some point (BioLiP.txt.gz), as an extra column for example?
Personally I would find it very useful, much easier/faster to parse, and no need (in this case) to download the additional ligand files (I think around ~600.000 ligands?), just a suggestion perhaps.
Best,
Christos
On Wed, 8 Feb 2023 at 22:49, Chengxin Zhang <zcx@umich.edu> wrote:
Hello Christos,
"first ligand in the chain" denotes the first mentioned ligand in the mmCIF file belonging to the given chain (regardless of protein residue in the ligand binding site).
Yes, as explained in the previous email, 4c20_MN_A_1.pdb can be downloaded from BioLiP.
In an individual webpage such as https://zhanggroup.org/BioLiP/pdb.cgi?p ... =A&bs=BS01, you can download it from "[Download ligand structure]".
To batch download all ligand PDB files, you need to go to https://zhanggroup.org/BioLiP/weekly.html. For example, 4c20_MN_A_1.pdb is available at https://zhanggroup.org/BioLiP/weekly/ligand_c2.tar.bz2.
Chengxin Zhang, PhD
Howard Hughes Medical Institute,
Department of Molecular, Cellular and Developmental Biology, Yale University
On Wed, Feb 8, 2023 at 4:23 PM Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Dear Zhang,
Thank you very much for your email. I have indeed downloaded the file that you specify in your email and it is the file that I have been working with.
I am trying to understand the process that you describe for figuring out which is the actual ligand mentioned in a given line in BioLiP, but I am not certain that I do.
For example, would the "first ligand in the chain" denote the first mentioned ligand in the mmCIF file belonging to the given chain, or the ligand binding the most upstream in the residue sequence of that chain?
Either way, parsing the structure (the mmCIF file) seems to be necessary in order to find out which is the ligand mentioned. This can get computationally expensive at scale, so I assume I am missing something.
I might have not explained this properly in my email and I convey my apologies. By residue number I refer to the PDB-assigned sequence identifier, which in the mmCIF files is designated by the label "_atom_site.auth_seq_id".
https://mmcif.wwpdb.org/dictionaries/mm ... eq_id.html
I understand this label is necessary for depositing structures in the PDB and is present for all ligands (in addition to receptor residues) as well as water molecules in the mmCIF structure files to my knowledge - I have yet to come across a ligand without an index position in a PDB or mmCIF file. This way, by having the chain and this sequence identifier, (and also the ligand name), you can specify a ligand unambiguously, without parsing the structure.
In the previous version of BioLiP, this "seq.id" or "residue number" was not mentioned directly in the BioLiP TSV files but it was available for all ligands when downloading them from BioLiP. I am providing an example in the attachment, which was downloaded from your webpage in the previous version of BioLiP. This PDB file, was basically a stripped down version of the complete PDB structure file, only containing information for the atoms of the given ligand. In there, it was possible to find the "residue number" or "sed. id", and thus the relevant ligand, unambiguously, without parsing the structure and performing search operations. I cannot seem to find such a file in the new version of BioLiP.
Is it possible that I have missed it or is it just not there anymore?
Best regards,
Christos
On Wed, 8 Feb 2023 at 21:11, Chengxin Zhang <zcx@umich.edu> wrote:
Hello Christos,
For the "large TSV file that includes all protein-ligand interactions", you can find it as BioLiP.txt.gz at https://zhanggroup.org/BioLiP/download.html.
If you want the TSV file for the current webpage you are browsing, such as https://zhanggroup.org/BioLiP/qsearch.cgi, you can find it at the top of the page with the sentence "Download all results in tab-seperated text for xxx receptor-ligand interactions", just as in the old BioLiP browsing webpage.
We do not keep the "residue number" of the ligand in the table, both in the old and the new BioLiP. The technical reason is that many ligands are not assigned residue number in the mmCIF file.
To differentiate different ligands of the same type in the same chain, the tsv file offers at column 7 the Ligand serial number. For example, PDB 117e chain A (https://zhanggroup.org/BioLiP/qsearch.c ... 7e&chain=A) has three MN and two PO4. The first 7 columns of the corresponding tsv reads:
117e A 2.15 BS01 MN A 1
117e A 2.15 BS02 MN A 2
117e A 2.15 BS03 MN A 3
117e A 2.15 BS04 PO4 A 1
117e A 2.15 BS05 PO4 A 2
For example, the last row column 7 reads "2" because it is the second PO4 in the chain.
The corresponding ligand file name is 117e_PO4_A_2.pdb when you Download ligand structure at its own page (https://zhanggroup.org/BioLiP/pdb.cgi?p ... =A&bs=BS05) or through batch download at https://zhanggroup.org/BioLiP/weekly.html
Chengxin Zhang, PhD
Howard Hughes Medical Institute,
Department of Molecular, Cellular and Developmental Biology, Yale University
From: Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz>
Date: Wednesday, February 8, 2023 at 22:24
To: yangzhanglab@umich.edu <yangzhanglab@umich.edu>
Subject: BioLiP inquiry
Dear Yang Zhang lab,
I have been using BioLiP in my research and I would like to thank you for providing such a useful resource to the community. I am studying protein-ligand interactions and it is very important to be able to distinguish between relevant and non-relevant ligands.
I am writing as I have recently tried to access BioLiP for ligand information and I have noticed that the previous perl script seems to be now obsolete. Instead, I can access the interactions through the webpage, which is very intuitive. While I can download the large TSV file that includes all protein-ligand interactions, I cannot seem to find in it, or in another file, the index of each ligand as it is stated in the PDB. To be clear, I am referring to the "residue number" that ligands have in the PDB file of the deposited structure, just like residues do. This number represents the position of the ligand and can be used to directly specify the relevant ligand in the structure, when stated together with the ligand's chain, without relying on the binding residues (which are present in the file and can also be common for more than one ligands). These positions were previously available when downloading the ligands separately, but I cannot seem to locate these files now. Could you please advise me on where to look for them, if they are still available?
I am quoting the first entry in protein-ligand file of BioLiP as an example
101m A 2.07 BS01 HEM A 1 ....
The residue number for this Heme in the respective mmCIF file is 155 and it is assigned chain A. While the ligand chain is available in the main file (6th column), the residue position is not there. In this example, this is the only Heme ligand in the structure. However in other cases (i.e., metal ions, or other small ligands, there can be multiple ligands binding the same chain).
Thank you very much for your time.
Best regards,
Christos Feidakis
I once again turn to you for some clarification.
I have been playing around with the dataset and I ran into a situation. The main file "BioLiP.txt.gz" has about 800.000+ entries (redundant set), all kinds of ligands included.
When downloading the weekly updates (redundant set), I am getting 586.605 ligands. Is this the same on your side or is it an error with my download?
The perl script seems to complete successfully.
Best,
Christos
On Thu, 9 Feb 2023 at 01:06, Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Hello Zhang,
Great, thank you a lot for all your help!
About the sequence number implementation, I understand it would take some work. I will keep using the separate files for now.
I can say that for the ligands that are single-entity (which is the majority, about 400.000 in BioLiP right now I think), so this means excluding DNA, RNA and peptides, each of them has one integer identifier that specifies them explicitly. But even the rest might be worth the trouble of including as it might save plenty of gigabytes in traffic overall.
If you decide to go for it and need any help or feedback on this, please feel free to let me know, I would be happy to help if I can.
Best,
Christos
On Thu, 9 Feb 2023 at 01:06, Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Hello Zhang,
Great, thank you a lot for all your help!
About the sequence number implementation, I understand it would take some work. I will keep using the separate files for now.
I can say that for the ligands that are single-entity (which is the majority, about 400.000 in BioLiP right now I think), so this means excluding DNA, RNA and peptides, each of them has one integer identifier that specifies them explicitly. But even the rest might be worth the trouble of including as it might save plenty of gigabytes in traffic overall.
If you decide to go for it and need any help or feedback on this, please feel free to let me know, I would be happy to help if I can.
Best,
Christos
On Thu, 9 Feb 2023 at 00:48, Chengxin Zhang <zcx@umich.edu> wrote:
Hello Christos,
We no longer separate the old set from weekly sets. All tar datasets are weekly updated. Sorry for the typo. I have removed the wrong link in download_all_sets.pl
I will consider implementing the residue sequence number for ligands, but it will take a while, as it will mean changing a lot of underlying code.
Chengxin Zhang, PhD
Howard Hughes Medical Institute,
Department of Molecular, Cellular and Developmental Biology, Yale University
On Wed, Feb 8, 2023 at 6:40 PM Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Hello Zhang,
Thank you for the clarification, and once again for pointing me in the right direction.
So to have all the up-to-date ligands that are present in the "large TSV file" (BioLiP.txt.gz), should the weekly updates (https://zhanggroup.org/BioLiP/weekly.html) cover it?
I am asking as I saw a link with reference to the "old set" in the included Perl script (download_all_sets.pl), and I remember using this feature in the past, but I cannot find the ligand links for the "old set" at "https://zhanggroup.org/BioLiP2/download.html".
Is there any plan to include the sequence number for ligands into the main file at some point (BioLiP.txt.gz), as an extra column for example?
Personally I would find it very useful, much easier/faster to parse, and no need (in this case) to download the additional ligand files (I think around ~600.000 ligands?), just a suggestion perhaps.
Best,
Christos
On Wed, 8 Feb 2023 at 22:49, Chengxin Zhang <zcx@umich.edu> wrote:
Hello Christos,
"first ligand in the chain" denotes the first mentioned ligand in the mmCIF file belonging to the given chain (regardless of protein residue in the ligand binding site).
Yes, as explained in the previous email, 4c20_MN_A_1.pdb can be downloaded from BioLiP.
In an individual webpage such as https://zhanggroup.org/BioLiP/pdb.cgi?p ... =A&bs=BS01, you can download it from "[Download ligand structure]".
To batch download all ligand PDB files, you need to go to https://zhanggroup.org/BioLiP/weekly.html. For example, 4c20_MN_A_1.pdb is available at https://zhanggroup.org/BioLiP/weekly/ligand_c2.tar.bz2.
Chengxin Zhang, PhD
Howard Hughes Medical Institute,
Department of Molecular, Cellular and Developmental Biology, Yale University
On Wed, Feb 8, 2023 at 4:23 PM Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz> wrote:
Dear Zhang,
Thank you very much for your email. I have indeed downloaded the file that you specify in your email and it is the file that I have been working with.
I am trying to understand the process that you describe for figuring out which is the actual ligand mentioned in a given line in BioLiP, but I am not certain that I do.
For example, would the "first ligand in the chain" denote the first mentioned ligand in the mmCIF file belonging to the given chain, or the ligand binding the most upstream in the residue sequence of that chain?
Either way, parsing the structure (the mmCIF file) seems to be necessary in order to find out which is the ligand mentioned. This can get computationally expensive at scale, so I assume I am missing something.
I might have not explained this properly in my email and I convey my apologies. By residue number I refer to the PDB-assigned sequence identifier, which in the mmCIF files is designated by the label "_atom_site.auth_seq_id".
https://mmcif.wwpdb.org/dictionaries/mm ... eq_id.html
I understand this label is necessary for depositing structures in the PDB and is present for all ligands (in addition to receptor residues) as well as water molecules in the mmCIF structure files to my knowledge - I have yet to come across a ligand without an index position in a PDB or mmCIF file. This way, by having the chain and this sequence identifier, (and also the ligand name), you can specify a ligand unambiguously, without parsing the structure.
In the previous version of BioLiP, this "seq.id" or "residue number" was not mentioned directly in the BioLiP TSV files but it was available for all ligands when downloading them from BioLiP. I am providing an example in the attachment, which was downloaded from your webpage in the previous version of BioLiP. This PDB file, was basically a stripped down version of the complete PDB structure file, only containing information for the atoms of the given ligand. In there, it was possible to find the "residue number" or "sed. id", and thus the relevant ligand, unambiguously, without parsing the structure and performing search operations. I cannot seem to find such a file in the new version of BioLiP.
Is it possible that I have missed it or is it just not there anymore?
Best regards,
Christos
On Wed, 8 Feb 2023 at 21:11, Chengxin Zhang <zcx@umich.edu> wrote:
Hello Christos,
For the "large TSV file that includes all protein-ligand interactions", you can find it as BioLiP.txt.gz at https://zhanggroup.org/BioLiP/download.html.
If you want the TSV file for the current webpage you are browsing, such as https://zhanggroup.org/BioLiP/qsearch.cgi, you can find it at the top of the page with the sentence "Download all results in tab-seperated text for xxx receptor-ligand interactions", just as in the old BioLiP browsing webpage.
We do not keep the "residue number" of the ligand in the table, both in the old and the new BioLiP. The technical reason is that many ligands are not assigned residue number in the mmCIF file.
To differentiate different ligands of the same type in the same chain, the tsv file offers at column 7 the Ligand serial number. For example, PDB 117e chain A (https://zhanggroup.org/BioLiP/qsearch.c ... 7e&chain=A) has three MN and two PO4. The first 7 columns of the corresponding tsv reads:
117e A 2.15 BS01 MN A 1
117e A 2.15 BS02 MN A 2
117e A 2.15 BS03 MN A 3
117e A 2.15 BS04 PO4 A 1
117e A 2.15 BS05 PO4 A 2
For example, the last row column 7 reads "2" because it is the second PO4 in the chain.
The corresponding ligand file name is 117e_PO4_A_2.pdb when you Download ligand structure at its own page (https://zhanggroup.org/BioLiP/pdb.cgi?p ... =A&bs=BS05) or through batch download at https://zhanggroup.org/BioLiP/weekly.html
Chengxin Zhang, PhD
Howard Hughes Medical Institute,
Department of Molecular, Cellular and Developmental Biology, Yale University
From: Christos Feidakis, M.Sc. <christos.feidakis@natur.cuni.cz>
Date: Wednesday, February 8, 2023 at 22:24
To: yangzhanglab@umich.edu <yangzhanglab@umich.edu>
Subject: BioLiP inquiry
Dear Yang Zhang lab,
I have been using BioLiP in my research and I would like to thank you for providing such a useful resource to the community. I am studying protein-ligand interactions and it is very important to be able to distinguish between relevant and non-relevant ligands.
I am writing as I have recently tried to access BioLiP for ligand information and I have noticed that the previous perl script seems to be now obsolete. Instead, I can access the interactions through the webpage, which is very intuitive. While I can download the large TSV file that includes all protein-ligand interactions, I cannot seem to find in it, or in another file, the index of each ligand as it is stated in the PDB. To be clear, I am referring to the "residue number" that ligands have in the PDB file of the deposited structure, just like residues do. This number represents the position of the ligand and can be used to directly specify the relevant ligand in the structure, when stated together with the ligand's chain, without relying on the binding residues (which are present in the file and can also be common for more than one ligands). These positions were previously available when downloading the ligands separately, but I cannot seem to locate these files now. Could you please advise me on where to look for them, if they are still available?
I am quoting the first entry in protein-ligand file of BioLiP as an example
101m A 2.07 BS01 HEM A 1 ....
The residue number for this Heme in the respective mmCIF file is 155 and it is assigned chain A. While the ligand chain is available in the main file (6th column), the residue position is not there. In this example, this is the only Heme ligand in the structure. However in other cases (i.e., metal ions, or other small ligands, there can be multiple ligands binding the same chain).
Thank you very much for your time.
Best regards,
Christos Feidakis