I’m trying to reproduce your BioLiP2 curation pipeline from your source code in order to integrate it into my workflow. First — thank you for making the scripts and procedures available; the overall pipeline is very helpful.
While reproducing the step 3 (Download binding affinity), I hit a reproducible problem that (as far as I can tell) is not described neither in the paper or the README. Below I describe the problem and what I tried. I’d appreciate any guidance or pointers you can share.
Your repo already produces data/chain2uniprot.tsv.gz. The file contains identifiers that look like mmCIF/BioLiP internal IDs, e.g., 101m, 102l, … rather than the standard 4-character PDB codes. However, the downloaded BindingDB’s (bind/BindingDB_All.tsv) contains standard PDB codes in the PDB column (e.g., 1W5Y,1W5X,...) and UniProt accessions like P03367 appear at the end of the line. Because the BioLiP chain2uniprot uses the mmCIF/internal identifiers, a direct lookup from BindingDB’s PDB code → chain → UniProt fails (no matches), so the mapping step produces an empty data/BindingDB.tsv mapping in my runs.
I attempted a few fixes:
Verified bind/BindingDB_All.tsv integrity and contents (it contains UniProt accessions);
Tried mapping by PDB → chain using your chain2uniprot (no matches because IDs differ);
Wrote a helper script that falls back to mapping by UniProt only (skips PDB → chain) — that works but loses the PDB chain specificity I need for per-chain cleaning/assignment;
Tried parsing different BindingDB column positions (some releases change column order), and ensured the parser reads the last column (where UniProt lives).
With these attempts I still could not produce the expected data/BindingDB.tsv where BindingDB ligands are linked to BioLiP chains.
Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB
Moderators: ITASSERteam, junh, XiZhang
-
Lina Khelfet
- Posts: 1
- Joined: Wed Nov 26, 2025 6:59 pm
Re: Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB
Dear user,
The identifiers in chain2uniprot.tsv.gz (e.g., 101m) are indeed standard PDB IDs (in lowercase). This file is built directly from the SIFTS mapping, using the common standard of “4-character PDB code + chain ID” to map to UniProt accessions. For more information, please refer to the email I previously sent you.
Best wishes,
Hao
The identifiers in chain2uniprot.tsv.gz (e.g., 101m) are indeed standard PDB IDs (in lowercase). This file is built directly from the SIFTS mapping, using the common standard of “4-character PDB code + chain ID” to map to UniProt accessions. For more information, please refer to the email I previously sent you.
Best wishes,
Hao