Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB
Posted: Wed Nov 26, 2025 7:04 pm
I’m trying to reproduce your BioLiP2 curation pipeline from your source code in order to integrate it into my workflow. First — thank you for making the scripts and procedures available; the overall pipeline is very helpful.
While reproducing the step 3 (Download binding affinity), I hit a reproducible problem that (as far as I can tell) is not described neither in the paper or the README. Below I describe the problem and what I tried. I’d appreciate any guidance or pointers you can share.
Your repo already produces data/chain2uniprot.tsv.gz. The file contains identifiers that look like mmCIF/BioLiP internal IDs, e.g., 101m, 102l, … rather than the standard 4-character PDB codes. However, the downloaded BindingDB’s (bind/BindingDB_All.tsv) contains standard PDB codes in the PDB column (e.g., 1W5Y,1W5X,...) and UniProt accessions like P03367 appear at the end of the line. Because the BioLiP chain2uniprot uses the mmCIF/internal identifiers, a direct lookup from BindingDB’s PDB code → chain → UniProt fails (no matches), so the mapping step produces an empty data/BindingDB.tsv mapping in my runs.
I attempted a few fixes:
Verified bind/BindingDB_All.tsv integrity and contents (it contains UniProt accessions);
Tried mapping by PDB → chain using your chain2uniprot (no matches because IDs differ);
Wrote a helper script that falls back to mapping by UniProt only (skips PDB → chain) — that works but loses the PDB chain specificity I need for per-chain cleaning/assignment;
Tried parsing different BindingDB column positions (some releases change column order), and ensured the parser reads the last column (where UniProt lives).
With these attempts I still could not produce the expected data/BindingDB.tsv where BindingDB ligands are linked to BioLiP chains.
While reproducing the step 3 (Download binding affinity), I hit a reproducible problem that (as far as I can tell) is not described neither in the paper or the README. Below I describe the problem and what I tried. I’d appreciate any guidance or pointers you can share.
Your repo already produces data/chain2uniprot.tsv.gz. The file contains identifiers that look like mmCIF/BioLiP internal IDs, e.g., 101m, 102l, … rather than the standard 4-character PDB codes. However, the downloaded BindingDB’s (bind/BindingDB_All.tsv) contains standard PDB codes in the PDB column (e.g., 1W5Y,1W5X,...) and UniProt accessions like P03367 appear at the end of the line. Because the BioLiP chain2uniprot uses the mmCIF/internal identifiers, a direct lookup from BindingDB’s PDB code → chain → UniProt fails (no matches), so the mapping step produces an empty data/BindingDB.tsv mapping in my runs.
I attempted a few fixes:
Verified bind/BindingDB_All.tsv integrity and contents (it contains UniProt accessions);
Tried mapping by PDB → chain using your chain2uniprot (no matches because IDs differ);
Wrote a helper script that falls back to mapping by UniProt only (skips PDB → chain) — that works but loses the PDB chain specificity I need for per-chain cleaning/assignment;
Tried parsing different BindingDB column positions (some releases change column order), and ensured the parser reads the last column (where UniProt lives).
With these attempts I still could not produce the expected data/BindingDB.tsv where BindingDB ligands are linked to BioLiP chains.