Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB

Lina Khelfet · Post by **Lina Khelfet** » Wed Nov 26, 2025 7:04 pm

I’m trying to reproduce your BioLiP2 curation pipeline from your source code in order to integrate it into my workflow. First — thank you for making the scripts and procedures available; the overall pipeline is very helpful.

While reproducing the step 3 (Download binding affinity), I hit a reproducible problem that (as far as I can tell) is not described neither in the paper or the README. Below I describe the problem and what I tried. I’d appreciate any guidance or pointers you can share.

Your repo already produces data/chain2uniprot.tsv.gz. The file contains identifiers that look like mmCIF/BioLiP internal IDs, e.g., 101m, 102l, … rather than the standard 4-character PDB codes. However, the downloaded BindingDB’s (bind/BindingDB_All.tsv) contains standard PDB codes in the PDB column (e.g., 1W5Y,1W5X,...) and UniProt accessions like P03367 appear at the end of the line. Because the BioLiP chain2uniprot uses the mmCIF/internal identifiers, a direct lookup from BindingDB’s PDB code → chain → UniProt fails (no matches), so the mapping step produces an empty data/BindingDB.tsv mapping in my runs.

I attempted a few fixes:

Verified bind/BindingDB_All.tsv integrity and contents (it contains UniProt accessions);
Tried mapping by PDB → chain using your chain2uniprot (no matches because IDs differ);
Wrote a helper script that falls back to mapping by UniProt only (skips PDB → chain) — that works but loses the PDB chain specificity I need for per-chain cleaning/assignment;
Tried parsing different BindingDB column positions (some releases change column order), and ensured the parser reads the last column (where UniProt lives).

With these attempts I still could not produce the expected data/BindingDB.tsv where BindingDB ligands are linked to BioLiP chains.

LuHao · Post by **LuHao** » Tue Dec 09, 2025 3:16 pm

Dear user,

The identifiers in chain2uniprot.tsv.gz (e.g., 101m) are indeed standard PDB IDs (in lowercase). This file is built directly from the SIFTS mapping, using the common standard of “4-character PDB code + chain ID” to map to UniProt accessions. For more information, please refer to the email I previously sent you.

Best wishes,
Hao

Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB

Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB

Re: Reproducing BioLiP2 curation pipeline — PDB ID mapping mismatch with BindingDB