1) I was surprised to find that Asparaginase had no less than 92 substance (SID) entries in PubChem but, as expected, no CIDs, because a chemical structure can't be rendered. These seem to be from many sources but SID 34449732 has the ChEMBL link. There are 148 Erwinia asparaginases in the protein database but the chrysanthemi sequence is only in PDB. You can find it as a patent sequence in gi 300626453
2) Aflibercept, an engineered protein, has three substance entries including SDI 24490314 from KEGG that has the same sequence as in the JPO blog. They could have got this either direct from the INN document or the CAS no. but it gives a clear chimeric match in the nr protein database as shown below. The patent protein database gives a full-length match against Regeneron patent sequence, with a signal peptide, gi 12309729
3) Ruxolitinib There are some interesting structural comments on JPO’s blog, I think from a Chemspider staff member, that I’ll leave you to read. The correct enantiomer CID 25126798 was submitted by Prous Drugs of the Future, Thomson Pharma and Discovery gate. It also came through as a malate salt in CID 25126797.The flat version, CID 7754772 made it into the kinome kinase panel assay from Abbot.
4) Clobazam is the lowest numbered CID 2789 . This marks it out as an early 2005 PubChem submission and the INN, as the oldest one in this set, has been in MeSH from 1976. Among the 65 synonyms this has accumulated we see a good ole' CAS no. 22316-47-8 linked to 9 of the 37 SIDs. I took this as a golden opportunity to try out the CAS Common Chemistry look-up service for ~ 8K reg. nos. Low and behold it returned - a not very aesthetic rendering - but with the Wikipedia link to boot.5) Deferiprone, is also a 2005 entry as CID 2972 , entering MeSH in 1985. It went into the MLSCN screening collection early on and so gets the wet (as opposed to ChEMBL journal-extracted) assay record in this set of 629, with active: 67 and inactive: 531
6) Icatibant can be assigned to CID 71364 but, because this is a big peptidomimetic with lots of stereo centers, there are 9 CIDs with the same connectivity and Mw of 1304.5, of which ABIChem can apparently supply four different CIDs. There is only one assay between all 9, which is not an assay but the GLIDA ligand compilation. Kudos to DrugBank here for picking up a direct off-target interaction with aminopeptidase N when TTD only gets the Bradykinin receptor B2 primary target. While there was no 3D rendering in Pc3 additional kudos in this case goes to the the Jmol implementation in ChemSpider 6446 that, after about three coffee sips, manfully spawned the 3D below. However, TTD renders this off the bat.
7) Crizotinib = CID 11626560 This has the highest BioAssay count of 878 (active in 268) but most of these are cell-line screens from the Sanger Centre via ChEMBL. There is a synonym miss-map in this CID because Selleck Chemicals have linked the structure to SB 431542 in SID 124756975. But, SB431542 (sans hyphen) hits two different PubChem structures, crizontanib above as CID 11626560 and, what I assume is the correct SB 431542, as CID 4521392. There is also a flat version, CID 1597571 , that has links to the Abbot kinome panel assay AID 493040_1 with no less than 172 kinases. You can also find three PDB entries for, c-MET, ALK and a mutant. The gi sequences for the catalytic domain constructs have their purification tags left on.
8) Vemurafenhib = CID 42611257 This is of relatively few "clean" CIDs (a single name match subsuming all the SIDs, and no "same connectivity" CIDs ). There are 13 mixture records from patents and a protein structure link as B-Raf Kinase V600e Oncogenic Mutant In Complex With Plx4032 . What is odd (since it has been around in the literature long enough to be co-crystallized) is the absence of any bio-assay links, even from ChEMBL where this is captured as an approved drug. It is shown below in full P3CD pharmacophoric glory.
9) Brentuximab vedotin has three substance entries for the two terms, none for the antibody alone but a match for the conjugated drug vedotin as CID 53297465. We can award points here both for ChEMBL for now including mabs (with clinical trial pointers) and, as observed a few entries here, KEGG for getting things first. The interesting quirk here is the big ugly linker that ties the warhead to the mab. The ChEMBL blog renders this, presumably from the INN, but in SID 85152967 the US gov ChemIDPlus has the full IUPAC string. Additional kudos here goes first to OPSIN for the instant conversion below (it looks OK to me at any rate),
and secondly, to the PubChem structure search that converted the OPSIN-generated SMILES and crunched it against the 32 million within a couple of coffee-sips. Not expecting any matches I was surprised when it whacked 7 CIDs at 90%, all listed as neighbors of CID 46944733 at around Mw 1350. As these are Thomson Pharma entries my guess is they are patented linker analogues (but the exact match was not there ?). As a final trick to square the circle, a substructure search with either the SMILES string or CID 4694473, whacked the Vedotin CIDs.
10) Ticagrelor = CID 9871419. This gets second highest score for virtual deuterated derivatives from patents with over 80 (see this previous post). This has both in vitro assay results and in vivo DMPK data from the ChEMBL links in SID 103554312.It has only three similar conformers so it quick to walk through one of the analogues shown below, CID 9847634, probably from the early patents.
11) Indacaterol Maleate = CID 9827599 . This has the USAN via KEGG SID 96025999 and the parent "indacterol" (also a USAN) is CID 6918554 . The nice quirk here is that this compound found its way into the GSK antimalarial screening hits (1.2 uM) via CID 44532988 (a.k.a TCMDC-138894) but as the (1S) acetate (maybe GSK made this ?). Because they pick up all TCMDCs ChEMBL have duly inherited both the 1R malate and the 1S acetate but stripped them back to parents that get linked to the assay results (yep, that makes four ChEMBL ids and four CIDs). Not to be outdone, IBM patents, EPO/Sling patents and the GLIDA GPCR ligand database are all linked to the flat version, CID 6433117.
12) Rivoxaban = CID 6433119 has an intriguing virtual assay title link of "Wombat Data for Belief Docking". You can take a look at AID 493017 to find out more.
13) Belatacept has eight SID name matches but they include some Abatacept entries which, as the JPO blog entry points out, is a different sequence. It took a bit of ferreting around to work out what has happened but it turns out that MeSH have linked the two names as synonyms.
14) Ezogabine = CID 121892 This has an extended commercial history from Viatris > Xcel > Valeant > GSK. An IBM entry SID 24263039 picks up possibly the first Viatris patent in EP0956281 from 1998. This CID also has Nature Chem Biol as depositor so PMID 17435769 is linked to 10 SIDs including this one.
15) Fidaxomicin could be CID 46174142 but the stereo spaghetti connects nine CIDs. The submitter popularity vote (11 SIDs) goes to CID 10034073 . Wikepedia plumbs for CID 11528171 and NextBio have hedged their representational bets by having links to no less that 5 CIDs including this "flat" one SID 77445353.
17) Rilpivirine =CID 6451164 has 6 HIV mutants in PDB and tops the maximally deuterated list at 98 . It also has a comprehensive set of 434 ADMET assays including 253 dog and rat plasma profiles from one publication.
16) Telaprevir = CID 3010818 but in this case both Therapeutic Target Database links go to different compounds. The assay quirk here is that you only see the cross-screens as protein targets because the viral protease has a polyprotein precursor. This also provides a case of the same CAS no. to two different structures, the one above plus CID 52914943. This compound has an impressive total deuteration rendering in CID 5255312 (via Thomson Pharma and probably extracted from US 20090082366) but best of luck if you want to synthesise it ....
18) Linagliptin = CID 10096344 KEGG picked up the JAN/INN/USAN on 2011-08-02. Of the 24 assays four are chemical property measurements, valuable, but not strictly bioassays. As mentioned in a previous post, the MeSH term Dipeptidyl-Peptidase IV Inhibitors brings back 19 compounds including most of the other gliptins.
19) Boceprevir is a I:I mixture of stereoisomers CID 10324367. We can find some challenging assay multiplexing around this compound and other NS3 protease inhibitors because of the importance of clinical viral isolates with active-site altering mutations in this target. Each mutation an/or pair is assigned a separate Assay ID by ChEMBL and hence in PubChem BioAssay. We thus see 262 assays with five or so inhibitor leads but 100 of these are from the same paper. This is further confounded because there are seven CIDs with the same connectivity and a major SAR publication where 33 analogues were tested (AID 286744) got linked to CID 44422871 missing one of the stereocentres.
20) Abiraterone Acetate = CID 132970 is a prodrug to Abiraterone . It has 7 steros but only 6 of the 63 deuterateds get this right (1S,4S,5R). A name search with abiraterone in PubChem comes up with 28 CIDs. These appear to have ChEMBL as the primary source for a published analogue series but the names have been generated by the use of the suffix "mimetic" in BindingDB.
21) Gabapentin carbamyl = CID 9883933 is a racemate. It has two SID links to TTD but I can't reconcile the jmole rendering of SID 134339032 in with the PubChem 2D. It's the pro-drug of gabapentin CID 3446. This entry holds a few records for this set with 101 synonyms, 246 SIDs and 159 SID mixtures. While you can find them via the name, as far as I can ascertain there are no direct links between the drug and pro-drug structures. Also, kudos for the only formal depiction of the racemic position in a link-out from this CID goes to the package insert.
22) Vandetanib = CID 3081361 has 518 assays (active in 198), because it was included in a large kinase panel assay publication (PMID 18183025) captured by ChEMBL. It takes the biscuit here for being assigned against 318 protein targets across the 518 assays.
23) Gadobutrol , is one of two contrast agents in this set. The collision between the submissions and PubChem rules produce an interesting set of 9 substance connectivities . For example, CID 15814656 has stripped out the gadolinium but the Molecular Imaging and Contrast Agent Database (MICAD) puts it in the molecular cage in SID 8149120. The different substance renderings make quite an aesthetic mix but I'll leave you to work out "who submitted what" if you are interested.
24) Belmimabab, another mab, also has three SID matches, with the one from KEGG going back to 2006-11-22.
25) Roflumilast = CID 449193. Just for a change there seem to be no deuterateds here but you can get a tritiated version as CID 50924690 . This entry has what I think has the 2-nd most mixtures (50) the "biggest" of which is CID 11961292. I can leave you to spot the five components.
26) Azilsartan Medoxomil, = CID 9825285 This is also a "clean" CID in that all submissions and name mappings have merged into just one CID. IBM (SID 128380015) have extracted this structure from one PMID and eight patents, and, for some reason, a second identical entry (SID 28380016) for one extra patent.
27) Ioflupane 123I (USAN: Ioflupane I123 USAN date: 2009 tradename: DaTSCAN, NDA 022454)) for the imaging of dopamine transporters, = CID 3086674. You can pick up 10 connectivities including some with unlabeled iodine, and two with F as a label. There are ChEBI, ChEMBL, assay data and an IBM link under CID 4286448. I had assumed the IBM entries were all extracted from patents, plus some Medline abstracts, but this is an exception with only a link to PMID 8201589.
28) Vilazodone hydrochloride = CID 6918313 but most links and the assay data maps to the parent as CID 6918314.There is an IBM patent link and the ChemSpider entry links through to a number of SureChem patents as well. CID 6918314 has 23 mixture SIDs coming in via Thomson Pharma as primary source. DrugBank has the entry but has missed out the structure (so I've left a note ...)
29) Ipilimumab has the standard substance triplet for a mab of ChemIDplus, ChEMBL and KEGG
30) Spinosad is a mixture of the natural products Spinosyn A CID 115003, Spinosyn D (CID 183094) in a ratio of approximately 5:1. You can find 8 Spinosyn compound term matches with two for A and 6 for B-to-P. It's not that big but it sure throws up some representational challenges. There is a second form for A as CID 443059 and you can find an A+D mixture in CID 17754356. Now, you get 10 SIDs under 115003 but 19 same-connectivity CIDs. Open these all up to SIDs and you get 51same-connectivity SIDs. One is a deprecated CHEMBL449382 but seven are from NextBio and three from NovoSeek. My guess is that these depositors are "pulling and pushing" (i.e. just dropping in links). The seven NexBio inlinks are stamped 12-June-2009 but the outlinks are "dead" text look-ups so I'm none the wiser. Of more scientific interest is that SID 26514570 links to a Nature Chemical Biology article about the macrocyclization enzymes in spinosad synthesis. Just for a change I snapped the CHEBI 9230 view so you can see the stereo labels nice and clearly.