Cointime

Download App
iOS & Android

Microsoft BioGPT: Towards the ChatGPT of Life Science?

The use of language models (LMs) has exploded in recent years, and ChatGPT is just the tip of the iceberg. ChatGPT has been used to write code, recipes, or even sonnets and poems. All noble purposes, but there is also a huge scientific literature, so why not exploit this vast amount of textual data?  

Microsoft recently unveiled BioGPT which does exactly that. the new model has achieved state-of-the-art in several tasks. Let’s find out together.

Meanwhile, why is it important? Thousands and thousands of scientific publications come out every year, and it is difficult to be able to keep up with the growing literature. On the other hand, the scientific literature is essential to be able to develop new drugs, establish new trials, develop new algorithms, or understand the mechanisms of disease.

In fact, NLP pipelines can be used to be able to extract information from large amounts of scientific articles (names and entities, relationships, classification, and so on). On the other hand, LMs models generally have poor performance in the biomedical area (poor ability to generalize in such a domain). for this reason, researchers, prefer to train models directly on scientific literature articles.

PubMed, screenshot by the Author

In general, Pubmed (the main repository of scientific articles) contains 30M articles. So there are enough articles to be able to train a model and then use this pre-trained model for follow-up tasks.

Generally, two types of pre-trained models were used:

  • BERT-like models, trained using masked languaging modeling. Here the task is given a sequence of tokens (subwords) some are masked, and the model using the remaining tokens (context) should predict which tokens are masked.
  • GPT like, trained using auto-regressive language modeling. The model learns to predict the next word in a sequence, knowing which words are before in the sequence.

BERT has been used extensively and in fact, there are several alternatives dedicated to the biomedical world: BioBERT, PubMedBERT, and so on. These models have shown superior capabilities in understanding tasks in compared to other models. On the other hand, GPT-like models are superior in generative tasks and have been little explored in the biomedical field.

So in summary, the authors in this paper used GPT-like architecture:

we propose BioGPT, a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch. We apply BioGPT to six biomedical NLP tasks: end-to-end relation extraction on BC5CDR [13], KD-DTI [14] and DDI [15], question answering on PubMedQA [16], document classification on HoC [17], and text generation. (original article)

The authors tested (and compared with previous methods) BioGPT on three main tasks:

  • Relation extraction. The purpose is the joint extraction of both entities and their relationships (e.g., drugs, diseases, proteins, and how they interact).
  • Question answering. In this task, the model must provide an appropriate answer according to the context (reading comprehension).
  • Document classifcation. The model must classify (predict) a document with a label (or more than one label).
  image source: preprint


As the authors point out when training a model from scratch it is important to make sure that the dataset is from the domain, of quality, and in the right amount. In this case, they used 15 M abstracts. In addition, the vocabulary must also be appropriate for the domain: here the vocabulary is learned with byte pair encoding (BPE). Next, the architecture must be chosen, and the authors chose GPT-2.

The authors, specifically engined the datasets to create training prompts (in which they provided the prompt and target). This was to enable better training specific to the biomedical domain.

  Framework of BioGPT when adapting to downstream tasks. image source: preprint


The authors evaluated the model on end-to-end relationship extraction with a model called REBEL (based on BART, a variant of BERT) and which was recently published. They also used as a baseline the GPT-2 model, which has not been trained specifically for the medical domain. The table shows that BioGPT achieves state-of-the-art in a chemical dataset:

“Results on BC5CDR chemical-disease-relation extraction task. ’gt+pred’ means using ground truth NER information for training and using open-source NER tool to annotate NER for inference. ’pred+pred’ means using open-source NER tool for both training and inference. ’†’ means training on training and validation set. M”. image source: preprint

drug-drug interaction is another type of interaction that is very useful in research. In fact, interactions between two different drugs are among the leading causes of adverse effects during treatment, so predicting potential interactions going forward is an asset for clinicians:

  image source: preprint


They also used a drug-target interaction dataset. The result is very interesting because predicting drug-target interaction can be useful for developing new drugs.

  image source: preprint


In this task, too, the model achieved state-of-the-art status:

  image source: preprint


And even in document classification, BioGPT outperformed previous models

  image source: preprint


As said above, GPT has the generative capability: “ they can continue to generate text that are syntactically correct and semantically smooth conditioning on the given text”. The authors decided to evaluate the ability of the model in generating biomedical synthetic text.

Specially, we extract all the entities within the triplets from the KD-DTI test set (i.e. drugs and targets). Then for each drug/target name, we provide it to the language model as the prefix and let the model generate text conditioned on it. We then investigate whether the generated text is meaningful and fluent. (original article)

The authors noted that the model was working well with known names as input while if you input an unknown name the model or it is copying from the article (something seen in the training set) or fails in generating an informative test.

  image source: preprint


In conclusion:

BioGPT achieves SOTA results on three endto-end relation extraction tasks and one question answering task. It also demonstrates better biomedical text generation ability compared to GPT-2 on the text generation task. (original article)

For the future the authors would like to scale up with a bigger model and a bigger dataset:

For future work, we plan to train larger scale BioGPT on larger scale biomedical data and apply to more downstream tasks. (original article)

Microsoft is convinced it can be useful for helping biologists and scientists with scientific discoveries. The model can be useful in the future for the research of new drugs, being included in pipelines that are analyzing the scientific literature.

article: here, preprint: here, GitHub repository: here, HuggingFace page: here.

What do you think about it?

Comments

All Comments

Recommended for you

  • The total value of Starknet bridge storage exceeds 900,000 ETH

    According to Dune data, the total value bridged (TVB) by Ethereum Layer 2 solution Starknet has exceeded 900,000 ETH, reaching 901,512 ETH, which is approximately worth $2.84 billion at current prices. The total number of bridged user addresses is 1,225,098.

  • NFT transactions on the Bitcoin chain exceeded $55 million in the past 7 days

    According to CryptoSlam data, the Bitcoin on-chain NFT transaction volume in the past seven days reached 55.02 million US dollars, Ethereum NFT transaction volume reached 48.7 million US dollars, and Solana NFT transaction volume reached 24.62 million US dollars.

  • USDe issuance is nearly 2.3 billion

    According to Etherscan data reported, the stablecoin USDe issued by Ethena Labs has reached a circulation of 2,292,060,769. USDe is a stablecoin based on Ethereum, which is collateralized by derivatives and achieves price stability through delta-neutral hedging in centralized and decentralized venues. To create USDe, Ethena allows users to use USD, ETH, or liquidity collateral tokens as collateral.

  • CEL breaks through $0.55, with a 24-hour increase of more than 75%

    According to market data, CEL has broken through $0.55 and is currently priced at $0.5651, with a 24-hour increase of over 75%. The market is volatile, so please be prepared for risk control.

  • SlowMist: The total loss from security incidents last week (April 28-May 4, 2024) exceeded US$71.4 million

    According to the weekly security report (April 28 - May 4, 2024) released by SlowMist, the total loss this week exceeded $71,399,000. An incident this week resulted in losses rising from the nine-digit range to the astonishing ten-digit range. Surprisingly, this was not due to complex technical flaws or sophisticated phishing scams, but a simple error that could have been easily avoided by implementing a whitelist. Multiple security incidents include:

  • Tether issued USDT worth $240 million on May 4 and redeemed USDT worth $8.6 million

    According to ChainArgos monitoring, on May 4th, Tether conducted a large-scale issuance and redemption activity. A total of $240 million USDT was issued that day, while $8.6 million USDT was redeemed.

  • Ethereum stablecoin transaction volume breaks monthly record in April

    The total transaction volume of stablecoins on Ethereum last month was significantly higher than any previous month, but most of the trading volume was contributed by a single stablecoin, DAI. As previously reported by The Block, DAI is increasingly being used for complex MEV transactions, usually involving flash loans, with a large portion of DAI being minted and returned in a single transaction. Just one of these transactions increased DAI trading volume by nearly $1 billion.

  • The Ethereum network has currently destroyed more than 4.28 million ETH

    According to Ultrasound data, as of now, the Ethereum network has burned a total of 4,287,144.91 ETH. Note: After the Ethereum London upgrade introduced EIP-1559, the Ethereum network will dynamically adjust the BaseFee of each transaction based on transaction demand and block size, and this portion of the fee will be directly burned and destroyed.

  • Cointime May 5th News Express

    1.The Federal Reserve reduced its balance sheet by $77 billion in April, and the size of its balance sheet fell below $7.4 trillion2.Former Bitmex CEO: Bitcoin will trade between $60,000 and $70,000 before August 3.SLERF total destruction exceeds 7 million USD4.ether.fi large staker initiates pledge withdrawal application for 37,140 ETH5.Web3 digital asset company Alpha Transform Holdings makes strategic investments in Arhasi and Cloudbench 6.A trader spent 402 ETH to buy 732,326 FRIEND, with an unrealized profit of $653,0007.A certain address has sold a total of 677,197 FRIEND airdrops through BunnySwap, making a profit of approximately $1.15 million8.A multi-signature wallet withdrew 915.85 billion PEPE from Binance9.The NFT project Blob team engraved the rune EPIC•EPIC•EPIC•EPIC on the Epic Satoshi block of Bitcoin’s fourth halving10.On-Chain Analyst Predicts Six to Twelve Months of 'Parabolic Advance' for Bitcoin

  • Cointime May 4th News Express

    1. Hong Kong Bitcoin Spot ETF has held 4,218 BTC since its listing three days ago