Cointime

Download App
iOS & Android

Microsoft BioGPT: Towards the ChatGPT of Life Science?

The use of language models (LMs) has exploded in recent years, and ChatGPT is just the tip of the iceberg. ChatGPT has been used to write code, recipes, or even sonnets and poems. All noble purposes, but there is also a huge scientific literature, so why not exploit this vast amount of textual data?  

Microsoft recently unveiled BioGPT which does exactly that. the new model has achieved state-of-the-art in several tasks. Let’s find out together.

Meanwhile, why is it important? Thousands and thousands of scientific publications come out every year, and it is difficult to be able to keep up with the growing literature. On the other hand, the scientific literature is essential to be able to develop new drugs, establish new trials, develop new algorithms, or understand the mechanisms of disease.

In fact, NLP pipelines can be used to be able to extract information from large amounts of scientific articles (names and entities, relationships, classification, and so on). On the other hand, LMs models generally have poor performance in the biomedical area (poor ability to generalize in such a domain). for this reason, researchers, prefer to train models directly on scientific literature articles.

PubMed, screenshot by the Author

In general, Pubmed (the main repository of scientific articles) contains 30M articles. So there are enough articles to be able to train a model and then use this pre-trained model for follow-up tasks.

Generally, two types of pre-trained models were used:

  • BERT-like models, trained using masked languaging modeling. Here the task is given a sequence of tokens (subwords) some are masked, and the model using the remaining tokens (context) should predict which tokens are masked.
  • GPT like, trained using auto-regressive language modeling. The model learns to predict the next word in a sequence, knowing which words are before in the sequence.

BERT has been used extensively and in fact, there are several alternatives dedicated to the biomedical world: BioBERT, PubMedBERT, and so on. These models have shown superior capabilities in understanding tasks in compared to other models. On the other hand, GPT-like models are superior in generative tasks and have been little explored in the biomedical field.

So in summary, the authors in this paper used GPT-like architecture:

we propose BioGPT, a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch. We apply BioGPT to six biomedical NLP tasks: end-to-end relation extraction on BC5CDR [13], KD-DTI [14] and DDI [15], question answering on PubMedQA [16], document classification on HoC [17], and text generation. (original article)

The authors tested (and compared with previous methods) BioGPT on three main tasks:

  • Relation extraction. The purpose is the joint extraction of both entities and their relationships (e.g., drugs, diseases, proteins, and how they interact).
  • Question answering. In this task, the model must provide an appropriate answer according to the context (reading comprehension).
  • Document classifcation. The model must classify (predict) a document with a label (or more than one label).
  image source: preprint


As the authors point out when training a model from scratch it is important to make sure that the dataset is from the domain, of quality, and in the right amount. In this case, they used 15 M abstracts. In addition, the vocabulary must also be appropriate for the domain: here the vocabulary is learned with byte pair encoding (BPE). Next, the architecture must be chosen, and the authors chose GPT-2.

The authors, specifically engined the datasets to create training prompts (in which they provided the prompt and target). This was to enable better training specific to the biomedical domain.

  Framework of BioGPT when adapting to downstream tasks. image source: preprint


The authors evaluated the model on end-to-end relationship extraction with a model called REBEL (based on BART, a variant of BERT) and which was recently published. They also used as a baseline the GPT-2 model, which has not been trained specifically for the medical domain. The table shows that BioGPT achieves state-of-the-art in a chemical dataset:

“Results on BC5CDR chemical-disease-relation extraction task. ’gt+pred’ means using ground truth NER information for training and using open-source NER tool to annotate NER for inference. ’pred+pred’ means using open-source NER tool for both training and inference. ’†’ means training on training and validation set. M”. image source: preprint

drug-drug interaction is another type of interaction that is very useful in research. In fact, interactions between two different drugs are among the leading causes of adverse effects during treatment, so predicting potential interactions going forward is an asset for clinicians:

  image source: preprint


They also used a drug-target interaction dataset. The result is very interesting because predicting drug-target interaction can be useful for developing new drugs.

  image source: preprint


In this task, too, the model achieved state-of-the-art status:

  image source: preprint


And even in document classification, BioGPT outperformed previous models

  image source: preprint


As said above, GPT has the generative capability: “ they can continue to generate text that are syntactically correct and semantically smooth conditioning on the given text”. The authors decided to evaluate the ability of the model in generating biomedical synthetic text.

Specially, we extract all the entities within the triplets from the KD-DTI test set (i.e. drugs and targets). Then for each drug/target name, we provide it to the language model as the prefix and let the model generate text conditioned on it. We then investigate whether the generated text is meaningful and fluent. (original article)

The authors noted that the model was working well with known names as input while if you input an unknown name the model or it is copying from the article (something seen in the training set) or fails in generating an informative test.

  image source: preprint


In conclusion:

BioGPT achieves SOTA results on three endto-end relation extraction tasks and one question answering task. It also demonstrates better biomedical text generation ability compared to GPT-2 on the text generation task. (original article)

For the future the authors would like to scale up with a bigger model and a bigger dataset:

For future work, we plan to train larger scale BioGPT on larger scale biomedical data and apply to more downstream tasks. (original article)

Microsoft is convinced it can be useful for helping biologists and scientists with scientific discoveries. The model can be useful in the future for the research of new drugs, being included in pipelines that are analyzing the scientific literature.

article: here, preprint: here, GitHub repository: here, HuggingFace page: here.

What do you think about it?

Comments

All Comments

Recommended for you

  • Movement Labs raises $38m to enhance Ethereum ecosystem with Facebook's Move Virtual Machine

    San Francisco-based blockchain development team, Movement Labs, has raised $38m in Series A funding led by Polychain Capital and featuring participation from a range of investors. The funds will be used to bring Facebook’s Move Virtual Machine to Ethereum, addressing smart contract vulnerabilities and enhancing transaction throughput. Movement Labs aims to tackle smart contract vulnerabilities within the Ethereum ecosystem while introducing a novel execution environment designed for 30,000+ transactions per second.

  • Blockchain Life 2024 thunderstruck in Dubai

    Dubai, April 17, 2024 - The 12th edition of the Blockchain Life Forum, known as the leading gathering for global cryptocurrency leaders, concluded with an impressive turnout of 10,162 attendees despite the unprecedented storm that happened in Dubai.

  • Modular Data Layer for Gaming and AI, Carv, Raises $10M in Series A Funding

    Santa Clara-based Carv has secured $10m in Series A funding led by Tribe Capital and IOSG Ventures, with participation from Consensys, Fenbushi Capital, and other investors. The company plans to use the funds to expand its operations and development efforts. Carv specializes in providing gaming and AI development with high-quality data enhanced with human feedback in a regulatory-compliant, trustless manner. Its solution includes the CARV Protocol, CARV Play, and CARV's AI Agent, CARA. The company is also preparing to launch its node sale to enhance decentralization and bolster trustworthiness.

  • The US GDP seasonally adjusted annualized rate in the first quarter was 1.6%

    The seasonally adjusted annualized initial value of US GDP for the first quarter was 1.6%, estimated at 2.5%, and the previous value was 3.4%.

  • The main culprit of China's 43 billion yuan illegal money laundering case was arrested in the UK, involved in the UK's largest Bitcoin money laundering case

    Local time in the UK, Qian Zhimin appeared in Westminster Magistrates' Court for the first time under the identity of Yadi Zhang. She was accused of obtaining, using or possessing cryptocurrency as criminal property from October 1, 2017 to this Tuesday in London and other parts of the UK. Currently, Qian Zhimin is charged with two counts of illegally holding cryptocurrency. Qian Zhimin is the main suspect in the Blue Sky Gerui illegal public deposit-taking case investigated by the Chinese police in 2017, involving a fund of 43 billion yuan and 126,000 Chinese investors. After the case was exposed, Qian Zhimin fled abroad with a fake passport and held a large amount of bitcoin overseas. According to the above Financial Times report, Qian Zhimin denied the charges of the Royal Prosecution Service in the UK, stating that she would not plead guilty or apply for bail.

  • Nigeria’s Central Bank Denies Call to Freeze Crypto Exchange Users’ Bank Accounts

    In response to the news that "the Central Bank of Nigeria has issued a ban on cryptocurrency trading and requested financial institutions to freeze the accounts of users related to Bybit, KuCoin, OKX, and Binance exchanges," the Central Bank of Nigeria (CBN) stated in a document that the CBN has not officially issued such a notice, and the public should check the official website for the latest information to ensure the reliability of the news. According to a screenshot reported by Cointelegraph yesterday, the Central Bank of Nigeria has requested all banks and financial institutions to identify individuals or entities trading with cryptocurrency exchanges and set these accounts to "Post-No-Debit" (PND) status within six months. This means that account holders will not be able to withdraw funds or make payments from these accounts. According to the screenshot, the Central Bank of Nigeria has listed cryptocurrency exchanges that have not obtained operating licenses in Nigeria, including Bybit, KuCoin, OKX, and Binance. The Central Bank of Nigeria will crack down on the illegal purchase and sale of stablecoin USDT on these platforms, especially those using peer-to-peer (P2P) transactions. In addition, the Central Bank of Nigeria pointed out that financial institutions are prohibited from engaging in cryptocurrency transactions or providing payment services to cryptocurrency exchanges.

  • Universal verification layer Aligned Layer completes $20 million Series A financing

    Ethereum's universal verification layer Aligned Layer has completed a $20 million Series A financing round, led by Hack VC, with participation from dao5, L2IV, Nomad Capital, and others. The Aligned Layer mainnet is scheduled to launch in the second quarter of 2024. As the EigenLayer AVS, Aligned Layer provides Ethereum with a new infrastructure for obtaining economically viable zero-knowledge proof verification for all proof systems.

  • The total open interest of Bitcoin contracts on the entire network reached 31.41 billion US dollars

    According to Coinglass data, the total open position of Bitcoin futures contracts on the entire network is 487,500 BTC (approximately 31.41 billion US dollars).Among them, the open position of CME Bitcoin contracts is 143,600 BTC (approximately 9.23 billion US dollars), ranking first;The open position of Binance Bitcoin contracts is 109,400 BTC (approximately 7.07 billion US dollars), ranking second.

  • Bitcoin mining difficulty increased by 1.99% to 88.1T yesterday, a record high

    According to BTC.com data reported by Jinse Finance, the mining difficulty of Bitcoin has increased by 1.99% to 88.1T at block height 840,672 (22:51:52 on April 24), reaching a new historical high. Currently, the average network computing power is 642.78EH/s.

  • US Stablecoin Bill Could Be Ready Soon, Says Top Democrat on House Financial Services Committee

    The top Democrat on the U.S. House Financial Services Committee, Maxine Waters, has stated that a stablecoin bill may be ready soon, indicating progress towards a new stablecoin law in the U.S. before the elections. Waters has previously criticized a version of the stablecoin bill, but emphasized the importance of protecting investors and ensuring that stablecoins are backed by assets. Congressional movement on stablecoin legislation has recently picked up pace, with input from the U.S. Federal Reserve, Treasury Department, and White House in crafting the bill. The stablecoin bill could potentially be tied to a must-pass Federal Aviation Administration reauthorization due next month, and may also be paired with a marijuana banking bill.