InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery

Overview

Abstract

The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.

Technical Description

• Motivations and Challenges

Numerous studies have explored multimodal LLMs for visual understanding. However, in molecular research, key challenges include integrating molecule representations with LLMs and text, compiling comprehensive datasets, and devising effective training methods for LLMs to adapt to diverse tasks. Prior studies fine-tuned generalist LLMs for the molecular domain. Despite improving upon the original models, these works revealed issues:
- Inadequate alignment between modalities.
- Lack of exploration for an optimal molecular structure encoder.
- Rudimentary training pipeline design neglecting LLMs' knowledge update.
These issues contribute to a notable performance gap between current AI assistants and traditional specialist models in practical tasks.
Our solution, InstructMol, is a multi-modal instruction-tuning LLM that aligns molecular modality and text. It employs calibrated instruction datasets and a two-stage training scheme to align molecule information with natural language. The model enhances LLM understanding of molecular data and significantly improves performance in drug discovery tasks, bridging the gap with specialized models. Key contributions include introducing InstructMol, efficient extraction of molecular representations, and substantial improvement over state-of-the-art LLMs in practical assessments for drug discovery.

• InstructMol's Architecture and Training Pipeline

The diagram presented below provides an overview of the architectural design of the InstructMol model, along with its two-stage training paradigm. The example molecule in the figure is Terephthalaldehyde (CID 12173).

Stage 1: Alignment Pretraining. The initial stage aligns molecule modalities with text, enabling LLMs to grasp structural and sequential molecular information. We use a dataset from PubChem with 330K molecule-text pairs, applying a self-instruction approach for diverse task descriptions. Training focuses on fine-tuning the alignment projector while freezing the graph encoder and LLM to prevent overfitting and leverage pre-trained knowledge. The goal is for the projector to effectively map graph representations to text tokens.
Stage 2: Task-specific Instruction Tuning. In the second stage, we address three drug-discovery-related scenarios: compound property prediction, chemical reaction analysis, and molecule description generation. We utilize specific instruction datasets for each task, designing corresponding instruction templates. Training involves initializing with the alignment projector's parameters from the first stage, keeping the molecular encoder frozen, and updating the projector and LLM weights. We use low-rank adaptation (LoRA) to tailor the LLM for diverse tasks while retaining common-sense reasoning capabilities in dialogue. This adaptable approach allows for different adaptors based on specific needs or their combination for modular knowledge integration.

Demonstrations

• Example-1: Molecule Description Generation

• Example-2: Forward Reaction Prediction

• Example-3: Reagent Prediction

• Example-4: Retrosynthesis Prediction

BibTeX

@misc{cao2023instructmol,
      title={InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery}, 
      author={He Cao and Zijing Liu and Xingyu Lu and Yuan Yao and Yu Li},
      year={2023},
      eprint={2311.16208},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM}
}