Virtual characterization via knowledge-enhanced representation learning: from organic conjugated molecules to devices

Zhao, Guojiang; Ou, Qi; Zhao, Zifeng; Chen, Shangqian; Lin, Haitao; Ji, Xiaohong; Wang, Zhen; Wang, Hongshuai; Cai, Hengxing; Wu, Lirong; Lu, Shuqi; Yang, FengTianCi; Wen, Yaping; Zhang, Yingfeng; Ma, Haibo; Gao, Zhifeng; Cheng, Zheng; E, Weinan

doi:10.1038/s41524-025-01788-y

Download PDF

Article
Open access
Published: 16 October 2025

Virtual characterization via knowledge-enhanced representation learning: from organic conjugated molecules to devices

npj Computational Materials volume 11, Article number: 308 (2025) Cite this article

1218 Accesses
Metrics details

Subjects

Abstract

The rational design of organic functional devices relies on understanding structure-property-performance relationships through multi-scale characterization. However, traditional characterizations are costly and require multidisciplinary expertise. Here we present OCNet, a domain-knowledge-enhanced representation learning framework that, for the first time, enables unified virtual characterization from molecules to devices. Pre-trained on over ten million self-generated conjugated molecules and dimers, OCNet learns generalizable microscopic representations comparable to expert-crafted features. As a result, it surpasses state-of-the-art models by over 20% in predicting key computed and experimental molecular optoelectronic properties. OCNet further provides the first transferable model for predicting transfer integrals in thin films, enabling accurate mesoscale carrier mobility estimation via multiscale simulations. By integrating tight-binding-level electronic descriptors, OCNet achieves near real-time, accurate prediction of device power conversion efficiency. Together, OCNet offers a unified and scalable foundation for virtual characterization of organic materials across multiple scales, with broad applicability in photovoltaics, displays, and sensing.

High-accuracy physical property prediction for pure organics via molecular representation learning: bridging data to discovery

Article Open access 11 July 2025

A materials informatics driven fine-tuning of triazine-based electron-transport layer for organic light-emitting devices

Article Open access 22 February 2024

Dimensional evolution of charge mobility and porosity in covalent organic frameworks

Article Open access 05 March 2025

Introduction

Understanding the structure-property-function relationships of materials is fundamental to the rational design of functional devices. For functional materials, diverse application scenarios such as displays, energy conversion, and sensing impose distinct property requirements from molecules to devices. This necessitates comprehensive, multi-scale, and multi-property characterization to guide material development. However, such characterization is inherently challenging and resource-intensive¹, since conventional characterizations remain costly, labor-intensive, and reliant on multidisciplinary expertise, thereby creating major bottlenecks in the development of next-generation functional materials.

Organic functional materials, a burgeoning frontier in functional materials, have found extensive applications in diverse academic and commercial areas from light-emitting diodes^2,3 and organic photovoltaics^4,5,6 to chemical and biological sensors^7,8,9,10. These materials typically comprise conjugated molecules and are utilized in the form of solid films as the core constituents of optoelectronic devices, where the macroscopic device performance arises from a complex interplay of molecular-level optoelectronic properties (e.g., emission wavelength, photoluminescence quantum yield), mesoscale charge transport behaviors (e.g., carrier mobility), and their structural organization in thin films. Accurate and efficient characterization of these multi-level properties is critical for materials innovation¹¹, yet remains largely constrained by the high cost and complexity of current characterization approaches. This raises a central question: can we develop virtual characterization tools that provide accurate, scalable, and cost-effective access to various key material properties across multiple length scales, thereby substantially diminishing the reliance on experimental characterization during device development?

Quantum mechanics (QM) methods have long been used to evaluate optoelectronic properties and to model charge transport, typically through the calculation of intermolecular electronic couplings (transfer integrals) combined with multi-scale simulation^12,13,14. However, these approaches are computationally expensive, often scaling cubically with system size and requiring extensive sampling of microscopic configurations, which limits their practicality in large-scale materials discovery. On the other hand, data-driven approaches have exhibited great potential in predicting characterized properties with high accuracy only based on microscopic representation of materials^{15,16,17,18,19,20,21,22}. Recent 2D graph convolutional networks have attained state-of-the-art (SOTA) accuracy in multiple optoelectronic property predictions at the molecular level^23,24,25. Nevertheless, these approaches are inherently limited by their inability to incorporate essential 3D structural information, which is critical for modeling transport-related processes. Alternatively, strategies such as the Coulomb matrix representation²⁶ or 3D graph networks²⁷ have been employed to model transfer integrals, which are further used as the input for kinetic Monte Carlo (kMC) simulations to estimate mesoscale carrier mobility^27,28,29,30. However, there remains a lack of transferable models for accurately predicting transfer integrals in thin films and subsequently estimating film mobility. Moreover, device-level performance predictions continue to rely heavily on hand-crafted descriptors derived from computationally intensive DFT or TDDFT calculations^15,22. To date, no existing framework has simultaneously achieved high accuracy, efficiency, and transferability across molecular-, mesoscopic-, and device-scale virtual characterizations — leaving a long-standing gap that continues to hinder the rational design of organic functional devices.

To address this challenge, we propose OCNet, a domain-knowledge-enhanced representation learning framework for organic conjugated systems that, for the first time, enables unified and accurate virtual characterization of organic functional materials—from molecular-scale optoelectronic properties and mesoscale charge transport to device-level performance. Specifically, OCNet realize the first deep-learning-derived molecular and bimolecular(intermolecular) representations for organic functional materials. Leveraging self-constructed databases of over ten million conjugated molecules and dimers, together with the pre-training strategy adopted in previous data-rich scenarios^{31,32,33,34,35,36}, OCNet captures generalizable 3D features that are comparable to domain-expert feature engineering in describing intramolecular optoelectronic properties and intermolecular electronic coupling. As a result, it outperforms reported SOTA models by over 20% in predicting various key computed or experimental optoelectronic properties and intermolecular transfer integrals. Subsequently, using a self-constructed million-scale database of transfer integrals at the DFT level, OCNet realizes the first transferable model for predicting transfer integrals in thin films, enabling accurate prediction of mesoscale carrier mobility through multi-scale simulation. Finally, by integrating tight-binding-level electronic descriptors with our microscopic representation, OCNet achieves accurate, near real-time prediction of device PCE, surpassing TDDFT-descriptor-based models by 12%. This bridges the longstanding gap between molecular design and device-level optimization. Overall, OCNet offers a unified and scalable foundation for multi-property and multi-scale virtual characterization in organic electronics. We anticipate this framework will broadly accelerate the discovery and development of organic materials for energy, display, and sensing applications.

Results

Overview of OCNet framework

Our OCNet framework (Fig. 1) employs a pre-trained Transformer architecture based on 3D geometries to extract microscopic information of organic conjugated systems. It establishes general molecular and bimolecular representations that capture microscopic optoelectronic and charge transport behaviors, including intramolecular electronic excitations and intermolecular electron hopping. At the molecular level, OCNet directly maps microscopic representations to molecular optoelectronic properties or intermolecular transfer integrals. At the mesoscopic and macroscopic scales, it connects microscopic representations to higher-level properties through either physics-driven multi-scale modeling or end-to-end data-driven pipelines. Moreover, for material properties governed by complex physical mechanisms, particularly device-level performance, OCNet supports the incorporation of expert-derived features such as electronic structure information to further enhance the expressiveness of its microscopic representations. In addition, to overcome the scarcity of large-scale databases required for effective pre-training, we construct the first 10-million-scale conjugated molecular and bimolecular databases including geometries and corresponding optoelectronic properties or transfer integrals at the tight- binding (TB) level (Fig. 1a), enabling OCNet to learn more comprehensive and transferable microscopic 3D representations.

**Fig. 1: Overview of the OCNet framework.**

Pre-training Database

For the molecular dataset, we incorporate 15 elements (H, B, C, N, O, F, Si, P, S, Cl, Br, I, Ir, Ge, Se) and cover three major classes of conjugated molecules: metal-organic complexes, fused-ring structures, and fragment-assembled conjugated systems, thus spanning a broad and representative chemical space. Specifically, we integrate 0.84 million Ir complex structures from a recent open-source dataset³⁷ and 0.5 million fused-ring systems from COMPAS-2x³⁸. Additionally, we generate 14 million molecular structures using ring fusion and fragment assembly methods (detailed in the supporting information, Figs. S1 and S2). In our ring fusion protocol, we allow carbon or heteroatoms to be shared by two or three rings (Fig. S1b), resulting in molecules with multiple resonance forms—an essential feature for optoelectronic applications such as display materials^39,40. Fragment assembly further extends chemical diversity by linking conjugated fragments via carbon-carbon connections. We compare the chemical diversity of our molecular database with the open-source COMPAS-2x dataset by analyzing the distributions of heavy atom count (Fig. 2a, b) and molecular weight (Fig. S3). Most molecules in COMPAS-2x contain fewer than 50 heavy atoms and have molecular weights below 600 Da, whereas approximately 60% of the molecules in our database exceed these thresholds, indicating the inclusion of larger and more complex structures. We further benchmark the chemical space coverage of our database against COMPAS-2x and the largest open-source conjugated fragment assembly dataset, FORMED⁴¹, using t-SNE visualization of our molecular representations (Fig. 2c). The results indicate that COMPAS-2x and FORMED occupy only limited regions of the projected space, while our dataset spans a broader and more diverse range, underscoring its comprehensive coverage of conjugated chemical space.

**Fig. 2: Distribution and visualization of molecular datasets.**

For the bimolecular database, we sample 9.5M dimer conformations from 100K molecular films, that represents the first large-scale bimolecular database derived from thin-film environments. To construct this database, we first select molecules from our molecular database that exhibit low electron or hole reorganization energies at the GFN2-xTB level. These selected molecules are then assembled into amorphous films via MD simulations, using our previously developed GAFF-compatible force field specifically tailored for organic conjugated systems⁴² (see Supporting Information for details). We further demonstrate the chemical diversity of the bimolecular database by analyzing the distributions of heavy atom counts and molecular weights (Fig. S4). Our database includes dimers with up to 350 heavy atoms and molecular weight exceeding 9000 Da, indicating its extensive structural and chemical complexity.

Domain-knowledge-enhanced Microscopic Representations for Conjugated Systems

We then leverage the self-constructed molecular and bimolecular databases to pre-train the SE(3) Transformer architecture³⁴(Fig. 1b). In the first stage, OCNet is pretrained to recover atomic positions of molecular and bimolecular structures using an SE(3)-equivariant head. In the second stage, we re-pretrain the model to predict optoelectronic properties and intermolecular transfer integrals at the tight-binding (TB) level⁴³ (see Supporting Information and Fig. S5). These two-stage pre-training enables OCNet to acquire rich structural and physical knowledge, resulting in a general microscopic representation that matches or even surpasses expert-designed features in downstream tasks.

To further enhance OCNet’s capability in modeling complex physical quantities, especially device-level performance, we incorporate domain knowledge by fusing our deep-learning-derived representations with expert features (e.g., TB-level electronic structure descriptors) using multilayer perceptrons (Fig. 1c, see Methodology for details). This hybrid strategy establishes OCNet as a state-of-the-art framework for virtual characterization across a wide range of organic functional materials. In the following sections, we systematically evaluate OCNet on multiple representative tasks (Fig. 1d), including molecular-level optoelectronic property prediction, mesoscopic charge transport estimation, and macroscopic device performance (PCE) modeling, to demonstrate its universality, accuracy, and efficiency. Unless otherwise specified, we adopt an 8:2 training-to-test split and report model performance using the mean absolute error (MAE) and the coefficient of determination (R²).

Molecular-level optoelectronic property prediction

We first evaluate OCNet’s performance on predicting computed optoelectronic properties using the largest open-source dataset: OCELOT chromophores^24,44. Specifically, we focus on four molecular properties that are directly relevant to downstream device design: the HOMO-LUMO gap (H-L), the lowest singlet excitation energy (S0-S1), and electron and hole reorganization energies (ER and HR). To assess OCNet’s effectiveness, we define an accuracy score as the ratio between the MAE of the reported state-of-the-art (SOTA) model and that of OCNet. Across all four properties, OCNet achieves the highest accuracy, outperforming existing methods by at least 13%, and achieves up to 60% improvement in HR prediction (Fig. 3a). OCNet’s predictions show strong agreement with quantum mechanical results (Fig. 3d), with MAEs of 0.199 eV and 0.008 eV, and R² values of 0.803 and 0.987 for S0-S1 and H-L, respectively. For ER and HR, OCNet reaches semi-quantitative accuracy (MAEs of 0.082 eV and 0.087 eV; R² values of 0.575 and 0.511), which is sufficient for screening low-reorganization-energy candidates.

We further compare OCNet (with(w/) and without(w/o) pre-training) to reported SOTA models, in terms of MAE and R² across all four properties (Tables S1 and S2). OCNet w/ pre-training exhibits significantly superior performance over other models in these four opto-electronic properties. For instance, in S0-S1 prediction, it achieves a MAE of 0.199 eV and R² of 0.803, significantly better than both OCNet w/o pretraining (MAE: 0.318 eV; R²: 0.544) and the reported SOTA (MAE: 0.249 eV; R²: 0.76).

Next, we evaluate OCNet’s performance on Deep4Chem⁴⁵, the largest open-source dataset of experimental optoelectronic properties. To account for solvent effects, we construct a unified representation for solute-solvent systems by concatenating the element and distance matrices of both components (Fig. S6, detailed in “Methods”). Additionally, we integrate domain-features defined in SuboptGraph²⁵ into OCNet’s molecular representation to further enhance its expression capability. We benchmark OCNet against the reported SOTA model on four real-scenario related optoelectronic properties: absorption wavelength (Abs.), emission wavelength (Emi.), photoluminescence quantum yield (PLQY), and full width at half maximum (FWHM) (Fig. 3b). OCNet outperforms the SOTA model across all four tasks, achieving 18% and 13% accuracy improvements in Abs. and Emi. predictions, respectively. While improvements for PLQY and FWHM are more modest (5%), this is expected given that these properties were not included in the pre-training stage. Correlation analysis further confirms OCNet’s strong predictive performance (Fig. 3e). For Abs. and Emi., the model achieves MAEs of 7.085 nm and 11.167 nm, with corresponding R² values of 0.982 and 0.949. For PLQY and FWHM, OCNet attains MAEs of 0.101 and 9.123 nm, and R² values of 0.722 and 0.719, respectively. These results are sufficient for screening candidates with desired light color, high quantum yield, and narrow emission bandwidths in future applications.

To evaluate the contributions of pre-training and the domain-features, we compare OCNet’s performance (w/ and w/o domain features) against Uni-Mol (a general-purpose molecular representation model for drug discovery) and the reported SOTA neural network for Abs. and Emi. predictions (Fig. 3c; Tables S3 and S4). Uni-Mol exhibits significantly lower accuracy in this context, with a MAE of 16 nm for Emi, due to its lack of pre-training on large-scale conjugated molecular database. In contrast, both OCNet w/ and w/o domain features outperform the SOTA baseline, indicating OCNet’s strong expression capability in experimental property prediction at the molecular scale. In addition, the integration of domain features yields only marginal improvements over OCNet w/o domain features, suggesting that for optoelectronic properties governed by relatively simple physical processes, deep-learning-derived representations are already sufficiently expressive.

Overall, all results validate the necessity of pre-training on large-scale conjugated molecular databases and the advantage of 3D deep learning over 2D graph-based approaches in predicting optoelectronic properties, demonstrating great potential for efficient, property-driven materials design.

Intermolecular charge transfer integrals prediction

We next evaluate OCNet’s performance on intermolecular electronic coupling (transfer integral) prediction, a key microscopic property that directly governs charge transport in organic semiconductors. To enhance the model’s geometric expressiveness, we incorporate physically meaningful structural descriptors inspired by Valeev et al.⁴⁶, including: centroid-to-centroid distance, the angle between molecular plane normals, and the angle between the centroid vector and each molecular plane normal. For benchmarking, we adopt the OCELOT dimer dataset^27,44, containing 438,000 DFT-calculated transfer integrals across approximately 25,000 molecular crystal structures. OCNet accurately predicts both HOMO-HOMO (H-H) and LUMO-LUMO (L-L) transfer integrals, achieving MAEs of 2.131 meV and 2.242 meV, and R² values of 0.909 for both cases (Fig. 4a). Compared to the reported SOTA model, OCNet demonstrates a 50% improvement in prediction accuracy for crystal transfer integrals (Fig. 4c; Tables S5 and S6), highlighting the effectiveness of its bimolecular representation.

**Fig. 4: Performance of OCNet on charge transfer integrals prediction.**

To further assess the contribution of pretraining, we compare the performance of OCNet (w/ and w/o pre-training), alongside the reported SOTA model, on the OCELOT dataset (Fig. 4d). Without pre-training, OCNet’s accuracy declines markedly, with MAEs of 4.100 meV (H-H) and 3.300 meV (L-L), substantially higher than the OCNet w/ pre-training (2.131 meV) and even the reported SOTA baseline (3.000 meV). These findings emphasize the importance of both the large-scale bimolecular database and the pretraining strategy in acquiring a expressive and transferable representation for modeling intermolecular electronic couplings.

Mesoscopic-level charge transport prediction

At the mesoscopic level, carrier mobility in thin film serves as a crucial parameter for evaluating charge transport efficiency in organic electronic devices^14,47. However, its accurate estimation via multi-scale simulations remains a key challenge, primarily due to the reliance on transfer integrals derived from costly DFT calculations. To address this, we develop the first transferable model for predicting transfer integrals in disordered thin-film environments. We construct a large-scale DFT-level database comprising 1.8 million dimers extracted from 45,000 distinct molecular films (details in the Supporting Information). To capture the complexity of film environments, we enhance OCNet’s bimolecular representation by integrating both structural features and domain-specific, TB-level electronic descriptors—including overlap integrals, orbital-specific and total effective transfer integrals. Since no prior models exist for this task, OCNet’s performance is benchmarked with an assigned accuracy score of 1.0 (Fig. 4c). We further evaluate the correlation of H-H and L-L transfer integrals(TI.) between OCNet predicted and QM calculated values (Fig. 4b). OCNet demonstrates high accuracy, with R² values of 0.844 and 0.872, and MAEs of 7.350 meV and 7.497 meV for H-H and L-L TI., respectively, indicating sufficient precision to support subsequent mobility evaluations through further multi-scale modeling.

We then randomly select 80 molecules from our molecular database and generate their thin-film structures via molecular dynamics simulations at 300 K. For each film, we evaluate all transfer integrals of dimers within a 10 Å center-of-mass distance using both OCNet and PW91/6-31G(d) methods. Reorganization energies of single molecules are also obtained at the same DFT level. These parameters are then fed into kinetic Monte Carlo (kMC) simulations to estimate charge carrier mobilities. We compare the electron mobilities of seven representative thin films based on transfer integrals from DFT, OCNet, and GFN1-xTB (Fig. 5a). The mobilities obtained using OCNet closely match those derived from DFT, whereas the GFN1-xTB-based values are significantly underestimated. Furthermore, we compare the correlation between logarithmic mobilities (log(μ)) predicted by OCNet and those calculated using DFT(Fig. 5b). The results show that the log(μ) predicted by OCNet is comparable with the DFT-calculated values, with a MAE of 0.291 and R² and R of 0.713 and 0.939, respectively. Through physic-driven multi-scale modeling, OCNet bridges microscopic representation with mesoscopic charge transport properties, achieving a favorable balance between the accuracy and efficiency for mobility evaluation. This establishes a foundation for high-throughput virtual screening of high-mobility organic semiconductors, with great potential to address the longstanding bottleneck in the discovery of efficient organic electron transport materials.

**Fig. 5: Performance of OCNet in predicting the mesoscopic carrier mobility and device-level PCE.**

Device-level performance prediction

Although device-level performance such as PCE arises from complex physical processes and depends on the collective optoelectronic and transport properties of multiple functional layers, it is fundamentally governed by the microscopic behavior of electrons. Previous efforts, such as the work by Sahu et al.¹⁵, have explored end-to-end data-driven pipelines that link microscopic electronic structure descriptors to device-level PCE. However, their approaches rely on computationally expensive TDDFT-derived features, which limit scalability for high-throughput material screening.

In principle, by leveraging its expressive microscopic representations, OCNet may achieve accurate device performance prediction either directly based on 3D structural information or in combination with low-cost, approximate TB-level descriptors. To evaluate this, we adopt the OPV-PCE dataset created by Sahu et al. as a benchmark. To maintain consistency with Sahu’s study, partition the dataset into 250 molecules for training and validation and 30 for testing. OCNet with TB-level descriptors achieves a test-set MAE of 0.738% in predicting PCE (Fig. 5c), demonstrating that OCNet can reliably reproduce experimental PCE values with both high accuracy and high efficiency. Furthermore, OCNet attains a R² of 0.696 and a Pearson correlation coefficient R of 0.841, representing a significant improvement over Sahu et al.’s previous result (R = 0.79). We also evaluate the performance of OCNet w/o TB-level descriptors in predicting PCE (Fig. S7), which still surpasses Sahu et al.’s results, with a MAE of 0.756%, R² = 0.657, and R = 0.817. These findings indicate that both 3D structural information and electronic information contribute to enhancing PCE prediction accuracy, and their integration provides a more precise molecular representation for device modeling. We also compare the computational efficiency of OCNet. The generation of TB-level descriptors requires approximately 0.08 CPU hours on an Intel Xeon Platinum 8163 2.5 GHz processor, while inference with OCNet takes only 0.005 seconds on an NVIDIA 4090 GPU. In contrast, TDDFT-derived descriptors typically demand over 1000 CPU hours. Thus, OCNet enables near real-time predictions, combining high accuracy with exceptional computational efficiency. By bridging this critical gap, we believe OCNet offers a promising pathway toward the end-to-end design of high-performance organic electronic devices.

Discussion

In summary, we present OCNet, a domain-knowledge-enhanced representation learning framework for organic conjugated systems that, for the first time, enables multi-scale virtual characterization, spanning from molecular properties to mesoscale film behavior and macroscopic device performance. To achieve this, we construct the first deep-learning-derived molecular and bimolecular representations for organic functional materials. Leveraging a self-generated database of over ten million conjugated molecules and dimers and pre-training strategy, OCNet learns generalizable 3D features comparable to domain-expert-crafted descriptors for modeling intramolecular and intermolecular electronic behaviors. As a result, it outperforms reported SOTA models by over 20% in predicting various key computed or experimental optoelectronic properties and intermolecular transfer integrals. Furthermore, trained on a self-constructed million-scale transfer integrals database at the DFT level, OCNet provides the first transferable model for predicting thin-film transfer integrals, enabling accurate mesoscale carrier mobility estimation through multiscale simulations. At the device level, by integrating tight-binding-level electronic descriptors with our microscopic representation, OCNet first achieves near real-time prediction of PCE with high accuracy, surpassing TDDFT-descriptor-based models by 12%. Taken together, OCNet offers a unified and scalable tool for accurate virtual characterization of various key material properties across multiple length scales, significantly reducing the reliance on resource-intensive characterization to establish structure-property-function relationships, thus, with broad applicability in accelerating materials design in photovoltaics, displays, and sensing.

However, in this work we have not yet employed OCNet to design new molecules and validate their performance through wet-lab experiments. To further advance OCNet’s capabilities, we aim to integrate OCNet-based virtual characterization with high-throughput experiments in future studies. This closed-loop research paradigm will extend OCNet’s utility in data-scarce scenarios, ultimately enabling fully intelligent design of organic materials.

Methods

Architecture of microscopic representation

To construct general and transferable molecular and bimolecular representations of organic functional materials, we first encode atomic numbers and pairwise distances to capture both atomic and 3D spatial information. We then use the self-attention mechanism in the Transformer architecture to update and couple these representations, enabling the model to capture complex interactions within molecules or bimolecules. Similar to the CLS token in BERT³³, which aggregates sequence-level representations for 1D tasks, we select the geometric center of the molecule or bimolecule as the CLS atom to aggregate atomic features. This method reflects the overall structural characteristics. The initial atomic representation is given by:

$${{\bf{x}}}^{0}={[{\rm{emb}}({\rm{CLS}}),{\rm{emb}}({Z}_{0})\ldots {\rm{emb}}({Z}_{n}),{\rm{emb}}({\rm{PAD}}),\ldots {\rm{emb}}({\rm{PAD}})]}_{{n}_{\max }+1}$$

(1)

where Z_i represents the vocabulary index of the i-th atom in the molecule or bimolecule. All atoms within the molecule or bimolecule are encoded using an embedding layer according to their elements, while the first element in Eq. (1) represents the embedding layer for the CLS atom. ${n}_{\max }$ refers to the maximum number of atoms in a molecule or bimolecule within the database. We use a PAD token to ensure a fixed input size when the number of atoms is less than ${n}_{\max }$.

The initial pair representation is the molecular or bimolecular distance kernel matrix P⁰, where P_ij = σ(a_ijD_ij + b_ij), with a_ij and b_ij determined by the elemental types of atoms i and j. The L2 distance matrix D is given by:

$${\bf{D}}=\left(\begin{array}{llllll}{r}_{{\rm{CLS}},{\rm{CLS}}}&{r}_{{\rm{CLS}},1}&\cdots \,&{r}_{{\rm{CLS}},n}&\cdots \,&0\\ {r}_{1,{\rm{CLS}}}&{r}_{1,1}&\cdots \,&{r}_{1,n}&\cdots \,&0\\ \vdots &\vdots &\ddots &\vdots &\cdots \\ {r}_{n,{\rm{CLS}}}&{r}_{d,1}&\cdots \,&{r}_{n,n}&\cdots \,&0\end{array}\right){\rm{n}}_{\max }+1,{\rm{n}}_{\max }$$

(2)

For systems in solution, we concatenate the initial atomic representations ${{\bf{x}}}_{{\rm{solu}}}^{0}$ and ${{\bf{x}}}_{{\rm{solu}}}^{0}$ of the solute and solvent molecules, along with the initial pair representations ${{\bf{P}}}_{{\rm{solu}}}^{0}$ and ${{\bf{P}}}_{{\rm{solu}}}^{0}$ (Figure S5), to form the initial atomic representations:

$$\begin{array}{ll}{{\bf{x}}}^{0}=\left[{\rm{emb}}({\rm{CLS}}),{\rm{emb}}({Z}_{0}^{{\rm{solu}}}),\ldots ,{\rm{emb}}({Z}_{m}^{{\rm{solu}}}),{\rm{emb}}({Z}_{0}^{{\rm{solv}}}),\right.\\\qquad\left.\ldots ,{\rm{emb}}({Z}_{n}^{{\rm{solv}}})\right]_{{n}_{\max }^{{\rm{solu}}}+{n}_{\max }^{{\rm{solv}}}+1}\end{array}$$

(3)

where Z denotes the vocabulary index of the i-th atom in the molecules, and ${n}_{\max }^{{\rm{solu}}}$ and ${n}_{\max }^{{\rm{solv}}}$ represent the maximum number of atoms in the solute molecule and solvent molecule. Padding may be applied when the number of atoms in a given system is smaller than M.

The initial pair representations are:

$${{\bf{P}}}^{0}=\left(\begin{array}{ll}{{\bf{P}}}_{{\rm{solu}}}^{0}&0\\ 0&{{\bf{P}}}_{{\rm{solv}}}^{0}\end{array}\right)$$

(4)

The first element in ${{\bf{x}}}_{0}^{0}$ denotes the initial whole representation of the gas molecule or the solute and solvent molecules.

Based on the initial atomic and pair representations x⁰ and P⁰, we update the atomic and pair representations with 15 encoder layers (Figure S6). For the l-th layer, we compute the the query, value, and key matrices as ${{\bf{Q}}}^{l}={{\bf{x}}}^{l-1}{{\bf{W}}}_{Q}^{l}$, ${{\bf{V}}}^{l}={{\bf{x}}}^{l-1}{{\bf{W}}}_{V}^{l}$ and ${{\bf{K}}}^{l}={{\bf{x}}}^{l-1}{{\bf{W}}}_{K}^{l}$. By aggregating the atomic and pair representations from the l − 1th layer, the atomic and pair representations for the l-th layer are updated as:

$${{\bf{x}}}^{l}={{\bf{x}}}^{l-1}+({\rm{softmax}}\left(\frac{{{\bf{Q}}}^{l}{{{\bf{K}}}^{l}}^{T}}{\sqrt{{d}_{k}}}+{{\bf{P}}}^{l-1}\right){{\bf{V}}}^{l}){{\bf{W}}}_{O}^{l}$$

(5)

$${{\bf{x}}}^{l}={{\bf{x}}}^{l}+{\rm{MLP}}({x}^{l})$$

(6)

$${{\bf{P}}}^{l}={{\bf{P}}}^{l-1}+\frac{{{\bf{Q}}}^{l}{{{\bf{K}}}^{l}}^{T}}{\sqrt{{d}_{k}}}$$

(7)

The MLP represents the multilayer perceptron. We use ${{\bf{W}}}_{O}\in {{\mathbb{R}}}^{{d}_{v}\times {d}_{{\rm{model}}}}$ to project the output of the attention mechanism to the same dimension as x^l.

After processing through Lth encoder layers and MLP in our MRL model, the initial feature of the CLS atom ${{\bf{x}}}_{0}^{0}$ aggregates the features of atomic and pair representations within the molecule or bimolecule. We denote ${{\bf{x}}}_{0}^{L}$ as CLS_repr, which serves as the overall representation of the molecule or biomolecule. We pre-train the model on large-scale 3D geometries and TB-level properties to improve the expressiveness of CLS_repr. The pre-training include masked atom prediction and 3D coordinate reconstruction using SE(3)-equivariant networks at the first stage and then pre-trained on a large-scale optoelectronic or transfer integral database.

For downstream property prediction, we use the following model:

$$y={\rm{MLP}}({{\rm{CLS}}}_{{\rm{repr}}})$$

(8)

Alternatively, we can integrate domain-specific features (Fea) with the microscopic representation:

$$y={\rm{MLP}}\left({\rm{concat}}({\rm{MLP}}({{\rm{CLS}}}_{{\rm{repr}}}),{\rm{MLP}}({\rm{Fea}}))\right.$$

(9)

Model configuration and training

We construct molecular and bimolecular representations with 15 layers and an embedding dimension of 512, using a Gaussian kernel size of 128. Pre-training is performed on 8 Tesla A100 GPUs, taking approximately 20 days to complete. We use the Adam optimizer with a learning rate of 0.001, gradient clipping set to 1.0, and 8 million training steps with 20K warm-up steps. The batch size is 128, and training lasts for 1000 epochs. Hyperparameters for training downstream organic optoelectronic properties and transfer integrals are detailed in Table S7.

Data availability

The datasets used in this study are available at https://github.com/545487677/OCNet. The OCNet source code and associated pre-trained models are also accessible at https://github.com/545487677/OCNet.

Code availability

The OCNet code and relevant models can be obtained through https://github.com/545487677/OCNet.

References

Ortega, E. O. et al. Material characterization techniques and applications (Springer, 2022).
Qin, Z. et al. Intrinsically white organic polarized emissive semiconductors. Nat. Photonics 19, 1–9 (2025).
Zeng, J. et al. Purely organic room-temperature phosphorescence sensitizers for highly efficient hyperfluorescence oleds. Sci. Adv. 11, eadt7899 (2025).
Article CAS PubMed PubMed Central Google Scholar
Hu, Y., Wang, J., Yan, C. & Cheng, P. The multifaceted potential applications of organic photovoltaics. Nat. Rev. Mater. 7, 836–838 (2022).
Article CAS Google Scholar
Chen, H. et al. Organic solar cells with 20.82% efficiency and high tolerance of active layer thickness through crystallization sequence manipulation. Nat. Mater. 24, 1–10 (2025).
Ouyang, Y., Wang, R., Wang, X., Xiao, M. & Zhang, C. Ultrafast energy transfer beyond the förster approximation in organic photovoltaic blends with non-fullerene acceptors. Sci. Adv. 11, eadr5973 (2025).
Article CAS PubMed PubMed Central Google Scholar
Gkoupidenis, P. et al. Organic mixed conductors for bioinspired electronics. Nat. Rev. Mater. 9, 134–149 (2024).
Article CAS Google Scholar
Xie, Z., Liu, D., Gao, C., Dong, H. & Hu, W. High-mobility emissive organic semiconductors: an emerging class of multifunctional materials. Nat. Rev. Mater. 9, 837–839 (2024).
Article CAS Google Scholar
Yuan, L. et al. Improving both performance and stability of n-type organic semiconductors by vitamin C. Nat. Mater. 23, 1268–1275 (2024).
Article CAS PubMed Google Scholar
Shi, J. et al. Active biointegrated living electronics for managing inflammation. Science 384, 1023–1030 (2024).
Article CAS PubMed Google Scholar
Wang, J. et al. Physical insights into non-fullerene organic photovoltaics. Nat. Rev. Phys. 6, 1–17 (2024).
Ou, Q., Peng, Q. & Shuai, Z. Computational screen-out strategy for electrically pumped organic laser materials. Nat. Commun. 11, 4485 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fallon, K. J. et al. Exploiting excited-state aromaticity to design highly stable singlet fission materials. J. Am. Chem. Soc. 141, 13867–13876 (2019).
Article CAS PubMed Google Scholar
Friederich, P. et al. Molecular origin of the charge carrier mobility in small molecule organic semiconductors. Adv. Funct. Mater. 26, 5757–5763 (2016).
Article CAS Google Scholar
Sahu, H., Rao, W., Troisi, A. & Ma, H. Toward predicting efficiency of organic solar cells via machine learning and improved descriptors. Adv. Energy Mater. 8, 1801032 (2018).
Article Google Scholar
Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15, 1120–1127 (2016).
Article PubMed Google Scholar
Rao, L., Yuan, Y., Shen, X., Yu, G. & Chen, X. Designing nanotheranostics with machine learning. Nat. Nanotechnol. 19, 1769–1781 (2024).
Article CAS PubMed Google Scholar
Hu, Y. et al. Identifying a highly efficient molecular photocatalytic CO2 reduction system via descriptor-based high-throughput screening. Nat. Catal. 8, 1–11 (2025).
Dral, P. O. & Barbatti, M. Molecular excited states through a machine learning lens. Nat. Rev. Chem. 5, 388–405 (2021).
Article CAS PubMed Google Scholar
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Article CAS PubMed Google Scholar
De Keer, L. et al. Computational prediction of the molecular configuration of three-dimensional network polymers. Nat. Mater. 20, 1422–1430 (2021).
Article PubMed Google Scholar
Han, G. & Yi, Y. Singlet-triplet energy gap as a critical molecular descriptor for predicting organic photovoltaic efficiency. Angew. Chem. Int. Ed. 134, e202213953 (2022).
Article Google Scholar
Joung, J. F. et al. Deep learning optical spectroscopy based on experimental database: potential applications to molecular design. JACS Au 1, 427–438 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bhat, V. et al. Electronic, redox, and optical property prediction of organic π-conjugated molecules through a hierarchy of machine learning approaches. Chem. Sci. 14, 203–213 (2023).
Article CAS Google Scholar
Sun, M. et al. Enhancing chemistry-intuitive feature learning to improve prediction performance of optical properties. Chem. Sci. 15, 17533–17546 (2024).
Article CAS PubMed PubMed Central Google Scholar
Rupp, M., Tkatchenko, A., Müller, K.-R. & Von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).
Article PubMed Google Scholar
Bhat, V., Ganapathysubramanian, B. & Risko, C. Rapid estimation of the intermolecular electronic couplings and charge-carrier mobilities of crystalline molecular organic semiconductors through a machine learning pipeline. J. Phys. Chem. Lett. 15, 7206–7213 (2024).
Article CAS PubMed Google Scholar
Rinderle, M., Kaiser, W., Mattoni, A. & Gagliardi, A. Machine-learned charge transfer integrals for multiscale simulations in organic thin films. J. Phys. Chem. C. 124, 17733–17743 (2020).
Article CAS Google Scholar
Wang, C.-I., Joanito, I., Lan, C.-F. & Hsu, C.-P. Artificial neural networks for predicting charge transfer coupling. J. Chem. Phys. 153, 214113 (2020).
Tan, T., Duan, L. & Wang, D. Elucidating morphology-mobility relationships of organic thin films through transfer learning-assisted multiscale simulation. Adv. Func. Mater. 34, 2313085 (2024).
Article CAS Google Scholar
Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
Article PubMed Google Scholar
Zhang, D., Yin, J., Zhu, X. & Zhang, C. Network representation learning: A survey. IEEE Trans. Big Data 6, 3–28 (2018).
Article Google Scholar
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. Proc. Conf. North Am. Chapter Assoc. Comput. Linguist. 1, 4171–4186 (2019).
Zhou, G. et al. Uni-mol: A universal 3d molecular representation learning framework. ChemRxiv. (2023).
Zhu, J. et al. Unified 2d and 3d pre-training of molecular representations. In Proc. 28th ACM SIGKDD Conf. Knowl. Discov. Data Min. (KDD), 2626–2636 (2022).
Ji, X. et al. Uni-mol2: Exploring molecular pretraining model at scale. Adv. Neural Inf. Process. Syst. 37, 46956–46978 (2024).
Cheng, Z. et al. Automatic screen-out of ir (iii) complex emitters by combined machine learning and computational analysis. Adv. Opt. Mater. 11, 2301093 (2023).
Article CAS Google Scholar
Mayo Yanes, E., Chakraborty, S. & Gershoni-Poranne, R. Compas-2: a dataset of cata-condensed hetero-polycyclic aromatic systems. Sci. Data 11, 97 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wu, X. et al. The role of host–guest interactions in organic emitters employing mr-tadf. Nat. Photonics 15, 780–786 (2021).
Article CAS Google Scholar
Madayanad Suresh, S. et al. A deep-blue-emitting heteroatom-doped mr-tadf nonacene for high-performance organic light-emitting diodes. Angew. Chem. Int. Ed. 62, e202215522 (2023).
Article CAS Google Scholar
Blaskovits, J. T., Laplaza, R., Vela, S. & Corminboeuf, C. Data-driven discovery of organic electronic materials enabled by hybrid top-down/bottom-up design. Adv. Mater. 36, 2305602 (2024).
Article CAS Google Scholar
Zhao, G. et al. Data-driven parametrization of all-atom force fields for organic semiconductors. ChemRxiv (2024).
Grimme, S., Bannwarth, C. & Shushkov, P. A robust and accurate tight-binding quantum chemical method for structures, vibrational frequencies, and noncovalent interactions of large molecular systems parametrized for all spd-block elements (z= 1–86). J. Chem. Theory Comput. 13, 1989–2009 (2017).
Article CAS PubMed Google Scholar
Ai, Q. et al. Ocelot: an infrastructure for data-driven research to discover and design crystalline organic semiconductors. J. Chem. Phys. 154, 174705 (2021).
Joung, J. F., Han, M., Jeong, M. & Park, S. Experimental database of optical properties of organic compounds. Sci. Data 7, 295 (2020).
Article CAS PubMed PubMed Central Google Scholar
Valeev, E. F., Coropceanu, V., da Silva Filho, D. A., Salman, S. & Brédas, J.-L. Effect of electronic polarization on charge-transport parameters in molecular organic semiconductors. J. Am. Chem. Soc. 128, 9882–9886 (2006).
Article CAS PubMed Google Scholar
Shuai, Z., Geng, H., Xu, W., Liao, Y. & André, J.-M. From charge transport parameters to charge mobility in organic semiconductors through multiscale simulation. Chem. Soc. Rev. 43, 2662–2679 (2014).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the funding support from the AI for Science Institute, Beijing (AISI) and DP Technology Corporation. The computing resources for this work were provided by the Bohrium Cloud Platform (https://bohrium.dp.tech), which is supported by DP Technology, the Hefei Advanced Computing Center of Sugon, and the High-Performance Computing Platform at Peking University. E’s work is supported in part by NSFC’s Major Research Project 92270001.Z. Z.’s work is supported in part by the Beijing Nova Program (20250484934).

Author information

These authors contributed equally: Guojiang Zhao, Qi Ou, Zifeng Zhao.

Authors and Affiliations

DP Technology, Beijing, PR China
Guojiang Zhao, Shangqian Chen, Haitao Lin, Xiaohong Ji, Zhen Wang, Hongshuai Wang, Hengxing Cai, Lirong Wu, Shuqi Lu, FengTianCi Yang & Zhifeng Gao
SINOPEC Research Institute of Petroleum Processing Co. Ltd, Beijing, China
Qi Ou
AI for Science Institute, Beijing, PR China
Zifeng Zhao, Zhifeng Gao, Zheng Cheng & Weinan E
Key Laboratory of Green Chemical Media and Reactions, Ministry of Education, Collaborative Innovation Center of Henan Province for Green Manufacturing of Fine Chemicals, College of Chemistry and Chemical Engineering, Henan Normal University, Xinxiang, PR China
Yaping Wen
Faculty of Synthetic Biology, Shenzhen University of Advanced Technology, Shenzhen, China
Yingfeng Zhang
Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, PR China
Yingfeng Zhang
Key Laboratory for Colloid and Interface Chemistry, Ministry of Education, School of Chemistry and Chemical Engineering, Shandong University, Qingdao, PR China
Haibo Ma
School of Mathematical Sciences, Peking University, Beijing, PR China
Zheng Cheng & Weinan E

Authors

Guojiang Zhao
View author publications
Search author on:PubMed Google Scholar
Qi Ou
View author publications
Search author on:PubMed Google Scholar
Zifeng Zhao
View author publications
Search author on:PubMed Google Scholar
Shangqian Chen
View author publications
Search author on:PubMed Google Scholar
Haitao Lin
View author publications
Search author on:PubMed Google Scholar
Xiaohong Ji
View author publications
Search author on:PubMed Google Scholar
Zhen Wang
View author publications
Search author on:PubMed Google Scholar
Hongshuai Wang
View author publications
Search author on:PubMed Google Scholar
Hengxing Cai
View author publications
Search author on:PubMed Google Scholar
Lirong Wu
View author publications
Search author on:PubMed Google Scholar
Shuqi Lu
View author publications
Search author on:PubMed Google Scholar
FengTianCi Yang
View author publications
Search author on:PubMed Google Scholar
Yaping Wen
View author publications
Search author on:PubMed Google Scholar
Yingfeng Zhang
View author publications
Search author on:PubMed Google Scholar
Haibo Ma
View author publications
Search author on:PubMed Google Scholar
Zhifeng Gao
View author publications
Search author on:PubMed Google Scholar
Zheng Cheng
View author publications
Search author on:PubMed Google Scholar
Weinan E
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.Z., Z.G., Z.C. and W.E. designed the project. G.Z., Z.G. and Z.C. designed the neural network architectures. Z.C. collected and organized the open-source datasets. G.Z., Z.Z., S.C., H.L., X.J., Z.G. and Z.C. constructed the pre-training models in OCNet. Q.O., Z.Z., Y.Z. and Z.C. performed the computational simulations, including quantum chemical calculations, molecular simulations, and multi-scale simulations. G.Z., Z.W., H.W. and H.C. fine-tuned the downstream optoelectronic tasks. L.W., S.L., F.Y. and Z.C. fine-tuned the downstream transport-related tasks. G.Z., Y.W., H.M. and Z.C. fine-tuned the downstream device performance tasks. Z.C. and Z.Z. wrote the original draft. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Zifeng Zhao, Zhifeng Gao or Zheng Cheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Materials

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, G., Ou, Q., Zhao, Z. et al. Virtual characterization via knowledge-enhanced representation learning: from organic conjugated molecules to devices. npj Comput Mater 11, 308 (2025). https://doi.org/10.1038/s41524-025-01788-y

Download citation

Received: 07 May 2025
Accepted: 30 August 2025
Published: 16 October 2025
Version of record: 16 October 2025
DOI: https://doi.org/10.1038/s41524-025-01788-y