Introduction

In today’s society, mental health issues among college students are increasingly becoming a focus of public concern1. Against this background, loneliness and online social support are considered two vital dimensions affecting the mental health of college students, and they continue to be the focus of researchers’ attention2,3,4. Loneliness is a negative social and emotional experience characterized by an unacceptable or unpleasant sense of lack in interpersonal relationships5,6. As one of the most common emotional experiences among humans, loneliness is typically associated with a variety of negative outcomes7. Extensive studies have indicated that loneliness significantly increases the risks of stress, distress, anxiety, depression, anger, insomnia, self-harm, and suicidal ideation8,9,10,11. College students are at an important stage of developing peer relationships, pursuing autonomy and individuality, and are susceptible to loneliness, significantly affecting their mental health12,13. Online social support refers to the communication that occurs within the virtual realm, where individuals are understood and respected through the sharing of emotions, information, and materials, thereby attaining a sense of identity and belonging14. The rapid proliferation of the internet and social media has increased the online interactions among college students, making online social support an important channel for them to obtain emotional support and communicate information15,16. Studies have shown that online social support is associated with lower negative outcomes17. Increased online social support can reduce levels of loneliness and improve mental health to some extent18,19,20. Therefore, it is crucial to select appropriate methods to accurately and efficiently assess loneliness and online social support among college students.

Existing assessments often rely on validated questionnaires. The ULS-6 (University of California Los Angeles Loneliness Scale-6) is a unidimensional scale with good reliability and validity, capable of concisely and swiftly measuring loneliness in the Chinese adult population, including college students21. The OSSS-CS (Chinese Youth Version of the Online Social Support Scale) is specifically designed for the Chinese adolescent population to assess the social support of individuals in the online environment, and it has demonstrated good reliability and validity22,23. The ULS-6 and OSSS-CS, two widely recognized and validated questionnaires, provide us with in-depth insights into college students’ loneliness and how they seek and obtain social support in the online environment24. However, there are some shortcomings in the use of validated questionnaires in mental health assessment, including low response rates and efficiency, and susceptibility to the subjectivity of respondents25. With the advancement of artificial intelligence technology, ChatGPT, as an emerging assessment tool, shows great potential in the field of mental health, which may compensate some of the shortcomings of validated questionnaires and provide more efficient and objective assessments.

ChatGPT is a GPT‐based chatbot that is trained to generate human‐understanding texts from inputs (prompts). Since its release in November 2022, ChatGPT has rapidly garnered significant attention from the society26. Data showed that ChatGPT has attracted more than 1 million users after just 5 days of launch and has gained 100 million monthly active users within just 2 months of launch, making it the fastest-growing consumer-grade app to date27. With its powerful language generation and comprehension capabilities, ChatGPT has already been applied to mental health-related tasks28. In the field of mental health, ChatGPT can be used to provide mental health education, conduct mental status assessments, and assist in the implementation of psychological interventions28,29. ChatGPT can at times predict mental health symptoms and diagnoses accurately28. By pre-training ChatGPT with mental health information, it has proven to be accurate to varying degrees in mental health assessments30,31. A study found that ChatGPT’s ability to assess suicide risk was similar to that of mental health professionals, especially when using the ChatGPT-4 version32. In addition, the ChatGPT demonstrated greater accuracy in recognizing suicidal ideation32. These studies suggest that the ChatGPT has potential in mental health assessment, especially in providing initial screening and intervention recommendations. However, current research in exploring the evaluation capabilities of ChatGPT mostly involves comparing ChatGPT’s output with the assessment results from professionals, with less involvement in comparing with the results of validated questionnaire assessments. Given that most current mental health assessments use validated questionnaires, so it was necessary to compare the output of ChatGPT with the results of a validated questionnaire assessment. Whether ChatGPT can assist conducting mental health assessments, which could improve medical efficiency.

This study combined the core elements and scoring criteria of the ULS-6 and OSSS-CS to pre-train ChatGPT-4, creating a structured interview questionnaire. Participants were invited to complete both the ChatGPT-created questionnaire and the validated questionnaires. By comparing the consistency of scores from ChatGPT and the validated questionnaires, evaluating the validity of ChatGPT as a mental health assessment tool to measure college students’ loneliness and online social support. This study has significant implications by applying ChatGPT to the field of mental health assessment, especially for the assessment of loneliness and online social support in the college student population. First, the study provides an innovative approach to mental health assessment, allowing ChatGPT to assist healthcare professionals to conduct assessments more efficiently. Second, due to the accessibility of ChatGPT, it enables more people to break through time and geographic constraints to access real-time mental health assessment services.

Methods

Participants

From June 2024 to August 2024, 220 college students were enrolled in a cross-sectional study at several universities and colleges in Wuhan, Hubei Province, China, using a convenience sampling method. The inclusion criteria were: (1) aged 18–30 years; (2) current students with regular full-time enrollment (specifically including junior college students, undergraduates, master’s and doctoral degree students); (3) informed consent and were willing to participate in this study.

Materials and measures

An information sheet was designed by ourselves to gather demographic data from participants, which included queries regarding gender, age, student leaders’ experience, high school origin place, father’s education, mother’s education, family monthly income, and monthly living expenses.

ChatGPT-4 was pre-trained using the items from the validated questionnaires (ULS-6 and OSSS-CS) and tailored to reflect the daily lives of college students. Specifically, the complete items of these questionnaires were input into ChatGPT-4, and the model was instructed to generate new questions that are more relevant to the everyday experiences of college students. This process involved using the original questionnaire items as prompts to guide ChatGPT-4 in generating contextually appropriate questions without additional fine-tuning. The ChatGPT-4-generated questionnaire was systematically adapted from the original items to align with collegiate cultural and linguistic contexts. To ensure contextual appropriateness, the generated questions were reviewed by three researchers specializing in psychological assessment, who confirmed that the language, examples, and relevance aligned with the daily experiences of university students (e.g., academic workload, social media interactions, and campus life). Although no formal quantitative evaluation was conducted, this expert review process aimed to enhance the ecological validity of the AI-generated questions. The output of ChatGPT-4 is adjusted by computer experts to ensure effective interaction with validated questionnaires. The ChatGPT-4 constructed four scenarios to assess college students’ loneliness and online social support, including campus activities, online recognition, online interactions, and online help-seeking (see Supplementary Information). These scenarios included most of the items in the ULS-6 and OSSS-CS. It is worth noting that although the ChatGPT-4 questionnaire was generated based on two validated questionnaires, the ChatGPT-4 differed from the two previous questionnaires in terms of question content. These differences are mainly in the form of changes in wording and presentation, and the ChatGPT-4 questionnaire is tailored to the cultural background of college students, which is conducive to enhancing their understanding and participation. In order to better accommodate the Chinese-speaking college students, the items were developed, answered, and evaluated in Chinese. After completing these scenarios on the ChatGPT-4, the responses were automatically scored by the software based on the criteria of the ULS-6 and OSSS-CS. This automated scoring process ensured that the generated responses were evaluated for both accuracy and relevance.

The ULS-6 is a valid instrument for measuring loneliness in the Chinese adult population, with good reliability and applicability, and can be assessed concisely and quickly. The ULS-6 scale was validated in Chinese adults with a Cronbach’s α coefficient of 0.89133. The questionnaire is composed of 6 items and subjects rated the frequency of the 6 symptoms over the past 3 weeks on a 4-point Likert scale (1 = never, 2 = sometimes, 3 = rarely, and 4 = often). The total scores of the ULS-6 ranged from 6 to 24, with higher scores indicating that the individual is more lonely.

The OSSS-CS, a scale developed specifically for Chinese adolescents to assess the level of social support they receive online, has been validated in the Chinese adolescent population. The study showed that the OSSS-CS and each dimension had good internal consistency with a Cronbach’s α coefficient of 0.969 and Cronbach’s α coefficients of 0.898 to 0.935 for each dimension22. The OSSS-CS scale is composed of four dimension of esteem/emotional support (EE), social companionship (SC), informational support (INF), and instrumental support (INS), each of which consists of five items, and subjects are asked to rate each entry on a 5-point Likert scale based on their experiences within the past 2 months, reflecting the frequency of social support they have received from interacting with others online (1 = never, 2 = sometimes, 3 = rarely, 4 = often, and 5 = always). The total scores of the scale ranged from 20 to 100, with higher scores indicating that individuals receive more social support online.

Data collection

Firstly, college students were asked to complete the assessment for the four scenarios adjusted by ChatGPT-4, ChatGPT-4 will output a score for each item as the ChatGPT-created questionnaire scores. Additionally, college students were asked to complete the validated ULS-6 and OSSS-CS questionnaires to produce the validated questionnaire scores. All data were collected anonymously, and the investigators of the questionnaire were not aware of the participants’ responses. Participants had the right to withdraw from the study at any time without suffering any adverse consequences.

Statistical analysis

The collected data were statistically analyzed using SPSS 27.0 and Origin 2024. A p value of < 0.05 was considered statistically significant. In this study, Cronbach’s α coefficient was calculated to assess the internal consistency of the ChatGPT-created questionnaire and the validated questionnaires.

Spearman correlation analysis was used to explore the correlation between the ChatGPT-created questionnaire and the validated questionnaires. Correlation (r) levels were defined as: high (r ≥ 0.6), moderate (0.6 < r ≤ 0.3), low (r < 0.3)34. We hypothesized that the scores from ChatGPT-4-generated questionnaires would exhibit strong positive correlations (r > 0.6) with the validated ULS-6 and OSSS-CS. In this study, Spearman’s correlation was chosen because it is a non-parametric measure that assesses the monotonic relationship between two variables without assuming a normal distribution, making it suitable for non-normally distributed continuous data35.

Intra-class correlation coefficients (ICC) were computed for each item and the total score to assess agreement between the ChatGPT-created questionnaire and the validated questionnaires. ICC was selected because it is a widely used measure for assessing the reliability and consistency of continuous data, providing valuable insights into the agreement between measurements even when the data do not strictly follow a normal distribution36. The analysis employed a one-way random effects model with absolute agreement based on single measurements. The Intra class correlation estimates and their 95% Confidence Interval (CI) could range from 0 to 1, where 0 signifies no agreement and 1 denotes perfect agreement. Concerning the interpretation of the ICC, those < 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and > 0.9 are indicative of poor, moderate, good and excellent reliability, respectively37.

Consistency was quantitatively assessed using Bland–Altman plots. By comparing the differences in scores of the ChatGPT-created questionnaire and the validated questionnaires ULS-6 and OSSS-CS, the bias and 95% consistency boundaries can be visualized. On these scatter plots, the y-axis represents the difference between scores from the ChatGPT-created questionnaire and the validated questionnaires, with a positive difference indicating that the ChatGPT-created questionnaire scored higher than the validated questionnaires. The x-axis represents the mean of the scores of the two questionnaire. According to the Bland–Altman analysis, at least 95% of the difference values should be within the agreement limit, and compared with the professionally acceptable limit value. If the agreement limit is clinically acceptable, it can be considered that the agreement between the two raters is good and can be exchanged.

The percentage of dimension/total scores that reach the highest or lowest score is calculated to evaluate the ceiling or floor effect. If the ratio exceeds 15%, the dimension, domain and total score are considered to have a ceiling or floor effect35.

Ethical statement

This research protocol was conducted in accordance with the Declaration of Helsinki and has received ethical approval from the Nursing College at Hubei University of Chinese Medicine, and all participants clearly knew the purpose of the study and have signed an informed consent form before participating in the study. In our study, we took stringent measures to ensure that the samples’ information could not be reached by the third parties or individuals not involved in this research project. In addition, a series of measures were implemented to protect the samples’ information in our study. Firstly, we anonymized all data collected from participants to remove any personally identifiable information. Secondly, only authorized personnel involved in the research project had access to the data. Finally, we adhered to relevant data protection regulations to ensure that all data handling practices were compliant.

Results

A total of 220 college students were recruited, and excluding unanswered questionnaires, 216 valid questionnaires were returned, with an effective recovery rate of 98.18%. All 216 college students were able to complete the administered assessments without assistance. The age distributed between 18 and 27 years old, the mean age was 21.09 years (SD = 1.28), and 149 (69.00%) of the participants were female (Table 1). The median results of the questionnaires, with interquartile ranges and total scores, were as follows: GPT-ULS-6: 8 (IQR 6–11 [range 6–19]); ULS-6: 11.50 (IQR 8–13 [range 6–20]); GPT-OSSS-CS: 67 (IQR 56–74 [range 27–100]); OSSS-CS: 67.50 (IQR 58–78 [range 33–100]).

Table 1 Socio-demographic characteristics of college students (n = 216).

Spearman correlation analysis between the ChatGPT questionnaire and the validated loneliness and online social support questionnaires revealed both positive intra and inter-correlations. The analysis showed a high correlation between the total scores of the ChatGPT questionnaire and the validated questionnaires: ULS-6 (r = 0.64, p < 0.001) and OSSS-CS (r = 0.89, p < 0.001), both of which were consistent with our hypothesis. Also most of the other items on the ChatGPT questionnaire were significantly correlated with the validated questionnaires. Moreover, the correlation for OSSS-CS was observed to be stronger than that for ULS-6 (Fig. 1).

Fig. 1
figure 1

Spearman correlation matrix of the ChatGPT questionnaire and the validated loneliness and online social support questionnaires scores: the heatmap visualizes the Spearman correlation matrix for the ChatGPT questionnaire and the validated questionnaires scores. Heatmap (a) compares the Spearman correlation coefficients and the degree of correlation between GPT-ULS-6 and ULS-6; Heatmap (b) compares the Spearman correlation coefficients and the degree of correlation between GPT-OSSS-CS and OSSS-CS. Each cell in the heatmap represents the Spearman correlation coefficient between two variables, with the color intensity and the value indicating the strength and direction of the correlation. Asterisks indicates significant values: *p < 0.05, **p < 0.01, ***p < 0.001.

The Cronbach’s α for the GPT-ULS-6 scale was 0.87, indicating a good reliability. For the GPT-OSSS-CS, the overall Cronbach’s α was 0.96, also demonstrating a good internal consistency.

The ICC estimates and the corresponding 95% CI can be found in Table 2. The ICC for the ULS-6 was 0.81 (95% CI 0.75–0.85; p < 0.001), indicating slightly lower consistency compared to the OSSS-CS, which had an ICC of 0.95 (95% CI 0.94–0.96; p < 0.001). These findings suggest good agreement between the total scores of the ChatGPT questionnaire and those of the validated questionnaires.

Table 2 ICC to evaluate the consistency of the total scores between the ChatGPT-created questionnaire and the validated questionnaires.

Bland–Altman plots showed that the ULS-6 (limits of agreement, 95% CI − 7.27 to 3.23) and OSSS-CS (limits of agreement, 95% CI − 12.95 to 10.08) demonstrated adequate and acceptable concordance at the 95% level, suggesting that these scales could be used interchangeably in clinical settings or research. A slight bias was observed between the GPT-ULS-6 and ULS-6, with a mean difference of − 2.02 (SD: 2.68), and between the GPT-OSSS-CS and OSSS-CS, with a mean difference of − 1.44 (SD: 5.88) (Fig. 2).

Fig. 2
figure 2

(a) Bland–Altman plot of difference between GPT-ULS-6 and ULS-6 change total scores (n = 216). (b) Bland–Altman plot of difference between GPT-OSSS-CS and OSSS-CS change total scores (n = 216).

ULS-6 had the lowest score of 13.40% and the highest score of 0.40%; OSSS-CS had the lowest score of 0.40% and the highest score of 1.30%; ChatGPT-ULS-6 had the lowest score of 11.50% and the highest score of 0.40%; and ChatGPT-OSSS-CS had the lowest score of 0.40% and the highest score of 0.40%. There is almost no floor/ceiling effect for the four.

Discussion

This study compared scoring consistency between ChatGPT and the validated questionnaires in assessing college students’ loneliness and online social support. The results showed that ChatGPT may be effective in assessing loneliness and online social support among college students, and may also be applicable to the assessment of a broader range of mental health dimensions.

In assessing loneliness and online social support among college students, we found acceptable significant consistency between the results of ChatGPT and the validated questionnaires, which is consistent with previous research38. This consistency may be related to two factors. First, the ChatGPT adapted the questionnaire to the purpose of the study and the characteristics of the college student population, ensuring that the questionnaire was both relevant to loneliness and online social support items and appropriated for college students, with a high reliability. Second, college students typically have more open and positive attitudes toward emerging technologies39. The study found that perceived usefulness plays a central role in shaping students’ trusting attitudes toward ChatGPT and is critical in predicting and encouraging successful technology adoption40. Higher acceptance of the new technology ChatGPT among college students may lead them to hold a positive psychological prediction of ChatGPT in favor of new technology adoption40,41. College students may be more willing to accept and trust an artificial intelligence tool like ChatGPT. This trust may motivate them to be more open in their interactions with ChatGPT, thus allowing ChatGPT to more accurately assess their level of loneliness and online social support. The results suggest that ChatGPT-created questionnaire, specifically designed to assess loneliness and online social support among college students, shows promise in generating relevant items for mental health evaluations, though further validation is needed to establish its independent validity.

The results of the questionnaire scores showed lower assessment scores on the ChatGPT compared to the validated questionnaire, which may indicated that the ChatGPT is more inclined to reflect lower levels of loneliness when measuring loneliness and online social support among college students. This difference may stem from ChatGPT’s unique assessment methodology or from the positive experiences college students have when interacting with ChatGPT42,43. First, ChatGPT’s interactive design and natural language processing capabilities allow it to communicate with users in a way that is closer to everyday conversation44. This interactivity may make respondents feel more comfortable and at ease, and thus more willing to share their true feelings45. One study noted that the ChatGPT showed positive results in increasing students’ behavioral engagement, which may indicate that the ChatGPT is more effective in promoting student engagement and interaction when assessing college students’ loneliness and online social support45. Second, the ChatGPT’s assessment may be more focused on respondents’ immediate responses, which may result in the ChatGPT being able to capture more subtle emotional changes when assessing loneliness and online social support46. This real-time feedback mechanism may allow the ChatGPT to more accurately reflect respondents’ psychological states than traditional questionnaires in some cases47. Overall, the ChatGPT’s lower scores in assessing college students’ feelings of loneliness and online social support may not be due to the fact that it underestimates these feelings, but rather to the fact that it provides a more interactive and personalized assessment, which may be more in line with the communication habits and psychological needs of the college student population. This difference emphasizes the importance of considering the characteristics of the target group when choosing an assessment tool.

The consistency in scores between ChatGPT and validated questionnaires when assessing loneliness and online social support among college students suggests that ChatGPT may offer a level of reliability in the realm of mental health assessment. This consistency suggests that the ChatGPT may be useful as a complementary tool, especially in capturing subtle changes in an individual’s psychological state48. Its real-time feedback and adaptive nature may provide a nuanced assessment perspective in specific contexts, offering a possible resource for clinical practice49. At the same time, the convenience and accessibility of ChatGPT may encourage more individuals to seek help, opening up new possibilities for mental health service delivery47. These findings provide a new perspective to explore the use of ChatGPT in mental health.

Additionally, the high correlation between the ChatGPT-created questionnaire and the validated questionnaires may be influenced by the pre-training process, which aligned the AI-generated content with the constructs measured by existing questionnaires. While this ensures the relevance of the generated questions, it also introduces a potential tautological relationship. Future research should explore additional validation methods to assess the unique contributions of AI-generated questionnaires, ensuring that these tools provide meaningful insights and enhance the measurement of psychological constructs.

Limitations

There are potential limitations of this study. First, due to time and effort limits, this study only surveyed college students in Wuhan, Hubei Province, China, resulting in a small sample size and potentially insufficient representation, which may affect the generalizability of the results. Future studies should expand the sample range, considering various demographic characteristics, to verify the validity and applicability of the ChatGPT multidimensional assessment in a broader social context.

Second, current ChatGPT-4 access is limited by a paywall, and access to the professional version of Bing BETA requires a $20 per month subscription fee, which may increase the difficulty of promoting in clinical assessments in China. Third, this study did not ask the respondents about the feasibility between the ChatGPT-4-created and the validated questionnaires, nor did it ask which one is better and more practical for them. Future studies should fully consider the respondents’ feedback. Fourth, while we claimed that the ChatGPT-4-generated questions were contextually appropriate for college students, this assessment relied on expert judgment rather than standardized quantitative measures. Future research should employ systematic approaches (e.g., cognitive interviews or item relevance ratings) to objectively evaluate contextual appropriateness. Additionally, because our study did not include subgroups of college students with varying levels of loneliness and online social support, it was not possible to measure known-group validity; nor did we have the opportunity to assess responsiveness due to the cross-sectional nature of this study. Future research should emphasize the importance of known cluster effects and reactivity in psychometric assessments.

Conclusions

The results of this cross-sectional study indicated substantial agreement between the questionnaire created by ChatGPT and the validated questionnaires assessing loneliness and online social support among college students. This finding not only heralds the great potential of ChatGPT technology in mental health assessment, but also provides new ideas for our future research directions. In the future, we expect ChatGPT to be further integrated into more clinical assessments, improve the efficiency and accuracy of assessments through intelligent questionnaire design, and provide more comprehensive and in-depth data support for mental health research and clinical practice. At the same time, we should also think deeply about how to make better use of ChatGPT technology to meet the mental health assessment needs of different populations and situations, and to promote the personalized and precise development of mental health assessment.