The value and limitations of self‐administered questionnaires in clinical practice and epidemiological studies

In the past few decades, there has been a proliferation of self‐administered questionnaires aimed to assist clinicians in improving the identification of various disorders, and researchers in estimating disorder prevalence rates in community‐based epidemiological settings. Most of these questionnaires focus on a single disorder, such as major depressive disorder, bipolar disorder, or generalized anxiety disorder. A minority evaluate a range of the most common disorders encountered in outpatient mental health settings.

Self‐administered questionnaires are not a substitute for an interviewer‐based diagnostic evaluation. They are screening instruments, and their use represents the first phase of a two‐stage diagnostic procedure. The purpose of a screening test is to cast a broad net to ensure that most patients with the disorder are captured in that net. Thus, a screening test is intended to reduce the frequency of missed diagnoses. That test is expected to be followed by a more definitive diagnostic assessment, an evaluation that is generally more expensive and/or invasive than the screening procedure. In psychiatry, a self‐administered screening questionnaire is intended to be followed by a diagnostic interview. In studies of the performance of screening questionnaires, a semi‐structured interview is the usual “gold standard”.

The two most commonly reported statistics when describing the performance of a screening measure are sensitivity and specificity. Sensitivity refers to how well the test identifies individuals with the disorder, whereas specificity refers to how well the test identifies individuals without the disorder. Two other statistics important in understanding a screening test's clinical utility are positive and negative predictive value. Positive predictive value refers to the probability that a person who screens positive on the test has the disorder. Negative predictive value refers to the probability that a person who screens negative on the test does not have the disorder. Positive and negative predictive values are less commonly used to describe a screening test's performance, because these statistics are influenced by the prevalence of the disorder in the sample studied.

From a clinical perspective, it is most important that a screening measure has good sensitivity, and corresponding high negative predictive value. With high negative predictive value, the clinician can be confident that when the test indicates that the disorder is not present there is little need to inquire about that disorder's presence. False positives (i.e., persons who screen positive but do not have the disorder) are less of a problem for a screening questionnaire, because their major cost is the time a clinician takes to determine that the disorder is not present. Presumably this is time clinicians would have nonetheless spent for the same purpose had they conducted a comprehensive interview.

From an epidemiological perspective, it is most important that the self‐administered questionnaire provides an accurate estimate of the presence of a given condition. However, when questionnaires are used in this manner, the studies should refer to the prevalence of symptoms rather than disorder (e.g., prevalence of depressive symptoms rather than depressive disorder). The prevalence of the disorder should be assessed by the subsequent use of a diagnostic interview.

Self‐administered questionnaires for psychiatric disorders yield a continuous distribution of scores, and the developers of these instruments typically recommend a cutoff score to identify individuals who have screened positive. A major problem with the research using these questionnaires is that many scale developers take a case‐finding rather than a screening approach in deriving the cutoff score to indicate which patients have screened positive. That is, investigators select the cutoff that maximizes agreement with a diagnostic standard (such as a semi‐structured interview). From a screening viewpoint, a more appropriate approach is to select a cutoff that prioritizes a scale's sensitivity, so that diagnoses are not missed. A review of 68 reports of the performance of the three most researched screening scales for bipolar disorder found that only 11 (16.2%) studies recommended a cutoff that prioritized the scale's sensitivity 1 .

The failure to appreciate the difference between case‐finding and screening has led to inappropriate conclusions from studies using screening measures as diagnostic proxies. For example, a study of the impact of borderline personality disorder (BPD) on the response of depressed patients to electroconvulsive therapy (ECT) 2 used the McLean Screening Inventory for BPD (MSI‐BPD) to “diagnose” the personality disorder. A summary of the performance of the MSI‐BPD found that across studies the scale had a sensitivity of 80% and a specificity of 66% at the cutoff recommended by the scale's developers 3 . When this is taken into account, along with the prevalence of BPD in the sample, an analysis of this study suggested that the majority of the patients whom the authors considered to have BPD would not receive this diagnosis if administered a diagnostic interview. In other terms, the screening scale's positive predictive value was well below 50%. Therefore, valid conclusions about the efficacy of ECT in depressed patients with comorbid BPD cannot be drawn from a study using a screening measure to “diagnose” the personality disorder.

Studies of community samples have used screening questionnaires for bipolar disorder to estimate the prevalence of this disorder, the psychosocial morbidity associated with it, the frequency of the disorder's underdiagnosis, and the frequency of its undertreatment with mood stabilizers and overtreatment with antidepressants 4 , 5 . None of these studies discuss the limited positive predictive value of bipolar disorder screening scales in the general population 6 . None of these reports note that most individuals who screened positive for bipolar disorder would not be diagnosed with the disorder if interviewed (because the positive predictive value is less than 50%). Thus, the public health concerns raised in the discussion sections of these studies are based on misconstruing screening instruments as diagnostic measures.

More recently, online surveys based on self‐administered questionnaires have been used to assess the psychological impact of the COVID‐19 pandemic and of the public health restrictions imposed to limit the spread of infection. A PubMed search on November 24, 2023 on the terms “COVID‐19” and “depression” yielded 16,026 citations. In almost all of these studies, depression was assessed by self‐administered questionnaires. The literature has been sufficiently extensive to generate meta‐analyses of the prevalence of depression during the pandemic in specific populations such as health care workers, pregnant women, and college students. Similarly, meta‐analyses of the prevalence of depression have been conducted in various geographic regions of the world and have examined factors impacting that prevalence. The results of these studies have been used to influence public health discussions related to the funding of mental health services. However, screening questionnaires for depression, such as the Patient Health Questionnaire‐9 (PHQ‐9) – the self‐report questionnaire most frequently used in these studies – significantly overestimate the prevalence of depression compared to diagnostic interviews 7 . Again, this is not a problem with the questionnaires themselves, which are designed to identify individuals who might have a disorder, while the subsequent use of a diagnostic interview is expected to distinguish true cases from false positives. The problem is with the interpretation of the results based on screening instruments.

When our clinical research group developed the Psychiatric Diagnostic Screening Questionnaire (PDSQ), we intended it as a diagnostic aid to be used in clinical practice to reduce underdiagnosis of disorders comorbid with the principal diagnosis and improve clinicians’ efficiency in conducting the initial diagnostic evaluation 8 . Consequently, we recommended that a cutoff resulting in a sensitivity of 90% be chosen when using the scale in clinical practice, rather than a cutoff that optimized agreement with a diagnostic standard.

The bottom line is that a self‐report questionnaire with high sensitivity and negative predictive value can be a valuable tool in clinical practice by guiding the clinician towards inquiry about disorders on which the patient screens positive (thereby reducing missed diagnoses) and identifying disorders that are unlikely to be present and thus requiring little or no inquiry (thereby saving the clinician time). In epidemiological studies, screening instruments can give accurate prevalence estimates if the cutoff point is appropriate, but the findings should be viewed as estimates of the prevalence of symptoms of a disorder rather than the disorder itself. Using expressions such as “prevalence of depression” or “prevalence of anxiety” (instead of “prevalence of depressive symptoms” or “prevalence of anxiety symptoms”) may generate misunderstandings – in particular, an overestimation of the clinical and public health implications of the findings.

REFERENCES

1. Zimmerman M, Guzman Holst C. Psychiatry Res 2018; 270 :1068‐73. [PubMed] [Google Scholar]

2. Yip AG, Ressler KJ, Rodriguez‐Villa F et al. J Clin Psychiatry 2021; 82 :19m13202. [PubMed] [Google Scholar]