To be valid, the result of a test has to accurately (i.e. reliably) answer the question that the test is intended to answer.
Reliability refers to consistency of a measuring instrument. Thus, reliability is necessary but not sufficient for validity. Inverse of reliability is "random error." Reliability in any area of medicine is affected by 1) how well patients present a sign/symptom (e.g. consistent answers, quality of X-ray) and 2) how well the examiner evaluates the answer or the result.
- Intrarater reliability is a measure of how well the same rater be asked to “blindly” review the same patient material two or more times.
- Interrater reliability requires that two or more different raters review the same patient material.
- Test-retest reliability is most inclusive measure of reliability where the same patient is observed separately by two or more raters within a close interval. It captures the variability created by patient's inconsistent report (internal consistency reliability) as well as the rater variability.
The measure of reliability for a diagnosis is kalle intraclass kappa coefficient, a measure of how well a first diagnosis (Patient A, Clinician A) predicts a second independent one (Patient A, Clinician B). Cohen’s kappa is a measure of validity. 
A kappa of 1.0 means 100% match in the diagnosis between clinicians. For example, diagnosis of skin and soft-tissue infections has kappa values between 0.39 and 0.43, and assessment of hand films for osteoarthrosis yields a κ=0.54.
The DSM-5 field trials were designed to evaluate the reliability, utility, (and, where possible, convergent validity) of the proposed criteria. To accomplish this patients were evaluated by two clinicians form various mental health disciplines. This approach permitted the estimate of test-retest reliability. The trials involved evaluating over 2,200 patients at 11 field trial sites (2) The goal was to establish intraclass kappa between 0.4 and 0.6; κ>0.2 were considered acceptable.
In psychiatry there is a particular issue with assessing the validity of the diagnostic categories themselves. In this context:
- Construct validity includes (at least) content, discriminant, and convergent validity. It answers the question: Are we measuring what we want to measure?
- content validity refers to symptoms and diagnostic criteria, i.e. how well the test measures all aspects of a phenomenon;
- discriminant/convergent validity assess if symptoms or properties that are suppose to be different/same are found to be different/same with a particular test.
- concurrent validity refers to weather a test correlates well with a measures that had previously been validated.
- predictive validity refers mainly to diagnostic stability over time
1. Kraemer HC Moving Toward DSM-5: The field Trials. Am J Psychiatry 167:10, October 2010
2. Clarke, DE et.al. DSM-5 Field Trials in the US and Canada, Part I: Study Design, Sampling Strategy, Implementation and Analytic Approaches. Am J Psychiatry 170:1 (2013)
Kendell R. & Jablensky A. (2003) Distinguishing Between the Validity and Utility of Psychiatric Diagnoses Am J Psychiatry. January;160(1):4-12