Systematic review of the measurement properties of instruments utilized to diagnose Temporomandibular Disorders according to the COSMIN checklist

Marília Barbosa Santos Garcia,1 Ana Paula Amaral,1 Cid Andre Fidelis de Paula Gomes,2 Fabiano Politti,1 Daniela Aparecida Biasotto-Gonzalez,1 1Postgraduate Program in Rehabilitation Sciences, Núcleo de Apoio a Pesquisa em Análise do Movimento, Universidade Nove de Julho (UNINOVE), Rua Profa Maria Jose Barone Fernandes, 300, São Paulo, SP 02117-020, Brazil. mariliabsg@gmail.com; ap.fisioterapeuta@gmail.com; politti@uninove.br; dani_atm@uninove.br 2Postgraduate Program in Biophotonics Applied to Health Sciences, Universidade Nove de Julho (UNINOVE), Rua Vergueiro, 235, São Paulo, SP, 01504-001, Brazil. cid.andre@gmail.com


Introduction
Temporomandibular disorder (TMD) regards a set of clinical problems involving articular noises, limited range of motion and/or deviations in mandibular function, pre auricular pain, as well as pain in the temporomandibular joint (TMJ) and/or pain" and "questionnaire" or "index" or "scale" or "score" or "assessment" or "evaluation" or "self-report" or "inventory" and "Brasil" or "Brazil" or "Portuguese" or "Brazilian-Portuguese" or "Brazilian".No restrictions were imposed regarding language or publication date.The last search was performed on Dez 17 th , 2015.

Inclusion criteria
The study included instruments used in Brazil to diagnose TMD that had been submitted to a procedure for testing the measurement properties.This study included instruments with variable characteristics of the sample (with TMD, without TMD, or both), adult or pediatric population and different statistics analyses.Only studies that were published in complete texts were included.

Exclusion criteria
Texts that were part of theses and dissertation, congress summaries or books were excluded.

Data extraction and assessment of the methodological quality of the eligible studies
Data referring to measurement properties were extracted from each study and analyzed according to the COSMIN checklist [10][11][12][13][14] .
When assessing the quality of an instrument, one can distinguish three domains of quality: reliability; validity and responsiveness.Each domain contains one or more measurement properties.The domain reliability contains three measurement properties: internal consistency; reliability and measurement error.The domain validity also contains three measurement properties: content validity; structural validity and criterion validity.The domain responsiveness contains only one measurement property, also known as responsiveness.A number of measurement properties contain one or more aspects, which were defined separately: validity of the content included face validity; structural validity includes structural validity, tests of hypotheses and crosscultural validity [10][11][12][13][14] .
According to COSMIN, the main measurement properties are the following: internal consistency; validity; reliability; the measurement error and ceiling and floor effects.Internal consistency is a measurement of the homogeneity of an instrument and indicates the degree to which the items of a determined instrument are correlated, thereby measuring the same construct.Validity indicates whether the instrument is assessing the construct it proposes to measure and can be used to measure the criterion validity (in the case of a "gold standard") or the structural validity (when there is no "gold standard" for comparison).Reliability refers to the capacity of a certain test to obtain similar results for stable individuals.The measurement error confirms the errors in patients scores that did not attribute real changes to the construct that was measured.Ceiling and floor effects refer to the number of individuals interviewed that reached the maximum and minimum score possible, respectively [10][11][12][13][14] .The COSMIN checklist has 12 boxes, ten of which can be used to evaluate whether a study meets the requirements of adequate methodological quality.Nine of these boxes have norms for measurement properties: Box A (internal consistency); Box B (reliability); Box C (measurement error); Box D (content validity, including face validity); Box E (structural 309 masticatory muscles 1 . The literature offers different instruments to diagnose TMD, with distinct categorizations: questionnaires, anamnestic indices, clinical indices and diagnostic criteria [2][3][4][5][6][7] .Most of the instruments used in Brazil to diagnose Temporomandibular Disorders (TMD) were developed in another Language.To effectively use instruments that were created in another language, it must be translated into the relevant target language before cultural adaptation.Clinimetric tests should also be performed to evaluate the measurement properties.This procedure is fundamental due to the different customs, cultures, languages and perceptions of health found in different countries.The culturally adapted questionnaire overcomes linguistic and cultural barriers 8 .
That assessment instruments are only useful and capable of providing scientifically robust results when they demonstrate satisfactory measurement properties, that is, that all the measurement properties have been tested with adequate sample and also that they present values statistically indicated by the criterion of quality followed when performing the clinimetric tests 8 .
Despite the significant increase in the quantity of assessment scales and/or questionnaires, many of them have not been developed and/or validated appropriately 9 .
Studies that evaluate the measurement properties of assessment tools should have a high degree of methodological quality.To evaluate the quality of such studies, criteria are needed to classify the study design and statistical analyses.The Consensusbased Standards for the selection of health status Measurement INstruments (COSMIN) checklist provides these criteria [10][11][12][13][14] .This list was developed in an international, multidisciplinary study involving the participation of 43 specialists in measurement properties in the field of health [10][11][12][13][14] .According to the COSMIN group, studies assessing measurement properties should exhibit a high methodological quality in order to ensure appropriate conclusions concerning the validity of the instrument [10][11][12][13][14] .
The aim of the present systematic review was to employ the COSMIN checklist to analyze the methodological quality of measurement properties of TMD assessment tools for use in Brazil.

Methods
The present study was a systematic review, which followed the recommendations of the PRISMA checklist.It was registered under number 2014 CRD42014014286 in PROSPERO (International prospec tive register of systematic reviews) and can be accessed at http://www.crd.york.ac.uk/PROSPERO/display_ record.asp?ID=CRD42014014286.The details of the protocol of this systematic review can be accessed using the following link: http://www.crd.york.ac.uk/PROSPEROFILES/14286_PROTOCOL_20140920.pdf.

Study selection
Systematic searches were performed of the PUBMED, SCIELO, LILACS and SCIENCE DIRECT databases.The search terms and operators (AND, OR or NOT) used in the electronic databases were as follows: "temporomandibular disorder" or "temporomandibular dysfunction" or "temporomandibular Systematic review of the measurement properties of instruments utilized to diagnose Temporomandibular Disorders according to the COSMIN checklist Braz J Oral Sci.15(4):308-314 validity), Box F (hypotheses testing); Box G (cross-cultural), Box H (criterion validity) and Box I (responsiveness).Box J is used for the interpretability of the results of a given study.Box IRT is used for Item Response Theory and the Generalizability Box is used for the possibility of the generalization of a study regarding one or more measurement properties 13 .
Part of the COSMIN group developed an evaluation scale to classify each measurement property as excellent, good, reasonable or weak based on the scores of the items in the corresponding box.Methodological quality using this scale is defined by the worst score of a given box.Thus, a box with some items classified as excellent or good, but one item classified as poor is classified as having poor methodological quality ("worse score counts") 14 .
The extraction of the data and evaluations were performed by a single rater and verified by an independent reviewer, who then met to discuss the findings.No divergences of opinion were found between the rater and independent reviewer.

Results
In total, 513 studies were found in the searches, although only 11 were considered eligible for the data analysis (Figure 1).
Systematic review of the measurement properties of instruments utilized to diagnose Temporomandibular Disorders according to the COSMIN checklist reduced version of the QAADO 25 .A number of the measurement properties from all of the instruments found in the searches were tested.
The table 1 displays the sample size, sample description and statistical values of each measurement property assessed in the studies included in this systematic review.The table 2 displays the assessments of measurement properties, according to the COSMIN checklist [10][11][12][13][14] .
In summary, the measurement properties reliability, internal consistency and content validity were tested for the RDC/TMD 15,16 .Reliability and internal consistency were tested for the multimedia version of the RDC/TMD 17 .Reliability, internal consistency and the content validity were tested for the Mandibular Function Impairment Questionnaire (MFIQ) 18 .Internal consistency and reliability were analyzed for the FAI 19 .Only the content validity was analyzed in the CR-10 for TMD 20 .Internal consistency, reliability and validity were tested for the reduced version of the FAI 21 .Internal consistency, reliability and the criterion validity were analyzed for the self-report of oral conditions 22 .Internal consistency, reliability and the structural validity were tested for the Brasil-MOPDS 23 .Only reliability was tested for the QAADO 24 , whereas reliability, internal consistency and the content validity were tested for the reduced version of the QAADO 25 .
Lucena et al. 15 analyzed the internal consistency (Box A), reliability (Box B) and structural validity (Box E) of the RDC/ TMD, for which the respective classifications were good, fair (based on the moderate sample size [n = 45], although other items were classified as good and excellent) and poor.The poor classification for structural validity was due to an error regarding the formulation of hypotheses, which were not described prior to testing the validity.Campos et al. 16 tested the internal consistency (Box A) and reliability (Box B) of the RDC/TMD, for which the respective classifications were good and fair (due to the sample size [n = 36]).
Campos et al. 19 analyzed the internal consistency (Box A) and reliability (Box B) of the Fonseca Index, for which the respective classifications good (due to the failure to perform factor analysis) and fair (due to the moderate sample size [n = 40]).Ferreira-Bacci et al. 20 analyzed the content validity (Box D) of the CR-10 questionnaire used to measure pain associated with TMD, for which the classification was fair.Cavalcanti et al. 17  Campos et al. 18 tested the measurement properties of the MFIQ.Internal consistency (Box A) was classified as good, with some items were classified as excellent, such as the use of factor analysis.Reliability (Box B) was classified as good due to the adequate sample size (n = 62).Content validity (Box D, including face validity) was classified as excellent.This was the only study to employ Cronbach's alpha coefficient in combination with factor analysis, which is an important analysis, as it allows the identification of subscales on a questionnaire and Cronbach's alpha coefficient should be calculated separately for each subscale 26 .The studies that evaluated the RDC/TMD and the multimedia version of the RDC/TMD calculated Cronbach's alpha coefficients, but did not employ factor analysis.In the 11 eligible articles, ten instruments were identified: the Research Diagnostic Criteria for Temporomandibular Disorders(RDC/TMD) 15,16 ; the multimedia version of the RDC/ TMD 17 ; the Mandibular Function Impairment Questionnaire (MFIQ) 18 ; the Fonseca anamnestic index (FAI) 19 ; the category-ratio scale (CR-10) 20 ; the reduced version of the FAI 21 ; the self-report for oral conditions 22 ; the Brazilian version of the Manchester Orofacial Pain Disability Scale (Brasil-MOPDS) 23 ; a screening questionnaire for orofacial pain and temporomandibular disorders, recommended by the American Academy of Orofacial Pain (QAADO) 24 ; and the  construct validity presented a good methodological quality.The self-report of oral conditions had the properties of measure reliability and internal consistency tested through the study of Pinelli and Loffredo 22 .The reability was classified as reasonable because it did not use ICC in statistical analysis.The internal consistency obtained a weak degree because Cronbach's alpha was not calculated.The criterion validity (Box E) of the self-report of oral condition was also tested.A reasonable classification was reached because it did not use a gold standard questionnaire in the statistical comparison.

Discussion
The aim of the present study was to assess the quality of the measurement properties of instruments to diagnose temporomandibular disorder that had been translated to Portuguese.
None of the questionnaires completely tested all measurement properties.Reliability was tested in 81.8% of the questionnaires, with 66.7% classified as reasonable.This classification was mainly due to the fact that none of the articles mentioned which type of Intraclass Correlation Coefficient (ICC) was used to measure reliability.It is extremely important to specify which type of ICC was used in different tests, given that different ICCs can lead to completely different results, which would underestimate or overestimate the reliability, depending on the ICC used 27 .
Internal consistency was also assessed in 81.8% of the instruments, with 77.8% receiving a classification of good.Questionnaires did not receive a classification of excellent if they did not use the Cronbach alpha test in combination with factorial analysis.This analysis is important since it can identify how many scales are present in a questionnaire.If there is more than one scale, the Cronbach alpha value should be calculated for each sub-scale separately 26 .Only the reduced version of the FAI and the MIFQ used the Cronbach alpha test in combination with factorial analysis.
Measurement errors were not tested in any of the questionnaires.The measurement error confirms errors in the scores of patients that did not attribute real changes in the construct that was measured 14 .
The criterion validity was tested in 9.09% of the instruments and was classified as reasonable.The content validity was analyzed in 18.9% of the studies and was classified as good.The structural validity was tested in 46.1% of the instruments.In these studies, there were no classifications of excellent.The studies used Pearson's correlation test to correlate a specific questionnaire with other similar measurements.However, prior to testing the structural validity, it is important to formulate hypotheses that should specify the range and the direction of the expected correlation.Hypotheses were not formulated in any of the studies included in this review.Without specific hypotheses, the risk of bias can be high since it is easier to develop an alternative explanation for low correlations, rather than concluding that the questionnaire may not exhibit high indices for the validity of the construct 26 .
Systematic review of the measurement properties of instruments utilized to diagnose Temporomandibular Disorders according to the COSMIN checklist Manfredi et al. 24 analyzed the reliability (Box B of the COSMIN checklist) of the QAADO.Although the sample size (n = 46) was classified as fair and other items were classified as good, the study was classified as poor with regard to reliability due to the failure to calculate the intraclass correlation coefficient (ICC), since the "worst score counts" on the COSMIN checklist 14 .
Only the studies analyzing the RDC/TMD 15 27 .
The study that analyzed the Brazil-MOPDS 23 questionnaire tested the following measurement properties: internal consistency (box A), reliability (box B) and structural validity (box E).The internal consistency obtained a good classification.The reliability reached a reasonable classification because it did not present the type of ICC used in the analysis.And the structural validity was classified as good.
The study developed by Franco-Micheloni et al. 25 tested the following measurement properties of QAADO instruments reduced version: internal consistency (box A), reliability (box B) and construct validity (box E).The internal consistency has been classified as good due to the factorial analysis.The reliability obtained a reasonable classification because it did not present the type of ICC used in the analysis.Finally, the Responsiveness was not assessed in any of the studies included in this systematic review.Responsiveness represents the ability of a questionnaire to detect clinical changes over time 14 .In addition, none of the studies tested ceiling/floor effects and consequently, it is not clear if the instruments assessed would fail to detect an improvement or a worsening in certain patients.
Costa et al. 28 found the same problems in a systematic review on cross-cultural adaptations and measurement property tests of a questionnaire designed to assess pain intensity (McGill Pain Questionnaire).Among the 44 different versions of the questionnaire for 26 different languages/cultures, most measurement properties were either not tested or were tested inadequately.The same was found in a systematic review of assessment tools designed for the evaluation of low back pain 29 , for which most studies evaluated reliability and structural validity, but failed to test internal consistency, responsiveness, the ceiling effect and the floor effect.
Bot et al. 30 , conducted a systematic review to analyze the measurement properties of questionnaires that assessed shoulder disorders and found different assessment methods for measurement properties, as well as flaws in assessments of the structural validity, internal consistency and reliability.In most of the instruments assessed, hypotheses related to the range and direction of the expected correlations with other instruments were not formulated.Factorial analysis was also not conducted.When it was used, it did not always confirm the dimensions that the questionnaire proposed to measure.Responsiveness was usually tested in samples of inadequate sizes.Most of the studies did not adequately describe the study method and/or data analysis.
It should be pointed out that other guidelines can be used for the evaluation of procedures for testing measurement properties that do not require all the criteria found on the COSMIN checklist.However, the decision was made to employ this checklist based on the fact that its quality criteria are the most updated and widely accepted in the literature [10][11][12][13][14]26 .
Every effort was made in the systematic search of the electronic databases to identify studies on TMD assessment tools in Brazilian Portuguese.However, it is possible that unpublished data on the measurement properties of the assessment tools analyzed could be found in dissertations and theses.Such texts were not considered in the selection process, which may be interpreted as a possible limitation of the present review.
The measurement properties of the instruments included in this systematic review ranged in classification from good to weak, according to the criteria of the COSMIN checklist.These questionnaires are used in many Brazilian scientific epidemiological or clinical researches.Thus, care must be taken when interpreting the scores of questionnaires that have not had their measurement properties completely tested or were not tested in accordance with quality criteria.
Finally, we recommend that instrument researchers consider conducting full psychometric tests of their instruments using adequate sample sizes.We also recommend they consider scoring methods and quality criteria to provide scientifically robust instruments that are easy to administer.
analyzed the internal consistency (Box A) and reliability (Box B) of the multimedia version of the RDC/TMD, for which both classifications were fair due to the failure to perform factor analysis and the moderate sample size (n = 30).

Table 1 -
Classification of the measurement properties of the articles included in the review according to the COSMIN checklist.

Table 2 -
Classification of questionnaires related to TMD according to the COSMIN checklist..
and MIFQ 18 mentioned the type of Intraclass Correlation Coefficient (ICC) employed to measure reliability.Both studies used 95%.It is extremely important to state the type of ICC employed, as different ICCs can demonstrate completely different results, which can either underestimate or overestimate reliability