Differential item functioning (DIF) is a statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups. Average item scores for subgroups having the same overall score on the test are compared to determine whether the item is measuring in essentially the same way for all subgroups. The presence of DIF requires review and judgment, and it does not necessarily indicate the presence of bias. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups ''with the same underlying true ability'' have a different probability of giving a certain response. Common procedures for assessing DIF are Mantel-Haenszel,

item response theory In psychometrics, item response theory (IRT) (also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring ...

(IRT) based methods, and

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...

Description

DIF refers to differences in the functioning of items across groups, oftentimes demographic, which are matched on the latent trait or more generally the attribute being measured by the items or test. It is important to note that when examining items for DIF, the groups must be matched on the measured attribute, otherwise this may result in inaccurate detection of DIF. In order to create a general understanding of DIF or measurement bias, consider the following example offered by Osterlind and Everson (2009). In this case, Y refers to a response to a particular test item which is determined by the latent

construct Construct, Constructs or constructs may refer to: * Construct (information technology), a collection of logic components forming an interactive agent or environment ** Language construct * ''Construct'' (album), a 2013 album by Dark Tranquillity ...

being measured. The latent construct of interest is referred to as theta (θ) where Y is an indicator of θ which can be arranged in terms of the

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

of Y on θ by the expression ''f''(Y), θ. Therefore, response Y is conditional on the latent trait (θ). Because DIF examines differences in the conditional probabilities of Y between groups, let us label the groups as the "reference" and "focal" groups. Although the designation does not matter, a typical practice in the literature is to designate the reference group as the group who is suspected to have an advantage while the focal group refers to the group anticipated to be disadvantaged by the test.^{/sup> Therefore, given the functional relationship $f(Y), \theta$ and under the assumption that there are identical measurement error

Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...
distributions for the reference and focal groups it can be concluded that under the null hypothesis

In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
:

with G corresponding to the grouping variable, "r" the reference group, and "f" the focal group. This equation represents an instance where DIF is not present. In this case, the absence of DIF is determined by the fact that the conditional probability

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
distribution of Y is not dependent on group membership. To illustrate, consider an item with response options 0 and 1, where Y = 0 indicates an incorrect response, and Y = 1 indicates a correct response. The probability of correctly responding to an item is the same for members of either group. This indicates that there is no DIF or item bias because members of the reference and focal group with the same underlying ability or attribute have the same probability of responding correctly. Therefore, there is no bias or disadvantage for one group over the other.
Consider the instance where the conditional probability of Y is not the same for the reference and focal groups. In other words, members of different groups with the same trait or ability level have unequal probability distributions on Y. Once controlling for θ, there is a clear dependency between group membership and performance on an item. For dichotomous

A dichotomy is a partition of a whole (or a set) into two parts (subsets). In other words, this couple of parts must be
* jointly exhaustive: everything must belong to one part or the other, and
* mutually exclusive: nothing can belong simultan ...
items, this suggests that when the focal and reference groups are at the same location on θ, there is a different probability of getting a correct response or endorsing an item. Therefore, the group with the higher conditional probability of correctly responding to an item is the group advantaged by the test item. This suggests that the test item is biased and functions differently for the groups, therefore exhibits DIF.
It is important to draw the distinction between DIF or measurement bias and ordinary group differences. Whereas group differences indicate differing score distributions on Y, DIF explicitly involves conditioning on θ. For instance, consider the following equation:

This indicates that an examinee's score is conditional on grouping such that having information about group membership changes the probability of a correct response. Therefore, if the groups differ on θ, and performance depends on θ, then the above equation would suggest item bias even in the absence of DIF. For this reason, it is generally agreed upon in the measurement literature that differences on Y conditional on group membership alone is inadequate for establishing bias. In fact, differences on θ or ability are common between groups and establish the basis for much research. Remember to establish bias or DIF, groups must be matched on θ and then demonstrate differential probabilities on Y as a function of group membership.
Forms

Uniform DIF is the simplest type of DIF where the magnitude of conditional dependency is relatively invariant across the latent trait continuum (θ). The item of interest consistently gives one group an advantage across all levels of ability θ. Within an item response theory (IRT) framework this would be evidenced when both item characteristic curves (ICC) are equally discriminating yet exhibit differences in the difficulty parameters (i.e., ''a_r = a_f'' and ''b_r < b_f'') as depicted in Figure 1. However, nonuniform DIF presents an interesting case. Rather than a consistent advantage being given to the reference group across the ability continuum, the conditional dependency moves and changes direction at different locations on the θ continuum. For instance, an item may give the reference group a minor advantage at the lower end of the continuum while a major advantage at the higher end. Also, unlike uniform DIF, an item can simultaneously vary in discrimination for the two groups while also varying in difficulty (i.e., ''a_r ≠ a_f'' and ''b_r < b_f''). Even more complex is "crossing" nonuniform DIF. As demonstrated in Figure 2, this occurs when an item gives an advantage to a reference group at one end of the θ continuum while favors the focal group at the other end. Differences in ICCs indicate that examinees from the two groups with identical ability levels have unequal probabilities of correctly responding to an item. When the curves are different but do not intersect, this is evidence of uniform DIF. However, if the ICCs cross at any point along the θ scale, there is evidence of nonuniform DIF.

Procedures for detecting DIF

Mantel-Haenszel

A common procedure for detecting DIF is the Mantel-Haenszel (MH) approach. The MH procedure is a chi-squared contingency table based approach which examines differences between the reference and focal groups on all items of the test, one by one. The ability continuum, defined by total test scores, is divided into ''k'' intervals which then serves as the basis for matching members of both groups. A 2 x 2 contingency table

In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
is used at each interval of ''k'' comparing both groups on an individual item. The rows of the contingency table correspond to group membership (reference or focal) while the columns correspond to correct or incorrect responses. The following table presents the general form for a single item at the ''k''th ability interval.

Odds ratio

The next step in the calculation of the MH statistic is to use data from the contingency table to obtain an odds ratio
An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
for the two groups on the item of interest at a particular ''k'' interval. This is expressed in terms of ''p'' and ''q'' where ''p'' represents the proportion
Proportionality, proportion or proportional may refer to:
Mathematics
* Proportionality (mathematics), the property of two variables being in a multiplicative relation to a constant
* Ratio, of one quantity to another, especially of a part compare ...
correct and ''q'' the proportion incorrect for both the reference (R) and focal (F) groups. For the MH procedure, the obtained odds ratio is represented by α with possible value ranging from 0 to ∞. A α value of 1.0 indicates an absence of DIF and thus similar performance by both groups. Values greater than 1.0 suggest that the reference group outperformed or found the item less difficult than the focal group. On the other hand, if the obtained value is less than 1.0, this is an indication that the item was less difficult for the focal group.^{/sup> Using variables from the contingency table above, the calculation is as follows:
α =
=
=
=
The above computation pertains to an individual item at a single ability interval. The population estimate α can be extended to reflect a common odds ratio across all ability intervals ''k'' for a specific item. The common odds ratio estimator is denoted α_MH and can be computed by the following equation:
α_MH =

for all values of ''k'' and where N_k represents the total sample size at the ''kth'' interval.
The obtained α_MH is often standardized through log transformation, centering the value around 0. The new transformed estimator MH_D-DIF is computed as follows:
MH_D-DIF = -2.35ln(α_MH)
Thus an obtained value of 0 would indicate no DIF. In examining the equation, it is important to note that the minus sign changes the interpretation of values less than or greater than 0. Values less than 0 indicate a reference group advantage whereas values greater than 0 indicate an advantage for the focal group.
Item response theory

Item response theory (IRT) is another widely used method for assessing DIF. IRT allows for a critical examination of responses to particular items from a test or measure. As noted earlier, DIF examines the probability of correctly responding to or endorsing an item conditioned on the latent trait or ability. Because IRT examines the monotonic

In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of order ...
relationship between responses and the latent trait or ability, it is a fitting approach for examining DIF.
Three major advantages of using IRT in DIF detection are:
*Compared to classical test theory

Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observe ...
, IRT parameter

A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
estimates are not as confounded by sample characteristics.
*Statistical properties of items can be expressed with greater precision which increases the interpretation accuracy of DIF between two groups.
*These statistical properties of items can be expressed graphically, improving interpretability and understanding of how items function differently between groups.
In relation to DIF, item parameter estimates are computed and graphically examined via item characteristic curves (ICCs) also referred to as trace lines or item response functions (IRF). After examination of ICCs and subsequent suspicion of DIF, statistical procedures are implemented to test differences between parameter estimates.
ICCs represent mathematical functions of the relationship between positioning on the latent trait continuum and the probability of giving a particular response. Figure 3 illustrates this relationship as a logistic function

A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation

f(x) = \frac,

where

For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
. Individuals lower on the latent trait or with less ability have a lower probability of getting a correct response or endorsing an item, especially as difficulty increases. Thus, those higher on the latent trait or in ability have a greater chance of a correct response or endorsing an item. For instance, on a depression inventory, highly depressed individuals would have a greater probability of endorsing an item than individuals with lower depression. Similarly, individuals with higher math ability have a greater probability of getting a math item correct than those with lesser ability. Another critical aspect of ICCs pertains to the inflection point

In differential calculus and differential geometry, an inflection point, point of inflection, flex, or inflection (British English: inflexion) is a point on a smooth plane curve at which the curvature changes sign. In particular, in the case of ...
. This is the point on the curve where the probability of a particular response is .5 and also represents the maximum value for the slope

In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is use ...
. This inflection point indicates where the probability of a correct response or endorsing an item becomes greater than 50%, except when a ''c'' parameter is greater than 0 which then places the inflection point at 1 + c/2 (a description will follow below). The inflection point is determined by the difficulty of the item which corresponds to values on the ability or latent trait continuum. Therefore, for an easy item, this inflection point may be lower on the ability continuum while for a difficult item it may be higher on the same scale.

Before presenting statistical procedures for testing differences of item parameters, it is important to first provide a general understanding of the different parameter estimation models and their associated parameters. These include the one-, two-, and three-parameter logistic (PL) models. All these models assume a single underling latent trait or ability. All three of these models have an item difficulty parameter denoted ''b''. For the 1PL and 2PL models, the ''b'' parameter corresponds to the inflection point on the ability scale, as mentioned above. In the case of the 3PL model, the inflection corresponds to 1 + c/2 where ''c'' is a lower asymptote (discussed below). Difficulty values, in theory, can range from -∞ to +∞; however in practice they rarely exceed ±3. Higher values are indicative of harder test items. Items exhibiting low ''b'' parameters are easy test items. Another parameter that is estimated is a discrimination parameter designated ''a'' . This parameter pertains to an item's ability to discriminate among individuals. The ''a'' parameter is estimated in the 2PL and 3PL models. In the case of the 1PL model, this parameter is constrained to be equal between groups. In relation to ICCs, the ''a'' parameter is the slope of the inflection point. As mentioned earlier, the slope is maximal at the inflection point. The ''a'' parameter, similar to the ''b'' parameter, can range from -∞ to +∞; however typical values are less than 2. In this case, higher value indicate greater discrimination between individuals. The 3PL model has an additional parameter referred to as a ''guessing'' or pseudochance parameter and is denoted by ''c''. This corresponds to a lower asymptote

In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...
which essentially allows for the possibility of an individual to get a moderate or difficult item correct even if they are low in ability. Values for ''c'' range between 0 and 1, however typically fall below .3.
When applying statistical procedures to assess for DIF, the ''a'' and ''b'' parameters (discrimination and difficulty) are of particular interest. However, assume a 1PL model was used, where the ''a'' parameters are constrained to be equal for both groups leaving only the estimation of the ''b'' parameters. After examining the ICCs, there is an apparent difference in ''b'' parameters for both groups. Using a similar method to a Student's t-test

A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of ...
, the next step is to determine if the difference in difficulty is statistically significant. Under the null hypothesis
H₀: b_r = b_f
Lord (1980) provides an easily computed and normally distributed test statistic.
d = (b_r - b_f) / SE(b_r - b_f)
The standard error

The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error ...
of the difference between ''b'' parameters is calculated by
√ E(b_r)sup>2} + √ E(b_f)sup>2}

Wald statistic

However, more common than not, a 2PL or 3PL model is more appropriate than fitting a 1PL model to the data and thus both the ''a'' and ''b'' parameters should be tested for DIF. Lord (1980) proposed another method for testing differences in both the ''a'' and ''b'' parameters, where ''c'' parameters are constrained to be equal across groups. This test yields a

Wald statistic In statistics, the Wald test (named after Abraham Wald) assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the ...

which follows a chi-square distribution. In this case the null hypothesis being tested is H₀: a_r = a_f and b_r = b_f. First, a 2 x 2

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

of the parameter estimates is calculated for each group which are represented by S_r and S_f for the reference and focal groups. These covariance matrices are computed by inverting the obtained information matrices. Next, the differences between estimated parameters are put into a 2 x 1 vector and is denoted by V' = (a_r - a_f, b_r - b_f) Next, covariance matrix S is estimated by summing S_r and S_f. Using this information, the Wald statistic is computed as follows: χ² = V'S⁻¹V which is evaluated at 2

degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...

Likelihood-ratio test

The

Likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...

is another IRT based method for assessing DIF. This procedure involves comparing the ratio of two models. Under model (M_c) item parameters are constrained to be equal or invariant between the reference and focal groups. Under model (M_v) item parameters are free to vary. The likelihood function under M_c is denoted (L_c) while the likelihood function under M_v is designated (L_v). The items constrained to be equal serve as anchor items for this procedure while items suspected of DIF are allowed to freely vary. By using anchor items and allowing remaining item parameters to vary, multiple items can be simultaneously assessed for DIF. However, if the likelihood ratio indicates potential DIF, an item-by-item analysis would be appropriate to determine which items, if not all, contain DIF. The likelihood ratio of the two models is computed by G² = 2ln _v / L_c'' Alternatively, the ratio can be expressed by G² = -2ln _c / L_v'' where L_v and L_c are inverted and then multiplied by -2ln. G² approximately follows a chi square distribution, especially with larger samples. Therefore, it is evaluated by the degrees of freedom that correspond to the number of constraints necessary to derive the constrained model from the freely varying model. For instance, if a 2PL model is used and both ''a'' and ''b'' parameters are free to vary under M_v and these same two parameters are constrained in under M_c, then the ratio is evaluated at 2 degrees of freedom.

Logistic regression

Logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...

approaches to DIF detection involve running a separate analysis for each item. The independent variables included in the analysis are group membership, an ability matching variable typically a total score, and an interaction term between the two. The dependent variable of interest is the probability or likelihood of getting a correct response or endorsing an item. Because the outcome of interest is expressed in terms of probabilities,

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...

is the appropriate procedure. This set of variables can then be expressed by the following regression equation:

Y = β₀ + β₁M + β₂G + β₃MG

where β₀ corresponds to the intercept or the probability of a response when M and G are equal to 0 with remaining β_s corresponding to weight coefficients for each independent variable. The first independent variable, M, is the matching variable used to link individuals on ability, in this case a total test score, similar to that employed by the Mantel-Haenszel procedure. The group membership variable is denoted G and in the case of regression is represented through dummy coded variables. The final term MG corresponds to the interaction between the two above mentioned variables. For this procedure, variables are entered hierarchically. Following the structure of the regression equation provided above, variables are entered by the following sequence: matching variable M, grouping variable G, and the interaction variable MG. Determination of DIF is made by evaluating the obtained chi-square statistic with 2 degrees of freedom. Additionally, parameter estimate significance is tested. From the results of the logistic regression, DIF would be indicated if individuals matched on ability have significantly different probabilities of responding to an item and thus differing logistic regression curves. Conversely, if the curves for both groups are the same, then the item is unbiased and therefore DIF is not present. In terms of uniform and nonuniform DIF, if the intercepts and matching variable parameters for both groups are not equal, then there is evidence of uniform DIF. However, if there is a nonzero interaction parameter, this is an indication of nonuniform DIF.

Considerations

Sample size

The first consideration pertains to issues of sample size, specifically with regard to the reference and focal groups. Prior to any analyses, information about the number of people in each group is typically known such as the number of males/females or members of ethnic/racial groups. However, the issue more closely revolves around whether the number of people per group is sufficient for there to be enough

statistical power In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H_0) when a specific alternative hypothesis (H_1) is true. It is commonly denoted by 1-\beta, and represents the chances ...

to identify DIF. In some instances such as ethnicity there may be evidence of unequal group sizes such that Whites represent a far larger group sample than each individual ethnic group being represented. Therefore, in such instances, it may be appropriate to modify or adjust data so that the groups being compared for DIF are in fact equal or closer in size. Dummy coding or recoding is a common practice employed to adjust for disparities in the size of the reference and focal group. In this case, all Non-White ethnic groups can be grouped together in order to have a relatively equal sample size for the reference and focal groups. This would allow for a "majority/minority" comparison of item functioning. If modifications are not made and DIF procedures are carried out, there may not be enough statistical power to identify DIF even if DIF exists between groups. Another issue that pertains to sample size directly relates to the statistical procedure being used to detect DIF. Aside from sample size considerations of the reference and focal groups, certain characteristics of the sample itself must be met to comply with assumptions of each statistical test utilized in DIF detection. For instance, using IRT approaches may require larger samples than required for the Mantel-Haenszel procedure. This is important, as investigation of group size may direct one toward using one procedure over another. Within the logistic regression approach, leveraged values and outliers are of particular concern and must be examined prior to DIF detection. Additionally, as with all analyses, statistical test assumptions must be met. Some procedures are more robust to minor violations while others less so. Thus, the distributional nature of sample responses should be investigated prior to implementing any DIF procedures.

Items

Determining the number of items being used for DIF detection must be considered. No standard exists as to how many items should be used for DIF detection as this changes from study-to-study. In some cases it may be appropriate to test all items for DIF, whereas in others it may not be necessary. If only certain items are suspected of DIF with adequate reasoning, then it may be more appropriate to test those items and not the entire set. However, oftentimes it is difficult to simply assume which items may be problematic. For this reason, it is often recommended to simultaneously examine all test items for DIF. This will provide information about all items, shedding light on problematic items as well as those that function similarly for both the reference and focal groups. With regard to statistical tests, some procedures such as IRT-Likelihood Ratio testing require the use of anchor items. Some items are constrained to be equal across groups while items suspected of DIF are allowed to freely vary. In this instance, only a subset would be identified as DIF items while the rest would serve as a comparison group for DIF detection. Once DIF items are identified, the anchor items can also be analyzed by then constraining the original DIF items and allowing the original anchor items to freely vary. Thus it seems that testing all items simultaneously may be a more efficient procedure. However, as noted, depending on the procedure implemented different methods for selecting DIF items are used. Aside from identifying the number of items being used in DIF detection, of additional importance is determining the number of items on the entire test or measure itself. The typical recommendation as noted by Zumbo (1999) is to have a minimum of 20 items. The reasoning for a minimum of 20 items directly relates to the formation of matching criteria. As noted in earlier sections, a total test score is typically used as a method for matching individuals on ability. The total test score is divided up into normally 3–5 ability levels (k) which is then used to match individuals on ability prior to DIF analysis procedures. Using a minimum of 20 items allows for greater variance in the score distribution which results in more meaningful ability level groups. Although the psychometric properties of the instrument should have been assessed prior to being utilized, it is important that the

validity Validity or Valid may refer to: Science/mathematics/statistics: * Validity (logic), a property of a logical argument * Scientific: ** Internal validity, the validity of causal inferences within scientific studies, usually based on experiments ** ...

and

reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * High availability * Reliability (computer networking), a ...

of an instrument be adequate. Test items need to accurately tap into the construct of interest in order to derive meaningful ability level groups. Of course, one does not want to inflate reliability coefficients by simply adding redundant items. The key is to have a valid and reliable measure with sufficient items to develop meaningful matching groups. Gadermann et al. (2012), Revelle and Zinbarg (2009), and John and Soto (2007)John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), ''Handbook of research methods in personality psychology'' (pp. 461–494). New York, NY: Cambridge University Press. offer more information on modern approaches to structural validation and more precise and appropriate methods for assessing reliability.

Statistics versus reasoning

As with all

psychological research Psychological research refers to research that psychologists conduct for systematic study and for analysis of the experiences and behaviors of individuals or groups. Their research can have educational, occupational and clinical application ...

and psychometric evaluation,

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

play a vital role but should by no means be the sole basis for decisions and conclusions reached. Reasoned judgment is of critical importance when evaluating items for DIF. For instance, depending on the statistical procedure used for DIF detection, differing results may be yielded. Some procedures are more precise while others less so. For instance, the Mantel-Haenszel procedure requires the researcher to construct ability levels based on total test scores whereas IRT more effectively places individuals along the latent trait or ability continuum. Thus, one procedure may indicate DIF for certain items while others do not. Another issue is that sometimes DIF may be indicated but there is no clear reason why DIF exists. This is where reasoned judgment comes into play. The researcher must use common sense to derive meaning from DIF analyses. It is not enough to report that items function differently for groups, there needs to be a theoretical reason for why it occurs. Furthermore, evidence of DIF does not directly translate into unfairness in the test. It is common in DIF studies to identify some items that suggest DIF. This may be an indication of problematic items that need to be revised or omitted and not necessarily an indication of an unfair test. Therefore, DIF analysis can be considered a useful tool for item analysis but is more effective when combined with theoretical reasoning.

Statistical software

Below are common statistical programs capable of performing the procedures discussed herein. By clicking on

list of statistical packages Statistical software are specialized computer programs for analysis in statistics and econometrics. Open-source * ADaMSoft – a generalized statistical software with data mining algorithms and methods for data management * ADMB – a softwar ...

, you will be directed to a comprehensive list of open source, public domain, freeware, and proprietary statistical software. Mantel-Haenszel procedure *SPSS *SAS *Stata *R (e.g., 'difR' package) *Systat *Lertap 5 IRT-based procedures *BILOG-MG *MULTILOG *PARSCALE *TESTFACT *EQSIRT *R (e.g., 'difR' or 'mirt' package) *IRTPRO Logistic regression *SPSS *SAS *Stata *R (e.g., 'difR' package) *Systat

References

{{reflist, refs= {{cite journal, last1=Magis, first1=David, last2=Béland, first2=Sébastien, last3=Tuerlinckx, first3=Francis, last4=De Boeck, first4=Paul, title=A general framework and an R package for the detection of dichotomous differential item functioning, journal=Behavior Research Methods, date=2010, volume=42, issue=3, pages=847–862, doi=10.3758/BRM.42.3.847, pmid=20805607, doi-access=free {{cite journal, last1=Chalmers, first1=R. P., title=mirt: A Multidimensional Item Response Theory Package for the R Environment, journal=Journal of Statistical Software, date=2012, volume=48, issue=6, pages=1–29, doi=10.18637/jss.v048.i06, doi-access=free Psychometrics Educational assessment and evaluation Educational research

Description

Forms

Procedures for detecting DIF

Mantel-Haenszel

Odds ratio

Item response theory