A number of covariate-balancing methods, based on the propensity score, are widely used to estimate treatment effects in observational studies. If the treatment effect varies with the propensity score, however, different methods can give very different answers. The authors illustrate this effect by using data from a United Kingdom-based registry of subjects treated with anti-tumor necrosis factor drugs for rheumatoid arthritis. Estimates of the effect of these drugs on mortality varied from a relative risk of 0.4 (95% confidence interval: 0.16, 0.91) to a relative risk of 1.3 (95% confidence interval: 0.8, 2.25), depending on the balancing method chosen. The authors show that these differences were due to a combination of an interaction between propensity score and treatment effect and to differences in weighting subjects with different propensity scores. Thus, the methods are being used to calculate average treatment effects in populations with very different distributions of effect-modifying variables, resulting in different overall estimates. This phenomenon highlights the importance of careful selection of the covariate-balancing method so that the overall estimate has a meaningful interpretation.