Correlation and Nonparametric Statistics of Variables with Different Data Types in Graduation Design Project

It’s the graduation season again, and it’s about time for the graduation defense this year. Graduation design projects this year involved eye movement research, “heads-down tribe” on campus, human-computer interface interaction design, and queuing theory application research. Except for queuing theory application research, which was selected by one of the students independently, other topics were given for reference this year. Among them, the human-computer interface interaction design subject was the first attempt to combine with the subject of the major of Information Management and Information System. Two students worked on the same project from different perspectives (Back-end database development, front-end interactive interface design). The specific projects are as follows:

Eye movement research project – eye movement research on visual contrast and intent affecting the attention of advertising keywords

Research project on “heads-down tribe” on campus——discomfort measurement on upper limb musculoskeletal system of college students with different levels of mobile phone use, research on influencing factors and prevention strategies of neck and shoulder pain of “heads-down tribe” on campus

Human-computer interface interaction design project——human-computer interface interaction design of c2c second-hand book information system on campus

Application research project of queuing theory——parking space matching and management charging standard of shopping mall underground parking lot based on queuing theory model

The research project of “heads-down tribe” on campus is my main research direction in recent years. There were three related topics here. Different research methods were used to research and design the discomfort of the upper limb musculoskeletal system of “heads-down tribe” on campus. Survey tools such as questionnaire and Likert scale were used for several times. Obviously, questionnaire and scale data are not continuous variables, and parametric statistical methods cannot be used directly. Even sEMG data and eye movement data collected during the ergonomic experiment needed to be tested for normality before using the parametric test method. Therefore, it is necessary to make a generalization of correlation analysis statistical methods and nonparametric statistical methods for different data types.

1. Data types of variables

The most common data classification method is to divide data according to the measurement level of data. Data can be divided into categorical variables, ordinal variables, equidistant variables and ratio variables. Equidistant and ratio variables are continuous variables, and categorical and ordinal variables are discrete variables. Equidistant variables have equal units but no absolute zero point, and can perform addition and subtraction operations, while cannot perform multiplication and division operations. Ratio variables have both equal units and absolute zero points, and can perform four arithmetic operations. Likert scale data are ordinal variables. Questionnaire data and independent variables in the experimental design are mostly categorical variables, and sEMG data and eye movement data are ratio variables. For ordinal variables such as Likert scale data, if they are identified as interval variables by the Mantel-Haenszel trend test, you can analyze interval ordinal variables as continuous variables.

2. Correlation analysis of variables with different data types

Pearson correlation is used to analyze the strength of linear association between two continuous variables, and the population from which the two columns of variables come must be normally or approximately normally distributed.

For correlation analysis between two ordinal variables, Spearman correlation is generally used to test the strength and direction of association with at least one ordinal variable, or two continuous variables but the population from which they are derived is not normal distribution or distribution is unknown.

Kendall’s tau-b correlation is a nonparametric analysis method used to test the strength and direction of association with at least one ordinal variable.

For the correlation analysis between two categorical variables, Chi-square test can be used to test their independence. This test can only analyze the statistical significance of the correlation and cannot reflect the strength of the association. It is often combined with Cramer’s V test to indicate the strength of the association.

For the correlation analysis between an ordinal variable and a continuous variable, the continuous variable is first tested as an ordinal variable, that is, to analyze the relationship between the two ordinal variables. Spearman correlation can be used.

For a detailed description of this part, please refer to the document as follows.

要做相关性分析，该如何选择正确的统计方法？

3. Normality test of sample data

One-sample K-S test can check whether the sample comes from a normally distributed population. Binomial method can test whether the actual distribution of the data in the binomial distribution conforms to a certain hypothesis, expectation, or specific form.

4. Nonparametric statistics of variables with different data types

Nonparametric tests with large samples are more reliable. In the case of a single sample, Chi-square test can be used to test the degree of cooperation to analyze whether the actual frequency of the variable value is consistent with the theoretical frequency.

To test whether the two independent samples come from the same population, or whether the data distribution of the two samples is the same, for the data that cannot meet the normal distribution condition, or two ordinal variables, Mann-Whitney U test needs to be used, which corresponds to independent sample t-test in parametric statistical method. It requires the independent variable to be a categorical variable with two levels, and the dependent variable to be an ordinal variable or continuous variable with at least an ordinal scale.

To test the significance between two related samples, it is usually applicable to two experimental design situations: repeated measures design and paired sample design. Four types of Wilcoxon signed-rank test, Sign test, McNemr test, and Marginal Homogeneity test can be used, corresponding to paired samples t-test and correlation coefficient significance test in parametric statistical method. Wilcoxon signed-rank test is the most widely used and is suitable for data with continuous distribution and symmetry. Sign test has a slightly lower statistical precision. McNemr test is only suitable for dichotomous correlated variables, and Marginal Homogeneity test is an extension of the McNemr test, which can test variables with multiple responses, but only for ordinal variables, and they are especially suitable for pretest-posttest experimental designs.

To test the significance among multiple independent samples, Kruskal-Wallis H test, Median test and Jonckheere-Terpstra test can be used, which correspond to the variance analysis of one-way completely randomized design in parametric statistical method. It requires the independent variable to be a categorical variable with more than two levels and the dependent variable be an ordinal variable or a continuous variable with at least an ordinal scale. Kruskal-Wallis H test corresponds directly to one-way ANOVA in parametric statistics and is frequently used. Median test is actually a contingency table analysis with low precision. Jonckheere-Terpstra test is similar to the Kruskal-Wallis H test, with higher precision when the grouping variable is ordinal.

To test the significance among multiple related samples, Friedman test, Cochrans Q test and Kendall W test can be used, which correspond to the variance analysis of randomized block design in parametric statistical method. Friedman test is an extension of Wilcoxon signed-rank test. Cochrans Q test is only applicable to several related dichotomous variables, which is an extension of McNemr test. Kendall W test is used to test whether the opinions of different evaluators are consistent. Both Friedman test and Cochrans Q test are applicable to repeated measures design and paired sample design. If there is a significant difference in the test results, further post-hoc tests are required, such as Wilcoxon signed-rank test.

For a detailed description of this part, please refer to the literature as follows.

丁国盛, 李涛编著. SPSS统计教程——从研究设计到数据分析. 北京: 机械工业出版社, 2014.

参考译文

毕业设计课题中的不同数据类型变量的相关与非参统计

关键词：毕业设计，数据类型，正态检验，非参统计

又到一年毕业季，马上就要进行今年的毕业设计答辩了。今年的毕业设计课题涉及眼动研究、校园低头族研究、人机界面交互设计和排队论应用研究，除排队论应用研究为学生自主选题外，其他课题均为今年给定的参考选题。其中人机界面交互设计课题为首次尝试与信息管理与信息系统专业交叉选题，课题研究对象选自信息管理与信息系统专业学生毕业设计，即两个专业的两名学生分别从不同角度（后台数据库开发、前台交互界面设计）对同一主题展开设计。具体选题如下：

眼动研究课题——视觉对比性与意图影响广告关键字注意力的眼动研究

校园低头族研究课题——手机不同使用程度的大学生上肢肌肉骨骼系统不适测评、校园低头族颈肩疼痛的影响因素及防范策略分析、不同情境下校园低头族手机使用的表面肌电研究

人机界面交互设计课题——高校c2c二手书信息系统人机界面交互设计

排队论应用研究课题——基于排队论模型的商场地下停车场车位匹配及管理收费标准

校园低头族研究课题是近年来本人的主要研究方向，这里有三个相关选题，运用了不同的研究方法对校园低头族上肢肌肉骨骼系统不适展开研究与设计，其中多次运用了问卷、里克特量表等调查工具。很显然问卷和量表数据都不是连续变量，都无法直接使用参数统计方法，即便是人因实验过程中收集的表面肌电数据和眼动数据，在运用参数检验方法前也需要进行正态检验。因此，有必要对不同数据类型的相关分析统计方法及非参统计方法做一次归纳。

1、变量的数据类型

最常见的数据分类方法是按照数据的测量水平来划分，可将数据区分为分类变量、顺序变量、等距变量和比率变量，其中等距变量和比率变量为连续变量，分类变量和顺序变量为离散变量。等距变量有相等单位但没有绝对零点，可进行加减运算，不能进行乘除运算；比率变量既有相等单位也有绝对零点，可以进行四则运算。里克特量表数据为顺序变量，问卷数据和实验设计中的自变量大部分为分类变量，表面肌电数据和眼动数据均为比率变量。对于里克特量表数据这一类的顺序变量，通过Mantel-Haenszel 趋势检验（根据研究者对顺序变量类别的赋值，判断两个顺序变量之间的线性趋势）认定为定距变量的话，也可以将定距顺序变量作为连续变量进行分析。

2、不同数据类型变量的相关分析

Pearson相关用于分析两个连续变量之间的线性关联强度，两列变量所来自的总体必须为正态或近似正态分布。

对于两个顺序变量之间的相关分析，一般采用Spearman相关（又称Spearman秩相关），用于检验至少有一个顺序变量的关联强度和方向，或者两个连续变量但所来自的总体非正态分布或分布未知。

Kendall’s tau-b相关用于检验至少有一个顺序变量关联强度和方向的非参分析方法，该检验与Spearman相关的应用范围基本一致，但更适用于存在多种关联的数据（如列联表）。

对于两个分类变量之间的相关分析，可采用卡方检验对它们进行独立性检验，该检验只能分析相关的统计学意义，不能反映关联强度，常联合Cramer’s V检验提示关联强度。

对于一个顺序变量和一个连续变量之间的相关分析，先将连续变量视为顺序变量进行检验，即分析两个顺序变量之间的关系，可采用Spearman相关。

关于这部分的详细说明可参考文献“要做相关性分析，该如何选择正确的统计方法？”

3、样本数据的正态检验

单样本K-S检验可以检查样本是否来自正态分布总体，Binomial方法可以检验二项分布中数据的实际分布是否符合某一假设、预期或特定的形式。

4、不同数据类型变量的非参统计

大样本的非参检验更为可靠。单样本情形下，可采用卡方检验进行配合度检验，分析变量值的实际频数与理论频数是否一致。

检验两个独立样本是否来自同一总体，或者两个样本的数据分布是否相同，对于数据无法满足正态分布条件，或者两个顺序变量，需要采用Mann-Whitney U检验，对应于参数统计方法中的独立样本t检验，该检验要求自变量为两个水平的分类变量，因变量为至少达到顺序尺度的顺序变量或连续变量。

检验两个相关样本的差异显著性，通常适用于重复测量设计与配对样本设计两种实验设计情形，可以采用Wilcoxon符号秩检验、Sign检验、McNemr检验、Marginal Homogeneity检验4种，对应于参数统计方法中的配对样本t检验和相关系数显著性检验。Wilcoxon符号秩检验应用最广，适用于数据呈连续分布，有对称性。Sign检验统计精度略低。McNemr检验只适用于二分相关变量，Marginal Homogeneity检验是McNemr检验的扩展，可检验多重反应的变量，但仅限于顺序变量，它们特别适用于前测-后测的实验设计。

检验多个独立样本的差异显著性检验，可采用Kruskal-Wallis H检验、Median检验和Jonckheere-Terpstra检验，对应于参数统计方法中的单因素完全随机设计的方差分析，该检验要求自变量为两个以上水平的分类变量，因变量为至少达到顺序尺度的顺序变量或连续变量。Kruskal-Wallis H检验直接对应于参数统计中的单因素方差分析，使用率最高。Median检验事实上是列联表分析，精度较低。Jonckheere-Terpstra检验与Kruskal-Wallis H检验类似，当分组变量为顺序变量时精度更高。

检验多个相关样本的差异显著性，可采用Friedman检验、Cochrans Q检验和Kendall W检验，对应于参数统计方法中的随机区组设计的方差分析。Friedman检验是Wilcoxon符号秩检验的扩展，Cochrans Q检验只适用于几个相关的二分变量，是McNemr检验的扩展，Kendall W检验用于检验不同评价者的意见是否一致。Friedman检验和Cochrans Q检验都适用于重复测量设计与配对样本设计。如果检验结果发现存在显著性差异时，需要进一步进行事后检验，如采用Wilcoxon符号秩检验进行。

关于这部分的详细说明可参考文献“丁国盛, 李涛编著. SPSS统计教程——从研究设计到数据分析. 北京: 机械工业出版社, 2014。”

青春路上

记录日常学习、生活与工作的点滴