Observation
vs. Observable: Maximum Likelihood Estimations according to the Assumption of
Generalized Gauss and
1 Technical University of
2”Iuliu Haţieganu” University of Medicine
and Pharmacy
E-mail(s): lori@academicdirect.org; sbolboaca@umfcluj.ro.
(* Corresponding author)
Abstract
Aim: The paper aims to investigate the use of maximum likelihood estimation to infer measurement types with their distribution shape. Material and Methods: A series of twenty-eight sets of observed data (different properties and activities) were studied. The following analyses were applied in order to meet the aim of the research: precision, normality (Chi-square, Kolmogorov-Smirnov, and Anderson-Darling tests), the presence of outliers (Grubbs’ test), estimation of the population parameters (maximum likelihood estimation under Laplace, Gauss, and Gauss-Laplace distribution assumptions), and analysis of kurtosis (departure of sample kurtosis from the Laplace, Gauss, and Gauss-Laplace population kurtosis). Results: The mean of most investigated sets was likely to be Gauss-Laplace while the standard deviation of most investigated sets of compound was likely to be Gauss. The MLE analysis allowed making assumptions regarding the type of errors in the investigated sets. Conclusions: The proposed procedure proved to be useful in analyzing the shape of the distribution according to measurement type and generated several assumptions regarding their association.
Keywords
Statistical inference; Accuracy; Observation; Maximum likelihood estimation.
Introduction
Experimental data plays an important role in the validity of quantitative Structure-Activity Relationship (qSAR) models. The precision and accuracy of experimental data influence the uncertainty of a qSAR model. The variability in the descriptors values used in modeling [1], the correct choice of the variables involved, the factors that influence the activity/property [2] also influence the validity of qSAR models. The accuracy refers to how experiments are carried out. The two types of errors (gross errors) that may occur can be eliminated by checking instruments against the standard, repeating measurements, using standard procedures, calibrating devices, etc. These types of errors could be classified as instrumental (always limited by the equipment and protocol used) and human (natural human biases, as for example reading errors). Experimental accuracy could be related to the existence of systemic errors (e.g. differences between laboratories, differences between researchers, etc.) [3]. Consequently, the statistical identification of any types of errors in experimental data is a relevant issue in qSAR analyses due to its impact on the estimation / prediction model.
Maximum likelihood (ML) [4] is a method used to find parameters that maximize the observation probability. The main properties of the maximum likelihood method are as follows [5]: ▪ consistency (the estimated MLE parameter is asymptotically consistent (n→∞)); ▪ normality (the estimated MLE parameter is asymptotically, normally distributed with minimal variance); ▪ invariance (the maximum likelihood solution is invariant when parameters change); ▪ efficiency (if efficient estimators exist for a give problem, the maximum likelihood method will find them). The method may also be used to evaluate the uncertainty of qSAR models [6-9].
The present research aimed to use the maximum likelihood estimation method in order to assess the association between measurement types and the power of error according to error type.
Material and Method
Sets of Compounds
Twenty-eight sets of compounds with a different property / activity were investigated. The measured property or activity was taken from previously reported research. A summary of the investigated sets of compounds expressed as sample size, set abbreviation, activity/property, existence of ties and associated references are presented in Table 1.
Table 1. Investigated sets of compounds
|
No. |
n |
Set [ref] |
Activity / Property |
Ties |
|
1 |
209 |
Y209 [10] |
Chromatographic retention times |
Yes |
|
2 |
209 |
RRF [11] |
Relative response factor |
Yes |
|
3 |
206 |
Y206 [12] |
Octanol-water partition coefficient (logKow) |
Yes |
|
4 |
205 |
Y205 [13] |
Octanol-water partition coefficient (logKow) |
Yes |
|
5 |
166 |
C166 [14] |
Thermodynamic solubility |
Yes |
|
6 |
143 |
OrgPest [15,16] |
Soil sorption coefficients (KOC) |
Yes |
|
7 |
126 |
Anthra [17-23] |
Toxicity on HepG2 cells (logIC50) c |
Yes |
|
8 |
111 |
MPC [24-27] |
Molecular partition coefficient in n-octanol / water system (logP) |
Yes |
|
9 |
105 |
MDL [28-38] |
Brain-blood partition coefficient (logBBP) |
Yes |
|
10 |
88 |
Diamino [39,40] |
Antibacterial inhibitory activity (-logIC50) f |
Yes |
|
11 |
87 |
lnCHF [41] |
Concentration high food (ng/g - lnCHF) |
Yes |
|
12 |
69 |
AAT [42] |
Acute aquatic toxicity (-log[LC50]) LC50a |
Yes |
|
13 |
63 |
DZGALYL [43] |
Resistance index (RI) d (-log(RI[taxoid]/RI[paclitaxel])) |
No |
|
14 |
63 |
IMHH [44] |
Brain-blood partition coefficient (logBBP) |
Yes |
|
15 |
57 |
InHIV [45] |
HIV1 inhibition (log(106/C50)) C50b |
No |
|
16 |
58 |
InACE [46] |
ACE inhibition activity (log(1/IC50)) IC50c |
Yes |
|
17 |
57 |
Clark [47] |
Brain-blood partition coefficient (logBBP) |
Yes |
|
18 |
48 |
BTA [46] |
Bitter tasting activity (log(1/T)) |
Yes |
|
19 |
47 |
MASIS-CAII [48] |
Carbonic anhydrase II inhibitory activity (KI, nM)) |
Yes |
|
20 |
45 |
MCY [49,50] |
Brain-blood partition coefficient (logBBP) |
No |
|
21 |
43 |
BKST [51] |
Protonation constant (pKa) |
No |
|
22 |
40 |
CAI [52] |
Carbonic anhydrase I inhibitory activity (logIC50, nM) |
Yes |
|
23 |
40 |
CAII [52] |
Carbonic anhydrase II inhibitory activity (logIC50, nM) |
Yes |
|
24 |
40 |
CAIV [52] |
Carbonic anhydrase IV inhibitory activity (logIC50, nM) |
Yes |
|
25 |
39 |
Nitro [53] |
Toxicity (logLD50, LD50f (mg/kg)) |
Yes |
|
26 |
35 |
MGWTI [54] |
Cell growth inhibitory activity (log1/IC50, IC50 c) |
Yes |
|
27 |
29 |
TTKSS-CAII [55] |
Carbonic anhydrase II inhibitory activity (logKc) |
Yes |
|
28 |
25 |
ERBAT [56] |
Estrogen receptor binding affinity (logRBA, LBA e) |
Yes |
|
n = sample size; Ties = existence of more than one compound with the same value of property/activity aLC50 = 50% lethal dose concentration bC50 = compound concentration required to achieve 50% protection of MT-4 cells against HIV cIC50 = compound concentration required for 50% growth inhibition d Inhibitory effect (IC50) to drug sensitive human breast carcinoma (MCF-7S) and multidrug-resistance human breast carcinoma (MCF-7R) – in vitro e Relative binding affinity to the estrogen receptor vis-à-vis E2 |
||||
Method
Experimental data were analyzed progressively in order to achieve the aim of the research:
§ Precision analysis. A series of statistical parameters were calculated in order to characterize the observed data (minimum, maximum, skewness, kurtosis, standard deviation, coefficient of variance (CV=s/m), variance-to-mean ratio (also knows as index of dispersion, VMR = s2/m). Standard deviation is associated with errors in each individual measurement. The skewness evaluated the asymmetry of the distribution while the kurtosis showed how far away the distribution of data was from the Gaussian shape. The following interpretations for skewness were used [57]: -0.5 < skewness < 0.5: distribution is approximately normal; -1 < skewness < -0.5 or 0.5 < skewness > 1: distribution is moderately skewed; skewness < -1 or skewness > 1: distribution is highly skewed. The data were considered normally distributed if the kurtosis was approximately zero; a kurtosis value higher than 0 indicated a leptokurtic distribution; a kurtosis value below 0 indicated a platikurtic distribution [58].
§ Distribution analysis. Three hypotheses regarding the distribution of observed data were tested (Laplace, Gauss and Gauss-Laplace) using the EasyFit software [59]. The following tests were applied: Chi square [60], Kolmogorov Smirnov [61] and Anderson Darling [62]. The Anderson-Darling test was applied because it gives more importance to the tails compared to the Kolmogorov-Smirnov test. Moreover, Anderson-Darling is sensitive to ties [61]. The outliers seem to bring type II errors to the Kolmogorov-Smirnov test (null hypothesis is accepted even if not true) and type I errors (null hypothesis is rejected even if true) to Anderson-Darling statistics [63].
§ Grubbs analysis. Grubbs test [64] was applied whenever appropriate in order to adjust the obliquity of experimental data (skewness; -0.5 < skewness < 0.5: distribution was considered as approximately symmetric). The characteristics of Grubbs test are as follows:
a) Grubbs’ statistics:
|
G = [max|Yi - m|]/s |
Eq(1) |
where I = identification number of compound from the data set (1 ≤ i ≤ n); m = sample mean; s = sample standard deviation.
b) The test is rejected for two-sided hypothesis if:
|
|
Eq(2) |
where n
= sample size,
=
critical value of the t-distribution with (n-2) degree of freedom at a
significance level of α.
§ Error analysis. Maximum likelihood estimation (MLE) was used as statistical method for fitting the experimental data of the investigated sets in order to estimate a series of parameters of the model. The following formulas were used:
|
|
Eq(3) |
|
|
Eq(4) |
where Xi = measured property / activity for compound i (1 ≤ i ≤ n); μ = population mean; σ = population standard deviation; p = power of error; Γ - gamma function.
The GL(x;μ,σ,p) probability density function features two particular cases: when p = 1 (fixed) it becomes the Laplace (or error) distribution, and when p = 2 (fixed) it becomes the Gauss (or normal) distribution.
The sample mean of each set of compounds was considered the maximum likelihood estimation of the population mean; the sample variance was considered the maximum likelihood estimator of the population variance. Three cases of hypothetical distributions were investigated in this research: Laplace (p = 1), Gauss (p = 2), and Gauss-Laplace (power of error to be estimated) [13]. For each distribution, the population statistical parameters were calculated (mean and standard deviation; also power of error for Gauss-Laplace).
The association of measurement type with the power of error (p) according to the type of error was also investigated (Laplace (p = 1) as model for relative error and Gauss (p = 2) for absolute error).
§ Kurtosis analysis. The kurtosis of the samples was computed for Laplace (p = 1), Gauss (p = 2) and Gauss-Laplace (p as resulted from MLE). The following kurtosis formula for the investigated distributions was used to analyze the distance between the sample kurtosis and the expected population kurtosis:
|
|
Eq(5) |
The following two particular cases occurred: Laplace (p = 1) with KurtosisGL(1) = 6 and Gauss (p = 2) with KurtosisGL(2) = 3.
Results and Discussion
Descriptive statistic parameters expressed as mean (m), standard deviation (s), minim (min), maxim (max), skewness (skew), kurtosis (kurt), coefficient of variance (CV) and variance-to-mean ratio (VMR) for the investigated sets of compounds were calculated and are presented in Table 2.
Table 2. Descriptive statistics of investigated property / activity
|
Set |
n |
min |
max |
m |
s |
skew |
kurt |
VMR |
CV (%) |
|
209 |
0.10 |
1.05 |
0.60 |
0.18 |
-0.13 |
2.72 |
0.054 |
30 |
|
|
RRF |
209 |
0.03 |
2.04 |
0.77 |
0.35 |
0.56 |
3.67 |
0.162 |
46 |
|
Y206 |
206 |
4.15 |
9.60 |
6.48 |
0.83 |
0.25 |
3.85 |
0.106 |
13 |
|
Y205 |
205 |
4.15 |
9.14 |
6.47 |
0.80 |
0.05 |
3.28 |
0.099 |
12 |
|
C166 |
166 |
-6.00 |
3.35 |
-0.35 |
1.81 |
-0.49 |
3.20 |
n.a. |
n.a. |
|
OrgPest |
143 |
0.42 |
5.31 |
2.52 |
0.91 |
0.77 |
3.68 |
0.327 |
36 |
|
Anthra |
126 |
3.45 |
7.70 |
4.74 |
0.78 |
1.60 |
5.94 |
0.127 |
16 |
|
Anthra-GO |
124 |
3.45 |
7.05 |
4.70 |
0.69 |
1.36 |
5.17 |
0.103 |
15 |
|
MPC |
111 |
-0.44 |
4.79 |
1.90 |
1.01 |
-0.03 |
2.98 |
0.538 |
53 |
|
MDL |
105 |
-2.00 |
1.44 |
-0.09 |
0.77 |
-0.47 |
2.86 |
n.a. |
n.a. |
|
Diamino |
88 |
3.10 |
6.00 |
4.84 |
0.52 |
-0.81 |
4.18 |
0.056 |
11 |
|
Diamino-GO |
87 |
3.51 |
6.00 |
4.86 |
0.49 |
-0.58 |
3.56 |
0.049 |
10 |
|
lnCHF |
87 |
0.26 |
5.77 |
3.22 |
1.19 |
-0.23 |
2.69 |
0.442 |
37 |
|
AAT |
69 |
3.04 |
6.37 |
4.25 |
0.76 |
0.68 |
2.93 |
0.136 |
18 |
|
DZGALYL |
63 |
-0.57 |
2.28 |
0.74 |
0.68 |
0.34 |
2.66 |
n.a. |
n.a. |
|
IMHH |
63 |
-2.15 |
1.04 |
-0.16 |
0.79 |
-0.61 |
2.70 |
n.a. |
n.a. |
|
InHIV |
57 |
3.07 |
8.62 |
6.54 |
1.50 |
-0.60 |
2.36 |
0.345 |
23 |
|
InACE |
58 |
1.77 |
5.80 |
3.05 |
1.00 |
1.09 |
3.62 |
0.329 |
33 |
|
Clark |
57 |
-2.15 |
1.04 |
-0.14 |
0.79 |
-0.68 |
2.89 |
n.a. |
n.a. |
|
BTA |
48 |
1.13 |
3.60 |
1.98 |
0.63 |
0.84 |
2.91 |
0.199 |
32 |
|
MASIS-CAII |
47 |
0.86 |
2.51 |
1.75 |
0.51 |
-0.25 |
1.79 |
0.149 |
29 |
|
MCY |
45 |
-2.00 |
1.04 |
0.00 |
0.71 |
-0.95 |
3.76 |
n.a. |
n.a. |
|
ERBAT |
25 |
-2.00 |
2.22 |
0.38 |
1.38 |
-0.47 |
1.98 |
n.a. |
n.a. |
|
CAI |
40 |
0.00 |
2.66 |
0.85 |
0.54 |
1.45 |
7.60 |
0.338 |
63 |
|
CAII |
40 |
-0.70 |
2.04 |
0.47 |
0.52 |
0.85 |
6.04 |
n.a. |
n.a. |
|
CAIV |
40 |
-0.30 |
2.51 |
0.74 |
0.54 |
0.98 |
6.49 |
n.a. |
n.a. |
|
logCAII-GO |
38 |
-0.70 |
0.95 |
0.39 |
0.38 |
-0.95 |
3.55 |
n.a. |
n.a. |
|
logCAIV-GO |
38 |
-0.30 |
1.45 |
0.66 |
0.39 |
-0.93 |
3.78 |
n.a. |
n.a. |
|
Nitro |
39 |
3.38 |
8.77 |
6.50 |
1.37 |
-0.53 |
3.07 |
0.291 |
21 |
|
MGWTI |
35 |
-2.00 |
1.74 |
-0.69 |
1.25 |
0.78 |
2.15 |
n.a. |
n.a. |
|
logCAI-GO |
34 |
0.30 |
1.28 |
0.85 |
0.25 |
-0.25 |
2.78 |
0.076 |
30 |
|
TTKSS-CAII |
29 |
4.41 |
9.39 |
7.44 |
1.41 |
-0.48 |
2.29 |
0.267 |
19 |
|
BKST |
43 |
5.51 |
10.53 |
8.46 |
1.13 |
-0.49 |
3.13 |
0.151 |
13 |
|
n = sample size; min = minimum; max = maximum; m = sample mean; s = sample standard deviation; skew = skewness; kurt = kurtosis; VMR = Variance-To-Mean Ratio; CV = coefficient of variance |
|||||||||
Thirteen out of thirty-three sets of compounds had negative values. The dispersion index and the variance coefficient could not be analyzed for these sets due to these negative values.
The analysis of the skewness revealed that 11 sets of compounds had a moderately skewed distribution (probability to be observed is between 1% and 5%), in 7 sets the distribution was highly skewed (less than 1% probability to be observed) and in 15 sets the distribution was approximately symmetric (no rejection of the symmetry at 5% risk being in error). The highly skewed sets comprised Soil sorption coefficients (OrgPest), Relative response factor (RRF), and some sets which referred to the concentration of compounds required for 50% growth inhibition (Anthra, CAI, InACE and Diamino, the Anthra set remained highly skewed following Grubbs test). According to this parameter, 15 sets of compounds were expected to have approximately symmetric distribution. The analysis of kurtosis revealed that 18 sets of compounds were leptokurtic and 15 platykurtic. According kurtosis values, the toxicity on HepG2 cells (Anthra) and Carbonic anhydrase inhibitory activity CAI, CAII and CAIV sets were expected to have the Laplace distribution (kurtosis > 5).
The analysis of variance-to-mean ratio of the investigated sets of compounds concluded that the data were under-dispersed (0 < VMR < 1) without exception. The analysis of the results obtained by the variation coefficients (as a measure of relative variation) showed a great relative variation (CV ≥ 20) of the experimental data in 17 sets and a small variation (10 ≤ CV < 20) in 9 sets. MPC and CAI presented greatest data variation according to the variation coefficients (see Table 2). The removal of the outlier whenever identified by Grubbs test did not shift the set of compounds between variation classes (see Table 2).
The analysis of the results obtained following the investigation of the null hypothesis “the observed data followed the Laplace distribution” revealed the following (see Table 3):
§ All three applied tests rejected the null hypothesis at a significance level of 5% for 10 sets: RRF, OrgPest, Anthra, Anthra-GO, AAT, InHIV, InACE, BTA, CAII, and CAIV.
§ With two exceptions (AAT and IMHH sets), the Anderson-Darling test rejected the null hypothesis for the same sets of compounds as the Chi-square test: Y209, RRF, Y206, Y205, OrgPest, Anthra, Anthra-GO, MDL, InHIV, InACE, and BTA.
§ With few exceptions, the null hypothesis of Laplace distribution was rejected at different significance levels. The exceptions were: DZGALYL, Clark, MCY, BKST, CAI, Nitro, logCAI-GO, ERBAT.
The Chi-square test rejected the null hypothesis of normality at a significance level of 5% in 5 (RRF, Anthra, Anthra-GO, InACE, and BTA) out of 28 cases (see Table 3). The normality has also been rejected by the Kolmogorov-Smirnov and Anderson-Darling tests for the Anthra and Anthra-GO sets. These two sets of compounds were the ones in which all three normality tests agreed at a 5% significance level. Thus, it can be concluded that the toxicity on HepG2 cells did not respect the normal distribution. Note that the adjustment of the obliquity of experimental data (Grubbs test) from the Anthra set did not lead to a normal distributed data-set. This observation was also true for different significance levels for logCAII-GO and logCAIV-GO, which led to the conclusion that there were errors in the experimental data (unreliable data).
Table 3. Results of Laplace distribution testing: Chi square (CS), Kolmogorov Smirnov (KS) and Anderson Darling (AD) tests
|
Set |
Chi-square |
Kolmogorov-Smirnov |
Anderson-Darling |
|||||||||
|
Stat. |
df |
p |
Reject5% |
Rejectα% |
Stat. |
p |
Reject5% |
Rejectα% |
Stat. |
Reject5% |
Rejectα% |
|
|
Y209 |
19.49 |
7 |
0.0068 |
Yes |
≥0.01 |
0.08769 |
0.0756 |
No |
≥0.1 |
2.7752 |
Yes |
≥0.05 |
|
RRF |
28.99 |
7 |
1.44∙10-4 |
Yes |
≥0.01 |
0.1121 |
0.0096 |
Yes |
≥0.02 |
3.2920 |
Yes |
≥0.02 |
|
Y206 |
21.97 |
7 |
0.0026 |
Yes |
≥0.01 |
0.0844 |
0.1000 |
No |
0.2 |
2.7284 |
Yes |
≥0.05 |
|
Y205 |
25.19 |
7 |
7.03∙10-4 |
Yes |
≥0.01 |
0.0920 |
0.0583 |
No |
≥0.1 |
3.1799 |
Yes |
≥0.05 |
|
C166 |
11.13 |
7 |
0.1331 |
No |
0.2 |
0.0996 |
0.0692 |
No |
≥0.1 |
2.0107 |
No |
≥0.1 |
|
OrgPest |
24.76 |
7 |
8.36∙10-4 |
Yes |
≥0.01 |
0.1299 |
0.0145 |
Yes |
≥0.02 |
2.566 |
Yes |
≥0.05 |
|
Anthra |
35.32 |
6 |
3.74∙10-6 |
Yes |
≥0.01 |
0.1784 |
5.56E-4 |
Yes |
≥0.01 |
5.0544 |
Yes |
≥0.01 |
|
Anthra-GO |
35.32 |
6 |
3.74∙10-6 |
Yes |
≥0.01 |
0.1610 |
0.0028 |
Yes |
≥0.01 |
3.8716 |
Yes |
≥0.01 |
|
MPC |
10.57 |
6 |
0.1026 |
No |
0.2 |
0.1002 |
0.2011 |
No |
n.a. |
1.5632 |
No |
0.2 |
|
MDL |
19.49 |
7 |
0.0068 |
Yes |
≥0.01 |
0.0877 |
.0756 |
No |
≥0.10 |
2.7752 |
Yes |
≥0.05 |
|
Diamino |
7.61 |
6 |
0.2682 |
No |
n.a. |
0.1595 |
0.0202 |
Yes |
≥0.05 |
2.040 |
No |
≥0.10 |
|
Diamino-GO |
9.52 |
6 |
0.1460 |
No |
0.2 |
0.1518 |
0.0324 |
Yes |
≥0.05 |
1.8791 |
No |
0.2 |
|
lnCHF |
9.17 |
6 |
0.1645 |
No |
0.2 |
0.1086 |
0.2388 |
No |
n.a. |
1.5085 |
No |
0.2 |
|
AAT |
10.69 |
4 |
0.0303 |
Yes |
≥0.05 |
0.1711 |
0.0309 |
Yes |
≥0.05 |
2.0787 |
No |
≥0.10 |
|
DZGALYL |
3.83 |
5 |
0.5738 |
No |
n.a. |
0.1139 |
0.3598 |
No |
n.a. |
0.9349 |
No |
n.a. |
|
IMHH |
11.28 |
4 |
0.0236 |
Yes |
≥0.05 |
0.1316 |
0.2063 |
No |
n.a. |
1.8420 |
No |
0.2 |
|
InHIV |
13.09 |
4 |
0.0108 |
Yes |
≥0.02 |
0.1870 |
0.0322 |
Yes |
≥0.05 |
2.8312 |
Yes |
≥0.05 |
|
InACE |
14.26 |
5 |
0.0140 |
Yes |
≥0.02 |
0.2011 |
0.0157 |
Yes |
≥0.02 |
2.6301 |
Yes |
≥0.05 |
|
Clark |
7.79 |
4 |
0.0996 |
No |
≥0.10 |
0.1306 |
0.2614 |
No |
n.a. |
1.5585 |
No |
0.2 |
|
BTA |
12.64 |
3 |
0.0055 |
Yes |
≥0.01 |
0.2518 |
0.0036 |
Yes |
≥0.01 |
2.6130 |
Yes |
≥0.05 |
|
MASIS-CAII |
8.46 |
4 |
0.0761 |
No |
≥0.10 |
0.14928 |
0.2224 |
No |
n.a. |
2.0537 |
No |
≥0.10 |
|
MCY |
1.39 |
4 |
0.8458 |
No |
n.a. |
0.14979 |
0.2398 |
No |
n.a. |
1.1642 |
No |
n.a. |
|
BKST |
4.01 |
4 |
0.4050 |
No |
n.a. |
0.1100 |
0.6351 |
No |
n.a. |
0.6320 |
No |
n.a. |
|
CAI |
2.77 |
4 |
0.5967 |
No |
n.a. |
0.1110 |
0.6667 |
No |
n.a. |
0.6642 |
No |
n.a. |
|
CAII |
15.34 |
3 |
0.0016 |
Yes |
≥0.01 |
0.221 |
0.0658 |
No |
≥0.10 |
2.6033 |
Yes |
≥0.05 |
|
CAIV |
15.34 |
3 |
0.0016 |
Yes |
≥0.01 |
0.2021 |
0.0658 |
No |
≥0.10 |
2.6033 |
Yes |
≥0.05 |
|
Nitro |
3.26 |
3 |
0.3527 |
No |
n.a. |
0.1573 |
0.2611 |
No |
n.a. |
0.9967 |
No |
n.a. |
|
logCAII-GO |
6.67 |
3 |
0.0833 |
No |
≥0.10 |
0.2667 |
0.0071 |
Yes |
≥0.01 |
1.9159 |
No |
0.2 |
|
logCAIV-GO |
7.28 |
4 |
0.1216 |
No |
0.2 |
0.2288 |
0.0313 |
Yes |
≥0.05 |
1.515 |
No |
0.2 |
|
MGWTI |
6.07 |
3 |
0.1085 |
No |
0.2 |
0.2556 |
0.0167 |
Yes |
≥0.02 |
2.8245 |
Yes |
≥0.05 |
|
logCAI-GO |
0.43 |
4 |
0.9796 |
No |
n.a. |
0.1322 |
0.5477 |
No |
n.a. |
0.5747 |
No |
n.a. |
|
TTKSS-CAII |
5.47 |
3 |
0.1402 |
No |
0.2 |
0.1698 |
0.3344 |
No |
n.a. |
1.1505 |
No |
n.a. |
|
ERBAT |
1.45 |
2 |
0.4831 |
No |
n.a. |
0.1519 |
0.5601 |
No |
n.a. |
1.1865 |
No |
n.a. |
|
Stat. = value of the statistics; df = degree of freedom; Reject5% = reject the hypothesis at a significance level of 5%; Rejectα% = the significance level at which the hypothesis is rejected, whenever appropriate; p = p-value; n.a. = not applicable |
||||||||||||
The hypothesis of normality was rejected at different significance levels by the Chi-square test in 14 cases (α = 0.2: Y206, MPC, AAT, InHIV, MASIS-CAII, CAI, logCAII-GO; α ≥ 0.10: CAII; α ≥ 0.01: BTA, Anthra, Anthra-GO; α ≥ 0.01: RRF; α ≥ 0.05: IMHH, Clark). An agreement between the applied normality tests (different significance levels, see Table 4) was observed for the RRF and BTA sets.
The Kolmogorov-Smirnov test rejected the hypothesis of normality at a 5% significance level in four sets: Anthra, Anthra-GO, MCY and logCAII-GO. Note that the hypothesis of normality was only rejected by the Kolmogorov-Smirnov test for the MCY and logCAII-GO sets.
Anderson-Darling, a less conservative normality test, rejected the hypothesis of normality at a 5% significance level in only 2 cases (Anthra and Anthra-GO sets, see Table 4).
The normality analysis showed that the following sets of compounds were not expected to present the shortest distance between the population (modelled through MLE) and the sample mean and between the population and sample standard deviation according to the Gauss assumption (p = 2): RRF, Anthra, Anthra-GO, Clark, BTA, MCY, and logCAII-GO.
The analysis of the results obtained following the investigation of the null hypothesis “the observed data followed the Gauss-Laplace distribution” revealed the following (see Table 5):
§ The null hypothesis of Gauss-Laplace distribution was rejected at a 5% significance level in all three tests for the Anthra and Anthra-GO sets.
§ The null hypothesis of Gauss-Laplace distribution was rejected at different significance levels in all three tests for the RRF and logCAII-GO sets.
As far as the distribution analysis is concerned, the following conclusions could be drawn:
§ The null hypotheses of investigated distributions were rejected by at least two out of three applied tests at different significance levels in the following sets: RRF, Anthra, Anthra-GO, Clark, BTA, CAII, logCAIV-GO, and MGWTI.
§ The following data sets proved to be normally distributed: Y209, Y205, C166, MDL, Diamino-GO, lnCHF, DZGALYL, BKST, CAIV, Nitro, logCAI-GO, TTKSS-CAII, and ERBAT. A MLR analysis should be applied to these sets.
§ The Gauss-Laplace distribution proved to be less frequently rejected than the Gauss or Laplace distributions.
Table 4. Results of Gauss distribution testing: Chi square (CS), Kolmogorov Smirnov (KS) and Anderson Darling (AD) tests
|
Set |
Chi-square |
Kolmogorov-Smirnov |
Anderson-Darling |
|||||||||
|
Stat. |
df |
p |
Reject5% |
Rejectα% |
Stat. |
p |
Reject5% |
Rejectα% |
Stat. |
Reject5% |
Rejectα% |
|
|
Y209 |
1.92 |
7 |
0.9641 |
No |
n.a. |
0.0314 |
0.9823 |
No |
n.a. |
0.1423 |
No |
n.a. |
|
RRF |
17.15 |
7 |
0.0165 |
Yes |
≥ 0.02 |
0.0857 |
0.0873 |
No |
≥ 0.10 |
1.545 |
No |
0.20 |
|
Y206 |
11.00 |
7 |
0.1386 |
No |
0.20 |
0.0335 |
0.9691 |
No |
n.a. |
0.4443 |
No |
n.a. |
|
Y205 |
8.64 |
7 |
0.2793 |
No |
n.a. |
0.0358 |
0.9469 |
No |
n.a. |
0.3788 |
No |
n.a. |
|
C166 |
2.99 |
7 |
0.8862 |
No |
n.a. |
0.0551 |
0.6743 |
No |
n.a. |
0.5654 |
No |
n.a. |
|
OrgPest |
8.06 |
7 |
0.3273 |
No |
n.a. |
0.0849 |
0.2400 |
No |
n.a. |
1.7042 |
No |
0.20 |
|
Anthra |
24.80 |
5 |
1.52·10-4 |
Yes |
≥ 0.01 |
0.1755 |
7.24·10-4 |
Yes |
≥ 0.01 |
5.6393 |
Yes |
≥ 0.01 |
|
Anthra-GO |
20.16 |
6 |
0.0026 |
Yes |
≥ 0.01 |
0.1500 |
0.0067 |
Yes |
≥ 0.01 |
4.3883 |
Yes |
≥ 0.01 |
|
MPC |
8.70 |
6 |
0.1914 |
No |
0.20 |
0.0493 |
0.9378 |
No |
n.a. |
0.2463 |
No |
n.a. |
|
MDL |
6.76 |
6 |
0.3438 |
No |
n.a. |
0.1033 |
0.1987 |
No |
n.a. |
1.0269 |
No |
n.a. |
|
Diamino |
8.54 |
6 |
0.2008 |
No |
n.a. |
0.1121 |
0.2029 |
No |
n.a. |
1.4863 |
No |
0.20 |
|
Diamino-GO |
7.31 |
6 |
0.2936 |
No |
n.a. |
0.1079 |
0.2447 |
No |
n.a. |
1.2040 |
No |
n.a. |
|
lnCHF |
2.17 |
6 |
0.9032 |
No |
n.a. |
0.0599 |
0.8954 |
No |
n.a. |
0.3052 |
No |
n.a. |
|
AAT |
8.05 |
5 |
0.1535 |
No |
0.20 |
0.1093 |
0.3557 |
No |
n.a. |
0.9161 |
No |
n.a. |
|
DZGALYL |
4.37 |
5 |
0.4978 |
No |
n.a. |
0.0733 |
0.8626 |
No |
n.a. |
0.3885 |
No |
n.a. |
|
IMHH |
11.39 |
4 |
0.0225 |
No |
≥ 0.05 |
0.1398 |
0.1551 |
No |
0.20 |
1.1324 |
No |
n.a. |
|
InHIV |
7.59 |
4 |
0.1080 |
No |
0.20 |
0.1472 |
0.1528 |
No |
0.20 |
1.2268 |
No |
n.a. |
|
InACE |
2.75 |
5 |
0.7384 |
No |
n.a. |
0.1393 |
0.1915 |
No |
0.20 |
1.8257 |
No |
0.20 |
|
Clark |
10.90 |
4 |
0.0277 |
Yes |
≥ 0.05 |
0.1479 |
0.1495 |
No |
0.20 |
1.0176 |
No |
n.a. |
|
BTA |
14.46 |
4 |
0.0060 |
Yes |
≥ 0.01 |
0.1977 |
0.0405 |
No |
≥ 0.05 |
1.4480 |
No |
0.20 |
|
MASIS-CAII |
6.37 |
4 |
0.1735 |
No |
0.20 |
0.1099 |
0.5831 |
No |
n.a. |
0.9572 |
No |
n.a. |
|
MCY |
5.93 |
4 |
0.2048 |
No |
n.a. |
0.2003 |
0.0466 |
Yes |
≥ 0.05 |
1.5082 |
No |
0.20 |
|
BKST |
0.48 |
2 |
0.7855 |
No |
n.a. |
0.1293 |
0.7505 |
No |
n.a. |
0.6314 |
No |
n.a. |
|
CAI |
5.55 |
5 |
0.1352 |
No |
0.20 |
0.1643 |
0.2061 |
No |
n.a. |
1.7636 |
No |
0.20 |
|
CAII |
6.67 |
3 |
0.0833 |
No |
≥ 0.10 |
0.1582 |
0.2427 |
No |
n.a. |
1.4951 |
No |
0.20 |
|
CAIV |
5.48 |
4 |
0.2413 |
No |
n.a. |
0.1512 |
0.2898 |
No |
n.a. |
1.2785 |
No |
n.a. |
|
logCAII-GO |
7.01 |
4 |
0.1354 |
No |
0.20 |
0.2197 |
0.0433 |
Yes |
≥ 0.01 |
1.3180 |
No |
n.a. |
|
logCAIV-GO |
0.84 |
3 |
0.8395 |
No |
n.a. |
0.2010 |
0.0804 |
No |
≥ 0.10 |
1.4905 |
No |
0.20 |
|
Nitro |
0.34 |
3 |
0.9518 |
No |
n.a. |
0.0985 |
0.8083 |
No |
n.a. |
0.5312 |
No |
n.a. |
|
MGWTI |
4.11 |
3 |
0.2498 |
No |
n.a. |
0.1953 |
0.1206 |
No |
0.20 |
1.9225 |
No |
0.20 |
|
logCAI-GO |
0.43 |
4 |
0.9796 |
No |
n.a. |
0.1051 |
0.8093 |
No |
n.a. |
0.2895 |
No |
n.a. |
|
TTKSS-CAII |
0.98 |
2 |
0.6125 |
No |
n.a. |
0.1159 |
0.7891 |
No |
n.a. |
0.4444 |
No |
n.a. |
|
ERBAT |
2.46 |
5 |
0.7828 |
No |
n.a. |
0.1217 |
0.5086 |
No |
n.a. |
0.3568 |
No |
n.a. |
|
Stat. = value of the statistics; df = degree of freedom; Reject5% = reject the hypothesis at a significance level of 5%; Rejectα% = the significance level at which the hypothesis is rejected, whenever appropriate; p = p-value; n.a. = not applicable |
||||||||||||
The maximum likelihood estimation was applied in order to estimate a series of population parameters. The obtained results expressed as MLE value, population mean and population standard deviation are presented in Table 6. The power of error and expected kurtosis (KuGL) were also investigated according to the Gauss-Laplace distribution (see Table 6).
Table 5. Results of Gauss-Laplace distribution testing: Chi square (CS), Kolmogorov Smirnov (KS) and Anderson Darling (AD) tests
|
Set |
Chi-square |
Kolmogorov-Smirnov |
Anderson-Darling |
|||||||||
|
Stat. |
df |
p |
Reject5% |
Rejectα% |
Stat. |
p |
Reject5% |
Rejectα% |
Stat. |
Reject5% |
Rejectα% |
|
|
Y209 |
1.37 |
7 |
0.9864 |
No |
n.a. |
0.0270 |
0.9971 |
No |
n.a. |
0.1246 |
No |
n.a. |
|
RRF |
17.94 |
7 |
0.0123 |
Yes |
≥0.02 |
0.0922 |
0.0537 |
No |
≥0.1 |
1.5687 |
No |
≥0.2 |
|
Y206 |
11.60 |
7 |
0.1144 |
No |
0.2 |
0.0511 |
0.6359 |
No |
n.a. |
0.7665 |
No |
n.a. |
|
Y205 |
7.65 |
7 |
0.3642 |
No |
n.a. |
0.0444 |
0.7976 |
No |
n.a. |
0.4958 |
No |
n.a. |
|
C166 |
2.98 |
7 |
0.8864 |
No |
n.a. |
0.0525 |
0.7286 |
No |
n.a. |
0.5541 |
No |
n.a. |
|
OrgPest |
7.37 |
7 |
0.3913 |
No |
n.a. |
0.0874 |
0.2116 |
No |
n.a. |
1.6051 |
No |
0.2 |
|
Anthra |
35.32 |
6 |
3.74∙10-6 |
Yes |
≥0.01 |
0.1779 |
5.87E-4 |
Yes |
≥0.01 |
5.0393 |
Yes |
≥0.01 |
|
Anthra-GO |
28.45 |
6 |
7.74∙10-5 |
Yes |
≥0.01 |
0.1528 |
0.0054 |
|||||