Eric C. Sayre, PhD - VWUO-MD Data mining for hypothesis generation

Variable-Weighted Ultrametric Optimization for Mixed-Type Data (VWUO-MD) - Data mining software for hypothesis generation

Abstract:

Scientific research begins with hypothesis generation, for which cluster analysis (CA) can be used. Traditionally, CA involves continuous variables weighted equally, and the subjective choice of linkage and stopping rules. Variable weighting for cluster analysis (VWCA), beginning with De Soete (1985/6), produces weights that may be useful for hypothesis generation. De Soete’s VWCA optimized ultrametricity, a property of better separated clusters, without requiring CA.

We developed variable-weighted ultrametric optimization for mixed-type data (VWUO-MD), starting with a variable-weighted, multivariate distance for data with any number of continuous, ordinal, nominal, binary symmetric and binary asymmetric (e.g., rare disease) variables. In Monte Carlo simulations we found that weights are consistent with a priori relationships between variables, under several distributions. On some relationships (e.g., single group linear), the method performs poorly. Compared to De Soete, VWUO-MD better penalizes for 0-weights, and better ensures a unique solution with a strategic random restart procedure. The bootstrap covariance matrix is slightly conservative. For mixtures of at least four continuous/nominal variables, a U-statistic-based covariance matrix performs well. Point estimates and covariances are invariant to column/category/record order and affine transformations.

We analyzed a subset of the Joint Canada/United States Survey of Health: working, mature students 50+ years old who received health services in the past year (n=167), split into training and testing segments. Prescreening within types and backwards elimination with VWUO-MD reduced the space. The final 14 variable weights were plotted as a scree plot. On the testing segment, a model was fit from the upper scree plot variables. Similar models were fit from the lower scree plot, prescreening and backwards elimination reject variables. Models were ordered on overall statistical significance and the upper model had the best fit, indicating that VWUO-MD had successfully mined these data for hypotheses. We learned that reduction in activities due to a long term health condition was associated with consultations with a mental health professional in the past year (odds ratio=12.25, 95% CI=1.67, 90.02). While needing additional research, in its present form VWUO-MD produces variable weights that may be informative for hypothesis generation on data with varied mixtures of data types.

Frequently asked questions:

Q. Can I download the FULL SOFTWARE FOR FREE?

A. Yes! Click to download the FULL FREE SOFTWARE of VWUO-MD Version 1.12(PAID)

Keywords:
data mining software, hypothesis generation, hypothesis generator, expert systems knowledge acquisition, ultrametric optimization, cluster analysis

References:

1. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (Second Ed.). Morgan Kaufmann Publishers. San Francisco, CA, USA. 2005. p. 5.

2. Gilman EA, Knox EG. Childhood cancers: space-time distribution in Britain. Journal of Epidemiology and Community Health. 1995;49:158-163.

3. Pfizer’s Data Aggregation Cross Functional Team. Data Analytic Principles. June 15, 2007.

4. Theodoridis S, Koutroumbas K. Pattern recognition (Third Ed.). Academic Press. Burlington, MA, USA. 2006.

5. Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N. Automated alphabet reduction for protein datasets. BMC Bioinformatics. 2009 Jan 6;10(1):6.

6. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57.

7. Kerstens HH, Kollers S, Kommadath A, Del Rosario M, Dibbits B, Kinders SM, Crooijmans RP, Groenen MA. Mining for Single Nucleotide Polymorphisms in Pig genome sequence data. BMC Genomics. 2009 Jan 6;10(1):4.

8. Moore M, Chan E, Lurie N, Schaefer AG, Varda DM, Zambrano JA. Strategies to improve global influenza surveillance: a decision tool for policymakers. BMC Public Health. 2008 May 28;8:186.

9. Bredel M, Bredel C, Sikic BI. Genomics-based hypothesis generation: a novel approach to unravelling drug resistance in brain tumours? Lancet Oncology. 2004 Feb;5(2):89-100.

10. Jacquez, GM. Spatial Cluster Analysis. In Fotheringham S, Wilson J (Eds.), The Handbook of Geographic Information Science. Blackwell Publishing, Edinburgh, UK. 2008. pp. 395-416.

11. Stegmann J, Grohmann G. Hypothesis generation guided by co-word clustering. Scientometrics. 2003;56(1):111-35.

12. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis (Fifth Ed.). Prentice Hall, Inc. Upper Saddle River, NJ, USA. 2002.

13. Kwek S. Cluster Analysis. Presented at the Human Genome Laboratory in the Department of Computer Science at the University of Texas at San Antonio. 2005.

14. Hogeweg P. Iterative character weighing in numerical taxonomy. Computers in Biology and Medicine. 1976;6(3):199-223.

15. Art D, Gnanadesikan R, Kettenring JR. Data-Based Metrics for Cluster Analysis. Utilitas Mathematica. 1982;21A:75-99.

16. DeSarbo WS, Carroll JD, Clark LA, Green PE. Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika. 1984 March;49(1):57-78.

17. De Soete G, DeSarbo WS, Carroll JD. Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm. Journal of Classification. 1985 Dec;2(1):173-92.

18. De Soete G. Optimal variable weighting for ultrametric and additive tree clustering. Journal Quality and Quantity. 1986 June. 20(2-3):169-80.

19. Arabie P, Hubert LJ. Combinatorial data analysis. Annual Review of Psychology. 1992;43(1):169-203.

20. Arabie P, Hubert LJ. Clustering from the Perspective of Combinatorial Data Analysis. In Krzanowski WJ (Ed.), Recent advances in descriptive multivariate analysis. Oxford University Press. Oxford, UK. 1995. pp. 1-13.

21. Breckenridge JN. Validating Cluster Analysis: Consistent Replication and Symmetry. Multivariate Behavioral Research. 2000;35(2):261-85.

22. Brusco MJ, Cradit JD. A variable-selection heuristic for K-means clustering. Psychometrika. 2001 June;66(2):249-70.

23. Brusco MJ. Clustering Binary Data in the Presence of Masking Variables. Psychological Methods. 2004 Dec;9(4):510-23.

24. Bull JK, Basford KE, DeLacy IH, Cooper M. Classifying genotypic data from plant breeding trials: a preliminary investigation using repeated checks. Theoretical and Applied Genetics. 1992 Dec;85(4):461-9.

25. Carmone FJ Jr., Kara A, Maxwell S. HINoV: A New Model to Improve Market Segment Definition by Identifying Noisy Variables. Journal of Marketing Research. 1999 Nov;36(4):501-9.

26. Chun J. Computer Assisted Classification and Identification of Actinomycetes. University of Newcastle, Department of Microbiology, Doctor of Philosophy thesis, 1995.

27. Chung J, Choi I. A Non-parametric Method for Data Clustering with Optimal Variable Weighting. In Corchado E, Yin H, Botti V, Fyfe C (Eds.), Intelligent Data Engineering and Automated Learning - IDEAL 2006. Springer. Berlin/Heidelberg, Germany. 2006. pp. 807-14.

28. Corter JE. Tree Models of Similarity and Association (Quantitative Applications in the Social Sciences). Sage Publications, Inc. California, USA. 1996.

29. Debska B, Guzowska-Swider B. Analysis of the relationship between the structure and aromatic properties of chemical compounds. Analytical and Bioanalytical Chemistry. 2003 Apr;375(8):1049-61.

30. DeSarbo WS, Cron WL. A maximum likelihood methodology for clusterwise linear regression. Journal of Classification. 1988 Sep;5(2):249-82.

31. DeSarbo WS, Oliver RL, Rangaswamy A. A simulated annealing methodology for clusterwise linear regression. Psychometrika. 1989 Sep;54(4):707-36.

32. De Soete G, Carroll JD, DeSarbo WS. Least squares algorithms for constructing constrained ultrametric and additive tree representations of symmetric proximity data. Journal of Classification. 1987 Sep;4(2):155-73.

33. De Soete G. OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting. Journal of Classification. 1988 March;5(1):101-4.

34. Donoghue, JR. The Effects of Within-group Covariance Structure on Recovery in Cluster Analysis: I. The Bivariate Case. Multivariate Behavioral Research. 1995;30(2):227-54.

35. Donoghue JR. Univariate Screening Measures for Cluster Analysis. Multivariate Behavioral Research. 1995 July;30(3):385-427.

36. Everitt BS, Landau S, Leese M. Cluster Analysis (Fourth Ed.). Oxford University Press, Inc. New York, NY, USA. 2001.

37. Fovell RG, Fovell MC. Climate Zones of the Conterminous United States Defined Using Cluster Analysis. Journal of Climate. 1993;6(11):2103-35.

38. Fowlkes EB, Gnanadesikan R, Kettenring JR. Variable selection in clustering. Journal of Classification. 1988;5(2):205-28.

39. Friedman JH, Meulman JJ. Clustering Objects on Subsets of Attributes. Journal of the Royal Statistical Society. Series B (Statistical Methodology). 2004; 66(4):815-49.

40. Gnanadesikan R, Kettenring JR, Tsao SL. Weighting and selection of variables for cluster analysis. Journal of Classification. 1995;12(1):113-36.

41. Gordon AD. Constructing dissimilarity measures. Journal of Classification. 1990 Sep;7(2):257-69.

42. Green PE, Kim J, Carmone FJ. A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification. 1990 Sept;7(2):271-85.

43. Hand DJ, Heard NA. Finding groups in gene expression data. Journal of Biomedicine and Biotechnology. 2005 Jun;2:215-25.

44. Huang JZ, Ng MK, Rong H, Li Z. Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005 May;27(5):657-68.

45. Huang JZ, Xu J, Ng M, Ye Y. Weighting Method for Feature Selection in K-Means. In Liu H, Motoda H (Eds.), Computational Methods of Feature Selection. Chemical Rubber Company Press. New York, NY. 2007. pp. 193-210.

46. Jedidi K, DeSarbo WS. A stochastic multidimensional scaling procedure for the spatial representation of three-mode, three-way pick any/J data. Psychometrika. 1991 Sep;56(3):471-94.

47. Jing L, Ng MK, Huang JZ. An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE Transactions on Knowledge and Data Engineering. 2007 June;19(8):1026-41.

48. Lapointe FJ, Legendre P. A Statistical Framework to Test the Consensus Among Additive Trees (Cladograms). Systematic Biology. 1992 Jun;41(2):158-71.

49. Leonard S, Droege M. The uses and benefits of cluster analysis in pharmacy research (Research in Social and Administrative Pharmacy). 2008 Mar;4(1):1-11.

50. Makarenkov V, Legendre P. Optimal Variable Weighting for Ultrametric and Additive Trees and K -means Partitioning: Methods and Software. Journal of Classification. 2001 Feb 1;18(2):245-71.

51. Meulman JJ, Verboon P. Points-of-view analysis revisited - fitting multidimensional structures to optimal distance components with cluster restrictions on the variables. Psychometrika. 1993 Mar;58(1):7-35.

52. Milligan GW, Cooper MC. Methodology Review: Clustering Methods. Applied Psychological Measurement. 1987;11(4):329-54.

53. Milligan GW, Cooper MC. A study of standardization of variables in cluster analysis. Journal of Classification. 1988 Sep;5(2):181-204.

54. Milligan GW. A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification. 1989 Dec;6(1):53-71.

55. Milligan GW. Clustering Validation: Results and Implications for Applied Analyses. In Arabie P, Hubert LJ, De Soete G (Eds.), Clustering and Classification. World Scientific Publ. River Edge, NJ. 1996. pp. 341-76.

56. Milligan GW, Hirtle SC. Clustering and Classification Methods. In Weiner IB, Freedheim DK, Schinka JA (Eds.), Handbook of Psychology. John Wiley and Sons. 2003. pp. 165-86.

57. Morris L, Schmolze R. Consumer archetypes: A new approach to developing consumer understanding frameworks. Journal of Advertising Research. 2006 Sep;46(3):289-300.

58. Schweinberger M, Snijders TAB. Settings in Social Networks: A Measurement Model. Sociological Methodology. 2003;33:307-41.

59. Soffritti G. Identifying multiple cluster structures in a data matrix. Communications in Statistics-Simulation and Computation. 2003;32(4):1151-77.

60. Sokal RR. Phenetic taxonomy: Theory and methods. Annual Review of Ecology & Systematics. 1986;17:423-42.

61. Steinley D, Henson R. OCLUS: An Analytic Method for Generating Clusters with Known Overlap. Journal of Classification. 2005 Sept;22(2):221-50.

62. Steinley D. K-means clustering: A half-century synthesis. British journal of mathematical & statistical psychology. 2006;59(1):1-34.

63. Steinley D, Brusco MJ. Selection of variables in cluster analysis: An empirical comparison of eight procedures. Psychometrika. 2008 Mar;73(1):125-144.

64. Taylor & Francis, Ltd. for the Society of Systematic Biologists. 21st Numerical Taxonomy Conference. Systematic Zoology. 1988 Mar;37(1):91-3.

65. Tsai CY, Chiu CC. Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Computational Statistics & Data Analysis. 2008 Jun;52(10):4658-72.

66. van Buuren S, Heiser WJ. Clustering n objects into k groups under optimal scaling of variables. Psychometrika. 1989 Sep;54(4):699-706.

67. Cochran WG. Sampling Techniques (Third Ed.). John Wiley & Sons, Inc. New York, NY, USA. 1977.

68. Casella G, Berger R. Statistical Inference (Second Ed.). Duxbury. Pacific Grove, CA, USA. 2002. pp. 240-44.

69. Serfling RJ. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, Inc. New York, NY, USA. 1980.

70. Thomas PY, Sreekumar NV. Estimation of location and scale parameters of a distribution by U-statistics based on best linear functions of order statistics. Journal of Statistical Planning and Inference. 2008 Jul;138(7):2190-2200.

71. Lee AJ. U-statistics: theory and practice. Chemical Rubber Company Press. New York, NY. 1990.

72. Efron B, Gong G. A Leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician. 1983 Feb;37(1):36-48.

73. SAS Institute Inc., Cary, NC, USA.

74. Henze N, Zirkler B. A class of invariant consistent tests for multivariate normality. Comm. Statist. Theory Methods. 1990;A19:3595-3617.

75. Yeo D, Mantel H, Liu T. Bootstrap variance estimation for the National Population Health Survey. Proceedings of the Survey Research Methods Section, ASA. 1999;778-83.

76. Vardeman S. Bootstrap percentile confidence intervals. [Available at https://www.public.iastate.edu/~vardeman/stat511/BootstrapPercentile.pdf]

77. The Joint Canada/United States Survey of Health. [Available at https://www.cdc.gov/nchs/about/major/nhis/jcush_mainpage.htm]

78. Hosmer DW, Lemeshow S. Applied Logistic Regression (Second Ed.). John Wiley & Sons, Inc. USA. 2000.

79. Roberts G, Binder D, Kovacevic M, Pantel M, Phillips O. Using an estimating function bootstrap approach for obtaining variance estimates when modeling complex health survey data. Proceedings of the Survey Methods Section, SSC Annual Meeting. June, 2003.

80. Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611-31.

81. Tzeng J, Lu HHS, Li WH. Multidimensional scaling for large genomic data sets. BMC Bioinformatics. 2008;9:179.

82. Pino-Mejias R, Jimenez-Gamero MD, Cubiles-de-la-Vega MD, Pascual-Acosta A. Reduced bootstrap aggregating of learning algorithms. Pattern recognition letters. 2008;29(3):265-271.

83. Buja A, Stuetzle W. The effect of bagging on variance, bias, and mean squared error. 2000. AT& T Labs-Research.

84. Shen SH, Liu YC. Efficient multiple faces tracking based on Relevance Vector Machine and Boosting learning. Journal of Visual Communication and Image Representation. 2008;19(6):382-91.

85. Ting KM, Witten IH. Issues in stacked generalization. Journal of Artificial Intelligence Research. 1999;10:271-89.

The VWUO.exe software was developed for Windows in C++. The program comes with a VWUO.ini file containing all the default options described in the user's guide with one exception: I provide the non-default normalizing multipliers obtained in the thesis (which you are free to change). The software also includes a short list of keys and command line parameters not all listed in the user's guide.

The PhD thesis provides relevant background and mathematical formulas. The self-contained, illustrated user's guide (approximately 40 pages) is Chapter 3. It contains screenshots, example data sets, commands, as well as walking you through sample analyses. Chapter 6 shows an application of successfully mining a real dataset for hypotheses using VWUO-MD. The thesis can be downloaded free of charge (with a limited license as explained in the PDF) from Simon Fraser University Library's Institutional Repository. (Full text and abstract: https://summit.sfu.ca/item/9744)

Software Performance Limitations

Before purchasing this software, it is STRONGLY RECOMMENDED that you download and try the free trial version of VWUO-MD (VWUO.exe) software. The ultrametric optimization algorithm used is approximately order n-cubed, and is also dependent on the number of input variables. As such, adding a small number of variables and/or data records may dramatically increase the run time of this software. For example, 5 times the data records might translate into 125 times the run time. If the data get too large, the software may not run at all due in part to memory limitations. These numbers are ballpark figures only; no guarantees of any kind about run time should be construed by this paragraph. You may review the analyses I performed in my PhD thesis to get an approximate idea of realistic input sizes on a high performance machine, before purchasing the full, unlocked version. I also discuss the issue in Chapter 7 of the thesis, suggesting some approaches that may allow for analysis of bigger data sets (e.g., "bagging"). Refer to the thesis for more details.

End User Software License Agreement ("Agreement")

This license agreement governs VWUO-MD (VWUO.exe) (tm) ("Software"). By downloading and/or running this software, you enter into the terms of this binding contract between you ("you" or "User") and Eric C. Sayre, PhD. If you do not agree with the terms of this license, do not download or run VWUO-MD (VWUO.exe). Installation of the Software constitutes acceptance of the terms of this License Agreement.

Grant of License: Subject to the terms and conditions of this Agreement, Eric C. Sayre hereby grants you a limited, nonexclusive license to install and use the object code version of the Software, a copy of which is provided herewith, on a single personal computer.

Limitations: The Software is licensed, not sold, to you. You must retain all copyright and related notices of Eric C. Sayre's ownership and other rights in the Software in the product, labeling and documentation provided. Furthermore, you may not: (a) modify, translate, de-compile, reverse engineer, disassemble or otherwise decode the Software; (b) copy any of the Software other than as reasonably required for your own personal use of the Software in accordance with this Agreement; or (c) sublicense, sell, rent, lend, transfer, post, transmit, distribute or otherwise make the Software available to anyone else, except that you may permanently transfer the Software and accompanying materials provided you retain no copies and the recipient agrees to the terms of this Agreement.

Trademarks: You acknowledge that Eric C. Sayre, VWUO-MD (VWUO.exe) (tm) and related logos and designs are trademarks of Eric C. Sayre, PhD and that no rights in such trademarks are granted to you by this Agreement.

Support: Eric C. Sayre reserves the right to modify the Software from time to time without obligation to notify you, or any other person or organization of such revision or change. Eric C. Sayre explicitly states that no technical support for this Software should be expected from him in any form. Eric C. Sayre's PhD thesis contains a user's guide for this Software. However, Eric C. Sayre does not guarantee that Simon Fraser University will continue to supply the PDF of the thesis free of charge or at all, nor does he attest to the completeness, accuracy or relevance of the user's guide for your particular purposes.

Limitation of Liability: IN NO EVENT WILL Eric C. Sayre BE LIABLE FOR ANY DAMAGES, INCLUDING LOSS OF DATA, LOST OPPORTUNITY OR PROFITS, COST OF COVER OR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, DIRECT OR INDIRECT DAMAGES ARISING FROM OR RELATING TO THE USE OF THE SOFTWARE, HOWEVER CAUSED ON ANY THEORY OF LIABILITY. THIS LIMITATION WILL APPLY EVEN IF Eric C. Sayre HAS BEEN ADVISED OR GIVEN NOTICE OF THE POSSIBILITY OF SUCH DAMAGE. THE ENTIRE RISK AS TO THE USE OF THE SOFTWARE IS ASSUMED BY THE USER. BECAUSE SOME STATES DO NOT ALLOW THE EXCLUSION OR LIMITATION OF LIABILITY FOR CERTAIN INCIDENTAL, CONSEQUENTIAL OR OTHER DAMAGES, THIS LIMITATION MAY NOT APPLY TO YOU.

Disclaimer of Warranty: TO THE EXTENT PERMITTED BY APPLICABLE LAW ALL Eric C. Sayre SOFTWARE, INCLUDING THE IMAGES AND/OR COMPONENTS, IS PROVIDED "AS IS" AND WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND BY EITHER Eric C. Sayre OR ANYONE ELSE WHO HAS BEEN INVOLVED IN THE CREATION, PRODUCTION OR DELIVERY OF SUCH SOFTWARE, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. NO COVENANTS, WARRANTIES OR INDEMNITIES OF ANY KIND ARE GRANTED BY Eric C. Sayre TO THE USER.

Termination: If you do not accept the terms of this license, you agree to destroy all copies of the Software in your possession and control.

Government Rights: If used or acquired by the Government, the Government acknowledges that (a) the Software constitutes "commercial computer software" or "commercial computer software documentation" for purposes of 48 C.F.R. 12.212 and 48 C.F.R. 227.7202-3, as applicable and (b) the Government's rights are limited to those specifically granted to you pursuant to this License. The contractor/manufacturer is Eric C. Sayre, PhD, currently at www.ericsayre.com.

Export Control Obligations: You will not export or re-export any Licensed Software in violation of any law, regulation, order or other governmental requirement (including, without limitation, the U.S. Export Administration Act, regulations of the Department of Commerce and other export controls of the U.S.). You shall, at your own expense, promptly obtain and arrange for the maintenance of all non-U.S.A. government approvals, if any, and comply with all applicable local laws and regulations as may be necessary for performance under this Agreement.