Setting a generalized functional linear model (GFLM) for the classification of different types of cancer

  • Miguel Flores National Polytechnic School
  • Guido Saltos Escuela Politécnica del Ejército, ESPE
  • Sergio Castillo Páez National Polytechnic School
Keywords: Depth of functional data, DNA, functional data analysis, functional distances, statistical classification

Abstract

This work aims to classify the DNA sequences of healthy and malignant cancer respectively. For this, supervised and unsupervised classification methods from a functional context are used; i.e. each strand of DNA is an observation. The observations are discretized, for that reason different ways to represent these observations with functions are evaluated. In addition, an exploratory study is done: estimating the mean and variance of each functional type of cancer. For the unsupervised classification method, hierarchical clustering with different measures of functional distance is used. On the other hand, for the supervised classification method, a functional generalized linear model is used. For this model the first and second derivatives are used which are included as discriminating variables. It has been verified that one of the advantages of working in the functional context is to obtain a model to correctly classify cancers by 100%. For the implementation of the methods it has been used the fda.usc R package that includes all the techniques of functional data analysis used in this work. In addition, some that have been developed in recent decades. For more details of these techniques can be consulted Ramsay, J. O. and Silverman (2005) and Ferraty et al. (2006).

DOI

Downloads

Download data is not yet available.

Author Biographies

Miguel Flores, National Polytechnic School

Miguel Flores, is a professor at the National Polytechnic School and a researcher at the Center for Modeling Mathematics at the National Polytechnic School in Quito, Ecuador. He is a BSc. in Statistical Computing Engineer from the Polytechnic School of the Coast. In 2006 he received an in MSc. in Operations Research from the National Polytechnic School, and in 2013 received a MSc. in Technical Statistics from the University of A Coruña. He is currently a doctoral student at the University of A Coruña in the area of Statistics and Operations Research. He has over 15 years professional experience in various areas of Statistics, Computing and Optimization, multivariate data analysis, econometric, Market Research, Quality Control, definition and construction of systems indicators, development of applications and optimization modeling. ORCID ID: 0000-0002-7742-1247

 

Guido Saltos, Escuela Politécnica del Ejército, ESPE

Guido Saltos S. received his Engineering degree in Electronics from Escuela Politécnica del Ejército (ESPE), Quito, Ecuador in 1987. He received his M.S. degree in Applied Statistics from National Polytechnic School (EPN), Quito, Ecuador, in 2016.
He worked several years in the field of industrial automation and now he is working at Universidad de las Américas (UDLA) in Quito Ecuador. His interests are related with data depth, and non-parametric statistics.

Sergio Castillo Páez, National Polytechnic School

Sergio Castillo Páez, is a mathematical engineer graduated from the National Polytechnic School in 2002. He also studied finance in the Simon Bolivar Andean University, and is currently studying his PhD in Statistics at the University of Vigo, Spain. He is a professor at the ESPE Armed Forces University in Ecuador. His current lines of research are related to geostatistics and analysis of multivariate data.

References

Cuevas A, Febrero M, Fraiman R. 2001. Cluster Analysis: a further approach based on density estimation. Computational Statisticsand Data Analysis 36: 441–456.

Dudoit et al. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97 (457), 77-87.

Febrero-Bande, M. and Oviedo de la Fuente, M. 2012. Statistical computing in functional data analysis: The R package fda.usc. Journal of Statistical Software, 51(4):1-28.

Febrero-Bande, M. and Gonzalez-Manteiga, W. 2012. Generalized additive models for functional data. TEST, 22(2):278-292.

Ferraty, F. andVieu, P. 2006. “Nonparametric Functional Data Analysis: Theory” and Practice.Springer-Verlag, New York., Pp. 113-146.

Fraiman R. and Muniz G. 2001 Trimmed means for functional data, Test, 10(2), 419-440.

Lopez-Pintado, S.,Romo, J., Torrente A. 2010. “Robust depth-based tool for the analysis of gene expression data”. Biostatistics11, 2, pp 254-264.

Ramsay, J. O. andSilverman, B. W.2005. “Functional Data Analysis”, 2nd ed., Springer-Verlag, NewYork., pp. 147-325.

Romualdi C., Campanaro S., Campagna D., Celegato B., Cannata,N, Toppo S.,Valle G. and LanfranchiG. 2003Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Human Molecular Genetics 12, 823-836.

Singh D. et al. 2002. Gene expression correlates of clinincal prostate cancer behavior, Cancer cell, 1 (2), 203-209.

Tárraga J., Medina I., Carbonell J., Huerta-Cepas J., Mínguez P., Alloza E., Al-Shahrour F., Vegas-Azcarate S. Gotz S. Escobar P and others 2008. GEPAS a web-based tool for microarray data analysis and interpretation. Nucleic Acids Research 36, W308-W314.

Wessels L.F.A., Reinders M. J. T., Hart,A.A.M.,Veenman C.J., Dai H.,He Y.D. and Van’t Veer L.J. 2005. A protocol for building and evaluatingpredictors of disease state based on microarray data. Bioinformatics 21, 3755-3762.

Zuo Y, Serfling R. 2000.General notions of statistical depth function. Annals of Statistics 28: 461–482.

Published
2016-12-09
How to Cite
[1]
M. Flores, G. Saltos, and S. Castillo Páez, “Setting a generalized functional linear model (GFLM) for the classification of different types of cancer”, LAJC, vol. 3, no. 2, p. 8, Dec. 2016.
Section
Research Articles for the Regular Issue