Data Science

  • Docentes: David Corrêa Martins Jr, Fabricio Olivetti de França. Raphael Yokoingawa de Camargo, Wagner Tanaka Botelho, Jesus Pascual Mena Chalco, João Paulo Gois, Flavio Eduardo Aoki Horita, Carlos Alberto Kamienski, Harlen Costa Batagelo, Ronaldo Cristiano Prati, Denise Hideko Goya, Edson Pinheiro Pimentel, Francisco Javier Ropero Pelaez, Itana Stiubiener, Saul de Castro Leite, Denis Gustavo Fantinato, Paulo Henrique Pisani, Emilio de Camargo Francesquini, Vladimir Moreira Rocha, Valerio Ramos Batista, Fabio Marques Simões de Souza.
  • Laboratórios: diversos.

Data science is an interdisciplinary research field that comprises scientific methods, systems, and processes used to gain insights and to understand a phenomenon of interest using data in distinct formats, i.e., structured, semi-structured, or unstructured. This research area emerged as a consequence of the advancements in information technologies (e.g., GPS, wearable equipment, and hard sensors) leading to an increase in the volume of available data, the so-called Big Data. The data processing and analysis are supported by techniques and theories from different domains like mathematics, computer science, statistics, information science, and, in particular, machine learning, pattern recognition, data mining, graph theory, and data visualization. Nevertheless, several challenges remain, from the use of existing methodologies in new domain contexts (e.g., social media analysis for weather forecasting) to the development of new approaches for dealing with existing problems (e.g., text mining using deep learning). This project aims to establish an interdisciplinary research network of international collaboration to address some challenges in the following stages of data science cycle:

1) Pre-processing and representation, including data integration, multidimensional databases, complex networks, text and multimedia mining, interoperability of different information systems and Internet-of-things;

2) Features engineering, in order to extract the relationship among the features through the neural network and symbolic regression;

3) Creation of regression and classification models through semi and supervised learning;

4) Combinatory and numerical optimization for feature selection;

5) Model validation through the interpretability of the generated model and applications to real world scenarios;

6) Challenges associated with the use of a large amount of data, as well as the use of parallel and distributed computing for high-performance systems.

These studies will be applied to political science and sentiment analysis in social networks, text and multimedia mining, educational data mining, smart cities and agriculture, urban resilience against natural disasters, scientometrics, systems biology, Neurocomputing, brain-computer interface, neuromorphic computing, aided image and video segmentation, and robotics.