Michał Woźniak is a professor of computer science at the Department of Systems and Computer Networks, Wrocław University of Science and Technology, Poland. His research focuses on machine learning, compound classification methods, classifier ensembles, data stream mining, and imbalanced data processing. Prof. Woźniak has been involved in research projects related to the topics mentioned above and has been a consultant of several commercial projects for well-known Polish companies and public administration. He has published over 300 papers and three books. Prof. Woźniak was awarded numerous prestigious awards for his scientific achievements as IBM Smarter Planet Faculty Innovation Award (twice) or IEEE Outstanding Leadership Award, and several best paper awards of the prestigious conferences. He is a member of the editorial board of the high ranked journals. Prof. Woźniak is a senior member of the IEEE.
Chosen Challenges of Imbalanced Data Classification
Imbalanced data classification is still a focus of intense research because most of the learning methods can work with a reasonably balanced data set. Still, many real-world applications have to face imbalanced data sets. A data set is said to be imbalanced when several classes are under-represented (minority classes) in comparison with others (majority classes). Learning from imbalanced data is among the contemporary challenges in machine learning, and multi-class imbalance, as well as an imbalanced data stream, stand out as the most challenging scenarios.
In binary imbalanced learning, the relationships between classes are easy to be defined: one class is the majority one, while the other is the minority one. However, in multi-class scenarios, this is no longer obvious, as the correlations among classes may vary, e.g., one class can be at the same time minority and majority one to different classes. Therefore canonical methods designed for binary cases cannot be directly applied in such scenarios.
Another topic which we will discuss during the talk is imbalanced data stream classification because only a few of the authors distinguish the differences between imbalanced data stream classification problem and a scenario where the prior knowledge about the entire data set is given. This discrepancy is a result of the lack of knowledge about the class distribution, and this issue is notably present in the initial stages of data stream classification. Another difficulty is the presence of the phenomenon called concept drift, which can usually lead to classifier quality deterioration. The concept drift may have different nature, but it causes the change of the probability characteristics of the decision task, e.g., it could lead to a shift in the prior probabilities, i.e., the frequency at which the objects appear in the examined classes. A typical example of such a case is the technical diagnosis in which the fault probability increases with utilization time, and it may be a result of material fatigue. Sometimes the relationship between the minority and majority classes changes in a way that the former becomes the majority class. This phenomenon can also be observed in tasks related to social media analysis, as the popularity of topics discussed on Tweeter or environmental hazards detection system, like oil spill detection. It is worth also mentioning medical screening for a condition is usually performed on a large population of people without the disease. To detect a small minority with it (e.g., HIV prevalence in the USA is ca. 0.4%) or the conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
This talk will discuss the main problems of imbalanced data classification, as multi-class imbalanced data analysis or imbalanced data stream classification, with particular attention to the methods developed by the Machine Learning team from the Department of Systems and Computer Networks from Wroclaw University of Science and Technology.