Forschungskolloquium Data and Knowledge Engineering

As part of this research colloquium, current research work in the field of data and knowledge engineering (DKE) will be presented. The colloquium usually takes place on Thursdays/Fridays from 10 am /1:00 pm in room G29-301/or G29-130
Please address questions about the colloquium to Andreas Nürnberger and Myra Spiliopoulou.

Current lectures:

Feb 28, 2025 (10-11:00 cet in room G29-301)

Unsupervised cardiac MRI phenotyping with 3D diffusion autoencoders reveals novel genetic insights
Dr. Soumick Chatterjee (Genomics Research Centre, Fondazione Human Technopole Italy)

https://ovgu.zoom-x.de/j/69108417582
Meeting ID: 691 0841 7582 (Passcode: 443936)

Biobank-scale imaging presents an unprecedented opportunity to characterise thousands of organ phenotypes (i.e. image features), examine their variability across populations, and explore their associations with disease outcomes. However, deriving specific phenotypes from imaging data, such as Magnetic Resonance Imaging (MRI), requires laborious expert annotation, which limits scalability and fails to fully exploit the information-rich nature of such image acquisitions. To address this, we developed a three-dimensional diffusion autoencoder to derive latent phenotypes (i.e. latent representations) from temporally resolved cardiac MRI data of 71,021 UK Biobank participants. These phenotypes were reproducible, heritable (h² = [4–18%]), and significantly associated with cardiometabolic traits and outcomes, including atrial fibrillation (P = 8.5 × 10⁻²⁹) and myocardial infarction (P = 3.7 × 10⁻¹²). By applying latent space manipulation techniques, we were able to interpret and visualise the specific features captured by each latent phenotype in a given MRI scan. To establish the genetic basis of these traits, we conducted a genome-wide association study, identifying 89 significant common variants (P < 2.3 × 10⁻⁹) across 42 loci, including seven novel loci. Extensive multi-trait colocalisation analyses (PP.H₄ > 0.8) linked these variants across phenotypic scales, from intermediate cardiac traits to clinical disease endpoints. Furthermore, Polygenic Risk Scores (PRS) derived from latent phenotypes demonstrated predictive utility for a range of cardiometabolic diseases, enabling effective stratification of individuals into distinct risk groups. In conclusion, this study highlights the potential of diffusion autoencoding methods as powerful tools for unsupervised phenotyping, genetic discovery, and disease risk prediction using cardiac MRI data. We are now extending this approach to a multi-modal, multi-organ framework to elucidate the shared genetic architecture underlying these traits.

Past Lectures:

July 19, 2024 (09:00 cet in room G29-130)

Computational Modelling of Human Non-verbal Behaviour 'in-the-wild'
Dr. Shreya Ghosh (Curtin University, Australia)

https://ovgu.zoom-x.de/j/67619297805
Meeting ID: 651 3273 6966 (Passcode: 973495)

Non-verbal human behaviour understanding plays an important role in day-to-day communication and social interaction. The information gained from the non-verbal cues can therefore be beneficial to gain insights of complex mental states. Over the past few years, several attempts were made to bridge the gap between human cognitive ability and machine intelligence to understand gestural cues in complex interaction scenarios. However, due to lack of properly labelled data, annotation complexity and computational constraints, these models lag in the ability to encode gestural cues properly. To this end, my research aims to develop computational models which can learn and model non-verbal gestural cues in social interaction scenarios preferably with limited supervision. This research is a stepping-stone in empowering AI models to understand complex human-human interactions. There are several real-world scenarios such as monitoring students in classroom for engagement, group cohesiveness analysis in task driven environments, localising the influential/dominant person in an interaction environment, emergent leader detection, mob monitoring etc, where ever it could be deployed.

May 28, 2024 (10:00 cet in room G29-301 -Talk organized jointly with KMD Colloquium)

ELADAIS : High Social Impact Data Analysis, extraction and storage
Prof. Ernestina Menasalvas (Universidad Politécnica de Madrid)

https://ovgu.zoom-x.de/j/67619297805
Meeting ID: 676 1929 7805 (Passcode: 451540)

Experience-based medicine is being replaced by an evidence-based, patient-centred approach. Artificial intelligence (AI) is emerging as a new tool to improve such predictive capabilities in the laboratory and clinic. Approaches are needed that address issues such as working with multiple heterogeneous data modalities, insufficient or low quality data, interpretability and explanation, and alternative learning approaches. A digital framework that includes real-world data (RWD) pipelines can also leverage AI, and natural language processing to discover insights that support decision-making in everyday clinical practice. Patients are also intended to become active participants in research and routine clinical practices; with the consequent benefit, where not only clinicians but also patients will benefit. This problem is the one addressed in this talk where ELADAIS is presented, whose main objective is the development of services independent of the clinical history system being used or the data storage standards that allow to integrate, clean, enrich and subsequently exploit the data by applying AI techniques with the aim of generating developments for the different agents involved in health, from the health professional, epidemiologist, pharmacologist, researcher, patient, family member, manager, ... with applications to improve healthcare assistance, empower patients, prevent illnesses, carry out a more personalised follow-up of patients, as well as exploit the huge amounts of data that exist in healthcare to gain a better understanding of illnesses and their treatments, and use this knowledge for social purposes.

May 27, 2024 (11:00 cet in room G29-301 -Talk organized jointly with KMD Colloquium)

Estimating long-term cancer-related survival from multiple prophylactic strategies: a temporal Bayesian network simulation
Associate Prof. Pedro Pereira Rodrigues (Faculty of Medicine of the University of Porto)

https://ovgu.zoom-x.de/j/67619297805
Meeting ID: 676 1929 7805 (Passcode: 451540)

Estimating comparative effectiveness of multiple prophylactic strategies used in clinical practice to prevent cancer-related mortality in patients with specific gene mutations is not something we can cleanly design in clinical trial studies. Future routinely collected electronic health records might present new ways of estimating such comparative effectiveness from real-world data, but if the target population in study is too specific, collecting data from a large enough sample to enable comparison of multiple strategies might prove to be impossible. To empower clinical decisions, aiming to develop a personalized risk management guideline, we have constructed a temporal Bayesian network model to simulate the expected overall mortality in patients who underwent different prevention strategies taking into account the patient's prognostic parameters and received treatment, allowing the long term survival comparison of 9 multiple prophylactic strategies. Transition probabilities were derived from literature after a critical review of studies published in PubMed, where all risk estimates were converted into yearly estimates by means of conditional probabilities, depending on the original metric published in literature with needed conversions. For each simulated patient, the first temporal node to be activated was identified, with survival being therefore computed for each patient. Overall survival of patients from each subgroup x policy combination was then plotted as Kaplan-Meier curves. We illustrate our approach with a specific real-world problem in breast-cancer survival analysis, simulating 2.5M patients across 144 subgroup cohorts and 9 different policies, during a 40-year follow up - the illustrated example was a result of joint work with Jelena Maksimenko (Riga Stradins University, Latvia) and Maria João Cardoso (Champalimaud Foundation, Portugal).

May 24, 2024 (11:00 cet in room G29-301 -Talk organized jointly with KMD Colloquium)

Explainable and actionable machine learning and its implications to healthcare
Prof. Panagiotis Papapetrou (Dept. of Computer and Systems Sciences, Stockholm University)

https://ovgu.zoom-x.de/j/67619297805
Meeting ID: 676 1929 7805 (Passcode: 451540)

In this talk, I will introduce some key concepts of explainability in relation to machine learning models. Examples of local and global explainers will be provided in the context of healthcare applications. Moreover, I will discuss the need for counterfactual explanations, and how these relate to causality. Some recent counterfactual explainers will be presented with emphasis on time series data sources. To further elaborate on the implications of machine learning in healthcare, we will explore the practical challenges and ethical considerations associated with deploying these models. This includes addressing issues of fairness and bias, which are crucial for ensuring that healthcare interventions are equitable across different patient demographics. Additionally, I will highlight the importance of actionable insights derived from machine learning models, illustrating how they can lead to improved patient outcomes and more efficient healthcare services. We will also examine case studies where explainable AI has been successfully integrated into clinical settings, showcasing the tangible benefits and potential hurdles of such implementations. This discussion aims to provide a comprehensive overview of the current landscape and future directions in the use of explainable, actionable, and fair machine learning in healthcare.

May 23, 2024 (10:00 cet in room G29-301)

Analyzing Behavior in Video Recordings
Prof. Dr. Jürgen Gall (Department of Information Systems and Artificial Intelligence, University of Bonn)

https://ovgu.zoom-x.de/j/67619297805
Meeting ID: 676 1929 7805 (Passcode: 451540)

In this talk, I will give an overview of some methods that we developed for analyzing human behavior in video recordings. This includes estimating the 2D pose of humans and tracking multiple subjects, as well as temporally detecting pre-defined behavior patterns in videos. While the algorithms are evaluated mainly for recognizing human behavior, the methods can be applied for recognizing animal behavior as well. I will also address the question of how the annotation effort can be reduced that is required to train the approaches. This includes different ways of annotating video recordings and self-supervised learning.

May 23, 2024 (11:00 cet in room G29-301 and Zoom - Talk organized jointly with KMD Colloquium)
Mobile Health and the Medical Informatics Working Group in Würzburg
Prof. Dr. Rüdiger Pryss, (University of Würzburg)

https://ovgu.zoom-x.de/j/67619297805
Meeting ID: 676 1929 7805 (Passcode: 451540)

The Medical Informatics Working Group at the University of Würzburg (Institute for Clinical Epidemiology and Biometry) and the Würzburg
University Hospital (Institute for Medical Data Sciences) are currently overseeing 11 major Mobile Health projects. These projects are receiving
either national or international funding and span a range of fields including sensor technology, Patient Reported Outcome Measures (PROMs),
Ecological Momentary Assessments (EMAs), machine learning, chatbot technology, and innovative interaction concepts. To date, the group has
supported more than 25 digital studies. The latest development involves a spin-off from this research. This presentation will provide an
overview of key topics, ongoing challenges, and a forward-looking perspective on the group’s ambitions for the coming years. One of the
case studies discussed is the 6-minute walk test, which illustrates both the current limitations and the potential of modern smartphone technology.

Jan 15, 2024 (14:00 cet in room G29-336 and Zoom)
Molecular-Continuum Flow Simulation in the Era of Exascale Computing and Data Science
Prof. Dr. Philipp Neumann, (Helmut Schmidt University)

Molecular-continuum simulations in fluid dynamics, as subject of the talk, couple computational fluid dynamics (CFD) solvers and molecular dynamics (MD) simulations in a domain decomposition sense. This allows to invest into computationally intensive MD in small-sized local spots, where the molecular behavior requires to be resolved, and to rely on computationally cheap CFD everywhere else. Typically, the MD solver consumes most of the computational time in molecular-continuum simulations. Although this multiscale approach itself renders respective flow simulations significantly cheaper compared to stand-alone MD systems, it can still easily require massive amounts of computational resources.Particular challenges arise from rapidly evolving exascale-enabling hardware technology and the vast amount of computational resources in exascale systems. Besides, machine learning and data science methods have evolved as additional scientific research paradigm, extending the computational approach via numerical simulations.

The talk will address aforementioned challenges. First introduce molecular-continuum methods and explain how these methods can be ported to massively parallel HPC systems, leveraging ensembles for error control and, potentially, fault tolerance. Second, focus on data science aspects. In particular, present the simulation-internal data processing flow in which noise filtering is included and present a machine learning method that strives to temporarily replace the expensive molecular dynamics component, which is ongoing work. All developments have been implemented in a mature software package, the macro-micro-coupling tool MaMiCo.

July 13, 2023 (15:00 cet in room G29-412)
Potentials and Limitations of observational population-based studies
Dr. Till Ittermann, (Head of the Statistical Method Unit, Institute for Community Medicine, University Medicine Greifswald)

There is a broad range of medical research questions which can be addressed by population-based studies including the description of prevalence and incidence of diseases and risk factors, the definition of reference intervals for clinical biomarkers, the investigation of associations between potential (genetical) risk factors and diseases, the calculation and validation of prediction models for certain diseases, and data mining analyses. Limitations, which has to be taken into account, derive from selection bias, confounding bias and information bias. This talk will give a summary on the potentials and limitations of population-based studies using examples from the Study of Health in Pomerania.

April 5, 2023 (09:40 am cet via Zoom)
Explaining Drug Recommendations with Deep Learning
Prof. Panagiotis Symeonidis, (School of Information & Communication Systems Engineering, Aegean University, Greece)

In this talk, we will present methods for finding optimal drug combinations to support the work of medical doctors, by minimizing the
unwanted drug side effects (less toxicity) or improving recovery (eg faster healing). In particular, we will present state-of-the-art deep
reinforcement learning algorithms for providing medicine recommendations. Moreover, we will present graph-based methods, which
can find interesting patterns from knowledge graphs related to health, for providing explainable recommendations, which can support the
decisions of medical doctors. Finally, we will present a demo application which can help medical doctors for identifying the most
critical measurements (eg, glucose index, heart rate, etc.) from lab tests related to a patient's clinical status and we will demonstrate a
proof of concept, which can be used to predict optimal dosing of insulin for patients with diabetes.

Oct 21, 2019 (3:00 pm st in room G29-301)
Headline/Summary Automated Evaluation - Challenges, SotA and HEvAS System
Dr. Marina Litvak (Sami Shamoon College of Engineering, Beer Sheva, Israel)

Automatic headline generation is a sub-task of one-line summarization with many reported applications. Evaluation of systems generating headlines is a very challenging and undeveloped area. In this talk, I will introduce multiple metrics for automatic evaluation of systems in terms of the quality of the generated headlines. The metrics measure the headlines' quality both from the informativeness and the readability perspectives, where informativeness is evaluated at the lexical and semantic levels.

March 14, 2019 (13:00 st in room G29-130)
Interpretable feature learning and classification: from time series feature tweaking to temporal abstractions in medical records
Prof. Panagiotis Papapetrou (Faculty of Social Sciences, Stockholm University)

The first part of the talk will tackle the issue of interpretability and explainability of opaque machine learning models, with focus on time series classification. Time series classification has received great attention over the past decade with a wide range of methods focusing on predictive performance by exploiting various types of temporal features. Nonetheless, little emphasis has been placed on interpretability and explainability. This talk will formulate the novel problem of explainable time series tweaking, where, given a time series and an opaque classifier that provides a particular classification decision for the time series, the objective is to find the minimum number of changes to be performed to the given time series so that the classifier changes its decision to another class.Moreover, it will be shown that the problem is NP-hard. Two instantiations of the problem will be presented. The second part of the talk will focus on temporal predictive models and methods for learning from sparse Electronic Health Records. The main application area is the detection of adverse drug events by exploiting temporal features and applying different levels of abstraction, without compromising predictive performance in terms of AUC.

Jan. 24, 2019 (1:00 pm st in room G29-301)
Data driven innovation - research challenges and opportunities
Prof. Dr. Barbara Dinter (business informatics, Chemnitz University of Technology)

Modern big data & analytics technologies and methods lead to manifold opportunities for innovative use cases and business models. Although organizations have started to establish appropriate technical and organizational infrastructures (eg big data labs), they still need advice how to benefit best from such investments in particular, if the big data activities should not only result in the optimization of existing applications and processes, but in true data-driven innovation. The talk will provide an overview of how the fields of big data & analytics and of innovation management converge, resulting in many challenging research questions. Following a framework with origins in the Service Dominant Logic, the potential mutual usage and impact of both fields will be presented.The role of open data and of open innovation for data-driven innovation will be illustrated by a research project in the field of open innovation for e-mobility. In addition, recent research on how to teach data driven innovation will be presented.

Jan 18, 2019 (10:00 am st in room G29-301)
From Ontology Development as Craft towards Ontology Engineering
Dr. Fabian Neuhaus (Institute for Intelligent Cooperating Systems, FIN, OVGU)

Ontologies have been successfully in use for at least 20 years. Nevertheless, the development of ontologies is still a cumbersome and expensive process. In my presentation I will address three challenges for ontology developers: (1) It is difficult to reuse ontologies and adapt them for new purposes. (2) A plethora of representation languages leads to difficult choices and interoperability issues. (3) Ontology developers rarely evaluate their ontologies during development time against requirements. These challenges are addressed by the Distributed Ontology, Modelling, and Specification Language (DOL), which is developed and implemented at the OvGU and has become an international standard at the Object Management Group (OMG) in 2018.

Jan 07, 2019 (3:00 pm st in room G29-301)
How to Break an API: How Community Values Influence Practices
Prof. Dr. Christian Kästner (Carnegie Mellon University, Institute for Software Research)

Breaking the API of a package can create severe disruptions downstream, but package maintainers have flexibility in whether and how to perform a change. Through interviews and a survey, we found that developers within a community or platform often share cohesive practices (eg, semver, backporting, synchronized releases), but that those practices differ from community to community, and that most developers are not aware of alternative strategies and practices, their tradeoffs, and why other communities adopt them. Most interestingly, it seems that often practices and community consensus seems to be driven by implicit values in each community, such as stability, rapid access, or ease to contribute. Understanding and discussing values openly can help to understand and resolve conflicts,

Nov. 15, 2018 (1:00 p.m. st in room G22, 2nd floor, Faculty Center FWW)
Social Media Analytics - New Potentials and Challenges for Research and Practice
Prof. Dr. Stefan Stieglitz (Univ. Duisburg-Essen, communication. in electronic media / social media)

Researchers as well as companies collect and analyze social media communication for various reasons. Eg to understand general patterns of interaction but also to identify potential customers or to offer new services. A variety of methods are used to structure and visualize these heterogeneous data. By conducting a systematic literature we identified the major challenges in the context of social media analytics. Based on two case studies (one on crisis communication in social media and one on social bots) it will be highlighted why 'dynamics of communication' and the 'quality of data' need to be carefully considered for meaningful analyzes of social media communication.

The colloquium takes place in cooperation with the Faculty of Economics and Management (FWW).

Note: The slides will be in English, the presentation itself will be given in German.

02 Oct 2018 (3:00 p.m. st in room G29-035, SwarmLab)
Development of evolutionary computation methods for multi-objective design optimization and decision-making
Prof. Dr. Hemant Singh (The University of New South Wales, Canberra, Australia)

Simultaneous optimization of multiple conflicting criteria is a problem commonly encountered in several disciplines, such as engineering, operations research and finance. The solution to such problems consists of not one but a set of best trade-off designs in the objective space, known as the Pareto Optimal Front (POF). Metaheuristics such as Evolutionary algorithms (EAs) are commonly used to solve these problems owing to several advantages, including parallelizability, global nature of search and ability to deal with highly non-linear/black-box functions. However, in their native form, EAs require large numbers of function evaluations to deliver good results, which becomes prohibitive if each design evaluation is done using a computationally expensive experiment (such as Finite Element Analysis, Computational Fluid Dynamics, etc.). This has motivated a number of past and ongoing studies towards developing strategies for reducing the number of design evaluations during the search. This talk discusses some of the recent efforts undertaken by the speaker with his research group in overcoming this challenge using spatially distributed surrogates and decomposition-based methods. Thereafter, mechanisms to support informed decision making (ie selecting the solutions or regions of interest from the POF) will also be discussed. A brief snapshot of some practical applications will also be presented.

10 Sep 2018 (2:00 p.m. st in room G29-301)
Modeling Attention for Post-Desktop User Interfaces
Dr. Felix Putze (Senior Researcher @ Cognitive Systems Lab, University of Bremen)

In recent years, many “post-desktop” user interfaces have emerged, for example the already omnipresent smart phones and smart watches, but also interfaces for Virtual and Augmented Reality. This paradigm shift results in a trend towards mobile and concurrent use of technology, with frequent side effects such as distraction and information overload. By employing biosignal-based user modeling, we can provide information sources to detect and respond to such effects. In this talk, I will focus on different biosignal-based models of attention as one of the central user states, for example to manage the amount and type of information presented as well as for understanding a user's implicitly communicated intent. I will show the results of multiple studies in which we monitor brain activity, eye gaze,

7 Sep 2018 (10:00 am st in room G29-301)
Home Caring Robot and Its Key Technologies
Prof. Dr. Hon Chi Tin (Macau University of Science and Technology)

Robot has wide applications in elderly caring scenario, from lifting robot, social robot to companion robot. As an individual robot to accompany with an elderly people, there are several key technologies behind, namely, obstacle avoidance, behavioral pattern detection, fall detection, natural language processing, remote diagnosis and the like. The research team from Macau University of Science and Technology has developed a robot Singou Butler with the above key technologies. The talk will take Singou Butler as an example to discuss one by one.

Feb 15, 2018 (08:30 st in room G29-301)

Evolution of machine learning - the way from neural networks to deep learning
Prof. Dr. Ali Reza Samanpour (South Westphalia University of Applied Sciences, Department of Engineering and Economics)

The history of Artificial Intelligence suggests that there has been a gradual and evolutionary development of a specific part of computational science underlying machine learning technologies that has not been defined by this perception/conception.

The bulk of these technologies consisted of the methods defined by what is known as computational intelligence, which includes neural networks, evolutionary algorithms, and fuzzy systems. The more data mining topics have emerged, influenced by the rapidly growing data (Big Data), combined with the same challenges of the Internet of Things (IoT), one can observe that the economic system is changing accordingly. Nowadays you can find a number of vendors offering machine learning frameworks. Some of them enable the use of machine learning tools in the cloud. This possibility is mainly given by the big players like Microsoft Azure ML, Amazon Machine Learning, IBM Bluemix and Google Prediction API just to name a few.

Machine learning algorithms extract complex, high-level abstractions as data representations through a hierarchical learning process. Based on relatively simple abstractions formulated at the previous level in the hierarchy, complex abstractions are learned at a given level. Deep learning is a sub-area of machine learning, but could also be described as a further development of the classic artificial neural networks. While traditional machine learning algorithms rely on fixed sets of models for detection and classification, deep learning algorithms independently evolve, guide, or create their own new model layers within the neural networks. This does not have to be developed and implemented manually again and again for new circumstances, as would be the case with classic machine learning algorithms. The advantage of deep learning lies in the analysis and learning of large amounts of data. This makes it a valuable tool for data analytics in the context of raw data that is largely unlabeled and uncategorized.

In other words, how can computers be made to do what needs to be done without being told how it should be done?

Feb. 8, 2018 (12:00 p.m. st in room G29-301)
SAP Health: Applications and Analytics
Dr.-Ing. Matthias Steinbrecher (SAP, Potsdam)

This talk will cover cohort analysis applications and projects of the SAP Health organization. Cohort analysis is about finding and analyzing patient groups for research or therapy. The use cases will cover existing products like SAP Medical Research Insights, upcoming releases like SAP Health for Clinical Quality as well as research topics around pattern visualization in medical records.

November 9, 2017 (1:00 p.m. st in room G29-301)
Cohort analysis made visual on explorative methods for medical research
Dr.-Ing. Thorsten May (Fraunhofer IGD, Darmstadt)

My talk will focus on the present, future, and past of medical visual analytics research at Fraunhofer IGD (in roughly that order). I will present two current examples from our projects on patient cohort analysis. Cohort analysis aims at defining subsets of patients that are comparable by virtue of properties that are relevant for prevention, diagnosis, or therapy. Visual Analytics research for cohort analysis aims at making this process visible and navigable for the medical researchers. Ideally, the visual cohort analysis enables the physician to embed her own knowledge into the cohort definition. We expect future research to extend the basis for cohort analysis beyond clinical, demographic, and follow-up data. Imaging-based approaches (MRI, CT, U/S, ... ) represent rich input that can be used for a more comprehensive analysis of the patients' situation. My talk will outline a number of challenges that remain to be solved. Our research line evolved from research on general multivariate visual data analysis and time-series analysis that started some 12 years ago. Hence, this talk concludes with the “tale of two arrows”, briefly reflecting on struggles to understand and explain what visual analytics actually is, beyond Keim's process model (with the arrows), and to structure our own lectures according to this understanding.

July 03, 2017 (1:15 pm st in room G29-301)
Theory and Practice of Big Data Analytics for Railway Transportation Systems
Assoc. Prof Luca Oneto (University of Genoa, Italy)

Big Data Analytics is one of the current trending research interests in many industrial sectors and in particular in the context of railway transportation systems. Indeed, many aspects of the railway world can greatly benefit from new technologies and methodologies able to collect, store, process, analyze and visualize large amounts of data as well as new methodologies coming from machine learning, artificial intelligence, and computational intelligence to analyze that data in order to extract actionable information. The EC H2020 In2Rail project is the perfect example of an initiative made to bring the big data technologies into the railway world. The purpose of this talk is to show how theory and practice must be exploited together in order to solve real big data analytics problems in the field of railway transportation systems. in particular, we will focus on one of the problems that we are facing in the In2Rail project: predicting the train delays in the Italian railway network by exploiting both data coming from Rete Ferroviaria Italiana and exogenous data sources. For this purpose, we will make use of the most recent advances in the analytics field of research: from the deep learning to the thresholdout model selection framework.

May 12, 2017 (10:00 am st in room G29-301)
Big Data Visualization: Graphics quality factors
Prof. Dr. Juan J. Cuadrado Gallego (Universidad de Alcalá, Spain)

Nowadays Big Data is used in almost all the fields of human knowledge. The main goal of Big Data is to analyze big databases to find useful information that expand the knowledge in the field that it is applied. In addition, the reason to get knowledge is to share it. Moreover, it is in these two points when data visualization is having a bigger role each day. Data visualization can help to analyze the data faster, and can help to share the acquired knowledge more easily. For the reasons many and new data graphics are used and published everyday. But, all of them provide the reasons for which are used? That is, all them allow to have a easy and faster analysis of the databases and a easy and faster transmission of the information/knowledge obtained from the big databases analysis? The answer is no. And the reason is that is not enough use graphics to improve big data analysis. The user must know when to use data visualization and how to use data visualization. It is not enough to know how must be developed a graphic but that must be know which design aspects must be applied to make a graphic useful. This talk introduces the quality aspects that must be applied to obtain not only data visualization but higher quality data visualization.

11. Mai 2017 (11:00 Uhr s.t. in Raum G29-301)
Three Algorithms Inspired by Data from the Life Sciences
Dr. Allan Tucker (Brunel University London)

In this talk I will discuss how the analysis of real-world data from health and the environment can shape novel algorithms. Firstly, I will discuss some of our work on modelling clinical data. In particular I will discuss the collection of longitudinal data and how this creates challenges for diagnosis and the modelling of disease progression. I will then discuss how cross-sectional studies offer additional useful information that can be used to model disease diversity within a population but lack valuable temporal information. Finally, I will discuss the importance of inferring models that generalise well to new independent data and how this can sometimes lead to new challenges, where the same variables can represent subtly different phenomena. Some examples in ecology and genomics will be described.

10. Mai 2017 (17:00 Uhr s.t. in Raum G29-301)
Multiobjective Clustering
Prof. Dr. Sanghamitra Bandyopadhyay (Indian Statistical Institute, Kolkata)

When the only data that is available is unlabelled, clustering is one of the primary operations applied. The objective is to group those data points that are similar to each other, while clearly separating dissimilar groups from each other. In clustering, usually some similarity/dissimilarity metric is optimized such that a pre-defined objective attains its optimal value. The problem of clustering is therefore essentially one of optimization. The use of metaheuristic methods like genetic algorithms has been demonstrated successfully in the past for clustering a data set. The clustering problem inherently admits a number of criteria or cluster validity indices that have to be simultaneously optimized for obtaining improved results. Hence in recent times the problem has been posed in a multiobjective optimization (MOO) framework and popular metaheuristics for multiobjective optimization have been applied. In this talk, we will first briefly discuss about the fuzzy c-means algorithm, followed by an introduction to the basic principles of MOO and the popular NSGA-II algorithm. Subsequently it will be shown how the algorithm is useful for solving the clustering problem. Since such algorithms provide a number of solutions, a way of combining the multiple clustering solutions so obtained into a single one using supervised learning will be explained. Finally, results will be demonstrated on clustering of some popular gene expression data sets.

19.01.2017 (13:00 Uhr s.t. in Raum G29-301)
Random Shapelet Forests for time series classification
Prof. Panagiotis Papapetrou (Stockholm University)

In this talk I will present a novel technique for time series classification called random shapelet forest. Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases.

In the first part of the talk I will discuss a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

The second part of the talk will focus on early classification of time series. I will present a novel technique that extends the random shapelet forest to allow for early classification of time series. An extensive empirical investigation has shown that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.

15.12.2016 (13:00 Uhr s.t. in Raum G29-301)
Handling Time-Series Data with Visual Analytics: Challenges and Examples
Dr. Theresia Gschwandtner (TU Wien)

Due to the ever growing amounts of available data we need effective ways to make these often complex and heterogeneous data accessible and analyzable. The aim of Visual Analytics (VA) is to support this information discovery process by combining humans’ outstanding capabilities of visual perception with the computational power of computers. By providing interactive visualizations for the visual exploration of trends, patterns and relationships, with automatic methods, such as machine learning and data mining, VA enables knowledge discovery in large and complex bodies of data. The design of such VA solutions, however, requires careful consideration of the data and tasks at hand, as well as the knowledge and capabilities of the user who is going to work with the solution. Dealing with time-oriented data makes this task even more complex as time is an exceptional data dimension with special characteristics. In my talk, I will illustrate different aspects and characteristics of time-oriented data and how we tackled these problems in previous work with respect to data, users, and tasks. I will give several examples of VA solutions developed in our group and I will put a particular focus on examples from the healthcare domain.

07.07.2016 (15:00 s.t. in Raum G29-301)
Aktives Lernen für Klassifikationsprobleme unter der Nutzung von Strukturinformationen
Dr. rer. nat. Tobias Reitmaier (Universität Kassel)

Heutzutage werden mediale, kommerzielle und auch persönliche Inhalte immer mehr in der digitalen Welt konsumiert, ausgetauscht und somit gespeichert. Diese Daten versuchen IT-Unternehmen mittels Methoden des Data Mining oder des maschinellen Lernens verstärkt wirtschaftlich zu nutzen, wobei in der Regel eine zeit- und kostenintensive Kategorisierung bzw. Klassifikation dieser Daten stattfindet. Ein effizienter Ansatz, diese Kosten zu senken, ist aktives Lernen (AL), da AL den Trainingsprozess eines Klassifikators durch gezieltes Anfragen einzelner Datenpunkte steuert, die daraufhin durch Experten mit einer Klassenzugehörigkeit versehen werden. Jedoch zeigt eine Analyse aktueller Verfahren, dass AL nach wie vor Defizite aufweist. Insbesondere wird Strukturinformation, die durch die räumliche Anordnung der (un-)gelabelten Daten gegeben ist, unzureichend genutzt. Außerdem wird bei vielen bisherigen AL-Techniken noch zu wenig auf ihre praktische Einsatzfähigkeit geachtet. Um diesen Herausforderungen zu begegnen, werden in diesem Vortrag mehrere aufeinander aufbauende Lösungsansätze präsentiert: Zunächst wird mit probabilistischen, generativen Modellen die Struktur der Daten erfasst und die selbstadaptive, (fast) parameterfreie Selektionsstrategie 4DS (Distance-Density-Distribution-Diversity Sampling) entwickelt, die zur Musterauswahl Strukturinformation nutzt. Anschließend wird der AL-Prozess um einem transduktiven Lernprozess erweitert, um die Datenmodellierung während des Lernvorgangs anhand der bekanntwerdenden Klasseninformationen iterativ zu verfeinern. Darauf aufbauend wird für das AL-Training einer Support Vector Machine (SVM) der neue datenabhängige Kernel RWM (Responsibility Weighted Mahalanobis) definiert.

01.07.2016 (13:00 Uhr s.t. in Raum G29-301)
Gemeinsam gegen Kriminaldelikte: Wie die Kombination von Data Mining und Spieltheorie bei der Verbrechensbekämpfung helfen kann
Prof. Richard Weber (Department of Industrial Engineering, Universidad de Chile)

Methoden des Data Mining werden seit vielen Jahren erfolgreich zur Erkennung von Verbrechensmustern eingesetzt. Anwendungen gibt es beispielsweise in der Missbrauchserkennung (fraud detection), Vorhersage von Verbrechen im öffentlichen Bereich und in der cyber Kriminalität. Dabei werden in der Regel Daten ausgewertet, die das Verbrechen beschreiben. In vielen Fällen wird jedoch die Interaktion zwischen Kriminellen und den Verantwortlichen für Sicherheit nicht entsprechend berücksichtigt. In diesem Vortrag stellen wir ein hybrides Modell zur Klassifikation von Verbrechensmustern vor, das diese Interaktion explizit modelliert. Am Beispiel der Identifizierung von phishing emails wird ein Spiel zwischen „Angreifer“ und „Bewacher“ beschrieben, welches Eingangsinformationen für einen auf Support Vector Machines basierenden binären Klassifikator liefert. Anhand eines umfangreichen Datensatzes wird gezeigt, welche Vorteile das beschriebene hybride Modell bietet. Zahlreiche Ansätze für weiterführende Arbeiten deuten auf das Potenzial für zukünftige angewandte Forschung hin.

25.05.16 (14:15 in Raum 301)
Metro Maps: Straight-line, Curved, and Concentric
Prof. Dr. Alexander Wolff (Universität Würzburg)

The first schematic metro maps appeared in the 1930's when the networks became too big to be readable in a geographically correct layout. Only 70 years later, computer scientists started to investigate ways how to automate the drawing of metro maps. In my talk, I will present a few of these approaches.

14.04.16 (13:00 in Raum 301)
Space, Time, and Visual Analytics
Prof. Natalia Andrienko, Prof. Gennady Andrienko (Fraunhofer IAIS and City University London)

Visual analytics aims to combine the strengths of human and computer data processing. Visualization, whereby humans and computers cooperate through graphics, is the means through which this is achieved. Sophisticated synergies are required for analyzing spatio-temporal data and solving spatio-temporal problems. It is necessary to take into account the specifics of the geographic space, time, and spatio-temporal data. While a wide variety of methods and tools are available, it is still hard to find guidelines for considering a data set systematically from multiple perspectives. To fill this gap, we systematically consider the structure of spatio-temporal data, possible transformations, and demonstrate several workflows of comprehensive analysis of different data sets, paying special attention to the investigation of data properties. We shall show several workflows of analysis of real data sets on human mobility, city traffic, animal movement, and football. We finish the talk by outlying directions for future research, including semantic level analysis and big data.

21.01.16 (13:15 in Raum 301)
Learning Shortest Paths for Text Summarisation
Prof. Dr. Ulf Brefeld (Leuphana Universität Lüneburg)

We cast multi-sentence compression as a structured prediction problem. Related sentences are represented by a word graph such that every path in the graph is considered a (more or less meaningful) summary of the collection. We propose to adapt shortest path algorithms to data at hand so that the shortest path realises the best possible summary. We report on empirical results and compare our approach to state-of-the-art baselines using word graphs. The proposed technique can be applied to a great variety of objectives that are traditionally solved by dynamic programming. I’ll conclude with a short discussion of learning knapsack-like problems using the same framework.

10.12.15 (13:00 in Raum G26.1-010)
Trajectories Through the Disease Process: Cross Sectional and Longitudinal Data Analysis
Dr. Allan Tucker (Brunel University London)

Degenerative diseases such as cancer, Parkinson’s disease, and glaucoma are characterised by a continuing deterioration to organs or tissues over time. This monotonic increase in severity of symptoms is not always straightforward however. The rate can vary in a single patient during the course of their disease so that sometimes rapid deterioration is observed and other times the symptoms of the sufferer may stabilise (or even improve - for example when medication is used). The characteristics of many degenerative diseases is however a general transition from healthy to early onset to advanced stages. Clinical trials are typically conducted over a population within a defined time period in order to illuminate certain characteristics of a health issue or disease process. These cross-sectional studies provide a snapshot of these disease processes over a large number of people but do not allow us to model the temporal nature of disease, which is essential for modelling detailed prognostic predictions. Longitudinal studies on the other hand, are used to explore how these processes develop over time in a number of people but can be expensive and time-consuming, and many studies only cover a relatively small window within the disease process. This talk explores the application of intelligent data analysis techniques for building reliable models of disease progression from both cross-sectional and longitudinal studies. The aim is to learn disease `trajectories' from cross-sectional data, integrating longitudinal data and taking into account the sometimes non-stationary nature of the disease process.

19.11.15 (13:15 in Raum G29-301)
Knowledge based Tax Fraud Fighting
Prof. Dr. Hans-Joachim Lenz (Freie Universität Berlin)

Tax Fraud is a criminal activity done by a manager of a firm or at least one tax payer who intentionally manipulates tax data to deprive the tax authorities or the government of money for his own benefit. Tax fraud is a kind of data fraud, and happens every time and every where in daily life. Data fraud is extensionally characterized by the four fields: Spy-out, data plagiarism, manipulation and fabrication. Tax fraud investigations can be embedded into the methodology of knowledge based reasoning. One way is to apply case based reasoning where similar stored cases are retrieved and their information re-used. Alternatively, we put the focus on the Bayesian Learning Theory as a step wise procedure integrating prior information, facts from first (and follow-up) investigations and partial or background information. There is and will be no omnibus test available to detect the underlying manipulations of (even double-entry) book keeping data in business with high precision. However, a bundle of techniques like probability distribution analysis methods, Benford’s Law application, inliers and outlier as well as tests of conformity between data and Business Key Indicators systems exist to give hints for tax fraud. Finally, investigators may be hopeful in the long run because betrayers never will be able to construct a perfect manipulated world of figures, cf. F. Wehrheim (2011).

11/12/15 (1:15 pm in room G29-301) Ethical
challenge for a mobility service provider when dealing with customer data in the digital age.
Karl Partle. ( Head of the Institute for Computer Science of the Volkswagen Group )

As a branch of philosophy, information ethics is a field of ethics that examines moral issues in dealing with digitally available information in information and communication technologies. Challenges arise from the combination of ideas, some of which are older than 5000 years, with technologies that are less than 50 years old. Four theses:

It is not ethics as a philosophical discipline that is being changed, but ethical guidelines are to be reinterpreted in the course of digitalization.
All data tends to be available at any time and any place in a defined quality in real time (up-to-date).
The distinction between the real world (being) and the digital world (fictitious image of the real) is becoming increasingly blurred.
The secrecy of one's own data becomes arbitrarily difficult; however, this also applies to states, secret services and companies.

In view of this, questions arise about social responsibility and the use of consciously drawn moral barriers as a benchmark and measure of ethical foundations in the digital age. This is to be examined using the example of the challenges faced by a mobility service provider when dealing with customer data.