As you probably know, I’m a big fan of Staticman’s approach to enable dynamic content on static web sites. When I introduced comments on this blog, things quickly got out of hand: Each day, I would receive roughly five comments that were posted by bots.
Spam comments as pull requests in GitHub
Manually approving each post quickly became a nuisance, which is why I deactivated Staticman again after some time.

Learn about the greatest differences between a data science role in academia and a software engineering role in industry. How to prepare for the transition?

Having recently transitioned from academia to industry, I’d like to share what I found are the greatest differences between working in industry and academia. Since this article is based on my personal experiences, I would first like introduce my respective roles in research and in industry. After that, I will summarize the main differences between industry and academia. Finally, I offer some pieces of advice regarding how to prepare for an industry job when transitioning from academia.

Having obtained both a Bachelor’s and a Master’s degree in bioinformatics, I would like to describe how I experienced studying bioinformatics. Moreover, I would like to discuss whether it was worth studying in the first place, and, finally, to offer some advice to prospective students and graduates.
What is Bioinformatics? Bioinformatics is an interdisciplinary field that is concerned with developing and applying methods from computer science on biological problems.

During my time as a PhD student I have developed software in the academic setting. At that time I was already under the impression that my work would probably not meet industry standards. Having recently transitioned to an industry job, I quickly realized how coding in academia is different from coding in industry. This post summarizes the main differences between the two fields and extrapolates what coders in academia can learn from industry.

Forecasting is concerned with making predictions about future observations by relying on past measurements. In this article, I will give an introduction how ARMA, ARIMA (Box-Jenkins), SARIMA, and ARIMAX models can be used for forecasting given time-series data.
Preliminaries Before we can talk about models for time-series data, we have to introduce two concepts.
The backshift operator Given the time series \(y = \{y_1, y_2, \ldots \}\), the backshift operator (also called lag operator) is defined as

In supervised learning, we are often concerned with prediction. However, there is also the concept of forecasting. Here, I will discuss the differences between the two concepts so that we can answer the question why weather forecasting is not called weather prediction.
Predicion and forecasting Prediction is concerned with estimating the outcomes for unseen data. For this purpose, you fit a model to a training data set, which results in an estimator \(\hat{f}(x)\) that can make predictions for new samples \(x\).

Receiver operating characteristic (ROC) curves are probably the most commonly used measure for evaluating the predictive performance of scoring classifiers.
The confusion matrix of a classifier that predicts a positive class (+1) and a negative class (-1) has the following structure:
Prediction/Reference Class +1 -1 +1 TP FP -1 FN TN Here, TP indicates the number of true positives (model predicts positive class correctly), FP indicates the number of false positives (model incorrectly predicts positive class), FN indicates the number of false negatives (model incorrectly predicts negative class), and TN indicates the number of true negatives (model correctly predicts negative class).

The terms inference and prediction both describe tasks where we learn from data in a supervised manner in order to find a model that describes the relationship between the independent variables and the outcome. Inference and prediction, however, diverge when it comes to the use of the resulting model:
Inference: Use the model to learn about the data generation process. Prediction: Use the model to predict the outcomes for new data points.

For classification problems, classifier performance is typically defined according to the confusion matrix associated with the classifier. Based on the entries of the matrix, it is possible to compute sensitivity (recall), specificity, and precision. For a single cutoff, these quantities lead to balanced accuracy (sensitivity and specificity) or to the F1-score (recall and precision). For evaluate a scoring classifier at multiple cutoffs, these quantities can be used to determine the area under the ROC curve (AUC) or the area under the precision-recall curve (AUCPR).

By now, datascienceblog.net already exists for one month, with the first post dating back to the 16th of October, 2018. I would like to use this opportunity to reflect on how the blog has developed since its inception.
Content I am quite happy with the amount of content I could produce over the last couple of weeks. Especially when starting a blog, high-quality content is the most important criterion for developing a user base.