Data Scientist for Biology and Health Care, Cambridge (UK)

Getting Started

This article presents a short introduction on Markov Chain and Hidden Markov Models with an emphasis on their application on bio-sequences. Markov’s models are probably not the most common machine learning methodology right now but I believe they still represent an important stepping stone in the path of any data scientist.

Introduction to Markov Chains

The Markov Chains (MC) and the Hidden Markov Model (HMM) are powerful statistical models that can be applied in a variety of different fields, such as protein homologies detection; speech recognition; language processing; telecommunications; and tracking animal behaviour.

HMM has been widely used in bioinformatics since its inception. It…

Hands-on Tutorials

Applying machine learning to biological sequences is not an easy or straightforward task. And whatever you are a Data Scientist working in biology or a Bioinformatician who wants to create predictive models, you have to understand a functional way to deal with biological sequences.

In this article, I do not want to discuss the machine learning models, but to address a system of pre-processing that I have successfully used during my PhD to convert large database into suitable elements for model training.

How to turn strings into numerical vectors

The analysis of sequences being them DNA, RNA or proteins, is often a comparative study, in the sense…

Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

  • Intro
  • Pros and Cons
  • Pipeline Structure:
    . Tasks
    . Targets
    . Parameters
  • Building Blocks
  • Execute it
Image for post
Image for post
Photo by Mike Benna on Unsplash


My experience with Luigi started a few years ago, when the sole person in charge of the company pipeline left, and I got “gifted” with a massive legacy codebase.

At the time, I had no knowledge about pipelines in general, or Luigi in particular, so I started looking for tutorials and…

Image for post
Image for post

Probability calibration is the process of calibrating an ML model to return the true likelihood of an event. This is necessary when we need the probability of the event in question rather than its classification.

Image that you have two models to predict rainy days, Model A and Model B. Both models have an accuracy of 0.8. And indeed, for every 10 rainy days, both mislabelled two days.

But if we look at the probability connected to each prediction, we can see that Model A reports a probability of 80%, while Model B of 100%.

This means that model B…

In my job as a data scientist, once I needed to add detailed records of weather data to my project. I wanted things like, temperature, humidity, rainfall, etc given the spacetime coordinates (time and GPS location).

I thought that finding an API that could give this type of information was going to be easy. I didn’t know that weather data are one of the most jealously kept type of data.

If you search for “free weather API”, you will see plenty of similar websites with different services that are not actually free or have no historical weather records.

I really…

Image for post
Image for post

You know when you have coded your biggest project and every time it runs you can barely figure out what is doing, only by reading a series of print statements and the creation of strategically saved files?

Well if that is the case, you ought to learn logging and step up your game.

With a proper system of logging. you will have a consistent, ordered and a more reliable way to understand your own code, to time and track its progression and capture bugs easily.

Let’s break down the advantages of logging:

  1. Formatting: Logging allows you to standardize every message…

If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses it to track their expenses, job searching and every other multi-step processes.
Indeed, it is very suitable to visualize the progression of events and their outcome. And in my opinion, they look great!

Therefore, let’s see how to do in Python:

In matplotlib

Personally, in matplotlib they look awful:

Image for post
Image for post

The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.

The other solution is…

Image for post
Image for post
(Image by author)

If you have to create a report, LaTeX is definitely the choice to make, everything looks better in LaTeX!

Chances are that you used it to write your thesis or some assays, back in your academia years. Indeed, environments like texstudio or the more recent Overleaf are great for single projects. But now in your day job, can they really scale it up and make LaTeX still useful?

What you need is a method that generates reports automatically (maybe, just there at the end of your pipeline), that changes its content dynamically in relation to the findings of your code…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store