6 things FELD M learned at useR!2019 in Toulouse

on 23.07.2019 by Linda Le

Hi, I´m Linda. I am part of the Data Science team at FELD M and was excited to participate this year’s useR!2019 conference, which took place in Toulouse.

That meant 4 days full of great

  • 3h tutorials
  • keynotes
  • 30 min blocks of 6*5 min lightning talks
  • 1,5h blocks of 5*18 min talks
  • sponsor talks
  • poster session
  • social events, …on up to 6 parallel tracks!

The complete list of talks including slides can be found here http://www.user2019.fr/talk_schedule/ and video recordings of the keynotes here: https://www.youtube.com/channel/UC_R5smHVXRYGhZYDJsnXTwg/videos. The videos of the other talks will also be online soon.

Let me tell you about the conference’s input as I guide you through a typical project´s timeline. I took advantage of a nice Machine Learning Workflow Hexa-Diagramm and added a 6th Hexagram, adding ‘Communication’ of projects.

Let’s go through the 2nd, 3rd and 6th Hexagon to give some examples, what I took with me from useR! and where we now are taking some deep dives to improve our workflow.

 

  • {tidyr} by famous Hadley Wickham (a must read for everyone advancing in R is his recent 2nd edition of “Advanced R” book: https://adv-r.hadley.nz/index.html) is updated. In the area of web analytics we, at FELD M, receive raw data, in which all touchpoints of all visitors/customers are recorded in rows. In order to analyse customer journeys, we need to reshape our data, so that we have the customers in rows and all touchpoints per customer, i.e. the customer journey in another column. The transformation of reshaping the data from long format to wide format is therefore a regulary used transformation in Data Science projects. The current functions to reshape data are spread() and gather(), where many R-users had to strugggle with the logic. So, Hadley Wickham showed us the work in progress functions pivot_longer() and pivot_wider(), with a more intuitive function and arguments name to reshape data. https://tidyr.tidyverse.org/
  • When working with large data sets we usually use either data.table or SparkR (which we currently prefer over sparklyr because of its more similar syntax to PySpark and hence easier switch between Python and R). The latter two methods rely on RAM for their performance. Since our datasets often don’t fit into the RAM anymore but are still below real big data (calculations can’t be handle by a single machine anymore), the newly developed package {disk.frame} (https://rpubs.com/xiaodai/intro-disk-frame) offers an interesting possibility to store and process medium sized datasets. Data larger-than-RAM is split up and stored in chunks on the harddrive and {disk.frame} provides an API for manipulating these chunks. Unlike Spark, {disk.frame} does not require a cluster and can use any function in R.
  • Before we build a model, we first analyse the data on a descriptive level to decide what assumptions we make to build a model. Visualizing high-dimensional data can then be a cumbersome task. In a tutorial Di Cook showed us her packages like {tourr} https://github.com/ggobi/tourr, which visualizes higher-dimensional (>3) data in an animated rotation. You can take a variable and rotate it out of the projection and see if the structure persists or disappears. The package {nullabor} https://github.com/dicook/nullabor is a tool for graphical inference. Your data plot will be displayed among several random nullplots (plots representing your nullhypothesis). If the difference is visible, there is probably a statistical significane in the structure of the plot.
  • Due to the individual advantages of Python and R, at FELD M Data/Software Engineering is mainly done in Python, while the analysis (building models, statistical tests) by the Data Science Team is more focused on R. Our Data/Software Engineering- and Data Science Team is already working very closely together on Advanced Analytics projects to take the advantage of both expertises and both languages (Python and R). Of course, it is in general our goal to build our (data) products in one programming language. Nevertheless, sometimes we build prototypes, which have to live in both worlds and require to use both languages. The {reticulate} package https://rstudio.github.io/reticulate/ makes it possible to call Python out of RStudio. Rounded off by the GUI developments of knit Rmarkdown, it will be easier to bridge language silos.

 

  • When it comes to building a model, it is always important to know the cause of a variable, as we all know “correlation != causation”. Under the assumption, that causal relationship leaves a structure in the data, there are many procedures that detect this causation. Causaldisco summarizes the causal discovery procedures in R and filters the appropriate procedures for your data when you choose your properties. http://biostatistics.dk/causaldisco/.

 

All in all, the success of a project depends not only on the methods, such as those mentioned above, but also on the environment you create in your company. Julie Lowndres showed us in her keynote (https://www.youtube.com/watch?v=Z8PqwFPqn6Y&t=2806s), how she and her team work by embracing open data science, openess and the power of welcome.

FELD M is looking forward to take some deep dives into the learnings listed above now and to put them into practice to improve our workflow and smoothen the journey for our customers.

If you are interested in our work, come and check out our portfolio: https://www.feld-m.de/service/data-strategy-advanced-analytics/.

Or if you are a NGO/NPO, come and check out our contribution to Data Science for good with our “Data Ambulance”: https://www.feld-m.de/datenambulanz/