Back from Dataxday 2018 | Toucan Toco

Table of Contents

This is a personal feedback from my favorite talks of Dataxday 2018, it is highly subjective, but fortunately, you can make your own opinion by watching the recording on their Youtube channel: Xebia TV.

Machine learning models at scale with Amazon SageMaker - AWS (FR)

Interesting talk, the speaker was knowledgeable and pleasant to listen to, even though the beginning is a little on the product showcase side. He gave a tour of what’s possible with Amazon Web Services and specifically Amazon SageMaker. He also showed us various possible workflows, from IPython’s notebook to direct API/SDK call to generate a model and use it afterward, all using Amazon of course ;). If you’ve a need or business for leveraging AWS GPU instances to train your machine learning models, this is worth a look.

Direct link to this talk on Youtube

Tensors in the sky with CloudML - XEBIA (FR)

Really eloquent speaker with a simple subject for someone familiar with infrastructures as a service. He started by introducing what the cloud is and how a neural network works. For the rest of the talk, he compared a local deployment against Google Cloud Platform and how to leverage its cloudiness for A/B testing, rollbacks, scaling etc. This presentation was smooth and overall focus on showing us what is possible with those tools.

Direct link to this talk on Youtube

Data lineage: visualize the data life cycle - ZEENEA (FR)

I was not sure if the talk was right for me but I was -very quickly- pleasantly surprise! The speaker was good and down to earth (i.e. do not heavily relied on buzzword), basically he knew what he talked about. He explained what his startup is doing in regard to tracing the life cycle of your data. To give concrete examples, he showed us the techniques used to retrace the various operations done to the data, such as parsing SQL queries or the use of Spline for Apache Spark. You can then explore the graph of transformations and retrace where exactly did that information came from. This is clearly something I would have been happy to use a few month ago!

Direct link to this talk on Youtube

A data scientist journey to industrialization of machine learning - AIR FRANCE (FR)

Really interesting feedback about a project that began as a small data scientist’s POC in R to a big production with Spark and Python. They gave a lot of advices on what to avoid or be careful about. The two subjects that clicked with me:

  • importance of onboarding -at the early stage- developers and ops. This onboarding will help tremendously when you want to convert your POC to something that is industrial: easily shippable, repeatable, robust etc.
  • data people needs to be developer-like: unit test, model validation, automatize process and avoid not executing a Jupyter Notebook by hand every week etc.

This is nice to see because this is exactly what we strive to do with our Data division :).

Direct link to this talk on Youtube

Computer vision : a pragmatic alliance between deep learning and a more “traditional” technique. - XEBIA (FR)

The first part of this talk was the genesis: an insurance company wanted to automate the recognition of accident reports. Unfortunately, they had to bailed because the problem is really hard to solve. For example, you have no limit on what you can draw (and its meaning).

Afterward, they decided to retry a similar experiment but with stricter rules: the national identity card. The speakers (1 data science, 1 dev) showed how they mixed and matched machine learning with more regular/conventional algorithms. The idea was to leverage each other strength and not focus too much on the hype bandwagon of machine learning.

In the end, this was a nice and pleasant talk. Their solution works and is not to be ashamed of, especially when comparing what they did in less than 10 days to commercially available solutions.

Direct link to this talk on Youtube

Building a Real Time Analytics API at Scale - ALGOLIA (EN)

The speaker talked about the challenges Algolia faces for their Search as a Service where their API needs to answer in ms for big players (e.g. Twitch). The talk was more focus on analytics but the data volumes are huge: they received 40B searches per month. They originally hosted their own Elasticsearch cluster. But with more and more events coming in, half billion searches at the time, ES was not able to keep up. They also were on an old version of ES and to upgrade to the latest version, they had to rewrite every query.

Their new stack is mostly composed of Go, Kubernetes, an API in C++ (for performance reasons) and now Citus as the database. They choosed citus which is an extension, hosted for you solution, around PostgresSQL. It enables you to scale Postgres horizontal or vertically. The resulting performance are impressive, especially when you remember yourself that in the end, you execute an SQL query on PostgreSQL.

Finally, the presentation focus on how everything works under the hood for this pipeline and how to optimize everything from rollups to aggregates.

This was one of my favorite talk of the day! :).

Direct link to this talk on Youtube

Real-Time Access log analysis - BLABLACAR (EN)

The talk began with a presentation of Blablacar and their presence worldwide. Quickly after, he introduced the subject of log analysis with use cases ranging from security, product simplification to API usage.

The stack is basically: Nginx with Hindsight/Lua, Kafka, Flink, Schema Registry and a Lake. Everything is automated, no human is harmed during this execution. The schema is also updated automatically by Flink when it detects a new field (Flink is used as a glorified distributed regex engine). The schema registry is a key value store for various schema defined in JSON. Finally, Kafka is the piece that enable streams of data from every part of the system.

The speaker gave an example of a real scenario used by this stack:

  • we detect a log pike (it is twelve o’clock, not usual for the platform)
  • launch a count of user-agent real-time parsed log
  • we see a version is greatly on top… checking the trace with the dev team: bingo, we have found a bug in the latest deployment!

You can also find crawler or even bad API usage with the same way. I find this workflow really awesome, and I am now motivated to implement something similar for Toucan Toco! :)

Direct link to this talk on Youtube

Table of Contents