StreamTools: A tool for analyzing data streams : AntonioCortes.com

Every day we hear about Big Data, IoT, Smart Data, Machine Learning, semantic data, etc. Many times out of context or simply used because they’re “trendy”.

One of the best examples is “Big Data”, where we always talk about huge amounts of information, systems, platforms, queries, but with the error, from my point of view, of taking that as information, no, no it’s not information, they’re data, raw data or processed data, information is what is extracted from that data. Many times, with the term “Big Data”, we get lost in only the part of storing huge amounts of data, replicated and in astronomical volumes. That’s not “Big Data”, that’s only talking about one part, the most mechanical one, and the one that contributes least to what we’re looking for “Information”, it’s only “data storage and management”, one leg of a much broader table.

I understand the error as tending to focus more on components than on the required solution.

I love proof of concepts (those small or not so small tools/solutions for something existing or new but applying a different solution) and that’s where I discovered the New York Times R&D project: Streamtools.

For New York Times R&D, Streamtools is based on three predictions for 3-5 years out:

1) Data will be provided as streams: Due to the volume of data, these will be obtained through “sensors” where stream-based APIs will prevail over data obtained from databases. To a large extent, data sources will “emit” data. Putting a database between that emitter and the people/machines that process them will be too expensive (due to volumes).

2) The use of streams will change how we draw conclusions in the world: This paradigm shift will be what makes us start thinking in terms of analysis, modeling, decision making, and visualization. Each new stream that arrives will instantly affect our view of the world (it can change it).

3) Adaptable tools will infer new ways to semantize and obtain information: Data analysis will tend toward “Abductive Reasoning”: The researcher will begin to explore and observe the data and through hypotheses try to reason about them.

With this in mind, they created Streamtools, a tool that, with a graphical interface, allows defining and managing these streams, and not only working with them but even defining new streams from data or sources that are apparently “static”.

In this way, you can set how often they should read and from where, and the process of cleaning, filtering, and actions to manage that stream.

Use cases they’ve given as examples are:

Analysis of NYT visits using a Queue to generate automatic daily reports.
An earthquake tracker using USGS real-time data.
A system to see “lost objects” from NY transit systems.
Citibike availability at the stop nearest to NYT offices.

See examples

In addition to the examples, there are potential data sources (potentially usable) for the Streamtools paradigm: Data Sources

Streamtools is licensed under Apache 2 (Open Source) and is written in Go. As a research project, it’s a platform to explore new algorithms and analysis methods. Streamtools allows being extremely expressive (clear and visual) when creating data analysis prototypes.

A Revelation in Algorithm Form

Thanks to an article from System Design Academy that came my way this week, I’ve been reflecting on something I find curious and simple at the same time: how to sort massive datasets in a distributed manner. And you know what? These patterns are so elegant that they can be applied to so many other problems we face day to day.

As a developer who has gone from JavaScript to PHP, then Python, and is now fully immersed in Golang, I’m struck by how certain patterns transcend languages and frameworks. Distributed sorting is one of those cases where architecture matters more than implementation.

Latest Posts

Claude Code with LSP: from searching text to understanding code

Ghost Jobs: the economy built on positions that don't exist

DuckDB and httpfs behind a proxy: the secret nobody tells you

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

Analyzing Container Filesystem Isolation for Multi-Tenant Workloads

The Software Development Renaissance with AI Agents

StreamTools: A tool for analyzing data streams

Comments

Latest Posts

Why I'm Fascinated by Distributed Sorting (and Why You Should Be Too)

A Revelation in Algorithm Form

List of Go resources of the week