Every day we hear about Big Data, IoT, Smart Data, Machine Learning, semantic data, etc. Many times out of context or simply used because they’re “trendy”.
One of the best examples is “Big Data”, where we always talk about huge amounts of information, systems, platforms, queries, but with the error, from my point of view, of taking that as information, no, no it’s not information, they’re data, raw data or processed data, information is what is extracted from that data. Many times, with the term “Big Data”, we get lost in only the part of storing huge amounts of data, replicated and in astronomical volumes. That’s not “Big Data”, that’s only talking about one part, the most mechanical one, and the one that contributes least to what we’re looking for “Information”, it’s only “data storage and management”, one leg of a much broader table.
I understand the error as tending to focus more on components than on the required solution.
I love proof of concepts (those small or not so small tools/solutions for something existing or new but applying a different solution) and that’s where I discovered the New York Times R&D project: Streamtools.
For New York Times R&D, Streamtools is based on three predictions for 3-5 years out:
1) Data will be provided as streams: Due to the volume of data, these will be obtained through “sensors” where stream-based APIs will prevail over data obtained from databases. To a large extent, data sources will “emit” data. Putting a database between that emitter and the people/machines that process them will be too expensive (due to volumes).
2) The use of streams will change how we draw conclusions in the world: This paradigm shift will be what makes us start thinking in terms of analysis, modeling, decision making, and visualization. Each new stream that arrives will instantly affect our view of the world (it can change it).
3) Adaptable tools will infer new ways to semantize and obtain information: Data analysis will tend toward “Abductive Reasoning”: The researcher will begin to explore and observe the data and through hypotheses try to reason about them.
With this in mind, they created Streamtools, a tool that, with a graphical interface, allows defining and managing these streams, and not only working with them but even defining new streams from data or sources that are apparently “static”.
In this way, you can set how often they should read and from where, and the process of cleaning, filtering, and actions to manage that stream.
Use cases they’ve given as examples are:
- Analysis of NYT visits using a Queue to generate automatic daily reports.
- An earthquake tracker using USGS real-time data.
- A system to see “lost objects” from NY transit systems.
- Citibike availability at the stop nearest to NYT offices.
In addition to the examples, there are potential data sources (potentially usable) for the Streamtools paradigm: Data Sources
Streamtools is licensed under Apache 2 (Open Source) and is written in Go. As a research project, it’s a platform to explore new algorithms and analysis methods. Streamtools allows being extremely expressive (clear and visual) when creating data analysis prototypes.









Comments