In the world we live in, where fashions and bubbles of all kinds set the pace for things, it is hard not to see how technological terms are, as Rudyard Kipling said in his famous poem, " misrepresented by rascals to make a trap for fools . " And the term " Big Data " is one of the most used today, often with disparate meanings and often simply wrong.
It is true that in the area of data analysis it is difficult to find definitions that are lasting over time and accommodate the preferences of all professionals in the area, but the author of these lines has come to sit in front of technology managers from various companies and heard the term "Big Data" as equivalent to "Business Intelligence" or "Data Mining". And it is neither one thing nor the other. And from people further removed from the world of technology it is also common to hear the term with the meaning of "Social Networks", "Internet" or "Apps" (applications usually for mobile devices). Or even with the meaning of "Artificial Intelligence" (even if it is given another name, such as "robotics" or similar).
Every day the interpretation is imposed according to which "Big Data" refers to the enormous amount of data that is generated globally, whether from the Apps we use, from banking transactions, from the use of social networks, from the use of what we do with our mobile phone ... data from which we can extract valuable information regarding our consumption habits that should allow us to improve our standard of living. Still, Big Data means "Big Data" or "Big Data". It is a highly subjective term that, strictly speaking, does not refer to more than the size of the data involved in a process, so that when there are "many", we are talking about Big Data (in that case we can also say that we are in a "Big Data" or "Big Data" scenario). And when can we say that we are working with "a lot" of data ?: When your process (whether we are talking about data analysis or any other) makes traditional analysis solutions insufficient. And by "traditional solution" here we are referring to the use of a computer with a processor (or multiprocessor), a memory and a hard disk. In other words, when it is possible to analyze a set of data using a single computer - and the analysis is carried out at the right times - we should not talk about massive data, as much as we like to say that "in my company we do Big Data". On the contrary, if this traditional solution is not enough, we will find ourselves, as mentioned, in a massive data scenario and for its analysis it will be necessary to resort to what is called "distributed computing", which implies the use of several computers (sometimes even thousands of them) among which the process is distributed. Logically, this distribution requires that the computers be coordinated with each other, which requires the use of specialized software (in data analysis the most widely used is Apache Hadoop ). On the other hand, the subjectivity of the term "Big Data" implies that certain data may or may not be considered "large" depending on the available technology: what we now consider Big Data will probably not be so in a few years. Or, in other words, if we had a computer with infinite computing capacity, Big Data would not exist and all analysis scenarios would be considered traditional.
Even though the use of the term Big Data is, as mentioned, very fashionable, it is important to note that, if we accept the previous definition, working in Big Data scenarios or not, it is (almost) independent of the type of process that we are executing: the only difference would be the technology on which we rely to execute them. It is true that there are machine learning algorithms used in data mining that are difficult to distribute among several computers (in some cases it is simply impossible), but they are exceptions that should not divert attention from what is really important: distributed computing is no more This is a resource that we turn to to satisfy a need for computing capacity imposed by demands greater than usual, but this is, from a conceptual point of view, irrelevant in the analysis.