Learning Data-Driven Decisions for Managers in New Style Companies with Professor Dutta from USA Part 7.3

The development of information technology today is very helpful in the company's business. However, if we do not understand the type of technology needed, we will choose the wrong technology. Especially in the field of decision making for companies, there is one information technology product that is very helpful, namely a decision support system.

STEKOM university's efforts to have a global reach include holding webinars on an international scale. On this occasion we will discuss an international webinar held by STEKOM University in which one of the speakers is a professor from the United States. The resource person is Kaushik Dutta who is a Professor and School Director at the University of South Florida. Professor Dutta in his presentation delivered material on decision support systems which are IT products that are very useful in corporate business.

The material presented by Professor Dutta includes Framework, Applications for Business, Techniques, and Infrastructure. Because the material presented is quite long, the news article that discusses Professor Dutta's presentation is divided into several parts. We are currently entering part 7.3. If the reader wants to know the previous presentation, please see the previous sections in the title of the same article.

Continuing from the previous section, it is still about the material presented by Professor Dutta. This article is the last article to discuss the material presented by Professor Dutta. At the end of his presentation, Professor Dutta discussed a summary of big data systems.

Big data is data that has a very large volume or size consisting of structured (structured), semi-structured (semi-structured), and unstructured (unstructured) data that can grow over time.

Big data storage in Professor Dutta's explanation is divided into two, namely databases and file systems. An explanation of the database has been discussed in the title of the same article in a different section. So we will not discuss it again here.

As for file systems, these are the methods and data structures that the operating system uses to control how data is stored and retrieved. Without a file system, data placed on storage media would become one huge data set with no way of knowing where one piece of data stopped and the next started, or where that piece of data would be when it was time to retrieve it. By separating data into sections and giving each section a name, the data is easy to isolate and identify. Taking its name from the name of a paper-based data management system, each group of data is called a "file". The structure and logical rules used to manage group data and its name are called "file systems".

Then about Query, Professor Dutta explained the two query models that exist in the Hadoop Pache service, namely Impala and Hive. Where Hive is for offline work, while impala is for online work.

In the field of computing, Professor Dutta exemplifies the Map-reduce model in Hadoop and apache spark. MapReduce is a programming model for retrieving large datasets via a distributed file system (such as HDFS). MapReduce is designed to be able to process unstructured data on large cluster commodities, handling hardware failures, duplication of tasks, and aggregated results.

The MapReduce programming model consists of a map function and a reduce function. The map function accepts each record of input data as a key-value and produces key-value output data. Each map function call is independent of each other which supports the framework for using divide and conquer strategies for parallel computing. It is also possible to duplicate execution or re-execute the map task in the event of a failure without affecting the computational results. Usually, Hadoop creates a single map task for each HDFS data block from the input data. The number of map function calls in a map task is equal to the number of records in
on the input data block on a particular task map.

Next, the key-values of all map tasks are grouped by key and then distributed into reduce tasks. The process of distributing and transmitting data to this reduce task is called the shuffle phase. The input data that goes into each reduce task is also sorted and grouped based on the key. The reduce function is called for each key and each group of values of that key. In a typical MapReduce program, users only need to implement the Map function and the Reduce and Hadoop functions that take care of scheduling and executing them in parallel. If any task fails, Hadoop will re-run the task.

Apache Spark is a distributed computing framework designed to perform general computing quickly. Spark developed the MapReduce programming model to be able to perform various types of computations such as interactive queries and streaming processing. Spark has an in-memory computation feature where computations are performed in the computer's memory, not on disk as Hadoop MapReduce does.

The last section of an important summary of big data is machine learning technology. Among them is Pyspark. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using the Python API, but also provides a PySpark shell to interactively analyze your data in a distributed environment. PySpark supports most of Spark's features like Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

Another machine learning technology alternative is tensorflow. Tensorflow is a machine learning framework that might become a friend when we play with data and if you are a fan of one of the areas in AI (artificial intelligence), namely deep learning. Tensorflow can help you build neural networks on a large scale. Tensorflow has helped scientists in projects such as the search for new planets, helping doctors prevent blindness in diabetic patients and others. Tensorflow is also a framework that underpins projects like AlphaGo and Google Cloud Vision that you can use.

Actually, there is another alternative to maschine learning technology discussed by Professor Dutta, namely H2O. But according to the author, the above technology is more interesting to write about in this article.