1. Data science, in simple terms, focuses on making sense out of data (often in large amount) i.e. a collection of figures/numbers. The data may be structured or unstructured and a data scientist will collect this data, use the combined knowledge of statistics, math, and programming to clean the data and present it in a human understandable form.
This further leads to making predictions, identifying patterns, etc. which in turn can be used to create data products. Few examples of data products- self-driving technology making use of computer vision to identify traffic lights, pedestrians, vehicles on road etc., youtube’s video suggestions are tailored based on each person’s viewing history, email spam filters process mails by certain keywords, sender address etc. and determine if it is junk or not. 2. Data science and big data analytics are closely related.
Data scientists collect the data and use their knowledge of math, statistics, programming, information science etc. to clean data, identify patterns, extract information out of that data. Thus data science can be considered as a wider term for the numerous different techniques that can be used to extract information from data. Similarly, big data analytics can be termed as the techniques used in order to process raw data in huge quantity, which is often impossible to store on a single machine, in order to enable an organization for effective decision making. An example of this can be Youtube’s recommendations which are different for each user based on their search and viewing history collected over a long period of time. With the increasing size of data there arises the need for new technologies to analyze the data. The humongous size in itself is a big factor which limits the use of traditional techniques.
Often the data is so big that it has to be stored on multiple systems and producing results in a reasonable amount of time is rarely possible, thus we need some distributed way to process the data. One such technique is Map-Reduce which makes use of distributed file system and different machines working together to process and analyze the data being fed to it. Some other tech 3.
Map reduce framework contributes to solving the problem of big data processing by processing the data in smaller chunks with the help of a large number of commodity hardware. Map-Reduce originated at Google in 2004 on which the Hadoop architecture is built. It consists of two components viz.
map and reduce. Mapper takes the input and generates intermediate key-value pairs which are then passed as input to the reducer to generate the final key-value pairs. Map and reduce functions are parallelized and run on a large number of commodity systems. There is a master node that splits the map and reduce tasks, takes care of any hardware failures etc. There are many differences in Map-reduce and traditional parallel DMBS.
As per an article “Weighing MapReduce Against Parallel DBMS” by Ian Armas Foster- a. Map reduce is easily scalable as it uses cheap commodity hardware and adding a node in the cluster will significantly increase processing power, thus map-reduce is capable of handling significantly more amount of data then parallel dbms. b. Parallel dbms is usually SQL based and hence cannot effectively handle unstructured data, this is not a limitation for map-reduce.
c.Parallel dbms is not fault tolerant, the processing is stopped and restarted in case of a node failure. Map-reduce can handle node failures by re-distributing tasks between nodes.d. If the data is structured, parallel dbms is more suited for repeated querying.
Map-reduce is slow in handling repeated queries.