Purpose The purpose of this research paper is to outline a goal against the relationship between big data and social media data and its extraction, by doing so it links the content of many unfolding areas of research with the paper published in this special issue of BIG data, Social data mining, social data extraction, security.
Social networking platforms provide different means of communications between the people living around the globe by establishing a network in which information such as text, pictures, audios and videos are shared. Another problem is chances of fraudulent activities carried out such as creating fake profiles, spreading fake and bogus news in massive form so due to age of big data, security is also a major paradigm of research in the field so big data analytics which is increasingly becoming trending practice of many organization adopting with the purpose of constructing valuable information from the social media. The extremely large scale big data are in the form of quantity, complexity, semantic distribution and processing cost in computer science, web base computing, and computational intelligence so analytics of massive data require a lot of efforts at multiple levels for extraction knowledge from social media.
Design/methodology/approach The paper surveys the existing literature, which is still in its infancy, and proposes ways in which to frame early and future research. The intention is not to offer a comprehensive review, but to stimulate and conversation
Findings in this Paper authors reviews any continuing studies which explore technology enabled groups in the form of network and focus some of the key aspects like social media data, its extraction and big data along the few reviews on security, so classification existing efforts of research as well as highlighting the future research opportunities. Many areas are investigating are identified such as new tools in performance indicating on social media data and big data, governance of social data and big data massive information resources and in the last social media and big data ‘alteration of information and decision making behaviors of organizations So we need new tools such as Big web data mining. Visualization is tool which is shown to be effective from gleaning insight in big data, many technologies like apache Hadoop also emerging to support back end concerns, such as storage, processing and visualization focus on front end of data of big data mainly big Data introduce unique computational and statistical challenges including scalability ,storage bottle neck , Nosie accumulation spurious correlation and measurement errors, further author tried to ask the two questions Q1 what are different types of Big data Challenges confronted by organizations? Q 2 What are the different types of method employed to overcome Big Data Challenges in the aspect of Big social media data its extraction and security measures?
Originality/value – In the concern of originality authors are working and experiencing a technological evolution that will the way of operating different organizations as well as individual The authors are currently experiencing a technological revolution that will fundamentally change the way in which organizations, as well as individuals, operate.
Keywords – Social Media Data, Social Media Data Extraction, Big Data, Big Data Analytics, Big Web Data Mining, Big Social Data Security, Apache Hadoop, Visualization.
Paper Type – Survey Paper
Now a days usage of big social media data such as Facebook, Twitter, YouTube, Yuku, and Tao BAO and many blogs in the last few years with the most of population under the age of 30 using more technology enabled network in their day to day work everywhere around the clock with no boundaries with the key characteristics of big social media data which make possible to connect with other users worldwide and to post and to share information on every single seconds milliseconds nanoseconds continuous bases. The emerging effects results in the adoption of social media technologies has come to the growth of so called “BIG Data” also namely as Big social data in the concern of this paper.
Many user and organizations collect, collate and analyze the massive amount of social data which is available on different social network Portals. So it is easy to be carried away with the inflation of social media and big data which is called Big social data or Big social media data.
Web Mining is the technique used in data mining to extract valuable information in the form of knowledge from web data that deals with the web documents (Webpages), Hyperlinks, User logs and many more useful information as there is massive amount of data in web pages in the form of pictures, audios, videos, posts, likes, forums, blogs so more it includes content mining, structure mining and usage mining all these mainly deals with attempt to extract useful, valuable information and in results address some real world problems security is one of them. There is lot of personal information reside on internet and in this regard web data mining helps to secure the information at the forefront.
IDC reports that 90% of now a day’s data of the world are generated by the world in last two years. In order to tackle the above described challenges such as big social web data many generic proposed framework with data mining techniques in large web data sets will be discussed with the goal to correlate he results if mining web usage logs and extracted web structure with visualization, so visual web mining is being used as the application of information visualization techniques on the outcomes of web mining for the purpose of amplification perception of extracted pattern, rules and regularities
Many of today’s Technologies have been used in order to produces large ontologies like Google’s knowledge Graph, which produced and maintained the large set of data by integrating from high quality structured sources which is purely concerned with the semantic query interpretation (1 A. Singhal. Introducing the knowledge graph: things, not strings)
DBpedia is endlessly sprouting by extracting data from Wikipedia (2 J. Lehmann, R. Isele, M.
Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey,P. van Kleef, S. Auer, and C. Bizer. DBpedia – a large-scale, multilingual knowledge base extracted from wikipedia.).
Researcher’s community has not yet focused social content in construction ontological knowledge; DBpedia, the Knowledge Graphs in Google and Facebook derive from structured or semi-structured curated data (4 T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In International Semantic Web Conference).
(5 E. Sun and V. Iyer. Under the hood: The entities graph.)
However, increasing trends towards of social traffic and user interest the new era of social media analytics, concerned with analyzing real world phenomena using social media which is very large in size so in turn as created a wealthy useful technologies and tools and linking technologies which deals with the purpose of Natural language processing techniques to be used in social content and its analytics. (6 S. Stieglitz, L. Dang-Xuan, A. Bruns, and C. Neuberger. Social media analytics – an interdisciplinary approach and its implications for information systems. Business & Information Systems Engineering, 6(2):89–96, 2014.)
Overview of web data Mining
Web Mining Web mining is the term of applying data mining techniques to automatically discover and extract useful information from the World Wide Web documents and services 5.
It is an application of data mining technique used to extract knowledge from the webpage like from hyperlinks, user logs and web documents it can further classify to three categories according to content, structure and usage.
Web Content Mining It is the process of extracting valuable information from website contents like from web documents and pages; where contents are facts and webpage is designed to hold which includes text, images, audio, video or structured form of lists and table. There are many issues that can be addressed in text mining like topic discovery, tracking, extracting association patterns, clustering and classification. Many research areas are going on Natural Language Procession (NLP) and Information Retrieval 6. Some are focusing on extracting knowledge from images in the era of image processing and computer vision
Web Structure Mining Normally web structure or web graph composed off webpages in the shape of nodes, hyperlinks (edges) for connecting related webpages, and it is process of evaluating structure information from the web. Analyzation of web resources that contain the website, hyperlinks and path that user take to surf or explore the web 6. Further content inside webpage organized in tree-structured format, which rely on HTML and XML tags of that page. Document Object Model(DOM) is effort of mining working on structure of documents.
Web Usage Mining is used to discover interesting usage patterns from the web usage data for the purpose of understanding and better serving the requirements of web based application because usage capture origin of users along their browsing attitude. Web usage mining works on the contents of raw data for web usage mining at one side and expected extracted knowledge to be acquired from it on the other side impose a special challenge. 7
TECHNIQUES FOR HANDLING BIG WEB DATA
Visual Web Mining Architecture
The architecture of implementing the visual web mining is shown in below Figure 1 8 proposed a system that consists of web pages and respective server files(log). File local system provide the access to web log and can be download it from remote server
A web robot (webbot) is used to retrieve the pages of the website 10. In parallel, Web Server Log files are downloaded and processed through a sessionizer and a LOGML 11 file is generated.
· For the Input Data Mining Suite, Link Analysis suit and user Pro filing/ modeling suite is being used. To Produce output in propriety formats integration engine; Data Preparation (Extracting, cleaning, transforming integrating, generating graphs) is required for Conversion and uses XQuery, XSLT on standard and propriety data forms.
· Extracting user sessions, this produces results of specific user; and after converted to format fir sequence mining usage Cspade.
· In the proposed framework outputs are contiguous sequence on given support and then imported to database and non-maximal sequences are removed. Later on different sampling queries are performed against this data following some criteria like length and support of particular pattern.
Handling Big Web Data with Hadoop Map reduce
With the Increasing usage of internet and web is becoming interesting for the world simultaneously the growth of data increase beyond the imagination of users. Hadoop deals with the massive quantity of big web data using clustering. Web servers logs usually semi structured files generated by computers consist of flat text files and can be utilized by map reduce efficiently.
Pranit B. Mohata 8 performs session identification in the log files using Hadoop in distributing cluster, while Hadoop Map reduce data processing is used in pseudo distributed mode, in which framework identify the session utilized by the user to recognize unique users and pages they accessed, Then Identified session is analyzed in R to produce report based on total count of particular user visit per day further compared with java environment resultant good time efficiency, storage and processing speed of proposed work 12
Hadoop allows many applications to work on thousands of computers and petabytes of data.
It is used to classify the large number of data into smaller chunks and then processed on separately on different machine each. In order to achieve parallel execution, it implements MapReduce programing model. Map Reduce is a java based distributed programming model consists of two phases: a massively parallel “Map” phase, followed by an aggregating “Reduce” phase 13.
Extraction and Analysis of Big Social Data
Social Websites like Facebook, twitter, are the great source of sharing large sets of data, in these website people use informal language which is normally unstructured or semi structured for communication which leads to many problems such lexical, syntactic and sematic 14. In this regard to extract logical pattern with valuable and accurate and authentic information in is critical problem, so for this purpose text mining is used to extract valuable information from these large data sets.
Machine Learning Based Text Classification
Machine learning techniques can be classifies as supervised, unsupervised and semi-supervised method.
Rocchio’s Algorithm: It is based on feedback method which provides the synonym of words having same meaning in their context 15, but there is problem to have full sufficient knowledge in order to relevance feedback because user can write word with different spellings with different way. Rocchio use vector space method for filtering, to build prototype vector for each class by using training set and calculate the relevance between test data, prototype vectors for the assignment of relevant and irrelevant documents with maximum relevance
Instant Based Learning Algorithm: It works based upon the comparison between already stored instances during training the data set and new problem instances 16. Case Based reasoning, k-nearest neighbor is an example of instant based learning which identify the best closest feature space by calculating distance between vectors. While CBR use TCBR to handle text knowledge but this extract similar cases with low knowledge, it calculates the relevance between the test document and each neighbor and assignment is done on the bases class with having most of neighbors
Decision Rules Classification Method: It Handles large number of sufficient and insufficient number of relevant features which sometimes cause of poor performance related to text classification in decision 17 From the figure below is decision tree which shows different level of different entities so here is drawback we cannot assign document to a category, so the rules of decision may not properly work when number different features is too large.
Genetic Algorithm: It use the term weight in order to assign to each concept in the document which is based on the same or relevant topics, it provides better results and it is widely used for optimization problems. Due to recursive a function normally called end function needs for monitoring the improvement of results which is being generated consecutively 18.
Support Vector Machine (SVM): It construct the hyperplane that separates the positive and negative points or training data sets, here the points that close to hyperplane called support vectors. SVM needs two different points that reside on both sides of hyperplane which is not common in text classification. The document which is close to decision surface provides the performance of SVM classification because its remain unchanged even document does not belong to support vectors which is actually moved from the training set data 19.
Artificial Neural Networks: It is constructed from a large set of artificial neurons with and input fan order of magnitudes which is larger than other elements(computational) of conventional architectures. These neurons are connected to each other using mathematical model for processing of information. Then the interconnected neurons make their neuron sensitive to store them, further it can be used for distortion tolerant storing of highly dimensional vectors having large number of cases. Among Many approaches some of the researches use single layer perceptron which have only input and output layer. Input are directly served to the output via weights, likewise multilayer perceptron is more sophisticated which have one input layer and one or more hidden layer and an output layer in its structure.
Artificial neural network in classification tasks has the ability to deal with the documents with high dimensional features and with contradictory and noisy data. The Drawback is there computing cost due to which consume high CPU and require physical memory usage and with difficult to understand for average user 20
Clustering is an unsupervised process of classification of text documents in the group form by using different clustering algorithms. Clustering performed in top-down and bottom up manner. There are many different clustering techniques some of them are hierarchical, distribution density, centroid and K means 21.
Zhang et al. 22 to calculate a correlation similarity in two projected documents in low dimensional semantic space by using cosine and also performed documents clustering in the correlation similarity base measure space.
Hassan 23 performed partition clustering of documents to maximize the sum of different information provided from the documents, because different information provides to yield clusters that are definable by their highly discriminating terms.
Qimin et al. 24 works on feature-based vector space model (FC-VSM) which used text feature clusters co-occurrence matrix to represent documents and also works on identification of non-contiguous phrases in preprocessing phase.
Wei at el. 25 shows an ontology hierarchical structure for word sense disambiguation to have assess similarity ofwords.
It is associated with the task to automatically sort a set of documents into categories or classes or topics from predefined set 26
Haung 27 proposed text categorization technique called VSM-WN-TM which is composed of Vector space model (VSM), WordNet Ontology, and Probabilistic Latent Semantic Analysis(PLSA) topic Modeling.
Zheng 28 produced Novel approach for the categorization of text which is based upon Regularization extreme learning machine (RELM) in which weights can be analytically obtained and bias-variance trade-off could be achieved into linear system of single hidden layer forward propagation by simply adding regularization term.
Tang 29 Presents a Bayesian classification approach in order to achieve automatic text categorization by using class-specific features in his work.
Text Summarization is process of collecting and producing concise representation of original text documents 30. Afterwards more methods were introduced with standard text mining process to improve the accuracy of results and relevance accordingly 31.
Ferreira et al 32 works on new summarization system that easily combine different sentence methods in order summarize but it’s all depending on the context.
Pal et al. 33 proposed WordNet based method for the identification of semantic behind different input by using of Lesk algorithm.
Xiong and Lu 34 use latent Semantic Analysis (LSA) which was the unique concept because it uses Latent semantic information instead of original features by choosing the sentence individually to remove the redundant sentences.
Sentiment Analysis is also known as opinion mining, involves in building a system to collect and examine the public opinion/mood/reaction/behavior about a particular product which is made in blog posts, comments, reviews and tweets.
Lin at el 35 proposed a novel probabilistic modeling frame which is called joint sentiment-topic (JST) based on Latent Dirichlet allocation (LDA), Moreover Re-parameterized version of the JST model called Reversed-JST which reverse the sequence of sentiments in modeling process, so the topics and topic sentiments are collected by JST.
Arun at el 36 works on review analyzer for analyzation of consumer reviews about the product but is based on the performing the sentimental words.
Cui at el 37 works on multilingual and informal messages and it take it as challenging task and tackle this by analysis of emotion tokens. This done in two phases; first tokens are extracted; second graph propagation algorithm plots the token at different polarities and thus sentiment analysis algorithm analysis and classify on tokens (emotion).
Big Data in The Context of Social Media Data (Is it Real Big Social Media Data)
Now the emerging technological development of big data is recognized as one the frontline area of future information technology in every organization and it is continuously evolving at the rapid speed due abnormal increase in the social media activities.
Gartner 38 forecasted that in 2015 the total number of connected object is 4.9 billion will reach 25 billion by 2020.Currently big data in context of social media data composed of different factors; Veracity variability, complexity, value, velocity and volume, an integrated view of big data can be shown in the following figure shows that in the increase in velocity it affects either volume or variety or it can increase both and finally the affects the other five factors inside the triangle (veracity declines, variability, complexity, decay and value tends to increase.
Big Data 1.0 (1994-2004)
At this stage big data coincides with the ecommerce in 1994, so here the contribution of web content starts, while user generated content was only the marginal part in the contents of the web. At this time web mining techniques (Web Usage Mining, Web Structure Mining, web content Mining) were developed to analyze user’s online activities.
Big Data 2.0 (2005-2014)
Actually big data 2.0 is obsessed by web 2.0 and thus the social media phenomenon which evolved from the web technologies in 1990 and allowed user to interact with the websites and contribute their own content in the websites.
O’Reilly 39 Social media personified the principles of web 2.0 and created a new way of organization in their operations and collaboration. Social media analysis also supports content mining, usage mining and structure mining.
ReportsnReports 40 says that social media worldwide analytic market is $1.6 billion in 2015 and it estimated that $ 5.4 billion by 2020.
Big Data 3.0 (2015-Continue)
From Big data 1.0 and Big data 2.0 now the age of Big data 3.0 which the combination of IoT applications that used to generate data in the form of images, audios and videos which are greatly contribution towards social data areas like streaming analysis which involves real-time event analysis and geographical data analysis.