Literature Review

Literature Review on Big Data Privacy & Security challenges faced by IT Companies.

Introduction
“Data really powers everything that we do.” – Jeff Weiner, LinkedIn. Data is not just an information about particular thing, it is an essential quantity for existence. Data in various fields of life are stored and analyzed for Information and Knowledge. Analytics is a process of transforming data into insight for making better decisions. There are large number of data in different environments of life for example: Academic Sector, Weather Forecast, IT Sector, Industries, and Etc. Big Data Analysis drives nearly every aspect of society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. Nowadays the Internet represents a big space where great amounts of information are added every day. We can associate the importance of Big Data and Big Data Analysis with the society we live in. Today we are living in an informational Society and we are moving towards a Knowledge Base Society. In order to extract better knowledge we need a bigger amount of data. The Society of Information is a place wherein information plays a major role in the economic, cultural and political stage. In the Knowledge Society the competitive advantage is gained through understanding the information and predicting the evolution of facts based on data. Every organization needs to collect a large set of data in order to support its decision and extract correlations through data analysis as a basis for decisions. Big Data is revolutionizing all aspects of our lives ranging from enterprises to consumers, from science to government. These collection of large data in a particular Sector or Firm or Category to be analyzed is termed as “BIG DATA ANALYTICS”. The term “Big Data” was first introduced to the computing world by Roger Magoulas from O’Reilly media in 2005 in order to define a great amount of data that traditional data management techniques cannot manage and process due to the complexity and size of this data. Big Data is defined by its size, comprising a large, complex and independent collection of data sets, each with the potential to interact. In addition, an important aspect of Big Data is the fact that it cannot be handled with standard data management techniques due to the inconsistency and unpredictability of the possible combinations. The main importance of Big Data consists in the potential to improve efficiency in the context of use of a large volume of data and of different type.

General Challenges:
1. Heterogeneity: When humans consume information, a great deal of heterogeneity is comfortably tolerated. The nuance and richness of natural language can provide valuable depth. However, machine analysis algorithms expect homogeneous data, and are poor at understanding nuances. An associated challenge to Heterogeneity is to automatically generate the right metadata to describe the data recorded. Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline which is called “data provenance”.
2. Inconsistency and Incompleteness: Big Data includes information provided by increasingly diverse sources, of varying reliability. Uncertainty, errors =, and missing values are endemic and must be managed. The volume and redundancy of Big Data can often be exploited to compensate for missing data, to crosscheck conflicting cases, to validate trustworthy relationships, to disclose inherent clusters and to uncover hidden relationships and models. We need technologies to facilitate this issue. This incompleteness and errors must be managed during data analysis and doing this correctly is a challenge.
3. Scale: Managing large and rapidly increasing volumes of data has been a challenging issue for many decades. In the past, this challenge was mitigated by processors getting faster. But now there is a shift where data volume is increasing faster than the CPU processor speeds and other computational resources. This has led to a need for global optimization across multiple user programs, even those doing complex machine learning tasks.
4. Timeliness: As data grow in volume, we need real-time techniques to summarize and filter what is to be stored, since in many instances it is not economically viable to store raw data. This in return gives rise to the acquisition rate challenge and a timeliness challenge. The fundamental challenge is to provide interactive response ties to complex queries at scale over high–volume event streams.
5. Privacy and data ownership: The privacy of data is one of the major challenges that increases in the context of Big Data. In cases of location-based services, where location of the user is needed for a successful data access or data collection, it is important that this integration is done right, which is a prior challenge. As the value of data is increasingly recognized, the value of data owned by an organization becomes a central strategic consideration. Organizations are concerned with how to leverage this data, while retaining their unique data advantage, and questions like how to sell or share data without losing control are becoming important.
6. Visualization and collaboration: Big Data to fully reach its potential it isn’t just enough to consider scale in terms of system but also it is important towards the perspective of humans. It should be made sure that humans can properly absorb the analysis and not get lost in the ocean of data. Systems with a rich palette of visualizations are to be considered, which can be quickly and declaratively created. These help the users in understanding the output of analysis with great detail.

Data Security & Privacy:
It is one of the important challenges for Big Data. As Big Data consists in a large amount of complex data, it is very difficult for a company to sort this data on privacy levels and apply the according security. Managing privacy effectively is both a technical and a sociological problem, which must be addressed jointly from both perspectives to realize the promise of Big Data.
Example: Data extracted from location-based services, which require a user to share his/her location with the service provider. There are obvious privacy concerns, which are not addressed by hiding the user’s identity alone without her location. An attacker or a potentially malicious location-based server can infer the identity of the query source from its subsequent location information.
Many of the companies are doing business across countries and continents and the differences in privacy laws are considerable and have to be taken in consideration when starting the Big Data initiative. If data are not authentic, new mined knowledge will be unconvincing; while if privacy is not well addressed, people may be reluctant to share their data. [1] Many Privacy and Security techniques have already been designed, but they are inadequate for the newly emerging big data scenarios as they are tailored to secure traditional small-size data. Therefore, in-depth research efforts dedicated to security and privacy challenges in big data are expected.

Strategies for Data Security & Privacy
Prevent Loss of Privacy Data Architecture (PLPD)
PLPD consists of four distinct parts: (I) A Unique ID Authentication Mechanism (UAM): Involves verifying whether a user is suitable as a prerequisite to allowing access to personal information on the network], (II) A Rule-based Control Mechanism (RCM): The RCM is restricted access according to that of the authenticated user through the UAM. Access control in this mechanism is an important function to detect any abnormalities that check the number of frequencies. The access to private data including unique identification and sensitive information can be checked. And then must be check by conditions such as the consent of the data subject prior to the processing of personal identification information. And it performs the obligations after an access action is executed. In order to perform these rules, this mechanism includes rules to ensure compliance with privacy laws and regulations. (III) A Violation Check Mechanism (VCM): The VCM is performed when violations occur within system. It is automatically analyzed suspicious activity. The violation includes that an inquiry does not meet user’s authorized limit and the quantity of inquiries exceeds the specific standards. (IV) Detection Management Mechanism (DMM): All the process can be recorded in the DMM, and it is to support accountability efforts. The DMM used the log file to record who logged in, when logout, and which devices they use. The log file must be retained for accountability purposes, and also have information about the performed basis by mapping access control rules. It is important to understand condition and obligations based regulations because users have the right to protect own personal information.

System Architecture
This system consists of two major components: Cloud Data Distributor and Cloud Providers. The Cloud Data Distributor receives data in the form of files from clients, splits each file into chunks and distributes these chunks among cloud providers. Cloud Providers store chunks and responds to chunk requests by providing the chunks.
(A) Cloud Data Distributor
Cloud Data Distributor is the entity that receives data (files) from clients, performs fragmentation of data (splits files into chunks) and distributes these fragments (chunks) among Cloud Providers. It also participates in data retrieving procedure by receiving chunk requests from clients and forwarding them to Cloud Providers. Clients do not interact with Cloud Providers directly rather via Cloud Data Distributor. This entity deals with Cloud Providers as an agent of clients. To upload data, clients deliver files to the Cloud Data Distributor. Each file is given a privacy level chosen by the client indicating its mining sensitivity. Here mining sensitivity of a file refers to the significance of information that can be leaked through mining the data in the file. The proposed system suggests 4 sensitivity levels of privacy: PL 0, 1, 2, 3. These 4 levels indicate public data (data accessible to everyone including the adversary), low sensitive data (data that do not reveal any private or protected information but can be used to find patterns), moderately sensitive data (protected data that can be used to extract non-trivial financial, legal, health information of a company or an individual), highly sensitive data or private data (data that can be used to extract personal information of an individual or private information of a company, revealing which can prove disastrous) respectively. The higher the privacy level of a file, the more sensitive the data inside the file. After receiving files from clients, the Cloud Data Distributor partitions each file into chunks with each chunk having the same privacy level of the parent file. The total number of chunks for each file is notified to the client so that any chunk can be asked by the client by mentioning the file name and serial no. Serial no. corresponds to the position of the chunk within the file. To ensure greater dimension of privacy, the Cloud Data Distributor may add misleading data into chunks depending on the demand of clients. The positions of misleading data bytes are also maintained by the distributor and these misleading bytes are removed while providing the chunks to the clients.
(B) Cloud Providers
The main tasks of Cloud Providers are: storing chunks of data, responding to a query by providing the desired data, and removing chunks when asked. All these are done using virtual id which is known as key for Amazons simple storage service. Providers receive chunks from the distributor and store them. Each provider is considered as a separate disk storing clients’ data. The cloud provider responds to the query of the distributor by providing data. Providers also receive remove requests from the distributor and acts accordingly by removing the corresponding chunk.

Conclusion
The progress in the area of information society has increased the risk of invasion of privacy due to the unfair or excessive privacy data collection. In particular, the emergence of new technologies is more interested in large data hijacking. Also, multiple data about private approved by the users can cumulatively expose sensitive information of the user that he or she didn’t want to let others know. The problem with large scale data breach is considered as one of the most valuable assets in recent years, are now becoming targets to hijack. As data protection becomes more secure and developed, the attackers gets more organized and professionally equipped with focused intention. Ensuring security of cloud data is still a challenging problem. Cloud service providers as well as other third parties use different data mining techniques to acquire valuable information from user data hosted on the cloud. The approach combining categorization, fragmentation and distribution, prevents data mining by maintaining privacy levels, splitting data into chunks and storing these chunks of data to appropriate cloud providers and this would help in keeping the data secure.
Although the given strategies provide an effective way to protect privacy, in future research has to be done on devising the protection module to prevent the large scale data loss and adhere to performance overhead when client needs to access all data frequently.

References:
(1) Kim, Kyong-jin, Seng-phil Hong, and Joon Young Kim. “A Study of Privacy Protection from Risk of Hijacking Data.” International Journal of Multimedia and Ubiquitous Engineering 8.1 (2013).
(2) Chun, Byung-Tae, and Seong-Hoon Lee. “A Study on Big Data Processing Mechanism & Applicability.” International Journal of Software Engineering & Its Applications 8.8 (2014).
(3) Jagadish, H. V., et al. “Big data and its technical challenges.” Communications of the ACM 57.7 (2014): 86-94.
(4) Rajesh, K. V. N. “Big Data Analytics: Applications and Benefits.” IUP Journal of Information Technology 9.4 (2013).
(5) Chen, Hsinchun, Roger HL Chiang, and Veda C. Storey. “Business Intelligence and Analytics: From Big Data to Big Impact.” MIS quarterly 36.4 (2012): 1165-1188.
(6) ULARU, Elena Geanina, et al. “Perspectives on Big Data and Big Data Analytics.” Database Systems Journal 3.4 (2012): 3-14.
(7) Daries, Jon P., et al. “Privacy, anonymity, and big data in the social sciences. “Communications of the ACM 57.9 (2014): 56-63.
(8) Wang, Richard Y., and Diane M. Strong. “.” View point Beyond data & Analysis.” Journal of management information systems (1996): 5-33.
(9) Lu, Rongxing, et al. “Eppa: An efficient and privacy-preserving aggregation scheme for secure smart grid communications.” Parallel and Distributed Systems, IEEE Transactions on 23.9 (2012): 1621-1631.
(10) Jutla, Dawn N., Peter Bodorik, and Sohail Ali. “Engineering Privacy for Big Data Apps with the Unified Modeling Language.” Big Data (BigData Congress), 2013 IEEE International Congress on. IEEE, 2013.

Leave a Reply

Your email address will not be published.