• Complain

T. Ravindra Babu M. Narasimha Murty - Compression Schemes for Mining Large Datasets

Here you can read online T. Ravindra Babu M. Narasimha Murty - Compression Schemes for Mining Large Datasets full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. City: London, publisher: Springer London, genre: Children. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

T. Ravindra Babu M. Narasimha Murty Compression Schemes for Mining Large Datasets

Compression Schemes for Mining Large Datasets: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Compression Schemes for Mining Large Datasets" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

T. Ravindra Babu M. Narasimha Murty: author's other books


Who wrote Compression Schemes for Mining Large Datasets? Find out the surname, the name of the author of the book and a list of all author's works by series.

Compression Schemes for Mining Large Datasets — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Compression Schemes for Mining Large Datasets" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
T. Ravindra Babu , M. Narasimha Murty and S.V. Subrahmanya Advances in Computer Vision and Pattern Recognition Compression Schemes for Mining Large Datasets 2013 A Machine Learning Perspective 10.1007/978-1-4471-5607-9_1
Springer-Verlag London 2013
1. Introduction
T. Ravindra Babu 1, M. Narasimha Murty 2 and S. V. Subrahmanya 1
(1)
Infosys Technologies Ltd., Bangalore, India
(2)
Indian Institute of Science, Bangalore, India
Abstract
Data mining aims at generating abstraction from large datasets through efficient algorithms. Some approaches to achieve efficiency are to arrive at valid representative subsets of original data and feature sets. All further data mining analysis can be based only on these representative subsets leading to significant reduction in storage space and time. Another important direction is to compress the data by some manner and operate in the compressed domain directly. In this chapter, we present a discussion on major data mining tasks such as clustering, classification, dimensionality reduction, association rule mining, and data compression; all these tasks may be viewed as some kind of data abstraction or compaction tasks. We further discuss various aspects of compression schemes both in abstract sense and as practical implementation. We provide a brief summary of content of each chapter of the book and discuss its overall organization. We provide literature for further study at the end.
In this book, we deal with data mining and compression; specifically, we deal with using several data mining tasks directly on the compressed data.
1.1 Data Mining and Data Compression
Data mining is concerned with generating an abstraction of the input dataset using a mining task.
1.1.1 Data Mining Tasks
Important data mining tasks are:
Clustering. Clustering is the process of grouping data points so that points in each group or cluster are similar to each other than points belonging to two or more different clusters. Each resulting cluster is abstracted using one or more representative patterns. So, clustering is some kind of compression where details of the data are ignored and only cluster representatives are used in further processing or decision making.
Classification. In classification a labeled training dataset is used to learn a model or classifier. This learnt model is used to label a test (unlabeled) pattern; this process is called classification.
Dimensionality Reduction. A majority of the classification and clustering algorithms fail to produce expected results in dealing with high-dimensional datasets. Also, computational requirements in the form of time and space can increase enormously with dimensionality. This prompts reduction of the dimensionality of the dataset; it is reduced either by using feature selection or feature extraction. In feature selection, an appropriate subset of features is selected, and in feature extraction, a subset in some transformed space is selected.
Regression or Function Prediction. Here a functional form for variable y is learnt (where y = f ( X )) from given pairs ( X , y ); the learnt function is used to predict the values of y for new values of X . This problem may be viewed as a generalization of the classification problem. In classification, the number of class labels is finite, where as in the regression setting, y can have infinite values, typically, y .
Association Rule Mining. Even though it is of relatively recent origin, it is the earliest introduced task in data mining and is responsible for bringing visibility to the area of data mining. In association rule mining, we are interested in finding out how frequently two subsets of items are associated.
1.1.2 Data Compression
Another important topic in this book is data compression . A compression scheme CS may be viewed as a function from the set of patterns Compression Schemes for Mining Large Datasets - image 1 to a set of compressed patterns Compression Schemes for Mining Large Datasets - image 2 . It may be viewed as
Compression Schemes for Mining Large Datasets - image 3
Specifically, CS ( x ) = x for Picture 4 and Picture 5 . In a more general setting, we may view CS as giving output x using x and some knowledge structure or a dictionary K . So, CS ( x , K ) = x for Picture 6 and Picture 7 . Sometimes, a dictionary is used in compressing and uncompressing the data. Schemes for compressing data are the following:
  • Lossless Schemes. These schemes are such that CS ( x ) = x and there is an inverse CS 1 such that CS 1( x ) = x . For example, consider a binary string 00001111 ( x ) as an input; the corresponding run-length-coded string is 44 ( x ), where the first 4 corresponds to a run of 4 zeros, and the second 4 corresponds to a run of 4 ones. Also, from the run-length-coded string 44 we can get back the input string 00001111. Note that such a representation is lossless as we get x from x using run-length encoding and x from x using decoding.
  • Lossy Schemes. In a lossy compression scheme, it is not possible in general to get back the original data point x from the compressed pattern x . Pattern recognition and data mining are areas in which there are a plenty of examples where lossy compression schemes are used.
We show some example compression schemes in Fig..
Fig 11 Compression schemes 113 Compression Using Data Mining Tasks - photo 8
Fig. 1.1
Compression schemes
1.1.3 Compression Using Data Mining Tasks
Among the lossy compression schemes, we considered the data mining tasks. Each of them is a compression scheme as:
  • Association rule mining deals with generating frequently cooccurring items/patterns from the given data. It ignores the infrequent items. Rules of association are generated from the frequent itemsets. So, association rules in general cannot be used to obtain the original input data points provided.
  • Clustering is lossy because the output of clustering is a collection of cluster representatives. From the cluster representatives we cannot get back the original data points. For example, in K -means clustering, each cluster is represented by the centroid of the data points in it; it is not possible to get back the original data points from the centroids.
  • Classification is lossy as the models learnt from the training data cannot be used to reproduce the input data points. For example, in the case of Support Vector Machines, a subset of the training patterns called support vectors are used to get the classifier; it is not possible to generate the input data points from the support vectors.
  • Dimensionality reduction schemes can ignore some of the input features. So, they are lossy because it is not possible to get the training patterns back from the dimensionality-reduced ones.
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Compression Schemes for Mining Large Datasets»

Look at similar books to Compression Schemes for Mining Large Datasets. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Compression Schemes for Mining Large Datasets»

Discussion, reviews of the book Compression Schemes for Mining Large Datasets and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.