Reading ML complex dataset

7 SIGNS YOU’RE DEALING WITH COMPLEX DATA

The complexity of your data is likely to indicate the level of difficulty you’ll face when trying to translate it into business value – complex data is typically more difficult to prepare and analyze than simple data, and often will require a different set of BI tools to do so. Complex data necessitates additional work to prepare and model the data before it is “ripe” for analysis and visualization. Hence it is important to understand the current complexity of your data, and its potential complexity in the future,

two (alternate) initial warning signs: big or disparate data

Big=larger amounts of data poses a challenge in terms of the computational resources needed to process massive datasets, as well as the difficulty of separating the wheat from the chaff, i.e. distinguishing between signal and noise amid a huge deposit of raw information.

a bit deeper, here are seven more specific indicators :

Size (petabytes to “choke” popular in-memory databases rely on shifting compressed data into your server’s RAM. tall data – or wide data 1 billion) Dispersed data Data stored in multiple locations is both more difficult to gather ( receive it in a timely and effective manner), and once gathered – will typically require some ‘cleaning’ or standardization before the various datasets can be cross-referenced and analyzed Growth rate the speed in which your data is growing or changing. If the data sources are frequently being updated, or new data sources are frequently being added, this could tax your hardware and software resources

1. Structure

Data from different sources, or even different tables from within the same source, could often refer to the same information but be structured entirely differently: thus for example, imagine your HR department has three different spreadsheets, one for employees’ personal details, another for their role and salary, a third for their qualifications, etc. – whereas your finance department records the same information in a single table, along with insurance, benefits and other costs. Additionally, in some of these tables employees might be mentioned by their full name, in others by initial, or some combination of the two.

To efficiently use data from all these different tables, without losing or duplicating information, requires data modeling and preparation work. This is the simplest use case: working with unstructured data sources (such as NoSQL databases) can further complicate matters, as initially these have no schema in place.

2. Size

Again returning to the murky concept of “big data”, the amount of data you collect can affect the types of software or hardware you need to analyze it. This can be measured either in raw size: gigabytes, terabytes or petabytes – the larger the data grows, the more likely it is to “choke” popular in-memory databases that rely on shifting compressed data into your server’s RAM. Additional considerations include tall data – tables that contain many rows (Excel, arguably the most commonly used data analysis tool, is limited to 1048576 rows), or wide data – tables that contain many columns. You’ll find that the tools and methods you use to analyze 100,000 rows are significantly different than those needed to analyze 1 billion.

3. Detail

The level of granularity in which you wish to explore the data. When creating a dashboard or report, presenting summarized or aggregated data is often easier than giving end-users the ability to drill into every last detail – however this is a tradeoff that comes at the price of limiting the possible depth of analysis and data discovery. Creating a BI system that enables granular drill-downs means having to process larger amounts of data on an ad-hoc basis (without relying on predefined queries, aggregations or summary tables).

4. Query language

Different data sources speak different languages: while SQL is the primary means of extracting data from common sources and RDBMS, when using a third party platform you will often need to connect to it via its own API and syntax, and to understand the internal data model and protocols used to access this data. Your BI tools need to be flexible enough to allow for this type of native connectivity to said data source, either via built-in connectors or API access, or else you will find yourself having to repeat a cumbersome process of exporting the data to a spreadsheet \ SQL database \ data warehouse and then pulling it into your Business Intelligence software from there, making your analysis cumbersome.

5. Data type

Working with mostly numeric, operational data stored in tabular form is one thing, but massive and unstructured machine data is another thing entirely, as is a text-heavy dataset stored in MongoDB, not to mention video and audio recordings. Different types of data have different rules, and finding a way to forge a single source of truth from all of them is essential in order to base your business decisions on an integrated view of all your organization’s data.

6. Dispersed data

Data stored in multiple locations: e.g.: different departments inside the organization, on-premises or in the Cloud (either in purchased storage or via cloud applications), external data originating from clients or suppliers, etc. This data is both more difficult to gather (simply because of the amount of stakeholders who need to be involved in order to receive it in a timely and effective manner), and once gathered – will typically require some ‘cleaning’ or standardization before the various datasets can be cross-referenced and analyzed, since each local dataset will be collected according to the relevant organization \ application’s own practices and focuses..

7. Growth rate

Finally, you need to consider not only your current data, but the speed in which your data is growing or changing. If the data sources are frequently being updated, or new data sources are frequently being added, this could tax your hardware and software resources (as less advanced systems would need to re-ingest the entire dataset from scratch whenever significant changes are made to the source data), as well as multiply the above mentioned issues around structure, type, size, etc.

understanding is the first step towards finding an appropriate solution

Understanding Complex Datasets: Data Mining using Matrix Decompositions,

1 Data Mining

1.1 Data 1.2 Data-mining techniques 1.2.1 Prediction 1.2.2 Clustering 1.2.3 Finding outliers 1.2.4 Finding local patterns 1.3 Why use matrix decomposition? 1.3.1 Data that comes from multiple processes 1.3.2 Data with multiple causes 1.3.3 What are matrix decomposition used for? 2 Matrix decompositions 2.1 Definition 2.2 Interpreting decompositions 2.2.1 Factor interpretation -- signals from hidden sources 2.2.2 Geometric interpretation -- hidden clusters 2.2.3 Component interpretation -- underlying processes 2.2.4 Graph interpretation -- hidden connections 2.2.5 Summary 2.2.6 Example 2.3 Applying decompositions 2.3.1 Selecting factors, dimensions, components, or waystations 2.3.2 Similarity and clustering 2.3.3 Finding local relationships 2.3.4 Sparse representations 2.3.5 Oversampling 2.4 Algorithm issues 2.4.1 Algorithms and complexity 2.4.2 Data preparation issues 2.4.3 Updating a decomposition 3 Singular Value Decomposition (SVD) 3.1 Definition 3.2 Interpreting SVD 3.2.1 Factor interpretation 3.2.2 Geometric interpretation 3.2.3 Component interpretation 3.2.4 Graph interpretation 3.3 Applying SVD 3.3.1 Selecting factors, dimensions, components, and waystations 3.3.2 Similarity and clustering 3.3.3 Finding local relationships 3.3.4 Sampling and sparsifying by removing values 3.3.5 Using domain knowledge or priors 3.4 Algorithm issues 3.4.1 Algorithms and complexity 3.4.2 Updating SVD 3.5 Applications of SVD 3.5.1 The workhorse of noise removal 3.5.2 Information retrieval - Latent Semantic Indexing (LSI) 3.5.3 Ranking objects and attributes by interestingness 3.5.4 Collaborative filtering 3.5.5 Winnowing microarray data 3.6 Extensions 3.6.1 PDDP 3.6.2 CUR decomposition 4 Graph Partitioning 4.1 Graphs versus datasets 4.2 Adjacency matrix 4.3 Eigenvalues and eigenvectors 4.4 Connections to SVD 4.5 A motivating example: Google's PageRank 4.6 Overview of the embedding process 4.7 Datasets versus graphs 4.7.1 Mapping a Euclidean space to an affinity matrix 4.7.2 Mapping an affinity matrix to a representation matrix 4.8 Eigendecompositions 4.9 Clustering 4.9.1 Examples 4.10 Edge prediction 4.11 Graph substructures 4.12 Bipartite graphs 5 SemiDiscrete Decomposition (SDD) 5.1 Definition 5.2 Interpreting SDD 5.2.1 Factor interpretation 5.2.2 Geometric interpretation 5.2.3 Component interpretation 5.2.4 Graph interpretation 5.3 Applying SDD 5.3.1 Truncation 5.3.2 Similarity and clustering 5.4 Algorithm issues 5.5 Extensions 5.5.1 Binary nonorthogonal matrix decomposition 6 Using SVD and SDD together 6.1 SVD then SDD 6.1.1 Applying SDD to A_k 6.1.2 Applying SDD to the truncated correlation matrices 6.2 Applications of SVD and SDD together 6.2.1 Classifying galaxies 6.2.2 Mineral exploration 6.2.3 Protein conformation 7 Independent Component Analysis 7.1 Definition 7.2 Interpreting ICA 7.2.1 Factor interpretation 7.2.2 Geometric interpretation 7.2.3 Component interpretation 7.2.4 Graph interpretation 7.3 Applying ICA 7.3.1 Selecting dimensions 7.3.2 Similarity and clustering 7.4 Algorithm issues 7.5 Applications of ICA 7.5.1 Determining suspicious messages 7.5.2 Removing spatial artifacts from microarrays 7.5.3 Finding al Qaeda groups 8 Non-Negative Matrix Factorization (NNMF) 8.1 Definition 8.2 Interpreting NNMF 8.2.1 Factor interpretation 8.2.2 Geometric interpretation 8.2.3 Component interpretation 8.2.4 Graph interpretation 8.3 Applying NNMF 8.3.1 Selecting factors 8.3.2 Denoising 8.3.3 Similarity and clustering 8.4 Algorithm issues 8.4.1 Algorithms and complexity 8.4.2 Updating 8.5 Applications of NNMF 8.5.1 Topic detection 8.5.2 Microarray analysis 8.5.3 Mineral exploration revisited 9 Tensors 9.1 The Tucker3 tensor decomposition 9.2 The CP decomposition 9.3 Applications of tensors 9.3.1 Citation data 9.3.2 Words, documents, and links 9.3.3 Users, keywords, and time in chat rooms 9.4 Algorithmic issues 10 Conclusion

From the preface:

Many data-mining algorithms were developed for the world of business, for example for customer relationship management. The datasets in this environment, although large, are simple in the sense that a customer either did or did not buy three widgets; or did or did not fly from Chicago to Albuquerque.

In contrast, the datasets collected in scientific and engineering applications contain values that represent a combination of different properties of the real world. For example, an observation of a star produces some value for the intensity of its radiation at a particular frequency. But the observed value is the sum of (at least) three different components: the actual intensity of the radiation that the star is (was) emitting, properties of the atmosphere that the radiation encountered on its way from the star to the telescope, and properties of the telescope itself. Astrophysicists who want to model the actual properties of stars must remove (as far as possible) the other components to get at the `actual' data value. And it not always clear which components are of interest. For example, we could imagine a detection system for stealth aircraft that relied on the way they disturb the image of stellar objects behind them. In this case, a different component would be the one of interest.

Most mainstream data-mining techniques ignore the fact that real-world datasets are combinations of underlying data, and build single models from them. If such datasets can first be separated into the components that underlie them, we might expect that the quality of the models will improve significantly. Matrix decompositions use the relationships among large amounts of data and the probable relationships between the components to do this kind of separation. For example, in the astrophysical example, we can plausibly assume that the changes to observed values caused by the atmosphere are independent of those caused by the device. The changes in intensity might also be independent of changes caused by the atmosphere, except if the atmosphere attenuates intensity non-linearly.

Some matrix decompositions have been known for over a hundred years; others have only been discovered in the past decade. They are typically computationally-intensive to compute, so it is only recently that they have been used as analysis tools except in the most straightforward ways. Even when matrix decompositions have been applied in sophisticated ways, they have often only been used in limited application domains, and the experiences and `tricks' to use them well have not been disseminated to the wider community.

This book gathers together what is known about the commonest matrix decompositions:

- Singular Value Decomposition (SVD);
- SemiDiscrete Decomposition (SDD);
- Independent Component Analysis (ICA);
- Non-Negative Matrix Factorization (NNMF);
- Tensors;

and shows how they can be used as tools to analyze large datasets. Each matrix decomposition makes a different assumption about what the underlying structure in the data might be, so choosing the appropriate one is a critical choice in each application domain. Fortunately once this choice is made, most decompositions have few other parameters to set.

There are deep connections between matrix decompositions and structures within graphs. For example, the PageRank algorithm that underlies the Google search engine is related to Singular Value Decomposition; and both are related to properties of walks in graphs. Hence matrix decompositions can shed light on relational data, such as the connections in the Web, or transfers in the financial industry, or relationships in organizations.

This book shows how matrix decompositions can be used in practise in a wide range of application domains. Data mining is becoming an important analysis tool in science and engineering in settings where controlled experiments are impractical. We show how matrix decompositions can be used to find useful documents on the web, make recommendations about which book or DVD to buy, find terrorists by their travel patterns, look for deeply buried mineral deposits without drilling, explore the structure of proteins, clean up the data from DNA microarrays, detect suspicious emails or cell phone calls, and figure out what topics a set of documents are about.

This book is intended for researchers who have complex datasets that they want to model, and are finding that other data-mining techniques do not perform well. It will also be of interest to researchers in computing who want to develop new data-mining techniques or investigate connections between standard techniques and matrix decompositions. It can be used as a supplement to graduate level data-mining textbooks.

The conventional presentations of this material tend to rely on a great deal of linear algebra. Most scientists and engineers will have encountered basic linear algebra; some social scientists may have as well. For example, most will be familiar (perhaps in a hazy way) with eigenvalues and eigenvectors; but singular value decomposition is often covered only in graduate linear algebra courses, so it is not as widely known as perhaps it should be. I have tried throughout to concentrate on intuitive explanations of what the linear algebra is doing. The software that implements the decompositions described here can be used directly -- there is little need to program algorithms. What is important is to understand enough about what is happening computationally to be able to set up sequences of analysis, to understand how to interpret the results, and to notice when things are going wrong.

I teach much of this material in an undergraduate data mining course. Although most of the students do not have enough linear algebra background to understand the deeper theory behind most of the matrix decomposition, they are quickly able to learn to use them on real datasets, especially as visualization is often a natural way to interpret the results of a decomposition. I originally developed this material as background for my own graduate students who go on either to use this approach in practical settings, or to explore some of the important theoretical and algorithmic problems associated with matrix decomposition, for example reducing the computational cost.