Wednesday, June 5, 2019
Identifying Clusters in High Dimensional Data
Identifying Clusters in naughty balanceal selective breedingAsk those who remember, atomic number 18 mindful if you do non know). (Holy Quran, 643) removal Of Redundant Dimensions To Find Clusters In N-Dimensional selective in embodimentation Using Sub station ClusteringAbstractThe entropy excavation has emerged as a powerful tool to extract fellowship from huge selective readingbases. Researchers have introduced several(pre noun phrase) machine development algorithms to explore the infobases to d sensitive culture, cabalistic patterns, and rules from the entropy which were non known at the teaching recording duration. cod to the remarkable developments in the computing device storage capacities, plowing and powerful algorithmic tools, practiti wizrs be developing new and improved algorithms and proficiencys in several argonas of entropy tap to discoer the rules and relationship among the attributes in b are(a) and complex laster dimensional entrop ybases. much(prenominal) than everyplace entropy minelaying has its implementation in large variety of areas ranging from banking to merchandising, engineering to bioinformatics and from investiture to chance analysis and player detection. Practiti sensationrs are analyzing and implementing the proficiencys of artificial neural networks for gradeification and regression jobs beca usance of accuracy, efficiency. The aim of his short research project is to develop a centering of identifying the meets in in high spirits dimensional information as wholesome as unnecessary dimensions which give the axe create a noise in identifying the clusters in high dimensional selective information. Techniques use in this project utilizes the strength of the projections of the information full evens on the dimensions to identify the intensity of projection along each dimension in severalize to breakthrough cluster and tautological dimension in high dimensional info.1 Introd uctionIn numerous scientific settings, engineering handlees, and business applications ranging from experimental sensing element data and cultivate control data to telecommunication traffic observation and financial transaction monitoring, huge amounts of high-dimensional measurement data are produced and sto departure. Whereas sensor equipments as well as big storage devices are entranceting cheaper day by day, data analysis tools and techniques wrap behind. Clustering modes are mutual solutions to unsupervised learning problems where incomplete any expert familiarity nor some assistantful annotation for the data is available. In general, bunch groups the data objects in a way that standardized objects get together in clusters whereas objects from distinct clusters are of high dissimilarity. even it is observed that b totally disclose al close no organise even it is known thither must be groups of similar objects. In some cases, the reason is that the cluster struct ure is stimulated by some subsets of the spaces dimensions entirely, and the many additional dimensions contribute cryptograph former(a) than making noise in the data that hinder the denudation of the clusters at bottom that data. As a solution to this problem, clump algorithms are applied to the pertinent subspaces merely. Immediately, the new question is how to visualize the relevant subspaces among the dimensions of the climb space. Being faced with the power set of the set of dimensions a tool force trial of all subsets is infeasible due to their exponential number with respect to the original dimensionality.In high dimensional data, as dimensions are increasing, the visual percept and prototype of the data becomes to a great extent difficult and sometimes increase in the dimensions butt create a bottleneck. More dimensions mean more visualization or representation problems in the data. As the dimensions are increased, the data within those dimensions seems dispersi ng towards the corners / dimensions. Subspace assemble solves this problem by identifying both problems in parallel. It solves the problem of relevant subspaces which green goddess be marked as redundant in high dimensional data. It in any case solves the problem of finding the cluster structures within that dataset which become apparent in these subspaces. Subspace clustering is an extension to the traditional clustering which automatically finds the clusters present in the subspace of high dimensional data space that allows better clustering the data points than the original space and it works even when the swearword of dimensionality occurs. The nearly of the clustering algorithms have been designed to discover clusters in full dimensional space so they are not effective in identifying the clusters that exists within subspace of the original data space. The most of the clustering algorithms produces clustering results base on the revision in which the input signal record s were formed 2.Subspace clustering quite a little identify the incompatible cluster within subspaces which exists in the huge amount of exchanges data and through it we chamberpot find which of the different attributes are associate. This great deal be usable in promoting the sales and in planning the inventory levels of different products. It support be use for finding the subspace clusters in spacial databases and some useful closes displace be taken based on the subspace clusters identified 2. The technique employ here for indentifying the redundant dimensions which are creating noise in the data in install to identifying the clusters consist of drawing or plotting the data points in all dimensions. At second step the projection of all data points along each dimension are plotted. At the third step the unions of projections along each dimension are plotted using all practical combinations among all no. of dimensions and at long last the union of all projection a long all dimensions and crushd, it leave show the contribution of each dimension in indentifying the cluster which forget be equal by the encumbrance of projection. If any of the given dimension is contributing genuinely less in order to building the weight of projection, that dimension stinkpot be considered as redundant, which means this dimension is not so important to identify the clusters in given data. The details of this outline will be covered in later chapters.2 selective information tap2.1 What is information tap?Data archeological site is the process of analyzing data from different office and summarizing it for getting useful information. The information keister be utilise for many useful purposes give care increasing revenue, cuts costs etc. The data digging process also finds the hidden knowledge and relationship within the data which was not known patch data recording. Describing the data is the outgrowth step in data mining, followed by summarizing its attributes (like standard dispute mean etc). After that data is reviewed using visual tools like charts and graphs and so meaningful relations are sicd. In the data mining process, the steps of fooling, exploring and selecting the sound data are critically important. User enkindle analyze data from different dimensions categorize and iterate it. Data mining finds the correlation or patterns amongst the palm in large databases.Data mining has a great potential to help companies to focus on their important information in their data storage warehouse. It asshole predict the future trends and styles and allows the business to make more proactive and knowledge driven decisions. It fucking answer the business questions that were traditionally much time go through to resolve. It scours databases for hidden patterns for finding predictive information that experts may miss it might lies beyond their expectations. Data mining is ordinarily used to transform the data into information or knowledge. It is commonly used in wide range of profiting practices such as marketing, fraud detection and scientific discovery. Many companies already collect and refine their data. Data mining techniques sess be implemented on existing platforms for enhance the value of information resources. Data mining tools base analyze coarse databases to deliver answers to the questions.Some other terms contains similar meaning from data mining such as fellowship mining or intimacy Extraction or Pattern Analysis. Data mining faeces also be treated as a Knowledge Discovery from Data (KDD). Some people simply mean the data mining as an essential step in Knowledge discovery from a large data. The process of knowledge discovery from data contains following steps.* Data cleaning (removing the noise and inconsistent data)* Data Integration (combining multiple data sources)* Data selection (retrieving the data relevant to analysis task from database)* Data Transformation (tran sforming the data into appropriate forms for mining by performing summary or aggregation operations)* Data mining (applying the intelligent methods in order to extract data patterns)* Pattern evaluation (identifying the truly interesting patterns representing knowledge based on some measures)* Knowledge representation (representing knowledge techniques that are used to present the mined knowledge to the exploiter)2.2 DataData gage be any fictional character of facts, or text, or ikon or number which send word be processed by computer. Todays organizations are accumulating large and growing amounts of data in different formats and in different databases. It can include operational or transactional data which includes costs, sales, inventory, payroll and accounting. It can also include nonoperational data such as industry sales and depend data. It can also include the meta data which is, data about the data itself, such as logical database design and data dictionary definitions .2.3 InformationThe information can be retrieved from the data via patterns, associations or relationship may exist in the data. For exercising the sell point of sale transaction data can be analyzed to yield information about the products which are being s hoary and when.2.4 KnowledgeKnowledge can be retrieved from information via historical patterns and the future trends. For example the analysis on retail supermarket sales data in promotional efforts point of view can countenance the knowledge vitiateing behavior of customer. Hence items which are at most risk of infection for promotional efforts can be determined by manufacturer easily.2.5 Data warehouseThe advancement in data capture, affect power, data transmission and storage technologies are enabling the industry to integrate their heterogeneous databases into data warehouse. The process of centralizing and retrieving the data is called data warehousing. Data warehousing is new term but concept is a bit old. Data w arehouse is storage of massive amount of data in electronic form. Data warehousing is used to represent an intellectionl way of maintaining a central repository for all organizational data. Purpose of data warehouse is to maximize the user rag and analysis. The data from different data sources are extracted, transformed and then loaded into data warehouse. Users / clients can generate different types of reports and can do business analysis by accessing the data warehouse.Data mining is primarily used today by companies with a secure consumer focus retail, financial, communication, and marketing organizations. It allows these organizations to evaluate associations between certain internal external genes. The product positioning, price or staff skills can be example of internal factors. The external factor examples can be economic indicators, customer demographics and competition. It also allows them to calculate the impact on sales, corporate profits and customer satisfaction. Furthermore it allows them to summarize the information to regard detailed transactional data. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by its capabilities.Data mining unremarkably automates the procedure of searching predictive information in huge databases. Questions that traditionally required extensive hands-on analysis can now be answered instantly from the data very diligently. The targeted marketing can be an example of predictive problem. Data mining utilizes data on previous promotional mailings in order to distinguish the targets most probably to increase return on investment as maximum as thinkable in future mailings. Tools used in data mining traverses through huge databases and discover antecedently unseen patterns in single step. Analysis on retail sales data to recognize apparently unrelated products which are ordinarily purchased together can be an example of it. The more pattern discovery problems can include identifying fraudulent assurance card transactions and identifying irregular data that could symbolize data entry input errors. When data mining tools are used on parallel processing systems of high performance, they are able to analyze huge databases in very less amount of time. Faster or quick processing means that users can automatically experience with more details to recognize the complex data. High speed and quick response makes it actually possible for users to examine huge amounts of data. abundant databases, in turn, give improved and better predictions.2.6 Descriptive and Predictive Data MiningDescriptive data mining aims to find patterns in the data that depart some information about what the data contains. It describes patterns in existing data, and is generally used to create meaningful subgroups such as demographic clusters. For example descriptions are in the form of Summaries and visualization, Clustering and Link Analysis. Predictive Data Mi ning is used to forecast explicit values, based on patterns determined from known results. For example, in the database having records of clients who have already answered to a specific offer, a sample can be made that predicts which prospects are most seeming to answer to the same offer. It is usually applied to recognize data mining projects with the goal to identify a statistical or neural network deterrent example or set of mildews that can be used to predict some response of interest. For example, a credit card company may want to take in in predictive data mining, to derive a (trained) model or set of models that can quickly identify transactions which have a high probability of being fraudulent. former(a) types of data mining projects may be more exploratory in nature (e.g. to determine the cluster or divisions of customers), in which case drill-down descriptive and dubitable methods need to be applied. Predictive data mining is goad oriented. It can be decomposed into following major tasks.* Data facility* Data Reduction* Data Modeling and Prediction* Case and Solution Analysis2.7 textual matter MiningThe Text Mining is sometimes also called Text Data Mining which is more or less equal to Text Analytics. Text mining is the process of extracting/deriving high quality information from the text. High quality information is typically derived from deriving the patterns and trends through means such as statistical pattern learning. It usually involves the process of structuring the input text (usually parsing, along with the addition of some derived lingual features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. The High persona in text mining usually refers to some combination of relevance, novelty, and interestingness. The text categorization, concept/entity extraction, text clustering, sentiment analysis, production of r ough taxonomies, entity relation modeling, text file summarisation can be include as text mining tasks.Text Mining is also known as the discovery by computer of new, previously unknown information, by automatically extracting information from different scripted resources. Linking together of the extracted information is the key element to create new facts or new hypotheses to be examined further by more conventional shipway of experimentation. In text mining, the goal is to discover unknown information, something that no one yet knows and so could not have yet written down. The difference between ordinary data mining and text mining is that, in text mining the patterns are retrieved from lifelike language text preferably of from structured databases of facts. Databases are designed and developed for programs to execute automatically text is written for people to read. Most of the researchers think that it will need a full fledge simulation of how the brain works before that pro grams that read the way people do could be written.2.8 electronic network MiningWeb Mining is the technique which is used to extract and discover the information from sack up documents and hunt automatically. The interest of various research communities, tremendous growth of information resources on Web and late(a) interest in e-commerce has made this area of research very huge. Web mining can be usually decomposed into subtasks.* election finding fetching intend meshwork documents.* Information selection and pre-processing selecting and preprocessing specific information from fetched blade resources automatically.* Generalization automatically discovers general patterns at individualistic and across multiple website* Analysis reasonableation and explanation of mined patterns.Web Mining can be mainly categorized into three areas of interest based on which part of Web inescapably to be mined Web Content Mining, Web Structure Mining and Web Usage Mining. Web Contents Mini ng describes the discovery of useful information from the web contents, data and documents 10. In past the internet consisted of only different types of services and data resources. But today most of the data is available over the internet even digital libraries are also available on Web. The web contents consist of several types of data including text, image, audio, video, metadata as well as hyper tie in. Most of the companies are trying to transform their business and services into electronic form and putting it on Web. As a result, the databases of the companies which were previously residing on bequest systems are now accessible over the Web. Thus the employees, business partners and even end clients are able to access the companys databases over the Web. Users are accessing the applications over the web via their web interfaces due to which the most of the companies are trying to transform their business over the web, be get to internet is capable of making joining to any oth er computer anywhere in the world 11. Some of the web contents are hidden and hence cannot be indexed. The dynamically generated data from the results of queries residing in the database or underground data can fall in this area. Unstructured data such as free text or cheat structured data such as HTML and fully structured data such as data in the tables or database generated web pages can be considered in this category. However unstructured text is loosely found in the web contents. The work on Web content mining is mostly make from 2 point of views, one is IR and other is DB point of view. From IR view, web content mining assists and improves the information finding or filtering to the user. From DB view web content mining models the data on the web and integrates them so that the more sophisticated queries other than keywords could be performed. 10.In Web Structure Mining, we are more concerned with the structure of hyperlinks within the web itself which can be called as inte r document structure 10. It is closely related to the web use mining 14. Pattern detection and graphs mining are essentially related to the web structure mining. Link analysis technique can be used to determine the patterns in the graph. The search engines like Google usually uses the web structure mining. For example, the links are mined and one can then determine the web pages that point to a particular web page. When a string is searched, a webpage having most number of links pointed to it may become first in the list. Thats why web pages are listed based on rank which is calculated by the rank of web pages pointed to it 14. Based on web morphologic data, web structure mining can be divided into two categories. The first kind of web structure mining interacts with extracting patterns from the hyperlinks in the web. A hyperlink is a structural component that links or connects the web page to a different web page or different location. The other kind of the web structure mining i nteracts with the document structure, which is using the shoe channelize-like structure to analyze and describe the HTML or XML tags within the web pages.With continuous growth of e-commerce, web services and web applications, the volume of clickstream and user data collected by web based organizations in their daily operations has increased. The organizations can analyze such data to determine the life time value of clients, design cross marketing strategies etc. 13. The Web physical exertion mining interacts with data generated by users clickstream. The web usage data includes web host access logs, proxy server logs, browser logs, user profile, registration data, user sessions, transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a result of interaction 10. So the web usage mining is the most important task of the web mining 12. Weblog databases can provide rich information about the web dynamics. In web usage mining, web log record s are mined to discover the user access patterns through which the potential customers can be identified, quality of internet services can be enhanced and web server performance can be improved. Many techniques can be developed for implementation of web usage mining but it is important to know that success of such applications depends upon what and how much valid and reliable knowledge can be discovered the log data. Most often, the web logs are cleaned, condensed and transformed before extraction of any useful and portentous information from weblog. Web mining can be performed on web log records to find associations patterns, sequential patterns and trend of web accessing. The overall Web usage mining process can be divided into three inter-dependent stages data collection and pre-processing, pattern discovery, and pattern analysis 13. In the data collection preprocessing stage, the raw data is collected, cleaned and transformed into a set of user transactions which represents th e activities of each user during visits to the web site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to retrieve hidden patterns representing the typical behavior of users, as well as summary of statistics on Web resources, sessions, and users.3 Classification 3.1 What is Classification?As the quantity and the variety increases in the available data, it needs some robust, efficient and versatile data categorization technique for exploration 16. Classification is a method of categorizing class labels to patterns. It is actually a data mining methodology used to predict group membership for data instances. For example, one may want to use sorting to guess whether the weather condition on a specific day would be sunny, cloudy or rainy. The data mining techniques which are used to differentiate similar kind of data objects / points from other are called clustering. It actually uses attribute values found in the data of one clas s to distinguish it from other types or classes. The data classification majorly concerns with the treatment of the large datasets. In classification we build a model by analyzing the existing data, describing the characteristics of various classes of data. We can use this model to predict the class/type of new data. Classification is a supervised machine learning procedure in which individual items are placed in a group based on quantitative information on one or more characteristics in the items. last Trees and Bayesian Networks are the examples of classification methods. wizard type of classification is Clustering. This is process of finding the similar data objects / points within the given dataset. This similarity can be in the meaning of outer space measures or on any other parameter, depending upon the need and the given data.Classification is an ancient term as well as a modern one since classification of animals, plants and other physical objects is still valid today. Cl assification is a way of thinking about things rather than a study of things itself so it draws its speculation and application from complete range of human experiences and thoughts 18. From a bigger picture, classification can include medical patients based on disease, a set of images containing red rose from an image database, a set of documents describing classification from a document/text database, equipment malfunction based on cause and loan applicants based on their likelihood of payment etc. For example in later case, the problem is to predict a new applicants loans eligibility given old data about customers. thither are many techniques which are used for data categorization / classification. The most common are conclusiveness direct classifier and Bayesian classifiers.3.2 Types of ClassificationThere are two types of classification. One is supervised classification and other is unsupervised classification. Supervised learning is a machine learning technique for discove ring a function from upbringing data. The training data contains the pairs of input objects, and their desired outputs. The output of the function can be a continuous value which can be called regression, or can predict a class label of the input object which can be called as classification. The task of the supervised savant is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this goal, the learner needs to simplify from the presented data to hidden situations in a meaningful way.The unsupervised learning is a class of problems in machine learning in which it is needed to seek to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unknown examples. Unsupervised learning is nearly related to the problem of density estimation in statistics. However unsupervised learning also covers many other techniques t hat are used to summarize and explain key features of the data. One form of unsupervised learning is clustering which will be covered in next chapter. Blind source zone based on self-reliant Component Analysis is another example. Neural network models, adaptive resonance theory and the self organizing maps are most commonly used unsupervised learning algorithms. There are many techniques for the implementation of supervised classification. We will be discussing two of them which are most commonly used which are purpose Trees classifiers and Nave Bayesian Classifiers.3.2.1 Decision Trees ClassifierThere are many alternatives to represent classifiers. The decision tree is probably the most widely used approach for this purpose. It is one of the most widely used supervised learning methods used for data exploration. It is easy to use and can be represented in if-then-else statements/rules and can work well in noisy data as well 16. Tree like graph or decisions models and their poss ible consequences including resource costs, find out event, outcomes, and utilities are used in decision trees. Decision trees are most commonly used in specifically in decision analysis, operations research, to help in identifying a strategy most probably to reach a target. In machine learning and data mining, a decision trees are used as predictive model means a planning from observations calculations about an item to the conclusions about its target value. More descriptive names for such tree models are classification tree or regression tree. In these tree structures, leaves are representing classifications and branches are representing conjunctions of features those lead to classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or decision trees. Decision trees are simple but powerful form of multiple variable analyses 15. Classification is done by tree like structures that have different test criteria for a va riable at each of the lymph glands. New leaves are generated based on the results of the tests at the nodes. Decision Tree is a supervised learning system in which classification rules are constructed from the decision tree. Decision trees are produced by algorithms which identify various ways splitting data set into branch like segment. Decision tree try to find out a strong relationship between input and target values within the dataset 15.In tasks classification, decision trees normally visualize that what steps should be taken to reach on classification. Every decision tree starts with a parent node called shank node which is considered to be the parent of every other node. Each node in the tree calculates an attribute in the data and decides which path it should follow. Typically the decision test is comparison of a value against some constant. Classification with the help of decision tree is done by traversing from the root node up to a leaf node. Decision trees are able to r epresent and classify the diverse types of data. The simplest form of data is numerical data which is most familiar too. Organizing nominal data is also required many times in many situations. Nominal quantities are normally represented via distinguishable set of symbols. For example weather condition can be described in either nominal fashion or numeric. Quantification can be done about temperature by saying that it is eleven degrees Celsius or fifty two degrees Fahrenheit. The cool, mild, cold, warm or hot terminologies can also be sued. The former is a type of numeric data while and the latter is an example of nominal data. More precisely, the example of cool, mild, cold, warm and hot is a special type of nominal data, expressed as ordinal data. Ordinal data usually has an implicit assumption of ordered relationships among the values. In the weather example, purely nominal description like rainy, overcast and sunny can also be added. These values have no relationships or distan ce measures among each other.Decision Trees are those types of trees where each node is a question, each branch is an answer to a question, and each leaf is a result. Here is an example of Decision tree.Roughly, the idea is based upon the number of telephone line items we have to make different decisions. If we dont have much, you get at any cost. If you have a lot of items then you only buy if it is inexpensive. directly if commonplace items are less than 10 then buy all if unit price is less than 10 otherwise buy only 10 items. Now if we have 10 to 40 items in the stock then check unit price. If unit price is less than 5 then buy only 5 items otherwise no need to buy anything expensive since stock is good already. Now if we have more than 40 items in the stock, then buy 5 if and only if price is less than 2 otherwise no need to buy too expensive items. So in this way decision trees help us to make a decision at each level. Here is another example of decision tree, representing the risk factor associated with the rash driving.The root node at the top of the tree structure is covering the feature that is split first for highest discrimination. The internal nodes are showing decision rules on one or more attributes while leaf nodes are class labels. A person having age less than 20 has very high risk while a person having age greater than 30 has a very low risk. A middle category a person having age greater than 20 but less than 30 depend upon another attribute which is car type. If car type is of sports then on that point is again high risk involved while if family car is used then there is low risk involved.In the field of sciences engineering and in the applied areas including business tidings and data mining, many useful features are being introduced as the result of evolution of decision trees.* With the help of transformation in decision trees, the volume of data can be reduced into more compact form that preserves the major characteristicIdentify ing Clusters in High Dimensional DataIdentifying Clusters in High Dimensional DataAsk those who remember, are mindful if you do not know). (Holy Quran, 643)Removal Of Redundant Dimensions To Find Clusters In N-Dimensional Data Using Subspace ClusteringAbstractThe data mining has emerged as a powerful tool to extract knowledge from huge databases. Researchers have introduced several machine learning algorithms to explore the databases to discover information, hidden patterns, and rules from the data which were not known at the data recording time. Due to the remarkable developments in the storage capacities, processing and powerful algorithmic tools, practitioners are developing new and improved algorithms and techniques in several areas of data mining to discover the rules and relationship among the attributes in simple and complex higher dimensional databases. Furthermore data mining has its implementation in large variety of areas ranging from banking to marketing, engineering to bioinformatics and from investment to risk analysis and fraud detection. Practitioners are analyzing and implementing the techniques of artificial neural networks for classification and regression problems because of accuracy, efficiency. The aim of his short research project is to develop a way of identifying the clusters in high dimensional data as well as redundant dimensions which can create a noise in identifying the clusters in high dimensional data. Techniques used in this project utilizes the strength of the projections of the data points along the dimensions to identify the intensity of projection along each dimension in order to find cluster and redundant dimension in high dimensional data.1 IntroductionIn numerous scientific settings, engineering processes, and business applications ranging from experimental sensor data and process control data to telecommunication traffic observation and financial transaction monitoring, huge amounts of high-dimensional measurement data are produced and stored. Whereas sensor equipments as well as big storage devices are getting cheaper day by day, data analysis tools and techniques wrap behind. Clustering methods are common solutions to unsupervised learning problems where neither any expert knowledge nor some helpful annotation for the data is available. In general, clustering groups the data objects in a way that similar objects get together in clusters whereas objects from different clusters are of high dissimilarity. However it is observed that clustering disclose almost no structure even it is known there must be groups of similar objects. In many cases, the reason is that the cluster structure is stimulated by some subsets of the spaces dimensions only, and the many additional dimensions contribute nothing other than making noise in the data that hinder the discovery of the clusters within that data. As a solution to this problem, clustering algorithms are applied to the relevant subspaces only. Immediately, the new question is how to determine the relevant subspaces among the dimensions of the full space. Being faced with the power set of the set of dimensions a brute force trial of all subsets is infeasible due to their exponential number with respect to the original dimensionality.In high dimensional data, as dimensions are increasing, the visualization and representation of the data becomes more difficult and sometimes increase in the dimensions can create a bottleneck. More dimensions mean more visualization or representation problems in the data. As the dimensions are increased, the data within those dimensions seems dispersing towards the corners / dimensions. Subspace clustering solves this problem by identifying both problems in parallel. It solves the problem of relevant subspaces which can be marked as redundant in high dimensional data. It also solves the problem of finding the cluster structures within that dataset which become apparent in these subspaces. Subspace cluster ing is an extension to the traditional clustering which automatically finds the clusters present in the subspace of high dimensional data space that allows better clustering the data points than the original space and it works even when the curse of dimensionality occurs. The most of the clustering algorithms have been designed to discover clusters in full dimensional space so they are not effective in identifying the clusters that exists within subspace of the original data space. The most of the clustering algorithms produces clustering results based on the order in which the input records were processed 2.Subspace clustering can identify the different cluster within subspaces which exists in the huge amount of sales data and through it we can find which of the different attributes are related. This can be useful in promoting the sales and in planning the inventory levels of different products. It can be used for finding the subspace clusters in spatial databases and some useful d ecisions can be taken based on the subspace clusters identified 2. The technique used here for indentifying the redundant dimensions which are creating noise in the data in order to identifying the clusters consist of drawing or plotting the data points in all dimensions. At second step the projection of all data points along each dimension are plotted. At the third step the unions of projections along each dimension are plotted using all possible combinations among all no. of dimensions and finally the union of all projection along all dimensions and analyzed, it will show the contribution of each dimension in indentifying the cluster which will be represented by the weight of projection. If any of the given dimension is contributing very less in order to building the weight of projection, that dimension can be considered as redundant, which means this dimension is not so important to identify the clusters in given data. The details of this strategy will be covered in later chapter s.2 Data Mining2.1 What is Data Mining?Data mining is the process of analyzing data from different perspective and summarizing it for getting useful information. The information can be used for many useful purposes like increasing revenue, cuts costs etc. The data mining process also finds the hidden knowledge and relationship within the data which was not known while data recording. Describing the data is the first step in data mining, followed by summarizing its attributes (like standard deviation mean etc). After that data is reviewed using visual tools like charts and graphs and then meaningful relations are determined. In the data mining process, the steps of collecting, exploring and selecting the right data are critically important. User can analyze data from different dimensions categorize and summarize it. Data mining finds the correlation or patterns amongst the fields in large databases.Data mining has a great potential to help companies to focus on their important info rmation in their data warehouse. It can predict the future trends and behaviors and allows the business to make more proactive and knowledge driven decisions. It can answer the business questions that were traditionally much time consuming to resolve. It scours databases for hidden patterns for finding predictive information that experts may miss it might lies beyond their expectations. Data mining is normally used to transform the data into information or knowledge. It is commonly used in wide range of profiting practices such as marketing, fraud detection and scientific discovery. Many companies already collect and refine their data. Data mining techniques can be implemented on existing platforms for enhance the value of information resources. Data mining tools can analyze massive databases to deliver answers to the questions.Some other terms contains similar meaning from data mining such as Knowledge mining or Knowledge Extraction or Pattern Analysis. Data mining can also be trea ted as a Knowledge Discovery from Data (KDD). Some people simply mean the data mining as an essential step in Knowledge discovery from a large data. The process of knowledge discovery from data contains following steps.* Data cleaning (removing the noise and inconsistent data)* Data Integration (combining multiple data sources)* Data selection (retrieving the data relevant to analysis task from database)* Data Transformation (transforming the data into appropriate forms for mining by performing summary or aggregation operations)* Data mining (applying the intelligent methods in order to extract data patterns)* Pattern evaluation (identifying the truly interesting patterns representing knowledge based on some measures)* Knowledge representation (representing knowledge techniques that are used to present the mined knowledge to the user)2.2 DataData can be any type of facts, or text, or image or number which can be processed by computer. Todays organizations are accumulating large and growing amounts of data in different formats and in different databases. It can include operational or transactional data which includes costs, sales, inventory, payroll and accounting. It can also include nonoperational data such as industry sales and forecast data. It can also include the meta data which is, data about the data itself, such as logical database design and data dictionary definitions.2.3 InformationThe information can be retrieved from the data via patterns, associations or relationship may exist in the data. For example the retail point of sale transaction data can be analyzed to yield information about the products which are being sold and when.2.4 KnowledgeKnowledge can be retrieved from information via historical patterns and the future trends. For example the analysis on retail supermarket sales data in promotional efforts point of view can provide the knowledge buying behavior of customer. Hence items which are at most risk for promotional efforts can be de termined by manufacturer easily.2.5 Data warehouseThe advancement in data capture, processing power, data transmission and storage technologies are enabling the industry to integrate their various databases into data warehouse. The process of centralizing and retrieving the data is called data warehousing. Data warehousing is new term but concept is a bit old. Data warehouse is storage of massive amount of data in electronic form. Data warehousing is used to represent an ideal way of maintaining a central repository for all organizational data. Purpose of data warehouse is to maximize the user access and analysis. The data from different data sources are extracted, transformed and then loaded into data warehouse. Users / clients can generate different types of reports and can do business analysis by accessing the data warehouse.Data mining is primarily used today by companies with a strong consumer focus retail, financial, communication, and marketing organizations. It allows thes e organizations to evaluate associations between certain internal external factors. The product positioning, price or staff skills can be example of internal factors. The external factor examples can be economic indicators, customer demographics and competition. It also allows them to calculate the impact on sales, corporate profits and customer satisfaction. Furthermore it allows them to summarize the information to look detailed transactional data. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by its capabilities.Data mining usually automates the procedure of searching predictive information in huge databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data very quickly. The targeted marketing can be an example of predictive problem. Data mining utilizes data on previous promotional mailings in order to recognize the targets most probably to increase retur n on investment as maximum as possible in future mailings. Tools used in data mining traverses through huge databases and discover previously unseen patterns in single step. Analysis on retail sales data to recognize apparently unrelated products which are usually purchased together can be an example of it. The more pattern discovery problems can include identifying fraudulent credit card transactions and identifying irregular data that could symbolize data entry input errors. When data mining tools are used on parallel processing systems of high performance, they are able to analyze huge databases in very less amount of time. Faster or quick processing means that users can automatically experience with more details to recognize the complex data. High speed and quick response makes it actually possible for users to examine huge amounts of data. Huge databases, in turn, give improved and better predictions.2.6 Descriptive and Predictive Data MiningDescriptive data mining aims to fin d patterns in the data that provide some information about what the data contains. It describes patterns in existing data, and is generally used to create meaningful subgroups such as demographic clusters. For example descriptions are in the form of Summaries and visualization, Clustering and Link Analysis. Predictive Data Mining is used to forecast explicit values, based on patterns determined from known results. For example, in the database having records of clients who have already answered to a specific offer, a model can be made that predicts which prospects are most probable to answer to the same offer. It is usually applied to recognize data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models that can quickly identify transactions which have a high proba bility of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g. to determine the cluster or divisions of customers), in which case drill-down descriptive and tentative methods need to be applied. Predictive data mining is goad oriented. It can be decomposed into following major tasks.* Data Preparation* Data Reduction* Data Modeling and Prediction* Case and Solution Analysis2.7 Text MiningThe Text Mining is sometimes also called Text Data Mining which is more or less equal to Text Analytics. Text mining is the process of extracting/deriving high quality information from the text. High quality information is typically derived from deriving the patterns and trends through means such as statistical pattern learning. It usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns withi n the structured data, and finally evaluation and interpretation of the output. The High Quality in text mining usually refers to some combination of relevance, novelty, and interestingness. The text categorization, concept/entity extraction, text clustering, sentiment analysis, production of rough taxonomies, entity relation modeling, document summarization can be included as text mining tasks.Text Mining is also known as the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. Linking together of the extracted information is the key element to create new facts or new hypotheses to be examined further by more conventional ways of experimentation. In text mining, the goal is to discover unknown information, something that no one yet knows and so could not have yet written down. The difference between ordinary data mining and text mining is that, in text mining the patterns are retrieved from natural l anguage text instead of from structured databases of facts. Databases are designed and developed for programs to execute automatically text is written for people to read. Most of the researchers think that it will need a full fledge simulation of how the brain works before that programs that read the way people do could be written.2.8 Web MiningWeb Mining is the technique which is used to extract and discover the information from web documents and services automatically. The interest of various research communities, tremendous growth of information resources on Web and recent interest in e-commerce has made this area of research very huge. Web mining can be usually decomposed into subtasks.* Resource finding fetching intended web documents.* Information selection and pre-processing selecting and preprocessing specific information from fetched web resources automatically.* Generalization automatically discovers general patterns at individual and across multiple website* Analysis ve rification and explanation of mined patterns.Web Mining can be mainly categorized into three areas of interest based on which part of Web needs to be mined Web Content Mining, Web Structure Mining and Web Usage Mining. Web Contents Mining describes the discovery of useful information from the web contents, data and documents 10. In past the internet consisted of only different types of services and data resources. But today most of the data is available over the internet even digital libraries are also available on Web. The web contents consist of several types of data including text, image, audio, video, metadata as well as hyperlinks. Most of the companies are trying to transform their business and services into electronic form and putting it on Web. As a result, the databases of the companies which were previously residing on legacy systems are now accessible over the Web. Thus the employees, business partners and even end clients are able to access the companys databases over th e Web. Users are accessing the applications over the web via their web interfaces due to which the most of the companies are trying to transform their business over the web, because internet is capable of making connection to any other computer anywhere in the world 11. Some of the web contents are hidden and hence cannot be indexed. The dynamically generated data from the results of queries residing in the database or private data can fall in this area. Unstructured data such as free text or semi structured data such as HTML and fully structured data such as data in the tables or database generated web pages can be considered in this category. However unstructured text is mostly found in the web contents. The work on Web content mining is mostly done from 2 point of views, one is IR and other is DB point of view. From IR view, web content mining assists and improves the information finding or filtering to the user. From DB view web content mining models the data on the web and inte grates them so that the more sophisticated queries other than keywords could be performed. 10.In Web Structure Mining, we are more concerned with the structure of hyperlinks within the web itself which can be called as inter document structure 10. It is closely related to the web usage mining 14. Pattern detection and graphs mining are essentially related to the web structure mining. Link analysis technique can be used to determine the patterns in the graph. The search engines like Google usually uses the web structure mining. For example, the links are mined and one can then determine the web pages that point to a particular web page. When a string is searched, a webpage having most number of links pointed to it may become first in the list. Thats why web pages are listed based on rank which is calculated by the rank of web pages pointed to it 14. Based on web structural data, web structure mining can be divided into two categories. The first kind of web structure mining interacts with extracting patterns from the hyperlinks in the web. A hyperlink is a structural component that links or connects the web page to a different web page or different location. The other kind of the web structure mining interacts with the document structure, which is using the tree-like structure to analyze and describe the HTML or XML tags within the web pages.With continuous growth of e-commerce, web services and web applications, the volume of clickstream and user data collected by web based organizations in their daily operations has increased. The organizations can analyze such data to determine the life time value of clients, design cross marketing strategies etc. 13. The Web usage mining interacts with data generated by users clickstream. The web usage data includes web server access logs, proxy server logs, browser logs, user profile, registration data, user sessions, transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a result of interaction 10. So the web usage mining is the most important task of the web mining 12. Weblog databases can provide rich information about the web dynamics. In web usage mining, web log records are mined to discover the user access patterns through which the potential customers can be identified, quality of internet services can be enhanced and web server performance can be improved. Many techniques can be developed for implementation of web usage mining but it is important to know that success of such applications depends upon what and how much valid and reliable knowledge can be discovered the log data. Most often, the web logs are cleaned, condensed and transformed before extraction of any useful and significant information from weblog. Web mining can be performed on web log records to find associations patterns, sequential patterns and trend of web accessing. The overall Web usage mining process can be divided into three inter-dependent stages data collection and pre-proce ssing, pattern discovery, and pattern analysis 13. In the data collection preprocessing stage, the raw data is collected, cleaned and transformed into a set of user transactions which represents the activities of each user during visits to the web site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to retrieve hidden patterns representing the typical behavior of users, as well as summary of statistics on Web resources, sessions, and users.3 Classification 3.1 What is Classification?As the quantity and the variety increases in the available data, it needs some robust, efficient and versatile data categorization technique for exploration 16. Classification is a method of categorizing class labels to patterns. It is actually a data mining methodology used to predict group membership for data instances. For example, one may want to use classification to guess whether the weather on a specific day would be sunny, cloudy or rainy. T he data mining techniques which are used to differentiate similar kind of data objects / points from other are called clustering. It actually uses attribute values found in the data of one class to distinguish it from other types or classes. The data classification majorly concerns with the treatment of the large datasets. In classification we build a model by analyzing the existing data, describing the characteristics of various classes of data. We can use this model to predict the class/type of new data. Classification is a supervised machine learning procedure in which individual items are placed in a group based on quantitative information on one or more characteristics in the items. Decision Trees and Bayesian Networks are the examples of classification methods. One type of classification is Clustering. This is process of finding the similar data objects / points within the given dataset. This similarity can be in the meaning of distance measures or on any other parameter, depe nding upon the need and the given data.Classification is an ancient term as well as a modern one since classification of animals, plants and other physical objects is still valid today. Classification is a way of thinking about things rather than a study of things itself so it draws its theory and application from complete range of human experiences and thoughts 18. From a bigger picture, classification can include medical patients based on disease, a set of images containing red rose from an image database, a set of documents describing classification from a document/text database, equipment malfunction based on cause and loan applicants based on their likelihood of payment etc. For example in later case, the problem is to predict a new applicants loans eligibility given old data about customers. There are many techniques which are used for data categorization / classification. The most common are Decision tree classifier and Bayesian classifiers.3.2 Types of ClassificationThere a re two types of classification. One is supervised classification and other is unsupervised classification. Supervised learning is a machine learning technique for discovering a function from training data. The training data contains the pairs of input objects, and their desired outputs. The output of the function can be a continuous value which can be called regression, or can predict a class label of the input object which can be called as classification. The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this goal, the learner needs to simplify from the presented data to hidden situations in a meaningful way.The unsupervised learning is a class of problems in machine learning in which it is needed to seek to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unknown e xamples. Unsupervised learning is nearly related to the problem of density estimation in statistics. However unsupervised learning also covers many other techniques that are used to summarize and explain key features of the data. One form of unsupervised learning is clustering which will be covered in next chapter. Blind source partition based on Independent Component Analysis is another example. Neural network models, adaptive resonance theory and the self organizing maps are most commonly used unsupervised learning algorithms. There are many techniques for the implementation of supervised classification. We will be discussing two of them which are most commonly used which are Decision Trees classifiers and Nave Bayesian Classifiers.3.2.1 Decision Trees ClassifierThere are many alternatives to represent classifiers. The decision tree is probably the most widely used approach for this purpose. It is one of the most widely used supervised learning methods used for data exploration. It is easy to use and can be represented in if-then-else statements/rules and can work well in noisy data as well 16. Tree like graph or decisions models and their possible consequences including resource costs, chance event, outcomes, and utilities are used in decision trees. Decision trees are most commonly used in specifically in decision analysis, operations research, to help in identifying a strategy most probably to reach a target. In machine learning and data mining, a decision trees are used as predictive model means a planning from observations calculations about an item to the conclusions about its target value. More descriptive names for such tree models are classification tree or regression tree. In these tree structures, leaves are representing classifications and branches are representing conjunctions of features those lead to classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or decision trees. Dec ision trees are simple but powerful form of multiple variable analyses 15. Classification is done by tree like structures that have different test criteria for a variable at each of the nodes. New leaves are generated based on the results of the tests at the nodes. Decision Tree is a supervised learning system in which classification rules are constructed from the decision tree. Decision trees are produced by algorithms which identify various ways splitting data set into branch like segment. Decision tree try to find out a strong relationship between input and target values within the dataset 15.In tasks classification, decision trees normally visualize that what steps should be taken to reach on classification. Every decision tree starts with a parent node called root node which is considered to be the parent of every other node. Each node in the tree calculates an attribute in the data and decides which path it should follow. Typically the decision test is comparison of a value ag ainst some constant. Classification with the help of decision tree is done by traversing from the root node up to a leaf node. Decision trees are able to represent and classify the diverse types of data. The simplest form of data is numerical data which is most familiar too. Organizing nominal data is also required many times in many situations. Nominal quantities are normally represented via discrete set of symbols. For example weather condition can be described in either nominal fashion or numeric. Quantification can be done about temperature by saying that it is eleven degrees Celsius or fifty two degrees Fahrenheit. The cool, mild, cold, warm or hot terminologies can also be sued. The former is a type of numeric data while and the latter is an example of nominal data. More precisely, the example of cool, mild, cold, warm and hot is a special type of nominal data, expressed as ordinal data. Ordinal data usually has an implicit assumption of ordered relationships among the values. In the weather example, purely nominal description like rainy, overcast and sunny can also be added. These values have no relationships or distance measures among each other.Decision Trees are those types of trees where each node is a question, each branch is an answer to a question, and each leaf is a result. Here is an example of Decision tree.Roughly, the idea is based upon the number of stock items we have to make different decisions. If we dont have much, you buy at any cost. If you have a lot of items then you only buy if it is inexpensive. Now if stock items are less than 10 then buy all if unit price is less than 10 otherwise buy only 10 items. Now if we have 10 to 40 items in the stock then check unit price. If unit price is less than 5 then buy only 5 items otherwise no need to buy anything expensive since stock is good already. Now if we have more than 40 items in the stock, then buy 5 if and only if price is less than 2 otherwise no need to buy too expensive items. So i n this way decision trees help us to make a decision at each level. Here is another example of decision tree, representing the risk factor associated with the rash driving.The root node at the top of the tree structure is showing the feature that is split first for highest discrimination. The internal nodes are showing decision rules on one or more attributes while leaf nodes are class labels. A person having age less than 20 has very high risk while a person having age greater than 30 has a very low risk. A middle category a person having age greater than 20 but less than 30 depend upon another attribute which is car type. If car type is of sports then there is again high risk involved while if family car is used then there is low risk involved.In the field of sciences engineering and in the applied areas including business intelligence and data mining, many useful features are being introduced as the result of evolution of decision trees.* With the help of transformation in decis ion trees, the volume of data can be reduced into more compact form that preserves the major characteristic
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.