Data Mining : Meaning and Approach

There is a huge amount of data that organizations generate, collect, and store. They are gradually relying more on new technologies to access, analyze, summarize, and interpret information intelligently. Data mining is the search for valuable information in large volumes of data. It can discover hidden relationships, patterns, and interdependencies and generate rules to predict the correlations, which can help the organizations make critical decisions faster or with a greater degree of confidence.

There is a wide range of data mining techniques, which has been successfully used in many applications. This article identifies three common application domains, including bioinformatics, electronic commerce, and search engines. For each domain, how data mining can enhance the functions will be described. Subsequently, the limitations of current research will be addressed, followed by a discussion of directions for future research.

Data mining can be used to achieve many types of tasks. Based on the kinds of knowledge to be discovered, it can be broadly divided into supervised learning and unsupervised learning. The former requires the data to be pre-classified. Each item is associated with a unique label, signifying the class in which the item belongs. In contrast, the latter does not require pre-classification of the data and can form groups that share common characteristics.

To achieve these two main tasks, four data mining approaches are commonly used:
1. Classification;
2. Clustering;
3. Association rules; and
4. Visualization.

Classification

Classification, which is a process of supervised learning, is an important issue in data mining. It refers to discovering predictive patterns where a predicted attribute is nominal or categorical. The predicted attribute is called the class. Subsequently, a data item is assigned to one of the predefined sets of classes by examining its attributes. One example of classification applications is to analyze the functions of genes on the basis of predefined classes that biologists set.

Clustering

Clustering is also known as exploratory data analysis (EDA). This approach is used in those situations where a training set of pre-classified records is unavailable. Objects are divided into groups based on their similarity. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. From a data mining perspective, clustering is an approach for unsupervised learning. One of the major applications of clustering is the management of customers’ relationships.

Association Rules

Association rules that were first proposed by Agrawal and Srikant (1994) are mainly used to find out the meaningful relationships between items or features that occur synchronously in databases. This approach is useful when one has an idea of different associations that are being sought out. This is because one can find all kinds of correlations in a large data set. It has been widely applied to extract knowledge from Web log data. In particular, it is very popular among marketing managers and retailers in electronic commerce who want to find associative patterns among products.

Visualization

The visualization approach to data mining is based on an assumption that human beings are very good at perceiving structure in visual forms. The basic idea is to present the data in some visual form, allowing the human to gain insight from the data, draw conclusions, and directly interact with the data. Since the user is directly involved in the exploration process, shifting and adjusting the exploration goals is automatically done if necessary. This approach is especially useful when little is known about the data and the exploration goals are vague. One example of using visualization is author co-citation analysis.

(Source: “Data Mining“, by Sherry Y. Chen and Xiaohui Liu, Brunel University, UK)

,

No Comments

Computer Systems Security: Authentication Methods

Rapid growth of e-commerce has led to increased the demand for reliable computer security. Most computer systems are protected through a process of user identification and authentication. While identification is usually non-private information provided by users to identify themselves and can be known by system administrators and other system users, authentication provides secret, private user information which can authenticate their identity. There are three main authentication approaches frequently used. This article briefly describes the three approaches.

The authentication approaches can be classified into three types according to the distinguishing characteristics they use :
1. What the user KNOW — knowledge-based authentication (e.g., password, PIN, pass code);
2. What the user HAS — possession-based authentication (e.g., memory card and smart card tokens); and
3. What the user IS — biometric-based authentication:physiological (e.g., fingerprint) or behavioral (e.g., keyboard dynamics) characteristics.

Knowledge-based Authentication

The most widely used type of authentication is knowledge-based authentication. Examples of knowledge-based authentication include passwords, pass phrases, or pass sentences, graphical passwords, pass faces and personal identification numbers (PINs). To verify and authenticate users over an unsecured public network, such as the Internet, digital certificates and digital signatures are used. They are provided using a public key infrastructure (PKI) which consists of a public and a private cryptographic key pair. The traditional, and by far the most widely used, form of authentication based on user knowledge is the password. Most computer systems are protected through user identification (like user name or user e-mail Knowledge-based address) and a password.

Possession-based Authentication

Possession-based authentication, referred to also as token-based authentication, is based on what the user has. It makes use mainly of physical objects that a user possesses, like tokens. Aside from the fact that presentation of a valid token does not prove ownership, as it may have been stolen or duplicated by some sophisticated fraudulent means, there are problems of administration and of the inconvenience to users of having to carry them. Tokens are usually divided into two main groups: memory tokens and smart tokens. Memory tokens store information but do not process it. The most common type of memory token is the magnetic card, used mainly for authentication together with a knowledge-based authentication mechanism such as a PIN. Memory tokens are inexpensive to produce. Using them with PINs provides significantly more security than PINs or passwords alone. Unlike memory tokens, smart tokens incorporate one or more embedded integrated circuits which enable them to process information. Like memory tokens, most smart tokens are used for authentication together with a knowledge-based authentication mechanism such as a PIN. Of the various types of smart tokens, the most widely used are those that house an integrated chip containing a microprocessor. Their portability and cryptographic capacity have led to their wide use in many remote and e-commerce applications. Due to their complexity, smart tokens are more expensive than memory tokens but provide greater flexibility and security and are more difficult to forge. Because of their high security level, smart tokens are also used for one-time passwords for authentication across open networks.

Biometric-based Authentication

Biometric-based authentication is based on what the user is, namely, automatic identification using certain anatomical, physiological or behavioral features and characteristics associated with the user. Biometric authentications are based on the fact that certain physiological or behavioral characteristics reliably distinguish one person from another. Thus, it is possible to establish an identity based on who the user is, rather than on what the user possesses or knows and remembers. Biometrics involves both the collection and the comparison of these characteristics. A biometric system can be viewed as a pattern recognition system consisting of three main modules: (1) the sensor module, (2) the feature extraction module, and (3) the feature matching module. The users’ personal attributes are captured and stored in reference files to be compared for later authentication to determine if a match exists. The accuracy of the different biometric systems can be evaluated by the measurement of two types of errors: (1) erroneous rejection, that is, false non-match (type I error), and (2) erroneous acceptance, that is, false match (type II error). In a biometric system that provides a high level of authentication, the rate of these two errors is low. Biometric authentications are technically complex and usually expensive as they require special hardware. Although all biometric technologies inherently suffer from some level of false match or false non-match, they have a high level of security. Despite their high security, they do not have a high acceptance rate by users as they are perceived to be intrusive and an encroachment on privacy through automated means. They also raise ethical issues of potential misuse of personal biometrics such as for tracking and monitoring productivity. Thus, they are not popular and mainly used in systems with very high levels of security.

(Source: “Authentication Methods for Computer Systems Security“, by Zippy Erlich and Moshe Zviran, short version)

,

No Comments

Internet : A Brief History

by Martin Campbell-Kelly – Warwick University, England

A computer-based communications system allowing users to communicate quickly without relying upon telephone communication.The enabling technology of the Internet, packet switching, was invented in the early 1960s, but it took 30 years for the first primitive computer networks to evolve into today’s ubiquitous information infrastructure.

Until the invention of packet switching, users could be connected to only one computer at a time, using a long-distance telephone line. This was expensive, because the telephone connection was used an average of only 2 percent of the time, and unreliable, because if the telephone connection failed communication ceased altogether. In packet switching, data was transmitted not by a dedicated communications line, but by converting it into “packets,” rather like telegrams, containing the address of the sender and recipient. A packet-switched network contained many communications lines interconnected by small, message-processing computers—now called routers—that directed the flow of packets in the network.

The pioneering packet-switched network was Arpanet, initially connecting just four “host” computers in 1969, which was funded by the U.S. Department of Defense’s Advanced Research Projects Agency. Development of the Arpanet was contracted out to a group of American universities, and this led to a uniquely democratic, occasionally anarchic, culture. By 1971, Arpanet had 23 computers attached to it. Originally, the network had been designed so that users could make use of specialized computers remote from their place of work. However, it turned out that the main use of the network was for electronic mail, something the designers had never envisioned.

In the period 1975–85, other computer networks sprang up around the world, usually based on some form of packet switching. Some of them were commercial networks, while others were private networks owned by governments or Multinational Corporations. The early 1980s also saw the development of on-line computer services such as CompuServe and America Online (AOL) for home computer users. The problem with these networks was that they could not communicate with each other. For example, users could e-mail only people within their own network, and could access only the information located on their particular network. However, in the late 1970s, the Advanced Research Projects Agency—the sponsor of Arpanet—began to addresss this problem, which it called inter-networking, or simply the Internet.

It devised a set of rules—known as a “protocol”—for communication between networks. This was the Transmission Control Protocol/Internet Protocol, or simply TCP/IP, a mysterious acronym familiar to most experienced users of the Internet. Gradually many of the world’s non-military networks began to connect with one another. Thus, the Internet is simply a network of computer networks, but it was a miracle of cooperation, each network adding to the telecommunications infrastructure piece-bypiece without payment from any centralized funding authority. By 1988, there were 50,000 host computers attached to the Internet. Three years later there were a million. The early 1990s saw the first commercial Internet Service Providers (ISPs), which gave inexpensive commercial and domestic access to the Internet. The issue of the Internet became highly politicized in the Clinton-Gore election campaign in 1992, in which the candidates expressed the need to provide Internet access to all Americans, just as earlier generations had had access to the postal service and the telephone.

Increasingly, the Internet came to be viewed not as a computing and communications resource but as an information repository, but it was difficult to access this information unless one was a trained information researcher. In 1989, a young, British-born researcher at the CERN nuclear research laboratory in Geneva, Tim Berners-Lee, invented a method of organizing information that he called the World Wide Web (WWW—or simply the Web). To view information on the Web, one would use a “browser” to view an on-line document, using navigation buttons and links to move within the document or to another document. The information itself, however, would be effectively disembodied in cyberspace—existing on computers here, there, and everywhere.

The World Wide Web liberated the Internet. In 1993, the primary users of the Internet had been academics and scientists; five years later, there were 130 million users around the world from all walks of life. The Internet became increasingly commercialized. One of the major commercial successes was the Netscape Corporation, whose Netscape Navigator browser, introduced in December 1994, did much to popularize the Internet. Other corporations such as Yahoo and Lycos were commercial spin-offs of “search engines” originally developed in universities to help locate information on the Web. In 1995, Microsoft introduced its Internet Explorer browser and the Microsoft Network (MSN), seeking to dominate the Internet as it had the personal computer. However, as the content of any one network was dwarfed by the riches of the Internet as a whole, full-service providers such as CompuServe, AOL, and MSN quickly changed their business model to become Internet Service Providers and mere “portals” to the World Wide Web.

By 1996, there were 10 million host computers on the Internet, a number that was doubling every 18 months. By 2000, there were more than 70 million. The Internet enabled a new commercial paradigm, based on the reduction of economic friction by eliminating middlemen and physical inventories. The best-known example was Amazon. com, the on-line bookstore established by Jim Bezos, a 30-year-old entrepreneur, in 1995; five years later it had more than 10 million customers. The Internet was a Klondike for so-called dot-com entrepreneurs, with hundreds and eventually thousands of new businesses being formed, such as travel agencies, “e-tailers,” stockbrokers, and on-line auctioneers. By 2000, all significant businesses, whether new economy or old economy, found it necessary to have a Web “presence.”

, ,

No Comments

Protecting Important Files

One of the problems often faced by computer users is the security of documents or files. Is the document secure and can not be opened or read by someone else? How to protect my files?

One solution can be used is to perform encryption. There are two encryption options, encryption specific files or the encryption of all files.

Which option better?

Read more …

No Comments