Home    Teaching    Research    Publications  Activities


Dat Tran – Research Projects 

 

  1. Fuzzy Subspace Clustering Methods

·       Collaborators:

Description: We propose a generic model to apply fuzzy clustering methods to subspace clustering. Fuzzy c-means and fuzzy entropy clustering methods will be used to weigh distances in multi-dimensional space. Extensions of this model to other fuzzy clustering methods will also be proposed. Applications of this model to biometric authentication and pattern recognition will be experimentally evaluated. If you are interested in this model, please contact me (dat.tran@canberra.edu.au).

 

  1. Spam Email Detection

·       Collaborators: Dr Wanli Ma, A/Prof Dharmendra Sharma, A/Prof John Campbell, Dr Shuangzhe Liu

·       Description: Several methods for detecting spam emails have been proposed in the literature. They are based on address lists, headers, keyword lists and content statistical analysis known as Bayesian filter. The project focuses on the keyword list-based system which has a blacklist of keywords used to detect spam emails. Although there are not many keywords found in spam emails, but the problem for detection is that these keywords are written as misspelling words and change their misspellings from time to time. Users can understand the content of the email containing such misspellings but the keyword list-based system is unable to update the blacklist with those misspellings. For example, an email is regarded as a spam email because it contains the keyword “virus”. After updating the blacklist with this keyword, the system is still unable to detect spam emails containing “viirus”, “vi_rus”, “virrus”, or “virruus”. Since there are numerous ways to produce misspellings for a given word, the email detection system becomes ineffective.

     We propose a novel method to overcome the problem of misspelling words in spam emails. We consider the occurrences of letters in a keyword as a stochastic process and hence the keyword can be represented as a Markov chain where letters are states. Given the blacklist as a training set, the initial and transition probabilities for all Markov chains representing all keywords in the blacklist are calculated and the set of those probabilities is regarded as a Markov model for that blacklist. In order to detect a keyword and its misspellings, we build a Markov model for the keyword W and use a statistical hypothesis test. Our experiments showed that the scores obtained from misspellings we could produce for the keyword W were not very different and hence with a preset threshold, we could detect those misspellings. As an extension, we believe that other misspellings of the keyword W that we have not tested will also be detected using the same preset threshold.

 

  1. Observable Markov Models

·       Collaborators: A/Prof Dharmendra Sharma, Dr Wanli Ma, A/Prof Tuan Pham, Prof Michael Wagner, Girija Chetty

·       Description: The well-known hidden Markov model (HMM) is widely used in speech and speaker recognition to deal with the temporal information in speech. A feature vector sequence is extracted from a speech signal of utterance. To model an utterance, a set of feature vector sequences obtained from speech signals of the utterance is used to train a HMM for the utterance. The HMM can represent the temporal information in those feature vector sequences. However, this modelling method cannot be applied to cell phase identification. A cell changes its phase from time to time and the task is how to identify the cell phase at a specific time. To apply the HMM to cell phase identification, we need to group cells being the same phase in to a subgroup, but the temporal information obtained from the cell sequence is lost after regrouping cells in to subgroups. Therefore we propose to use observable Markov model (OMM) to model this temporal information. Markov states will be cell phases and changing from phase to phase is modelled as state transition in the OMM. The OMM is very promising and we are applying this model to solve problems in security, spam emails and biometric authentication.

 

  1. Tablet PC-Based Handwriting Applications

·       Collaborators: Dr Wanli Ma, A/Prof Dharmendra Sharma, A/Prof John Campbell, Dr Shuangzhe Liu, Girija Chetty

·       Description: Tablet PCs are general-purpose computers with sensitive screens designed to interact with an accompanied pen. They run on the same processors as a laptop, have large hard drives, and have as much memory as any other computer. Tablet PCs are hybrids of handheld devices, laptops and other information tools. They are powered by special tablet PC versions of operating systems. The focus of this paper is on the Microsoft Windows XP Tablet PC Edition operating system and its related software. The Windows XP Tablet PC is a superset of Windows XP; therefore all applications that can run on a regular PC can also run on the Tablet PC.  This includes anything from MS Office to the applications we write ourselves.

     The current Tablet PC tools offered by Microsoft include Input Panel, Office OneNote, Windows Journal, Sticky Notes, and the Education Pack. Input Panel dynamically converts handwriting input by a user to text. The latest tool for the academic environment is the Education Pack which includes Ink Flash Cards and Equation Writer as the main programs. With Ink Flash Cards, a student can create two-sided question-and-answer cards to test their knowledge. Equation Writer helps users handwrite a math equation and convert it to text with the touch of a pen; much more efficient than using Equation in Word.

     Academic staff members at universities around the world have developed Tablet PC-based applications in their academic environments. Typical application includes lecture presentations, teaching computer science and software engineering courses and providing peer-review comments.

     We are focusing on signature identification and verification applications for the Tablet PC.  In signature verification application, the user’s signature plays the role of a password. For a new user registration, a Windows form is provided for the user to enter a username and two copies of their signature. The system extracts features from the entered signatures and builds a signature model using a modelling technique such as vector quantization, Gaussian mixture or hidden Markov modelling. After registration, the user can log on to the system using the registered username and signature via a login page. The entered signature will be compared with the signature model whose identity is claimed. If the match is above a given threshold, the identity claim is accepted.

 

  1. Modelling Methods for Typist Recognition

·       Collaborators: Dr Wanli Ma, A/Prof Dharmendra Sharma, Girija Chetty

·       Description: User authentication (human-by-machine authentication) is the process of verifying the identity of a user: is this person really who he/she claims to be? User authentication has become more complicated and difficult with the onset of the computer age.

     Passwords are excellent authenticators, but they can be stolen if recorded or guessed. We propose to use keystroke biometrics to overcome this problem. Typing biometric-based user authentication has been a very active research area. Typing biometric features are time intervals when a key on a keyboard is pressed or released. Four features mainly used are 1) The time interval when a key remains pressed (down-up), 2) The time interval until the next key is pressed (up-down), 3) The time interval between two consecutive pressed keys (down-down), and 4) The time interval between two consecutive released keys (up-up). The four features are used to form a 4-dimensional feature vector. A set of feature vectors is collected to build a statistical model for a user based on current pattern recognition methods such as learning vector quantization, neural networks and hidden Markov model.

 

  1. Statistical and Fuzzy Techniques for Classification of Cell Nuclei in Different Mitotic Phases

·       Collaborators: A/Prof Tuan Pham

·       Description: Approaches and methodologies in computer science, information technology, and information sciences for solving new biological and biomedical problems in bioinformatics are very difficult to be handled by conventional methods and manual procedures. These approaches and methodologies have not yet fully explored and can be of high potential for addressing many significant issues in bioinformatics.

     This project aims to develop an innovative and comprehensive application of statistical and fuzzy pattern recognition for the computerized classification of cell nuclei in different mitotic phases.  A summary of the application is as follows.

     Advances in fluorescent probing and microscopic imaging technology provide important tools for biology and medicine research in studying the structures and functions of cells and molecules.  Such studies require the processing and analysis of huge amounts of image data, and manual image analysis is very time consuming, thus costly, and also potentially inaccurate and poorly reproducible. Stages of an automated cellular imaging analysis consist of segmentation, feature extraction, classification, and tracking of individual cells in a dynamic cellular population. Image classification of cell phases in a fully automatic manner presents the most difficult task of such analysis.

     We are interested in applying several advanced computational, probabilistic, and fuzzy-set methods we have proposed in speech, speaker and image recognition for the computerized classification of cell nuclei in different mitotic phases. We have been testing several proposed computational procedures with real image sequences recorded over a period of twenty-four hours at every fifteen minutes with a time-lapse fluorescence microscopy. The data set was provided by the Harvard Medical School in Boston, USA. Experimental results have shown that the proposed methods are effective and have potential for higher performance with better cellular feature extraction strategy.

 

  1. Fuzzy Pattern Recognition Methods for Intrusion Detection and Spam Emails Filtering Systems

·       Collaborators: Dr Wanli Ma, A/Prof Dharmendra Sharma, A/Prof John Campbell, Dr Shuangzhe Liu

·       Description: Nowadays, computers are connected through wired or wireless connection to generate networks for resource sharing through which large amounts of data are exchanged. Many of current computer systems are running critical business, yet little research and development work have been done to secure these systems. IT security covers many issues such as intrusion detection and spam emails filtering.

     In general, there are two types of intrusion detection systems: signature matching based intrusion detection systems and anomaly behaviour analysis based intrusion detection systems. The goal of an intrusion detection system is to efficiently and effectively detect intrusions. The requirements are so basic, yet they are far from achievable. Current technologies such as hidden Markov model (HMM) and information retrieval are used to detect intrusions.

Spam email is more or less like the junk mail we receive from our mail boxes in front of our houses. Spam emails not only waste IT resources, but also pose serious security threats. There are 2 types of spam email: unsolicited commercial email and the email used as delivery agents for malware (malicious software). The former uses email for commercial advertisement purpose, including illegal commercial activities. The later has more sinister intention. For any type of malware, be it a virus, a worm, or SpyWare etc., after being developed, it has to find a way to infect host computers. An easy and effective way to deliver malware is through unsolicited bulk email.

     We propose new fuzzy modelling methods to build the claimed identity model. These methods are fuzzy hidden Markov modelling, fuzzy Gaussian Markov modelling and fuzzy entropy clustering. We propose a new background modelling method to determine the probability of the alternative hypothesis. These proposed methods have been applied to biometric authentication problems and showed better performance than current methods.

 

  1. New Statistical Modelling Methods for Biometric Authentication and Intrusion Detection Systems

·       Collaborators: Dr Wanli Ma, A/Prof Dharmendra Sharma, Dr Shuangzhe Liu

·       Description: We propose new statistical methods that can avoid the limitations in the current hidden Markov modelling (HMM) approach and therefore enhance the performance of the user authentication and intrusion detection systems.

     We propose the temporal Markov modelling (TMM) method to avoid the limitations of the current HMM. The TMM has an observed Markov chain where cluster indices are states to represent the dependence of each observation on its predecessor. The hidden Markov chain in the HMM is still employed in the TMM. We use both the standard NIST databases and the biometric and network databases we collect to evaluate the proposed method.

     We propose to train the TMM using the quasi-likelihood estimation (QLE) method. The reason is that computer network data and biometric data do not have a statistical distribution; therefore the current MLE method is not appropriate. The proposed QLE method requires only assumptions on the mean and variance functions rather than the full likelihood. We also propose a new background modelling method to determine the probability of the alternative hypothesis from the null hypothesis.

 

  1. Written Language Recognition

·       Collaborators: A/Prof Tuan Pham, Dr Wanli Ma, A/Prof Dharmendra Sharma, Girija Chetty

·       Description: The main principles for a language identification system is that it should be fast for real-time processing, efficient, requires minimum storage, and robust against textual errors. Based on these principles, a Markov chain-based method is proposed for language identification in this paper. The occurrences of letters in a word can be regarded as a stochastic process and hence the word can be represented as a Markov chain where letters are states. The occurrence of the first letter in the word is characterized by the initial probability of the Markov chain and the occurrence of the other letter given the occurrence of its previous letter is characterized by the transition probability. Given a text document in a specific language as a training set, the initial and transition probabilities for all Markov chains representing all words in the text document are calculated and the set of those probabilities is regarded as a Markov model for that language. In order to identify language for an unknown string, the maximum likelihood decision rule was used. Words in the string are regarded as Markov chains and for each language model built in the training session, the initial and transition probabilities taken from the language model are used to calculate the probability of the unknown string for that language. The unknown string is then identified to the language that has the maximum probability.

 

  1. New Background Modeling for Speaker Verification

·       Collaborators: Prof Michael Wagner, A/Prof Dharmendra Sharma, Girija Chetty

·       Description: Most of the current background modeling methods were discovered by various considerations of the impostors’ models. However, speaker verification systems based on such background models rely on the availability of speaker databases and the acoustic condition. It is known that the speech signal is influenced by the speaking environment, the channel used to transmit the signal, and, when recording it, also by the transducer used to capture the signal. For portable devices such as palm-top computers and wireless handsets, a high demand of computation and memory requirement is not desirable. A background model design for flexible and portable speaker verification systems have been proposed for this purpose. A speaker verification system using left-to-right hidden Markov models consisting of 25 states with 4 Gaussian mixtures per state showed a good performance for this background model. In this approach, background speaker models were directly built from the claimed speaker’s enrolment utterances.

     In this project, we propose a background modelling method for text-independent speaker verification systems using Gaussian mixture models (GMMs) based on the above-mention background model design. We use the same training data to build the claimed speaker model and background model. The difference between these two models is the number of Gaussian mixtures. The background model should have a smaller number of Gaussian mixtures compared to the claimed speaker model. We have done experiments performed on the YOHO speech database with different number of Gaussian mixtures. Experimental results showed that a low verification error rate is obtained if the number of Gaussian mixtures in the background model is less than half of those in the claimed speaker model. Compared to current background model set methods, the proposed method using 64-mixture GMMs for claimed speaker models and 16-mixture GMMs for background speaker models showed a better performance.

 

  1. Fuzzy Normalization Methods for Pattern Verification

·       Collaborators: Prof Michael Wagner, Dr Wanli Ma, A/Prof Dharmendra Sharma,

·       Description: Consider the pattern verification problem in fuzzy set theory. To accept or reject the claimed pattern, the task is to make a decision whether the input object X is either from the claimed pattern l0 or from the set of impostors l, based on comparing the score for X and a decision threshold q. The space of input objects can be considered as consisting of two fuzzy subsets for the claimed pattern and impostors. The similarity score means the fuzzy membership function, which denotes the degree of belonging of the input object to the claimed pattern. Accepting (or rejecting) the claimed pattern is viewed as a defuzzification process, where the input object is (or is not) in the claimed pattern's fuzzy subset if the fuzzy membership value is (or is not) greater than the given threshold q. According to this fuzzy set theory-based viewpoint, currently used scores might be viewed as fuzzy membership scores and inversely, other fuzzy memberships can be used as the claimed pattern's scores.

     In theory, there are many ways to define the fuzzy membership function, therefore it can be said that this fuzzy approach proposes more general scores than the current likelihood ratio scores for pattern verification. These are termed fuzzy membership scores, which can denote the belonging of X to the claimed pattern.

     The use of the normalization term can cause false acceptances of impostors because of the relativity of the ratio-based values. For example, the two ratios of (0.08 / 0.04) and (0.0000008 / 0.0000004) have the same value of 2. The first ratio can lead to a correct decision whereas the second one is unlikely since both likelihood values are very low. This problem can be overcome by applying the idea of the well-known noise clustering method in fuzzy clustering, where impostors’ objects are considered as noisy data and thus should have arbitrarily small fuzzy membership scores in the claimed pattern's fuzzy subset. This is implemented by simply adding to the normalization term a constant membership value, which denotes the belonging of all input objects to impostors' fuzzy subset.