Home – Teaching – Research – Publications – Activities
Dat Tran – Research Projects
·
Collaborators:
Description: We propose a generic model to
apply fuzzy clustering methods to subspace clustering. Fuzzy c-means and fuzzy entropy
clustering methods will be used to weigh distances in multi-dimensional space.
Extensions of this model to other fuzzy clustering methods will also be
proposed. Applications of this model to biometric authentication and pattern
recognition will be experimentally evaluated. If you are interested in this
model, please contact me (dat.tran@canberra.edu.au).
·
Collaborators: Dr Wanli Ma, A/Prof Dharmendra
Sharma, A/Prof John Campbell, Dr Shuangzhe Liu
·
Description: Several methods for detecting spam
emails have been proposed in the literature. They are based on address lists,
headers, keyword lists and content statistical analysis known as Bayesian
filter. The project focuses on the keyword list-based system which has a
blacklist of keywords used to detect spam emails. Although there are not many
keywords found in spam emails, but the problem for detection is that these
keywords are written as misspelling words and change their misspellings from
time to time. Users can understand the content of the email containing such
misspellings but the keyword list-based system is unable to update the
blacklist with those misspellings. For example, an email is regarded as a spam
email because it contains the keyword “virus”. After updating the blacklist
with this keyword, the system is still unable to detect spam emails containing
“viirus”, “vi_rus”, “virrus”, or “virruus”.
Since there are numerous ways to produce misspellings for a given word, the email
detection system becomes ineffective.
We propose a
novel method to overcome the problem of misspelling words in spam emails. We
consider the occurrences of letters in a keyword as a stochastic process and
hence the keyword can be represented as a Markov chain
where letters are states. Given the blacklist as a training set, the initial
and transition probabilities for all Markov chains representing all keywords in
the blacklist are calculated and the set of those probabilities is regarded as
a Markov model for that blacklist. In order to detect a keyword and its
misspellings, we build a Markov model for the keyword W and use a statistical
hypothesis test. Our experiments showed that the scores obtained from
misspellings we could produce for the keyword W were not very different and
hence with a preset threshold, we could detect those misspellings. As an
extension, we believe that other misspellings of the keyword W that we have not
tested will also be detected using the same preset threshold.
·
Collaborators: A/Prof Dharmendra Sharma, Dr Wanli
Ma, A/Prof Tuan Pham, Prof Michael Wagner, Girija Chetty
·
Description: The well-known hidden Markov model
(HMM) is widely used in speech and speaker recognition to deal with the
temporal information in speech. A feature vector sequence is extracted from a
speech signal of utterance. To model an utterance, a set of feature vector
sequences obtained from speech signals of the utterance is used to train a HMM
for the utterance. The HMM can represent the temporal information in those
feature vector sequences. However, this modelling method cannot be applied to
cell phase identification. A cell changes its phase from time to time and the
task is how to identify the cell phase at a specific time. To apply the HMM to
cell phase identification, we need to group cells being the same phase in to a
subgroup, but the temporal information obtained from the cell sequence is lost
after regrouping cells in to subgroups. Therefore we propose to use observable
Markov model (OMM) to model this temporal information. Markov states will be
cell phases and changing from phase to phase is modelled as state transition in
the OMM. The OMM is very promising and we are applying this model to solve
problems in security, spam emails and biometric authentication.
·
Collaborators: Dr Wanli Ma, A/Prof Dharmendra
Sharma, A/Prof John Campbell, Dr Shuangzhe Liu, Girija Chetty
·
Description: Tablet PCs are general-purpose
computers with sensitive screens designed to interact with an accompanied pen.
They run on the same processors as a laptop, have large hard drives, and have
as much memory as any other computer. Tablet PCs are hybrids of handheld
devices, laptops and other information tools. They are powered by special
tablet PC versions of operating systems. The focus of this paper is on the
Microsoft Windows XP Tablet PC Edition operating system and its related
software. The Windows XP Tablet PC is a superset of Windows XP; therefore all
applications that can run on a regular PC can also run on the Tablet PC. This includes anything from MS Office to the
applications we write ourselves.
The current
Tablet PC tools offered by Microsoft include Input Panel, Office OneNote,
Windows Journal, Sticky Notes, and the Education Pack. Input Panel dynamically
converts handwriting input by a user to text. The latest tool for the academic
environment is the Education Pack which includes Ink Flash Cards and Equation
Writer as the main programs. With Ink Flash Cards, a student can create
two-sided question-and-answer cards to test their knowledge. Equation Writer
helps users handwrite a math equation and convert it to text with the touch of
a pen; much more efficient than using Equation in Word.
Academic staff
members at universities around the world have developed Tablet PC-based
applications in their academic environments. Typical application includes
lecture presentations, teaching computer science and software engineering
courses and providing peer-review comments.
We are focusing
on signature identification and verification applications for the Tablet
PC. In signature verification
application, the user’s signature plays the role of a password. For a new user
registration, a Windows form is provided for the user to enter a username and
two copies of their signature. The system extracts features from the entered
signatures and builds a signature model using a modelling technique such as
vector quantization, Gaussian mixture or hidden Markov modelling. After
registration, the user can log on to the system using the registered username
and signature via a login page. The entered signature will be compared with the
signature model whose identity is claimed. If the match is above a given
threshold, the identity claim is accepted.
· Collaborators: Dr Wanli Ma,
A/Prof Dharmendra Sharma, Girija Chetty
·
Description: User authentication
(human-by-machine authentication) is the process of verifying the identity of a
user: is this person really who he/she claims to be? User authentication has
become more complicated and difficult with the onset of the computer age.
Passwords are
excellent authenticators, but they can be stolen if recorded or guessed. We propose
to use keystroke biometrics to overcome this problem. Typing biometric-based
user authentication has been a very active research area. Typing biometric
features are time intervals when a key on a keyboard is pressed or released.
Four features mainly used are 1) The time interval when a key
remains pressed (down-up), 2) The time interval until the next key is pressed
(up-down), 3) The time interval between two consecutive pressed keys
(down-down), and 4) The time interval between two consecutive released keys
(up-up). The four features are used to form a 4-dimensional feature vector. A
set of feature vectors is collected to build a statistical model for a user
based on current pattern recognition methods such as learning vector
quantization, neural networks and hidden Markov model.
·
Collaborators: A/Prof Tuan Pham
·
Description: Approaches and methodologies in
computer science, information technology, and information sciences for solving
new biological and biomedical problems in bioinformatics are very difficult to
be handled by conventional methods and manual procedures. These approaches and
methodologies have not yet fully explored and can be of high potential for
addressing many significant issues in bioinformatics.
This project aims
to develop an innovative and comprehensive application of statistical and fuzzy
pattern recognition for the computerized classification of cell nuclei in
different mitotic phases. A summary of
the application is as follows.
Advances in
fluorescent probing and microscopic imaging technology provide important tools
for biology and medicine research in studying the structures and functions of
cells and molecules. Such studies
require the processing and analysis of huge amounts of image data, and manual
image analysis is very time consuming, thus costly, and also potentially
inaccurate and poorly reproducible. Stages of an automated cellular imaging
analysis consist of segmentation, feature extraction, classification, and
tracking of individual cells in a dynamic cellular population. Image
classification of cell phases in a fully automatic manner presents the most
difficult task of such analysis.
We are interested
in applying several advanced computational, probabilistic, and fuzzy-set
methods we have proposed in speech, speaker and image recognition for the
computerized classification of cell nuclei in different mitotic phases. We have
been testing several proposed computational procedures with real image
sequences recorded over a period of twenty-four hours at every fifteen minutes
with a time-lapse fluorescence microscopy. The data set was provided by the
·
Collaborators: Dr Wanli Ma, A/Prof Dharmendra
Sharma, A/Prof John Campbell, Dr Shuangzhe Liu
·
Description: Nowadays, computers are connected
through wired or wireless connection to generate networks for resource sharing
through which large amounts of data are exchanged. Many of current computer
systems are running critical business, yet little research and development work
have been done to secure these systems. IT security covers many issues such as
intrusion detection and spam emails filtering.
In general, there
are two types of intrusion detection systems: signature matching based
intrusion detection systems and anomaly behaviour analysis based intrusion
detection systems. The goal of an intrusion detection system is to efficiently
and effectively detect intrusions. The requirements are so basic, yet they are
far from achievable. Current technologies such as hidden Markov model (HMM) and
information retrieval are used to detect intrusions.
Spam email is more or less like the junk mail we receive
from our mail boxes in front of our houses. Spam emails not only waste IT
resources, but also pose serious security threats. There are 2 types of spam
email: unsolicited commercial email and the email used as delivery agents for
malware (malicious software). The former uses email for commercial
advertisement purpose, including illegal commercial activities. The later has
more sinister intention. For any type of malware, be it a virus, a worm, or SpyWare etc., after being developed, it has to find a way
to infect host computers. An easy and effective way to deliver malware is
through unsolicited bulk email.
We propose new
fuzzy modelling methods to build the claimed identity model. These methods are
fuzzy hidden Markov modelling, fuzzy Gaussian Markov modelling and fuzzy
entropy clustering. We propose a new background modelling method to determine
the probability of the alternative hypothesis. These proposed methods have been
applied to biometric authentication problems and showed better performance than
current methods.
·
Collaborators: Dr Wanli Ma, A/Prof Dharmendra
Sharma, Dr Shuangzhe Liu
·
Description: We propose new statistical methods
that can avoid the limitations in the current hidden Markov modelling (HMM)
approach and therefore enhance the performance of the user authentication and
intrusion detection systems.
We propose the
temporal Markov modelling (TMM) method to avoid the limitations of the current
HMM. The TMM has an observed Markov chain where cluster indices are states to
represent the dependence of each observation on its predecessor. The hidden
Markov chain in the HMM is still employed in the TMM. We use both the standard
NIST databases and the biometric and network databases we collect to evaluate
the proposed method.
We propose to
train the TMM using the quasi-likelihood estimation (QLE) method. The reason is
that computer network data and biometric data do not have a statistical
distribution; therefore the current MLE method is not appropriate. The proposed
QLE method requires only assumptions on the mean and variance functions rather
than the full likelihood. We also propose a new background modelling method to
determine the probability of the alternative hypothesis from the null
hypothesis.
· Collaborators: A/Prof Tuan
Pham, Dr Wanli Ma, A/Prof Dharmendra Sharma, Girija Chetty
·
Description: The main principles for a language
identification system is that it should be fast for real-time processing,
efficient, requires minimum storage, and robust against textual errors. Based
on these principles, a Markov chain-based method is proposed for language
identification in this paper. The occurrences of letters in a word can be
regarded as a stochastic process and hence the word can be represented as a
Markov chain where letters are states. The occurrence of the first letter in
the word is characterized by the initial probability of the Markov chain and
the occurrence of the other letter given the occurrence of its previous letter
is characterized by the transition probability. Given a text document in a
specific language as a training set, the initial and transition probabilities
for all Markov chains representing all words in the text document are
calculated and the set of those probabilities is regarded as a Markov model for
that language. In order to identify language for an unknown string, the maximum
likelihood decision rule was used. Words in the string are regarded as Markov
chains and for each language model built in the training session, the initial
and transition probabilities taken from the language model are used to
calculate the probability of the unknown string for that language. The unknown
string is then identified to the language that has the maximum probability.
·
Collaborators: Prof Michael Wagner, A/Prof
Dharmendra Sharma, Girija Chetty
·
Description: Most of the current background modeling methods were discovered by various considerations
of the impostors’ models. However, speaker verification systems based on such
background models rely on the availability of speaker databases and the
acoustic condition. It is known that the speech signal is influenced by the
speaking environment, the channel used to transmit the signal, and, when
recording it, also by the transducer used to capture the signal. For portable
devices such as palm-top computers and wireless handsets, a high demand of
computation and memory requirement is not desirable. A background model design
for flexible and portable speaker verification systems have been proposed for
this purpose. A speaker verification system using left-to-right hidden Markov
models consisting of 25 states with 4 Gaussian mixtures per state showed a good
performance for this background model. In this approach, background speaker
models were directly built from the claimed speaker’s enrolment utterances.
In this project,
we propose a background modelling method for text-independent speaker
verification systems using Gaussian mixture models (GMMs)
based on the above-mention background model design. We use the same training
data to build the claimed speaker model and background model. The difference
between these two models is the number of Gaussian mixtures. The background
model should have a smaller number of Gaussian mixtures compared to the claimed
speaker model. We have done experiments performed on the YOHO speech database
with different number of Gaussian mixtures. Experimental results showed that a
low verification error rate is obtained if the number of Gaussian mixtures in
the background model is less than half of those in the claimed speaker model.
Compared to current background model set methods, the proposed method using
64-mixture GMMs for claimed speaker models and
16-mixture GMMs for background speaker models showed
a better performance.
·
Collaborators: Prof Michael Wagner, Dr Wanli Ma,
A/Prof Dharmendra Sharma,
·
Description: Consider the pattern verification
problem in fuzzy set theory. To accept or reject the claimed pattern, the task
is to make a decision whether the input object X is either from the claimed
pattern l0 or from the set of impostors l, based on comparing the score for X
and a decision threshold q. The space of input objects can be considered as consisting of two
fuzzy subsets for the claimed pattern and impostors. The similarity score means
the fuzzy membership function, which denotes the degree of belonging of the
input object to the claimed pattern. Accepting (or rejecting) the claimed
pattern is viewed as a defuzzification process, where
the input object is (or is not) in the claimed pattern's fuzzy subset if the
fuzzy membership value is (or is not) greater than the given threshold q. According to this fuzzy set
theory-based viewpoint, currently used scores might be viewed as fuzzy
membership scores and inversely, other fuzzy memberships can be used as the
claimed pattern's scores.
In theory, there
are many ways to define the fuzzy membership function, therefore it can be said
that this fuzzy approach proposes more general scores than the current
likelihood ratio scores for pattern verification. These are termed fuzzy
membership scores, which can denote the belonging of X to the claimed pattern.
The use of the
normalization term can cause false acceptances of impostors because of the
relativity of the ratio-based values. For example, the two ratios of (0.08 /
0.04) and (0.0000008 / 0.0000004) have the same value of 2. The first ratio can
lead to a correct decision whereas the second one is unlikely since both
likelihood values are very low. This problem can be overcome by applying the
idea of the well-known noise clustering method in fuzzy clustering, where
impostors’ objects are considered as noisy data and thus should have
arbitrarily small fuzzy membership scores in the claimed pattern's fuzzy
subset. This is implemented by simply adding to the normalization term a
constant membership value, which denotes the belonging of all input objects to
impostors' fuzzy subset.