Matching entries: 0
settings...
Adams B, Phung D and Venkatesh S (2014), "Social reader: towards browsing the social web", Multimedia tools and applications. Vol. 69(3), pp. 951-990. Springer.
Abstract: We describe Social Reader, a feed-reader-plus-social-network aggregator that mines comments from social media in order to display a user’s relational neighborhood as a navigable social network. Social Reader’s network visualization enhances mutual awareness of blogger communities, facilitates their exploration and growth with a fully dragn- drop interface, and provides novel ways to filter and summarize people, groups, blogs and comments. We discuss the architecture behind the reader, highlight tasks it adds to the workflow of a typical reader, and assess their cost. We also explore the potential of mood-based features in social media applications. Mood is particularly relevant to social media, reflecting the personal nature of the medium. We explore two prototype mood-based features: colour coding the mood of recent posts according to a valence/arousal map, and a mood-based abstract of recent activity using image media. A six week study of the software involving 20 users confirmed the usefulness of the novel visual display, via a quantitative analysis of use logs, and an exit survey.
BibTeX:
@article{adams2014social,
  author = {Adams, Brett and Phung, Dinh and Venkatesh, Svetha},
  title = {Social reader: towards browsing the social web},
  journal = {Multimedia tools and applications},
  publisher = {Springer},
  year = {2014},
  volume = {69},
  number = {3},
  pages = {951--990},
  url = {http://link.springer.com/article/10.1007%2Fs11042-012-1138-5}
}
Arandjelović O, Pham D and Venkatesh S (2014), "Stream quantiles via maximal entropy histograms.", International Conference on Neural Information Processing (ICONIP). Vol. II, pp. 327-334.
Abstract: We address the problem of estimating the running quantile of a data stream when the memory for storing observations is limited. We (i) highlight the limitations of approaches previously described in the literature which make them unsuitable for non-stationary streams, (ii) describe a novel principle for the utilization of the available storage space, and (iii) introduce two novel algorithms which exploit the proposed principle. Experiments on three large real-world data sets demonstrate that the proposed methods vastly outperform the existing alternatives.
BibTeX:
@article{AranPhamVenk2014,
  author = {Arandjelović, O. and Pham, D. and Venkatesh, Svetha.},
  title = {Stream quantiles via maximal entropy histograms.},
  journal = {International Conference on Neural Information Processing (ICONIP)},
  year = {2014},
  volume = {II},
  pages = {327--334},
  url = {http://arxiv.org/pdf/1409.7289.pdf}
}
Beykikhoshk A, Arandjelovic O, Phung DQ, Venkatesh S and Caelli T (2014), "Data-mining twitter and the autism spectrum disorder: A Pilot study", In International Conference on Advances in Social Networks Analysis and Mining, (ASONAM). , pp. 349-356.
Abstract: The autism spectrum disorder (ASD) is increasingly being recognized as a major public health issue which affects approximately 0.5–0.6% of the population. Promoting the general awareness of the disorder, increasing the engagement with the affected individuals and their carers, and understanding the success of penetration of the current clinical recommendations in the target communities, is crucial in driving research as well as policy. The aim of the present work is to investigate if Twitter, as a highly popular platform for information exchange, can be used as a data-mining source which could aid in the aforementioned challenges. Specifically, using a large data set of harvested tweets, we present a series of experiments which examine a range of linguistic and semantic aspects of messages posted by individuals interested in ASD. Our findings, the first of their nature in the published scientific literature, strongly motivate additional research on this topic and present a methodological basis for further work.
BibTeX:
@inproceedings{beykikhoshkdata,
  author = {Adham Beykikhoshk and Ognjen Arandjelovic and Dinh Q. Phung and Svetha Venkatesh and Terry Caelli},
  title = {Data-mining twitter and the autism spectrum disorder: A Pilot study},
  booktitle = {International Conference on Advances in Social Networks Analysis and Mining, (ASONAM)},
  year = {2014},
  pages = {349--356},
  url = {http://dx.doi.org/10.1109/ASONAM.2014.6921609},
  doi = {10.1109/ASONAM.2014.6921609}
}
Bo Dao Thin Nguyen DQP and Venkatesh S (2014), "Effect of Mood, Social Connectivity and Age in Online Depression Community via Topic and Linguistic Analysis", In 15th International Conference on Web Information Systems Engineering - (WISE). , pp. 398-407.
Abstract: Depression afflicts one in four people during their lives. Several studies have shown that for the isolated and mentally ill, the Web and social media provide effective platforms for supports and treatments as well as to acquire scientific, clinical understanding of this mental condition. More and more individuals affected by depression join online communities to seek for information, express themselves, share their concerns and look for supports [12]. For the first time, we collect and study a large online depression community of more than 12,000 active members from Live Journal. We examine the effect of mood, social connectivity and age on the online messages authored by members in an online depression community. The posts are considered in two aspects: what is written (topic) and how it is written (language style). We use statistical and machine learning methods to discriminate the posts made by bloggers in low versus high valence mood, in different age categories and in different degrees of social connectivity. Using statistical tests, language styles are found to be significantly different between low and high valence cohorts, whilst topics are significantly different between people whose different degrees of social connectivity. High performance is achieved for low versus high valence post classification using writing style as features. The finding suggests the potential of using social media in depression screening, especially in online setting.
BibTeX:
@inproceedings{daoeffect,
  author = {Bo Dao,Thin Nguyen, Dinh Q. Phung and Svetha Venkatesh},
  title = {Effect of Mood, Social Connectivity and Age in Online Depression Community via Topic and Linguistic Analysis},
  booktitle = {15th International Conference on Web Information Systems Engineering - (WISE)},
  year = {2014},
  pages = {398--407},
  url = {http://link.springer.com/chapter/10.1007%2F978-3-319-11749-2_30},
  doi = {10.1007/978-3-319-11749-2_30}
}
Dao B, Nguyen T, Venkatesh S and Phung D (2014), "Analysis of circadian rhythms from online communities of individuals with affective disorders", In International Conference on Data Science and Advanced Analytics (DSAA). , pp. 463-469.
Abstract: The circadian system regulates 24 hour rhythms in biological creatures. It impacts mood regulation. The disruptions of circadian rhythms cause destabilization in individuals with affective disorders, such as depression and bipolar disorders. Previous work has examined the role of the circadian system on effects of light interactions on mood-related systems, the effects of light manipulation on brain, the impact of chronic stress on rhythms. However, such studies have been conducted in small, preselected populations. The deluge of data is now changing the landscape of research practice. The unprecedented growth of social media data allows one to study individual behavior across large and diverse populations. In particular, individuals with affective disorders from online communities have not been examined rigorously. In this paper, we aim to use social media as a sensor to identify circadian patterns for individuals with affective disorders in online communities.We use a large scale study cohort of data collecting from online affective disorder communities. We analyze changes in hourly, daily, weekly and seasonal affect of these clinical groups in contrast with control groups of general communities. By comparing the behaviors between the clinical groups and the control groups, our findings show that individuals with affective disorders show a significant distinction in their circadian rhythms across the online activity. The results shed light on the potential of using social media for identifying diurnal individual variation in affective state, providing key indicators and risk factors for noninvasive wellbeing monitoring and prediction.
BibTeX:
@inproceedings{dao2014analysis,
  author = {Dao, Bo and Nguyen, Thin and Venkatesh, Svetha and Phung, Dinh},
  title = {Analysis of circadian rhythms from online communities of individuals with affective disorders},
  booktitle = {International Conference on Data Science and Advanced Analytics (DSAA)},
  year = {2014},
  pages = {463--469},
  url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7058113}
}
Gopakumar S, Tran T, Phung D and Venkatesh S (2014), "Stabilizing Sparse Cox Model using Clinical Structures in Electronic Medical Records", In Proceedings of the Second International Workshop on Pattern Recognition for Healthcare Analytics.
Abstract: Stability in clinical prediction models is crucial for transferability between studies, yet has received little attention. The problem is paramount in high dimensional data which invites sparse models with feature selection capability. We introduce an effective method to stabilize sparse Cox model of time-to-events using clinical structures inherent in Electronic Medical Records. Model estimation is stabilized using a feature graph derived from two types of EMR structures: temporal structure of disease and intervention recurrences, and hierarchical structure of medical knowledge and practices. We demonstrate the efficacy of the method in predicting time-to-readmission of heart failure patients. On two stability measures - the Jaccard index and the Consistency index - the use of clinical structures significantly increased feature stability without hurting discriminative power. Our model reported a competitive AUC of 0.64 (95% CIs: [0.58,0.69]) for 6 months prediction.
BibTeX:
@inproceedings{gopakumar_tran_phung_venkatesh_icpr_ws14,
  author = {Gopakumar, Shivapratap and Tran, Truyen and Phung, Dinh and Venkatesh, Svetha},
  title = {Stabilizing Sparse Cox Model using Clinical Structures in Electronic Medical Records},
  booktitle = {Proceedings of the Second International Workshop on Pattern Recognition for Healthcare Analytics},
  year = {2014},
  url = {http://arxiv.org/pdf/1407.6094v1.pdf}
}
Gupta S, Phung D and Venkatesh S (2014), "Modelling multilevel data in multimedia: A hierarchical factor analysis approach", Multimedia Tools and Applications. , pp. 1-23. Springer.
BibTeX:
@article{gupta2014modelling,
  author = {Gupta, Sunil and Phung, Dinh and Venkatesh, Svetha},
  title = {Modelling multilevel data in multimedia: A hierarchical factor analysis approach},
  journal = {Multimedia Tools and Applications},
  publisher = {Springer},
  year = {2014},
  pages = {1--23}
}
Gupta S, Tran T, Luo W, Phung D, Kennedy RL, Broad A, Campbell D, Kipp D, Singh M, Khasraw M, Matheson L, Ashley D and Venkatesh S (2014), "Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry", BMJ open. Vol. 4(3), pp. e004007. British Medical Journal Publishing Group.
Abstract: Using the prediction of cancer outcome as a model, we have tested the hypothesis that through analysing routinely collected digital data contained in an electronic administrative record (EAR), using machine-learning techniques, we could enhance conventional methods in predicting clinical outcomes.
BibTeX:
@article{gupta2014machine,
  author = {Gupta, Sunil and Tran, Truyen and Luo, Wei and Phung, Dinh and Kennedy, Richard Lee and Broad, Adam and Campbell, David and Kipp, David and Singh, Madhu and Khasraw, Mustafa and Matheson, Leigh and Ashley, David and Venkatesh, Svetha},
  title = {Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry},
  journal = {BMJ open},
  publisher = {British Medical Journal Publishing Group},
  year = {2014},
  volume = {4},
  number = {3},
  pages = {e004007},
  url = {http://bmjopen.bmj.com/content/4/3/e004007.full.pdf+html}
}
Gupta SK, Phung D, Adams B and Venkatesh S (2014), "A matrix factorization framework for jointly analyzing multiple nonnegative data sources", In Data Mining for Service. , pp. 151-170. Springer.
Abstract: Nonnegative matrix factorization based methods provide one of the simplest and most effective approaches to text mining. However, their applicability is mainly limited to analyzing a single data source. In this paper, we propose a novel joint matrix factorization framework which can jointly analyze multiple data sources by exploiting their shared and individual structures. The proposed framework is flexible to handle any arbitrary sharing configurations encountered in real world data. We derive an efficient algorithm for learning the factorization and show that its convergence is theoretically guaranteed. We demonstrate the utility and effectiveness of the proposed framework in two real-world applications–improving social media retrieval using auxiliary sources and cross-social media retrieval. Representing each social media source using their textual tags, for both applications, we show that retrieval performance exceeds the existing state-of-the-art techniques. The proposed solution provides a generic framework and can be applicable to a wider context in data mining wherever one needs to exploit mutual and individual knowledge present across multiple data sources.
BibTeX:
@incollection{gupta2014matrix,
  author = {Gupta, Sunil Kumar and Phung, Dinh and Adams, Brett and Venkatesh, Svetha},
  title = {A matrix factorization framework for jointly analyzing multiple nonnegative data sources},
  booktitle = {Data Mining for Service},
  publisher = {Springer},
  year = {2014},
  pages = {151--170},
  url = {https://www.researchgate.net/profile/Sunil_Gupta28/publication/260300069_A_Matrix_Factorization_Framework_for_Jointly_Analyzing_Multiple_Nonnegative_Data_Sources/links/02e7e537f3a455cc53000000.pdf}
}
Gupta SK, Rana S, Phung DQ and Venkatesh S (2014), "Keeping up with Innovation: A Predictive Framework for Modeling Healthcare Data with Evolving Clinical Interventions", In International Conference on Data Mining. , pp. 235-243.
Review: Medical outcomes are inexorably linked to patient illness and clinical interventions. Interventions change the course of disease, crucially determining outcome. Traditional outcome prediction models build a single classifier by augmenting interventions with disease information. Interventions, however, differentially affect prognosis, thus a single prediction rule may not suffice to capture variations. Interventions also evolve over time as more advanced interventions replace older ones. To this end, we propose a Bayesian nonparametric, supervised framework that models a set of intervention groups through a mixture distribution building a separate prediction rule for each group, and allows the mixture distribution to change with time. This is achieved by using a hierarchical Dirichlet process mixture model over the interventions. The outcome is then modeled as conditional on both the latent grouping and the disease information through a Bayesian logistic regression. Experiments on synthetic and medical cohorts for 30-day readmission prediction demonstrate the superiority of the proposed model over clinical and data mining baselines.
BibTeX:
@inproceedings{guptakeeping,
  author = {Sunil Kumar Gupta and Santu Rana and Dinh Q. Phung and Svetha Venkatesh},
  title = {Keeping up with Innovation: A Predictive Framework for Modeling Healthcare Data with Evolving Clinical Interventions},
  booktitle = {International Conference on Data Mining},
  year = {2014},
  pages = {235--243},
  url = {http://dx.doi.org/10.1137/1.9781611973440.27},
  doi = {10.1137/1.9781611973440.27}
}
Li C, Rana S, Phung DQ and Venkatesh S (2014), "Regularizing Topic Discovery in EMRs with Side Information by Using Hierarchical Bayesian Models", In 22nd International Conference on Pattern Recognition. , pp. 1307-1312.
Abstract: We propose a novel hierarchical Bayesian framework, word-distance-dependent Chinese restaurant franchise (wd-dCRF) for topic discovery from a document corpus regularized by side information in the form of word-to-word relations, with an application on Electronic Medical Records (EMRs). Typically, a EMRs dataset consists of several patients (documents) and each patient contains many diagnosis codes (words). We exploit the side information available in the form of a semantic tree structure among the diagnosis codes for semantically-coherent disease topic discovery. We introduce novel functions to compute word-to-word distances when side information is available in the form of tree structures. We derive an efficient inference method for the wddCRF using MCMC technique. We evaluate on a real world medical dataset consisting of about 1000 patients with PolyVascular disease. Compared with the popular topic analysis tool, hierarchical Dirichlet process (HDP), our model discovers topics which are superior in terms of both qualitative and quantitative measures.
BibTeX:
@inproceedings{liregularizing,
  author = {Cheng Li and Santu Rana and Dinh Q. Phung and Svetha Venkatesh},
  title = {Regularizing Topic Discovery in EMRs with Side Information by Using Hierarchical Bayesian Models},
  booktitle = {22nd International Conference on Pattern Recognition},
  year = {2014},
  pages = {1307--1312},
  url = {http://dx.doi.org/10.1109/ICPR.2014.234},
  doi = {10.1109/ICPR.2014.234}
}
Luo W, Phung D, Nguyen V, Tran T and Venkatesh S (2014), "Speed up health research through topic modeling of coded clinical data", In International Workshop on Pattern Recognition for Healthcare Analytics. Stockholm, Sweden
Abstract: Although random control trial is the gold standard in medical research, researchers are increasingly looking to alternative data sources for hypothesis generation and early-stage evidence collection. Coded clinical data are collected routinely in most hospitals. While they contain rich information directly related to the real clinical setting, they are both noisy and semantically diverse, making them difficult to analyze with conventional statistical tools. This paper presents a novel application of Bayesian nonparametric modeling to uncover latent information in coded clinical data. For a patient cohort, a Bayesian nonparametric model is used to reveal the common comorbidity groups shared by the patients and the proportion that each comorbidity group is reflected inindividual patient. To demonstrate the method, we present a case study based on hospitalization coding from an Australian hospital. The model recovered 15 comorbidity groups among 1012 patients hospitalized during a month. When patients from two areas of unequal socio-economic status were compared, it reveals higher prevalence of diverticular disease in the region of lower socio-economic status. The study builds a convincing case for using routine coded data to speed up hypothesis generation.
BibTeX:
@inproceedings{luo_phung_nguyen_tran_venkatesh_iapr14,
  author = {Wei Luo and Dinh Phung and Vu Nguyen and Truyen Tran and Svetha Venkatesh},
  title = {Speed up health research through topic modeling of coded clinical data},
  booktitle = {International Workshop on Pattern Recognition for Healthcare Analytics},
  year = {2014},
  url = {https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxpd3ByaGEyfGd4OjU0NmYxNDg4MmU2ZDA0ZDQ}
}
Nguyen T, Duong T, Phung D and Venkatesh S (2014), "Affective, Linguistic and Topic Patterns in Online Autism Communities", In Web Information Systems Engineering. , pp. 474-488. Springer International Publishing.
Abstract: Online communities offer a platform to support and discuss health issues. They provide a more accessible way to bring people of the same concerns or interests. This paper aims to study the characteristics of online autism communities (called Clinical) in comparison with other online communities (called Control) using data from 110 Live Journal weblog communities. Using machine learning techniques, we comprehensively analyze these online autism communities. We study three key aspects expressed in the blog posts made by members of the communities: sentiment, topics and language style. Sentiment analysis shows that the sentiment of the clinical group has lower valence, indicative of poorer moods than people in control. Topics and language styles are shown to be good predictors of autism posts. The result shows the potential of social media in medical studies for a broad range of purposes such as screening, monitoring and subsequently providing supports for online communities of individuals with special needs.
BibTeX:
@incollection{nguyen2014affective,
  author = {Nguyen, Thin and Duong, Thi and Phung, Dinh and Venkatesh, Svetha},
  title = {Affective, Linguistic and Topic Patterns in Online Autism Communities},
  booktitle = {Web Information Systems Engineering},
  publisher = {Springer International Publishing},
  year = {2014},
  pages = {474--488},
  url = {http://link.springer.com/chapter/10.1007%2F978-3-319-11746-1_35}
}
Nguyen T, Gupta S, Venkatesh S and Phung D (2014), "Fixed-lag particle filter for continuous context discovery using Indian Buffet Process", In IEEE International Conference on Pervasive Computing and Communications (PerCom). , pp. 20-28.
Abstract: Exploiting context from stream data in pervasive environments remains a challenge. We aim to extract proximal context from Bluetooth stream data, using an incremental, Bayesian nonparametric framework that estimates the number of contexts automatically. Unlike current approaches that can only provide final proximal grouping, our method provides proximal grouping and membership of users over time. Additionally, it provides an efficient online inference. We construct co-location matrix over time using Bluetooth data. A Poisson-exponential model is used to factorize this matrix into a factor matrix, interpreted as proximal groups, and a coefficient matrix that indicates factor usage. The coefficient matrix follows the Indian Buffet Process prior, which estimates the number of factors automatically. The non-negativity and sparsity of factors are enforced by using the exponential distribution to generate the factors. We propose a fixed-lag particle filter algorithm to process data incrementally. We compare the incremental inference (particle filter) with full batch inference (Gibbs sampling) in terms of normalized factorization error and execution time. The normalized error obtained through our incremental inference is comparable to that of full batch inference, whilst the execution time is more than 100 times faster. The discovered factors have similar meaning to the results of the popular Louvain method for community detection.
BibTeX:
@inproceedings{nguyen2014fixed,
  author = {Nguyen, Thuong and Gupta, Sunil and Venkatesh, Svetha and Phung, Dinh},
  title = {Fixed-lag particle filter for continuous context discovery using Indian Buffet Process},
  booktitle = {IEEE International Conference on Pervasive Computing and Communications (PerCom)},
  year = {2014},
  pages = {20--28},
  url = {http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6813939}
}
Nguyen T, Gupta SK, Venkatesh S and Phung DQ (2014), "A Bayesian Nonparametric Framework for Activity Recognition Using Accelerometer Data", In 22nd International Conference on Pattern Recognition. , pp. 2017-2022.
Abstract: Monitoring daily physical activity of human plays an important role in preventing diseases as well as improving health. In this paper, we demonstrate a framework for monitoring the physical activity levels in daily life. We collect the data using accelerometer sensors in a realistic setting without any supervision. The ground truth of activities is provided by the participants themselves using an experience sampling application running on mobile phones. The original data is discretized by the hierarchical Dirichlet process (HDP) into different activity levels and the number of levels is inferred automatically. We validate the accuracy of the extracted patterns by using them for the multi-label classification of activities and demonstrate the high performances in various standard evaluation metrics. We further show that the extracted patterns are highly correlated to the daily routine of users.
BibTeX:
@inproceedings{nguyenbayesian,
  author = {Thuong Nguyen and Sunil Kumar Gupta and Svetha Venkatesh and Dinh Q. Phung},
  title = {A Bayesian Nonparametric Framework for Activity Recognition Using Accelerometer Data},
  booktitle = {22nd International Conference on Pattern Recognition},
  year = {2014},
  pages = {2017--2022},
  url = {http://dx.doi.org/10.1109/ICPR.2014.352},
  doi = {10.1109/ICPR.2014.352}
}
Nguyen T, Phung D, Adams B and Venkatesh S (2014), "Mood sensing from social media texts and its applications", Knowledge and information systems. Vol. 39(3), pp. 667-702. Springer.
Abstract: We present a large-scale mood analysis in social media texts. We organise the paper in three parts: (1) addressing the problem of feature selection and classification of mood in blogosphere, (2) we extract global mood patterns at different level of aggregation from a large-scale data set of approximately 18 millions documents (3) and finally, we extract mood trajectory for an egocentric user and study how it can be used to detect subtle emotion signals in a user-centric manner, supporting discovery of hyper-groups of communities based on sentiment information. For mood classification, two feature sets proposed in psychology are used, showing that these features are efficient, do not require a training phase and yield classification results comparable to state of the art, supervised feature selection schemes; on mood patterns, empirical results for mood organisation in the blogosphere are provided, analogous to the structure of human emotion proposed independently in the psychology literature; and on community structure discovery, sentiment-based approach can yield useful insights into community formation.
BibTeX:
@article{nguyen2014mood,
  author = {Nguyen, Thin and Phung, Dinh and Adams, Brett and Venkatesh, Svetha},
  title = {Mood sensing from social media texts and its applications},
  journal = {Knowledge and information systems},
  publisher = {Springer},
  year = {2014},
  volume = {39},
  number = {3},
  pages = {667--702},
  url = {http://prada-research.net/~dinh/uploads/Main/Publications/Nguyen_etal_13mood.pdf}
}
Nguyen T, Phung DQ, Dao B, Venkatesh S and Berk M (2014), "Affective and Content Analysis of Online Depression Communities", T. Affective Computing. Vol. 5(3), pp. 217-226.
Abstract: A large number of people use online communities to discuss mental health issues, thus offering opportunities for new understanding of these communities. This paper aims to study the characteristics of online depression communities (CLINICAL) in comparison with those joining other online communities (CONTROL). We use machine learning and statistical methods to discriminate online messages between depression and control communities using mood, psycholinguistic processes and content topics extracted from the posts generated by members of these communities. All aspects including mood, the written content and writing style are found to be significantly different between two types of communities. Sentiment analysis shows the clinical group have lower valence than people in the control group. For language styles and topics, statistical tests reject the hypothesis of equality on psycholinguistic processes and topics between two groups. We show good predictive validity in depression classification using topics and psycholinguistic clues as features. Clear discrimination between writing styles and contents, with good predictive power is an important step in understanding social media and its use in mental health.
BibTeX:
@article{nguyenaffective,
  author = {Thin Nguyen and Dinh Q. Phung and Bo Dao and Svetha Venkatesh and Michael Berk},
  title = {Affective and Content Analysis of Online Depression Communities},
  journal = {T. Affective Computing},
  year = {2014},
  volume = {5},
  number = {3},
  pages = {217--226},
  url = {http://dx.doi.org/10.1109/TAFFC.2014.2315623},
  doi = {10.1109/TAFFC.2014.2315623}
}
Nguyen T, Phung DQ, Luo W, Tran T and Venkatesh S (2014), "iPoll: Automatic Polling Using Online Search", In 15th International Conference on Web Information Systems Engineering. , pp. 266-275.
Abstract: For years, opinion polls rely on data collected through telephone or person-to-person surveys. The process is costly, inconvenient, and slow. Recently online search data has emerged as potential proxies for the survey data. However considerable human involvement is still needed for the selection of search indices, a task that requires knowledge of both the target issue and how search terms are used by the online community. The robustness of such manually selected search indices can be questionable. In this paper, we propose an automatic polling system through a novel application of machine learning. In this system, the needs for examining, comparing, and selecting search indices have been eliminated through automatic generation of candidate search indices and intelligent combination of the indices. The results include a publicly accessible web application that provides real-time, robust, and accurate measurements of public opinions on several subjects of general interest.
BibTeX:
@inproceedings{nguyenipoll,
  author = {Thin Nguyen and Dinh Q. Phung and Wei Luo and Truyen Tran and Svetha Venkatesh},
  title = {iPoll: Automatic Polling Using Online Search},
  booktitle = {15th International Conference on Web Information Systems Engineering},
  year = {2014},
  pages = {266--275},
  url = {http://dx.doi.org/10.1007/978-3-319-11749-2_21},
  doi = {10.1007/978-3-319-11749-2_21}
}
Nguyen TV, Phung D, Nguyen X, Venkatesh S and Bui H (2014), "Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts", In Proceedings of The 31st International Conference on Machine Learning. , pp. 288-296.
Abstract: We present a Bayesian non-parametric framework for multilevel clustering which utilizes group-level context information to simultaneously discover low-dimensional structures of the group contents and partitions groups into clusters. Using the Dirichlet process as the building block, our model constructs a product base-measure with a nested structure to accommodate content and context observations at multiple levels. The proposed model possesses properties that link the nested Dirichlet processes (nDP) and the Dirichlet process mixture models (DPM) in an interesting way: integrating out all contents results in the DPM over contexts, whereas integrating out group-specific contexts results in the nDP mixture over content variables. We provide a Polya-urn view of the model and an efficient collapsed Gibbs inference procedure. Extensive experiments on real-world datasets demonstrate the advantage of utilizing context information via our model in both text and image domains.
BibTeX:
@inproceedings{nguyen2014bayesian,
  author = {Nguyen, Tien Vu and Phung, Dinh and Nguyen, Xuanlong and Venkatesh, Svetha and Bui, Hung},
  title = {Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts},
  booktitle = {Proceedings of The 31st International Conference on Machine Learning},
  year = {2014},
  pages = {288--296},
  url = {http://jmlr.org/proceedings/papers/v32/nguyenb14.html}
}
Nguyen T-B, Lou W, Caelli T, Venkatesh S and Phung D (2014), "Individualized arrhythmia detection with ECG signals from wearable devices", In International Conference on Data Science and Advanced Analytics (DSAA). , pp. 570-576.
Abstract: Low cost pervasive electrocardiogram (ECG) monitors is changing how sinus arrhythmia are diagnosed among patients with mild symptoms. With the large amount of data generated from long-term monitoring, come new data science and analytical challenges. Although traditional rule-based detection algorithms still work on relatively short clinical quality ECG, they are not optimal for pervasive signals collected from wearable devices-they don't adapt to individual difference and assume accurate identification of ECG fiducial points. To overcome these short-comings of the rule-based methods, this paper introduces an arrhythmia detection approach for low quality pervasive ECG signals. To achieve the robustness needed, two techniques were applied. First, a set of ECG features with minimal reliance on fiducial point identification were selected. Next, the features were normalized using robust statistics to factors out baseline individual differences and clinically irrelevant temporal drift that is common in pervasive ECG. The proposed method was evaluated using pervasive ECG signals we collected, in combination with clinician validated ECG signals from Physiobank. Empirical evaluation confirms accuracy improvements of the proposed approach over the traditional clinical rules.
BibTeX:
@inproceedings{nguyen2014individualized,
  author = {Nguyen, Thanh-Binh and Lou, Wei and Caelli, Terry and Venkatesh, Svetha and Phung, Dinh},
  title = {Individualized arrhythmia detection with ECG signals from wearable devices},
  booktitle = {International Conference on Data Science and Advanced Analytics (DSAA)},
  year = {2014},
  pages = {570--576},
  url = {http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7058128}
}
Nguyen T-B, Nguyen T, Luo W, Venkatesh S and Phung D (2014), "Unsupervised inference of significant locations from WiFi data for understanding human dynamics", In 13th International Conference on Mobile and Ubiquitous Multimedia. , pp. 232-235.
Abstract: Motion and location activities are essential to understanding human dynamics. This paper presents a method for discovering significant locations and individuals' daily routines from WiFi data, a data source considered more suitable for analyzing human dynamics than GPS data. Our method determines significant locations by clustering access points in close proximity using the Affinity Propagation algorithm. We demonstrate the method on the MDC dataset that includes more than 30 million WiFi scans. The experimental results show a high clustering performance for most of the users. The discovered location trajectories revealed interesting mobility patterns of mobile phone users. The human dynamics of participants is reflected through the entropy of the location distributions which shows interesting correlation with the age and occupations of users. Quantitative results are presented to support our proposed approach.
BibTeX:
@inproceedings{nguyen2014unsupervised,
  author = {Nguyen, Thanh-Binh and Nguyen, Thuong and Luo, Wei and Venkatesh, Svetha and Phung, Dinh},
  title = {Unsupervised inference of significant locations from WiFi data for understanding human dynamics},
  booktitle = {13th International Conference on Mobile and Ubiquitous Multimedia},
  year = {2014},
  pages = {232--235},
  url = {http://dl.acm.org/citation.cfm?id=2677997}
}
Pham D-S, Venkatesh S, Lazarescu M and Budhaditya S (2014), "Anomaly detection in large-scale data stream networks", Data Mining and Knowledge Discovery. Vol. 28(1), pp. 145-189. Springer.
Abstract: This paper addresses the anomaly detection problem in large-scale data mining applications using residual subspace analysis. We are specifically concerned with situations where the full data cannot be practically obtained due to physical limitations such as low bandwidth, limited memory, storage, or computing power. Motivated by the recent compressed sensing (CS) theory, we suggest a framework wherein random projection can be used to obtained compressed data, addressing the scalability challenge. Our theoretical contribution shows that the spectral property of the CS data is approximately preserved under a such a projection and thus the performance of spectral-based methods for anomaly detection is almost equivalent to the case in which the raw data is completely available. Our second contribution is the construction of the framework to use this result and detect anomalies in the compressed data directly, thus circumventing the problems of data acquisition in large sensor networks. We have conducted extensive experiments to detect anomalies in network and surveillance applications on large datasets, including the benchmark PETS 2007 and 83 GB of real footage from three public train stations. Our results show that our proposed method is scalable, and importantly, its performance is comparable to conventional methods for anomaly detection when the complete data is available.
BibTeX:
@article{pham2014anomaly,
  author = {Pham, Duc-Son and Venkatesh, Svetha and Lazarescu, Mihai and Budhaditya, Saha},
  title = {Anomaly detection in large-scale data stream networks},
  journal = {Data Mining and Knowledge Discovery},
  publisher = {Springer},
  year = {2014},
  volume = {28},
  number = {1},
  pages = {145--189},
  url = {http://link.springer.com/article/10.1007/s10618-012-0297-3}
}
Phung D, Nguyen T, Gupta S and Venkatesh S (2014), "Learning latent activities from social signals with hierarchical Dirichlet processes", Handbook on Plan, Activity, and Intent Recognition. , pp. 149-174.
Abstract: Understanding human activities is an important research topic, noticeably in assisted living and health monitoring. Beyond simple forms of activity (e.g., an RFID event of entering a building), learning latent activities that are more semantically interpretable, such as sitting at a desk, meeting with people or gathering with friends, remains a challenging problem. Supervised learning has been the typical modeling choice in the past. However, this requires labeled training data, is unable to predict never-seen-before activity and fails to adapt to the continuing growth of data over time. In this chapter, we explore the use of a Bayesian nonparametric method, in particular the Hierarchical Dirichlet Process, to infer latent activities from sensor data acquired in a pervasive setting. Our framework is unsupervised, requires no labeled data and is able to discover new activities as data grows. We present experiments on extracting movement and interaction activities from sociometric badge signals and show how to use them for detecting of sub-communities. Using the popular Reality Mining dataset, we further demonstrate the extraction of collocation activities and use them to automatically infer the structure of social subgroups.
BibTeX:
@article{phung2014learning,
  author = {Phung, Dinh and Nguyen, Thuong and Gupta, Sunil and Venkatesh, Svetha},
  title = {Learning latent activities from social signals with hierarchical Dirichlet processes},
  journal = {Handbook on Plan, Activity, and Intent Recognition},
  year = {2014},
  pages = {149--174},
  url = {http://prada-research.net/~svetha/papers/2014/Phung_etal_pair14.pdf}
}
Rana S, Gupta SK, Phung D and Venkatesh S (2014), "Intervention-Driven Predictive Framework for Modeling Healthcare Data", In Advances in Knowledge Discovery and Data Mining. , pp. 497-508. Springer.
Abstract: Assessing prognostic risk is crucial to clinical care, and critically dependent on both diagnosis and medical interventions. Current methods use this augmented information to build a single prediction rule. But this may not be expressive enough to capture differential effects of interventions on prognosis. To this end, we propose a supervised, Bayesian nonparametric framework that simultaneously discovers the latent intervention groups and builds a separate prediction rule for each intervention group. The prediction rule is learnt using diagnosis data through a Bayesian logistic regression. For inference, we develop an efficient collapsed Gibbs sampler. We demonstrate that our method outperforms baselines in predicting 30-day hospital readmission using two patient cohorts - Acute Myocardial Infarction and Pneumonia. The significance of this model is that it can be applied widely across a broad range of medical prognosis tasks.
BibTeX:
@incollection{rana2014intervention,
  author = {Rana, Santu and Gupta, Sunil Kumar and Phung, Dinh and Venkatesh, Svetha},
  title = {Intervention-Driven Predictive Framework for Modeling Healthcare Data},
  booktitle = {Advances in Knowledge Discovery and Data Mining},
  publisher = {Springer},
  year = {2014},
  pages = {497--508},
  url = {http://link.springer.com/chapter/10.1007%2F978-3-319-06608-0_41}
}
Rana S, Luo W, Tran T, Phung D, Venkatesh S and Harvey R (2014), "HealthMap: A visual platform for patient suicide risk review", In Big Data. , pp. 42-43. Health Informatics Society of Australia.
Abstract: Misjudging suicide risk can be fatal. Risk assessment is complicated by multiplicity of risk factors, none of which individually can reliably predict risk. This paper addresses the need for better clinical support, visualising risk factors scattered in raw electronic medical records. HealthMap is a visual tool that helps clinicians effectively examine patient histories during a suicide risk assessment. We characterise the information visualisation problems accompanying suicide risk assessments. A design driven by visualisation principles was implemented. The prototype was evaluated by clinicians and accepted into daily clinical work-flow.
BibTeX:
@inproceedings{rana_et_al_bigdata14,
  author = {Santu Rana and Wei Luo and Truyen Tran and Dinh Phung and Svetha Venkatesh and Richard Harvey},
  title = {HealthMap: A visual platform for patient suicide risk review},
  booktitle = {Big Data},
  publisher = {Health Informatics Society of Australia},
  year = {2014},
  pages = {42-43},
  url = {http://ceur-ws.org/Vol-1149/bd2014_venkatesh.pdf}
}
Rana S, Tran T, Luo W, Phung D, Kennedy R and Venkatesh S (2014), "Predicting unplanned readmission after Myocardial Infarction from Routinely Collected Administrative Hospital Data", Australian Health Review. , pp. 377-382. CSIRO.
Abstract: This paper presents a way to predict readmissions following myocardial infarction using routinely collected administrative data. The model performed better than the recently described HOSPITAL score and a model derived from Elixhauser comorbidities. Moreover, the model uses only data generally available in most hospitals.
BibTeX:
@article{rana2014predicting,
  author = {Rana, Santu and Tran, Truyen and Luo, Wei and Phung, Dinh and Kennedy, Richard and Venkatesh, Svetha},
  title = {Predicting unplanned readmission after Myocardial Infarction from Routinely Collected Administrative Hospital Data},
  journal = {Australian Health Review},
  publisher = {CSIRO},
  year = {2014},
  pages = {377--382},
  url = {http://www.publish.csiro.au/?paper=AH14059}
}
Tran T, Luo W, Phung D, Gupta S, Rana S, Kennedy RL, Larkins A and Venkatesh S (2014), "A framework for feature extraction from hospital medical data with applications in risk prediction", BMC bioinformatics. Vol. 15(1), pp. 425. BioMed Central.
Abstract: Background
Feature engineering is a time consuming component of predictive modeling. We propose a versatile platform to automatically extract features for risk prediction, based on a pre-defined and extensible entity schema. The extraction is independent of disease type or risk prediction task. We contrast auto-extracted features to baselines generated from the Elixhauser comorbidities.

Results
Hospital medical records was transformed to event sequences, to which filters were applied to extract feature sets capturing diversity in temporal scales and data types. The features were evaluated on a readmission prediction task, comparing with baseline feature sets generated from the Elixhauser comorbidities. The prediction model was through logistic regression with elastic net regularization. Predictions horizons of 1, 2, 3, 6, 12 months were considered for four diverse diseases: diabetes, COPD, mental disorders and pneumonia, with derivation and validation cohorts defined on non-overlapping data-collection periods.

For unplanned readmissions, auto-extracted feature set using socio-demographic information and medical records, outperformed baselines derived from the socio-demographic information and Elixhauser comorbidities, over 20 settings (5 prediction horizons over 4 diseases). In particular over 30-day prediction, the AUCs are: COPD—baseline: 0.60 (95% CI: 0.57, 0.63), auto-extracted: 0.67 (0.64, 0.70); diabetes—baseline: 0.60 (0.58, 0.63), auto-extracted: 0.67 (0.64, 0.69); mental disorders—baseline: 0.57 (0.54, 0.60), auto-extracted: 0.69 (0.64,0.70); pneumonia—baseline: 0.61 (0.59, 0.63), auto-extracted: 0.70 (0.67, 0.72).

Conclusions
The advantages of auto-extracted standard features from complex medical records, in a disease and task agnostic manner were demonstrated. Auto-extracted features have good predictive power over multiple time horizons. Such feature sets have potential to form the foundation of complex automated analytic tasks.

BibTeX:
@article{tran2014framework,
  author = {Tran, Truyen and Luo, Wei and Phung, Dinh and Gupta, Sunil and Rana, Santu and Kennedy, Richard Lee and Larkins, Ann and Venkatesh, Svetha},
  title = {A framework for feature extraction from hospital medical data with applications in risk prediction},
  journal = {BMC bioinformatics},
  publisher = {BioMed Central},
  year = {2014},
  volume = {15},
  number = {1},
  pages = {425},
  url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0425-8},
  doi = {10.1186/s12859-014-0425-8}
}
Tran T, Luo W, Phung D, Harvey R, Berk M, Kennedy RL and Venkatesh S (2014), "Risk stratification using data from electronic medical records better predicts suicide risks than clinician assessments", BMC psychiatry. Vol. 14(1), pp. 76. BioMed Central Ltd.
Abstract: To date, our ability to accurately identify patients at high risk from suicidal behaviour, and thus to target interventions, has been fairly limited. This study examined a large pool of factors that are potentially associated with suicide risk from the comprehensive electronic medical record (EMR) and to derive a predictive model for 1–6 month risk.
7,399 patients undergoing suicide risk assessment were followed up for 180 days. The dataset was divided into a derivation and validation cohorts of 4,911 and 2,488 respectively. Clinicians used an 18-point checklist of known risk factors to divide patients into low, medium, or high risk. Their predictive ability was compared with a risk stratification model derived from the EMR data. The model was based on the continuation-ratio ordinal regression method coupled with lasso (which stands for least absolute shrinkage and selection operator).
In the year prior to suicide assessment, 66.8% of patients attended the emergency department (ED) and 41.8% had at least one hospital admission. Administrative and demographic data, along with information on prior self-harm episodes, as well as mental and physical health diagnoses were predictive of high-risk suicidal behaviour. Clinicians using the 18-point checklist were relatively poor in predicting patients at high-risk in 3 months (AUC 0.58, 95% CIs: 0.50 – 0.66). The model derived EMR was superior (AUC 0.79, 95% CIs: 0.72 – 0.84). At specificity of 0.72 (95% CIs: 0.70-0.73) the EMR model had sensitivity of 0.70 (95% CIs: 0.56-0.83).
Predictive models applied to data from the EMR could improve risk stratification of patients presenting with potential suicidal behaviour. The predictive factors include known risks for suicide, but also other information relating to general health and health service utilisation.
BibTeX:
@article{tran2014risk,
  author = {Tran, Truyen and Luo, Wei and Phung, Dinh and Harvey, Richard and Berk, Michael and Kennedy, Richard Lee and Venkatesh, Svetha},
  title = {Risk stratification using data from electronic medical records better predicts suicide risks than clinician assessments},
  journal = {BMC psychiatry},
  publisher = {BioMed Central Ltd},
  year = {2014},
  volume = {14},
  number = {1},
  pages = {76},
  url = {http://bmcpsychiatry.biomedcentral.com/articles/10.1186/1471-244X-14-76}
}
Tran T, Phung D, Luo W and Venkatesh S (2014), "Stabilized sparse ordinal regression for medical risk stratification", Knowledge and Information Systems. , pp. 1-28. Springer.
Abstract: The recent wide adoption of electronic medical records (EMRs) presents great opportunities and challenges for data mining. The EMR data are largely temporal, often noisy, irregular and high dimensional. This paper constructs a novel ordinal regression framework for predicting medical risk stratification from EMR. First, a conceptual view of EMR as a temporal image is constructed to extract a diverse set of features. Second, ordinal modeling is applied for predicting cumulative or progressive risk. The challenges are building a transparent predictive model that works with a large number of weakly predictive features, and at the same time, is stable against resampling variations. Our solution employs sparsity methods that are stabilized through domain-specific feature interaction networks. We introduces two indices that measure the model stability against data resampling. Feature networks are used to generate two multivariate Gaussian priors with sparse precision matrices (the Laplacian and Random Walk). We apply the framework on a large short-term suicide risk prediction problem and demonstrate that our methods outperform clinicians to a large margin, discover suicide risk factors that conform with mental health knowledge, and produce models with enhanced stability.
BibTeX:
@article{tran2014stabilized,
  author = {Tran, Truyen and Phung, Dinh and Luo, Wei and Venkatesh, Svetha},
  title = {Stabilized sparse ordinal regression for medical risk stratification},
  journal = {Knowledge and Information Systems},
  publisher = {Springer},
  year = {2014},
  pages = {1--28},
  url = {http://arxiv.org/pdf/1407.6084.pdf}
}
Vellanki P, Duong T, Venkatesh S and Phung D (2014), "Nonparametric Discovery of Learning Patterns and Autism Subgroups from Therapeutic Data", In 22nd International Conference on Pattern Recognition. , pp. 1828-1833.
Abstract: Autism Spectrum Disorder (ASD) is growing at a staggering rate, but, little is known about the cause of this condition. Inferring learning patterns from therapeutic performance data, and subsequently clustering ASD children into subgroups, is important to understand this domain, and more importantly to inform evidence-based intervention. However, this data-driven task was difficult in the past due to insufficiency of data to perform reliable analysis. For the first time, using data from a recent application for early intervention in autism (TOBY Play pad), whose download count is now exceeding 4500, we present in this paper the automatic discovery of learning patterns across 32 skills in sensory, imitation and language. We use unsupervised learning methods for this task, but a notorious problem with existing methods is the correct specification of number of patterns in advance, which in our case is even more difficult due to complexity of the data. To this end, we appeal to recent Bayesian nonparametric methods, in particular the use of Bayesian Nonparametric Factor Analysis. This model uses Indian Buffet Process (IBP) as prior on a binary matrix of infinite columns to allocate groups of intervention skills to children. The optimal number of learning patterns as well as subgroup assignments are inferred automatically from data. Our experimental results follow an exploratory approach, present different newly discovered learning patterns. To provide quantitative results, we also report the clustering evaluation against K-means and Nonnegative matrix factorization (NMF). In addition to the novelty of this new problem, we were able to demonstrate the suitability of Bayesian nonparametric models over parametric rivals.
BibTeX:
@inproceedings{vellanki2014nonparametric,
  author = {Vellanki, Pratibha and Duong, Thi and Venkatesh, Svetha and Phung, Dinh},
  title = {Nonparametric Discovery of Learning Patterns and Autism Subgroups from Therapeutic Data},
  booktitle = {22nd International Conference on Pattern Recognition},
  year = {2014},
  pages = {1828--1833},
  url = {http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6977032}
}