This research focuses on evaluating the potential use of crowdsourced bike data and comparing them with the traditional bike counting data that are collected in the City of Charlotte, NC. Using the bike data from both the Strava smartphone cycling application and the bicycle count stations, the bicycle volume models are developed. Based on the results, a bicycle volume predictive model is presented, and a map illustrating the bicycle volume on most of the road segments in the City of Charlotte is generated. In addition, to gain a better understanding of the attributes that have an impact on cycling, other supporting data are also collected and combined with the Strava bicycle count data. Multiple discrete choice models are developed to analyze the Strava users’ cycling activities. Furthermore, bicyclist injury risk analysis is also conducted to explore the impact factors affecting biking safety by developing a series of safety performance functions. Several indicators for model comparison are utilized to select the best fitting model for bicyclist injury risk modeling. Finally, recommendations are made in order to help improve the cycling environment and safety and increase the bicycle volume in the future.
Recent research development has demonstrated the advantages of deep learning models in prediction tasks on
electronic health records (EHR) in the medical domain. However, the prediction results tend to be difficult to
explain due to the complex neuron structures. Without the explainability and transparency, deep learning
models are not trustworthy or reliable for making real world decisions, especially the high-stakes ones in the
healthcare domain. To improve the trustworthiness of the deep learning model, quantifying the uncertainty is
crucial.
In this dissertation work, we proposed several Bayesian Neural Network (BNN) structures to estimate the data
uncertainty and model uncertainty associated with the EHR data and deep learning models, respectively. We
also proposed Variational Neural Network (VNN) algorithms to estimate the uncertainty of the variables to
investigate the medical and temporal features that contribute the most to the patient-level uncertainty. In order
to verify the validity of the uncertainty estimations, we designed a series of experiments to examine the
computational results against widely accepted facts about uncertainty. We also conducted post-hoc analysis to
evaluate whether the proposed models tend to specialize in one or more patient subgroups, at the cost of
model performance on others, as well as whether the treatment (improving uncertainty in one subgroup) will
mitigate such performance cost. The experiment results have confirmed the validity of our computational
approaches. Finally, we conducted a user study to understand the clinicians' perception of the proposed
uncertainty models.
As one of the most vulnerable entity within the transportation system, pedestrians might face more dangers and sustain severer injuries in the traffic crashes than others. The safety of pedestrians is particularly critical within the context of continuous traffic safety improvements in US. Moreover, traffic crash data are inherently heterogeneous, and such data heterogeneity can cause one to draw incorrect conclusions in many ways. Therefore, developments and applications of proper modeling approaches are needed to identify causes of pedestrian-vehicle crashes to better ensure the safety of pedestrians.
On the other hand, with the development of artificial intelligence techniques, a variety of novel machine learning methods have been established. Compared to conventional discrete choice models (DCMs), machine learning models are more flexible with no or few prior assumptions about input variables and have higher adaptability to process outliers, missing and noisy data. Furthermore, the crash data has inherent patterns related to both space and time, crashes happened in locations with highly aggregated uptrend patterns should be worth exploring to examine the most recently deteriorative factors affecting the pedestrian injury severities in crashes.
The major goal of this study is intended to develop a framework for modeling and analyzing pedestrian injury severities in single-pedestrian-single-vehicle crashes with providing a higher resolution on identification of contributing factors and their associating effects on the injury severities of pedestrians, particularly on those most recently deteriorative factors. Developments of both conventional DCMs and the selected machine learning model, i.e., XGBoost model, are established. Detailed comparisons among all developed models are conducted with a result showing that XGBoost model outperforms all other conventional DCMs in all selected measurements. In addition, an emerging hotspot analysis is further utilized to identify the most targeted hotspots, followed by a proposed XGBoost model that analyzes the most recently deteriorative factors affecting the pedestrian injury severities. By completions of all abovementioned tasks, the gaps between theory and practice could be bridged. Summary and conclusions of the whole research are provided, and further research directions are given at the end.
Insertions and deletions (indels) represent the second largest variation type in human genomes and have been implicated in the development of cancer. Accurate indel annotation is of paramount importance in variants analysis in both healthy and disease genomes. Previous studies have shown that existing indel calling methods generally produce high false positives and false negatives, which limits the downstream investigation of the roles of indels in structural and functional effects.
To assess the accuracy of indel calling programs, we carried out a comparative analysis by evaluating 7 general indel calling programs and 4 somatic indel calling programs, using 78 healthy samples from the 1000 Genomes Project and 30 cancer samples from The Cancer Genome Atlas (TCGA). We adopted a comprehensive and more stringent indel comparison approach, and an efficient way to use a benchmark for improved performance comparisons for the general indel calling programs. We found that germline indels in healthy genomes derived by combining several indel calling tools could help remove a large number of false positive indels from individual programs without compromising the number of true positives. The performance comparisons of somatic indel calling programs are more complicated due to the lack of reliable and comprehensive benchmark.
We further performed a comparative analysis of somatic indels in two cancer types, BRCA and LUAD. We compared somatic indels in both coding and non-coding regulatory regions such as transcription factor binding sites (TFBSs). We used an improved algorithm to predict TFBSs in human genomes and analyzed their evolutionary and structural roles. Our comparative results indicated that while there are differences between LUAD and BRCA genomes, both of them show a higher deletion rate, coding indel rate and frame-shift indel rate. Somatic indels tend to locate in sequences with important functions, including both coding and non-coding regions. This study can serve as the first step in future pan-cancer analysis for identifying key variant markers of cancer genomes.
The increasing performance of feature extraction and regression modeling in various domains raises the hope for machine and deep learning to assist clinicians in numerous healthcare applications. However, the complex and multimodal nature of the problems and the scarce resource of high-quality labeled data in this domain introduces several challenges and limitations. These challenges, along with lack of interpretability, undermines the generalizability and usability of many state-of-the-art machine learning models.
This dissertation focuses on using multimodal sources of data for regression modeling in healthcare applications. The argument is that domain knowledge describes the nature of each modality's relationship with the target function. This relationship can characterize the appropriate level of representation and an efficient integration method. We define a framework with two heterogeneous modalities, one modality provides more local features, while another contains higher-level global information. We demonstrate the framework's applicability for multiple healthcare regression tasks.
In this framework, we propose two approaches for increasing the performance in the absence of large-scale data: leveraging the abstraction of the modality representations based on domain knowledge, and a tree-structure convolutional neural network for integrating the information from the heterogeneous modalities. This framework is discussed in more detail for two different cases of "Alzheimer's disease progression prediction" and "radiation therapy treatment planning." The former predicts a scalar target variable, while the latter approximates a two-dimensional one. The first application's performance is compared with the previous submissions for the same dataset; it outperforms the best-reported results.
Avena sativa, or common oat, is a staple crop and member of the Poaceae or Grasses family. Following behind wheat, maize and rice, oats account for 10.5 million hectares of the world’s produced crops as of 2017. Phytocompounds such as β-glucan and other phytochemicals such as avenanthramides, vanillic, syringic, ferric, and caffeic acids are noted to benefit cardiovascular health or represent prospective benefactors to human health. However, further investigation into these potential benefiting factors requires research that surpasses past works in breadth and scope. Much has been done to bridge the gap in resources for oats, such as the development of high throughput markers, consensus linkage maps and most recently genome sequencing efforts, however the relative complexity of cultivated oat, an allohexaploid with high similarity subgenomes, provides additional challenges to the development of these resources. A final layer of complexity is the genome size of hexaploid oats, believed to be approximately 12.8 gigabases, of which a significant portion is composed of complex repetitive elements. Characterization of these highly complex regions is difficult as repetitive regions contained within reads are characteristically difficult to map, thereby complicating assembly efforts and resulting in misassembly and gaps. Through investigation of repetitive elements by utilization of a novel pipeline capable of offering enhanced resolution, novel information pertaining to repetitive elements were further examined within well-characterized Avena genomes, with this concluding with phylogenetic analyses examining evolutionary relationships between elements in efforts to bolster overall knowledge of the Avena family and the role of transposable elements throughout Avena.
The objective of this dissertation is to inform clinicians, researchers and policy makers of the potential value of prosthesis intervention for individuals who experience a lower limb amputation. In addition, this dissertation supports the call for more studies of high methodological quality to provide evidence of the functional and economic value associated with prostheses intervention post lower limb amputation.
The first chapter (study 1) measured the time to prosthesis receipt for based on different demographic (e.g. amputation level and sex) and personal health factors (e.g. diabetes or vascular disease and age) using administrative claims data. Kaplan-Meir method and log-rank tests were used to examine overall survival based on covariates. Multivariable cox proportional hazard models were fit to assess the overall risk of prosthesis receipt after amputation.
The second chapter (study 2 and 3) investigated the cost and utilization of prosthesis receipt stratified by time from surgery up to 12 months post-amputation using administrative claims data. The adjusted analysis was performed using general linear modeling with log transformed cost and logistic regression models assessed healthcare utilization.
The third chapter (study 4) assessed the relationship between injurious falls and self-perceived functional mobility. Multivariable logistic regression was applied to a cross-sectional sample using clinical outcomes data.
As demonstrated in this dissertation, earlier provision of prosthetic devices is associated with lower direct costs and reduced healthcare utilization. For those who are currently using a prosthesis, maintaining and improving mobility may help to reduce the burden and risk of injurious falls.
This study examined the influence of drama participation on foundational, digital, and Black Girls’ literacies of Black girls in an urban middle school. This case study used the Culturally Relevant Arts Education framework with Black Feminist Thought epistemology to address the following research questions: What are the experiences of middle school Black girls who participate in drama classes in relation to language, identity, and social media engagement? What perceptions do drama teachers have of urban middle school Black girls who participate in drama classes as they address language, identity, and social media engagement? Purposive criterion sampling was used to recruit participants for this study. Semi-structured interviews and a focus group were conducted with five Black females; four participants were middle school students who attended Stonybrook School (pseudonym), and one participant was a teacher who taught at the school. The findings of the study suggest that Black girls who participated in drama 1) Experienced enhanced foundational language, 2) Acquired a more positive racial identity, and 3) Demonstrated effective management of social media engagement. Additional findings suggest that the Black female drama teacher perceives Black middle school girls as mature enough to successfully navigate academics, identity, and social media engagement through practicing drama activities, despite the race and gender challenges they face. The findings from this study help inform educational practices, policies, and research aimed at improving outcomes for Black girls in urban middle schools.
Avena sativa, or common oat, is a staple crop and member of the Poaceae or Grasses family. Following behind wheat, maize and rice, oats account for 10.5 million hectares of the world’s produced crops as of 2017. Phytocompounds such as β-glucan and other phytochemicals such as avenanthramides, vanillic, syringic, ferric, and caffeic acids are noted to benefit cardiovascular health or represent prospective benefactors to human health, however further investigation into these potential benefiting factors requires research that surpasses past works in breadth and scope. Much has been done to bridge the gap in resources for oats, such as the development of high throughput markers, consensus linkage maps and most recently genome sequencing efforts, however the relative complexity of cultivated oat, an allohexaploid with high similarity subgenomes, provides additional challenges to the development of these resources. A final layer of complexity is the genome size of hexaploid oats, believed to be approximately 12.8 gigabases, of which a significant portion is composed of complex repetitive elements. Characterization of these highly complex repetitive regions is difficult as repetitive regions contained within reads are characteristically difficult to map, thereby complicating assembly efforts and resulting in misassembly and gaps. Through investigation of repetitive elements by utilization of a novel pipeline capable of offering enhanced resolution and detection of repetitive elements, use of this novel information of all repetitive elements in a given genome were further examined within well-characterized Avena genomes, concluding with phylogenetic analyses examine evolutionary relationships between elements in efforts to gain insight into the of the role transposable elements across Avena, and bolster overall knowledge of the Avena family.