Computer science
Historic Magnetogram Digitization
The conversion of historical analog images to time series data was performed by using deconvolution for pre-processing, followed by the use of custom built digitization algorithms. These algorithms have been developed to be user friendly with the objective of aiding in the creation of a data set from decades of mechanical observations collected from the Agincourt and Toronto geomagnetic observatories beginning in the 1840s. The created algorithms follow a structure which begins with pre-processing followed by tracing and pattern detection. Each digitized magnetogram was then visually inspected, and the algorithm performance verified to ensure accuracy, and to allow the data to later be connected to create a long-running time-series.
Author Keywords: Magnetograms
Augmented Reality Sandbox (Aeolian Box): A Teaching and Presentation Tool for Atmospheric Boundary Layer Airflows over a Deformable Surface
The AeolianBox is an educational and presentation tool extended in this thesis to
represent the atmospheric boundary layer (ABL) flow over a deformable surface in the
sandbox. It is a hybrid hardware cum mathematical model which helps users to visually,
interactively and spatially fathom the natural laws governing ABL airflow. The
AeolianBox uses a Kinect V1 camera and a short focal length projector to capture the
Digital Elevation Model (DEM) of the topography within the sandbox. The captured
DEM is used to generate a Computational Fluid Dynamics (CFD) model and project the
ABL flow back onto the surface topography within the sandbox.
AeolianBox is designed to be used in a classroom setting. This requires a low
time cost for the ABL flow simulation to keep the students engaged in the classroom.
Thus, the process of DEM capture and CFD modelling were investigated to lower the
time cost while maintaining key features of the ABL flow structure. A mesh-time
sensitivity analysis was also conducted to investigate the tradeoff between the number of
cells inside the mesh and time cost for both meshing process and CFD modelling. This
allows the user to make an informed decision regarding the level of detail desired in the
ABL flow structure by changing the number of cells in the mesh.
There are infinite possible surface topographies which can be created by molding
sand inside the sandbox. Therefore, in addition to keeping the time cost low while
maintaining key features of the ABL flow structure, the meshing process and CFD
modelling are required to be robust to variety of different surface topographies.
To achieve these research objectives, in this thesis, parametrization is done for meshing process and CFD modelling.
The accuracy of the CFD model for ABL flow used in the AeolianBox was
qualitatively validated with airflow profiles captured in the Trent Environmental Wind
Tunnel (TEWT) at Trent University using the Laser Doppler Anemometer (LDA). Three
simple geometries namely a hemisphere, cube and a ridge were selected since they are
well studied in academia. The CFD model was scaled to the dimensions of the grid where
the airflow was captured in TEWT. The boundary conditions were also kept the same as
the model used in the AeolianBox.
The ABL flow is simulated by using software like OpenFoam and Paraview to
build and visualize a CFD model. The AeolianBox is interactive and capable of detecting
hands using the Kinect camera which allows a user to interact and change the topography
of the sandbox in real time. The AeolianBox's software built for this thesis uses only
opensource tools and is accessible to anyone with an existing hardware model of its
predecessors.
Author Keywords: Augmented Reality, Computational Fluid Dynamics, Kinect Projector Calibration, OpenFoam, Paraview
Predicting Irregularities in Arrival Times for Toronto Transit Buses with LSTM Recurrent Neural Networks Using Vehicle Locations and Weather Data
Public transportation systems play important role in the quality of life of citizens
in any metropolitan city. However, public transportation authorities face
criticisms from commuters due to irregularities in bus arrival times. For example,
transit bus users often complain when they miss the bus because it arrived too
early or too late at the bus stop. Due to these irregularities, commuters may miss
important appointments, wait for too long at the bus stop, or arrive late for work.
This thesis seeks to predict the occurrence of irregularities in bus arrival times by
developing machine learning models that use GPS locations of transit buses provided
by the Toronto Transit Commission (TTC) and hourly weather data. We
found that in nearly 37% of the time, buses either arrive early or late by more than
5 minutes, suggesting room for improvement in the current strategies employed by
transit authorities. We compared the performance of three machine learning models,
for which our Long Short-Term Memory (LSTM) [13] model outperformed all
other models in terms of accuracy. The error rate for LSTM model was the lowest
among Artificial Neural Network (ANN) and support vector regression (SVR). The
improved accuracy achieved by LSTM is due to its ability to adjust and update the
weights of neurons while maintaining long-term dependencies when encountering
new stream of data.
Author Keywords: ANN, LSTM, Machine Learning
Support Vector Machines for Automated Galaxy Classification
Support Vector Machines (SVMs) are a deterministic, supervised machine learning algorithm that have been successfully applied to many areas of research. They are heavily grounded in mathematical theory and are effective at processing high-dimensional data. This thesis models a variety of galaxy classification tasks using SVMs and data from the Galaxy Zoo 2 project. SVM parameters were tuned in parallel using resources from Compute Canada, and a total of four experiments were completed to determine if invariance training and ensembles can be utilized to improve classification performance. It was found that SVMs performed well at many of the galaxy classification tasks examined, and the additional techniques explored did not provide a considerable improvement.
Author Keywords: Compute Canada, Kernel, SDSS, SHARCNET, Support Vector Machine, SVM
Fraud Detection in Financial Businesses Using Data Mining Approaches
The purpose of this research is to apply four methods on two data sets, a Synthetic
dataset and a Real-World dataset, and compare the results to each other with the
intention of arriving at methods to prevent fraud. Methods used include Logistic Regression,
Isolation Forest, Ensemble Method and Generative Adversarial Networks.
Results show that all four models achieve accuracies between 91% and 99% except
Isolation Forest gave 69% accuracy for the Synthetic dataset.
The four models detect fraud well when built on a training set and tested with
a test set. Logistic Regression achieves good results with less computational eorts.
Isolation Forest achieve lower results accuracies when the data is sparse and not preprocessed
correctly. Ensemble Models achieve the highest accuracy for both datasets.
GAN achieves good results but overts if a big number of epochs was used. Future
work could incorporate other classiers.
Author Keywords: Ensemble Method, GAN, Isolation forest, Logistic Regression, Outliers
Representation Learning with Restorative Autoencoders for Transfer Learning
Deep Neural Networks (DNNs) have reached human-level performance in numerous tasks in the domain of computer vision. DNNs are efficient for both classification and the more complex task of image segmentation. These networks are typically trained on thousands of images, which are often hand-labelled by domain experts. This bottleneck creates a promising research area: training accurate segmentation networks with fewer labelled samples.
This thesis explores effective methods for learning deep representations from unlabelled images. We train a Restorative Autoencoder Network (RAN) to denoise synthetically corrupted images. The weights of the RAN are then fine-tuned on a labelled dataset from the same domain for image segmentation.
We use three different segmentation datasets to evaluate our methods. In our experiments, we demonstrate that through our methods, only a fraction of data is required to achieve the same accuracy as a network trained with a large labelled dataset.
Author Keywords: deep learning, image segmentation, representation learning, transfer learning
Cloud Versus Bare Metal: A comparison of a high performance computing cluster running in a commercial cloud and on a traditional hardware cluster using OpenMP and OpenMPI
A comparison of two high performance computing clusters running on AWS and Sharcnet was done to determine which scenarios yield the best performance. Algorithm complexity ranged from O (n) to O (n3). Data sizes ranged from 195 KB to 2 GB. The Sharcnet hardware consisted of Intel E5-2683 and Intel E7-4850 processors with memory sizes ranging from 256 GB to 3072 GB. On AWS, C4.8xlarge instances were used, which run on Intel Xeon E5-2666 processors with 60 GB per instance. AWS was able to launch jobs immediately regardless of job size. The only limiting factors on AWS were algorithm complexity and memory usage, suggesting a memory bottleneck. Sharcnet had the best performance but could be hampered by the job scheduler. In conclusion, Sharcnet is best used when the algorithm is complex and has high memory usage. AWS is best used when immediate processing is required.
Author Keywords: AWS, cloud, HPC, parallelism, Sharcnet
Educational Data Mining and Modelling on Trent University Students' Academic Performance
Higher education is important. It enhances both individual and social welfare by improving productivity, life satisfaction, and health outcomes, and by reducing rates of crime. Universities play a critical role in providing that education. Because academic institutions face resource constraints, it is thus important that they deploy resources in support of student success in the most efficient ways possible. To inform that efficient deployment, this research analyzes institutional data reflecting undergraduate student performance to identify predictors of student success measured by GPA, rates of credit accumulation, and graduation rates. Using methods of cluster analysis and machine learning, the analysis yields predictions for the probabilities of individual success.
Author Keywords: Educational data mining, Students' academic performance modelling
Development of a Cross-Platform Solution for Calculating Certified Emission Reduction Credits in Forestry Projects under the Kyoto Protocol of the UNFCCC
This thesis presents an exploration of the requirements for and development of a software tool to calculate Certified Emission Reduction (CERs) credits for afforestation and reforestation projects conducted under the Clean Development Mechanism (CDM). We examine the relevant methodologies and tools to determine what is required to create a software package that can support a wide variety of projects involving a large variety of data and computations. During the requirements gathering, it was determined that the software package developed would need to support the ability to enter and edit equations at runtime. To create the software we used Java for the programming language, an H2 database to store our data, and an XML file to store our configuration settings. Through these choices, we can build a cross-platform software solution for the purpose outlined above. The end result is a versatile software tool through which users can create and customize projects to meet their unique needs as well as utilize the features provided to streamline the management of their CDM projects.
Author Keywords: Carbon Emissions, Climate Change, Forests, Java, UNFCCC, XML
Exploring the Scalability of Deep Learning on GPU Clusters
In recent years, we have observed an unprecedented rise in popularity of AI-powered systems. They have become ubiquitous in modern life, being used by countless people every day. Many of these AI systems are powered, entirely or partially, by deep learning models. From language translation to image recognition, deep learning models are being used to build systems with unprecedented accuracy. The primary downside, is the significant time required to train the models. Fortunately, the time needed for training the models is reduced through the use of GPUs rather than CPUs. However, with model complexity ever increasing, training times even with GPUs are on the rise. One possible solution to ever-increasing training times is to use parallelization to enable the distributed training of models on GPU clusters. This thesis investigates how to utilise clusters of GPU-accelerated nodes to achieve the best scalability possible, thus minimising model training times.
Author Keywords: Compute Canada, Deep Learning, Distributed Computing, Horovod, Parallel Computing, TensorFlow