Data science & analytics projects:

 PhD Dissertation Research

ABSTRACT: GROKHOWSKY, NICHOLAS ANDRÉ. Spatiotemporal Analysis of Innovation Using an Automated Publication Screening. (Under the direction of Helena Mitasova). Understanding production and innovation trends is important for decision makers in every discipline. This is particularly relevant in research and development, which has been historically used to represent innovation. As a result, researchers have analyzed research and development publications at global and local scales using a variety of methodologies. In these methodologies, space and time are utilized but not combined into a spatiotemporal approach. Furthermore, research and development expenditures are the commonly used variables in these analyses. This leaves out research and development spillover effects, which have been found to have a significant impact on innovation and economic growth. A solution to this is to use research locations as a measure for knowledge spillover. The research location is an important piece of analyzing innovation because it represents a location that gains knowledge from the research conducted, or it misses out on potential knowledge if research is not conducted in its region. Like natural and physical systems, production and innovation diffuse across space and time. Publication years are easily accessed in literature searches, and regularly used as a temporal variable. Similarly, publication locations are easily accessed, and commonly used as a spatial variable. However, the research locations, which are most relevant in geographically dependent fields, are not easy to extract. As a result, they are rarely used for analysis. Geographically dependent fields can benefit from analysis of research location, with and without publication location. Therefore, the second chapter of this dissertation specifies a methodology for spatiotemporal analysis using research locations and publication years for analyzing research and development publications. The methodology aggregates the frequencies of publications by geographic unit, and uses feature variables at the same geographic scale to perform hypothesis tests and infer publication estimates across a geographic region. Included with these feature variables are time-interval indicator variables, which were identified using time-series plots and validated using Chow Tests. This approach differs from other methods because research publications are generalized by their quantity per research location, along with time intervals and subtopics. Furthermore, unlike other research publication analyses, a specific topic is used as a case study for generalization. As a case study, the public health topic, childhood elevated blood lead levels (EBLL), was selected because of the strong impact space and time has on public health topics.

Understanding the spatiotemporal patterns of research locations is helpful for decision makers, but combining the results with publication locations identifies both direct and indirect production. In this context, the combined pecuniary and non-pecuniary beneficiaries can be identified by calculating production efficiency across time and space. The second chapter investigates identifying beneficiaries using production efficiency calculated by a production econometric model. This method accounts for regional and industrial factors in the form of geographically aggregated capital and labor. The residual outputs of this method represent the growth beyond the capital and labor inputs. Hence, production efficiency, and therefore innovation and competition, are measured. To be clear, these are not predicted values. To obtain the values needed for this methodology, the frequencies of research locations and publication locations were estimated using ensemble models. The estimated frequencies of research locations and publication locations were summed and averaged by the total time period. These average number of articles written per year were used as the dependent variable in the productivity function, to account for the direct and indirect production. The capital inputs were the sum of federal, state and private science and engineering research and development obligations per geographic unit, and the labor was the number of graduate students per geographic unit. The measures of production efficiency were clustered using spatial autocorrelation and mapped using Local Indicators of Spatial Association (LISA) plots. These outputs identified significant clusters of high and low production efficiency, or innovation, and displayed how they progressed over the time-intervals that were identified in the previous chapter.

To reduce the time required for extracting research locations from a corpus of research and development publications, an automated data extraction tool was developed. The automation tool utilizes a sub-field of Natural Language Processing (NLP), known as Named Entity Recognition (NER). Two NER toolkits were used to extract geographic locations in each publications’ texts, and a generic resolution step was applied. The generic resolution step filters out geographic locations by the geographical unit and the geographical units’ abbreviations. This was developed for country level and U.S. state level analyses. Additionally, state-of-the-art classification and transformer models were used to identify sub-topics within the corpus. First, a Latent Dirichlet Allocation (LDA) model was used to identify the potential subtopics within the corpus. Next, a Bidirectional Auto-Regressive Transformer (BART) model was used within a trained Zero-Shot Classifier, to identify the sub-topics of each publication. This entire workflow was built into a web application that allows users quick access to research locations, sub-topic clusters, hypothesis testing, and inferential analysis. The entire workflow can be accomplished in under five minutes for our corpus of more than 1,000 public health publications. The final chapter of this dissertation reconstructs the first chapter using this automated data extraction tool with acceptable to strong results. It also includes synthetically generated datasets for comparison. Geoparsing returned sensitivity equal to 71%, accuracy equal to 85%, and precision equal to 97%. Sub-topic clustering returned sensitivity equal to 66%, accuracy equal to 78%, and precision equal to 71%. Correlation of estimates from the ‘baseline’ study, taken from the first chapter, with the estimates from the automated method were greater than or equal to 86% (i.e., r >= 0.86) for all of the comparisons made.

The result of this dissertation provides four important outputs. The first is a theoretical and applicable understanding of how research locations, extracted from a corpus of research and development publications, can be used to identify research trends in specified fields. Second, it offers the theoretical and applicable understanding of beneficiaries, using an updated approach to the initial analysis of research trends. Third, it provides a methodology for improving the time commitments needed for performing an analysis of research and innovation using research locations and sub-topics. And fourth, it provides an example of a web application that allows any user with an internet connection to process this methodology on their own corpus of publications. The intent of this dissertation is to provide a novel approach for understanding the diffusion of research production and innovation, biases, and opportunities for decision makers in all fields that depend on space and time.


 Atlantic Salmon

This is an ESRI story map that explores the largest dams (by capacity) in Maine in comparison to Atlantic salmon zones. This story map uses ESRI's ArcGIS Online to create the GIS and the story map feature to display it. The purpose of this analysis is to explore potential future studies.


 Everglades National Park

This is a short presentation using Google Earth Pro of a fishing trip I took with my son. The purpose of this presentation is to identify unique pieces of location data. My data was the latitude and longitude taken from the photographs that I took on the trip.


 Aquatic Vegetation

Using hyperspectral imagery captured by UF's GatorEye Unmanned Flying Laboratory (GE-UFL), which uses a Headwall photonics VNIR 270 spectral band hyperspectral sensor, we attempted to determine aquatic plant species using structured and unstructured classification. The steps included image processing, visual examination, and species determination. Principal componenet analysis captured 90.01% of variation in the first component. Further investigation showed that the first component contained green and infrared bandwidths, and visualizations identified that this component did not overlap our water bodies. It was determined that terrestrial vegetation was primarily captured in this component. The second component captured 9.04% of the variation, and reflected blue and yellow. When visualized it was entirely overlapping with our water bodies. It is likely that the yellow reflectance was the aquatic vegetation we were looking to identify. Further analysis using a neural network and IsoData revealed that we were looking at aquatic vegetation, but it became clear that identifying the species was not likely. For now, manual determination is the best approach for identifying aquatic vegetation species.


Output Visualizations

PRINCIPAL COMPONENT ANALYSIS
SUPERVISED NEURAL NETWORK
UNSUPERVISED ISODATA

 Miami Crime

The goal of this analysis was to identify whether a time trend and spatial trend existed within a data set taken from the Florida Association of Land Surveyors. The data under analysis was of robbery, theft, or larceny of land surveying equipment throughout Florida. Exploratory analysis can be seen on the visualizations page, under the Kepler visuals. Due to the trend of more than 70% of the crimes occurring in the Miami tri-county area, an analysis of the Miami tri-county area crimes was calculated. However, because the data was sparse, traditional statistical models that rely on MCMC would not converge. A spatial autoregressive regression model using integrated nested LaPlace approximations was calculated to resolve this issue.

 REMOTE SENSING SOFTWARE

Using Java SE and the Swing library, this software application allows a user to analyze a raster image along three channels (red, blue, green). This application is specific to RGB imagery, but it was originally built with the intent to use it for satellite imagery. Time constraints limited its function to RGB images instead of multi-spectral images. The application demonstrates Object Oriented Programing to perform geospatial analysis. The objects classes created are the main object, a GUI object, a backend object, and a processing object. The main obect calls all of the object classes to build the program. The GUI object calls all of the Java Swing objects and methods for display. And, the backend and processing objects perform the necessary calculations for the raster image. The project link points to a GitHub repo where this software is available as well as a report about it.


SCREENSHOT OF RASTER ANALYSIS SOFTWARE BEING USED FOR SATELLITE IMAGE ANALYSIS