First Data Science Summer Internship Concludes with Posters
Students from CSUSB and SBCC presented posters describing their summer data science efforts.
On Thursday, August 13, participants from our first Summer Data Science Research Experience presented posters at our program's final group meeting. Faculty Co-PIs Yunfei Hou (CSUSB) and Nathalie Guebels (SBCC) joined the session to hear about the group's work over an intense eight weeks.
The summer program was comprised of four students from Cal State San Bernardino and four students from Santa Barbara City College. These eight students were joined by two local high school students who were participating in a separate summer internship sponsored by the Army Educational Opportunity Program (AEOP).
We split our program into two phases. In phase 1 (June 21 - July 4), the students participated in a boot camp on data science tools and methods. Kha-Dinh Luong, a graduate student in Computer Science, led a series of lectures on Python programming, data visualization, statistical analysis, and machine learning. The boot camp consisted of 1-hour daily lectures and assigned homework. UCSB Data Science Undergraduate Fellows Robin Hollingsworth and Zoe Holzer tutored the boot camp and held regular office hours outside of lecture.
Phase 2 (July 6 - August 14) comprised the final six weeks of the program. Ashley Bruce, another graduate student in Computer Science, and two DS Fellows mentored the students in two mini projects modeled after last year's data science capstone course. The undergraduate fellows, Holzer and Hollingsworth, met regularly with the students, while Bruce guided the group through weekly milestones. Ultimately, the students each delivered a poster and a report summarizing their efforts. See below for the abstracts.
All meetings and presentations were held remotely, via Zoom, due to COVID-19. Participants from CSUSB and SBCC received stipends from the National Science Foundation's HDR grant #1924205.
Investigating El Nino and Its Impact on Ocean Characteristics and Fish Larvae Abundance Off the California Coast
Students
- Kaylin Roberts (Santa Barbara City College)
- Elliot So (Dos Pueblos High School)
- Piero Trujillo (Santa Barbara City College)
- Corbin Ulloa (CSU San Bernardino)
- Natalie Fenkner (Dos Pueblos High School)
Abstract
Our project observed the effects of warming waters and climate change on marine life.
We used hypothesis testing, linear regression, and a variety of visualizations such as scatterplots, maps, and 3D graphs to find relationships between the water characteristics in our dataset. Data was collected along the California coast by CalCOFI and contains water characteristics at different locations over time and various fish larva species abundances. Our group explored El Niño’s effects on ocean temperature, species abundance, PH levels, salinity, and ocean depth. We found a strong correlation between ocean temperature, PH levels, and salinity. Our linear regression models indicated there was a highly positive correlation between ocean temperature and PH levels (0.86) and a highly negative correlation between salinity and PH levels (-0.895), and ocean temperature and salinity (-0.815). Likewise, our correlation coefficient calculator showed how closely related each of our classifying features were. The heatmap of our data showcased the change in California Anchovies abundance throughout the seasons over many years. Working with missing data and trying to create accurate calculations without it was the most challenging part of our research.
Understanding Data Science Methodologies Using a Biological Interactions Dataset
Students
- Alejandro Barragan (CSU San Bernardino)
- Brent Coloma (CSU San Bernardino)
- Angelica Fernandez (Santa Barbara City College)
- Paolo Pedrigal (Santa Barbara City College)
- Cristina Ruiz (CSU San Bernardino)
Abstract
An open-access dataset from Global Biotic Interactions (GloBI) is analyzed to determine its suitability for the use of various data science methods. This dataset primarily contains qualitative descriptors about various bee and plant species and their various biotic interactions. First, we present visualizations including bar graphs, linear regressions, heat maps, map plots, and supervised learning algorithms, including linear regression and classification models. The key findings of the project include positive correlations between citations per bee family in relation to the number of diverse bee species within the respective bee family, the number of bee species interactions and the number of plant species visited by the respective bee species, and between the number of citations per plant families and the number of interactions. Upon exploring the data, it is found that the dataset is affected by bias in that there is overrepresentation of certain types of data, and missing, underrepresented, or unclear/vague data entries, which limits the capacity in how it can be further applied in future data science research.
Congratulations to all the participants! And a huge thank you to Kha-Dinh Luong, Ashley Bruce, Zoe Holzer, and Robin Hollingsworth for all their help during the inaugural program.
Applications for the 2022 Summer Data Science Internship at UCSB will open in March 2022.