Colloquium: ELI Data Mining Group,"# Data Science Methods applied to an ESL learner corpus: opportunities and challenges"

February 1, 2019 - 3:00pm to 4:15pm

Abstract

This presentation summarizes some of the work that the ELI Data Mining group has carried out over the past 12 months. 

The Pittsburgh English Language Institute Corpus (PELIC) is a collection of data from Intensive English Program students collected from 2007-2012 as part of the Pittsburgh Science of Learning Center (www.learnlab.org). 

It includes written texts, spoken (.wav format) and written grammar exercises, and data from Recorded Speaking Activities (in .wav format) (McCormick & Vercellotti, 2013). 

The group has worked on tracking the development of lexical richness in the learners’ writing, including such measures as lexical diversity (vocD) and lexical sophistication (Advanced Guiraud), as well as L1 influence on L2 syllable structure (Li & Juffs, 2015). 

This presentation focuses specifically on how Data Science tools and techniques have allowed the group to analyze the lexis in the written data - 48,000+ texts with 4.2 million plus words - in a collaborative workspace powered by Git/GitHub. 

Challenges that we face include appropriate lemmatization, anonymization of the texts, removal of special characters, and developing the appropriate Python code tool kit to analyze the data. 

 

# References

Li, N., & Juffs, A. (2015). The influence of moraic structure on English L2 syllable final consonants. 2014 Annual Meeting on Phonology.  

McCormick, D. E., & Vercellotti, M. L. (2013). Examining the Impact of Self-Correction Notes on Grammatical Accuracy in Speaking. TESOL Quarterly, 47(2), 410-420.  

 

A summary of activities and publications can be found at our GitHub repository:  

https://github.com/ELI-Data-Mining-Group/Pitt-ELI-Corpus

Location and Address

Frick Fine Art 204