Lsci 109: Linguistic Data Science

1 Course Information

Lecture times TR 2-3:20pm
Lecture Location DBH 1431
Canvas site

2 Instructor Information

Instructor Richard Futrell (
Instructor's office SSPB 2215
Instructor's office hours By appointment

3 Course Description

This course teaches how to perform basic quantitative analysis of text data using Python and how to communicate the results of such analyses. Text data includes corpora of books, transcripts of spoken language, online reviews, and social media feeds. Students will complete exercises in Jupyter notebooks in which they will implement basic manipulation of text data, implement simple machine learning algorithms (classifiers and language models), deploy them to analyze bodies of text, and learn how to visualize and communicate the results of these analyses using Python visualization tools such as matplotlib and seaborn. The goal of the class is to develop students' skills in two areas: (1) the practical programming required to do analysis of text data in Python, (2) communication skills related to drawing conclusions based on algorithmic results, and presenting them in a way which is both precise and intuitive. A secondary goal is to develop students' knowledge of what kinds of analyses are possible given the current landscape of Natural Language Processing (NLP) technologies.

4 Course Format

Class time will be spent on lectures and guided exercises. Homework will consist of mini-projects involving different bodies of text: students will be asked to perform their own analyses and draw their own conclusions.

Students are required to bring laptops to class in order to work together on pair-programming exercises.

Students are encouraged to collaborate on homework.

Readings are optional but strongly recommended. They are short and you will find them immediately useful.

There are no exams or tests.

5 Intended audience

This course is intended for advanced undergraduates who already know the basics of Python programming, as demonstrated by passing ICS 033 or similar. The ability to program in Python is a prerequisite for enrollment in the course. I do not assume or require any background on statistics or math beyond high school algebra.

Information on UCI's Language Science major and minor.

6 Schedule (subject to modification)

Day Topic Deadlines Readings Notebook
1/7 Introduction      
1/9 Text Wrangling   Tokenization Notebook 1
1/14 Bags of Words   Generating n-grams Notebook 2
1/16 UNKs and Stop Words     Notebook 3
1/21 N-Grams and Association Measures   Tf-idf Notebook 4
1/23 Tf-idf and Topic models   Topic Modeling with Gensim Notebook 5
1/28 Project Time      
1/30 Linear Classifiers     Notebook 6
2/4 Naive Bayes Mini-Project 1 Due Naive Bayes Notebook 7
2/6 Evaluation metrics   Precision and recall Notebook 8
2/11 More Classifiers   Logistic regression Notebook 9
2/13 Feature Engineering     Notebook 10
2/18 Visualization and Dimensionality Reduction   Visualizing high-dimensional datasets Notebook 11
2/20 Project Time      
2/25 Word embeddings I   Word embeddings Notebook 12
2/27 Word embeddings II Mini-project 2 due Historical word embeddings Notebook 13
3/3 Contextual embeddings     Notebook 14
3/5 Multiclass Classifiers and Modern NLP   Text classification with Transformers Notebook 15
3/10 Project Time      
3/12 Project Time (Prof. Futrell out of town)      
3/17 - Mini-project 3 due    

7 Requirements & Grading

  • Grade breakdown

    Work Grade percentage
    Mini-projects 80%
    Participation 20%

    All mini-projects are equally weighted.

  • Mini-projects

    In each mini-project, you will be provided with a dataset and with some analytical tools and your job will be to apply those tools to discover something about the dataset, and to produce a writeup that explains what you found and how you found it. The writeup will be in the form of a Jupyter or CoLab notebook.

    You are highly encouraged to work in groups of up to 3 on mini-projects. Groups will turn in joint writeups. In a joint writeup, there should be a brief statement saying roughly who contributed what to the end product.

    Each mini-project is due at at the beginning of class on the day indicated in the schedule. The final project is due at 2pm on 3/17.

  • Participation

    To receive full credit for participation, you must attend every class session and participate actively in the group exercises. If you cannot make it to class, inform the instructor to find out what make-up work you can do to recover the participation credit.

  • Mapping of class score to letter grade

    I guarantee minimum grades based on these thresholds:

    Threshold Guaranteed minimum grade
    >= 90% A
    >= 80% B
    >= 70% C
    >= 60% D
    < 60% F

    So for example a score of 90.0001% guarantees you an A-. It is unlikely that I will grade the course on a curve, but if I do, you could end up with a higher grade due to the curve.

8 Academic Integrity

We will be adhering fully to the standards and practices set out in UCI's policy on academic integrity. Any attempts of academic misconduct or plagiarism will be met with consequences as per the university regulations.

9 Disability

Any student requesting academic accommodations based on a disability is required to apply with Disability Service Center at UCI. For more information, please visit

Author: Richard Futrell

Created: 2020-03-05 Thu 10:51