From Text to Tech

The TORCH Network HiCor: A Cross-Disciplinary Network for History and Corpus Linguistics is hosting a series of workshops on 'From Text to Tech' as part of the Digital Humanities Summer School 2016.

With large amounts of text becoming available through digitization efforts, there is a growing need for automatic analyses in the Digital Humanities to support distant reading. This workshop, originating from the HiCor research network, will impart some of the basics for working computationally and quantitatively with texts. It will take a hands-on approach to processing text, including cleaning and adding automatic linguistic annotation using freely available computational tools and the Python programming language, a very flexible tool with a wide range of applications in Humanities research.

The workshop proceeds in a stepwise manner, with an introduction to corpus linguistics followed by basic programming in Python. The workshop will also teach how to explore texts quantitatively, for example by creating frequency lists and visualizations, and more advanced types of analysis, such as topic modelling. The practical sessions are accompanied by lectures that discuss research which demonstrates concretely how Python and corpus linguistics can be applied to answer questions in a range of humanistic disciplines. The workshop rounds off with a practical problem-solving session covering the topics of the week.

No prior knowledge of programming is required, but attendees should be comfortable with identifying file paths on their own computer and installing software.

Schedule

Monday 4th July

11:00 - 12:30

Why should you learn Python?

Gard Jenset

This introductory session gives an overview of the workshop and discusses why programming is important in Digital Humanities.

Close versus distant reading and linguistic analysis in the Humanities

Gabor Toth

14:00 - 16:00

Introduction to Corpora

Barbara McGillivray

The session will give an introduction to the main concepts of corpus linguistics, including corpus creation and corpus processing for research in Digital Humanities.

Corpus tools

Gabor Toth

This session will introduce participants to selected tools for querying corpora, such as NoSketch Engine and Corpus Bench.

16:30 - 17:30

Corpus tools [Continued]

Tuesday 5th July

11:00 - 12:30

Introduction to programming in Python

Gard Jenset

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

14:00 - 16:00

Basic natural language processing (NLP) with Python

Gard Jenset

The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data.

Going further with NLP in Python

Barbara McGillivray

This session introduces the NLTK library and shows how it can be used for tasks such as stemming and part-of-speech tagging with Python

16:30 - 17:30

Going further with NLP in Python [Continued]

Wednesday 6th July

11:00 - 12:30

Corpus methods and social identity in historical texts

Heather Froelich

This session will explore how researchers can use evidence from the Historical Thesaurus of the OED in combination corpus methods to investigate lexical features of social identity, with the language of Shakespeare and his contemporaries as a case study.

14:00 - 16:00

Python and more NLTK

Gabor Toth

Corpus linguistics with Python: The session provides and introduction to doing corpus linguistics in Python and NLTK. Topics include collocations, frequency lists, and key words.

16:30 - 17:30

Python and more NLTK [Continued]

Thursday 7th July

11:00 - 12:30

Creativity is what we say it is: using corpus linguistics to identify key aspects of creativity

Anna Jordanous

As a concept, creativity is complex and multi-dimensional, encompassing many related aspects, abilities, properties and behaviours. Using techniques from the field of statistical natural language processing, we have identified a collection of fourteen key components of creativity. Words were identified which appeared significantly often in connection with discussions of the concept, and a measure of lexical similarity was used to cluster these words. A number of distinct themes emerged, which collectively contribute to our understanding of how creativity is composed.

14:00 - 16:00

Extracting information from text

Barbara McGillivray

The session gives introduction to how Python and the NLTK library can be used to extract structured information such as named entities from unstructured text.

Topic Modelling

Gard Jenset

This session gives a non-technical introduction to topic modelling along with examples of Python code.

16:30 - 17:30

Topic Modelling [Continued]

Friday 8th July

11:00 - 12:30

Corpora do what? On theory, method and data in Digital Humanities

Knut Melvær

Having stumbled my way into the Digital Humanities, I have had to overcome an array of challenges when it comes to messy data, undocumented and buggy software, the rapid advancements in the tech-world and the scarcity of theorizing about what digital methods such as “distant reading” really tell us. In this session I will invite you to explore some of these issues and discuss how we can make DH more approachable with regards to theory and method.

14:00 - 16:00

Problem solving session

The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance.

16:30 - 17:30

Problem solving session [Continued]

Conveners: Gard Jenset, Barbara McGillivray and Gabor Toth

Hashtag: #text2tech and #DHOxSS

Computers: Students are not required to bring their own laptops for this workshop. Desktop computers will be provided by DHOxSSS

HiCor: a Cross-Disciplinary Network for History and Corpus Linguistics

Digital Humanities

Contact name: Barbara McGillivray

Contact email: barbara.mcgilli@gmail.com

Audience: Open to all