From Text to Tech
The TORCH Network HiCor: A Cross-Disciplinary Network for History and Corpus Linguistics is hosting a series of workshops on 'From Text to Tech' as part of the Digital Humanities Summer School 2016.
With large amounts of text becoming available through digitization efforts, there is a growing need for automatic analyses in the Digital Humanities to support distant reading. This workshop, originating from the HiCor research network, will impart some of the basics for working computationally and quantitatively with texts. It will take a hands-on approach to processing text, including cleaning and adding automatic linguistic annotation using freely available computational tools and the Python programming language, a very flexible tool with a wide range of applications in Humanities research.
The workshop proceeds in a stepwise manner, with an introduction to corpus linguistics followed by basic programming in Python. The workshop will also teach how to explore texts quantitatively, for example by creating frequency lists and visualizations, and more advanced types of analysis, such as topic modelling. The practical sessions are accompanied by lectures that discuss research which demonstrates concretely how Python and corpus linguistics can be applied to answer questions in a range of humanistic disciplines. The workshop rounds off with a practical problem-solving session covering the topics of the week.
No prior knowledge of programming is required, but attendees should be comfortable with identifying file paths on their own computer and installing software.
Schedule
Monday 4th July
11:00 - 12:30
Why should you learn Python?
Gard Jenset
This introductory session gives an overview of the workshop and discusses why programming is important in Digital Humanities.
Close versus distant reading and linguistic analysis in the Humanities
Gabor Toth
14:00 - 16:00
Introduction to Corpora
Barbara McGillivray
The session will give an introduction to the main concepts of corpus linguistics, including corpus creation and corpus processing for research in Digital Humanities.
Corpus tools
Gabor Toth
This session will introduce participants to selected tools for querying corpora, such as NoSketch Engine and Corpus Bench.
16:30 - 17:30
Corpus tools [Continued]
Tuesday 5th July
11:00 - 12:30
Introduction to programming in Python
Gard Jenset
The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.
14:00 - 16:00
Basic natural language processing (NLP) with Python
Gard Jenset
The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data.
Going further with NLP in Python
Barbara McGillivray
This session introduces the NLTK library and shows how it can be used for tasks such as stemming and part-of-speech tagging with Python
16:30 - 17:30
Going further with NLP in Python [Continued]
Wednesday 6th July
11:00 - 12:30
Corpus methods and social identity in historical texts
Heather Froelich
This session will explore how researchers can use evidence from the Historical Thesaurus of the OED in combination corpus methods to investigate lexical features of social identity, with the language of Shakespeare and his contemporaries as a case study.
14:00 - 16:00
Python and more NLTK
Gabor Toth
Corpus linguistics with Python: The session provides and introduction to doing corpus linguistics in Python and NLTK. Topics include collocations, frequency lists, and key words.
16:30 - 17:30
Python and more NLTK [Continued]
Thursday 7th July
11:00 - 12:30
Creativity is what we say it is: using corpus linguistics to identify key aspects of creativity
Anna Jordanous
As a concept, creativity is complex and multi-dimensional, encompassing many related aspects, abilities, properties and behaviours. Using techniques from the field of statistical natural language processing, we have identified a collection of fourteen key components of creativity. Words were identified which appeared significantly often in connection with discussions of the concept, and a measure of lexical similarity was used to cluster these words. A number of distinct themes emerged, which collectively contribute to our understanding of how creativity is composed.
14:00 - 16:00
Extracting information from text
Barbara McGillivray
The session gives introduction to how Python and the NLTK library can be used to extract structured information such as named entities from unstructured text.
Topic Modelling
Gard Jenset
This session gives a non-technical introduction to topic modelling along with examples of Python code.
16:30 - 17:30
Topic Modelling [Continued]
Friday 8th July
11:00 - 12:30
Corpora do what? On theory, method and data in Digital Humanities
Knut Melvær
Having stumbled my way into the Digital Humanities, I have had to overcome an array of challenges when it comes to messy data, undocumented and buggy software, the rapid advancements in the tech-world and the scarcity of theorizing about what digital methods such as “distant reading” really tell us. In this session I will invite you to explore some of these issues and discuss how we can make DH more approachable with regards to theory and method.
14:00 - 16:00
Problem solving session
The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance.
16:30 - 17:30
Problem solving session [Continued]
Conveners: Gard Jenset, Barbara McGillivray and Gabor Toth
Hashtag: #text2tech and #DHOxSS
Computers: Students are not required to bring their own laptops for this workshop. Desktop computers will be provided by DHOxSSS
HiCor: a Cross-Disciplinary Network for History and Corpus Linguistics
Digital Humanities
Contact name: Barbara McGillivray
Contact email: barbara.mcgilli@gmail.com
Audience: Open to all