Update - Introducing NER

Feb 20, 2017

I’ve now integrated the Stanford NER library; AWS S3 access; a Project Gutenburg cleaning function; and others.

As each sentence is processed to build the Markov chain, it is now passed through the Stanford NER (Named Entity Recognition) library. This attempts to identify every mention of any person, organisation or location. For each name, I store the ‘surface form’ (i.e. the name as it appears in the text) into a list in Redis, one for each of the three types. I then replace the surface form with a special token, such as ‘'. The Markov chain will then represent how often various words appear in the context of people's names. So a sequence like ' said "Hello."' is quite common compared to a sequence like 'red happy'. When generating new text, these tags will appear quite often, so an extra step is now required to post-process the text. For this initial version, each tag is simply replaced with a surface form chosen randomly from the appropriate list in Redis.

Note that these are lists of surface forms, not sets, so that commonly used names appear more of then than rare names. Thus we can select a name at random and it will reflect some sensible underlying probability distribution. Eventually, the lists may get very long and I may need to revist this, and store the names as a frequency map of some kind.

Stanford’s NER library is pretty good but not perfect, so sometimes people are mistaken for places, leading to generated texts

Also, each text from Gutenburg is trimmed to remove headers/footers, and then stored in an S3 bucket. This is a) to help me learn more about AWS tooling, and b) so I can later run some heavy code on an AWS EC2 machine with the text already in situ.

Note to self: in ordinary writing, I’d say the spellings ‘organisation’ and ‘organization’ are equally acceptable. However, defining a variable with either as the name is a recipe for bugs, as I forget which one I’m using and get ‘unknown variable’ errors. So it’s usually just ‘org’! Or even ner_org = “ORGANIZATION”, so I only refer to the string itself once.