Practical Machine Learning Adventures

A selection of machine learning projects

2. Tweet2Bible - Exploring the Data

In the previous post we looked at the problem for our current project: how to match a tweet to passages of the Bible.

In this post we will look at the steps required to prepare some data for our algorithms. We will also look at the nature of the data, which may give us some insights as to how we transform our data in later stages.

Getting the Data

Tweets

Twitter offers an option in the settings to download your tweet archive. Log in to the web version, goto account settings and click the "Download Archive" button. You will then be sent an email with a link to the data.

Fair play to Twitter, the archive is quite cool. You get a CSV file with your tweets and some other useful data, plus a JSON archive (which can be viewed via a local HTML file). To keep things simple we'll just use the CSV file for now.

import csv
with open('tweets.csv', newline='', encoding='utf-8') as csvfile:
    casereader = csv.reader(csvfile, delimiter=',')
    data = [row for row in casereader]
data[0]
['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls']
data[1:3]
[['1009719610910957568',
  '1009527946036613121',
  '20524211',
  '2018-06-21 08:48:09 +0000',
  '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
  "@patently It's okay - whatever photo storage app you use is already plugged into the system.",
  '',
  '',
  '',
  ''],
 ['1009409717205168128',
  '',
  '',
  '2018-06-20 12:16:45 +0000',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  'I love all the people the computer said “looked a bit like Rick Astley” including, I think, JFK, Jesus and Norman Bates. https://t.co/HHI0HFU0Cy',
  '',
  '',
  '',
  'https://twitter.com/quasimondo/status/1009369380042485760']]

For now we will just extract the text to get a list of strings.

D1 = [d[5] for d in data[1:]]
D1[0:5]
["@patently It's okay - whatever photo storage app you use is already plugged into the system.",
 'I love all the people the computer said “looked a bit like Rick Astley” including, I think, JFK, Jesus and Norman Bates. https://t.co/HHI0HFU0Cy',
 "“These were ancient engineers with a genius that allowed people to walk multi-tonne statues and roll multi-tonne hats - which teaches us about the society's investment in honouring their ancestors. It's quite a remarkable accomplishment” https://t.co/DzscvwJ0do https://t.co/Q7B8ioRS0h",
 'KPMG audit work unacceptable - watchdog https://t.co/9np6lWkHTG [Average remuneration per partner in 2016 = £582k]',
 'Ooo Le Sud by Nino Ferrer - well done @SpotifyUK algorithms https://t.co/ANxMxVWiJV [Question is do I prefer the original French or Nino’s English version? Also check out the brilliantly proggy Métronomie]']
"We have {0} tweets.".format(len(D1))
'We have 9806 tweets.'

Looking at some of our tweets, we need to unescape the text such that "&" is converted to "&". This can be performed using the html library.

import html

D1 = [html.unescape(t) for t in D1]

Bible

The Bible is actually quite a good source of text for natural language processing projects.

  • it is free;
  • people want to make it easy to distribute;
  • it is naturally broken down into short passages; and
  • it contains a variety of styles (I like to thing of it as a 2000 year old Wikipedia for middle-eastern farmers).

For this project I went to BibleHub.net which offers an Excel spreadsheet featuring 10 different versions of the Bible, where each row is a different verse. You get a free username and password in exchange for registration using an email address.

We can use Pandas to convert the spreadsheet into useful Python data. We then need to pick a Bible to use. I think the most modern translation will probably be best.

import pandas as pd
# Pandas needs the xlrd package to read excel files
!pip3 install xlrd
Collecting xlrd
[?25l  Downloading https://files.pythonhosted.org/packages/07/e6/e95c4eec6221bfd8528bcc4ea252a850bffcc4be88ebc367e23a1a84b0bb/xlrd-1.1.0-py2.py3-none-any.whl (108kB)
    100% |################################| 112kB 1.1MB/s ta 0:00:01   65% |#####################           | 71kB 2.0MB/s eta 0:00:01
[?25hInstalling collected packages: xlrd
Successfully installed xlrd-1.1.0
file = 'bibles.xls'
df = pd.read_excel(file)
df.head()
Verse King James Bible American Standard Version Douay-Rheims Bible Darby Bible Translation English Revised Version Webster Bible Translation World English Bible Young's Literal Translation American King James Version Weymouth New Testament
0 Genesis 1:1 In the beginning God created the heaven and th... In the beginning God created the heavens and t... In the beginning God created heaven, and earth. In the beginning God created the heavens and t... In the beginning God created the heaven and th... In the beginning God created the heaven and th... In the beginning God created the heavens and t... In the beginning of God's preparing the heaven... In the beginning God created the heaven and th... NaN
1 Genesis 1:2 And the earth was without form, and void; and ... And the earth was waste and void; and darkness... And the earth was void and empty, and darkness... And the earth was waste and empty, and darknes... And the earth was waste and void; and darkness... And the earth was without form, and void; and ... Now the earth was formless and empty. Darkness... the earth hath existed waste and void, and dar... And the earth was without form, and void; and ... NaN
2 Genesis 1:3 And God said, Let there be light: and there wa... And God said, Let there be light: and there wa... And God said: Be light made. And light was made. And God said, Let there be light. And there wa... And God said, Let there be light: and there wa... And God said, Let there be light: and there wa... God said, |Let there be light,| and there was ... and God saith, 'Let light be;' and light is. And God said, Let there be light: and there wa... NaN
3 Genesis 1:4 And God saw the light, that <i>it was</i> good... And God saw the light, that it was good: and G... And God saw the light that it was good; and he... And God saw the light that it was good; and Go... And God saw the light, that it was good: and G... And God saw the light, that it was good: and G... God saw the light, and saw that it was good. G... And God seeth the light that it is good, and G... And God saw the light, that it was good: and G... NaN
4 Genesis 1:5 And God called the light Day, and the darkness... And God called the light Day, and the darkness... And he called the light Day, and the darkness ... And God called the light Day, and the darkness... And God called the light Day, and the darkness... And God called the light Day, and the darkness... God called the light |day,| and the darkness h... and God calleth to the light 'Day,' and to the... And God called the light Day, and the darkness... NaN
df.describe()
Verse King James Bible American Standard Version Douay-Rheims Bible Darby Bible Translation English Revised Version Webster Bible Translation World English Bible Young's Literal Translation American King James Version Weymouth New Testament
count 31102 31102 31100 31092 31099 31086 31102 31098 31102 31102 7924
unique 31102 30840 30716 30886 30722 30687 30855 30776 30861 30825 7913
top Proverbs 26:24 And the LORD spake unto Moses, saying, And Jehovah spake unto Moses, saying, And the Lord spoke to Moses, saying: And Jehovah spoke to Moses, saying, And the LORD spake unto Moses, saying, And the LORD spoke to Moses, saying, Yahweh spoke to Moses, saying, And Jehovah speaketh unto Moses, saying, And the LORD spoke to Moses, saying, May grace and peace be granted to you from God...
freq 1 72 72 55 72 72 72 71 73 72 5

Bibles!

Cool. I think the World English Bible looks good for a modern translation. It does have some annoying "|" we might want to scrub out.

worldbible = df[['Verse', 'World English Bible']]
worldbible.head()
Verse World English Bible
0 Genesis 1:1 In the beginning God created the heavens and t...
1 Genesis 1:2 Now the earth was formless and empty. Darkness...
2 Genesis 1:3 God said, |Let there be light,| and there was ...
3 Genesis 1:4 God saw the light, and saw that it was good. G...
4 Genesis 1:5 God called the light |day,| and the darkness h...
D2 = [tuple(x) for x in worldbible.to_records(index=False)]
D2[0:5]
[('Genesis 1:1', 'In the beginning God created the heavens and the earth.'),
 ('Genesis 1:2',
  "Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters."),
 ('Genesis 1:3', 'God said, |Let there be light,| and there was light.'),
 ('Genesis 1:4',
  'God saw the light, and saw that it was good. God divided the light from the darkness.'),
 ('Genesis 1:5',
  'God called the light |day,| and the darkness he called |night.| There was evening and there was morning, one day.')]
# Just get rid of those annoying |
D2 = [(str(v), str(t).replace("|","")) for v,t in D2]
D2[0:5]
[('Genesis 1:1', 'In the beginning God created the heavens and the earth.'),
 ('Genesis 1:2',
  "Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters."),
 ('Genesis 1:3', 'God said, Let there be light, and there was light.'),
 ('Genesis 1:4',
  'God saw the light, and saw that it was good. God divided the light from the darkness.'),
 ('Genesis 1:5',
  'God called the light day, and the darkness he called night. There was evening and there was morning, one day.')]
"We have {0} Bible passages.".format(len(D2))
'We have 31102 Bible passages.'

So now we D1, a set of tweets, and D2, a set of Bible passages. Let's get matching!

# Let's save our data so we can easily load it in a future session
save_data = (D1, D2)
import pickle
with open("processed_data.pkl", 'wb') as f:
    pickle.dump(save_data, f)