Analyzing NY Times Topics

This is a presentation I gave for prospective students at Flatiron School. It was meant as an engaging introductory example that demonstrated some of the potentially cool and interesting insights students could draw after only a month or two of intesive work at the bootcamp. As such, this talk covered a wide variety of topics from the basic data science workflow, APIs, http requests, natural language processing and visualization. Despite the wide range of topics (including some fairly complex ones with Latent Dirchlet Allocation), the talk is meant to be accessible to a wide audience and demonstrate just how powerful newbie data scientists can be with the proper guidance and modularization of knowledge. With that, I hope you enjoy and start to get a glimpse at both the power and accessability of many modern day data science workflows!


Analyzing NY Times Articles

General Data Science Outline

Our Outline

Acquire Articles - The NY Times API

https://developers.nytimes.com/

HTTP Requests

HTTP stands for Hyper Text Transfer Protocol. This protocol (like many) was proposed by the Internet Engineering Task Force (IETF) through a request for comments (RFC). We're going to start with a very simple HTTP method: the get method.

To learn more about HTTP methods see:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods

Python's Requests Package

The first thing to understand when dealing with APIs is how to make get requests in general. To do this, we'll use the Python requests package.

http://docs.python-requests.org/en/master/

Making a get request

In [1]:
import requests
In [9]:
response = requests.get('https://flatironschool.com')
print('Type:', type(response), '\n')
print('Response:', response, '\n')
print('Response text:\n', response.text)
Type: <class 'requests.models.Response'> 

Response: <Response [403]> 

Response text:
 <html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>

Hmmm, well that was only partially helpful. You can see that our request was denied. (This is shown by the response itself, which has the code 403, meaning forbidden.) Most likely, this is caused by permissioning from Flatiron School's servers, which may be blocking requests that appear to be from an automated platform.

HTTP Response Codes

In general, here's some common HTTP response codes you might come across:

Let's try another get request in the hopes of getting a successful (200) response.

In [11]:
#The Electronic Frontier Foundation (EFF) website; advocating for data privacy and an open internet
response = requests.get('https://www.eff.org')
print(response)
print(response.text[:2500])
<Response [200]>
<!DOCTYPE html>
  <!--[if IEMobile 7]><html class="no-js ie iem7" lang="en" dir="ltr"><![endif]-->
  <!--[if lte IE 6]><html class="no-js ie lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
  <!--[if (IE 7)&(!IEMobile)]><html class="no-js ie lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
  <!--[if IE 8]><html class="no-js ie lt-ie9" lang="en" dir="ltr"><![endif]-->
  <!--[if (gte IE 9)|(gt IEMobile 7)]><html class="no-js ie" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><![endif]-->
  <!--[if !IE]><!--><html class="no-js" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><!--<![endif]-->
<head>
  <meta charset="utf-8" />
<link href="https://www.eff.org/vi" rel="alternate" hreflang="vi" />
<link rel="apple-touch-icon-precomposed" href="https://www.eff.org/sites/all/themes/phoenix/apple-touch-icon-precomposed-114x114.png" sizes="114x114" />
<link href="https://www.eff.org/ur" rel="alternate" hreflang="ur" />
<link href="https://www.eff.org/tr" rel="alternate" hreflang="tr" />
<link href="https://www.eff.org/sh" rel="alternate" hreflang="sh" />
<link href="https://www.eff.org/sv" rel="alternate" hreflang="sv" />
<link href="https://www.eff.org/th" rel="alternate" hreflang="th" />
<link rel="apple-touch-icon-precomposed" href="https://www.eff.org/sites/all/themes/phoenix/apple-touch-icon-precomposed-72x72.png" sizes="72x72" />
<link rel="apple-touch-icon-precomposed" href="https://www.eff.org/sites/all/themes/phoenix/apple-touch-icon-precomposed-144x144.png" sizes="144x144" />
<link rel="profile" href="http://www.w3.org/1999/xhtml/vocab" />
<link rel="shortcut icon" href="https://www.eff.org/sites/all/themes/frontier/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="HandheldFriendly" content="true" />
<meta name="MobileOptimized" content="width" />
<link rel="apple-touch-icon-precomposed" href="https://www.eff.org/sites/all/themes/phoenix/apple-touch-icon-precomposed.png" />
<meta http-equiv="cleartype" content="on" />
<link href="https://www.eff.org/ru" rel="alternate" hreflang="ru" />
<link href="https://www.eff.org/es" rel="alternate" hreflang="es" />
<link href

Success! As you can see, the response.text is the html code for the given url that we requested. In the background, this forms the basis for web browsers themselves. Every time you put in a new url or click on a link your computer makes a get request for that particular page and then the browser itself renders that page into a visual display on screen.

OAuth

Some requests are a bit more complicated. Often, websites require identity verification such as logins. This helps a variety of issues such as privacy concerns, limiting access to content and tracking users history. Going forward, OAuth has furthered this idea by allowing third parties such as apps access to user information without providing the underlying password itself.

In the words of the Internet Engineering Task Force, "The OAuth 2.0 authorization framework enables a third-party application to obtain limited access to an HTTP service, either on behalf of a resource owner by orchestrating an approval interaction between the resource owner and the HTTP service, or by allowing the third-party application to obtain access on its own behalf. This specification replaces and obsoletes the OAuth 1.0 protocol described in RFC 5849."

See https://oauth.net/2/ or https://tools.ietf.org/html/rfc6749 for more details.

Access Tokens

In order to make requests to many APIs, you are required to login via an access token. As a result, the first step is to sign up through the web interface using your browser. Once you have an API key, you can then use it to make requests to the API. As with login passwords for your computer, these access tokens should be kept secret! For example, rather then including the passwords directly in this file, I have saved them to a seperate file called 'ny_times_api_keys.py'. The file would look something like this:

api_key = 'blah_blah_blah_YOUR_KEY_HERE'

Now it's time to start making some api calls!

In [12]:
from ny_times_api_keys import *
In [8]:
import requests
In [70]:
url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"
url_params = {"api-key" : api_key,
             'fq' : 'The New York Times',
             'sort' : "newest"}
response = requests.get(url, params=url_params)
In [71]:
response
Out[71]:
<Response [200]>

JSON Files

While you can see that we received an HTTP code of 200, indicating success, the actual data from the request is stored in a json file format. JSON stands for Javascript Object Notation and is the standard format for most data requests from the web these days. You can read more about json here. With that, let's take a quick look at our data:

In [72]:
response.json()
Out[72]:
{'status': 'OK',
 'copyright': 'Copyright (c) 2018 The New York Times Company. All Rights Reserved.',
 'response': {'docs': [{'web_url': 'https://www.nytimes.com/interactive/2018/upshot/elections-poll-ny22.html',
    'snippet': 'The district stretches from Lake Ontario to the Pennsylvania border.',
    'blog': {},
    'source': 'The New York Times',
    'multimedia': [{'rank': 0,
      'subtype': 'xlarge',
      'caption': None,
      'credit': None,
      'type': 'image',
      'url': 'images/2018/11/01/upshot/elections-poll-ny22-1541083888183/elections-poll-ny22-1541083888183-articleLarge.png',
      'height': 368,
      'width': 600,
      'legacy': {'xlarge': 'images/2018/11/01/upshot/elections-poll-ny22-1541083888183/elections-poll-ny22-1541083888183-articleLarge.png',
       'xlargewidth': 600,
       'xlargeheight': 368},
      'subType': 'xlarge',
      'crop_name': 'articleLarge'},
     {'rank': 0,
      'subtype': 'wide',
      'caption': None,
      'credit': None,
      'type': 'image',
      'url': 'images/2018/11/01/upshot/elections-poll-ny22-1541083888183/elections-poll-ny22-1541083888183-thumbWide.png',
      'height': 126,
      'width': 190,
      'legacy': {'wide': 'images/2018/11/01/upshot/elections-poll-ny22-1541083888183/elections-poll-ny22-1541083888183-thumbWide.png',
       'widewidth': 190,
       'wideheight': 126},
      'subType': 'wide',
      'crop_name': 'thumbWide'},
     {'rank': 0,
      'subtype': 'thumbnail',
      'caption': None,
      'credit': None,
      'type': 'image',
      'url': 'images/2018/11/01/upshot/elections-poll-ny22-1541083888183/elections-poll-ny22-1541083888183-thumbStandard.png',
      'height': 75,
      'width': 75,
      'legacy': {'thumbnail': 'images/2018/11/01/upshot/elections-poll-ny22-1541083888183/elections-poll-ny22-1541083888183-thumbStandard.png',
       'thumbnailwidth': 75,
       'thumbnailheight': 75},
      'subType': 'thumbnail',
      'crop_name': 'thumbStandard'}
     
In [16]:
type(response.json())
Out[16]:
dict
In [29]:
response.json()['response']['docs'][0]['web_url']
Out[29]:
'https://www.nytimes.com/1989/10/21/world/poland-s-premier-in-rome-seeks-aid.html'
In [24]:
response.json()['response']['docs'][0]['headline']
Out[24]:
{'main': "POLAND'S PREMIER, IN ROME, SEEKS AID",
 'kicker': None,
 'content_kicker': None,
 'print_headline': None,
 'name': None,
 'seo': None,
 'sub': None}

Transforming Our Data

In [35]:
import pandas as pd
In [37]:
pd.DataFrame(response.json()['response']['docs'])
Out[37]:
_id abstract blog byline document_type headline keywords multimedia news_desk print_page pub_date score section_name snippet source type_of_material web_url word_count
0 4fd1aa418eb7c8105d6c7bc7 NaN {} {'original': 'By ALAN RIDING, Special to The N... article {'main': 'POLAND'S PREMIER, IN ROME, SEEKS AID... [{'name': 'persons', 'value': 'POPE', 'rank': ... [] Foreign Desk 4 1989-10-21T00:00:00Z 1.0 NaN LEAD: On his first trip abroad since taking of... The New York Times News https://www.nytimes.com/1989/10/21/world/polan... 694
1 4fc04afd45c1498b0d22d8e2 Cable opened {} NaN article {'main': 'PRESIDENT OPENS NEW MANILA CABLE; Me... [{'name': 'persons', 'value': 'ROOSEVELT, THEO... [] NaN 1 1903-07-05T00:00:00Z 1.0 NaN With the completion of the Commercial Pacific ... The New York Times Front Page https://query.nytimes.com/gst/abstract.html?re... 1140
2 50193d971c22dfde670b384a The N.H.L. gave the players' union thousands o... {} {'original': 'By JEFF Z. KLEIN', 'person': [{'... blogpost {'main': 'In N.H.L. Negotiation, 76,000 Pages,... [{'name': 'persons', 'value': 'Bettman, Gary',... [] NaN NaN 2012-08-01T10:24:31Z 1.0 Hockey The N.H.L. gave the players' union thousands o... The New York Times Blog https://slapshot.blogs.nytimes.com/2012/08/01/... 306
3 4fc0b42c45c1498b0d417411 Leaves for Portsmouth, Me {} NaN article {'main': 'Birth Notice 1 -- No Title', 'kicker... [{'name': 'persons', 'value': 'MEADOWCROFT, WM... [] NaN 19 1931-07-18T00:00:00Z 1.0 NaN Leaves for Portsmouth, Me The New York Times Birth Notice https://query.nytimes.com/gst/abstract.html?re... 139
4 5360131038f0d87ca6edfc10 chrysanthemum show, NY Botanical Garden; illus {} NaN article {'main': 'MUMS DISPLAYED IN MOTIF OF JAPAN', '... [{'name': 'subject', 'value': 'HORTICULTURE', ... [] NaN NaN 1964-11-09T00:00:00Z 1.0 NaN chrysanthemum show, NY Botanical Garden; illus The New York Times News https://www.nytimes.com/1964/11/09/mums-displa... 198
5 4fc04afd45c1498b0d22d8e7 Warner, J. De W., protest water fee tales {} {'original': 'JOHN DE WITT WARNER', 'person': ... article {'main': 'John De Witt Warner's Compensation.'... [{'name': 'persons', 'value': 'WARNER, J. DE W... [] NaN 16 1903-12-15T00:00:00Z 1.0 NaN Warner, J. De W., protest water fee tales The New York Times Letter https://query.nytimes.com/gst/abstract.html?re... 212
6 4fbfdf8145c1498b0d03fa63 NaN {} NaN article {'main': 'Amusements this Evening.', 'kicker':... [] [] NaN 4 1863-09-23T00:03:58Z 1.0 NaN The New York Times Article https://www.nytimes.com/1863/09/23/news/amusem... 74
7 501945961c22dfde670b3882 A new study shows less mislabeling of sturgeon... {} {'original': 'By FLORENCE FABRICANT', 'person'... blogpost {'main': 'A Caviar Crackdown Has Worked, Resea... [{'name': 'glocations', 'value': 'New York Cit... [] NaN NaN 2012-08-01T11:02:20Z 1.0 Dining &amp; Wine A new study shows less mislabeling of sturgeon... The New York Times Blog https://dinersjournal.blogs.nytimes.com/2012/0... 412
8 4fc0b42c45c1498b0d417415 Wilson E; kidnapping feared {} NaN article {'main': 'ACTRESS VANISHES; KIDNAPPING FEARED;... [{'name': 'persons', 'value': 'WILSON, EVELYN'... [] NaN 15 1931-07-06T00:00:00Z 1.0 NaN With a black silk purse as their most tangible... The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 713
9 4fc4782845c1498b0d9f334b State Farm Mutual Auto Ins Co announces it wil... {} NaN article {'main': 'STATE FARM PLANS DIVIDEND PAYMENTS',... [{'name': 'subject', 'value': 'AUTOMOBILE INSU... [] NaN 52 1971-08-17T00:00:00Z 1.0 NaN The State Farm Mutual Automobile Insurance Com... The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 171

Repeating the process progromatically

In [39]:
import time
In [ ]:
responses = []
for i in range(10**3):
    url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"
    url_params = {"api-key" : api_key,
                 'fq' : 'The New York Times',
                  'q' : 'politics',
                  'sort' : "newest",
                 'page': i}
    response = requests.get(url, params=url_params)
    if response.status_code == 200:
        responses.append(response)
    else:
        print('Request Failed.')
        print(response)
        print('Pausing for 60 seconds.')
        time.sleep(60)
    time.sleep(2) #Always include a 2 second pause
print(len(responses))

Pulling Out Headline Text

In [77]:
dfs = []
for r in responses:
    dfs.append(pd.DataFrame(r.json()['response']['docs']))
df = pd.concat(dfs, ignore_index=True)
print(len(df))
df.head()
2010
Out[77]:
_id abstract blog byline document_type headline keywords multimedia news_desk print_page pub_date score section_name snippet source type_of_material uri web_url word_count
0 5bdb3ba600a1bc2872e91b3c NaN {} {'original': 'By MICHAEL TACKETT', 'person': [... article {'main': 'Writing Postcards Brings Voters Back... [{'name': 'subject', 'value': 'Politics and Go... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:45:04+0000 16.240780 Politics A grass roots army of almost 40,000 is hand wr... The New York Times News nyt://article/5de9ee53-d584-5ef8-bbb4-7ee60a67... https://www.nytimes.com/2018/11/01/us/politics... 1001
1 5bdb39f600a1bc2872e91b39 NaN {} {'original': 'By MICHAEL S. SCHMIDT, MARK MAZZ... article {'main': 'Read the Emails: The Trump Campaign ... [{'name': 'subject', 'value': 'Presidential El... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:37:56+0000 15.122690 Politics Newly revealed messages show how the political... The New York Times News nyt://article/02569d9a-d5b6-5e19-89c6-d74fc500... https://www.nytimes.com/2018/11/01/us/politics... 964
2 5bdb39b100a1bc2872e91b38 NaN {} {'original': 'By SHARON LaFRANIERE, MICHAEL S.... article {'main': 'Roger Stone Sold Himself to Trump’s ... [{'name': 'subject', 'value': 'Presidential El... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:36:36+0000 13.804007 Politics The special counsel is investigating whether M... The New York Times News nyt://article/4ee43518-63b8-5f83-9077-c59eed29... https://www.nytimes.com/2018/11/01/us/politics... 1762
3 5bdb391e00a1bc2872e91b32 NaN {} {'original': 'By JASON FARAGO', 'person': [{'f... article {'main': 'How Conspiracy Theories Shape Art', ... [{'name': 'subject', 'value': 'Art', 'rank': 1... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Weekend NaN 2018-11-01T17:34:14+0000 9.288733 Art & Design At the Met Breuer, the crackpot exhibition “Ev... The New York Times Review nyt://article/1deb977a-def7-5508-999d-9394c566... https://www.nytimes.com/2018/11/01/arts/design... 1436
4 5bdb391f00a1bc2872e91b33 NaN {} {'original': 'By ALAN RAPPEPORT', 'person': [{... article {'main': 'Democrats Eye Trump’s Tax Returns, W... [{'name': 'persons', 'value': 'Trump, Donald J... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:34:12+0000 16.596770 Politics Democrats intend to request the president’s ta... The New York Times News nyt://article/08b8183d-f48a-5027-8d1b-286cf30a... https://www.nytimes.com/2018/11/01/us/politics... 1016
In [79]:
df['main_headline'] = df.headline.map(lambda x: x['main'])
In [80]:
text = ''
for h in df.main_headline:
    text += str(h)
print(len(text), text[:50], text[-50:])
117759 Writing Postcards Brings Voters Back From the Edge ard Channing Is a Mother to Remember in ‘Apologia’
In [81]:
df.to_csv('Pulls_Nov1_2018_recent.csv', index=False)
In [1]:
import pandas as pd
df = pd.read_csv('Pulls_Nov1_2018_recent.csv')
df.head()
Out[1]:
_id abstract blog byline document_type headline keywords multimedia news_desk print_page pub_date score section_name snippet source type_of_material uri web_url word_count main_headline
0 5bdb3ba600a1bc2872e91b3c NaN {} {'original': 'By MICHAEL TACKETT', 'person': [... article {'main': 'Writing Postcards Brings Voters Back... [{'name': 'subject', 'value': 'Politics and Go... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:45:04+0000 16.240780 Politics A grass roots army of almost 40,000 is hand wr... The New York Times News nyt://article/5de9ee53-d584-5ef8-bbb4-7ee60a67... https://www.nytimes.com/2018/11/01/us/politics... 1001 Writing Postcards Brings Voters Back From the ...
1 5bdb39f600a1bc2872e91b39 NaN {} {'original': 'By MICHAEL S. SCHMIDT, MARK MAZZ... article {'main': 'Read the Emails: The Trump Campaign ... [{'name': 'subject', 'value': 'Presidential El... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:37:56+0000 15.122690 Politics Newly revealed messages show how the political... The New York Times News nyt://article/02569d9a-d5b6-5e19-89c6-d74fc500... https://www.nytimes.com/2018/11/01/us/politics... 964 Read the Emails: The Trump Campaign and Roger ...
2 5bdb39b100a1bc2872e91b38 NaN {} {'original': 'By SHARON LaFRANIERE, MICHAEL S.... article {'main': 'Roger Stone Sold Himself to Trump’s ... [{'name': 'subject', 'value': 'Presidential El... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:36:36+0000 13.804007 Politics The special counsel is investigating whether M... The New York Times News nyt://article/4ee43518-63b8-5f83-9077-c59eed29... https://www.nytimes.com/2018/11/01/us/politics... 1762 Roger Stone Sold Himself to Trump’s Campaign a...
3 5bdb391e00a1bc2872e91b32 NaN {} {'original': 'By JASON FARAGO', 'person': [{'f... article {'main': 'How Conspiracy Theories Shape Art', ... [{'name': 'subject', 'value': 'Art', 'rank': 1... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Weekend NaN 2018-11-01T17:34:14+0000 9.288734 Art & Design At the Met Breuer, the crackpot exhibition “Ev... The New York Times Review nyt://article/1deb977a-def7-5508-999d-9394c566... https://www.nytimes.com/2018/11/01/arts/design... 1436 How Conspiracy Theories Shape Art
4 5bdb391f00a1bc2872e91b33 NaN {} {'original': 'By ALAN RAPPEPORT', 'person': [{... article {'main': 'Democrats Eye Trump’s Tax Returns, W... [{'name': 'persons', 'value': 'Trump, Donald J... [{'rank': 0, 'subtype': 'xlarge', 'caption': N... Washington NaN 2018-11-01T17:34:12+0000 16.596770 Politics Democrats intend to request the president’s ta... The New York Times News nyt://article/08b8183d-f48a-5027-8d1b-286cf30a... https://www.nytimes.com/2018/11/01/us/politics... 1016 Democrats Eye Trump’s Tax Returns, With Mnuchi...

Simple Visualizations

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
# sns.set_style('darkgrid')
%matplotlib inline
In [3]:
df.word_count.hist(figsize=(10,10))
plt.title('Distribution of Words Per Article')
plt.xlabel('Number of Words in Articles')
plt.ylabel('Number of Articles')
Out[3]:
Text(0,0.5,'Number of Articles')
In [4]:
word_counts = {}
for h in df.main_headline:
    for word in h.split():
        word = word.lower()
        word_counts[word] = word_counts.get(word, 0) + 1
word_counts = pd.DataFrame.from_dict(word_counts, orient='index')
word_counts.columns = ['count']
word_counts = word_counts.sort_values(by='count', ascending=False)
word_counts.head(10)
Out[4]:
count
to 500
in 445
the 443
of 301
a 283
trump 227
for 218
on 185
and 161
is 127
In [5]:
word_counts.head(25).plot(kind='barh', figsize=(12,10))
plt.title('Most Frequent Headline Words',  fontsize=18)
plt.xlabel('Frequency', fontsize=14)
plt.ylabel('Word',  fontsize=14)
plt.xticks( fontsize=14)
plt.yticks( fontsize=14)
Out[5]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24]),
 <a list of 25 Text yticklabel objects>)

Topic Modelling

Brief Background

In order to perform topic modelling on our data we will use two primary tools. The first is to turn our text into a vector of word frequency counts; each possible word will be a feature and the number of times that word occurs will be represented by a number. From there, we can then apply mathematical operations to this numerical representation. In our case, we will be applying a common Natural Language Processing Algorithm known as Latent Dirichlet Allocation (LDA). For more technical details, start here.

Count Vectorizer

In [82]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
        ]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())  
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
In [ ]:
#Installing a new python package on the fly
!pip install lda

LDA: Latent Dirichlet Allocation

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

http://jmlr.org/papers/volume3/blei03a/blei03a.pdf

Latent dirichlet allocation is a probabilistic model for classifying documents. It works by viewing documents as mixtures of topics. In turn, topics can be viewed of as probability distributions of words. As we'll see, this allows us to model topics of a corpus and then visualize these topics by the top words associated with the topics as word clouds. While the mathematics behind LDA is fairly complex and outside the scope of this presentation, you can easily implement this powerful concept using prebuilt tools based on this academic research.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
import lda
import numpy as np

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=10000,
                                stop_words='english');
tf = tf_vectorizer.fit_transform(df.main_headline);


model = lda.LDA(n_topics=6, n_iter=1500, random_state=1);
model.fit(tf);

topic_word = model.topic_word_  # model.components_ also works
vocab = tf_vectorizer.get_feature_names();
INFO:lda:n_documents: 2010
INFO:lda:vocab_size: 1943
INFO:lda:n_words: 10657
INFO:lda:n_topics: 6
INFO:lda:n_iter: 1500
WARNING:lda:all zero row in document-term matrix found
/Users/matthew.mitchell/anaconda3/lib/python3.6/site-packages/lda/utils.py:55: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if sparse and not np.issubdtype(doc_word.dtype, int):
INFO:lda:<0> log likelihood: -123726
INFO:lda:<10> log likelihood: -86737
INFO:lda:<20> log likelihood: -85052
INFO:lda:<30> log likelihood: -84339
INFO:lda:<40> log likelihood: -83882
INFO:lda:<50> log likelihood: -83550
INFO:lda:<60> log likelihood: -83251
INFO:lda:<70> log likelihood: -83088
INFO:lda:<80> log likelihood: -83000

...

INFO:lda:<1400> log likelihood: -81692
INFO:lda:<1410> log likelihood: -81674
INFO:lda:<1420> log likelihood: -81672
INFO:lda:<1430> log likelihood: -81723
INFO:lda:<1440> log likelihood: -81790
INFO:lda:<1450> log likelihood: -81680
INFO:lda:<1460> log likelihood: -81663
INFO:lda:<1470> log likelihood: -81623
INFO:lda:<1480> log likelihood: -81612
INFO:lda:<1490> log likelihood: -81633
INFO:lda:<1499> log likelihood: -81516
In [7]:
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: trump bombs court bomb suspect ex political pipe charged
Topic 1: says trump china caravan migrant border vote mexico eu
Topic 2: new brazil president pm election pittsburgh right political sri
Topic 3: saudi khashoggi election midterm poll briefing vs district elections
Topic 4: trump democrats house race latest senate republicans governor campaign
Topic 5: trump new york debate tax america media white plan

Visualization

In [9]:
from wordcloud import WordCloud

topic = "new soviet russia national talks party music world minister"
# Generate a word cloud image
wordcloud = WordCloud().generate(topic)

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(topic)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
In [14]:
sns.set_style(None)
In [10]:
fig, axes = plt.subplots(3,2, figsize=(15,15))
fig.tight_layout()
for i, topic_dist in enumerate(topic_word):
    topic_words = list(np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words):-1])
    topic_words = ' '.join(topic_words)
    row = i//2
    col = i%2
    ax = axes[row, col]
    wordcloud = WordCloud().generate(topic_words)
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.set_title('Topic {}'.format(i))
plt.tight_layout()

Summary

Congratulations! We've covered a lot here! We started with HTTP requests, one of the fundamental protocols underlying the internet that we know and love. From there, we further investigated OAuth and saw how to get an access token to use in an API such as yelp. Then we made some requests to retrieve information that came back as a json format. We then transformed this data into a dataframe using the Pandas package. Finally, we created an initial visualization of the data that we retrieved using matplotlib!

Appendix Extensions

Scraping Full Articles

In [25]:
from bs4 import BeautifulSoup
In [33]:
def scrape_full_article_text(url):
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page, 'html.parser')
    paragraphs = soup.find_all('p', attrs={'class': 'story-body-text'})
    full_text=str()
    for paragraph in paragraphs:
        raw_paragraph = paragraph.contents
        cleaned_paragraph=str()
        for piece in raw_paragraph:
            if piece.string:
                cleaned_paragraph += piece.string
                cleaned_paragraph = cleaned_paragraph.replace(r"<.*?>","")
                cleaned_paragraph = cleaned_paragraph.encode('ascii','ignore')
                print(cleaned_paragraph, type(cleaned_paragraph))
        full_text += str(cleaned_paragraph)
    return full_text