Importing data from Twitter with python and tweepy


png

Twitter provides an API that lets you download data from this social network. To do this we will use python and the tweepy library.

The aim is to retrieve tweets related with the word ‘NoSQL’ and store them in a file for later analysis.

The first thing to do is register a new Twitter application via the Twitter Application Management page.

After registering our application, Twitter will give us the keys that we need to access it using its API

consumer_key = 'hRNtRgjzGq4wq3mt3fbuUkQ2c'
consumer_secret = 'yBbXnvNRpm93wvblpG9xhUMFF7w9sgxLfQT8k15Fs3k1RN4pnQ'
access_token_key = '12391902-qHOsgUBIvKuv7DjajXBmdm3SyZH8vgmR3jcpLVnnM'
access_token_secret = '9ViwfNW5FhOLhagaf4qfmDLXfY6qDtGzJ1MmAQM0gN3LK'

The next step is to import the library and login in twitter.

import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)

api = tweepy.API(auth)

Testing the library

To test the library we will retrieve information about a user of twitter, his name, his followers, and so on ..

user = api.get_user('NoSQLDigest')
print "Name:", user.screen_name
print "Description:", user.description
print "Followers count:", user.followers_count
print "Friends' count:", user.friends_count
print "Statues Count [Number of Tweets]: ", user.statuses_count
Name: NoSQLDigest
Description: NoSQL Digest of tweets.
Followers count: 9785
Friends' count: 12
Statues Count [Number of Tweets]:  668120

Retrieving Tweets via a search term

lookup ='NoSQL'

Using the following method we can download tweets quickly, but it depends on the limit set by Twitter

max_tweets = 200
search_results = api.search(q=lookup, lang = 'en', count=max_tweets)

print len(search_results)
100
from prettytable import PrettyTable

table = PrettyTable(["User", "Fecha", "Texto"])
table.align["User"] = "l"
table.align["Texto"] = "l"

for tweet in search_results[0:10]:
    table.add_row([tweet.user.screen_name, tweet.created_at, tweet.text[:80]])
 
print table
User Fecha Texto
retweetjava 2015-10-21 04:00:19 RT @MusicHackFest: #GoGettersNetwork #CodeTalk Apache Hadoop and NoSQL as Analy
NeuvooPhpCA 2015-10-21 04:00:11 CyberCoders is hiring a #Senior #Backend Developer - Ruby, Python, PHP, Agile, S
GujaratiGuy789 2015-10-21 03:57:05 RT @SoftwareJokes: 3 database admins walked into a NoSQL bar. A little while lat
astosyk 2015-10-21 03:53:11 RT @couchbase: Get rapid ramp-up on #NoSQL application development with lessons
tjmickol 2015-10-21 03:40:36 Listening to Sweet Home Wundermude by Spill & Freunde on #kexp #galvanize #z
MooglieTwitimon 2015-10-21 03:38:26 Couchbase CEO on rise of NoSQL - https://t.co/9Zeif7Afw4 https://t.co/yglu7h3WA0
MooglieTwitimon 2015-10-21 03:38:24 Couchbase CEO on rise of NoSQL - https://t.co/9Zeif7Afw4 https://t.co/yglu7h3WA0
bph 2015-10-21 03:19:39 RT @couchbase: The 3 mega-trends that define businesses leading the digital econ
VVagias 2015-10-21 03:13:36 RT @geneolot: It’s jut great Tuesday evening, Bambini Dahlonega, Georgia #BigD
retweetjava 2015-10-21 03:13:35 RT @geneolot: It’s jut great Tuesday evening, Bambini Dahlonega, Georgia #BigD

Retrieving a user’s timeline

timeline_results = api.user_timeline(screen_name = 'NoSQLDigest', count = 1000, include_rts = True)
len(timeline_results)
198

Retrieving Tweets via a search term using a cursor

This method uses a cursor that skips the restriction of 200 tweets ;-)

c = tweepy.Cursor(api.search, q= lookup).items()
    
search_results = []
while True:
    try:
        tweet = c.next()
        # Insert into db
        search_results.append(tweet)
    except tweepy.TweepError as e:
        print e
        break

print len(search_results)
{"errors":[{"message":"Rate limit exceeded","code":88}]}
2648

This fails too!. In this case is because of a timeout limit :-(.

Retrieving a user’s timeline using a cursor

import sys

c = tweepy.Cursor(api.user_timeline,id='NoSQLDigest').items()    
timeline_results = []
while True:
    try:
        tweet = c.next()
        # Insert into db
        timeline_results.append(tweet)
    except:
        print "Error: ", sys.exc_info()[0]
        break

print len(timeline_results)
Error:  <type 'exceptions.StopIteration'>
3136

Visualizing the content of a Tweet

Twitter returns data in JSON format. Let’s see what a tweet looks like:

import pprintpp

tweet = search_results[0]
pprintpp.pprint(tweet._json)
{
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Wed Oct 21 04:00:19 +0000 2015',
    u'entities': {
        u'hashtags': [
            {u'indices': [19, 36], u'text': u'GoGettersNetwork'},
            {u'indices': [37, 46], u'text': u'CodeTalk'},
            {u'indices': [131, 135], u'text': u'IoT'},
            {u'indices': [136, 140], u'text': u'Java'},
            {u'indices': [139, 140], u'text': u'API'},
            {u'indices': [139, 140], u'text': u'Linux'},
        ],
        u'symbols': [],
        u'urls': [
            {
                u'display_url': u'bit.ly/1FAQ7B5',
                u'expanded_url': u'http://bit.ly/1FAQ7B5',
                u'indices': [107, 130],
                u'url': u'https://t.co/z2lcnaHeov',
            },
        ],
        u'user_mentions': [
            {
                u'id': 3243621325,
                u'id_str': u'3243621325',
                u'indices': [3, 17],
                u'name': u'Hackathon News',
                u'screen_name': u'MusicHackFest',
            },
        ],
    },
    u'favorite_count': 0,
    u'favorited': False,
    u'geo': None,
    u'id': 656681395348217856,
    u'id_str': u'656681395348217856',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'is_quote_status': False,
    u'lang': u'en',
    u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
    u'place': None,
    u'possibly_sensitive': False,
    u'retweet_count': 1,
    u'retweeted': False,
    u'retweeted_status': {
        u'contributors': None,
        u'coordinates': None,
        u'created_at': u'Tue Oct 20 22:37:58 +0000 2015',
        u'entities': {
            u'hashtags': [
                {u'indices': [0, 17], u'text': u'GoGettersNetwork'},
                {u'indices': [18, 27], u'text': u'CodeTalk'},
                {u'indices': [112, 116], u'text': u'IoT'},
                {u'indices': [117, 122], u'text': u'Java'},
                {u'indices': [123, 127], u'text': u'API'},
                {u'indices': [128, 134], u'text': u'Linux'},
            ],
            u'symbols': [],
            u'urls': [
                {
                    u'display_url': u'bit.ly/1FAQ7B5',
                    u'expanded_url': u'http://bit.ly/1FAQ7B5',
                    u'indices': [88, 111],
                    u'url': u'https://t.co/z2lcnaHeov',
                },
            ],
            u'user_mentions': [],
        },
        u'favorite_count': 0,
        u'favorited': False,
        u'geo': None,
        u'id': 656600272781877248,
        u'id_str': u'656600272781877248',
        u'in_reply_to_screen_name': None,
        u'in_reply_to_status_id': None,
        u'in_reply_to_status_id_str': None,
        u'in_reply_to_user_id': None,
        u'in_reply_to_user_id_str': None,
        u'is_quote_status': False,
        u'lang': u'en',
        u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
        u'place': None,
        u'possibly_sensitive': False,
        u'retweet_count': 1,
        u'retweeted': False,
        u'source': u'<a href="http://ifttt.com" rel="nofollow">IFTTT</a>',
        u'text': u'#GoGettersNetwork #CodeTalk  Apache Hadoop and NoSQL as Analysis Engines for IoT Data ▸ https://t.co/z2lcnaHeov #IoT #Java #API #Linux …',
        u'truncated': False,
        u'user': {
            u'contributors_enabled': False,
            u'created_at': u'Fri Jun 12 20:02:51 +0000 2015',
            u'default_profile': False,
            u'default_profile_image': False,
            u'description': u'Music + Technology + Hackathons = @MusicHackFest',
            u'entities': {u'description': {u'urls': []}},
            u'favourites_count': 105,
            u'follow_request_sent': False,
            u'followers_count': 2143,
            u'following': False,
            u'friends_count': 1388,
            u'geo_enabled': False,
            u'has_extended_profile': False,
            u'id': 3243621325,
            u'id_str': u'3243621325',
            u'is_translation_enabled': False,
            u'is_translator': False,
            u'lang': u'en',
            u'listed_count': 2879,
            u'location': u'Los Angeles, CA',
            u'name': u'Hackathon News',
            u'notifications': False,
            u'profile_background_color': u'000000',
            u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png',
            u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png',
            u'profile_background_tile': False,
            u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/3243621325/1434140299',
            u'profile_image_url': u'http://pbs.twimg.com/profile_images/609454174841876481/GMtjuot9_normal.jpg',
            u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/609454174841876481/GMtjuot9_normal.jpg',
            u'profile_link_color': u'ABB8C2',
            u'profile_sidebar_border_color': u'000000',
            u'profile_sidebar_fill_color': u'000000',
            u'profile_text_color': u'000000',
            u'profile_use_background_image': False,
            u'protected': False,
            u'screen_name': u'MusicHackFest',
            u'statuses_count': 301851,
            u'time_zone': u'Pacific Time (US & Canada)',
            u'url': None,
            u'utc_offset': -25200,
            u'verified': False,
        },
    },
    u'source': u'<a href="http://not.yet/" rel="nofollow">final one kk</a>',
    u'text': u'RT @MusicHackFest: #GoGettersNetwork #CodeTalk  Apache Hadoop and NoSQL as Analysis Engines for IoT Data ▸ https://t.co/z2lcnaHeov #IoT #Ja…',
    u'truncated': False,
    u'user': {
        u'contributors_enabled': False,
        u'created_at': u'Thu Dec 25 02:07:45 +0000 2014',
        u'default_profile': False,
        u'default_profile_image': False,
        u'description': u"Hey, I retweet #Java related tweets. Follow us and maybe you'll learn something new! Questions/concerns? Contact @jdf221 or @Jordanb844",
        u'entities': {u'description': {u'urls': []}},
        u'favourites_count': 62458,
        u'follow_request_sent': False,
        u'followers_count': 2819,
        u'following': False,
        u'friends_count': 19,
        u'geo_enabled': False,
        u'has_extended_profile': False,
        u'id': 2942356560,
        u'id_str': u'2942356560',
        u'is_translation_enabled': False,
        u'is_translator': False,
        u'lang': u'en',
        u'listed_count': 2696,
        u'location': u'',
        u'name': u'Java',
        u'notifications': False,
        u'profile_background_color': u'022330',
        u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme15/bg.png',
        u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme15/bg.png',
        u'profile_background_tile': False,
        u'profile_image_url': u'http://pbs.twimg.com/profile_images/547942714675712000/Kr9dDPXJ_normal.png',
        u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/547942714675712000/Kr9dDPXJ_normal.png',
        u'profile_link_color': u'0084B4',
        u'profile_sidebar_border_color': u'A8C7F7',
        u'profile_sidebar_fill_color': u'C0DFEC',
        u'profile_text_color': u'333333',
        u'profile_use_background_image': True,
        u'protected': False,
        u'screen_name': u'retweetjava',
        u'statuses_count': 164272,
        u'time_zone': None,
        u'url': None,
        u'utc_offset': None,
        u'verified': False,
    },
}

Parsing the result

The aim of the following functions to simplify the information we keep on file because Twitter provide too much information

def parse_user(usr):
    user = {}  
    user["created_at"] = usr['created_at']
    user["description"] = usr['description']
    user["favourites_count"] = usr['favourites_count']
    user["followers_count"] = usr['followers_count']
    user["friends_count"] = usr['friends_count']
    user["geo_enabled"] = usr['geo_enabled']
    user["_id"] = usr['id']
    user["id_str"] =usr['id_str']
    user["name"] = usr['name']
    user["screen_name"] = usr['screen_name']
    user["statuses_count"] = usr['statuses_count']
    user["profile_image_url"] = usr['profile_image_url']
    if usr['time_zone'] <> None:
        user["time_zone"] = usr['time_zone']
    
    return user
def parse_tweet(t):
    tweet = {}
    tweet['created_at'] = t['created_at']
    #for ht in tweet.entities.hashtags:
    #    print ht.text

    tweet['entities'] = []
    for k in t['entities']['hashtags']:
        tweet['entities'].append(k['text'])
  
    tweet['user_mentions'] = []
    for k in t['entities']['user_mentions']:
        k.pop("indices", None)
        tweet['user_mentions'].append(k)

    tweet['favorite_count'] =  t['favorite_count']

    if t['geo'] <> None:
        tweet['geo'] = t['geo']

    tweet['_id'] = t['id']
    tweet['id_str'] = t['id_str']  

    tweet['lang'] = t['lang']
    tweet['retweet_count'] = t['retweet_count']
    tweet['source'] = t['source']
    tweet['text'] = t['text']
    tweet['user'] = parse_user(t['user'])

    if 'retweeted_status' in t.keys():
        tweet['retweeted_status'] = parse_tweet(t['retweeted_status'])
        
    return tweet

We parse the content we’ve previously downloaded …

tweets = []

for tweet in search_results:
    tweets.append(parse_tweet(tweet._json))
    

Now the content also in JSON format but much simpler

pprintpp.pprint(tweets[0])
{
    '_id': 656681395348217856,
    'created_at': u'Wed Oct 21 04:00:19 +0000 2015',
    'entities': [
        u'GoGettersNetwork',
        u'CodeTalk',
        u'IoT',
        u'Java',
        u'API',
        u'Linux',
    ],
    'favorite_count': 0,
    'id_str': u'656681395348217856',
    'lang': u'en',
    'retweet_count': 1,
    'retweeted_status': {
        '_id': 656600272781877248,
        'created_at': u'Tue Oct 20 22:37:58 +0000 2015',
        'entities': [
            u'GoGettersNetwork',
            u'CodeTalk',
            u'IoT',
            u'Java',
            u'API',
            u'Linux',
        ],
        'favorite_count': 0,
        'id_str': u'656600272781877248',
        'lang': u'en',
        'retweet_count': 1,
        'source': u'<a href="http://ifttt.com" rel="nofollow">IFTTT</a>',
        'text': u'#GoGettersNetwork #CodeTalk  Apache Hadoop and NoSQL as Analysis Engines for IoT Data ▸ https://t.co/z2lcnaHeov #IoT #Java #API #Linux …',
        'user': {
            '_id': 3243621325,
            'created_at': u'Fri Jun 12 20:02:51 +0000 2015',
            'description': u'Music + Technology + Hackathons = @MusicHackFest',
            'favourites_count': 105,
            'followers_count': 2143,
            'friends_count': 1388,
            'geo_enabled': False,
            'id_str': u'3243621325',
            'name': u'Hackathon News',
            'profile_image_url': u'http://pbs.twimg.com/profile_images/609454174841876481/GMtjuot9_normal.jpg',
            'screen_name': u'MusicHackFest',
            'statuses_count': 301851,
            'time_zone': u'Pacific Time (US & Canada)',
        },
        'user_mentions': [],
    },
    'source': u'<a href="http://not.yet/" rel="nofollow">final one kk</a>',
    'text': u'RT @MusicHackFest: #GoGettersNetwork #CodeTalk  Apache Hadoop and NoSQL as Analysis Engines for IoT Data ▸ https://t.co/z2lcnaHeov #IoT #Ja…',
    'user': {
        '_id': 2942356560,
        'created_at': u'Thu Dec 25 02:07:45 +0000 2014',
        'description': u"Hey, I retweet #Java related tweets. Follow us and maybe you'll learn something new! Questions/concerns? Contact @jdf221 or @Jordanb844",
        'favourites_count': 62458,
        'followers_count': 2819,
        'friends_count': 19,
        'geo_enabled': False,
        'id_str': u'2942356560',
        'name': u'Java',
        'profile_image_url': u'http://pbs.twimg.com/profile_images/547942714675712000/Kr9dDPXJ_normal.png',
        'screen_name': u'retweetjava',
        'statuses_count': 164272,
    },
    'user_mentions': [
        {
            u'id': 3243621325,
            u'id_str': u'3243621325',
            u'name': u'Hackathon News',
            u'screen_name': u'MusicHackFest',
        },
    ],
}

Saving the data in a file

Finally we will record the data downloaded to a file, so that we can analyze later

import json
    
with open('./data/tweets.json',"w") as file:
    for t in tweets:
        r = json.dumps(t)
        file.write(r)
        file.write("\n")
print "Number of tweets ... ", len(tweets)
Number of tweets ...  3136
tweets = []
for tweet in timeline_results:
    tweets.append(parse_tweet(tweet._json))
    
with open('./data/timeline.json',"w") as file:
    for t in tweets:
        r = json.dumps(t)
        file.write(r)
        file.write("\n")
        
print "Number of tweets ... ", len(tweets)
Number of tweets ...  3136

Hope this helps!

¿Tienes algo que decir?

Related Posts

Pandas and Covid-19

Data manipulation and cleaning with pandas

Empresas orientadas a datos

Descrubre lo que signigica que una empresa esté orientada a datos

Visualizing distributions of data

Plotting Distributions

Apache Spark en 10 minutos

Prueba Apache Spark mediante de una máquina virtual en 10 minutos