At 7:35 PM

Twitter incremental backup in YAML format - python, HTML get and parse

I use Twitter primarly as a bookmarking service.

I know that Twitter is not bookmarking service and there are specialized services for such needs. In my case, using this way started as an experiment, but later I found this way very convenient.

I wrote a script that enables me to incrementally backup all my tweets in a textual file in YAML format, by parsing HTML Twitter pages. Explanation and the script follows …

My case

So this is my case:

  • I read a lot
  • Google Reader
  • HTC Desire smartphone
  • everything I found interesting is sent to Twitter by using Android Peep Twitter application
  • tweets are tagged by prefixing keywords with hashes (#)
  • tweets are backuped to textual file which is held in my Dropbox folder
  • backup should be incremental
  • backup should be readable by human and machine - YAML is chosen
  • (future) all tweets/bookmarks will be publicly available on my site with advanced browsing feature

The script

Based on the script made by Scott Carpenter I made a python script that doesn’t use Twitter API, but reads Twitter HTML pages, parses them and collect tweets. On the other hand backup file is in YAML file, it is parsed and then only collected tweets that don’t exist in YAML file are written to the top of the file. The script needs BeautifulSoup HTML parsing package, and pyyaml so if you don’t have it yet:

pip install beautifulsoup
pip install pyyaml

Example of usage

Example of usage:

> dbox twitter-bak trebor74hr twitter-trebor74hr-backup.yaml
Reading http://twitter.com/trebor74hr, backup in twitter-trebor74hr-backup.yaml
started at 2011-09-24 18:30:13.937000
  1. page - read and parse
      20/ 20 tweets saved/processed. Waiting 2 seconds before fetching next page...
  2. page - read and parse
      40/ 40 tweets saved/processed. Waiting 2 seconds before fetching next page...
  3. page - read and parse
      60/ 60 tweets saved/processed. Waiting 2 seconds before fetching next page...
  4. page - read and parse
      80/ 80 tweets saved/processed. Waiting 2 seconds before fetching next page...
  5. page - read and parse
      89/100 tweets saved/processed. Waiting 2 seconds before fetching next page...
  6. page - read and parse
      No new tweets found, quit iteration ...
89/120 tweets saved in twitter-trebor74hr-backup.yaml

Try again, no new tweets:

> dbox twitter-bak trebor74hr twitter-trebor74hr-backup.yaml
Reading http://twitter.com/trebor74hr, backup in twitter-trebor74hr-backup.yaml,
started at 2011-09-24 18:30:49.906000
  1. page - read and parse
     No new tweets found, quit iteration ...
No new tweets found in 20 tweets analysed, file twitter-trebor74hr-backup.yaml

The result

The result is in twitter-trebor74hr-backup.yaml:

- content : "RT @williamtincup RT Don’t Be 'That Guy' as a Manager
            bit.ly/qJL8nx [http://t.co/fxy2IrIe] @greatleadership"
  date_time: "Fri Sep 23 20:17:32 +0000 2011"
  status_id: "117331787797102592"
  url: "http://twitter.com/#!/trebor74hr/status/117331787797102592"

- content : "python.mirocommunity.org [http://t.co/WYu9ygAb] #Python
            #Miro  Community - All Python #Video , All the Time"
  date_time: "Fri Sep 23 20:12:34 +0000 2011"
  status_id: "117330534635540481"
  url: "http://twitter.com/#!/trebor74hr/status/117330534635540481"

...


Again, the script is …

The script can be found here.

blog comments powered by Disqus

Notes

  1. trebor74hr posted this