Carl Burks is a software developer for a global financial institution. With over ten years experience in technology and software development for financial organizations and over twenty years of software experience, Carl Burks provides articles, musings and insight into technology issues, software development, and other selected topics.

Continuing the Downloader project part II

2016-10-07T20:09:00.002-07:00

Authors:
Carl Burks

I've been busy. I have been adding to this project a little bit here and there, but the rough alpha version is checked into source control: RedditDL

I explained what I was doing to my wife and as I am a visual person I made some notecard. She rewrote them to make them readable.


Diagram of Reddit Download Project

The initial flow is something kicks off check_posts.py. This could be your windows scheduler, cron, or running manually. This fires a queue message. If you haven't gotten a queue setup then it won't work. I plan on adding a docker file to the repo later which will do this for you. If you haven't renamed the example.config.yaml, then you didn't read the README.md first and shame on you. After you've supplied the appropriate config values such as:

  • reddit - user
  • messagequeue - server
  • messagequeue - user
  • messagequeue - pass
  • database - location
  • database - engine
  • fileStore - location
You should be ready to start firing off listeners. One listens to start working on pulling posts from reddit. Another waits for posts to log to the database. If the post has a new URL then it feeds the next listener. The final listener listens for urls and downloads the to the file store and sends a message. The post listener gets that message and puts the file reference in the database. If everything works you've got a directory full of downloads.

This has been a fun project to once again code some Python, play with Docker, play with a message queue that isn't from windows, to play with yaml, json, and the reddit api.

Where am I going from here?

I might convert it to Python 3 next before adding features and fixing bugs.

Extracting out the text of the target url for the core content discard the noise and storing it in the database.

Adding a queue task for keyword analysis of the content.

I want to add docker image which might include a web project to show the output files.