Lessons Learned From The Otter
2019-08-22 04:25:00 +0000 - Written by Carl Burks
I’ve been working on a not so secret project Octotter. It isn’t a search engine, it goes back to the roots of the web.
I didn’t want a search engine. I wanted a categorized collection I could browse. Google doesn’t solve my wish to virtually walk through a library and look at the shelves and grab a book. The casual browsing and looking through the spines and grabbing a “book” and thumbing the pages completes the story. It is a collection of links.
One of the problems with collections is they go stale. As I’ve been looking at others who have gone before I see the problem. It is one of a lack of community. If you are using humans to collect the links and catagorize the links, then you must have… Humans! I chatted with one of my long time friends and coworkers. They suggested I do what all the cool kids are doing, and replace the humans.
I’ve identified the tasks the humans solve:
- Finding the content
- Identifying and categorizing the content
- Filtering out the paywalls, the over advertised, and the low quality, non-entertaining links
The first of these problems is a basic spider. There isn’t much need to reinvent the wheel. Lots of spiders have been written, building one should be doable. Borrowing one would be better. Hooking into a prebuilt cache of sites would be even better.
The next problem is categorizing the content. Google has tools for this. There even a webservice call for it. The python libraries didn’t work or want to install. More on this later. There are lots of natural language tools for building this sort of thing. Several guides, but once again I will talk more on that shortly.
The last is filtering out the garbage. Just because the site has what you want doesn’t mean it is usable. I haven’t gotten to figure this bit out because I’m still one problem two. If I can’t get the categorizing just right then the filtering is kind of pointless.
Circling back, I’m playing with Windows again and Python on windows has always been a hate/hate sort of thing for me. Why the the double hate, why that is one for 2.7 and one for 3.x. Well after trying to install some things and having Python, documentation, and all the slightly out of date tutorials fail horrible. Scipy didn’t want to work, gensim gave DLL errors. I tried cheating, I used the Windows Subsystem for Linux. After that things started working a little better. Here is the problem. In a guide it doesn’t say what you need to to import.
import foo from bar
Python lets you name things differently than what you would install with pip. This is a problem with more than just Python. Proper namespaces which match the installs and have version numbers in the name are important to be able to replicate a web example without a link to the source or requirements. Python has “requirements” but it isn’t as robust as package.json. Enough whining. I titled this “lessons learned”, so here we go:
- a corpus is a collection of written texts
- Windows probably isn’t robust enough for Python without the WSL
- import statements should require the name of what you would use in a package manager to add it including the version number
- tqdm is a command line package which gives a nifty command line progress menu
- gensim is a topic modelling library
- PorterStemmer from nltk.stem lets you get stems from words
- Siraj Raval talks about building an abstract
- You can get newsgroup data for training here
- GloVe gives you global vectors for Word Representation see the video by Siraj Raval
- Doc2Vec creates word embeding
- BOW is a bag of words
- CBOW is a Continuous Bag of Words
- DBOW is a Distributed Bag of Words
I also built a classifier which I will probably push up to github, but I’m on other priorities atm.