Lessons Learned From The Otter

2019-08-22 04:25:00 +0000 - Written by Carl Burks

I’ve been working on a not so secret project Octotter. It isn’t a search engine, it goes back to the roots of the web.

I didn’t want a search engine. I wanted a categorized collection I could browse. Google doesn’t solve my wish to virtually walk through a library and look at the shelves and grab a book. The casual browsing and looking through the spines and grabbing a “book” and thumbing the pages completes the story. It is a collection of links.

One of the problems with collections is they go stale. As I’ve been looking at others who have gone before I see the problem. It is one of a lack of community. If you are using humans to collect the links and catagorize the links, then you must have… Humans! I chatted with one of my long time friends and coworkers. They suggested I do what all the cool kids are doing, and replace the humans.

I’ve identified the tasks the humans solve:

The first of these problems is a basic spider. There isn’t much need to reinvent the wheel. Lots of spiders have been written, building one should be doable. Borrowing one would be better. Hooking into a prebuilt cache of sites would be even better.

The next problem is categorizing the content. Google has tools for this. There even a webservice call for it. The python libraries didn’t work or want to install. More on this later. There are lots of natural language tools for building this sort of thing. Several guides, but once again I will talk more on that shortly.

The last is filtering out the garbage. Just because the site has what you want doesn’t mean it is usable. I haven’t gotten to figure this bit out because I’m still one problem two. If I can’t get the categorizing just right then the filtering is kind of pointless.

Circling back, I’m playing with Windows again and Python on windows has always been a hate/hate sort of thing for me. Why the the double hate, why that is one for 2.7 and one for 3.x. Well after trying to install some things and having Python, documentation, and all the slightly out of date tutorials fail horrible. Scipy didn’t want to work, gensim gave DLL errors. I tried cheating, I used the Windows Subsystem for Linux. After that things started working a little better. Here is the problem. In a guide it doesn’t say what you need to to import.

import foo from bar

Python lets you name things differently than what you would install with pip. This is a problem with more than just Python. Proper namespaces which match the installs and have version numbers in the name are important to be able to replicate a web example without a link to the source or requirements. Python has “requirements” but it isn’t as robust as package.json. Enough whining. I titled this “lessons learned”, so here we go:

Update 9/4/2019

I also built a classifier which I will probably push up to github, but I’m on other priorities atm.