There is such thing as a Stupid Question.
I was going to post a complete how to below.. but I’ve realized that Mongo is junk. Has zero write durability out of the box. So We’re going to remap this in Neo4j. (maybe)
Mongo is great if you know how to use it. I, as I understand, do not. Hackers never learn things that aren’t intuitive.. So I’ll break the laws of physics if I have to, to make this thing awesome. We’ll do a small Mongo project later on.
Here’s what I’ve written before:
If you’re going to learn a trick, it's how to ask. If you ask correctly, you get smarter, faster – than doing everything yourself. Leave your brain for novel ideas. Things that no one else has figured out yet – those are the only things worth thinking about.
If it’s been done before, don’t re-invent the wheel. It’s dumb. Ask, learn, move on.
For programmers, the best site to go and ask questions is stackoverflow. I’ve made the mistake and not learned how to ask. Lets not repeat that again, yah?
To solve this, we hack!
1. First, we build a database of all the questions on stack. Since going to individual questions themselves would be time consuming, stack made a nice summary here, which we’re going to consume. I’ve made a dirty “works today” hack, because I’m only going to do this most likely once in my life.
require 'mongo'
require 'mechanize'
require 'digest/md5'
def stack_hash(page) #Dirty hack - Not robost. Works today, YMMV if they change stuff.
page.search(".question-summary").collect do |x|
url = x.search("h3 a")[0].attributes["href"].value
{
"_id" => Digest::MD5.hexdigest(url[/.*\d*\//]), #base url won't change, so we'll use this as the id to prevent duplicates in Mongo.
:title => x.search("h3").text,
:views => x.search(".views")[0].attributes["title"].value.delete(" views"),
:vote_count => x.search(".vote-count-post").text,
:answers => x.search(".status strong").text,
:tags => x.search(".tags")[0].attributes["class"].value.gsub(/t\-|tags/, ""),
:url => url,
}
end
end
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
dbMongo = Mongo::Connection.new.db("stackoverflow") #Mongo Database.
coll = dbMongo.collection("question-summary")
urls = (837..54126).collect { |x| "http://stackoverflow.com/questions?page=#{x}&sort=votes&pagesize=50" }
urls.each do |url|
page = agent.get(url) #sauce
coll.insert(stack_hash(page))
File.open(ENV['HOME']+"/log-stackoverflow.txt", 'a') {|f| f.puts("database size: #{coll.count}, on completed: #{url}") }
sleep(rand(1.0..4.0)) #I'm a human, lawl ;)
end
The problem with this script is that it’s going through 54126 pages individually. It’ll take some time to grab, and its a bit overkill for what we’re trying to solve —> How to phrase (ask) a question.
It’s a lot to learn here, next time we’ll thread with peach to make things faster, and eat the server on their end.
All we need is the top 5%, and bottom 5% to know what words the best-type and the worst-type of questions incl.
I’m scraping everything, because I’m having a little fun, but mostly want to see the scalability of Mongo.
Estimating the size of our database… We’re looking at 54126 pages at 50 records per page = ~2.5M entries. It should be interesting to see how fast Mongo can perform when I make non-indexed queries on light hardware (2.16 GHz intel core 2 duo, 4GB DDR2 ram).