Lately I was trying to fetch a list of specific links from any website or forum so I can update a topic list almost automatically using only Ruby and some rubygems. Maybe this wasn't the best idea I ever had but this scriptlet let me get closer to what I actually wanted to achieve with it. (I forgot how I could paste it here using some bbcode or something the like...)
require 'open-uri'
require 'nokogiri'
results = []
file = File.open 'href.txt','w'
doc = Nokogiri::HTML(open('
http://some.website.com/12345678-some-subforum/'))
doc.search('//*[@href]').each do |m|
if m[:href].include?('12345678') and !m[:href].include?('#lastmsg') and
!m[:href].include?('forumid=')
results << "<a href=\""+ m[:href] + "\"></a>"
end
end
file.puts results.uniq!
file.close
The results are several lines like this one...
<a href="/12345678/98766432-some-weird-topic/></a>
...but I need to get the actual link name, too, so it looks like this...
<a href="/12345678/98766432-some-weird-topic/>Some Weird Topic</a>
...but IDK how to get the "Some Weird Topic" string from the doc variable... I guess I should nest another each iterator but if IDK what value to pass in the search method I won't be able to get the string...
I wonder if any of you have some experience with this kind of issue...