python - URL parsing error [BeautifulSoup] -


i'm trying list of href links website pages; code not working properly. code appending when shouldn't urllist. duplicating href links.

import urllib2 beautifulsoup import beautifulsoup  response = urllib2.urlopen("http://www.gamefaqs.com") html = response.read() soup = beautifulsoup(html)  donotprocesslist = ["gamespot.com", "cnet.com", "gamefaqs.com"]  urllist = []  link in soup.findall('a'):     bad in donotprocesslist:         if bad not in link['href']:             urllist.append(link['href'])  print urllist 

example error:

[u'http://cbsiprivacy.custhelp.com/app/answers/detail/a_id/1272/', u'http://cbsiprivacy .custhelp.com/app/answers/detail/a_id/1272/', u'http://www.cbsinteractive.com/terms_of_use.php?tag=ft', u'http://www .cbsinteractive.com/terms_of_use.php?tag=ft', u'http://www.cbsinteractive.com/terms_of_use.php?tag=ft', u'http://m.g amefaqs.com/?mob_on=1', u'http://m.gamefaqs.com/?mob_on=1'] 

the error has "not" in if statement removing not result in bad items being stored in list so:

[u'http://membership.gamefaqs.com/1328-4-46.html', u'http://www.gamefaqs.com/user/register.html', u'http://www.games pot.com/6316274', u'http://www.gamespot.com/6316274', u'http://www.gamespot.com/6316489', u'http://www.gamespot.com/ 6316489', u'http://www.gamespot.com/6316225', u'http://www.gamespot.com/6316225', u'http://www.gamespot.com/features /index.html', u'http://www.gamespot.com/news/6322016.html', u'http://www.gamespot.com/news/6322019.html', u'http://w ww.gamespot.com/news/6322017.html', u'http://www.gamespot.com/news/6322010.html', u'http://www.gamespot.com/news/632 1996.html', u'http://www.gamespot.com/news/index.html', u'http://www.gamespot.com/features/6314339/index.html', u'ht tp://www.gamespot.com/features/6313939/index.html', u'http://www.gamespot.com/features/6309202/index.html', u'http:/ /www.gamespot.com/features/6320393/index.html', u'http://www.gamespot.com/features/6162248/index.html', u'http://www .gamespot.com/gameguides.html', u'http://www.gamespot.com/downloads/index.html', u'http://www.gamespot.com/news/inde x.html', u'http://www.gamespot.com/pc/index.html', u'http://www.gamespot.com/xbox360/index.html', u'http://www.games pot.com/wii/index.html', u'http://www.gamespot.com/ps3/index.html', u'http://www.gamespot.com/psp/index.html', u'htt p://www.gamespot.com/ds/index.html', u'http://www.gamespot.com/ps2/index.html', u'http://www.gamespot.com/gba/index. html', u'http://www.gamespot.com/mobile/index.html', u'http://www.gamespot.com/cheats.html', u'http://www.gamespot.c om/forums/index.html', u'http://www.gamespot.com/', u'http://www.gamefaqs.com/features/help/', u'http://sitemap.game faqs.com/', u'http://www.gamefaqs.com/features/aboutus.html', u'http://reviews.cnet.com/music/2001-6450_7-0.html', u 'http://reviews.cnet.com/cell_phones/2001-3504_7-0.html', u'http://reviews.cnet.com/digital_cameras/2001-6501_7-0.ht ml', u'http://reviews.cnet.com/notebooks/2001-3121_7-0.html', u'http://reviews.cnet.com/handhelds/2001-3127_7-0.html ', u'http://reviews.cnet.com/4521-6531_7-5021436-3.html', u'http://reviews.cnet.com/web_hosting/2001-6540_7-0.html',  u'http://clearance.cnet.com', u'http://shopper.cnet.com/4520-5-6276184.html', u'http://www.cnet.com', u'http://www. gamespot.com', u'http://www.gamespot.com/cheats.html', u'http://www.cnet.com/apple-iphone.html', u'http://www.gamesp ot.com/reviews.html', u'http://reviews.cnet.com/laptops', u'http://download.cnet.com/windows/antivirus-software/', u 'http://m.gamefaqs.com/?mob_on=1'] 

list comprehension ftw:

[link['href'] link in soup.findall('a')   if not any(bad in link['href'] bad in donotprocesslist)] 

and, readability...

def condition(x):     return not any((bad in x) bad in donotprocesslist)  [link['href'] link in soup.findall('a') if condition(link['href'])] 

Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -