python - How to print certain lines and certain parts of line in a text file -


i'm trying extract information html file(google chrome bookmarks exported)

it contains text of following format , extract website adresses after <dt><a href= , before add_date=

i'm considering using sed , awk or python answers 3 languages welcome

so far know how print lines containing <dt><a href= awk

awk '/<dt><a href="*"/' favorit.html 

i suppose should combine sed

<!doctype netscape-bookmark-file-1> <!-- automatically generated file.      read , overwritten.      not edit! --> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <title>bookmarks</title> <h1>bookmarks</h1> <dl><p>     <dt><h3 add_date="0" last_modified="1309451494" personal_toolbar_folder="true">barre de favoris</h3>     <dl><p>         <dt><h3 add_date="1281455379" last_modified="1309422816">brain</h3>         <dl><p>             <dt><a href="http://gmazars.info/conf/index.html" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaabvkleqvq4jcwsovcjqqyhp3i4vm/ojiz+srthva7u2nqulit2/4haes5xm7wsh/suntuz+k5wb6atjxmioiyvk1+slzn7cxv7hy/y+vespxrwzqgqiejxjbullikoavev3gfwdw9yy0g5y63l1/0917mzqsrpuzsafk1z4bebev0dsfwuy8epc+y9p65rriscm6rk3vvtdlsveeij/h58lo8xrqqrk0iioodo2/zu4nimtesuydnz3hvuvcvehl7vgwm8hei952y2w1j7ymo9zte0aoy9r1w5pgr/oap/9iqxhs1mg4jqdv0zo+97ckpuvcv2uz0qopv2qyojrfyraha7hsjsimd/wgqykunwmlkxgy3ecstlxodn65oyixmrps4x9d1d17fogtq2zyzxtsc8vlyctpatqgjdmjbtyhggus6eelh8dy+fqjw+ro/su4j/swb4c6whlu0ncewkaaaaaelftksuqmcc">computer vision resource</a>             <dt><a href="http://research.google.com/" add_date="1281455379">google research</a>             <dt><a href="http://research.microsoft.com/en-us/" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaabl0leqvq4jawstwutqrrgz52575uyhayf0vzbk0chyajwcvpv3opv0607qfzyubktgkttsjautfuskmtembku0uzaetodsxlmnjk8d2qws3eldb4zn00qqepo3ntbcwivbhkqopgxo3arjaiyzgvnqwmenvuzyjkcczjnebgqebyhvopsgktado8eprmn+plxa831dbzzpjsyjmc86xty6faimsiimnncrosfhzgztkivpg61epf2dd29p+wdhpcokckkztt6hxmhhec9xieeqiwvxz9/4tnzf2xsbkljmjfgbkd9zvmzv75/q7xgbrpj8favm1vbjm9oqulaeww73wlq7/p75yevx73ghsqqyft293hee1pk/smhdi7+yjapqsdxcei2ze6vy3q6zavdpro7y+r8grz/cwjzqskjegmkzocalluakuuaukyutzkildkdnxcn0udvu5ylmi0vrgz4vrpfwbj6lxrzo4aruule2hpm+wlqhpozy7uksysacw630q1rcku+fblmhpaqifz3ya+m8j7/gttd19uubm8aaaaasuvork5cyii=">microsoft research - turning ideas reality</a>             <dt><a href="http://techresearch.intel.com/articles/index.html" add_date="1281455379">intel labs</a>             <dt><a href="http://www.ibm.com/developerworks/" add_date="1309092502" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaaawuleqvq4jwnksj3xn4ecwesj5ufkwp9z6xcmjq8tqwywphmdawmdy9pmbmqaxmzmm4nvegqxflxogylcnhsnf7abbc6a8zenhycbutngyudibyyjkxeqgaaapiuygxrljgmaaaaasuvork5cyii=">ibm developerworks : ibm&#39;s resource developers , professionals</a>             <dt><a href="http://www.siam.org/" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaaawkleqvq4jd2rwqraiaxdu7+8fvnbyr6s2smuxlguoyggvyrjkwdfj5ydfy8aejmatj3bko9zkwmdngzo/bhbxadftdkruhv9zkfv1xw5rgiaa7/cvjlz/knbbdfjeu7uim0taaaaaelftksuqmcc">siam: society industrial , applied mathematics</a>             <dt><a href="http://www.wolfram.com/" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaac7uleqvq4jywtx0htcrzfz3y37931zv2x5dzuiwzz2tqfkuvyvepivfczib4coyayquegqmi9s0qreb3zq2aeovqp1bjsg7smpqkmu6boov3debfddb23p8xuovp2+f3o98eb7/kr+i88n1sogfc6ptxut/s3e3kypl7suonzv7szwqonjzbc9s6uw/ogemdttpehravflwbf8gm+gldafklvarspnlmg6rp2scctbnxejtiqq83hfmchza6mzcg4ylkxejoyjzhuq8uk89wpsa5znkgsvyscooyjyuxfomp4cyulsvboqfed539naabzkuk4h9eqrvazsqjiqookh4mvuuwpriwnwucg+yvkvwwawk9cvgp6onidyhfsq3qckvea+jwgyq4sy1wea0orknbtocdggfg8+k3y4qbemleaz+qamfdjuwcu1wurfcyki7kxq0sj0h06potsqgrwyddgkg9tafv6cculw4rcfrogn3oz0qltmquannq3hkvpnq6sb7xjssuatdy1md0lilcyipngiio9em06hg0p4di5igtugddqqm0p73r+rfed4jvhpuuy6f+oekyeqwyazloguepm5dbpypu8wjrmqaqo4cste533qtmdamrtl4u8ors+clwfonoxwgioeirdiaoieedxsdu7j1/dk/aos/cnxqejk0glsvaay/ufw7s/tbg128v3+yd9w32iuno1wvfsckujcbmeet/isb/e9dqeng5br8tezzc+cdkebaqpor/hd5+u82/qwyi9afm1w6yfgzsfhvia0z6h7oj8on+l/qffelst69rx8ec9pc1jnwenra/bxaet6ybvemoxs48mjvyz0ixoapcinrmuai0tkr3s26ymknlm3v+q4prkwmhsc1mmuzplph2hquih7kprkdesokrkfik2rgnmzo2p/6mx5acgxtaequ1ww5zd1xwjxz7s2vtlzhgvxe38vpdtgut8vvfc7x1k9vwgs2enws3t7r0aaaaasuvork5cyii=">wolfram research: mathematica, technical , scientific software</a>             <dt><a href="http://www.mathworks.com/" add_date="1281455379">the mathworks - matlab , simulink technical computing</a>             <dt><a href="http://www.youtube.com/user/googletechtalks" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaabs0leqvq4jz2sp0hcurtgf+clqdaq2tksuhnbq9sgci5bu/sbpsnawrprxamlyujsaklcwqiaoigkk+cknkvobyf6gu57r/fyt9a3no+ce893zr3nokr6vm7qtmzavvwtdkkplhs6lvajapvq2l/av/18nfxar36etyugdu7glxc0fqlvr0ckalxt0w+xewnvtm+ogmwy2n+m8dlvvnjeiyjiicptc+uc5doichbumvkjobuywuleimmxqglitvvvrkjvilpk6qfj6ilweovopko4ze+drczyidhvp2jhme9mza5i9sqmhhw6ulgmtto4ttp4ulxqnlxqhns6lc+awbrc8cbiirezilu7keq5wcyigd9a958qawqqfukfendv/jmt2nsgs4z/8ahrq2yk7c+w3/mwrqjsgwkkmzmznvzj0udasf3oj0u9jv8paccxycnu1gaaaabjru5erkjggg==">youtube - chaƮne de googletechtalks</a>             <dt><a href="http://groups.csail.mit.edu/vision/welcome/" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaab0kleqvq4jy2t30ttyrjhp+mcp8mwzbfkjvk4cyvzi+wh1z1fqtfdv3uxr4bx/gks4exoquxgvogmuhg0jwrrwhgrbks7z81azzuy49kj+6fm62lsxafzyd+r9+f9v5/neb+875fyuvzgrjlpizr3mwdslqoadnwzpevilqcazmwaywuv0bnpakysjz8/kuizpni3mjl3dy3cqswv3wqu9xaa4clhfaufwwfymic6fbv1tbo982zniluzmgjarq4rdz6jxzum88wfcedyiv5ahs5r2xzt/d74vq/ifytx1n7bjd8ts1vvsysojsi4t/u1zdrib9yd1lmjovuvngtyk+6itqvauylsl/kc9n5ucfdyzkipkrpgay1ji9rmcksqtymqjofjylfk2zmk2xsafh45ivwyvqdp2bzhz182govgaox9kwaku5nuz00cxb2oitaanoctlqls89osl8fpv+shidu/zekbfwc4b9zn9puhhsn0tbjdxalloh5zomc7jms9rpt5on9jswyojni3mvbfdawhxht7csdhgj/cqsy9x28pzelfeauakj9ubbeyyvjoyvds/lztswu18yhmllyjokqnla+mnqh2s3k6us+j+g88skj7zlo6hwaaaabjru5erkjggg==">mit csail computer vision research group</a>             <dt><a href="http://www.youtube.com/watch?v=9k8x__i2o2a&feature=related" add_date="1281455379" icon="data:image/png;base64,ivborw0kggoaaaansuheugaaabaaaaaqcayaaaaf8/9haaabs0leqvq4jz2sp0hcurtgf+clqdaq2tksuhnbq9sgci5bu/sbpsnawrprxamlyujsaklcwqiaoigkk+cknkvobyf6gu57r/fyt9a3no+ce893zr3nokr6vm7qtmzavvwtdkkplhs6lvajapvq2l/av/18nfxar36etyugdu7glxc0fqlvr0ckalxt0w+xewnvtm+ogmwy2n+m8dlvvnjeiyjiicptc+uc5doichbumvkjobuywuleimmxqglitvvvrkjvilpk6qfj6ilweovopko4ze+drczyidhvp2jhme9mza5i9sqmhhw6ulgmtto4ttp4ulxqnlxqhns6lc+awbrc8cbiirezilu7keq5wcyigd9a958qawqqfukfendv/jmt2nsgs4z/8ahrq2yk7c+w3/mwrqjsgwkkmzmznvzj0udasf3oj0u9jv8paccxycnu1gaaaabjru5erkjggg==">youtube - hello world through custom uart hyperterminal</a> 

a quick solution use regular expressions in python.

assuming variable s contains html string:

import re  s = '''   <dt><a href="http://gmazars.info/conf/index.html"              <dt><a href="http://research.google.com/"              <dt><a href="http://research.microsoft.com/en-us/"              <dt><a href="http://techresearch.intel.com/articles/index.html"  '''  print re.findall("href=\"(.*?)\"", s) 

Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -