Link List In Perl

Type at the (Unix/Linux/Cygwin )command line:

 curl -s TARGET_URL | perl -e "while (\$line=<stdin>) { while (\$line=~ /(href\\s*=\\s*\\\"([^\\\"]*))|(href\\s*=\\s*([^\\s\\\">]+)[>])|(href\\s*=\\s*\\'([^\\']*)\\')/gi ) { print \$2 . \$4 . \$6 .\"\\n\" ;} }"

Note: you can use curl -s, lynx -source, GET (if you have Perl's LWP modules installed) or your local preference.

Man, that's UgLy. Of course, typing Perl code into a shell (which requires escaping many of Perl's metacharacters like $ with backslashes) makes it even uglier....

Man, that's ConCise. and can be typed in one line on a CommandLineInterface. What could be simpler, VisualBasic? Particularly if appearance is the test of efficacy.

PerlGolf!

 GET TARGET_URL | perl -e "foreach(<stdin>){print\"\$2\$4\$6\\n\"if/(href\\s*=\\s*\\\"([^\\\"]*))|(href\\s*=\\s*([^\\s\\\">]+)[>])|(href\\s*=\\s*\\'([^\\']*)\\')/gi;}"

No doubt someone with better RegularExpression-Fu than I will be able to ReFactor the regex there a bit. -- EarleMartin

Er, the foreach doesn't work like that, but you can lose the explicit loop required with a command line switch:

  GET TARGET_URL | perl -ne "print\"\$2\$4\$6\\n\"if/(href\\s*=\\s*\\\"([^\\\"]*))|(href\\s*=\\s*([^\\s\\\">]+)[>])|(href\\s*=\\s*\\'([^\\']*)\\')/gi;"

The foreach did work for me, but that's cool anyway. -- EM

I'm sorry to spoil your PerlGolf enjoyment, but you might have noticed that my code has one extra inner loop. If you run it against this very page, you'll notice the difference. --CostinCozianu

Amateurs, all of you! :-) Here's a much shorter (and more readable) version. Enclosing in single quotes removes the need for shell backslashing, while the -l command line option auto-outputs a newline after each print. The single-quoted (href='foo') match is not necessary as that's not valid HTML.

Since when are single quotes invalid? http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.2

  GET TARGET_URL | perl -lne 'print$2.$3while/href\s*=\s*("(.+?)"|([^\s">]+))/gi'
-- Mike Wittman

ThankYou. I had a feeling the backslashes could be eliminated, but wasn't sure how. -- EM

Oops, I take that back about the single quotes. That means we have to add them back in to the regular expression and escape them from the shell (ugh). It's still better than using double quotes and backslashing everything, I guess. Now if I could just figure out how to code pairs of adjacent single quotes in Wiki, I would be in good shape. The two spaces in the regex should not be there.

  GET TARGET_URL | perl -lne 'print$2.$3.$4while/href\s*=\s*('\' '(.+?)'\' '|"(.+?)"|([^\s">]+))/gi'
-- MW


CategoryPerl


EditText of this page (last edited May 23, 2005) or FindPage with title or text search