Linux • Data Manipulation
Comments >
hack

Using lynx to convert HTML to text

You can strip the text directly from a html file/content by using lynx

$ lynx -dump -stdin < file.html
$ cat file.html | lynx -dump -stdin
$ curl site | lynx -dump -stdin

For a HTML file:

<!DOCTYPE HTML> <html><body> <p>This is a link<a href='http://enki.com'> to enki. </a></p> </body></html>

The output will look like:

[1] This is a link to enki. References 1. http://enki.com

Lynx is a terminal-based browser that often proves useful for testing.

Try it out

Comments

Do you mean `lynx -dump -stdin < file.html`? How to supply a file?
@Arseny2 years ago
you can cat file | or also the way you specified,
@tuwi.dc2 years ago
Order of flags should not matter
@mikniea year ago
Agreed, it should be allowed to place the flags either way around
@SitiSchua year ago
In the example, "link" is marked yellow but actually "to enki." is the link text.
@niwia year ago
Yep, parameters order doens't matter
@giorgioa year ago
I bet this is how Stallman works with net
@Alkuzada year ago
Can't you use urls with dump?
@Sardtoka year ago
That was a misleading title
@Linkyu10 months ago
Why use this ovet curl?
@jldugger10 months ago
The markup doesn’t match the output; only "link” was probably meant to be enclosed.
@j16t9 months ago
Arguments in reversed order?
@NezumiRyu8 months ago
OMG this is awesome!!!
@pinkasey8 months ago
It is cool lynx feature, agree. But it is a bit faster to use sed. And in most cases you already have sed installed. Something like this: < file.html sed -r 's/<tr[^<>]*>/\n/g; s/<t[dh][^<>]*>/,/g; s/<[^<>]+>//g' Just replace <tr>s with new lines, <td>s and <th>s with commas and remove all other tags. You'll got plain text. Additionally you can remove blank lines or do some scrabing staff.
@XelorRelin8 months ago
Just remove -stdin flag if you want to use it with url: lynx -dump https://enki.com
@XelorRelin8 months ago

The 5-minute daily workout for your dev skills