Parsing HTML files that may not comply with HTML DOM
$10-30 USD
Imefungwa
Imechapishwa about 5 years ago
$10-30 USD
Kulipwa wakati wa kufikishwa
I have 1TB of htm files.
These files have been collected from the same 30 websites since 2011.
When I'm trying to write a parser for these (python,nlp,beautifulsoup,decruft,etc.) large portions of known text does not appear. Upon review I notice that some pages are putting the actual page content into json strings that live within javascript elements. This makes the parsing process very cumbersome for me. One example of this behavior is the current [login to view URL] index page.
This project will have 1 deliverable. A conversation with me to discuss modern technologies and techniques that can be used to parse these pages, store the results, and make it available for searching.
After our conversation a second project may be opened on freelancer exclusively for the winning bidder to perform further work related to the parsing of HTML based on their suggestions provided from our discussion on this project.
This can be done by combining different techniques including regular expressions. I have huge experience with parsing HTML files. Ready to start immediately. Please contact with details if you are interested. Thank you, zeke.
Hi,
I am a web scrapping expert. I have worked extensively on NLP. What i think you need is combination of both of these in order to parse your data correctly.
Let us discuss further details in chat. I will be glad to work with you.
Thanks,
Shubham Sharma
I'm an expert Python developer and I've done web scraping. For that reason I think I'm the best candidate for this work...
Have you tried to run the Javascript from the HTML?
This is a good solution but could be slower...