Saturday, 11 July 2015

Mobile app developers “duped” into distributing data-scraping malware: NICTA

The surge in mobile malware has led many to condemn developers' poor security practices, yet recent NICTA research suggests that – even though data-stealing is ubiquitous among both paid and free Android applications – many mobile application developers are in fact being “duped” into incorporating data-stealing routines into their applications.

A methodical analysis of Android applications and source code found that all of the top 100 paid and non-paid apps in Australia were collecting personal information, with 60 percent of the apps incorporating some sort of tracking library and 20 percent of the apps featuring more than three different tracking libraries.

While many have blamed developers for their poor security, NICTA mobile systems research group leader, Aruna Seneviratne, who leads the organisation's Networks Research Group, told CSO Australia that many tracking libraries were inadvertently added when developers incorporated third-party libraries into their mobile apps.

“In most cases app developers just use third-party libraries and don't know what's in them,” he said. “They're not being malicious for the sake of being malicious; they are just being duped into doing a thing that collects a lot of information.”

 And collect they do. Apps analysed by the team – whose paper 'early detection of spam mobile apps' was accepted for presentation at the recent WWW 2015 conference in Florence, Italy – were siphoning all kinds of personal information off of users' mobile devices, often sending it to enlarge what have become massive databases of personal preferences and behavioural modeling.

“It's amazing how much information each of those apps collects,” he said, “and the scary thing is that most of them actually go to a small number of sources – which means these guys can actually infer a lot of information about you. They have a very good idea of who you are and what you're doing – and they are cross-matching the information they collect.”

Ever more-clever data-siphoning routines were making data collection richer all the time, with many Android apps now being designed with libraries that collect information about nearby Wi-Fi access points and can correctly extrapolate the user's location 90 percent of the time.

Read more: The week in security: Android apps collecting your location data, home routers hit by drive-by malware

Seneviratne blamed Google's relatively lax app-approval process for the proliferation of such apps, which join the malware-laden apps that by the team's figures account for around 3 percent of all Google Play Store apps.

Recognising that developers are often as clueless as users about the extent of the data collection going on, the team has proposed an app-rating system that will give consumers a better idea of what they're enabling by downloading and installing a particular app.

A basic prototype has already been developed and a pilot site is expected to be up and running by the fourth quarter of this year. The service, which rates apps on criteria such as privacy and security, will be available to third parties as a Web service that Seneviratne hopes will eventually help it gain traction on app-rating and other sites.

Read more: Surveillance laws driving companies to limit data collection, developers to boost security

“We've been working to come up with a scheme that is similar to the energy-ratings system that you have for electrical appliances,” he said, noting that the site will also seek to boost developers' security awareness by correlating app ratings “to let consumers know they can download an alternate app that has the same functionality but a higher security rating”.

Israeli developer-tools firm Checkmarx has taken its own approach to improving developers' security skills, recently learning extensive lessons as hackers worked to manipulate its Game of Hacks security application – which is now under development to be sold to large corporates for developer training and testing.

This article is brought to you by Enex TestLab, content directors for CSO Australia.

Read more: The week in security: Budget flags encryption troubles, cross-government IAM

Feeling social? Follow us on Twitter and LinkedIn Now!

Read More:

    Victorian Commissioner for Privacy and Data Protection sorts sheep from the goats

    Better than email: VISA launches FireEye threat intel platform for merchants

Source: http://www.cso.com.au/article/576533/mobile-app-developers-duped-into-distributing-data-scraping-malware-nicta/

Friday, 26 June 2015

Data Scraping - Enjoy the Appeal of the Hand Scraped Flooring

Hand scraped flooring is appreciated for the character it brings into the home. This style of flooring relies on hand scraped planks of wood and not the precise milled boards. The irregularities in the planks provide a certain degree of charm and help to create a more unique feature in the home.

Distressed vs. Hand scraped

There are two types of flooring in the market that have an aged and unique charm with a non perfect finish. However, there is a significant difference in the process used to manufacture the planks. The more standard distresses flooring is cut on a factory production line. The grooves, scratches, dents, or other irregularities in these planks are part of the manufacturing process and achieved by rolling or pressed the wood onto a patterned surface.

The real hand scraped planks are made by craftsmen and they work on each plant individually. By using this working technique, there is complete certainty that each plank will be unique in appearance.

Scraping the planks

The hand scraping process on the highest-quality planks is completed by the trained carpenter or craftsmen who will produce a high-quality end product and take great care in their workmanship. It can benefit to ask the supplier of the flooring to see who completes the work.

Beside the well scraped lumber, there are also those planks that have been bought from the less than desirable sources. This is caused by the increased demand for this type of flooring. At the lower end of the market the unskilled workers are used and the end results aren't so impressive.

The high-quality plank has the distinctive look that feels and functions perfectly well as solid flooring, while the low-quality work can appear quite ugly and cheap.

Even though it might cost a little bit more, it benefits to source the hardwood floor dealers that rely on the skilled workers to complete the scraping process.

Buying the right lumber

Once a genuine supplier is found, it is necessary to determine the finer aspects of the wooden flooring. This hand scraped flooring is available in several hardwoods, such as oak, cherry, hickory, and walnut. Plus, it comes in many different sizes and widths. A further aspect relates to the finish with darker colored woods more effective at highlighting the character of the scraped boards. This makes the shadows and lines appear more prominent once the planks have been installed at home.

Why not visit Bellacerafloors.com for the latest collection of luxury floor materials, including the Handscraped Hardwood Flooring.

Source: http://ezinearticles.com/?Enjoy-the-Appeal-of-the-Hand-Scraped-Flooring&id=8995784

Saturday, 20 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"      

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                       

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"  "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1024648/"  "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"  "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"


# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"      

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                       

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"      

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                       

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer  http://www.imdb.com/title/tt1045658/

## 2  The Hobbit: An Unexpected Journey     2 8.2 / trailer  http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer  http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer  http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer  http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer  http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer  http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer  http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer  http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/


Tuesday, 9 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Wednesday, 3 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload)  # This'll be the results page

html = r.text

root = lxml.html.fromstring(html)  # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/

Friday, 29 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get  HTML Document text and that text then parsed to extract data from the HTML codes.  Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service.  If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

 Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library.  PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Web Scraping Services - Extracting Business Data You Need

Would you like to have someone collect, extract, find or scrap contact details, stats, list, extract data, or information from websites, online stores, directories, and more?

"Hi-Tech BPO Services offers 100% risk-free, quick, accurate and affordable web scraping, data scraping, screen scraping, data collection, data extraction, and website scraping services to worldwide organizations ranging from medium-sized business firms to Fortune 500 companies."

At Hi-Tech BPO Services we are helping global businesses build their own database, mailing list, generate leads, and get access to vast resources of unstructured data available on World Wide Web.

We scrape data from various sources such as websites, blogs, podcasts, and online directories; and convert them into structured formats such as excel, csv, access, text, My SQL using automated and manual scraping technologies. Through our web data scraping services, we crawl through websites and gather sales leads, competitor’s product details, new offers, pricing methodologies, and various other types of information from the web.

Our web scraping services scrape data such as name, email, phone number, address, country, state, city, product, and pricing details among others.

Areas of Expertise in Web Scraping:

•    Contact Details
•    Statistics data from websites
•    Classifieds
•    Real estate portals
•    Social networking sites
•    Government portals
•    Entertainment sites
•    Auction portals
•    Business directories
•    Job portals
•    Email ids and Profiles
•    URLs in an excel spreadsheet
•    Market place portals
•    Search engine and SEO
•    Accessories portals
•    News portals
•    Online shopping portals
•    Hotels and restaurant
•    Event portals
•    Lead generation

Industries we Serve:

Our web scraping services are suitable for industries including real estate, information technology, university, hospital, medicine, property, restaurant, hotels, banking, finance, insurance, media/entertainment, automobiles, marketing, human resources, manufacturing, healthcare, academics, travel, telecommunication and many more.

Why Hi-Tech BPO Services for Web Scraping?

•    Skilled and committed scraping experts
•    Accurate solutions
•    Highly cost-effective pricing strategies
•    Presence of satisfied clients worldwide
•    Using latest and effectual web scraping technologies
•    Ensures timely delivery
•    Round the clock customer support and technical assistance

Get Quick Cost and Time Estimate

Source: http://www.hitechbposervices.com/web-scraping.php

Monday, 25 May 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006


Saturday, 23 May 2015

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data – Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

•    minimal manual intervention
•    low on effort and time
•    can work on any scale

Cons-

•    Data quality compromised
•    inaccurate and incomplete datasets
•    lesser details suited only for high-level analyses
•    Suited for gathering- blogs, forums, news
•    Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets – If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

•    High data quality
•    Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering – any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Wednesday, 20 May 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Sunday, 17 May 2015

Scraping Twitter Lists To Boost Social Outreach (+ Free Tool!)

I published a post a few weeks ago describing how to build your own twitter custom audience list, outlining a variety of techniques to build up your list.

This post outlines another method (hat tip to Ade Lewis for the idea) which requires you to scrape Twitter directly.

If you want to skip all the explanations and just want to download the Twitter List Scraper tool, here you go…

Download the Twitter Scraper Tool for Windows or Mac (completely free)

Disclaimer: Scraping Twitter is against their Terms of Service, so if you decide to do this you do it at your own risk.

Some Benchmarks

Building custom audiences on Twitter requires you to identify Twitter usernames that might be interested in your service or product.

In my previous posts, one of the methods I employed was to pull a competitor’s link profile and scrape social accounts from the linking domains.

Once you upload a custom list, Twitter goes through a process of ‘matching’ against profiles in their system, to make sure the user exists and hasn’t opted out of tailored ads.

As our data was scraped from a list of unqualified websites, the data matching wasn’t likely to be perfect.

Experiments

Since I published that post, I have been experimenting a fair bit with list building, and have built up around 10 custom audience lists. I‘ve uploaded a total of 48,857 Twitter usernames using this method, but only 29,260 were matched by Twitter (just less than 60% match rate).

From some other experiments where I have had better control over the input data, this match rate was between 70-80%.

Since we’ll be scraping Twitter directly, I expect our match rate to be much higher – 90%+

Finding Relevant Twitter Lists

So, we’re going to scrape Twitter, and the first step is to find Twitter lists that will contain users potentially interested in what we have to offer.

As an example, we’ll pretend we’re marketing a music website, and we’ve produced a survey we want to collect responses for.

An advanced Google query can give us lists of music bloggers: site:twitter.com inurl:lists inurl:members inurl:music “music blogger”

Source: http://urlprofiler.com/blog/scraping-twitter/

Wednesday, 13 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity

Saturday, 2 May 2015

Lawyers & Attorneys Website Data Scraping Services

There are so many instances where one end’s up needing information from lawyers or bar associations. However, if you approach them directly or look for other ways to get information it might either be difficult or you might not get the information you are looking for. Thus, the best way to go about the scraping lawyer data.

Scraping lawyer data allow you to get information from various attorney websites, bar association websites, or other related websites. Using web scraping tools for getting such information makes it much easier to get all the relevant and important information without actually having to worry about the same.

If you wish to scrape data from lawyer, you are entitled to information such as lawyer name, firm names, address, contact details, history about the lawyers, educational qualifications, the bar association they are part of and much more.

Scraping lawyer data ensure that you also have images of the lawyer you are concentrating on. The result of scrape data form lawyer can be obtained in any format the user wants such as csv, excel, MySql etc. Scraping lawyer data also ensures that none of the information provided are repetitive or redundant.

If you are in need of information regarding any lawyer such as their contact details, address etc. it could end up being a huge and difficult task to get it manually or physically. Thus, taking off the help of scraping tools would ensure that you get all the needed information without actually having to bother about anything at all. The presence of lots of attorney websites and the fact that more and more lawyers are moving to the internet makes getting information easy with the help of some great tools. Scraping data is a very useful and handy method in which one can get all the required and relevant information and that too in a very easy to read format, which makes the method even worthier.

There are quite a few tools or services that you can take help of to get lawyers data scraped. Most of these services also provide with a sample demo and that free of cost. From the sample one can decide if they wish to continue with the services or try some other services. Thus, if you want any information from attorney websites or information about any lawyers, data scraping is a great way to get the same.

Source: https://3idatascraping.wordpress.com/2014/03/18/lawyers-attorneys-website-data-scraping-services/

Tuesday, 28 April 2015

Web Scraping – An Illegal Activity or Simple Data Collection?

Gone are the days when skillful extraction of information pertaining to real estate such as foreclosures, homes for sale, or mortgage records was considered difficult. Now, it is not only easy to extract data from real estate websites but also scrape real estate data on a consistent basis to add more value to your portal, or ensure that updated data is available to your visitors at all times. From downloading actual scanned documents in the form of PDF files to scraping websites for deeds or mortgages, smartly designer data extraction tools can do it all.

However, the one question that still manages to come to the front in the minds of those who scrape real estate listings and others are whether the act is illegal in nature or a simple way of collecting data.

Take a look.

Web Scraping—What is it?

Generally speaking, web scraping refers to programs that are designed to simulate human internet surfing and access websites on behalf of their users. These tools are effective in collecting large quantities of data that are otherwise difficult for end users to access. They process semi-structured or unstructured data pages of targeted websites and transform available data into a more structured format that can be extracted or manipulated by the user easily.

Quite similar to web indexing that is used by search engines, the end motivation of web scraping is much different. While web indexing makes search engines far more efficient, the latter is used for reasons like market research, change detection, data monitoring, or in some events, theft. But then, it is not always a bad thing. You just need to know if a website allows web scraping before proceeding with the act.

Fine Line between Stealing and Collecting Information

Web scraping rides an extremely fine line between the acts of collecting relevant information and stealing the same. The websites that have copyright disclosure statements in place to protect their website information are offended by outsiders raiding their data without due permission. In other words, it amounts to trespassing on their portal, which is unacceptable—both ethically and legally. So, it is very important for you to read all disclosure statements carefully and follow along in the right way. As web scraping cases may turn into felony offenses, it is best to guard against any kind of scrupulous activity and take permission before scraping data.

The Good News

However, all is not grey in data extraction processes. Reputed agencies are helping their clients scrape valuable data for gaining more value through legal means and carefully used tools. If you are looking for such services, then do get in touch with a reliable web scraping company of your choice and take your business to the next levels of success.

Source: https://3idatascraping.wordpress.com/2015/03/11/web-scraping-an-illegal-activity-or-simple-data-collection/

Sunday, 26 April 2015

I Don’t Need No Stinking API: Web Scraping For Fun and Profit

If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?

scraper toolThat’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:

Any content that can be viewed on a webpage can be scraped. Period.

If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.

Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.

There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.

Why You Should Scrape

But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.

Websites are More Important Than APIs

The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.

We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.

Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.

Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.

No Rate-Limiting

Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.

Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.

Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.

Anonymous Access

There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.

With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.

The Data’s Already in Your Face

Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.

Let’s Get to Scraping

So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.

Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.

I’ll share some of the tips I’ve learned along the way.

Fetching the Data

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Dealing with Pagination

At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.

Usually, clicking to page 2 adds some sort of offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.

With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.

The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.

Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.

AJAX Isn’t That Bad!

Sometimes people see web pages with URL fragments # and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.

The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.

All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.

Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.

(Un)structured Data?

Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.

Use CSS Hooks

In my experience, this is usually straightforward since most web designers litter the markup with tons of classes and ids to provide hooks for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost <div> around the item you want.

This <div> should be the outer wrapper around a single item you want access to. It probably has some class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.

A note here though: the DOM tree that is presented by the inspector isn’t always the same as the DOM tree represented by the HTML sent back by the website. It’s possible that the DOM you see in the inspector has been modified by Javascript — or sometime even the browser, if it’s in quirks mode.

Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.

This issue has caused me a number of head-scratchers.

Get a Good HTML Parsing Library

It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.

Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:

    You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. :)

You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.

A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.

Some Traps To Know About

I should mention that some websites explicitly prohibit the use of automated scraping, so it’s a good idea to read your target site’s Terms of Use to see if you’re going to make anyone upset by scraping.

For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.

But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.

When In Doubt, Spoof Headers

Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.

Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.

Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.

Content Behind A Login

Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work.

Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.

Rate Limiting

I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.

Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients.

This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.

Depending on your application, this could work for you.

Poorly Formed Markup

Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.

It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.

If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.

Source: https://blog.hartleybrody.com/web-scraping/

Wednesday, 22 April 2015

Hand Scraped Flooring For a Natural and Unique Look

An option in hardwood flooring that is being increasingly adopted by those looking for something new, innovative and unique for their homes is hand scraped flooring. This type of wood flooring helps one achieve a distinct natural look on one's floor and also has a couple of advantages.

There are three types of scraping that you can get done on your wooden flooring: light, medium and hard. Preferably, if you have a light colored woodwork, then you should go for light scraping and if your floor has a darker shade, then you should opt for hard scraping. But, irrespective of the type of scraping you go for, you must ensure that the laborers doing the scraping are very skilled and impeccable in their job as hand scraping floors is an art that demands patience, time, talent and hard work.

Nowadays, many people tend to go for machine scraping, attracted by the lower investment involved in it. But such people are unable to achieve the requisite natural effect on their floors as machines create patterns on the floors that are easily detectable. These patterns do not emerge with hand scraping and the consequent look is as random and unique as it gets.

Though such scraped flooring is a costly option in flooring, it demands little maintenance. While with perfectly smooth surfaces, you will be always on the edge ensuring that there are no scratches, with hand scraped floors, you will not have to be concerned about this as any new scratches will only add to the already distressed appearance of the flooring.

Prefinished hand scraped wood flooring is also available in the market nowadays. These eliminate the need of any on-site scraping. But this option is of course unsuitable for those who have already got their floors installed. As it is, if you get on-site scraping done, you will have more control over things as you would be able to see the scraping as it develops and would be therefore in a position to exercise your preferences more.

Source: http://ezinearticles.com/?Hand-Scraped-Flooring-For-a-Natural-and-Unique-Look&id=4581623

Saturday, 18 April 2015

Data Mining and Predictive Analysis

Data collection and curing is the core foundation of most businesses. Database building thus is an important function and activity where enterprises invest heavily. With information now available on the Internet and easily obtained, it raises the importance of having professionals who crawl data and offer web scraping services.

Once the data is accessed, though, it is important to filter out the relevant data based on the business need. Although Many DaaS provider convert the unstructured web data into meaningful structured data it is recommended to be internally equipped to use the data to its maximum.

This understanding has given rise to the field of Data Mining. Data Mining is designed to explore large amounts of data in search of consistent patterns and connections between the variables and validate the findings by applying the detected patterns to the new sets of the data. Once these connections are established and understood, the end goal is to be able to predict the possible outcomes using predictive analysis techniques.

Together, both Data Mining and predictive analysis aid in making marketing campaigns more efficient. While predictive analysis helps simulate and understand what may happen, data mining helps identify exciting data patterns and connections.

The process of Data Mining and Predictive analysis consists of 3 steps

Exploration


Once a database is compiled, it needs to be cleaned, analysed and potential connections need to be built. This process involves filtering the relevant data and identifying the possible predictors. Data Exploration also sets a premise for preliminary feature selection to manage number of variables. This data is then prepared for statistical analysis using a wide variety of graphical and statistical parameters. This helps identify the most relevant variables and setups the predictive models to be built.

Data mining process

Validation


Next comes building various models and choosing the most relevant ones. This decision is based on their possible predictive performance and of being able to produce stable results across all the samples. Simple as it sounds, to truly get the results, all possible models must be treated with data to simulate scenarios. The model with most stable statistical feature is validated.

Application

Once the relevant models are finalised, the same is applied to new data to understand and predict the estimated outcomes. Application of data models is an ongoing and complex process since every new dataset needs to be configured in the model.

Data Mining and predictive analysis essentially involves blending statistical methodology where the traditional statistics machine learning and complex algorithms. This greatly increases the need for efficient and skilled data handlers. This could include data analysts and scientists.

See how you can become data scientist here:

Data crunchers use data mining and predictive analysis actively to get an edge in the big data management. Database platforms like Hadoop assist in database management and large-scale distribution. But the costs involved in setting up data centres and big data management capacity are high. Budgets allocated within the enterprise are more project-focussed and analytics budgets are usually limited. Quite often, big data and analytics project fail to launch because of this problem! The other problem is that to run effective predictive models, data requires to be handled by scientists with experience. Finding and setting together a technologically-advanced team is a daunting task most enterprises face outside the tech domain.

Predictive Analysis model

A predictive analysis model is essentially predicting the all possible outcomes from a given set of data. Here are a few steps that can be taken to help build and identify the “ideal” predictive analysis model. These steps more or less mirror the usual statistical methodology of building a test model.

Defining an objective

This is the first and a critical step. Unless the objective is identified and defined there can be no concrete results since there wouldn’t be clarity to compare the final outcome to the expected result. It also helps understand the scope of the project.

Preparing the data


This is more to do with data mining. Historic data used for training the model is scattered across multiple platforms and sources. To compound the problem, data can be unstructured with possible duplicate accounts and missing values! Data quality determines the quality of the model, and thus it becomes imperative that data is healthy and relevant.

Data Sampling

Once mined, Data is essentially split into 2 parts. One set is for training that is used to build the model and the second is the ‘test’ set that is used to verify the accuracy of the final output. This also helps identify and filter the noise component.

Model Building

Sampling cam equally result in a single algorithm or parallel & connected algorithms. In such a case the data goes through multiple testing and a decision is based on the final output.

Execution

Once a model gets finalised, the other teams in the organization need to be involved to build a deployable model and understand its impact on the overall business.

The possibilities with Data mining & Predictive analysis are huge. It also gives a huge room for learning and experimenting. There are several tools available in the industry to aid through all the steps of data mining and predictive analysis. The combination of human expertise and intellect along with the help of the available tools and the overall cooperation within the multiple channels within the organization essentially ensures a stronger grip on the ability to build a solid predictive model.

When used together, predictive analytics and data mining help marketing professionals anticipate and get ready for customer needs, rather than just reacting to them.

Source: https://www.promptcloud.com/blog/data-mining-and-predictive-analysis/

Tuesday, 7 April 2015

rvest: easy web scraping with R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

  html_node("strong span") %>%

  html_text() %>%

  as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

  html_nodes("#titleCast .itemprop span") %>%

  html_text()

#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"   

#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"   

#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

  html_nodes("table") %>%

  .[[3]] %>%

  html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1...  marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

•    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

•    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

•    Detect and repair text encoding problems with guess_encoding() and repair_encoding().

•    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

Sunday, 5 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

web-scraping-service


Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases


The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s


During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping


Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

get-results-from-your-lead-generation-campaign

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Friday, 27 March 2015

Scraping expert's Amazon Scraper provides huge access to find your desired product on Amazon

Today, with latest advancement of technology we find plenty of ecommerce websites offering huge benefits to people by giving out various products from different categories at an affordable cost. One of the most renowned ecommerce website Amazon has come up with its all new launch of Amazon Scraper for the comfort of their customers. This product Amazon Scraper is also called web harvesting which is a computer software technique for getting out data from websites.

Today anyone can find such web scraping tools that are specifically designed for particular websites. Like for example, Amazon Scraper is also a web scraper tool or technique utilised to crawl, or scrap or even extract the data from the largest e commerce website called Amazon.com. Scrapingexpert.com offers best Amazon scraper for extracting plenty of products from websites easily.
  • Amazon scraper
  • Let us see how the Amazon Scraper works:

How to use: Step 1) Select the Category; Enter the (Keyword, UPC, and ASIN) Step 2) Set the delay in seconds Step 3) Click Start

Also you can Scrape the below given details from Amazon.com:
  •     Product Title & Description
  •     Category & Cost Manufacture,
  •     QTY Seller Name,
  •     Total Sellers Shipping Cost,
  •     Shipping / Product Weight ImageURL, IsBuyBoxFBA, Source Link
  •     Stars, Customer Reviews
  •     ASIN, UPC, Model Number Sales Rank,
  •     Sales Rank In Category

Here are some interesting Product Features:

  •     Single Screen Dashboard that shows total extracted records, extracted keywords, and elapse.
  •     Filter Search - Skip data that do not match phrases or keywords
  •     Compatible for Microsoft XP/Vista/Windows 7
  •     Option to set delay between requests to simulate a human surfing in a browser
  •     Extracted data is stored in CSV format, which you can easily open in excel
  • Benefits:
  •     Less Expensive - With our valuable services, we allow you to save both your efforts and money. We have some competitors who outsource their scraping projects to us.
  •     Guaranteed Accurate Results - We assure you get most reliable solutions with accurate results that cannot be collected by any ordinary human being or anyone else.
  •     Delivers Fast Results - We promise to get your work done in just few hours, which can take plenty of time if done by someone else. We save your time, workforce and money and give you an opportunity to stand at a distinction over your multiple competitors.
  •     System Requirement: Operating System - Windows XP, Windows Vista, Windows 7 Net Framework 2.0

Are you searching for some cost effective programs to extract data of other users? If your answer is yes, then we offer Amazon Screen Scraping which is the best Amazon Screen Scraping method of data extraction. Today, in this competitive world of advanced technology there are multiple companies who claim to offer best Amazon Screen Scraping services, so hiring their services for Amazon Screen Scraping can allow you to scrap almost any data in any format you wish to obtain. Well, we at Scrapingexpert.com study each and every single bit of little details of the scraping project and then provide you with a free quote and the date of completing the work

In order to get accurate data pertaining to a specific product, you can use our Awesome Amazon Scraper Tool. This Awesome Amazon Scraping Tool is very effective tool that will help you to extract information about any product from Amazon.

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.