R - httr Web interaction

October 09, 2020

Interact with web

raw <- GET(url_string) – retrieve from server, status starting with 2/3 is fine, 4 your problem, 5, their problem.

POST(url_string, data) – send data to server

content(raw, as = “text”/”parsed” (default) ) from httr, read from GET result

http_error(raw) returns TRUE if there’s an error, make error handling easier

GET(url, user_agent(“my@email.address this is a test”, query = params) user agent provides extra info for the webmaster, in case anything goes wrong. query helps add more params to the url more easily (rather than string concat or paste), param example = list (x = “asd”, y = “qwe”) -> url?x=asd&y=qwe .

paste() sep = “/”, glue strings together

http_type()

JSON

fromJSON(content(raw))

rlist package: list.select(json,var1,var2) collect var out of a json, list.stack() stack result from list.select into a dataframe

bind_rows, from dyplyr, turns list into df.

XML

xml2 package

read_xml()

xml_structure()

xml_find_all(xml result, XPATH “api/abc/xyz/rev”) extract all nodes matching the XPATH. Find all rev nodes regardless of path, use “//rev”. Each node is sith like <node> </node>.

xml_find_first, similar to find_all

xml_text, xml_double, xml_interger extract data out of a nodeset (the result of find_all)

xml_attrs or xml_attr , find all or a specific attribute of nodes

Web scraping

rvest package

read_html(url)

html_node(result above, xpath = “//node”)

html_text(node), html_attr(node, name = “”), html_name(node) name of the node …

html_table(), accepts a table node object and output a dataframe

html_nodes(test_xml, css = “..”), select by css, nodes return more than one result. “tag” selects by tag name. “.classname” selects by class name. “#id” selects by ID.

xml_text() pulls out the text (html) from xml response

Search This Blog

Kev's Place