R - httr Web interaction
Interact with web
raw <- GET(url_string) – retrieve from server, status starting with 2/3 is fine, 4 your problem, 5, their problem.
POST(url_string, data) – send data to server
content(raw, as = “text”/”parsed” (default) ) from httr, read from GET result
http_error(raw) returns TRUE if there’s an error, make error handling easier
GET(url, user_agent(“my@email.address this is a test”, query = params) user agent provides extra info for the webmaster, in case anything goes wrong. query helps add more params to the url more easily (rather than string concat or paste), param example = list (x = “asd”, y = “qwe”) -> url?x=asd&y=qwe .
paste() sep = “/”, glue strings together
http_type()
JSON
fromJSON(content(raw))
rlist package: list.select(json,var1,var2) collect var out of a json, list.stack() stack result from list.select into a dataframe
bind_rows, from dyplyr, turns list into df.
XML
xml2 package
read_xml()
xml_structure()
xml_find_all(xml result, XPATH “api/abc/xyz/rev”) extract all nodes matching the XPATH. Find all rev nodes regardless of path, use “//rev”. Each node is sith like <node> </node>.
xml_find_first, similar to find_all
xml_text, xml_double, xml_interger extract data out of a nodeset (the result of find_all)
xml_attrs or xml_attr , find all or a specific attribute of nodes
Web scraping
rvest package
read_html(url)
html_node(result above, xpath = “//node”)
html_text(node), html_attr(node, name = “”), html_name(node) name of the node …
html_table(), accepts a table node object and output a dataframe
html_nodes(test_xml, css = “..”), select by css, nodes return more than one result. “tag” selects by tag name. “.classname” selects by class name. “#id” selects by ID.
xml_text() pulls out the text (html) from xml response
Comments
Post a Comment