import requests

url = 'https://techcrunch.com/2017/03/08/a-new-affordable-naming-startup-for-startups/'
res = requests.get(url)

res # Success

<Response [200]>

res.text[:200] # 텍스트 형태로 데이터를 가지고 온다.

'\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>A new, affordable nam'

import lxml.html
from bs4 import BeautifulSoup

root = lxml.html.fromstring(res.text)

entries = root.cssselect('.article-entry')
entries

[<Element div at 0x107b1d458>]

len(entries) # 본문의 내용으로 된 tag는 1개 밖에 없다.

1

article = entries[0]
content = article.text_content() # 본문의 내용을 TEXT 형태로 추출.
content[:100]

'\n\n\n\nA few years ago, I launched a daily email newsletter, and I was ecstatic\xa0to be striking out on m'

root.cssselect('#speakable-summary')[0].text_content()[:100]

'A few years ago, I launched a daily email newsletter, and I was ecstatic\xa0to be striking out on my ow'

bs = BeautifulSoup(res.text, 'html.parser')
type(bs)

bs4.BeautifulSoup

bs.findAll("div", class_='article-entry')

[<div class="article-entry text">
 <!-- Begin: Wordpress Article Content -->
 <img class="" src="https://tctechcrunch2011.files.wordpress.com/2017/03/screen-shot-2017-03-08-at-1-26-28-pm.png?w=738"/>
 <p id="speakable-summary">A few years ago, I launched a daily email <a href="https://www.strictlyvc.com/" target="_blank">newsletter</a>, and I was ecstatic to be striking out on my own for the first time. Alas, just a few weeks after filing to secure a trademark, an officious-sounding note appeared in my inbox, and soon after, I found myself shelling out $10,000 in lawyer’s fees over a short-lived trademark dispute. It wasn’t nearly as painful as it might have been, but it was a rude realization that figuring out the right brand can be both time-consuming and have implications that founders might not foresee.</p>
 <p>Of course, my experience is hardly rare. Most founders are typically left to either conduct trademark searches on their own via the <a href="https://www.uspto.gov/" target="_blank">USPTO site</a>, or else pay top dollar for law firms or branding agencies to do it for them. Often they do both.</p>
 <p>Thankfully, affordably eliminating risky name choices is exactly the opportunity that a two-year-old Bay Area company, <a href="https://www.namingmatters.com/" target="_blank">Naming Matters</a>, is chasing, and the company’s founder is very familiar with the market. S.B. Master previously co-founded <a href="http://www.naming.com" target="_blank">Master-McNeil</a>, a 29-year-old corporate naming and branding firm in Berkeley, Ca., whose past clients include Apple, General Motors, Disney and PayPal.</p>
 <p>Now Master sees an opportunity to cater not just to deep-pocketed corporate customers but also startups on shoestring budgets. Indeed, 18 months ago, she decided to take everything she has learned over the years about linguistic analysis, trademark searching and domain name acquisition and pour it into a self-service software product that also incorporates search and data visualization. I talked with her earlier today to learn more.</p>
 <p><strong>TC: You’ve already run a naming company for decades. Why start this new thing?</strong></p>
 <p>SM: Naming is hard, and we tend to work with companies that can afford us to do deep preliminary availability screening. I grew frustrated with how slow and antiquated that searching step is [for companies that can’t afford such a service]. I mean, if you have 100 names, how do you figure out which are most likely to get you into trouble, and which are your stronger candidates that you should focus on? There are legacy providers, but their model is to charge users for every name they look up. If you’re looking for a name in every country and every class, it adds up. You have to be very skilled to [keep your costs down].</p>
 <p><strong>TC: So the idea is to pay less to your friendly trademark attorney.</strong></p>
 <p>SM: The idea is that instead of this being some super expensive cottage industry, that anyone, anywhere — whether founders or innovators in companies or paralegals in law firms or companies under pressure to do more faster and with less — can use this tool in an unlimited way.</p>
 <p><strong>TC: How big a problem — or opportunity — is this?</strong></p>
 <p>SM: About 5 million trademarks are registered worldwide each year, and to get to a name that you’re willing to spend the money [on] to file a trademark application, you’ve probably looked at 50 to 100 names. That means people are looking up something like 500 million names a year. That’s a lot of time and effort, and it still often doesn’t answer the question of whether it’s worth it.</p>
 <p>We’ve been told by big law firms that to look at one name, a paralegal is going to spend three hours, and they cost $300 an hour. So, there’s $1,000 right there.</p>
 <p><strong>TC: Why is this the killer solution?</strong></p>
 <p>SM: There are so many engineers and creative people who have no knowledge of trademarks or how they should work, and by merely looking at the visualization (that we produce for users), where the bigger the dot is to the name you’ve chosen, or the more crowded, the more [risky] the brand — it just offers incredible cost and time savings by being able to visualize this data.</p><div><div class="native-ad-mobile" id="adsDiv1331b4c729"></div></div>
 <p><strong>TC: Are you scanning trademarks globally or just in the U.S.? And how much are you charging?</strong></p>
 <p>SM: We’re still working on pricing, but we offer a day pass for less than $50, which provides users with unlimited use to search U.S. filings. We also have a standard product that offers unlimited use on a monthly basis; one seat is $100 per month . . . and the service can be stopped at any time.</p>
 <p>And we’re working on a pro product that’s much more feature rich and that will be a bit more expensive and will include multiple data sets, not just U.S [data].</p>
 <p><strong>TC: Don’t companies need to worry about competition globally from the outset?</strong></p>
 <p>SM: Absolutely. Any business that puts itself online is intrinsically international. So even though you may not plan to do business in Germany or the U.K. or Japan, knowing what’s out there and who could come after you — without hiring an attorney in Tokyo — [is key]. You’ll be able to see if there’s something there that you should be aware of.</p>
 <p><strong>TC: Is there no global database that exists as of today?</strong></p>
 <p>SM: You can find a newish database online that’s sponsored by the EU. But unless you’re a very skilled operator, it’s [hard to navigate]. It’s almost like doing a Google search, where you’re getting inundated with large amounts of irrelevant hits, or you have to have a lot of knowledge to know if you should care. Nothing is sorted; you can’t see how much of a threat other trademarks are. What we can do with our algorithm is rate and rank and visualize everything, so you can see those that look like the most serious threats.</p>
 <p>You can also see who else is out there in your space with similar names and get new ideas yourself for names that are different and probably smarter in the context of knowing who else is out there. Using this as a creativity tool wasn’t something we anticipated, but once people see what’s out there, it prompts more creativity on their part to think up more unique names.</p>
 <p><strong>TC: Can you talk about who some of your clients are?</strong></p>
 <p>SM: We have some law firm users. We have a prominent product innovation company. Fifteen companies from the last YC class signed up, too. [President] Sam [Altman] loves what we’re doing.</p>
 <p><strong>TC: Can I ask how you came up with the brand Naming Matters? I’ve talked with branding agencies in the past that say most early firms in a space use something that literally describes their business, like Facebook. Brands start getting crazier sounding the more crowded a space grows.</strong></p>
 <p>SM: It’s a pun. Naming <em>does</em> matter, but also, if you’re a lawyer, you call a legal topic a matter. What do you think? [Laughs.] We’re supposed to be good at this!</p>
 <!-- End: Wordpress Article Content -->
 </div>]

h2s = bs.find_all('h2')
len(h2s)

5

h2s

[<h2>Hi!</h2>, <h2 class="section-title collapse-title aside-adjacent">
 	Newsletter Subscriptions
 </h2>, <h2 class="collapse-title section-title">
 		Latest <span>Crunch Report</span> </h2>, <h2>Featured Stories</h2>, <h2><span class="text-muted">Latest From</span> Startups</h2>]

res = requests.get('https://techcrunch.com/startups/')
root = lxml.html.fromstring(res.text)
titles = root.cssselect('h2 a')
len(titles)

20

i = 0
for title in titles:
    i += 1
    print("- " + str(i) + " " + title.text)

- 1 Linxo raises $24 million for its app that brings bank accounts together
- 2 Crunch Report | So About That Equifax Hack
- 3 Copycats versus disruptors in Latin America
- 4 A Stanford professor’s advice on surviving the a**hole at your startup
- 5 It’s time to build our own Equifax with blackjack and crypto
- 6 Equity podcast: Roku is going public, 23andMe raises $200M and Juicero is dead
- 7 Red Cross to start testing drones in disaster relief efforts
- 8 Teralytics wants to tap telcos’ big data to help cities get smarter about Uber and Lyft
- 9 Final days to apply for Startup Battlefield Australia
- 10 Crunch Report | Marvel and Star Wars going exclusively to Disney streaming
- 11 Former GrubHub employee testified drivers often complained about ‘ghost orders’
- 12 Canvas’ robot cart could change how factories work
- 13 Streamroot raises $3.2 million for its peer-to-peer video delivery technology
- 14 HotelTonight to expand booking window to 100 days
- 15 StatMuse lets you ask a sports question and hear a response from an NFL star
- 16 Pendo acquires Insert to add mobile apps to its user analytics and engagement platform
- 17 Favor, the on-demand service focused on Texas, picks up $22 million Series B
- 18 European ‘social eating platform’ VizEat acquires U.S.-based EatWith
- 19 After scrapping Monsanto deal, Deere agrees to buy precision farming startup Blue River for $305M
- 20 VR company Upload settles sexual harassment suit, though some still feel unsettled

titles[0].attrib['href']

'https://techcrunch.com/2017/09/09/linxo-raises-24-million-for-its-app-that-brings-bank-accounts-together/'

for title in titles:
    print(title.attrib['href'])

https://techcrunch.com/2017/09/09/linxo-raises-24-million-for-its-app-that-brings-bank-accounts-together/
https://techcrunch.com/2017/09/08/crunch-report-equifax-hack/
https://techcrunch.com/2017/09/08/copycats-vs-disruptors-in-latin-america/
https://techcrunch.com/2017/09/08/a-stanford-professors-advice-on-surviving-the-ahole-at-your-startup/
https://techcrunch.com/2017/09/08/its-time-to-build-our-own-equifax-with-blackjack-and-crypto/
https://techcrunch.com/2017/09/08/equity-podcast-roku-is-going-public-23andme-raises-200m-and-juicero-is-dead/
https://techcrunch.com/2017/09/08/red-cross-to-start-testing-drones-in-disaster-relief-efforts/
https://techcrunch.com/2017/09/07/teralytics-wants-to-tap-telcos-big-data-to-help-cities-get-smarter-about-uber-and-lyft/
https://techcrunch.com/2017/09/07/final-days-to-apply-for-startup-battlefield-australia/
https://techcrunch.com/2017/09/07/crunch-report-marvel-star-wars-disney/
https://techcrunch.com/2017/09/07/former-grubhub-employee-testimony-ghost-orders/
https://techcrunch.com/2017/09/07/canvas-robot-cart-could-change-how-factories-work/
https://techcrunch.com/2017/09/07/streamroot-raises-32-million-for-its-peer-to-peer-video-delivery-technology/
https://techcrunch.com/2017/09/07/hoteltonight-to-expand-booking-window-to-100-days/
https://techcrunch.com/2017/09/07/statmuse-lets-you-ask-a-sports-question-and-hear-a-response-from-an-nfl-star/
https://techcrunch.com/2017/09/07/pendo-acquires-insert/
https://techcrunch.com/2017/09/07/favor-the-on-demand-service-focused-on-texas-picks-up-22-million-series-b/
https://techcrunch.com/2017/09/07/vizeat-swallows-eatwith/
https://techcrunch.com/2017/09/06/after-scraping-monsanto-deal-deere-agrees-to-buy-precision-farming-startup-blue-river-for-305m/
https://techcrunch.com/2017/09/06/vr-company-upload-settles-sexual-harassment-suit-though-some-still-feel-unsettled/

articles = [] 
for title in titles:
    url = title.attrib['href']
    res_a = requests.get(url)
    articles.append(res_a.text)

len(articles)

20

articles[0][:100]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/sc'

import tqdm

subTitles = root.cssselect('.river li .block .byline a')

for subTitle in tqdm.tqdm_notebook(subTitles[:10]):
    print(subTitle.text)
    print("https://techcrunch.com" + subTitle.attrib['href'])

Romain Dillet
https://techcrunch.com/author/romain-dillet/
Anthony Ha
https://techcrunch.com/author/anthony-ha/
Francisco Coronel
https://techcrunch.com/contributor/francisco-coronel/
Connie Loizos
https://techcrunch.com/author/connie-loizos/
John Biggs
https://techcrunch.com/author/john-biggs/
Alex Wilhelm
https://techcrunch.com/author/alex-wilhelm/
Matt Burns
https://techcrunch.com/author/matt-burns/
Natasha Lomas
https://techcrunch.com/author/natasha-lomas/
Samantha Stein
https://techcrunch.com/author/samantha-stein/
Anthony Ha
https://techcrunch.com/author/anthony-ha/

Web Crawling

기본적으로 텍스트 데이터를 가지고 위해서는 보통 기업 내부의 VOC 데이터 또는 텍스트 형태의 데이터를 가진 테이블을 활용하나 일반적으로 구하기 쉽거나 그렇지 않아 뉴스, SNS, 블로그 등 다양한 텍스트를 구하기 쉬운 웹의 정보를 이용한다.
requests : HTTP를 위한 패키지
lxml : HTML 처리
cssselect: HTML 처리
beautiful soup : HTML 처리

HTML Method

GET : 서버의 자원(resource)를 가지고 올때 사용
- 목록보기
- 글보기
- 다운로드
POST : 서버에 자원을 추가할때 즉, 데이터를 추가할때
- 글쓰기
- 업로드
구분이 명확하지 않아 소스와 브라우저의 개발자 도구를 통해 확인하여야 한다.

HTTP Status Code

3자리수
2XX : Success
- 200 OK
3XX : Redirection
- 302 Found
4XX : Client Error
- 400 Bad Request
- 403 Forbiddent
- 404 Not Found
- 405 Method Not Allowed
5XX : Server Error
405 : POST <-> GET 형식을 바꾸면 된다.

import requests

url = 'https://techcrunch.com/2017/03/08/a-new-affordable-naming-startup-for-startups/'
res = requests.get(url)

res # Success

<Response [200]>

res.text[:200] # 텍스트 형태로 데이터를 가지고 온다.

'\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>A new, affordable nam'

HTML에서 본문 추출

lxml
BeautifulSoup
아래의 Article의 본문을 Crawling 하려고 한다.

import lxml.html
from bs4 import BeautifulSoup

1. lxml

root = lxml.html.fromstring(res.text)

html 요소에 접근 방법
- class : .
- id : #

class 값으로 접근
- . 로 접근

entries = root.cssselect('.article-entry')
entries

[<Element div at 0x107b1d458>]

len(entries) # 본문의 내용으로 된 tag는 1개 밖에 없다.

1

article = entries[0]
content = article.text_content() # 본문의 내용을 TEXT 형태로 추출.
content[:100]

'\n\n\n\nA few years ago, I launched a daily email newsletter, and I was ecstatic\xa0to be striking out on m'

id 값으로 접근
- "#" 로 접근

root.cssselect('#speakable-summary')[0].text_content()[:100]

'A few years ago, I launched a daily email newsletter, and I was ecstatic\xa0to be striking out on my ow'

2. BeautifulSoup

bs = BeautifulSoup(res.text, 'html.parser')
type(bs)

bs4.BeautifulSoup

bs.findAll("div", class_='article-entry')

[<div class="article-entry text">
 <!-- Begin: Wordpress Article Content -->
 <img class="" src="https://tctechcrunch2011.files.wordpress.com/2017/03/screen-shot-2017-03-08-at-1-26-28-pm.png?w=738"/>
 <p id="speakable-summary">A few years ago, I launched a daily email <a href="https://www.strictlyvc.com/" target="_blank">newsletter</a>, and I was ecstatic to be striking out on my own for the first time. Alas, just a few weeks after filing to secure a trademark, an officious-sounding note appeared in my inbox, and soon after, I found myself shelling out $10,000 in lawyer’s fees over a short-lived trademark dispute. It wasn’t nearly as painful as it might have been, but it was a rude realization that figuring out the right brand can be both time-consuming and have implications that founders might not foresee.</p>
 <p>Of course, my experience is hardly rare. Most founders are typically left to either conduct trademark searches on their own via the <a href="https://www.uspto.gov/" target="_blank">USPTO site</a>, or else pay top dollar for law firms or branding agencies to do it for them. Often they do both.</p>
 <p>Thankfully, affordably eliminating risky name choices is exactly the opportunity that a two-year-old Bay Area company, <a href="https://www.namingmatters.com/" target="_blank">Naming Matters</a>, is chasing, and the company’s founder is very familiar with the market. S.B. Master previously co-founded <a href="http://www.naming.com" target="_blank">Master-McNeil</a>, a 29-year-old corporate naming and branding firm in Berkeley, Ca., whose past clients include Apple, General Motors, Disney and PayPal.</p>
 <p>Now Master sees an opportunity to cater not just to deep-pocketed corporate customers but also startups on shoestring budgets. Indeed, 18 months ago, she decided to take everything she has learned over the years about linguistic analysis, trademark searching and domain name acquisition and pour it into a self-service software product that also incorporates search and data visualization. I talked with her earlier today to learn more.</p>
 <p><strong>TC: You’ve already run a naming company for decades. Why start this new thing?</strong></p>
 <p>SM: Naming is hard, and we tend to work with companies that can afford us to do deep preliminary availability screening. I grew frustrated with how slow and antiquated that searching step is [for companies that can’t afford such a service]. I mean, if you have 100 names, how do you figure out which are most likely to get you into trouble, and which are your stronger candidates that you should focus on? There are legacy providers, but their model is to charge users for every name they look up. If you’re looking for a name in every country and every class, it adds up. You have to be very skilled to [keep your costs down].</p>
 <p><strong>TC: So the idea is to pay less to your friendly trademark attorney.</strong></p>
 <p>SM: The idea is that instead of this being some super expensive cottage industry, that anyone, anywhere — whether founders or innovators in companies or paralegals in law firms or companies under pressure to do more faster and with less — can use this tool in an unlimited way.</p>
 <p><strong>TC: How big a problem — or opportunity — is this?</strong></p>
 <p>SM: About 5 million trademarks are registered worldwide each year, and to get to a name that you’re willing to spend the money [on] to file a trademark application, you’ve probably looked at 50 to 100 names. That means people are looking up something like 500 million names a year. That’s a lot of time and effort, and it still often doesn’t answer the question of whether it’s worth it.</p>
 <p>We’ve been told by big law firms that to look at one name, a paralegal is going to spend three hours, and they cost $300 an hour. So, there’s $1,000 right there.</p>
 <p><strong>TC: Why is this the killer solution?</strong></p>
 <p>SM: There are so many engineers and creative people who have no knowledge of trademarks or how they should work, and by merely looking at the visualization (that we produce for users), where the bigger the dot is to the name you’ve chosen, or the more crowded, the more [risky] the brand — it just offers incredible cost and time savings by being able to visualize this data.</p><div><div class="native-ad-mobile" id="adsDiv1331b4c729"></div></div>
 <p><strong>TC: Are you scanning trademarks globally or just in the U.S.? And how much are you charging?</strong></p>
 <p>SM: We’re still working on pricing, but we offer a day pass for less than $50, which provides users with unlimited use to search U.S. filings. We also have a standard product that offers unlimited use on a monthly basis; one seat is $100 per month . . . and the service can be stopped at any time.</p>
 <p>And we’re working on a pro product that’s much more feature rich and that will be a bit more expensive and will include multiple data sets, not just U.S [data].</p>
 <p><strong>TC: Don’t companies need to worry about competition globally from the outset?</strong></p>
 <p>SM: Absolutely. Any business that puts itself online is intrinsically international. So even though you may not plan to do business in Germany or the U.K. or Japan, knowing what’s out there and who could come after you — without hiring an attorney in Tokyo — [is key]. You’ll be able to see if there’s something there that you should be aware of.</p>
 <p><strong>TC: Is there no global database that exists as of today?</strong></p>
 <p>SM: You can find a newish database online that’s sponsored by the EU. But unless you’re a very skilled operator, it’s [hard to navigate]. It’s almost like doing a Google search, where you’re getting inundated with large amounts of irrelevant hits, or you have to have a lot of knowledge to know if you should care. Nothing is sorted; you can’t see how much of a threat other trademarks are. What we can do with our algorithm is rate and rank and visualize everything, so you can see those that look like the most serious threats.</p>
 <p>You can also see who else is out there in your space with similar names and get new ideas yourself for names that are different and probably smarter in the context of knowing who else is out there. Using this as a creativity tool wasn’t something we anticipated, but once people see what’s out there, it prompts more creativity on their part to think up more unique names.</p>
 <p><strong>TC: Can you talk about who some of your clients are?</strong></p>
 <p>SM: We have some law firm users. We have a prominent product innovation company. Fifteen companies from the last YC class signed up, too. [President] Sam [Altman] loves what we’re doing.</p>
 <p><strong>TC: Can I ask how you came up with the brand Naming Matters? I’ve talked with branding agencies in the past that say most early firms in a space use something that literally describes their business, like Facebook. Brands start getting crazier sounding the more crowded a space grows.</strong></p>
 <p>SM: It’s a pun. Naming <em>does</em> matter, but also, if you’re a lawyer, you call a legal topic a matter. What do you think? [Laughs.] We’re supposed to be good at this!</p>
 <!-- End: Wordpress Article Content -->
 </div>]

h2s = bs.find_all('h2')
len(h2s)

5

h2s

[<h2>Hi!</h2>, <h2 class="section-title collapse-title aside-adjacent">
 	Newsletter Subscriptions
 </h2>, <h2 class="collapse-title section-title">
 		Latest <span>Crunch Report</span> </h2>, <h2>Featured Stories</h2>, <h2><span class="text-muted">Latest From</span> Startups</h2>]

목록에서 여러 개의 링크 가져오기

res = requests.get('https://techcrunch.com/startups/')
root = lxml.html.fromstring(res.text)
titles = root.cssselect('h2 a')
len(titles)

20

i = 0
for title in titles:
    i += 1
    print("- " + str(i) + " " + title.text)

- 1 Linxo raises $24 million for its app that brings bank accounts together
- 2 Crunch Report | So About That Equifax Hack
- 3 Copycats versus disruptors in Latin America
- 4 A Stanford professor’s advice on surviving the a**hole at your startup
- 5 It’s time to build our own Equifax with blackjack and crypto
- 6 Equity podcast: Roku is going public, 23andMe raises $200M and Juicero is dead
- 7 Red Cross to start testing drones in disaster relief efforts
- 8 Teralytics wants to tap telcos’ big data to help cities get smarter about Uber and Lyft
- 9 Final days to apply for Startup Battlefield Australia
- 10 Crunch Report | Marvel and Star Wars going exclusively to Disney streaming
- 11 Former GrubHub employee testified drivers often complained about ‘ghost orders’
- 12 Canvas’ robot cart could change how factories work
- 13 Streamroot raises $3.2 million for its peer-to-peer video delivery technology
- 14 HotelTonight to expand booking window to 100 days
- 15 StatMuse lets you ask a sports question and hear a response from an NFL star
- 16 Pendo acquires Insert to add mobile apps to its user analytics and engagement platform
- 17 Favor, the on-demand service focused on Texas, picks up $22 million Series B
- 18 European ‘social eating platform’ VizEat acquires U.S.-based EatWith
- 19 After scrapping Monsanto deal, Deere agrees to buy precision farming startup Blue River for $305M
- 20 VR company Upload settles sexual harassment suit, though some still feel unsettled

HTML 태그 내 요소 접근

attrib['요소명']

titles[0].attrib['href']

'https://techcrunch.com/2017/09/09/linxo-raises-24-million-for-its-app-that-brings-bank-accounts-together/'

for title in titles:
    print(title.attrib['href'])

https://techcrunch.com/2017/09/09/linxo-raises-24-million-for-its-app-that-brings-bank-accounts-together/
https://techcrunch.com/2017/09/08/crunch-report-equifax-hack/
https://techcrunch.com/2017/09/08/copycats-vs-disruptors-in-latin-america/
https://techcrunch.com/2017/09/08/a-stanford-professors-advice-on-surviving-the-ahole-at-your-startup/
https://techcrunch.com/2017/09/08/its-time-to-build-our-own-equifax-with-blackjack-and-crypto/
https://techcrunch.com/2017/09/08/equity-podcast-roku-is-going-public-23andme-raises-200m-and-juicero-is-dead/
https://techcrunch.com/2017/09/08/red-cross-to-start-testing-drones-in-disaster-relief-efforts/
https://techcrunch.com/2017/09/07/teralytics-wants-to-tap-telcos-big-data-to-help-cities-get-smarter-about-uber-and-lyft/
https://techcrunch.com/2017/09/07/final-days-to-apply-for-startup-battlefield-australia/
https://techcrunch.com/2017/09/07/crunch-report-marvel-star-wars-disney/
https://techcrunch.com/2017/09/07/former-grubhub-employee-testimony-ghost-orders/
https://techcrunch.com/2017/09/07/canvas-robot-cart-could-change-how-factories-work/
https://techcrunch.com/2017/09/07/streamroot-raises-32-million-for-its-peer-to-peer-video-delivery-technology/
https://techcrunch.com/2017/09/07/hoteltonight-to-expand-booking-window-to-100-days/
https://techcrunch.com/2017/09/07/statmuse-lets-you-ask-a-sports-question-and-hear-a-response-from-an-nfl-star/
https://techcrunch.com/2017/09/07/pendo-acquires-insert/
https://techcrunch.com/2017/09/07/favor-the-on-demand-service-focused-on-texas-picks-up-22-million-series-b/
https://techcrunch.com/2017/09/07/vizeat-swallows-eatwith/
https://techcrunch.com/2017/09/06/after-scraping-monsanto-deal-deere-agrees-to-buy-precision-farming-startup-blue-river-for-305m/
https://techcrunch.com/2017/09/06/vr-company-upload-settles-sexual-harassment-suit-though-some-still-feel-unsettled/

articles = [] 
for title in titles:
    url = title.attrib['href']
    res_a = requests.get(url)
    articles.append(res_a.text)

len(articles)

20

articles[0][:100]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/sc'

Example

기사 밑에 있는 기자 이름만 가지고 오고 싶다면?
div class = 'river' => li => div class = 'black' => div class = 'byline' => a 로 접근
tqdm library : Progress bar를 볼 수 있다.

import tqdm

subTitles = root.cssselect('.river li .block .byline a')

for subTitle in tqdm.tqdm_notebook(subTitles[:10]):
    print(subTitle.text)
    print("https://techcrunch.com" + subTitle.attrib['href'])

Romain Dillet
https://techcrunch.com/author/romain-dillet/
Anthony Ha
https://techcrunch.com/author/anthony-ha/
Francisco Coronel
https://techcrunch.com/contributor/francisco-coronel/
Connie Loizos
https://techcrunch.com/author/connie-loizos/
John Biggs
https://techcrunch.com/author/john-biggs/
Alex Wilhelm
https://techcrunch.com/author/alex-wilhelm/
Matt Burns
https://techcrunch.com/author/matt-burns/
Natasha Lomas
https://techcrunch.com/author/natasha-lomas/
Samantha Stein
https://techcrunch.com/author/samantha-stein/
Anthony Ha
https://techcrunch.com/author/anthony-ha/

[Crawling] Beautifulsoup & Requests (Crawling) (0)	2017.09.11
[TEXT MINING] ENCODING, 인코딩 (0)	2017.09.11
[URL] URL 분해 및 URL Encoding (0)	2017.09.10
[TEXT MINING] 텍스트마이닝의 기초 (TDM) (0)	2017.09.10
[TEXT MINING] 노무현 대통령 vs 이명박 대통령 (텍스트마이닝) (0)	2017.09.04

[Crawling] Beautifulsoup & Requests (Crawling) (0)	2017.09.11
[TEXT MINING] ENCODING, 인코딩 (0)	2017.09.11
[URL] URL 분해 및 URL Encoding (0)	2017.09.10
[TEXT MINING] 텍스트마이닝의 기초 (TDM) (0)	2017.09.10
[TEXT MINING] 노무현 대통령 vs 이명박 대통령 (텍스트마이닝) (0)	2017.09.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[Crawling] Web Crawling(크롤링)

Web Crawling

HTML Method

HTTP Status Code

HTML에서 본문 추출

1. lxml

2. BeautifulSoup

목록에서 여러 개의 링크 가져오기

HTML 태그 내 요소 접근

Example

'BIGDATA > TEXT MINING' 카테고리의 다른 글

Web Crawling

HTML Method

HTTP Status Code

HTML에서 본문 추출

1. lxml

2. BeautifulSoup

목록에서 여러 개의 링크 가져오기

HTML 태그 내 요소 접근

Example

'BIGDATA > TEXT MINING' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역