ハッカソン 6/4まとめ

21時スタート

21時半 仕様を決める

21時45分 モジュール分け

 

lxml

requests

をインポート

何が出来る?

webサイトからデータを取ってこれる

res = requests.get('http://homepage3.nifty.com/abe-hiroshi/')  など

webページからデータ取ってきてやる

lxmlはタグごとに木構造にして、取りやすくする

 

>>> res = requests.get('http://homepage3.nifty.com/abe-hiroshi/top.htm')
>>> e = lxml.html.fromstring(res.text)
>>> lists = e.xpath('//p')
>>> lists[0].text
u' \x8f\x8a\x91\xae'

 

ヤフーニュースからトピックを取ってくる

>>> res = requests.get('http://news.yahoo.co.jp/')
>>> res
<Response [200]>
>>> e = lxml.html.fromstring(res.text)
>>> e
<Element html at 0x10ca68310>
>>> lists = e.xpath('//p[@class="ttl"]/a')
>>> lists
[<Element a at 0x10c6b8998>, <Element a at 0x10ca683c0>, <Element a at 0x10ca68470>, <Element a at 0x10ca68520>, <Element a at 0x10ca68578>, <Element a at 0x10ca6fd60>, <Element a at 0x10ca6fdb8>, <Element a at 0x10ca6fe10>, <Element a at 0x10ca6fe68>, <Element a at 0x10ca6fec0>, <Element a at 0x10ca6ff18>, <Element a at 0x10ca6ff70>, <Element a at 0x10ca6ffc8>, <Element a at 0x10ca71050>, <Element a at 0x10ca710a8>, <Element a at 0x10ca71100>, <Element a at 0x10ca71158>, <Element a at 0x10ca711b0>, <Element a at 0x10ca71208>, <Element a at 0x10ca71260>, <Element a at 0x10ca712b8>, <Element a at 0x10ca71310>, <Element a at 0x10ca71368>, <Element a at 0x10ca713c0>, <Element a at 0x10ca71418>, <Element a at 0x10ca71470>, <Element a at 0x10ca714c8>, <Element a at 0x10ca71520>, <Element a at 0x10ca71578>, <Element a at 0x10ca715d0>, <Element a at 0x10ca71628>, <Element a at 0x10ca71680>, <Element a at 0x10ca716d8>, <Element a at 0x10ca71730>, <Element a at 0x10ca71788>, <Element a at 0x10ca717e0>, <Element a at 0x10ca71838>, <Element a at 0x10ca71890>, <Element a at 0x10ca718e8>, <Element a at 0x10ca71940>, <Element a at 0x10ca71998>, <Element a at 0x10ca719f0>, <Element a at 0x10ca71a48>, <Element a at 0x10ca71aa0>, <Element a at 0x10ca71af8>, <Element a at 0x10ca71b50>, <Element a at 0x10ca71ba8>, <Element a at 0x10ca71c00>, <Element a at 0x10ca71c58>, <Element a at 0x10ca71cb0>, <Element a at 0x10ca71d08>, <Element a at 0x10ca71d60>, <Element a at 0x10ca71db8>, <Element a at 0x10ca71e10>, <Element a at 0x10ca71e68>, <Element a at 0x10ca71ec0>, <Element a at 0x10ca71f18>, <Element a at 0x10ca71f70>, <Element a at 0x10ca71fc8>, <Element a at 0x10ca72050>, <Element a at 0x10ca720a8>, <Element a at 0x10ca72100>, <Element a at 0x10ca72158>]
>>> len(lists)
63
>>> lists[0].text
u'\u6a4b\u4e0b\u65b0\u515a \u76ee\u6a19\u300c40\u4eba\u300d\u5c4a\u304b\u305a'

 

言語直し

print lists[0].textでおk

 

urlを取ってくる

>>> lists[0].attrib['href']
'http://dailynews.yahoo.co.jp/fc/domestic/party/?id=6118781'
>>> res1 = requests.get(lists[0].attrib['href'])
>>> e1 = lxml.html.fromstring(res1.text)

>>> lists = e1.xpath('//h3/a')
>>> len(lists)
2
>>> print lists[0].text
石原新党22人以上、橋下新党「40人」届かず
>>> print lists[1].text
JR山手線の新駅名、どれがいい?

 

pythonのrequestsでレスポンスの中身の日本語が文字化けする

http://amagitakayosi.hatenablog.com/entry/20130504/1367693593

contentを使おう

 

PythonからYahoo!形態素解析APIを使う

http://aidiary.hatenablog.com/entry/20090415/1239802199

 

PythonからYahooキーフレーズ抽出WebAPIを使う

 

http://hikm.hatenablog.com/entry/20110321/1300714396

 

1時 キーフレーズ部分まで完成

 

次にsqlite3

 

リストの中で最大の長さを取って値を表示

print max(len(x) for x in stringlist)

 

sqliteを使う

>>> import sqlite3
>>> con = sqlite3.connect("test.db",isolation_level = None)

>>> sql = u"""
... create table topics (
... id integer,
... news varchar(200),
... url varchar(100)
... );
... """
>>> print sql

create table topics (
id integer,
news varchar(200),
url varchar(100)
);

>>> sql = u'insert into topics values (1,"お姉ちゃん","http://ane.com")'
>>> print sql
insert into topics values (1,"お姉ちゃん","http://ane.com")
>>> con.execute(sql)
<sqlite3.Cursor object at 0x10701aa40>

>>> sql = u"select * from topics"
>>> c = con.execute(sql)
>>> for t in c:
... print t
...
(1, u'\u304a\u59c9\u3061\u3083\u3093', u'http://ane.com')
(1, u'\u304a\u59c9\u3061\u3083\u3093', u'http://ane.com')
>>> for t in c:
... print t[0],t[1],t[2]
...
>>> c = con.execute(sql)
>>> for t in c:
... print t[0],t[1],t[2]
...
1 お姉ちゃん http://ane.com
1 お姉ちゃん http://ane.com

 

sqlite> select * from topics inner join pas on topics.id == pas.id where pas.score = (select min(score) from pas where id = 1);

 

for i in range(len(stringlist)):
sql = u'insert into topics values(%d,"%s","%s")' % (i,stringlist[i],urllist[i])
con.execute(sql)
for j in range(len(keyphraselist[i])):
sql = u'insert into pas values(%d,"%s","%s")' % (i,keyphraselist[i][j],scorelist[i][j])
con.execute(sql)
print sql

sql = u'select * from topics inner join pas on topics.id == pas.id'
c = con.execute(sql)
sql = u'select * from topics inner join pas on topics.id == pas.id where pas.score = (select max (score) from pas where pas.id == 39) and pas\
.id == 39'
#sql = u'select * from topics inner join pas on topics.id == pas.id where pas.score = (select max (score) from pas)'
#テーブル「pas」のうち,idが任意のxの中で最大スコアを取るレコードを結合させたものを取得

 

4時 データ入れ完成 

 

サーバーに入れる

表示はhtml

htmlの中にjavascriptを組み込む

javascriptpythonを呼べる

要求があれば、問題をランダムで表示

解答があれば、チェックして正解不正解を表示

 

pythonインタープリタについて

埋め込み可能なのはphpのみ?

直接pythonをおければいいが、それは格好が悪い

htmlとの関係性がわからない

js、cssファイルをいじる必要もありそう

次回への課題

 

6時 今回は完成せず 次回開催濃厚

 

これ見てわからないなら絶対ここ

http://www18168ue.sakura.ne.jp/wiki/index.php