15:00:47 #startmeeting neutron_northbound 15:00:47 Meeting started Mon Jul 24 15:00:47 2017 UTC. The chair is yamahata. Information about MeetBot at http://ci.openstack.org/meetbot.html. 15:00:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:47 The meeting name has been set to 'neutron_northbound' 15:01:00 #chair mkolesni rajivk_ 15:01:00 Current chairs: mkolesni rajivk_ yamahata 15:01:06 #topic agenda bashing and roll call 15:01:10 #info mkolesni 15:01:15 #info yamahata 15:01:22 #link https://wiki.opendaylight.org/view/NeutronNorthbound:Meetings 15:01:24 #info rajivk 15:01:45 any topics in addition to breakage and usual patches/bugs? 15:02:23 id like to discuss the ci 15:02:36 not the u/t, the tempest 15:02:45 yeah, now tempest ci is not in good shape. 15:03:07 well, were making it better but its a slow process 15:03:31 anything else? 15:04:15 ok move on 15:04:15 FF is thursday 15:04:23 we need to merge all non-bugs by then 15:04:34 yamahata, are you cutting the branch? 15:04:49 or is it nor automatic since wer'e not in independent release model 15:04:56 neutron team will do with one patch. 15:05:07 Pike-2 was done so. 15:05:17 So we'll review such patch 15:05:33 ok 15:05:37 #topic Announcements 15:05:43 pike-3 is this week. 15:05:44 afaik its 27th 15:05:52 so that leaves ~3 days 15:05:53 #info Feature freeze is thursday 15:06:12 any other announcement? 15:06:31 do you know if you're going to ptg yet? 15:06:34 or the summit? 15:06:39 Unfortunately not yet. 15:06:44 ok 15:06:52 i will ask again next week :) 15:07:07 #topic action items from last meeting 15:07:17 I suppose we don't have any. (except patch review) 15:07:21 #topic Pike/Nitrogen planning 15:07:27 rajivk_'s patch is good to go but blocked by the ci breakage :/ 15:07:46 So for Pike-3 feature patches needs to be merged 15:07:59 #action everyone address ci breakage 15:08:20 rajivk_ mentioned it earlier 15:08:31 rajivk_, do you know the necessary fix for the u/t ci? 15:08:45 i will put a patch. 15:09:09 But i dont know, why ci were passing after ceilometer patch got merged. 15:09:11 ok great i havent had time to look at it today so if you have the fix we'll review it 15:09:18 May be my findings are not correct. 15:09:38 well, post the patch and we'll see :) 15:09:49 ok 15:09:54 yeah, we'll see the result. 15:10:16 So what are the remaining patches? 15:10:26 https://review.openstack.org/#/c/474851/ 15:10:46 https://review.openstack.org/#/q/topic:bug/1683797 15:11:04 Oh mkolesni you uploaded a patch to make it neutron worker. 15:11:07 great 15:11:17 yamahata, yes i think its a more elegant approach 15:11:31 also it will allow to configure multiple workers if we have a need for it 15:11:36 and dhcp patch 15:11:46 https://review.openstack.org/#/c/465735/ 15:12:06 For dhcp port patch, it would need review. 15:12:20 i will review it again tomorrow 15:12:27 yamahata, if you agree with https://review.openstack.org/486606 15:12:43 perhaps we can abandon all the other ones on the same bug 15:12:55 I haven't reviewed the patch yet. But that's what I'd like to cook it. 15:13:10 I think thread pooling still make sense. 15:13:19 It's orthogonal to 486606. 15:13:20 sure just saying there's lot of patches there now 15:13:37 no problem with that though nobody addressed my comment there from PS5 15:13:49 prepopulate agentdb patches are floating around. 15:14:03 It's bug fix patch, though. 15:14:21 https://review.openstack.org/#/c/465735/ and https://review.openstack.org/#/c/484446/ 15:14:26 all bug fixes aren't first priority so lets focus on the features first 15:14:34 Yeah. 15:14:42 and if we have time left on the bug fixes 15:14:46 we have plenty of patches for Pike-3... 15:15:03 rajivk_, yamahata please see my comment here https://review.openstack.org/#/c/452647/5/networking_odl/journal/journal.py 15:15:06 After Pike-3, we can address bug fixes 15:15:35 of course then neutron stable team has to approve the backports to stable/pike right? 15:15:49 right. 15:16:01 hello 15:16:08 btw thread pooling is also a feature so if you want it in we can focus on it too 15:16:23 though i dont think its critical for Pike and could slip to Queens 15:17:23 https://review.openstack.org/#/q/project:openstack/networking-odl+status:open 15:17:34 we have many bug fix patches which were floating around. 15:17:47 After Pike-3, let's wipe them out. 15:17:58 theres some cleaning required there some of them are obsolete obviously 15:18:39 so to sum it up, for this week need to focus on: 15:18:48 1. https://review.openstack.org/474851 - done, needs to be merged 15:19:28 2. https://review.openstack.org/#/c/465735/ 15:19:38 3. https://review.openstack.org/#/c/452647 15:19:43 anything else? 15:20:07 that's the priority. 15:20:19 I think three is already many. 15:20:22 yes thats the rfes 15:20:34 well first one is +2 by both of us 15:20:44 its a technicality to merge it after the gate is fixed 15:21:23 good summary. let's move on 15:21:25 #topic patches/bugs 15:21:31 we've already discussed patches. 15:21:43 and we'll look into ci breakage. 15:21:48 #topic tempest CI 15:21:52 mkolesni: you're on stage 15:22:29 right 15:22:48 so as you know ive been investigating tempest ci breakage 15:23:02 found some bugs here and there all fixed now 15:23:12 but the status is still dire 15:23:36 so lets discuss per job basis.. 15:23:52 first, gate-tempest-dsvm-networking-odl-boron-snapshot-v2driver which is our only voting job (and also gating) 15:24:07 this job is very unstable 15:24:39 really unstable! It's with legacy netvirt. So I don't see much value to fix it. 15:24:41 i believe the cause is some mess up in the set up of the DHCP so that somehow traffic slips across subnets on the DHCP nodes 15:24:48 indeed 15:24:53 but just to understand the cause 15:25:01 I suppose once we have carbon with new netvirt voting, we can retire boron job or disable unstable tests of boron. 15:25:13 Oh, great! what's that? 15:25:16 so basically what you'll see when it fails its because VMs dont get IP 15:25:32 and on the VM boot log you see it got DHCP NAK 15:26:07 and also you see it in the dhcp log where you see each request gets answered by the dnsmasq on that subnet (DHCP ACK) 15:26:20 and also 2 other dnsmasq on other subnets (DHCP NAK) 15:26:33 so basically this sucks but i didnt investigate further 15:26:50 are those dhcp agent on same network? 15:26:52 because, as you said, its old netvirt so i doubt if anyone's going to fix it 15:27:01 no theyre on different subnets 15:27:11 but somehow they get the dhcp request as well 15:27:12 I mean, network, not subnet 15:27:32 mkolesni, i also noticed disk write failure 15:27:34 no i think theyre even on different tenants but im not sure 15:27:50 I see. 15:27:56 rajivk_, yes there might be other failures im just describing what i saw most of the time 15:28:07 anyway old netvirt, not interesting 15:28:22 ok can i move on to next job? 15:28:29 Is it failure to acquire lease again or just to get IP first time? 15:28:48 rajivk_, it fails to get ip on vm boot 15:28:49 I mean, are they failed after machine reboots or in all the test cases? 15:29:00 about 10 times or something and then gives up 15:29:16 from what i saw every time ip is requested 15:29:39 its consistent, all the same test fail each time because of this issue 15:29:49 I checked in one of the patch log, it was requesting a specific ips but server responded with NAK. 15:30:01 anyway, we can leave as you said. 15:30:08 yes lets continue 15:30:12 next is gate-tempest-dsvm-networking-odl-carbon-snapshot-vpnservice-v1driver-nv 15:30:39 so this one had a problem that the port status updater wasnt loaded at all causing random failures 15:30:44 that got fixed 15:31:01 i didnt continue too much on it since it's v1 driver 15:31:19 but its rather unstable, though it's non voting so meh 15:31:42 the only problem is that it stalls results until it times out but i guess we can live with that for now 15:32:07 not sure how much value it provides so we can decide to drop it entirely once P-3 is out 15:32:24 yamahata, whats the plan about V1, is it cut from the tree on Queens? 15:32:50 Maybe, if we can have v2driver voting, it makes sense to retire v1driver. 15:32:58 ++ 15:33:04 we can throw it out when we cut it out if the tree 15:33:24 Yeah. So far v2driver job isn't stable enough. 15:33:27 RH has no interest in V1 driver so as far as we're concerned the sooner the better 15:33:36 problem is no job is stable enough :) 15:33:52 so we had v1driver job for comparison to understand where the issue exists. 15:34:03 but right now they are both too unstable. 15:34:31 we can send an email and see if theres any objection to throwing out the job 15:34:37 anyway we should focus v2driver job. 15:34:38 if not we can remove the job at lease 15:34:44 *at least 15:35:19 Once v2driver job is stable, it's okay to remove v1job. 15:36:10 ok as you wish 15:36:34 ok now the big guy gate-tempest-dsvm-networking-odl-carbon-snapshot-vpnservice-v2driver-nv 15:36:51 so this one also had a bug that the provisioning block wasnt created 15:37:18 so port status update failed to actually do anything and then nova would randomly timeout VMs 15:37:48 depending on a race there so sometimes a VM vould boot normally because the provisioning by dhcp was fast enough 15:37:52 anyway that got fixed 15:38:12 now the major issue im noticing with it is something i believe is a problem in ODL 15:38:20 now we're seeing sometime the carbon v2 job is passing 15:38:36 i sent an email about it to netvirt-dev, let me find it 15:38:46 It's plausible that the issues are in ODL side. 15:39:23 #info https://lists.opendaylight.org/pipermail/netvirt-dev/2017-July/005062.html 15:39:27 there are many ERROR logs in karaf log. 15:39:38 so to sum it all up from the email, the FIP is sometimes broken 15:40:03 now again we're seeing a situation where either the tests are all green, or all tests requiring FIP fail 15:40:16 or at least the same tests fail every time 15:40:36 so this leads me to believe the problem happens when the public network gets created on ODL 15:40:54 All pass or all fail is interesting observation. 15:41:02 problem is im not that strong on odl side so thats why i asked for assistance 15:41:16 but nobody stepped up yet and the mail saw little interest 15:41:45 if you guys have better netvirt knowledge you can take a look 15:41:47 maybe we would like to replicate it with nitrogen. 15:42:02 if not im trying to get some help from our ODL team at RH 15:42:18 Cool. 15:42:25 hmm nitrogen jobs are all broken because its not built by the integrated job yet 15:42:42 so yamahata rajivk_ or manjeets do you guys have knowledge to debug this? 15:43:07 mkolesni, I guess they need to have nitrogen-snapshot available 15:43:09 off course, we have. The issue is their bandwidth. 15:43:35 it fails at getting nitrogen-snapshot 15:43:42 Anyway after Pike-3, I'll also look into it. 15:43:51 anyways i believe that the new-netvirt job should be the voting one and the old-netvirt should be non voting, the opposite of what happens today 15:44:00 With nitrogen, karaf-distribution is not created yet. 15:44:18 yes we can perhaps use only the netvirt karaf 15:44:18 karaf or netvirt image needs to be used. 15:44:47 basically netvirt karaf probably has everything we need 15:44:50 so we can try it 15:44:55 So far ODL community doesn't have ETA to create karaf-distribution image. 15:45:00 i had some experimental patch to use netvirt karaf 15:45:03 this seems plausible 15:45:23 https://review.openstack.org/#/c/482453/ 15:45:34 but i didnt dig into the test failures too much 15:46:05 regarding sfc im not sure 15:47:01 i think it does have sfc but dont take my word for it 15:47:19 is sfc even tested by tempest though? 15:47:41 I guess no. I guess no one seriously has tested sfc. 15:48:04 so for gate maybe its enough to use the netvirt karaf 15:48:20 Probably unit test for sfc will be kept. tempest tests for sfc won't be enabled. 15:48:25 i can rebase that patch if you want to see whats up 15:48:42 I'd love to see the result. 15:48:50 luckily unit test doesnt care about what distribution we use :) 15:49:47 ODL nitrogen cycle is short. so we should know issue early. 15:50:36 nitrogen would be targeted by queens though right? 15:50:57 obviously we need to know asap but im asking regarding the "optimal versions" 15:51:22 In that sense, yeah queens + nitrogen, pike + carbon. 15:51:39 btw with that experimental patch obviously old netvirt job fails cause it's not in the distribution even in boron :) 15:52:01 Also netvirt folks have started similar discussion. 15:52:24 again its hard to debug cause if the gate times out then no logs are collected 15:52:47 Probably we'd like to disable some tests with floating ip so that we can have logs. 15:52:49 also something thats been bothering me but i dont know how to solve is that these damn logs are in html 15:53:14 https://review.openstack.org/#/c/486177/ 15:53:19 and thats doubling their size making reading them tougher 15:53:22 there is something wrong with the patch. 15:53:54 with what patch? 15:54:13 to disable some fip tests. 15:54:48 ok i didnt see that i have some other ones to reduce the load so that logs do get collected and thats what i been using to debug the gate 15:54:48 we have 6mins left. 15:55:10 yamahata, mkolesni I switched grenade job to new netvirt, v2 driver, I still see it fails on floating Ip access tests 15:55:18 do we have anything else? 15:55:42 so just to make sure, if FIP fails you should see some errors about GARP in karaf.log 15:55:50 manjeets, please check if thats the case ^ 15:55:56 if so its the same as in tempest 15:56:28 i got nothing else basically we should switch to new netvirt job asap and keep old job as non voting for reference 15:56:48 problem right now they both seem to be failing about half of the time 15:56:54 so its hard to say whats worse 15:56:59 but its a true nightmare 15:57:17 also on the gate queue it takes 3 hours till it fails :/ 15:57:40 timeouts are very common these days 15:57:43 maybe we should consider slimmer tempest on the gate itself 15:57:53 and keep the heavy tests only on the check queue 15:58:20 parallel execution is one way, and neutron did. but it's too early for us.... 15:58:33 anyway anything else to discuss/compalin? 15:58:41 hmm yeahs thats a whole other discussion :) 15:58:50 #topic open mike 15:58:52 no im done, stick a fork in me :) 15:59:19 ok thanks guys 15:59:24 thank you everyone 15:59:25 have a good day/night 15:59:28 thank you 15:59:31 #topic cookies 15:59:31 bye :) 15:59:37 #endmeeting